# Live Audio Capture & Frame Decoding

MeetStream streams real-time meeting audio to your application over a WebSocket connection. Audio arrives as speaker-tagged binary frames from Google Meet, Zoom, and Microsoft Teams — all using the same wire format.

---

## Overview

When you create a bot with the `live_audio_required` configuration, Meetstream opens a WebSocket connection from the bot to your server and continuously streams binary audio frames for the duration of the meeting.


### Enabling Live Audio

Include `live_audio_required` in your Create Bot API request:

```json
{
  "meeting_url": "https://meet.google.com/abc-defg-hij",
  "live_audio_required": {
    "websocket_url": "wss://your-server.com/audio"
  }
}
```

The `websocket_url` is a WebSocket endpoint **you host**. Meetstream connects to it as a client.

---

## Connection Lifecycle

### 1. Bot connects to your WebSocket endpoint

The bot initiates the connection when it joins the meeting.

### 2. Bot sends a JSON text handshake

The first message is always a JSON text frame:

```json
{
  "type": "ready",
  "bot_id": "bot_abc123",
  "message": "Ready to receive messages"
}
```

### 3. Binary audio frames stream continuously

Every subsequent message is a **binary WebSocket frame** containing PCM audio with embedded speaker metadata. Frames arrive continuously for the duration of the meeting.

### 4. Connection closes when the bot leaves

The WebSocket closes with a normal `1000` close code when the bot exits the meeting.

### Timeline

![Connection Lifecycle](https://files.buildwithfern.com/meetstream-ai-573402.docs.buildwithfern.com/f554c44f8926783c950916afa70281ea97f5f0459abc2f6ee31afd5897f7f6a5/docs/assets/images/connection-lifecycle.png)

---

## Binary Frame Format

Every audio frame is a single binary WebSocket message with this structure:

```
┌──────────┬────────────┬────────────┬──────────────┬──────────────┬──────────────────┐
│ msg_type │ sid_length │ speaker_id │ sname_length │ speaker_name │  pcm_audio_data  │
│ 1 byte   │ 2 bytes    │ L1 bytes   │ 2 bytes      │ L2 bytes     │  remaining bytes │
└──────────┴────────────┴────────────┴──────────────┴──────────────┴──────────────────┘
```

### Field-by-Field Breakdown

| Offset | Size | Type | Field | Description |
|--------|------|------|-------|-------------|
| `0` | 1 byte | `uint8` | `msg_type` | Message type. Always `0x01` for PCM audio. |
| `1` | 2 bytes | `uint16 LE` | `sid_length` | Byte length of the `speaker_id` string that follows. |
| `3` | L1 bytes | `UTF-8` | `speaker_id` | Platform-specific unique identifier for the speaker. |
| `3 + L1` | 2 bytes | `uint16 LE` | `sname_length` | Byte length of the `speaker_name` string that follows. |
| `5 + L1` | L2 bytes | `UTF-8` | `speaker_name` | Display name of the speaker as shown in the meeting. |
| `5 + L1 + L2` | remaining | `int16 LE` | `pcm_audio` | Raw PCM audio samples. |

### Important

- There are **no delimiters** between fields. The format is length-prefixed: you read the 2-byte length, then read that many bytes for the string.
- The `sid_length` and `sname_length` values change depending on the length of the speaker's name and ID. These are **not** fixed values or delimiters — they are standard unsigned 16-bit little-endian integers encoding a string length.
- `0x01` is currently the only defined message type. All binary frames on this channel will have `0x01` at byte 0.

---

## Audio Properties

| Property | Value |
|----------|-------|
| Encoding | Signed 16-bit integer (PCM16) |
| Byte order | Little-endian |
| Sample rate | 48,000 Hz |
| Channels | 1 (mono) |
| Bit depth | 16 bits (2 bytes per sample) |
| Container | None — raw samples, no WAV/MP3/Ogg headers |

To calculate duration from a frame:

```
duration_seconds = (len(pcm_audio_data) / 2) / 48000
```

---

## Hex Dump Walkthrough

A frame from a speaker named `"Alice"` with ID `"user_42"`:

```
Hex:  01  07 00  75 73 65 72 5F 34 32  05 00  41 6C 69 63 65  XX XX XX XX ...
      ──  ─────  ─────────────────────  ─────  ───────────────  ─────────────
      │   │      │                      │      │                │
      │   │      │                      │      │                └─ PCM16 LE audio samples
      │   │      │                      │      └─ "Alice" (5 bytes UTF-8)
      │   │      │                      └─ sname_length = 5
      │   │      └─ "user_42" (7 bytes UTF-8)
      │   └─ sid_length = 7
      └─ msg_type = 0x01 (PCM audio)
```

A frame from `"James Chen"` with ID `"James Chen"`:

```
Hex:  01  0A 00  4A 61 6D 65 73 20 43 68 65 6E  0A 00  4A 61 6D 65 73 20 43 68 65 6E  XX XX ...
      ──  ─────  ──────────────────────────────  ─────  ──────────────────────────────  ────────
      │   │      │                               │      │                               │
      │   │      │                               │      │                               └─ PCM audio
      │   │      │                               │      └─ "James Chen" (10 bytes)
      │   │      │                               └─ sname_length = 10 (0x0A)
      │   │      └─ "James Chen" (10 bytes)
      │   └─ sid_length = 10 (0x0A)
      └─ msg_type = 0x01
```

Note: `0x07 = 7`, `0x0A = 10`, `0x0E = 14`, etc. These are string lengths, not protocol markers.

---

## Decoding Examples

### Python

```python
def decode_audio_frame(data: bytes):
    """Decode a Meetstream binary audio frame.

    Returns:
        tuple: (speaker_id, speaker_name, pcm_bytes) on success
        None: if the frame is malformed
    """
    if len(data) < 5 or data[0] != 0x01:
        return None

    # Speaker ID: 2-byte length prefix + UTF-8 string
    sid_len = int.from_bytes(data[1:3], "little")
    speaker_id = data[3 : 3 + sid_len].decode("utf-8")

    # Speaker Name: 2-byte length prefix + UTF-8 string
    off = 3 + sid_len
    sname_len = int.from_bytes(data[off : off + 2], "little")
    off += 2
    speaker_name = data[off : off + sname_len].decode("utf-8")
    off += sname_len

    # Remaining bytes are raw PCM16 LE audio
    pcm_bytes = data[off:]
    return speaker_id, speaker_name, pcm_bytes
```

### JavaScript / Node.js

```javascript
function decodeAudioFrame(buffer) {
  if (buffer.length < 5 || buffer[0] !== 0x01) return null;

  // Speaker ID
  const sidLen = buffer.readUInt16LE(1);
  const speakerId = buffer.subarray(3, 3 + sidLen).toString("utf-8");

  // Speaker Name
  let off = 3 + sidLen;
  const snameLen = buffer.readUInt16LE(off);
  off += 2;
  const speakerName = buffer.subarray(off, off + snameLen).toString("utf-8");
  off += snameLen;

  // PCM audio
  const pcmData = buffer.subarray(off);
  return { speakerId, speakerName, pcmData };
}
```

### Go

```go
import (
    "encoding/binary"
    "errors"
)

type AudioFrame struct {
    SpeakerID   string
    SpeakerName string
    PCMData     []byte
}

func DecodeAudioFrame(data []byte) (*AudioFrame, error) {
    if len(data) < 5 || data[0] != 0x01 {
        return nil, errors.New("invalid frame")
    }

    sidLen := int(binary.LittleEndian.Uint16(data[1:3]))
    if len(data) < 3+sidLen+2 {
        return nil, errors.New("frame too short for speaker ID")
    }
    speakerID := string(data[3 : 3+sidLen])

    off := 3 + sidLen
    snameLen := int(binary.LittleEndian.Uint16(data[off : off+2]))
    off += 2
    if len(data) < off+snameLen {
        return nil, errors.New("frame too short for speaker name")
    }
    speakerName := string(data[off : off+snameLen])
    off += snameLen

    return &AudioFrame{
        SpeakerID:   speakerID,
        SpeakerName: speakerName,
        PCMData:     data[off:],
    }, nil
}
```

### Java

```java
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.charset.StandardCharsets;

public class MeetstreamAudioFrame {
    public final String speakerId;
    public final String speakerName;
    public final byte[] pcmData;

    private MeetstreamAudioFrame(String speakerId, String speakerName, byte[] pcmData) {
        this.speakerId = speakerId;
        this.speakerName = speakerName;
        this.pcmData = pcmData;
    }

    public static MeetstreamAudioFrame decode(byte[] data) {
        if (data.length < 5 || data[0] != 0x01) return null;

        ByteBuffer buf = ByteBuffer.wrap(data).order(ByteOrder.LITTLE_ENDIAN);
        buf.get(); // skip msg_type

        int sidLen = buf.getShort() & 0xFFFF;
        byte[] sidBytes = new byte[sidLen];
        buf.get(sidBytes);
        String speakerId = new String(sidBytes, StandardCharsets.UTF_8);

        int snameLen = buf.getShort() & 0xFFFF;
        byte[] snameBytes = new byte[snameLen];
        buf.get(snameBytes);
        String speakerName = new String(snameBytes, StandardCharsets.UTF_8);

        byte[] pcmData = new byte[buf.remaining()];
        buf.get(pcmData);

        return new MeetstreamAudioFrame(speakerId, speakerName, pcmData);
    }
}
```

---

## Full Receiver Examples

### Python — Receive and Log

```python
import asyncio
import json
import websockets

async def receive_audio():
    async with websockets.connect("wss://your-server.com/audio") as ws:
        async for message in ws:
            # First message is a JSON text handshake
            if isinstance(message, str):
                handshake = json.loads(message)
                print(f"Bot connected: {handshake['bot_id']}")
                continue

            # All subsequent messages are binary audio frames
            result = decode_audio_frame(message)
            if result is None:
                continue

            speaker_id, speaker_name, pcm_bytes = result
            num_samples = len(pcm_bytes) // 2
            duration_ms = (num_samples / 48000) * 1000
            print(f"[{speaker_name}] {num_samples} samples ({duration_ms:.0f}ms)")

asyncio.run(receive_audio())
```

### Node.js — Receive and Log

```javascript
const WebSocket = require("ws");

const ws = new WebSocket("wss://your-server.com/audio");

ws.on("message", (data, isBinary) => {
  if (!isBinary) {
    const handshake = JSON.parse(data.toString());
    console.log(`Bot connected: ${handshake.bot_id}`);
    return;
  }

  const frame = decodeAudioFrame(data);
  if (!frame) return;

  const numSamples = frame.pcmData.length / 2;
  const durationMs = (numSamples / 48000) * 1000;
  console.log(`[${frame.speakerName}] ${numSamples} samples (${durationMs.toFixed(0)}ms)`);
});
```

---

## Working with PCM Audio

### Convert to NumPy Array (Python)

```python
import numpy as np

# To int16 sample array
samples = np.frombuffer(pcm_bytes, dtype=np.int16)

# To float32 (-1.0 to 1.0) — standard format for ML models and audio libraries
float_samples = samples.astype(np.float32) / 32768.0
```

### Save as WAV File (Python)

```python
import wave

def save_wav(pcm_bytes: bytes, path: str, sample_rate: int = 48000):
    with wave.open(path, "wb") as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)
        wf.setframerate(sample_rate)
        wf.writeframes(pcm_bytes)
```

### Accumulate and Save a Full Meeting Recording

```python
import wave

audio_buffer = bytearray()

# Inside your receive loop:
speaker_id, speaker_name, pcm_bytes = decode_audio_frame(message)
audio_buffer.extend(pcm_bytes)

# When the meeting ends:
with wave.open("meeting_recording.wav", "wb") as wf:
    wf.setnchannels(1)
    wf.setsampwidth(2)
    wf.setframerate(48000)
    wf.writeframes(bytes(audio_buffer))
```

### Resample to a Different Rate (Python)

Many speech-to-text services expect 16 kHz audio. Resample with `numpy`:

```python
import numpy as np

def resample_pcm16(pcm_bytes: bytes, src_rate: int, dst_rate: int) -> bytes:
    if src_rate == dst_rate:
        return pcm_bytes

    samples = np.frombuffer(pcm_bytes, dtype=np.int16).astype(np.float32)
    num_output = int(len(samples) * dst_rate / src_rate)
    t_in = np.linspace(0, 1, len(samples), endpoint=False)
    t_out = np.linspace(0, 1, num_output, endpoint=False)
    resampled = np.interp(t_out, t_in, samples)
    return np.clip(resampled, -32768, 32767).astype(np.int16).tobytes()

# Example: 48kHz → 16kHz
pcm_16k = resample_pcm16(pcm_bytes, 48000, 16000)
```

### Convert to Int16 Array in JavaScript

```javascript
function pcmBytesToInt16Array(buffer) {
  const int16 = new Int16Array(buffer.length / 2);
  for (let i = 0; i < int16.length; i++) {
    int16[i] = buffer.readInt16LE(i * 2);
  }
  return int16;
}

function pcmBytesToFloat32Array(buffer) {
  const float32 = new Float32Array(buffer.length / 2);
  for (let i = 0; i < float32.length; i++) {
    float32[i] = buffer.readInt16LE(i * 2) / 32768.0;
  }
  return float32;
}
```

---

## Speaker Identification

Each frame includes both a `speaker_id` and a `speaker_name`:

| Field | Description | Example |
|-------|-------------|---------|
| `speaker_id` | A platform-specific unique identifier. Stable within a session. | `"user_42"`, `"1234567890"`, `"NoSpeaker"` |
| `speaker_name` | The display name shown in the meeting UI. May not be unique. | `"Alice"`, `"David Hill"`, `"NoSpeaker"` |

### Platform Behavior

| Platform | `speaker_id` | `speaker_name` |
|----------|-------------|----------------|
| **Google Meet** | Participant ID from the meeting DOM | Display name from the meeting |
| **Zoom** | Zoom SDK `node_id` (per-participant) | `GetUserName()` from the SDK |
| **Teams** | Dominant speaker display name | Dominant speaker display name |

### Handling `"NoSpeaker"`

If the bot cannot determine the active speaker, both fields will be `"NoSpeaker"`. This can happen during the first moments of a meeting or during mixed audio when speaker attribution is unavailable.

---

## Message Types on the Audio Channel

| Byte 0 | Type | Format | Description |
|--------|------|--------|-------------|
| N/A | Handshake | JSON text | Sent once on connect. `{"type": "ready", ...}` |
| `0x01` | PCM Audio | Binary | Speaker-tagged audio frame (described above) |

`0x01` is currently the only binary message type. All binary frames will have `0x01` at position 0. Future protocol versions may introduce additional types — check byte 0 and skip unknown types for forward compatibility.

---

## FAQ

### What sample rate does the audio arrive at?

48,000 Hz for all platforms (Google Meet, Zoom, Teams).

### Is the audio mixed or per-speaker?

The audio is **mixed** — it contains all meeting participants combined into a single mono stream. Speaker metadata (`speaker_id`, `speaker_name`) indicates who was the **dominant speaker** when the frame was captured, but the audio itself contains everyone.

### How large is each frame?

Frame sizes vary. Typical frames contain 1,000 to 50,000+ samples (20ms to 1+ seconds of audio). The size depends on the platform's audio capture interval and buffering.

### Can I receive audio from specific speakers only?

No. The bot streams mixed audio. Use the `speaker_name` or `speaker_id` metadata for your own filtering or labeling logic after decoding.

### Do I need to send an acknowledgment for each frame?

No. The protocol is fire-and-forget. The bot streams continuously and does not expect any response on the audio channel.

### What happens if my server is slow to consume frames?

Frames will buffer in the WebSocket layer. If the buffer grows too large, the connection may drop. Ensure your receiver processes or discards frames promptly.

### Why does the `sid_length` value change between sessions?

The 2-byte length field reflects the byte length of the speaker's ID string. Different speakers have different name/ID lengths:

- `"Alice"` = 5 bytes → `sid_length` = `0x05 0x00`
- `"James Chen"` = 10 bytes → `sid_length` = `0x0A 0x00`
- `"Bob"` = 3 bytes → `sid_length` = `0x03 0x00`

This is not a delimiter or protocol variation — it is a standard length-prefixed string encoding.

### How should I parse the binary format defensively?

Always read the 2-byte length, then read exactly that many bytes. Never hard-code expected length values. A correct parser:

```python
sid_len = int.from_bytes(data[1:3], "little")   # read length
speaker_id = data[3 : 3 + sid_len].decode("utf-8")  # read exactly that many bytes
```

An incorrect parser:

```python
# BAD: hard-codes an expected "delimiter" byte
if data[1] == 0x09 and data[2] == 0x00:
    ...
```

### Can speaker names contain non-ASCII characters?

Yes. Speaker names are UTF-8 encoded. A name like `"Javier Martinez"` is 16 bytes, while `"佐藤太郎"` is 12 bytes (3 bytes per CJK character). Always use the length prefix — never scan for fixed byte patterns.

### What if I only need the audio and don't care about speaker info?

Skip past the headers:

```python
def extract_audio_only(data: bytes) -> bytes:
    if len(data) < 5 or data[0] != 0x01:
        return b""
    sid_len = int.from_bytes(data[1:3], "little")
    off = 3 + sid_len
    sname_len = int.from_bytes(data[off:off + 2], "little")
    off += 2 + sname_len
    return data[off:]
```