# Live Audio Capture & Frame Decoding MeetStream streams real-time meeting audio to your application over a WebSocket connection. Audio arrives as speaker-tagged binary frames from Google Meet, Zoom, and Microsoft Teams — all using the same wire format. --- ## Overview When you create a bot with the `live_audio_required` configuration, Meetstream opens a WebSocket connection from the bot to your server and continuously streams binary audio frames for the duration of the meeting. ### Enabling Live Audio Include `live_audio_required` in your Create Bot API request: ```json { "meeting_url": "https://meet.google.com/abc-defg-hij", "live_audio_required": { "websocket_url": "wss://your-server.com/audio" } } ``` The `websocket_url` is a WebSocket endpoint **you host**. Meetstream connects to it as a client. --- ## Connection Lifecycle ### 1. Bot connects to your WebSocket endpoint The bot initiates the connection when it joins the meeting. ### 2. Bot sends a JSON text handshake The first message is always a JSON text frame: ```json { "type": "ready", "bot_id": "bot_abc123", "message": "Ready to receive messages" } ``` ### 3. Binary audio frames stream continuously Every subsequent message is a **binary WebSocket frame** containing PCM audio with embedded speaker metadata. Frames arrive continuously for the duration of the meeting. ### 4. Connection closes when the bot leaves The WebSocket closes with a normal `1000` close code when the bot exits the meeting. ### Timeline ![Connection Lifecycle](https://files.buildwithfern.com/meetstream-ai-573402.docs.buildwithfern.com/f554c44f8926783c950916afa70281ea97f5f0459abc2f6ee31afd5897f7f6a5/docs/assets/images/connection-lifecycle.png) --- ## Binary Frame Format Every audio frame is a single binary WebSocket message with this structure: ``` ┌──────────┬────────────┬────────────┬──────────────┬──────────────┬──────────────────┐ │ msg_type │ sid_length │ speaker_id │ sname_length │ speaker_name │ pcm_audio_data │ │ 1 byte │ 2 bytes │ L1 bytes │ 2 bytes │ L2 bytes │ remaining bytes │ └──────────┴────────────┴────────────┴──────────────┴──────────────┴──────────────────┘ ``` ### Field-by-Field Breakdown | Offset | Size | Type | Field | Description | |--------|------|------|-------|-------------| | `0` | 1 byte | `uint8` | `msg_type` | Message type. Always `0x01` for PCM audio. | | `1` | 2 bytes | `uint16 LE` | `sid_length` | Byte length of the `speaker_id` string that follows. | | `3` | L1 bytes | `UTF-8` | `speaker_id` | Platform-specific unique identifier for the speaker. | | `3 + L1` | 2 bytes | `uint16 LE` | `sname_length` | Byte length of the `speaker_name` string that follows. | | `5 + L1` | L2 bytes | `UTF-8` | `speaker_name` | Display name of the speaker as shown in the meeting. | | `5 + L1 + L2` | remaining | `int16 LE` | `pcm_audio` | Raw PCM audio samples. | ### Important - There are **no delimiters** between fields. The format is length-prefixed: you read the 2-byte length, then read that many bytes for the string. - The `sid_length` and `sname_length` values change depending on the length of the speaker's name and ID. These are **not** fixed values or delimiters — they are standard unsigned 16-bit little-endian integers encoding a string length. - `0x01` is currently the only defined message type. All binary frames on this channel will have `0x01` at byte 0. --- ## Audio Properties | Property | Value | |----------|-------| | Encoding | Signed 16-bit integer (PCM16) | | Byte order | Little-endian | | Sample rate | 48,000 Hz | | Channels | 1 (mono) | | Bit depth | 16 bits (2 bytes per sample) | | Container | None — raw samples, no WAV/MP3/Ogg headers | To calculate duration from a frame: ``` duration_seconds = (len(pcm_audio_data) / 2) / 48000 ``` --- ## Hex Dump Walkthrough A frame from a speaker named `"Alice"` with ID `"user_42"`: ``` Hex: 01 07 00 75 73 65 72 5F 34 32 05 00 41 6C 69 63 65 XX XX XX XX ... ── ───── ───────────────────── ───── ─────────────── ───────────── │ │ │ │ │ │ │ │ │ │ │ └─ PCM16 LE audio samples │ │ │ │ └─ "Alice" (5 bytes UTF-8) │ │ │ └─ sname_length = 5 │ │ └─ "user_42" (7 bytes UTF-8) │ └─ sid_length = 7 └─ msg_type = 0x01 (PCM audio) ``` A frame from `"James Chen"` with ID `"James Chen"`: ``` Hex: 01 0A 00 4A 61 6D 65 73 20 43 68 65 6E 0A 00 4A 61 6D 65 73 20 43 68 65 6E XX XX ... ── ───── ────────────────────────────── ───── ────────────────────────────── ──────── │ │ │ │ │ │ │ │ │ │ │ └─ PCM audio │ │ │ │ └─ "James Chen" (10 bytes) │ │ │ └─ sname_length = 10 (0x0A) │ │ └─ "James Chen" (10 bytes) │ └─ sid_length = 10 (0x0A) └─ msg_type = 0x01 ``` Note: `0x07 = 7`, `0x0A = 10`, `0x0E = 14`, etc. These are string lengths, not protocol markers. --- ## Decoding Examples ### Python ```python def decode_audio_frame(data: bytes): """Decode a Meetstream binary audio frame. Returns: tuple: (speaker_id, speaker_name, pcm_bytes) on success None: if the frame is malformed """ if len(data) < 5 or data[0] != 0x01: return None # Speaker ID: 2-byte length prefix + UTF-8 string sid_len = int.from_bytes(data[1:3], "little") speaker_id = data[3 : 3 + sid_len].decode("utf-8") # Speaker Name: 2-byte length prefix + UTF-8 string off = 3 + sid_len sname_len = int.from_bytes(data[off : off + 2], "little") off += 2 speaker_name = data[off : off + sname_len].decode("utf-8") off += sname_len # Remaining bytes are raw PCM16 LE audio pcm_bytes = data[off:] return speaker_id, speaker_name, pcm_bytes ``` ### JavaScript / Node.js ```javascript function decodeAudioFrame(buffer) { if (buffer.length < 5 || buffer[0] !== 0x01) return null; // Speaker ID const sidLen = buffer.readUInt16LE(1); const speakerId = buffer.subarray(3, 3 + sidLen).toString("utf-8"); // Speaker Name let off = 3 + sidLen; const snameLen = buffer.readUInt16LE(off); off += 2; const speakerName = buffer.subarray(off, off + snameLen).toString("utf-8"); off += snameLen; // PCM audio const pcmData = buffer.subarray(off); return { speakerId, speakerName, pcmData }; } ``` ### Go ```go import ( "encoding/binary" "errors" ) type AudioFrame struct { SpeakerID string SpeakerName string PCMData []byte } func DecodeAudioFrame(data []byte) (*AudioFrame, error) { if len(data) < 5 || data[0] != 0x01 { return nil, errors.New("invalid frame") } sidLen := int(binary.LittleEndian.Uint16(data[1:3])) if len(data) < 3+sidLen+2 { return nil, errors.New("frame too short for speaker ID") } speakerID := string(data[3 : 3+sidLen]) off := 3 + sidLen snameLen := int(binary.LittleEndian.Uint16(data[off : off+2])) off += 2 if len(data) < off+snameLen { return nil, errors.New("frame too short for speaker name") } speakerName := string(data[off : off+snameLen]) off += snameLen return &AudioFrame{ SpeakerID: speakerID, SpeakerName: speakerName, PCMData: data[off:], }, nil } ``` ### Java ```java import java.nio.ByteBuffer; import java.nio.ByteOrder; import java.nio.charset.StandardCharsets; public class MeetstreamAudioFrame { public final String speakerId; public final String speakerName; public final byte[] pcmData; private MeetstreamAudioFrame(String speakerId, String speakerName, byte[] pcmData) { this.speakerId = speakerId; this.speakerName = speakerName; this.pcmData = pcmData; } public static MeetstreamAudioFrame decode(byte[] data) { if (data.length < 5 || data[0] != 0x01) return null; ByteBuffer buf = ByteBuffer.wrap(data).order(ByteOrder.LITTLE_ENDIAN); buf.get(); // skip msg_type int sidLen = buf.getShort() & 0xFFFF; byte[] sidBytes = new byte[sidLen]; buf.get(sidBytes); String speakerId = new String(sidBytes, StandardCharsets.UTF_8); int snameLen = buf.getShort() & 0xFFFF; byte[] snameBytes = new byte[snameLen]; buf.get(snameBytes); String speakerName = new String(snameBytes, StandardCharsets.UTF_8); byte[] pcmData = new byte[buf.remaining()]; buf.get(pcmData); return new MeetstreamAudioFrame(speakerId, speakerName, pcmData); } } ``` --- ## Full Receiver Examples ### Python — Receive and Log ```python import asyncio import json import websockets async def receive_audio(): async with websockets.connect("wss://your-server.com/audio") as ws: async for message in ws: # First message is a JSON text handshake if isinstance(message, str): handshake = json.loads(message) print(f"Bot connected: {handshake['bot_id']}") continue # All subsequent messages are binary audio frames result = decode_audio_frame(message) if result is None: continue speaker_id, speaker_name, pcm_bytes = result num_samples = len(pcm_bytes) // 2 duration_ms = (num_samples / 48000) * 1000 print(f"[{speaker_name}] {num_samples} samples ({duration_ms:.0f}ms)") asyncio.run(receive_audio()) ``` ### Node.js — Receive and Log ```javascript const WebSocket = require("ws"); const ws = new WebSocket("wss://your-server.com/audio"); ws.on("message", (data, isBinary) => { if (!isBinary) { const handshake = JSON.parse(data.toString()); console.log(`Bot connected: ${handshake.bot_id}`); return; } const frame = decodeAudioFrame(data); if (!frame) return; const numSamples = frame.pcmData.length / 2; const durationMs = (numSamples / 48000) * 1000; console.log(`[${frame.speakerName}] ${numSamples} samples (${durationMs.toFixed(0)}ms)`); }); ``` --- ## Working with PCM Audio ### Convert to NumPy Array (Python) ```python import numpy as np # To int16 sample array samples = np.frombuffer(pcm_bytes, dtype=np.int16) # To float32 (-1.0 to 1.0) — standard format for ML models and audio libraries float_samples = samples.astype(np.float32) / 32768.0 ``` ### Save as WAV File (Python) ```python import wave def save_wav(pcm_bytes: bytes, path: str, sample_rate: int = 48000): with wave.open(path, "wb") as wf: wf.setnchannels(1) wf.setsampwidth(2) wf.setframerate(sample_rate) wf.writeframes(pcm_bytes) ``` ### Accumulate and Save a Full Meeting Recording ```python import wave audio_buffer = bytearray() # Inside your receive loop: speaker_id, speaker_name, pcm_bytes = decode_audio_frame(message) audio_buffer.extend(pcm_bytes) # When the meeting ends: with wave.open("meeting_recording.wav", "wb") as wf: wf.setnchannels(1) wf.setsampwidth(2) wf.setframerate(48000) wf.writeframes(bytes(audio_buffer)) ``` ### Resample to a Different Rate (Python) Many speech-to-text services expect 16 kHz audio. Resample with `numpy`: ```python import numpy as np def resample_pcm16(pcm_bytes: bytes, src_rate: int, dst_rate: int) -> bytes: if src_rate == dst_rate: return pcm_bytes samples = np.frombuffer(pcm_bytes, dtype=np.int16).astype(np.float32) num_output = int(len(samples) * dst_rate / src_rate) t_in = np.linspace(0, 1, len(samples), endpoint=False) t_out = np.linspace(0, 1, num_output, endpoint=False) resampled = np.interp(t_out, t_in, samples) return np.clip(resampled, -32768, 32767).astype(np.int16).tobytes() # Example: 48kHz → 16kHz pcm_16k = resample_pcm16(pcm_bytes, 48000, 16000) ``` ### Convert to Int16 Array in JavaScript ```javascript function pcmBytesToInt16Array(buffer) { const int16 = new Int16Array(buffer.length / 2); for (let i = 0; i < int16.length; i++) { int16[i] = buffer.readInt16LE(i * 2); } return int16; } function pcmBytesToFloat32Array(buffer) { const float32 = new Float32Array(buffer.length / 2); for (let i = 0; i < float32.length; i++) { float32[i] = buffer.readInt16LE(i * 2) / 32768.0; } return float32; } ``` --- ## Speaker Identification Each frame includes both a `speaker_id` and a `speaker_name`: | Field | Description | Example | |-------|-------------|---------| | `speaker_id` | A platform-specific unique identifier. Stable within a session. | `"user_42"`, `"1234567890"`, `"NoSpeaker"` | | `speaker_name` | The display name shown in the meeting UI. May not be unique. | `"Alice"`, `"David Hill"`, `"NoSpeaker"` | ### Platform Behavior | Platform | `speaker_id` | `speaker_name` | |----------|-------------|----------------| | **Google Meet** | Participant ID from the meeting DOM | Display name from the meeting | | **Zoom** | Zoom SDK `node_id` (per-participant) | `GetUserName()` from the SDK | | **Teams** | Dominant speaker display name | Dominant speaker display name | ### Handling `"NoSpeaker"` If the bot cannot determine the active speaker, both fields will be `"NoSpeaker"`. This can happen during the first moments of a meeting or during mixed audio when speaker attribution is unavailable. --- ## Message Types on the Audio Channel | Byte 0 | Type | Format | Description | |--------|------|--------|-------------| | N/A | Handshake | JSON text | Sent once on connect. `{"type": "ready", ...}` | | `0x01` | PCM Audio | Binary | Speaker-tagged audio frame (described above) | `0x01` is currently the only binary message type. All binary frames will have `0x01` at position 0. Future protocol versions may introduce additional types — check byte 0 and skip unknown types for forward compatibility. --- ## FAQ ### What sample rate does the audio arrive at? 48,000 Hz for all platforms (Google Meet, Zoom, Teams). ### Is the audio mixed or per-speaker? The audio is **mixed** — it contains all meeting participants combined into a single mono stream. Speaker metadata (`speaker_id`, `speaker_name`) indicates who was the **dominant speaker** when the frame was captured, but the audio itself contains everyone. ### How large is each frame? Frame sizes vary. Typical frames contain 1,000 to 50,000+ samples (20ms to 1+ seconds of audio). The size depends on the platform's audio capture interval and buffering. ### Can I receive audio from specific speakers only? No. The bot streams mixed audio. Use the `speaker_name` or `speaker_id` metadata for your own filtering or labeling logic after decoding. ### Do I need to send an acknowledgment for each frame? No. The protocol is fire-and-forget. The bot streams continuously and does not expect any response on the audio channel. ### What happens if my server is slow to consume frames? Frames will buffer in the WebSocket layer. If the buffer grows too large, the connection may drop. Ensure your receiver processes or discards frames promptly. ### Why does the `sid_length` value change between sessions? The 2-byte length field reflects the byte length of the speaker's ID string. Different speakers have different name/ID lengths: - `"Alice"` = 5 bytes → `sid_length` = `0x05 0x00` - `"James Chen"` = 10 bytes → `sid_length` = `0x0A 0x00` - `"Bob"` = 3 bytes → `sid_length` = `0x03 0x00` This is not a delimiter or protocol variation — it is a standard length-prefixed string encoding. ### How should I parse the binary format defensively? Always read the 2-byte length, then read exactly that many bytes. Never hard-code expected length values. A correct parser: ```python sid_len = int.from_bytes(data[1:3], "little") # read length speaker_id = data[3 : 3 + sid_len].decode("utf-8") # read exactly that many bytes ``` An incorrect parser: ```python # BAD: hard-codes an expected "delimiter" byte if data[1] == 0x09 and data[2] == 0x00: ... ``` ### Can speaker names contain non-ASCII characters? Yes. Speaker names are UTF-8 encoded. A name like `"Javier Martinez"` is 16 bytes, while `"佐藤太郎"` is 12 bytes (3 bytes per CJK character). Always use the length prefix — never scan for fixed byte patterns. ### What if I only need the audio and don't care about speaker info? Skip past the headers: ```python def extract_audio_only(data: bytes) -> bytes: if len(data) < 5 or data[0] != 0x01: return b"" sid_len = int.from_bytes(data[1:3], "little") off = 3 + sid_len sname_len = int.from_bytes(data[off:off + 2], "little") off += 2 + sname_len return data[off:] ```