Live Audio Capture & Frame Decoding

View as MarkdownOpen in Claude

MeetStream streams real-time meeting audio to your application over a WebSocket connection. Audio arrives as speaker-tagged binary frames from Google Meet, Zoom, and Microsoft Teams — all using the same wire format.


Overview

When you create a bot with the live_audio_required configuration, Meetstream opens a WebSocket connection from the bot to your server and continuously streams binary audio frames for the duration of the meeting.

Enabling Live Audio

Include live_audio_required in your Create Bot API request:

1{
2 "meeting_url": "https://meet.google.com/abc-defg-hij",
3 "live_audio_required": {
4 "websocket_url": "wss://your-server.com/audio"
5 }
6}

The websocket_url is a WebSocket endpoint you host. Meetstream connects to it as a client.


Connection Lifecycle

1. Bot connects to your WebSocket endpoint

The bot initiates the connection when it joins the meeting.

2. Bot sends a JSON text handshake

The first message is always a JSON text frame:

1{
2 "type": "ready",
3 "bot_id": "bot_abc123",
4 "message": "Ready to receive messages"
5}

3. Binary audio frames stream continuously

Every subsequent message is a binary WebSocket frame containing PCM audio with embedded speaker metadata. Frames arrive continuously for the duration of the meeting.

4. Connection closes when the bot leaves

The WebSocket closes with a normal 1000 close code when the bot exits the meeting.

Timeline

Connection Lifecycle


Binary Frame Format

Every audio frame is a single binary WebSocket message with this structure:

┌──────────┬────────────┬────────────┬──────────────┬──────────────┬──────────────────┐
│ msg_type │ sid_length │ speaker_id │ sname_length │ speaker_name │ pcm_audio_data │
│ 1 byte │ 2 bytes │ L1 bytes │ 2 bytes │ L2 bytes │ remaining bytes │
└──────────┴────────────┴────────────┴──────────────┴──────────────┴──────────────────┘

Field-by-Field Breakdown

OffsetSizeTypeFieldDescription
01 byteuint8msg_typeMessage type. Always 0x01 for PCM audio.
12 bytesuint16 LEsid_lengthByte length of the speaker_id string that follows.
3L1 bytesUTF-8speaker_idPlatform-specific unique identifier for the speaker.
3 + L12 bytesuint16 LEsname_lengthByte length of the speaker_name string that follows.
5 + L1L2 bytesUTF-8speaker_nameDisplay name of the speaker as shown in the meeting.
5 + L1 + L2remainingint16 LEpcm_audioRaw PCM audio samples.

Important

  • There are no delimiters between fields. The format is length-prefixed: you read the 2-byte length, then read that many bytes for the string.
  • The sid_length and sname_length values change depending on the length of the speaker’s name and ID. These are not fixed values or delimiters — they are standard unsigned 16-bit little-endian integers encoding a string length.
  • 0x01 is currently the only defined message type. All binary frames on this channel will have 0x01 at byte 0.

Audio Properties

PropertyValue
EncodingSigned 16-bit integer (PCM16)
Byte orderLittle-endian
Sample rate48,000 Hz
Channels1 (mono)
Bit depth16 bits (2 bytes per sample)
ContainerNone — raw samples, no WAV/MP3/Ogg headers

To calculate duration from a frame:

duration_seconds = (len(pcm_audio_data) / 2) / 48000

Hex Dump Walkthrough

A frame from a speaker named "Alice" with ID "user_42":

Hex: 01 07 00 75 73 65 72 5F 34 32 05 00 41 6C 69 63 65 XX XX XX XX ...
── ───── ───────────────────── ───── ─────────────── ─────────────
│ │ │ │ │ │
│ │ │ │ │ └─ PCM16 LE audio samples
│ │ │ │ └─ "Alice" (5 bytes UTF-8)
│ │ │ └─ sname_length = 5
│ │ └─ "user_42" (7 bytes UTF-8)
│ └─ sid_length = 7
└─ msg_type = 0x01 (PCM audio)

A frame from "James Chen" with ID "James Chen":

Hex: 01 0A 00 4A 61 6D 65 73 20 43 68 65 6E 0A 00 4A 61 6D 65 73 20 43 68 65 6E XX XX ...
── ───── ────────────────────────────── ───── ────────────────────────────── ────────
│ │ │ │ │ │
│ │ │ │ │ └─ PCM audio
│ │ │ │ └─ "James Chen" (10 bytes)
│ │ │ └─ sname_length = 10 (0x0A)
│ │ └─ "James Chen" (10 bytes)
│ └─ sid_length = 10 (0x0A)
└─ msg_type = 0x01

Note: 0x07 = 7, 0x0A = 10, 0x0E = 14, etc. These are string lengths, not protocol markers.


Decoding Examples

Python

1def decode_audio_frame(data: bytes):
2 """Decode a Meetstream binary audio frame.
3
4 Returns:
5 tuple: (speaker_id, speaker_name, pcm_bytes) on success
6 None: if the frame is malformed
7 """
8 if len(data) < 5 or data[0] != 0x01:
9 return None
10
11 # Speaker ID: 2-byte length prefix + UTF-8 string
12 sid_len = int.from_bytes(data[1:3], "little")
13 speaker_id = data[3 : 3 + sid_len].decode("utf-8")
14
15 # Speaker Name: 2-byte length prefix + UTF-8 string
16 off = 3 + sid_len
17 sname_len = int.from_bytes(data[off : off + 2], "little")
18 off += 2
19 speaker_name = data[off : off + sname_len].decode("utf-8")
20 off += sname_len
21
22 # Remaining bytes are raw PCM16 LE audio
23 pcm_bytes = data[off:]
24 return speaker_id, speaker_name, pcm_bytes

JavaScript / Node.js

1function decodeAudioFrame(buffer) {
2 if (buffer.length < 5 || buffer[0] !== 0x01) return null;
3
4 // Speaker ID
5 const sidLen = buffer.readUInt16LE(1);
6 const speakerId = buffer.subarray(3, 3 + sidLen).toString("utf-8");
7
8 // Speaker Name
9 let off = 3 + sidLen;
10 const snameLen = buffer.readUInt16LE(off);
11 off += 2;
12 const speakerName = buffer.subarray(off, off + snameLen).toString("utf-8");
13 off += snameLen;
14
15 // PCM audio
16 const pcmData = buffer.subarray(off);
17 return { speakerId, speakerName, pcmData };
18}

Go

1import (
2 "encoding/binary"
3 "errors"
4)
5
6type AudioFrame struct {
7 SpeakerID string
8 SpeakerName string
9 PCMData []byte
10}
11
12func DecodeAudioFrame(data []byte) (*AudioFrame, error) {
13 if len(data) < 5 || data[0] != 0x01 {
14 return nil, errors.New("invalid frame")
15 }
16
17 sidLen := int(binary.LittleEndian.Uint16(data[1:3]))
18 if len(data) < 3+sidLen+2 {
19 return nil, errors.New("frame too short for speaker ID")
20 }
21 speakerID := string(data[3 : 3+sidLen])
22
23 off := 3 + sidLen
24 snameLen := int(binary.LittleEndian.Uint16(data[off : off+2]))
25 off += 2
26 if len(data) < off+snameLen {
27 return nil, errors.New("frame too short for speaker name")
28 }
29 speakerName := string(data[off : off+snameLen])
30 off += snameLen
31
32 return &AudioFrame{
33 SpeakerID: speakerID,
34 SpeakerName: speakerName,
35 PCMData: data[off:],
36 }, nil
37}

Java

1import java.nio.ByteBuffer;
2import java.nio.ByteOrder;
3import java.nio.charset.StandardCharsets;
4
5public class MeetstreamAudioFrame {
6 public final String speakerId;
7 public final String speakerName;
8 public final byte[] pcmData;
9
10 private MeetstreamAudioFrame(String speakerId, String speakerName, byte[] pcmData) {
11 this.speakerId = speakerId;
12 this.speakerName = speakerName;
13 this.pcmData = pcmData;
14 }
15
16 public static MeetstreamAudioFrame decode(byte[] data) {
17 if (data.length < 5 || data[0] != 0x01) return null;
18
19 ByteBuffer buf = ByteBuffer.wrap(data).order(ByteOrder.LITTLE_ENDIAN);
20 buf.get(); // skip msg_type
21
22 int sidLen = buf.getShort() & 0xFFFF;
23 byte[] sidBytes = new byte[sidLen];
24 buf.get(sidBytes);
25 String speakerId = new String(sidBytes, StandardCharsets.UTF_8);
26
27 int snameLen = buf.getShort() & 0xFFFF;
28 byte[] snameBytes = new byte[snameLen];
29 buf.get(snameBytes);
30 String speakerName = new String(snameBytes, StandardCharsets.UTF_8);
31
32 byte[] pcmData = new byte[buf.remaining()];
33 buf.get(pcmData);
34
35 return new MeetstreamAudioFrame(speakerId, speakerName, pcmData);
36 }
37}

Full Receiver Examples

Python — Receive and Log

1import asyncio
2import json
3import websockets
4
5async def receive_audio():
6 async with websockets.connect("wss://your-server.com/audio") as ws:
7 async for message in ws:
8 # First message is a JSON text handshake
9 if isinstance(message, str):
10 handshake = json.loads(message)
11 print(f"Bot connected: {handshake['bot_id']}")
12 continue
13
14 # All subsequent messages are binary audio frames
15 result = decode_audio_frame(message)
16 if result is None:
17 continue
18
19 speaker_id, speaker_name, pcm_bytes = result
20 num_samples = len(pcm_bytes) // 2
21 duration_ms = (num_samples / 48000) * 1000
22 print(f"[{speaker_name}] {num_samples} samples ({duration_ms:.0f}ms)")
23
24asyncio.run(receive_audio())

Node.js — Receive and Log

1const WebSocket = require("ws");
2
3const ws = new WebSocket("wss://your-server.com/audio");
4
5ws.on("message", (data, isBinary) => {
6 if (!isBinary) {
7 const handshake = JSON.parse(data.toString());
8 console.log(`Bot connected: ${handshake.bot_id}`);
9 return;
10 }
11
12 const frame = decodeAudioFrame(data);
13 if (!frame) return;
14
15 const numSamples = frame.pcmData.length / 2;
16 const durationMs = (numSamples / 48000) * 1000;
17 console.log(`[${frame.speakerName}] ${numSamples} samples (${durationMs.toFixed(0)}ms)`);
18});

Working with PCM Audio

Convert to NumPy Array (Python)

1import numpy as np
2
3# To int16 sample array
4samples = np.frombuffer(pcm_bytes, dtype=np.int16)
5
6# To float32 (-1.0 to 1.0) — standard format for ML models and audio libraries
7float_samples = samples.astype(np.float32) / 32768.0

Save as WAV File (Python)

1import wave
2
3def save_wav(pcm_bytes: bytes, path: str, sample_rate: int = 48000):
4 with wave.open(path, "wb") as wf:
5 wf.setnchannels(1)
6 wf.setsampwidth(2)
7 wf.setframerate(sample_rate)
8 wf.writeframes(pcm_bytes)

Accumulate and Save a Full Meeting Recording

1import wave
2
3audio_buffer = bytearray()
4
5# Inside your receive loop:
6speaker_id, speaker_name, pcm_bytes = decode_audio_frame(message)
7audio_buffer.extend(pcm_bytes)
8
9# When the meeting ends:
10with wave.open("meeting_recording.wav", "wb") as wf:
11 wf.setnchannels(1)
12 wf.setsampwidth(2)
13 wf.setframerate(48000)
14 wf.writeframes(bytes(audio_buffer))

Resample to a Different Rate (Python)

Many speech-to-text services expect 16 kHz audio. Resample with numpy:

1import numpy as np
2
3def resample_pcm16(pcm_bytes: bytes, src_rate: int, dst_rate: int) -> bytes:
4 if src_rate == dst_rate:
5 return pcm_bytes
6
7 samples = np.frombuffer(pcm_bytes, dtype=np.int16).astype(np.float32)
8 num_output = int(len(samples) * dst_rate / src_rate)
9 t_in = np.linspace(0, 1, len(samples), endpoint=False)
10 t_out = np.linspace(0, 1, num_output, endpoint=False)
11 resampled = np.interp(t_out, t_in, samples)
12 return np.clip(resampled, -32768, 32767).astype(np.int16).tobytes()
13
14# Example: 48kHz → 16kHz
15pcm_16k = resample_pcm16(pcm_bytes, 48000, 16000)

Convert to Int16 Array in JavaScript

1function pcmBytesToInt16Array(buffer) {
2 const int16 = new Int16Array(buffer.length / 2);
3 for (let i = 0; i < int16.length; i++) {
4 int16[i] = buffer.readInt16LE(i * 2);
5 }
6 return int16;
7}
8
9function pcmBytesToFloat32Array(buffer) {
10 const float32 = new Float32Array(buffer.length / 2);
11 for (let i = 0; i < float32.length; i++) {
12 float32[i] = buffer.readInt16LE(i * 2) / 32768.0;
13 }
14 return float32;
15}

Speaker Identification

Each frame includes both a speaker_id and a speaker_name:

FieldDescriptionExample
speaker_idA platform-specific unique identifier. Stable within a session."user_42", "1234567890", "NoSpeaker"
speaker_nameThe display name shown in the meeting UI. May not be unique."Alice", "David Hill", "NoSpeaker"

Platform Behavior

Platformspeaker_idspeaker_name
Google MeetParticipant ID from the meeting DOMDisplay name from the meeting
ZoomZoom SDK node_id (per-participant)GetUserName() from the SDK
TeamsDominant speaker display nameDominant speaker display name

Handling "NoSpeaker"

If the bot cannot determine the active speaker, both fields will be "NoSpeaker". This can happen during the first moments of a meeting or during mixed audio when speaker attribution is unavailable.


Message Types on the Audio Channel

Byte 0TypeFormatDescription
N/AHandshakeJSON textSent once on connect. {"type": "ready", ...}
0x01PCM AudioBinarySpeaker-tagged audio frame (described above)

0x01 is currently the only binary message type. All binary frames will have 0x01 at position 0. Future protocol versions may introduce additional types — check byte 0 and skip unknown types for forward compatibility.


FAQ

What sample rate does the audio arrive at?

48,000 Hz for all platforms (Google Meet, Zoom, Teams).

Is the audio mixed or per-speaker?

The audio is mixed — it contains all meeting participants combined into a single mono stream. Speaker metadata (speaker_id, speaker_name) indicates who was the dominant speaker when the frame was captured, but the audio itself contains everyone.

How large is each frame?

Frame sizes vary. Typical frames contain 1,000 to 50,000+ samples (20ms to 1+ seconds of audio). The size depends on the platform’s audio capture interval and buffering.

Can I receive audio from specific speakers only?

No. The bot streams mixed audio. Use the speaker_name or speaker_id metadata for your own filtering or labeling logic after decoding.

Do I need to send an acknowledgment for each frame?

No. The protocol is fire-and-forget. The bot streams continuously and does not expect any response on the audio channel.

What happens if my server is slow to consume frames?

Frames will buffer in the WebSocket layer. If the buffer grows too large, the connection may drop. Ensure your receiver processes or discards frames promptly.

Why does the sid_length value change between sessions?

The 2-byte length field reflects the byte length of the speaker’s ID string. Different speakers have different name/ID lengths:

  • "Alice" = 5 bytes → sid_length = 0x05 0x00
  • "James Chen" = 10 bytes → sid_length = 0x0A 0x00
  • "Bob" = 3 bytes → sid_length = 0x03 0x00

This is not a delimiter or protocol variation — it is a standard length-prefixed string encoding.

How should I parse the binary format defensively?

Always read the 2-byte length, then read exactly that many bytes. Never hard-code expected length values. A correct parser:

1sid_len = int.from_bytes(data[1:3], "little") # read length
2speaker_id = data[3 : 3 + sid_len].decode("utf-8") # read exactly that many bytes

An incorrect parser:

1# BAD: hard-codes an expected "delimiter" byte
2if data[1] == 0x09 and data[2] == 0x00:
3 ...

Can speaker names contain non-ASCII characters?

Yes. Speaker names are UTF-8 encoded. A name like "Javier Martinez" is 16 bytes, while "佐藤太郎" is 12 bytes (3 bytes per CJK character). Always use the length prefix — never scan for fixed byte patterns.

What if I only need the audio and don’t care about speaker info?

Skip past the headers:

1def extract_audio_only(data: bytes) -> bytes:
2 if len(data) < 5 or data[0] != 0x01:
3 return b""
4 sid_len = int.from_bytes(data[1:3], "little")
5 off = 3 + sid_len
6 sname_len = int.from_bytes(data[off:off + 2], "little")
7 off += 2 + sname_len
8 return data[off:]