Real-time Audio Streaming | MeetStream AI

MeetStream streams real-time meeting audio to your application over a WebSocket connection. Audio arrives as speaker-tagged binary frames from Google Meet, Zoom, and Microsoft Teams — all using the same wire format.

Overview

When you create a bot with the live_audio_required configuration, Meetstream opens a WebSocket connection from the bot to your server and continuously streams binary audio frames for the duration of the meeting.

Enabling Live Audio

Include live_audio_required in your Create Bot API request:

1 {
2   "meeting_url": "https://meet.google.com/abc-defg-hij",
3   "live_audio_required": {
4     "websocket_url": "wss://your-server.com/audio"
5   }
6 }

The websocket_url is a WebSocket endpoint you host. Meetstream connects to it as a client.

Connection Lifecycle

1. Bot connects to your WebSocket endpoint

The bot initiates the connection when it joins the meeting.

2. Bot sends a JSON text handshake

The first message is always a JSON text frame:

1 {
2   "type": "ready",
3   "bot_id": "bot_abc123",
4   "message": "Ready to receive messages"
5 }

3. Binary audio frames stream continuously

Every subsequent message is a binary WebSocket frame containing PCM audio with embedded speaker metadata. Frames arrive continuously for the duration of the meeting.

4. Connection closes when the bot leaves

The WebSocket closes with a normal 1000 close code when the bot exits the meeting.

Timeline

Connection Lifecycle

Binary Frame Format

Every audio frame is a single binary WebSocket message with this structure:

┌──────────┬────────────┬────────────┬──────────────┬──────────────┬──────────────────┐
│ msg_type │ sid_length │ speaker_id │ sname_length │ speaker_name │  pcm_audio_data  │
│ 1 byte   │ 2 bytes    │ L1 bytes   │ 2 bytes      │ L2 bytes     │  remaining bytes │
└──────────┴────────────┴────────────┴──────────────┴──────────────┴──────────────────┘

Field-by-Field Breakdown

Offset	Size	Type	Field	Description
`0`	1 byte	`uint8`	`msg_type`	Message type. Always `0x01` for PCM audio.
`1`	2 bytes	`uint16 LE`	`sid_length`	Byte length of the `speaker_id` string that follows.
`3`	L1 bytes	`UTF-8`	`speaker_id`	Platform-specific unique identifier for the speaker.
`3 + L1`	2 bytes	`uint16 LE`	`sname_length`	Byte length of the `speaker_name` string that follows.
`5 + L1`	L2 bytes	`UTF-8`	`speaker_name`	Display name of the speaker as shown in the meeting.
`5 + L1 + L2`	remaining	`int16 LE`	`pcm_audio`	Raw PCM audio samples.

Important

There are no delimiters between fields. The format is length-prefixed: you read the 2-byte length, then read that many bytes for the string.
The sid_length and sname_length values change depending on the length of the speaker’s name and ID. These are not fixed values or delimiters — they are standard unsigned 16-bit little-endian integers encoding a string length.
0x01 is currently the only defined message type. All binary frames on this channel will have 0x01 at byte 0.

Audio Properties

Property	Value
Encoding	Signed 16-bit integer (PCM16)
Byte order	Little-endian
Sample rate	48,000 Hz
Channels	1 (mono)
Bit depth	16 bits (2 bytes per sample)
Container	None — raw samples, no WAV/MP3/Ogg headers

To calculate duration from a frame:

duration_seconds = (len(pcm_audio_data) / 2) / 48000

Hex Dump Walkthrough

A frame from a speaker named "Alice" with ID "user_42":

Hex:  01  07 00  75 73 65 72 5F 34 32  05 00  41 6C 69 63 65  XX XX XX XX ...
      ──  ─────  ─────────────────────  ─────  ───────────────  ─────────────
      │   │      │                      │      │                │
      │   │      │                      │      │                └─ PCM16 LE audio samples
      │   │      │                      │      └─ "Alice" (5 bytes UTF-8)
      │   │      │                      └─ sname_length = 5
      │   │      └─ "user_42" (7 bytes UTF-8)
      │   └─ sid_length = 7
      └─ msg_type = 0x01 (PCM audio)

A frame from "James Chen" with ID "James Chen":

Hex:  01  0A 00  4A 61 6D 65 73 20 43 68 65 6E  0A 00  4A 61 6D 65 73 20 43 68 65 6E  XX XX ...
      ──  ─────  ──────────────────────────────  ─────  ──────────────────────────────  ────────
      │   │      │                               │      │                               │
      │   │      │                               │      │                               └─ PCM audio
      │   │      │                               │      └─ "James Chen" (10 bytes)
      │   │      │                               └─ sname_length = 10 (0x0A)
      │   │      └─ "James Chen" (10 bytes)
      │   └─ sid_length = 10 (0x0A)
      └─ msg_type = 0x01

Note: 0x07 = 7, 0x0A = 10, 0x0E = 14, etc. These are string lengths, not protocol markers.

Decoding Examples

Python

1 def decode_audio_frame(data: bytes):
2     """Decode a Meetstream binary audio frame.
3 
4     Returns:
5         tuple: (speaker_id, speaker_name, pcm_bytes) on success
6         None: if the frame is malformed
7     """
8     if len(data) < 5 or data[0] != 0x01:
9         return None
10 
11     # Speaker ID: 2-byte length prefix + UTF-8 string
12     sid_len = int.from_bytes(data[1:3], "little")
13     speaker_id = data[3 : 3 + sid_len].decode("utf-8")
14 
15     # Speaker Name: 2-byte length prefix + UTF-8 string
16     off = 3 + sid_len
17     sname_len = int.from_bytes(data[off : off + 2], "little")
18     off += 2
19     speaker_name = data[off : off + sname_len].decode("utf-8")
20     off += sname_len
21 
22     # Remaining bytes are raw PCM16 LE audio
23     pcm_bytes = data[off:]
24     return speaker_id, speaker_name, pcm_bytes

JavaScript / Node.js

1 function decodeAudioFrame(buffer) {
2   if (buffer.length < 5 || buffer[0] !== 0x01) return null;
3 
4   // Speaker ID
5   const sidLen = buffer.readUInt16LE(1);
6   const speakerId = buffer.subarray(3, 3 + sidLen).toString("utf-8");
7 
8   // Speaker Name
9   let off = 3 + sidLen;
10   const snameLen = buffer.readUInt16LE(off);
11   off += 2;
12   const speakerName = buffer.subarray(off, off + snameLen).toString("utf-8");
13   off += snameLen;
14 
15   // PCM audio
16   const pcmData = buffer.subarray(off);
17   return { speakerId, speakerName, pcmData };
18 }

Go

1 import (
2     "encoding/binary"
3     "errors"
4 )
5 
6 type AudioFrame struct {
7     SpeakerID   string
8     SpeakerName string
9     PCMData     []byte
10 }
11 
12 func DecodeAudioFrame(data []byte) (*AudioFrame, error) {
13     if len(data) < 5 || data[0] != 0x01 {
14         return nil, errors.New("invalid frame")
15     }
16 
17     sidLen := int(binary.LittleEndian.Uint16(data[1:3]))
18     if len(data) < 3+sidLen+2 {
19         return nil, errors.New("frame too short for speaker ID")
20     }
21     speakerID := string(data[3 : 3+sidLen])
22 
23     off := 3 + sidLen
24     snameLen := int(binary.LittleEndian.Uint16(data[off : off+2]))
25     off += 2
26     if len(data) < off+snameLen {
27         return nil, errors.New("frame too short for speaker name")
28     }
29     speakerName := string(data[off : off+snameLen])
30     off += snameLen
31 
32     return &AudioFrame{
33         SpeakerID:   speakerID,
34         SpeakerName: speakerName,
35         PCMData:     data[off:],
36     }, nil
37 }

Java

1 import java.nio.ByteBuffer;
2 import java.nio.ByteOrder;
3 import java.nio.charset.StandardCharsets;
4 
5 public class MeetstreamAudioFrame {
6     public final String speakerId;
7     public final String speakerName;
8     public final byte[] pcmData;
9 
10     private MeetstreamAudioFrame(String speakerId, String speakerName, byte[] pcmData) {
11         this.speakerId = speakerId;
12         this.speakerName = speakerName;
13         this.pcmData = pcmData;
14     }
15 
16     public static MeetstreamAudioFrame decode(byte[] data) {
17         if (data.length < 5 || data[0] != 0x01) return null;
18 
19         ByteBuffer buf = ByteBuffer.wrap(data).order(ByteOrder.LITTLE_ENDIAN);
20         buf.get(); // skip msg_type
21 
22         int sidLen = buf.getShort() & 0xFFFF;
23         byte[] sidBytes = new byte[sidLen];
24         buf.get(sidBytes);
25         String speakerId = new String(sidBytes, StandardCharsets.UTF_8);
26 
27         int snameLen = buf.getShort() & 0xFFFF;
28         byte[] snameBytes = new byte[snameLen];
29         buf.get(snameBytes);
30         String speakerName = new String(snameBytes, StandardCharsets.UTF_8);
31 
32         byte[] pcmData = new byte[buf.remaining()];
33         buf.get(pcmData);
34 
35         return new MeetstreamAudioFrame(speakerId, speakerName, pcmData);
36     }
37 }

Full Receiver Examples

Python — Receive and Log

1 import asyncio
2 import json
3 import websockets
4 
5 async def receive_audio():
6     async with websockets.connect("wss://your-server.com/audio") as ws:
7         async for message in ws:
8             # First message is a JSON text handshake
9             if isinstance(message, str):
10                 handshake = json.loads(message)
11                 print(f"Bot connected: {handshake['bot_id']}")
12                 continue
13 
14             # All subsequent messages are binary audio frames
15             result = decode_audio_frame(message)
16             if result is None:
17                 continue
18 
19             speaker_id, speaker_name, pcm_bytes = result
20             num_samples = len(pcm_bytes) // 2
21             duration_ms = (num_samples / 48000) * 1000
22             print(f"[{speaker_name}] {num_samples} samples ({duration_ms:.0f}ms)")
23 
24 asyncio.run(receive_audio())

Node.js — Receive and Log

1 const WebSocket = require("ws");
2 
3 const ws = new WebSocket("wss://your-server.com/audio");
4 
5 ws.on("message", (data, isBinary) => {
6   if (!isBinary) {
7     const handshake = JSON.parse(data.toString());
8     console.log(`Bot connected: ${handshake.bot_id}`);
9     return;
10   }
11 
12   const frame = decodeAudioFrame(data);
13   if (!frame) return;
14 
15   const numSamples = frame.pcmData.length / 2;
16   const durationMs = (numSamples / 48000) * 1000;
17   console.log(`[${frame.speakerName}] ${numSamples} samples (${durationMs.toFixed(0)}ms)`);
18 });

Working with PCM Audio

Convert to NumPy Array (Python)

1 import numpy as np
2 
3 # To int16 sample array
4 samples = np.frombuffer(pcm_bytes, dtype=np.int16)
5 
6 # To float32 (-1.0 to 1.0) — standard format for ML models and audio libraries
7 float_samples = samples.astype(np.float32) / 32768.0

Save as WAV File (Python)

1 import wave
2 
3 def save_wav(pcm_bytes: bytes, path: str, sample_rate: int = 48000):
4     with wave.open(path, "wb") as wf:
5         wf.setnchannels(1)
6         wf.setsampwidth(2)
7         wf.setframerate(sample_rate)
8         wf.writeframes(pcm_bytes)

Accumulate and Save a Full Meeting Recording

1 import wave
2 
3 audio_buffer = bytearray()
4 
5 # Inside your receive loop:
6 speaker_id, speaker_name, pcm_bytes = decode_audio_frame(message)
7 audio_buffer.extend(pcm_bytes)
8 
9 # When the meeting ends:
10 with wave.open("meeting_recording.wav", "wb") as wf:
11     wf.setnchannels(1)
12     wf.setsampwidth(2)
13     wf.setframerate(48000)
14     wf.writeframes(bytes(audio_buffer))

Resample to a Different Rate (Python)

Many speech-to-text services expect 16 kHz audio. Resample with numpy:

1 import numpy as np
2 
3 def resample_pcm16(pcm_bytes: bytes, src_rate: int, dst_rate: int) -> bytes:
4     if src_rate == dst_rate:
5         return pcm_bytes
6 
7     samples = np.frombuffer(pcm_bytes, dtype=np.int16).astype(np.float32)
8     num_output = int(len(samples) * dst_rate / src_rate)
9     t_in = np.linspace(0, 1, len(samples), endpoint=False)
10     t_out = np.linspace(0, 1, num_output, endpoint=False)
11     resampled = np.interp(t_out, t_in, samples)
12     return np.clip(resampled, -32768, 32767).astype(np.int16).tobytes()
13 
14 # Example: 48kHz → 16kHz
15 pcm_16k = resample_pcm16(pcm_bytes, 48000, 16000)

Convert to Int16 Array in JavaScript

1 function pcmBytesToInt16Array(buffer) {
2   const int16 = new Int16Array(buffer.length / 2);
3   for (let i = 0; i < int16.length; i++) {
4     int16[i] = buffer.readInt16LE(i * 2);
5   }
6   return int16;
7 }
8 
9 function pcmBytesToFloat32Array(buffer) {
10   const float32 = new Float32Array(buffer.length / 2);
11   for (let i = 0; i < float32.length; i++) {
12     float32[i] = buffer.readInt16LE(i * 2) / 32768.0;
13   }
14   return float32;
15 }

Speaker Identification

Each frame includes both a speaker_id and a speaker_name:

Field	Description	Example
`speaker_id`	A platform-specific unique identifier. Stable within a session.	`"user_42"`, `"1234567890"`, `"NoSpeaker"`
`speaker_name`	The display name shown in the meeting UI. May not be unique.	`"Alice"`, `"David Hill"`, `"NoSpeaker"`

Platform Behavior

Platform	`speaker_id`	`speaker_name`
Google Meet	Participant ID from the meeting DOM	Display name from the meeting
Zoom	Zoom SDK `node_id` (per-participant)	`GetUserName()` from the SDK
Teams	Dominant speaker display name	Dominant speaker display name

Handling `"NoSpeaker"`

If the bot cannot determine the active speaker, both fields will be "NoSpeaker". This can happen during the first moments of a meeting or during mixed audio when speaker attribution is unavailable.

Message Types on the Audio Channel

Byte 0	Type	Format	Description
N/A	Handshake	JSON text	Sent once on connect. `{"type": "ready", ...}`
`0x01`	PCM Audio	Binary	Speaker-tagged audio frame (described above)

0x01 is currently the only binary message type. All binary frames will have 0x01 at position 0. Future protocol versions may introduce additional types — check byte 0 and skip unknown types for forward compatibility.

FAQ

What sample rate does the audio arrive at?

48,000 Hz for all platforms (Google Meet, Zoom, Teams).

Is the audio mixed or per-speaker?

The audio is mixed — it contains all meeting participants combined into a single mono stream. Speaker metadata (speaker_id, speaker_name) indicates who was the dominant speaker when the frame was captured, but the audio itself contains everyone.

How large is each frame?

Frame sizes vary. Typical frames contain 1,000 to 50,000+ samples (20ms to 1+ seconds of audio). The size depends on the platform’s audio capture interval and buffering.

Can I receive audio from specific speakers only?

No. The bot streams mixed audio. Use the speaker_name or speaker_id metadata for your own filtering or labeling logic after decoding.

Do I need to send an acknowledgment for each frame?

No. The protocol is fire-and-forget. The bot streams continuously and does not expect any response on the audio channel.

What happens if my server is slow to consume frames?

Frames will buffer in the WebSocket layer. If the buffer grows too large, the connection may drop. Ensure your receiver processes or discards frames promptly.

Why does the `sid_length` value change between sessions?

The 2-byte length field reflects the byte length of the speaker’s ID string. Different speakers have different name/ID lengths:

"Alice" = 5 bytes → sid_length = 0x05 0x00
"James Chen" = 10 bytes → sid_length = 0x0A 0x00
"Bob" = 3 bytes → sid_length = 0x03 0x00

This is not a delimiter or protocol variation — it is a standard length-prefixed string encoding.

How should I parse the binary format defensively?

Always read the 2-byte length, then read exactly that many bytes. Never hard-code expected length values. A correct parser:

1 sid_len = int.from_bytes(data[1:3], "little")   # read length
2 speaker_id = data[3 : 3 + sid_len].decode("utf-8")  # read exactly that many bytes

An incorrect parser:

1 # BAD: hard-codes an expected "delimiter" byte
2 if data[1] == 0x09 and data[2] == 0x00:
3     ...

Can speaker names contain non-ASCII characters?

Yes. Speaker names are UTF-8 encoded. A name like "Javier Martinez" is 16 bytes, while "佐藤太郎" is 12 bytes (3 bytes per CJK character). Always use the length prefix — never scan for fixed byte patterns.

What if I only need the audio and don’t care about speaker info?

Skip past the headers:

1 def extract_audio_only(data: bytes) -> bytes:
2     if len(data) < 5 or data[0] != 0x01:
3         return b""
4     sid_len = int.from_bytes(data[1:3], "little")
5     off = 3 + sid_len
6     sname_len = int.from_bytes(data[off:off + 2], "little")
7     off += 2 + sname_len
8     return data[off:]

Overview

Enabling Live Audio

Connection Lifecycle

1. Bot connects to your WebSocket endpoint

2. Bot sends a JSON text handshake

3. Binary audio frames stream continuously

4. Connection closes when the bot leaves

Timeline

Binary Frame Format

Field-by-Field Breakdown

Important

Audio Properties

Hex Dump Walkthrough

Decoding Examples

Python

JavaScript / Node.js

Go

Java

Full Receiver Examples

Python — Receive and Log

Node.js — Receive and Log

Working with PCM Audio

Convert to NumPy Array (Python)

Save as WAV File (Python)

Accumulate and Save a Full Meeting Recording

Resample to a Different Rate (Python)

Convert to Int16 Array in JavaScript

Speaker Identification

Platform Behavior

Handling "NoSpeaker"

Message Types on the Audio Channel

FAQ

What sample rate does the audio arrive at?

Is the audio mixed or per-speaker?

How large is each frame?

Can I receive audio from specific speakers only?

Do I need to send an acknowledgment for each frame?

What happens if my server is slow to consume frames?

Why does the sid_length value change between sessions?

How should I parse the binary format defensively?

Can speaker names contain non-ASCII characters?

What if I only need the audio and don’t care about speaker info?

Handling `"NoSpeaker"`

Why does the `sid_length` value change between sessions?