Daily API: Developer Tips to Build Real-time Voice, Video, and AI into Apps

Recording improvements: next-gen raw tracks and new compositor with layout animations

Pauli Olavi Ojala — Fri, 22 May 2026 17:12:05 GMT

We’re happy to announce some major upgrades to video recording on Daily for both raw-tracks and single-file cloud recording.

These updates will become defaults for all accounts on May 26th, 2026. Before that date, you can already turn on these features today by using properties (either for your entire domain or for an individual room). We have tested these extensively in production with customers, so there shouldn't be any disruption to your recording experience.

0:00

/0:13

Event-driven raw-tracks compositing

The raw-tracks recording mode on Daily does exactly what it says on the tin: instead of a single composite MPEG-4 video, you get the meeting’s raw participant track data as individual media files. Nothing lost, nothing added. But the full story is not that simple. Because these track files are really just data packets captured in realtime over a network, they can have packet loss, resolution changes, etc. — all kinds of things that are very different from the traditional stable camera-originated media that post-processing applications like video editors are designed to operate on.

For easier post-processing, we provide an open source Github repo called raw-tracks-tools. But its capabilities have been limited by what’s available in the files. The tracks on their own don’t provide essential context about what was actually happening in the meeting room: users turning on and off their cameras and microphones, active speaker status changes, application level messages, etc.

To remedy this, some time ago we added a dataOutputs configuration option to startRecording calls that lets you capture events and transcripts as data files alongside the recorded media. This event JSON data output is so useful for raw-tracks that we’re making it the default for all raw-tracks sessions.

The event JSON is a small file that gets written in the same S3 bucket as the raw-tracks media files. The raw-tracks-tools scripts have also been updated to make use of these events for more reliable meeting reconstruction and more features. So you can now create much richer post-processing composites.

Let’s say you’ve got a long meeting recorded in raw-tracks mode. There are people coming and going, pausing and resuming their video, live transcription was turned on, and people are chatting too. All this data is present in the event JSON and can be rendered into a video composite after the fact, with visual results that are identical to what you’d get by running a regular cloud recording during the meeting. You can use all VCS features like switching layouts, rendering captions (for transcriptions), sidebars for chat messages, etc.

To enable the event JSON output today, you can set the property enable_raw_tracks_event_json: true on either your domain or a room. And if you prefer not to get this file in your S3 bucket, you can use this same property to explicitly turn it off (even after it becomes the default on May 26).

💡

When you fetch an access link for a raw-tracks recording via the Daily REST API, the response object includes the event JSON separately from the media tracks. The S3 key for the event JSON can be found under dataOutputs[‘event-json’] while track keys are in their usual place, as an array under the tracks property.

Transcoded gapless audio (WAV / AAC) for raw-tracks

As was mentioned above, recording raw client tracks really means that you're capturing every artifact of the connection. If there’s packet loss or the participant is not sending, the media file will simply not contain any data at that time. For audio specifically, this “no data” situation is subtly different from silence. Actual silence means that zeros have been encoded, while a gap in data means it could be either silence or missing data. (This is basically analogous to boolean false vs. a boolean|null type.)

Many of our customers do post-processing on their raw-tracks audio recordings. To make it easier to use these files, we’ve added a “transcoded gapless” setting that lets you record audio files that are already transcoded into a standard format that all audio processing applications can easily deal with. There are no data gaps. Silence has already been mixed in, so the audio data is always uniform.

To enable this mode, you can set the property enable_raw_tracks_transcoded_audio on either your domain or a specific room. This property’s value must be either null or one of these strings:

wav-48k-mono - WAV files at 16-bit 48kHz sample rate, mono
wav-44k1-mono - WAV files at 16-bit 44.1kHz sample rate, mono
wav-48k-stereo - WAV files at 16-bit 48kHz sample rate, stereo
wav-44k1-stereo - WAV files at 16-bit 44.1kHz sample rate, stereo
wav - alias for wav-48k-stereo, the highest quality WAV setting
aac - AAC (.m4a) files at 160kbps - a good compromise if you don’t need uncompressed audio
null - turns off transcoding, audio tracks will be recorded as raw WebM (default)

💡

You can find out the audio format of a raw-tracks recording session by inspecting the event JSON file saved alongside the media files. Each time a file is written to S3, a recording-media-started event is logged. These events have a contentType field that will tell you whether it’s WebM, AAC, or WAV.

New cloud compositor with animation support

We also have improvements to cloud recording mode. It has been upgraded to use the VCSRender compositor, which is the exact same code that powers the compositing features in raw-tracks-tools that was discussed above. This means you have a guarantee that the rendering output will be identical whether your composite is rendered on the fly (using cloud recording) or as a post-processing action using raw-tracks.

💡

This upgrade also applies to RTMP/HLS live streaming which uses the same cloud rendering infrastructure.

The new compositor offers better performance overall, so even large rooms with dozens of participants visible in the recording won’t stutter. (Well, due to CPU limitations at least! If a participant isn’t sending video frames or there’s network packet loss, the cloud recording compositor can’t fix that...)

A significant new feature enabled by the performance upgrade is support for full frame rate layout animations. Video layer positions and opacities can now be changed at 30fps with negligible overhead. VCS contains a declarative animation system that lets you build entirely custom animations if you want, but there’s also an easy way to get animations automatically in your recordings with a single setting.

You can turn on the new composition param enableLayoutAnims to get smooth animations every time the video layout changes. For example, when a participant is added to a grid, the new video layer will fade in, and the other participant’s layer positions will animate to new positions as the grid changes.

These animations work with any layout mode and can be enabled at the start of recording, or in an update call during the session. For example:

startRecording({
  layout: {
    preset: "custom",
    composition_params: {
      mode: "grid",
      enableLayoutAnims: true
    }
  }
})

To enable the new compositor today, you can set the property enable_legacy_compositor: false on either your domain or a room. If for any reason you prefer to remain on the legacy compositor after the May 26 date, you can set this property to true (and please tell us why, so we can fix whatever is keeping you on the old code).

NVIDIA Nemotron 3 Super

Kwindla Hultman Kramer — Wed, 11 Mar 2026 16:38:05 GMT

NVIDIA Nemotron 3 Super, the new open source LLM launched by NVIDIA today, marks an inflection point for voice AI developers.

We've tested Nemotron 3 Super in voice agents, run our benchmarks against the model, and calculated the cost of running it in production. We can fine-tune this model. We can extensively post-train it using Nemotron's open data sets and open training code. Together with Nemotron 3 Nano (a 30B parameter model released in November), and Nemotron Speech ASR (a realtime speech-to-text model released in January), this model forms a stack that is the first meaningful open source alternative (and complement) to proprietary API services for voice AI developers.

On our long-conversation voice agent benchmarks, Nemotron 3 Super performs at the same level as the new GPT-5.4 models. It also performs better than both GPT-4.1 and Gemini 2.5 Flash, which are the two most widely used models for production voice agents.

Nemotron 3 Super matches gpt-4.1 and the new gpt-5.4 models on our long-conversation voice agent benchmarks. The one caveat is that we don't have fully optimized inference tooling, yet, so we can't do an apples-to-apples comparison of time to first token. For more on inference tuning and latency, see section [x] below.

Today, we’re also introducing a new set of agentic task benchmarks: the Gradient Bang evals. Nemotron 3 Super is one of two open models in the top 10. The other is GLM-5, which is a 745B parameter model. (Roughly 6x the size of Nemotron 3 Super!)

Voice agents today are multi-agent systems. We launch sub-agent tasks to perform searches, do smart context compression, analyze data, and interact with backend systems. This new benchmark tests these sub-agent tasks and is designed to be very challenging.

The realtime AI systems we're building today are increasingly multi-model and multi-agent. We're building voice agents that orchestrate long-running sub-agents to search and compile information, agentic controllers for robots that fuse data from onboard sensors and implement hybrid local/cloud inference, and software interfaces that combine voice input and multi-modal output.

Voice agent performance requirements

These voice agent and realtime AI use cases are extremely challenging for LLMs.

Voice conversations are multi-turn and open-ended.
Responses must be fast. We need TTFTs under 700ms to build voice agents responsive enough for people to happily talk to.
Many voice agents have fairly large system instructions. 4,000 tokens or more is not uncommon.
Typical agents define several tools, and accurate tool calling is critical.
Enterprise voice agents have demanding requirements for instruction following accuracy. For example, if the first step in a conversation with a healthcare patient is to confirm their identity, that step must be performed accurately, and the conversation must not proceed beyond that step until it’s done.

We maintain open source benchmarks that test instruction following accuracy, tool calling reliability, hallucinations/grounding, and latency. Our standard benchmark is the aiewf medium context eval. This benchmark is a 30-turn conversation with an 8,000 token system instruction and six tool calls.

As voice agent developers, we’re often in the position of having to choose between low latency and high intelligence. GPT-4.1 is the most widely used LLM for production voice agents. (As the core maintainers of the Pipecat open source framework, and the team behind the commercial voice hosting platform Pipecat Cloud, we have fairly comprehensive usage statistics.) And the benchmark shows why. GPT-4.1 is the fastest model that scores above 95%.

Nemotron 3 Super scores slightly above GPT-4.1 – 97% compared to 96.3%.

Inference tooling notes

One caveat for the benchmark numbers above: it’s not possible quite yet to do an apples-to-apples latency comparison between Nemotron 3 Super and other models.

The Nemotron 3 architecture is new and, from the perspective of those of us who've been writing transformer-centric inference code for the past couple of years, a little weird. (In a good way!) Nemotron 3 models have both transformer and Mamba layers. The Mamba layers compress context into a rolling, fixed-size state rather than a traditional transformer's key-value cache. This keeps the Mamba context small and keeps inference fast no matter how long the context is. But now we have two different state mechanisms that our inference tooling needs to manage and optimize.

So inference speed for Nemotron 3 models is not yet as fast as it is for older architectures. In particular, prefill caching for the Nemotron 3 hybrid Mamba-Transformer mixture of experts architecture is only partially implemented in the widely used inference frameworks like vLLM, SGLang, and TRT-LLM. But we expect this to change quickly as the open source community embraces these models.

For benchmarking BF16 and FP8 checkpoints of the model on NVIDIA B200 hardware on Modal, we used this command.

export TOKENIZERS_PARALLELISM=false
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm
vllm serve $MODEL_REFERENCE \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name nemotron-3-super-120b \
    --async-scheduling \
    --tensor-parallel-size 1 \
    --swap-space 0 \
    --trust-remote-code \
    --reasoning-parser-plugin /opt/nim/nemotron_middleware/src/vllm_reasoning_parser_v2.py \
    --reasoning-parser super_v3_enhanced \
    --enable-auto-tool-choice \
    --tool-parser-plugin /opt/nim/nemotron_middleware/src/vllm_tool_parser.py \
    --tool-call-parser super_v3 \
    --gpu-memory-utilization 0.9 \
    --enable-expert-parallel \
    --max-num-seqs 4 \
    --no-enable-prefix-caching

The base TTFT we’re seeing for this model on an NVIDIA B200 is ~38ms. Prefill processing can be as fast as 8k tokens per second. Generation throughput is roughly 200 tokens per second.

The NVFP4 quantization of Nemotron 3 Super runs on the NVIDIA DGX Spark desktop AI supercomputer, too.

NVIDIA DGX Spark

We're 1,000 tokens per second prefill and 14 tokens per second generation on the DGX Spark, with this configuration.

docker run -d \
    --name $CONTAINER_REFERENCE \
    --gpus all \
    --ipc=host \
    --ulimit memlock=-1 \
    --shm-size=32g \
    -p 8001:8000 \
    -v /home/ubuntu/models:/model \
    -v /home/ubuntu/.cache/
    -v /home/ubuntu/.cache/vllm_ubuntu:/home/ubuntu/.cache/vllm \
    -v /home/ubuntu/.cache/flashinfer:/home/ubuntu/.cache/flashinfer \
    -e HF_HOME=/home/ubuntu/.cache/huggingface \
    -e HF_MODULES_CACHE=/home/ubuntu/.cache/huggingface/modules \
    -e TRANSFORMERS_CACHE=/home/ubuntu/.cache/huggingface/transformers \
    -e TOKENIZERS_PARALLELISM=false \
    -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
    -e VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm \
    -e MAX_JOBS=1 \
    --attention-backend TRITON_ATTN \
    --enforce-eager'

I , therefore I am

Nemotron 3 Super is a reasoning model. It is trained to “think” before producing final output. The initial thinking segments are visible in the inference output and are wrapped in tags.

You can disable reasoning. But, like almost all recent SOTA models, tool calling performance for Nemotron 3 Super is very poor with reasoning disabled. So we need to run the model in reasoning mode. But the initial thinking segment adds to response latency.

We can set a “thinking budget,” which forces the model to stop thinking after emitting a certain number of initial reasoning tokens. We did a parameter sweep to find the optimal thinking budget for the aiewf medium context benchmark. It turns out that setting a budget of 512 tokens is a good compromise between intelligence and latency. Because thinking is dynamic, most of the time we won’t hit the 512-token limit. But when we need those extra reasoning tokens, thinking for a few hundred milliseconds more to get an accurate tool call is worthwhile.

Here’s the distribution of thinking segment lengths, from a set of ~300 inference calls performed during an aiewf medium context benchmark test.

Percentile	Tokens
Min	11
P25	78
P50	203
P75	265
P90	288

We can fine-tune this model. (And we probably should!)

Voice AI developers have been in a funny position over the past year, as models have gotten better, but also slower.

Almost all recent SOTA models are post-trained as reasoning models. This has led to impressive improvements in agentic task abilities. But the focus on RL for reasoning has meant that new models often don’t perform well with reasoning disabled. Or they have minimum reasoning settings that all but guarantee latencies that are untenable for voice agent use cases.

OpenAI describes GPT-4.1 as the company’s “smartest non-reasoning model.” GPT-4.1 is almost a year old, which makes it a very old model in today’s incredibly fast-moving AI landscape. OpenAI may never release another generation of non-reasoning models!

From https://developers.openai.com/api/docs/models

So where does that leave those of us building for production use cases – like voice agents – that require low latency?

Nemotron 3 Super provides a path forward. Not the model in the form that was released today (though it’s an excellent model). But in future versions that might be created by the NVIDIA model team, or might be created by the open source community.

The Nemotron 3 models are completely open source. The weights are open, of course. And, in addition, the training data sets, training code, and inference tooling are open source.

This opens up many possibilities for fine-tuning. Or even for doing more intensive post-training than would normally be described as fine tuning.

The “reasoning heavy” RL that is the current norm for SOTA models is only one point on a multidimensional tradeoff surface. I’d like to see a version of Nemotron 3 Super that is additionally post-trained specifically to improve tool calling with reasoning turned off. My experience working on models suggests this is a data set generation and engineering task, not a research question. We can very likely hill climb on function call accuracy metrics in a fairly straightforward way.

Let me know if you’d like to work on this project. Training models is fun!

Deploying this model

You can run Nemotron 3 Super on a range of NVIDIA data center GPU configurations. My preferred stack for testing the model has been NVIDIA DGX B200 instances on the Modal AI cloud. Modal’s tooling makes it easy to deploy different inference server configurations, spin up capacity to run benchmarks and demos, and then spin down those applications so I’m only paying for the time I’m actually using the GPUs.

Here’s some sample code for running Nemotron on Modal.

Modal also supports a variety of low-latency networking capabilities and deployment components that are very, very useful for running voice agent inference at scale. And the Modal team also has lots of experience supporting voice agent use cases.

We did an NVIDIA livestream last month about using Nemotron models, Modal infrastructure, Pipecat, and Daily WebRTC for voice AI. Check that out if you’re interested in seeing a really great code walk-through from Ben, who builds this stuff at Modal.

And come hang out with us in the Pipecat Discord if you’re interested in general discussion about voice AI models, tooling, orchestration, and infrastructure.

Benchmarking STT for Voice Agents

Mark Backman — Fri, 13 Feb 2026 17:19:57 GMT

Today we're releasing a new benchmark for Speech-to-Text (STT) providers, focused on two dimensions that matter most for voice agents: transcription latency and semantic accuracy.

Our goal is specifically to evaluate STT performance for realtime voice agents. The best voice agents today can hear a user speak, transcribe their input, reason about a response, and begin generating audio in under a second. To hit that budget, every component in the pipeline needs to be fast. But speed alone isn't enough - the transcription also needs to be accurate enough that the LLM understands what the user actually said.

When a user speaks to a voice agent, three things matter:

Semantic accuracy: The transcription doesn't need to be a perfect written record. It needs to convey the user's intent clearly enough for an LLM to respond correctly. This is more forgiving than traditional Word Error Rate (WER), because LLMs are extraordinary next-word predictors. They handle contractions, filler words, and minor grammatical variations without issue. What they can't handle is a wrong noun or a garbled name.
Latency: A natural conversation has short pauses between turns. If your STT service takes over a second to finalize a transcript, you've burned most of the latency budget before the LLM even starts thinking.
Turn detection: Has the user finished speaking? The most natural conversations emit transcripts quickly when a user's turn is finished, but give them time to think when they need it.

Existing STT benchmarks do a good job of measuring transcription accuracy, but voice agents require a different focus. Traditional WER penalizes "gonna" vs. "going to" as two errors - technically correct, but irrelevant when the transcript is being consumed by an LLM. And accuracy alone doesn't tell you whether a service is fast enough to keep a conversation feeling natural. We wanted a benchmark that evaluates both dimensions together, on real-world audio, through the lens of voice agent performance.

This benchmark focuses on accuracy and latency. Turn detection is a separate, complex topic that deserves its own treatment (we address it briefly below).

The source code for this benchmark is available on GitHub: https://github.com/pipecat-ai/stt-benchmark

Results

We benchmarked 10 STT services on 1,000 samples of real human speech. Here are the headline numbers:

A few things stand out:

The STT space is heating up. As voice agents have grown in demand, STT providers have been investing heavily. There are now multiple excellent options delivering state-of-the-art performance on both accuracy and latency. A year ago, this table would have looked very different.

There's a clear latency-accuracy frontier. Some services are fast, some are accurate, and a handful manage to be both. The Pareto frontier below makes this trade-off visible:

Typical latency (Median TTFS)

Worst-case latency (P95 TTFS)

Services on the Pareto frontier offer the best trade-off between latency and accuracy. No other service is better on both metrics. The "ideal" corner is bottom-left: fastest and most accurate.

On median latency, three services sit on the frontier:

Deepgram Nova 3 (247ms, 1.62% WER)
Soniox (249ms, 1.29% WER)
Speechmatics (495ms, 1.07% WER)

These represent different points on the speed-accuracy curve. But median latency only tells half the story.

Why P95 latency matters more than you think

When building voice agents, the input pipeline typically looks like this:

1. Transport: Receives streaming audio from the user, sends audio to the STT

2. STT: Transcribes speech in realtime, emitting partial and final transcripts

3. Context aggregator: Application code that collects transcripts and decides when a user has started and stops speaking

4. LLM: Perform inference to generate a response

There's a critical tension in step 3: you can't send transcripts to the LLM as they arrive. Streaming STT services emit multiple transcript segments over time. If you triggered an LLM completion on every partial or final transcript, you'd waste tokens and get responses based on incomplete input.

P95 tells you how long it will take the STT service to deliver a complete transcript, 95 times out of 100.

Median latency describes the average experience. P95 latency characterizes the worst-case experience. And in conversation, the worst cases are what users remember. A service with great median but poor P95 indicates inconsistent performance: usually fast, but occasionally slow enough to break the conversational flow.

A note on speech-to-speech models: new speech-to-speech models such as Ultravox, OpenAI Realtime, and Gemini Live simplify the voice agent pipeline by eliminating separate transcription and speech generation models. For a variety of reasons, almost all production voice agents today use the multi-model pipeline described above. For more on this, see the LLMs for voice agents open source benchmark, and the Voice AI Guide.

Finalization support is important

Some STT services support finalization, which means that we can send a signal to tell the service that the user has stopped speaking. The service then confirms the final transcript has been received with tokens or metadata in the transcription message, closing the loop with our application code. Given this finalization metadata, application code knows with certainty that the complete user input has been received, and can make an informed decision: proceed to LLM inference immediately, or defer to give the user more time to speak.

Services that support finalization don't require the application to wait an arbitrary amount of time, hoping that the STT service has completed processing. In practice, this means that for services that support finalization, the response time the user experiences will be approximately the median latency. But for services that do not support finalization, we have to wait longer to try to be sure we’ve gotten a complete transcript. We generally use the service’s P95 as a guide for how long to wait, and this means that every turn takes as long as the P95.

It's worth noting that the services with the lowest latency in our benchmark all support some form of finalization. When evaluating STT providers for voice agents, finalization support is worth considering alongside raw latency numbers.

That said, P95 still matters. It characterizes the upper bound of how long finalization itself takes. When evaluating STT for production voice agents, we recommend looking at P95 (or even P99) latency, not just medians. Your users experience both the average and tail of the latency distribution.

Semantic WER: Transcription accuracy for voice agents

Traditional Word Error Rate (WER) measures transcription accuracy by counting every word-level difference between a transcript and ground truth. "Gonna" versus "going to"? That's two errors. Missing a comma? Error. "3" instead of "three"? Error. This is a well-established metric that gives a clear, reproducible measure of transcription fidelity.

But in a voice agent pipeline, the transcript isn't the end product-it's input to an LLM. And LLMs are remarkably good at understanding natural language variations. Contractions, filler words, minor grammatical differences, number formats; an LLM handles all of these without issue. It will respond identically to "I'm gonna need 3 tickets" and "I am going to need three tickets." From the LLM's perspective, those are the same request.

This means many of the errors that traditional WER counts simply don't matter for voice agents. We need a different way to think about transcription accuracy. One that asks whether the LLM would understand the transcript correctly, not whether every word matches exactly.

Consider this example:

Ground truth: "Can you describe the process for changing my legal name on official documents like my driver's license and social security card?"
STT output: "Can you describe the process for changing my legal name on official documents like my driver licenses and social security card"

Traditional WER would flag multiple errors: "driver's" vs "driver", "license" vs "licenses", missing punctuation. But would an LLM respond any differently to these two inputs? No. Both clearly ask the same question. An LLM would understand both identically.

Now consider this:

Ground truth: "I need to renew my prescription."
STT output: "I need to review my prescription."

"Renew" to "review" - that's a real error. Both are plausible requests, but an LLM would take a completely different action for each. This is the kind of error that matters.

How Semantic WER works

We use Claude to perform multi-step semantic evaluation of each transcription. The process:

1. Normalize both texts: lowercase, expand contractions, normalize numbers, remove filler words, standardize spelling variations

2. Align the normalized texts word-by-word

3. Semantic check: For each difference, ask: *"Would an LLM agent respond differently to these two versions?"*

4. Count only the differences where the answer is yes

5. Calculate WER from the semantic error counts

The key question at step 3 is what makes this different from traditional WER. Here's how the semantic check works in practice:

Not errors (LLM would understand both the same way)

Real errors (LLM would misunderstand)

The evaluation prompt

The full evaluation uses a detailed system prompt with normalization rules, few-shot examples, and a structured calculate_wer tool that Claude calls after completing the analysis. Each evaluation produces a full reasoning trace that we store for auditability. You can inspect exactly *why* Claude counted or didn't count each difference.

We use claude-sonnet-4-5-20250929 for evaluation. The full prompt and tool definition are in the benchmark source code.

Pooled WER, mean WER, and perfect samples count

We report both:

Mean WER: Average WER across all samples. Simple, but can be skewed by short samples where a single error produces a high percentage.
Pooled WER: Total errors across all samples divided by total reference words. Weighted by sample length, so longer (harder) samples contribute proportionally more. Generally a more stable metric
Perfect: How many of the input samples had perfect semantic WER scores

How the benchmark works

The pipeline

This benchmark is built on Pipecat, an open-source framework for building voice agents. Each benchmark run creates a Pipecat pipeline:

The Synthetic Transport plays pre-recorded PCM audio through the pipeline at realtime pace - 20ms chunks at 16kHz, exactly as a live microphone would deliver them. It uses Silero VAD to detect speech boundaries, emitting the same start/stop speaking events that a real transport would.

Each STT service processes the audio stream and emits transcription frames. Pipeline observers capture both the transcripts and timing metrics. After audio playback completes, the transport continues sending silence to give streaming services time to finalize their last segments.

This approach means every service sees exactly the same audio, delivered in exactly the same way, through the same pipeline infrastructure. The only variable is the STT service itself.

Measuring TTFS (Time To Final Segment)

For other services in a voice agent pipeline, like LLM and TTS, there’s a discrete input with a corresponding output, making latency straightforward to measure. Streaming STT works differently. User audio is continuously streamed to the STT provider:

This continuous stream can generate multiple transcription segments. Because we need the full transcription for the next stage in the pipeline (LLM inference), we measure from the last transcription segment received. We use this final transcript along with the VAD's stop-speaking event as two points in time to calculate a metric we call “time to final transcript,” (TTFS):

The time difference between the receipt of the last transcript and the time at which the user stopped speaking is the TTFS—the latency between the user’s final utterance and the last transcript from that utterance.

Data

All benchmark data comes from the pipecat-ai/smart-turn-data-v3.1-train dataset:

- 1,000 samples of real human speech

- PCM streaming audio at 16kHz mono

- English language only (for this benchmark)

- Real-world recording conditions: variable microphone quality, background noise, varying audio clarity. The goal is to represent what a voice agent actually hears in production, not laboratory conditions.

See the full data set and ground truth transcriptions:

https://huggingface.co/datasets/pipecat-ai/stt-benchmark-data

Ground truth

Ground truth transcriptions were generated in two passes:

1. Initial transcription by Gemini 2.5 Flash, with a prompt emphasizing literal, phonetic accuracy.

2. Human review and correction. Each transcription was reviewed with audio playback. Errors were corrected and the corrections tracked with an audit trail.

Service configuration

All STT services use their Pipecat integration. We worked with STT vendors to represent their best configuration for voice AI use cases. Key configuration choices:

- Smart formatting disabled where possible

- English language specified

- Latest models selected

- Endpoint locations closest to testing, which was performed in the US

All tests were run under similar conditions: US business hours on weekdays, from the same network location. While we can't fully control for provider-side load variations, running during consistent time windows helps minimize that noise.

A note on turn detection

We mentioned turn detection as the third critical component of the voice agent input pipeline. It's worth expanding on briefly, even though this benchmark doesn't measure it directly.

Turn detection answers: *"Is the user done speaking?"* Get it wrong in either direction and the experience suffers:

- Too aggressive: The agent cuts the user off mid-thought. They were just pausing to think, or taking a breath between clauses.

- Too conservative: Awkward silence. The user finished their sentence two seconds ago and the agent is still waiting.

Many STT services now include their own endpointing or turn detection signals. Examples include AssemblyAI, Deepgram’s Flux, Speechmatics, and Soniox.

Note: For this testing, we disabled turn detection in these models. We test this “external turn detection” configuration for two reasons. First, most production voice agents handle turn detection in application or pipeline code, so it’s important to understand how STT services perform in this mode. Second, STT services that perform turn detection internally all make different design decisions in their implementations, so there isn’t clear common ground for testing response latency. Testing turn detection introduces several additional variables over and above latency and WER. This meant excluding Deepgram Flux from the testing, as turn detection cannot be disabled for that model. Flux is Deepgram's flagship model, so if you are evaluating STT options and you do not need application-level turn detection control, you should include Flux in your testing.

In this benchmark, we use Silero VAD with a fixed stop threshold of 200ms for consistent TTFS measurement across services. In production, turn detection for these models can be enabled, or you can include a third-party model for turn detection. In the Pipecat ecosystem the Krisp VIVA turn detection model and the open source Smart Turn model are both widely used.

Turn detection is an active area of development, both at the STT service level and in the agent framework layer. We plan to address it in a future benchmark.

Caveats and limitations

All benchmarks carry bias. We've tried to be transparent about ours:

English only: Results will differ significantly for other languages. Many of these services have strong multilingual support that we haven't tested here.
Single dataset: The smart-turn dataset represents conversational voice agent input, but it's one distribution of audio. Enterprise telephony, accented speech, domain-specific terminology, and other conditions may produce different rankings.
Configuration sensitivity: STT performance depends heavily on configuration. We worked with vendors to use recommended settings, but there may be configurations we missed that would improve specific services.
Point-in-time: STT services update their models frequently. These results reflect a specific moment. We plan to re-run periodically and track changes.
Semantic WER is a judgment call: Using an LLM to evaluate transcription accuracy introduces its own subjectivity. We've designed the evaluation prompt carefully with extensive few-shot examples, and we store full reasoning traces for auditability. But reasonable people (and models) could disagree on edge cases. We welcome feedback on our evaluation criteria - the full evaluation prompt is open source.
Network conditions: Latency measurements are influenced by network path to each provider. We ran from a single US location. Results from other regions may vary.

Important takeaways

Building this benchmark reinforced a few things we already suspected, and surfaced a few surprises:

The accuracy floor has risen dramatically. Even the "worst" service in our benchmark (by WER) achieved a pooled Semantic WER under 4.4%. Every service transcribed successfully on 99%+ of samples. The days of unreliable STT are behind us for English conversational speech.

Latency varies greatly between services The gap between the fastest and slowest services is roughly 5x on median TTFS. For most voice agent applications, latency is so important that slow P95 times disqualify a service from consideration.

P95 latency reveals the real story. Several services that look competitive on median latency show significant tail latency. Building a reliable voice agent means planning for the P95 case, not the median case.

Semantic WER better reflects agent impact. Traditional WER would penalize many of these services more harshly. By focusing on errors that actually affect LLM understanding, we get a more useful picture of transcription quality for voice AI applications.

So which model should you use? There are three clear models on the Pareto frontier: Deepgram, Soniox, and Speechmatics. For many years, Deepgram was alone in delivering both accuracy and very low latency. Now there is significant competition. Competition is great for the voice AI ecosystem. It’s worth noting that there are other factors that influence model choice, beyond latency and WER:

Performance in languages other than English, including the ability to do mixed-language transcription. This is an important area to cover in future benchmarks.
Cost.
The ability to run the model “on prem” or even on end-user devices. Self-hosting models (running "on prem") can improve latency significantly. Self-hosting is also a requirement for many enterprise use cases.
Advanced features like turn detection and speaker diarization.
The ability to customize or fine-tune a model.

We expect to see even more competition among transcription models, including along all of the axes listed above.

Try it yourself

The benchmark tool is open source:

https://github.com/pipecat-ai/stt-benchmark

An additional goal of this benchmark is to produce a utility that developers can use to measure the performance of their own STT service, configuration, and network location. We encourage you to run the TTFS portion of the benchmark to understand the latency characteristics specific to your setup.

The TTFS values from this benchmark directly inform how Pipecat configures its input pipeline. Starting in Pipecat 0.0.102, you can set the ttfs_p99_latency arg on your STT service to tell the context aggregator how long to wait for final transcripts. For services that do not support finalization signals, this lets your agent make better decisions about when to proceed to LLM inference versus waiting for additional transcript segments.

Training Smart Turn on the NVIDIA DGX Spark™

Marcus — Tue, 10 Feb 2026 18:26:24 GMT

We recently got our hands on an NVIDIA DGX Spark™: a tiny, desktop device designed for AI inference and training.

The Spark’s architecture differs from a typical AI workstation, with 128GB of unified memory shared between its Arm CPU and its NVIDIA Blackwell CUDA cores.

Because of this, we wanted to find out whether we could train our open-source Smart Turn model on the device, and if so, how does the experience compare to running training with a traditional GPU?

What is the DGX Spark?

NVIDIA describes the Spark as a compact “AI supercomputer”, built around the NVIDIA GB10 Grace™ Blackwell Superchip.

“Grace” refers to the CPU, a 20-core Arm processor, and “Blackwell” refers to the GPU. The Blackwell architecture is used for NVIDIA’s 50-series consumer GPUs, and its Blackwell datacenter GPUs.

The standout feature here is 128GB of unified memory, which lets you work with much larger models than would fit inside a typical consumer GPU.

20-core Arm CPU
Blackwell GPU
128GB unified memory
4TB NVMe storage
NVIDIA AI Software Stack preinstalled
Dimensions: 15cm x 15cm x 5cm

What is Smart Turn?

If you use the Pipecat framework for voice AI, you may already have heard of our Smart Turn model. It’s fully open-source, and designed to let an AI voice agent know when the user has finished talking.

Smart Turn analyzes the raw audio data coming from the user, listening to the intonation and pacing of their voice, and the words they use, to make an accurate determination about whether it’s safe for the agent to respond without interrupting them.

The model is trained using PyTorch — we’ve released the full training code and datasets on GitHub, so if you have a DGX Spark at home, you can follow along.

Getting set up

We used NVIDIA’s PyTorch container for the training. Start it running as follows:

docker run  --ipc=host --gpus all -it --name smart_turn_training nvcr.io/nvidia/pytorch:25.12-py3

The --ipc=host argument allows the dataloader processes to communicate — the same result could be achieved by increasing the shared memory size (--shm-size).

Inside the container, get a copy of the Smart Turn training code:

git clone https://github.com/pipecat-ai/smart-turn
cd smart-turn

There are a few dependencies we’ll need to run the training. First is ffmpeg, for loading audio files from the training dataset:

apt update
apt install ffmpeg

We’ll also need to install Smart Turn’s Python dependencies, and also remove a library called apex (which causes conflicts with version of the transformers library we’re using).

Smart Turn provides requirements.txt for x86_64 systems, and requirements_aarch64.txt for Arm systems. Use the Arm version when running on the Spark.

pip install -r requirements_aarch64.txt
pip uninstall apex -y

Arm compatibility

Until now, we’d only trained Smart Turn on x86_64 devices. Many of our library dependencies use native code, and so to run on the Spark’s Arm CPU, they’d need to be compiled specifically for this architecture.

This was true for most libraries, with the notable exception of torchcodec, which at the time of writing doesn’t make aarch64 binaries available:

https://forums.developer.nvidia.com/t/cant-install-torch-torchaudio-torchcodec/348660

However, it turns out that Arm support is available in the form of nightly builds, and we were able to use these by enabling the nightly index (https://download.pytorch.org/whl/nightly/cu130) in our requirements file.

We also needed to pin the library versions to specific builds:

torchvision==0.25.0.dev20260122+cu130
torch==2.11.0.dev20260122
torchaudio==2.11.0.dev20260122

Note that cu130 refers to the supported CUDA version, and you can match these to your system by taking a look at the output of nvidia-smi.

The above changes are already part of requirements_aarch64.txt, so running pip install as described in the above section is sufficient to get the correct versions.

Start training

The training script tracks progress and training stats using Weights & Biases, and it expects the API key WANDB_API_KEY to be set.

Tweak the batch size parameters in train.py (train_batch_size and eval_batch_size) to suit the memory size of the device you’re training on. With the Spark’s 128GB, we were able to set these to 2000.

Run the training script as follows:

python train_local.py --training-run-name=my_training_run

Be prepared for this process to take a while! The script will need to download the Smart Turn training and testing datasets, which are around 45GB together.

Performance

Training the model took around an hour on the Spark, roughly in line with what we see on datacenter GPUs such as NVIDIA’s L4, and also comparable to running the same job locally on a consumer GPU such as an RTX 5060 Ti.

Device	Memory	Batch size	Runtime
NVIDIA DGX Spark	128GB	2000	61 minutes
NVIDIA RTX 5060 Ti	16GB	256	53 minutes
NVIDIA L4	24GB	384	79 minutes

Where the Spark really starts to differentiate itself is how it pairs that level of throughput with 128GB of unified memory. That extra headroom doesn’t necessarily make a small model train faster, but it does expand what you can fit comfortably: larger models, longer sequence lengths, and more demanding training configurations.

Smart Turn is a tiny 8M parameter model, so we’re nowhere near memory-limited on most devices. In this case, we used the available memory to our advantage by dialling up the batch size.

Conclusion

The DGX Spark is an interesting device, and we were pleased that our existing training scripts worked with minimal changes. Once the torchcodec aarch64 binaries are released as stable, the process will be simplified further.

To find out more about Smart Turn, and how you can integrate it with your AI voice agents, see the following links:

And to find out more about the DGX Spark, check out the NVIDIA website:

https://www.nvidia.com/en-us/products/workstations/dgx-spark/

Benchmarking LLMs for Voice Agent Use Cases

Kwindla Hultman Kramer — Mon, 02 Feb 2026 21:37:13 GMT

Today we’re releasing a new benchmark that tests tool calling, instruction following, and factual grounding in long, multi-turn LLM conversations. We test both text-mode LLMs and speech-to-speech models.

Our goal is specifically to compare LLM performance for voice agents. Voice agent adoption in challenging enterprise use cases is growing very fast. These enterprise voice agents require:

Very fast LLM response times
Accurate tool calling
Consistent instruction following throughout a long conversation
No hallucinations

Most of the standard benchmarks of LLM performance don’t tell us very much about whether a model will perform well as a voice agent. They don’t test performance over long conversations. They don’t include natural human speech, or complex tool calling.

Teams building voice agents do lots of manual testing to build intuitions about which models work best.

At Daily, we have internal tooling for various kinds of testing and evaluation of models. One of our goals in 2026 is to make as much of this source code as we can available, so other people can reproduce, benefit from, and help to improve the state-of-the-art in voice AI public benchmarks.

The source code for the benchmark we discuss in this post is here on GitHub: https://github.com/kwindla/aiewf-eval

Headlines

Models are getting better all the time

Six months ago, no publicly available model scored above 95% on this benchmark. In a 30-turn conversation, the best models made a significant error in at least one turn.

Today, three models saturate this benchmark, scoring 100% on every eval metric.

But … the best models are too slow

The models that score 100% on this benchmark are too slow for voice agents. Natural conversation requires voice-to-voice response times under 1,500ms. This translates to about ~700ms TTFT for text-mode LLMs used in a transcription -> LLM -> voice harness.

Of course, we are using models in production that score below 95% on this benchmark. To build reliable voice agents:

We do significant model-specific prompt engineering for each use case.
Our multi-turn voice agent harnesses do lots of context engineering.

Part of the point of this benchmark is to lay down a marker for how we would like to be able to use LLMs for voice agents: write a good, detailed, general-purpose prompt and the model “just works,” performing perfectly throughout a long, multi-turn conversation.

In any case, when we decide what models to use for production voice agents, we have to take latency into account. We don’t yet have access to any models that both saturate this benchmark and are fast enough for voice agent use cases.

Most voice agents in production today use GPT-4.1 or Gemini 2.5 Flash, both of which were released in April 2025. (Relatively old models by AI engineering timelines!)

Also of note, the brand new AWS Nova 2 Pro model matches GPT-4.1 and Gemini 2.5 Flash performance and latency on this benchmark. This makes it possible to run fully capable enterprise voice agents entirely on AWS, bringing AWS options up to parity with Azure (which hosts OpenAI models) and Google Cloud (Gemini).

Speech-to-speech models are closing the gap

Most production voice agents today use text-mode LLMs, not the newer speech-to-speech models. There are several reasons for this (see the Two Ways to Build Voice Agents section below), but the most important is the capabilities gap between speech-to-speech models and text-mode LLMs. Compare the GPT Realtime pass-rate score of 86.7% on this benchmark to GPT-4.1’s score of 94.9%.

Ultravox 0.7 is the first speech-to-speech model that performs well on this kind of long, multi-turn benchmark. Congratulations to the Ultravox team for this truly impressive achievement. Ultravox has set the bar at a new level for speech-to-speech models.

A number of teams are doing very interesting work on speech-to-speech models. GPT Realtime and Gemini Live were the first major speech-to-speech releases. The new Nova 2 Sonic model from AWS performs very well on the instruction following and function calling categories in this benchmark. And the NVIDIA PersonaPlex model is a research release that builds on the innovative Moshi bidirectional streaming architecture.

Open source is closing the gap

Ultravox is not only a speech-to-speech model, it’s an open weights model. When I showed Zach Koch, founder of Ultravox, an early version of the results table, above, he noted that a big story here is how good open weights models have become.

Nemotron 3 Nano, with only 30 billion parameters, nearly matches the performance of GPT-4o. Last year at this time, GPT-4o was the most widely used model for voice agents. (And we can assume that GPT-4o is much larger than 30 billion parameters.)

Many people expect open weights models to gain market share in 2026. Proprietary models definitely aren’t going away. But open weights models give us the flexibility to optimize model architecture and inference tooling for latency, run models inside private cloud domains, and post-train for specific use cases.

I had a conversation about this benchmark with Zach and with Brooke Hopkins, founder of the voice AI testing company Coval. We talked about why voice agents are a hard use case for LLMs, open weights models, the progress of speech-to-speech models like Ultravox, and what we’re looking forward to working on in 2026.

Benchmark details

What we’re trying to measure

The aiwf_medium_context benchmark simulates a very common voice agent scenario:

A system instruction describes what the voice agent should do, including how to answer general classes of questions and what questions to deflect.
At startup time, we load a few thousand tokens of background “knowledge” directly into the LLM context.
We define half a dozen tools, including an “end session” tool. Accurate tool calling is very important, because tool calls are how the agent interfaces with backend systems. In this benchmark, we simulate updating a customer’s record in a database, and the creation of support tickets.
We expect the session to last about 30 conversation turns (five to ten minutes).

Our goal is to measure how well each model adheres to the system instructions, cites the included background knowledge, and performs tool calls.

In addition, for speech-to-speech models, we measure whether the agent responds when we expect it to. Speech-to-speech models operate in a much harder domain than text models (audio input) and are much less mature than text-mode APIs. With text-mode APIs, inference call failures are fairly rare and are easy to retry. Speech model response failures are harder to debug and much more frequent. So it’s worth adding a quantitative “turn completion” category for speech-to-speech models.

This benchmark doesn’t try to measure qualitative attributes like how human-like and natural the model’s responses are. Qualitative attributes are important! But developing good benchmarks is hard, and you have to scope the work …

Good benchmarks are hard … and hard to write

Good benchmarks are “hard” in two senses of the word.

First, a benchmark should be calibrated so that models do pretty well on it, most of the time, but not too well. If a benchmark is too easy or hard for a model, you don’t learn very much from it. Because models are improving very quickly, this means that benchmarks need to be updated regularly. A good benchmark needs to be just hard enough to tell you useful things about model performance.

Second, designing a benchmark that simulates a real-world use case well is an art. A good benchmark is precise enough to be repeatable and to generate a useful quantitative comparison. But we use these models for fundamentally open-ended tasks. So a benchmark that is too precise won’t tell you very much about how the model will perform when you put it into production with actual human users. Many, many hours of hard human work goes into thinking through specific benchmark design trade-offs, iterating on a judging pipeline, and performing test runs and scrutinizing the output.

With that in mind, here are some important caveats about this benchmark and the quantitative results.

A benchmark is a single data point, not a comprehensive judgment. It’s impossible for a single benchmark to fully characterize everything important about model performance.
Every benchmark is “unfair” in multiple ways. Every model/API has strengths and weaknesses. Testing all models in the same way, while collapsing performance down to a small set of quantitative metrics, will disadvantage some models in ways that aren’t ideal.
Small problems in data input are very difficult to track down and eliminate. We have spent a lot of time scrutinizing the inputs and outputs for this benchmark and trying to fix corner cases that shouldn’t impact scoring. For example, some of the speech-to-speech models often (but not always) don’t respond to very short utterances. We made all of the input audio speech segments long enough that all tested models reliably respond to them. But if you run the benchmark yourself and scrutinize the data, you may find corner cases that we missed.
Relatedly, how we squeeze qualitative results into a quantitative score involves making lots of judgment calls. We talk about some of these in the “Notes on specific models” section, below.
Open-ended vs “on rails” testing. This benchmark sends 30 user inputs in a fixed sequence, no matter how the model responds. This is one place where this benchmark doesn’t match real-world voice agent behavior at all. In a real-world voice agent, the human talking to the agent can adjust to how the model responds. This makes our benchmark harder than it should be, in some ways, because in real-world conversation the user and the agent can adjust to each other to some extent even when a conversation takes an unexpected turn. But simulating and judging fully open-ended conversations is challenging. We’re taking a simpler approach here. (Simulation-based agent testing is very valuable and is a big, interesting area of research. Simulation-based testing is part of how Waymo self-driving cars got so good.)
We use exactly the same prompt for all the models. When developing real voice agents, we do a lot of model-specific prompt engineering. Model performance would improve on this benchmark if we optimized the prompt for each model. There’s a reason not to do that, however. The baseline performance on a general prompt captures important aspects of model performance and, loosely speaking, tells you how hard it will be to improve performance with prompt engineering, when working on a production voice agent.
Model changes and infrastructure reliability pose challenges. LLMs are new technology. Teams at OpenAI, Google, and other labs are doing a truly amazing job building and maintaining new kinds of super-computers. But AI APIs are much, much less reliable than most of the traditional APIs we’ve all gotten used to using in the cloud computing era. This means that TTFT numbers, for example, vary a huge amount between benchmark runs. And benchmark results are not repeatable. Even with inference parameters set to be as non-stochastic as possible, you will not get the same results from every benchmark run. Finally, API providers also change inference stack code and sometimes model weights without changing the model names.

Finally, we use the term “benchmark” in this post, but we use the term “eval” in a lot of places in the source code. It’s mostly fine to use these terms interchangeably; people will know what you mean. The connotations are a bit different, though. A benchmark is a standardized test. An eval can be broader: basically anything you hack together to measure model performance is an eval. All benchmarks are evals, but not all evals are benchmarks.

The `aiwf_medium_context` benchmark is implemented on top of tooling derived from Pipecat code we use internally at Daily to build various kinds of tests, measurement tools, and evals specific to various use cases. We have lots of internal benchmarks that we would like to find the time to clean up, carefully scrutinize, and make public. If you want to add a voice agent benchmark to this project, let us know. We’d love to work with you.

LLM as a judge

To evaluate model performance, we compare the model’s output for each turn against a “golden” response that we consider optimal. There is substantial judgment involved, here. We narrow the scope of judgment by specifying pass/fail criteria explicitly.

1. **turn_taking** (bool):
   - This dimension is PRE-COMPUTED based on audio timing analysis
   - If marked as a turn-taking failure in the input, set to FALSE
   - If not marked, set to TRUE
   - Turn-taking failures indicate audio timing issues (interruptions, overlaps, missing audio)

2. **tool_use_correct** (bool):
   - TRUE if the assistant correctly called the expected function with semantically equivalent arguments
   - TRUE if no function call was expected and none was made
   - TRUE if a function call was expected but was already made in an earlier turn (realignment case)
   - TRUE if a late function call is made at this turn (the call eventually happened, credit this turn)
   - FALSE if a function call was expected, not made, and NOT already made earlier
   - FALSE if the assistant's words imply waiting for confirmation but it acts without waiting
   - FALSE if the assistant asks for unnecessary confirmation instead of making the expected function call
   - For argument matching, use semantic equivalence (not verbatim)
   - Session IDs must match exactly

3. **instruction_following** (bool):
   - TRUE if assistant directly answers the question OR advances the task
   - TRUE if assistant properly deflects out-of-scope questions
   - TRUE if the turn is part of a realigned workflow that still accomplishes the goal
   - FALSE if assistant's words contradict its actions (says "Does that work?" but doesn't wait)
   - FALSE if assistant neither answers nor advances the workflow
   - FALSE if the assistant asks for unnecessary confirmation when it already has all needed information
   - **IMPORTANT**: If a turn has turn_taking=FALSE, be lenient on instruction_following since garbled audio may cause transcription issues

4. **kb_grounding** (bool):
   - TRUE unless assistant states an explicit factual error
   - TRUE if assistant provides additional correct information
   - FALSE only for clear factual contradictions (wrong dates, times, locations, speakers)

We use the Claude Agent SDK with Claude Opus 4.5. In earlier implementations of this kind of benchmark, we prompted a judge model directly rather than using a framework like the Claude Agent SDK. Using the Agent SDK makes it much easier to build robust LLM-as-a-judge systems. The framework implements the same querying, tool use, file access, and looping capability that is familiar to you if you are a Claude Code user.

The judge implementation is in this source code file. It was, of course, written by Claude. But we have spent many, many (human) hours looking at model raw data output and the output of judging runs. We’re confident that while it is always possible to improve the output of an LLM judge, the current implementation characterizes model performance on this task in a useful way.

To give you a sense of the kinds of judgment calls that are required when deciding how to score a cross-model benchmark, here are two examples of decisions encoded in our judge prompt for this project.

We do not penalize models based on whether they do or don’t output speech alongside tool calls. It is very difficult to control this behavior successfully with prompting. Some models are strongly biased to output text and function calls together. Some models exhibit the opposite bias and rarely or never mix text and tool calls. Some models seem inconsistent but still hard to steer in this regard.And different people have different preferences about this! Some engineers building voice agents want the model to output explanatory text alongside tool calls. Some engineers prefer to manage tool call vocalizations programmatically.Given that models differ widely in default behavior, are generally hard to convince to behave differently than their defaults, and people don’t agree on what they want, we choose to ignore this variation when we score model performance.

We do penalize the models if they ask the user for a piece of information the user has already provided. Here’s an example of this kind of failure:

User: I'd like to vote for the one about vibe coding.
[This should trigger a tool call because the model already has the user’s name from a previous turn.]

Model: Great! I can help you with that. Just to confirm, I'll need your name to submit the vote. Is it still Jennifer Smith?

In a production agent, this would annoy the user. We consider it an instruction following failure, of a kind that neatly highlights a very common category of prompt engineering challenges. In this case, we have two instructions in the prompt that almost all humans would consider clear, that many models view as conflicting, and that the smartest models have no problem disambiguating:

7. Gather Information for Tools: Before calling a function, you must collect all the `required` parameters from the user. Engage in a natural conversation to get this information. For example, if a user wants to submit a dietary request, you must ask for their name and preference before calling the `submit_dietary_request` function.
  
8. When using Tools, use information that has been provided previously. Whenever you use tools, you should use information you already know to help you complete the task. For example, if you are asked to submit a dietary request, you should use the information you already know about the user to help you complete the task.

You can almost certainly create a prompt that avoids this specific excessive confirmation failure mode with these tool calls for a specific LLM. (Note, though, that this might not be as easy as you expect.) However, when real-world users say a wide variety of things to a real-world agent, you will definitely see this general category of failure from less-capable models more often than from more-capable models. It is not possible to cover all possible conversation paths and corner cases, no matter how carefully you engineer your prompt. So this scoring is useful, in the sense that it captures an important, generalizable behavioral difference between models.

You may disagree! We very much welcome feedback and collaboration on this benchmark.

Latency

Because voice-to-voice response time is so important for voice agents, we need to characterize latency accurately.

For text-mode LLMs, we can measure the “time to first token,” which is a widely-understood metric and easy to calculate. A well-engineered voice pipeline can start generating voice output as soon as a small initial batch of tokens arrives, in parallel with receiving streaming output from the LLM.

The one thing to note here is that we need to measure TTFT from the receiving side of the API connection. Model providers sometimes quote TTFT numbers internal to their inference stacks. We calculate TTFT from the time we send the inference request to the first usable token we get back from the API.

For speech-to-speech models, measuring latency is more complicated. We need to measure from the end of the user speech segment to the first moments of speech emitted by the model. For this benchmark, we record the audio of each test with the “user” input speech in the left audio track and the LLM output speech in the right track. Then we mark the start and end of all the speech segments in both tracks, pair up the segments, and measure the voice-to-voice latency.

A couple of notes: we use a small, specialized “voice activity detection” model (Silero VAD) to calculate the speech segment start and end times. Silero operates on 30ms frame sizes, so the precision of each start/end timestamp is approximately 30ms. We’ve empirically tuned the Silero settings we’re using to work well for both the input samples and the model outputs of all the models we test in this benchmark. We also tag the beginning of each bot speech segment with a short beep sound, which we use to check the track alignment against the timestamps in the pipeline log logs. The beeps are also easy to see in waveform visualizations, which is helpful for manually checking sample runs.

Visually measuring silence padding in a speech-to-speech model output segment. The first bytes from the model are tagged with a short beep, visible at the beginning of the highlighted waveform segment.

For a deeper dive on latency, see the Voice Agents Primer.

More on agent architectures

Two ways to build voice agents: speech-to-speech and cascaded pipelines

Today, we have two ways to build voice agents, with different strengths.

Speech-to-speech models perform speech input and output “inside” the LLM.

A “cascaded pipeline” uses a specialized speech-to-text model to transcribe speech to text, feeds that text to an LLM, and then generates voice output using a text-to-speech model.

These two approaches have different strengths. Speech-to-speech models offer excellent audio understanding and natural-sounding output.

Using a cascaded pipeline with a text-mode LLM delivers generally better “intelligence,” system observability, flexibility, and cost. Text-mode LLM APIs are also more mature and support more sophisticated context engineering.

Today, most production voice agents use the cascaded pipeline architecture, because for most use cases, the performance of the best text-mode LLMs and the ability to manage the conversation context are important.

Most of us working in the voice AI space expect use of speech-to-speech models to grow as the models improve. Most of us also, though, think text-mode models will be around for a while. Cost, easier fine tuning, and an increasing variety of model options will keep text-model LLMs in our voice agent pipelines for a long time.

Context engineering isn’t going anywhere

Today we do a lot of context engineering to build reliable voice agents on top of models that aren’t quite as good at complex instruction following and tool calling as we would like.

We use libraries like Pipecat Flows to model voice workflows as state transitions. We try to limit how much “world knowledge” we stuff into system prompts. We use various tricks to limit the number of tools we define, and try to make the tool definitions simple.

As models continue to get better, we don’t need to do as much of this for the same use cases. And the best models today are really good!

But we are continually expanding what we do with voice agents. We’re building agents for more and more use cases. And talking to them longer. And incorporating more background information, more dynamically retrieved context, and more kinds of structured and multi-modal data into our inference calls.

My rule of thumb is: we will always want to do 20% more than the best models can (easily) do.

So context engineering is here to stay. And so are other, related, techniques like sub-agents, model routing, and RAG.

The future is multi-model

AI agent systems these days very often use multiple models. Small models for speed and specialized tasks. Large models and big thinking budgets for big, long-context, reasoning-heavy inference. Fine-tuned models when we have good data and quantifiable success metrics.

Voice agents are the original multi-model agent systems! We’ve used STT, LLM, and TTS models in voice agents since we first started building agents.

In January 2024, before any of the SOTA LLMs had vision capabilities, we built a Pipecat agent using the small, fast Moondream vision model alongside GPT-4 to implement a voice agent that could “see” the world around it at a high frame rate.

Today, we often use multiple models and multiple inference loops inside a single voice agent. We process video, do content safety “guardrails” checks, trigger asynchronous tool calls, generate user interface events, and much more with specialized models running in parallel inference loops.

For example, here’s a small LLM performing asynchronous, non-blocking context compaction alongside the main voice conversation loop, in a Pipecat ParallelPipeline.

Perhaps the most exciting multi-model trend we’re starting to see in 2026 is hybrid local and cloud agents. These agents run some of their AI inference locally on-device, and send inference requests to the cloud only when they need larger, more capable models.

NVIDIA showed an example of this in the CES keynote this year, running open source NVIDIA models on a DGX Spark NVIDIA mini-supercomputer. (This demo was built with Pipecat.)

If you’re interested in adding to the benchmark set that the aiwf_medium_context test we’ve talked about in this article is part of, all of the code is open source. We welcome contributions, ideas, and feedback.

If you’re building voice agents, the Pipecat Discord is a great community to be part of. We also host regular voice AI meetups in San Francisco, in other cities, and online.

Beyond the Context Window: Why Your Voice Agent Needs Structure with Pipecat Flows

Chad Bailey — Mon, 12 Jan 2026 19:11:48 GMT

LLMs have gotten 'smarter' at an astronomical rate, driven largely by one metric: context window size. This is how much data an LLM can effectively 'think' about at once. What was once a ceiling of 16,385 tokens for models like GPT-3.5 is now routinely measured in the hundreds of thousands—with some models, like those from Gemini, offering context windows of 1 million tokens or more.

This context window revolution has profound implications for building sophisticated voice agents. When an agent needs access to a massive knowledge base or is expected to complete a long, multi-step list of tasks, a large context window seems like the ultimate solution.

The Illusion of the Gigantic System Prompt

The knee-jerk reaction for many developers is to lean on this new capacity: write a colossal system prompt, dump in every possible instruction, define every potential action (function call), and trust the LLM to figure it out. And for simple, constrained conversations, this often works surprisingly well.

But anyone who has managed a complex process—like call center operations, for example—knows that simply providing access to everything is not the same as providing guidance. You cannot put a new employee in a room with the entire company manual and expect them to be immediately fluent and mistake-free.

LLMs are no different. Even with context windows larger than a novel, they are susceptible to "context rot," where instructions or important data from earlier in the history start to be ignored. When a conversation gets long or complex, the LLM loses its way, confusing tasks or prematurely exiting a workflow.

The Developer's Toolkit: Context vs. Control

When building voice agents, developers fundamentally rely on two tools to control the LLM's behavior:

Prompts: How we whisper instructions and knowledge into the agent's mind.
Function Calls: The buttons we give the LLM that allow it to take action (like booking a reservation or looking up a customer ID).

For basic workflows, you define these once at the start. But complexity demands dynamic control. For example, you may need:

Gated Access: Only making certain high-value tools (e.g., a "Make Payment" function) available after explicit identity verification.
Structured Progression: Ensuring a user confirms an order before moving to the shipping details.
Branching Conversations: Executing two entirely different interaction paths based on a user's single, critical answer.

In these cases, you need a way to dynamically adjust what the bot knows (by editing or appending system prompt information) and what it can do (by changing the available function calls) based on the state of the conversation.

Pipecat Flows: A Framework for Structured Conversation

This is the exact problem that Pipecat Flows solves.

Pipecat Flows is an add-on framework for Pipecat that allows you to build structured conversations in your voice AI applications.

Pipecat is a 100% open source and vendor neutral framework for voice agents, providing functionality like ultra low latency orchestration, flexible model use, native telephony support, enterprise data store integration, and more.

Pipecat Flows acts as the necessary layer of control and logic that sits outside the context window, guiding the LLM step-by-step.

By enabling you to create both predefined conversation paths and dynamically generated flows, Pipecat Flows handles the critical complexities of state management and LLM interactions that a massive context window simply cannot. It acts as guardrails, keeping the LLM focused on the task at hand. It allows for creative and flexible LLM responses while preventing fully open-ended conversations.

Pipecat Flows is useful in many contexts. Here are a few examples:

Food ordering: The agent accepts the order, then confirms the delivery address, then verifies payment.
Enterprise healthcare workflows, like patient intake: The agent starts by confirming the patient's identity. Then it asks for a list of the patient's current prescriptions, asking clarifying questions (such as dosage information) as necessary until the list is complete. Then, it moves through the rest of the intake process.
Hotel reservation: The agent asks for the booking date and confirms availability with a back-end system. Then, it asks about room type and other preferences.

The Pipecat Flows repo has more information on how to build flows. There’s a visual editor to quickly build out predefined conversation paths. There’s also a full-featured API for building dynamically generated flows in code.

Knowing When to Adopt Structure

Given how capable modern models are, how do you know if you need to adopt a structured framework like Pipecat Flows?

While a simple agent might run perfectly fine without it, scope creep is inevitable. As you add more features, functions, and complexity, relying solely on the LLM's raw context will lead to failure.

The key is to integrate evals (evaluations) into your development process early. Use these evaluations to objectively characterize how reliably your agent completes expected tasks. Once you observe certain kinds of conversations consistently "going off the rails," that is your cue.

Pipecat Flows is designed to make that gradual transition possible, allowing you to move from an unstructured, LLM-driven conversation to one managed by explicit, reliable steps, ensuring your voice agent remains scalable, maintainable, and most importantly, trustworthy.

Pipecat Cloud is Now Generally Available

Nina Kuruvilla — Thu, 08 Jan 2026 21:50:17 GMT

Deploy and scale your Pipecat agents on enterprise-grade infrastructure

Pipecat Cloud is now GA, following an nine-month beta period with more than 1,000 teams building and scaling voice agents on Pipecat Cloud’s global infrastructure.

We built Pipecat Cloud to handle the low-level operational and scaling challenges of voice AI so that teams building enterprise voice agents can focus on agent code and business logic. Engineering and product teams rely on Pipecat Cloud for auto-scaling, multi-region deployments, world-class redundancy and resilience, and compliance and data security, all while avoiding any vendor lock-in.

Pipecat Cloud supports direct connections to telephony providers like Twilio and bundles value-added services like the industry-leading Krisp VIVA noise reduction models and Daily WebRTC transport.

With the help of our developer community and their feedback, today Pipecat Cloud is powering voice AI across use cases like agentic interviewers; enterprise healthcare workflows like patient intake and schedule reminders; embedded hardware platforms; and more.

Build on open source, “docker push” to Pipecat Cloud

Pipecat Cloud is the managed service for Pipecat, the most widely used voice agents and multimodal AI framework.

Pipecat’s architecture is built around a programmable, AI-native multimodal pipeline. It’s fully open source, and composable, to support how engineers build and enterprises preserve strategic value: use any model and easily swap them out; integrate with any data store; connect to AI-native observability and eval tooling; run on any transport; leverage cross-platform libraries.

But voice AI developers also face a second challenge, scaling voice agent infrastructure. Deploying at scale with production reliability involves complexities like configuring optimal network routing, implementing rolling deploys with connection-aware drain times, avoiding cold starts, managing long-running connections, allocating CPU efficiently in Kubernetes, and more. (See Section 10 of the Voice AI & Voice Agents: An Illustrated Primer.)

Pipecat Cloud is built by Daily, and reflects our 10 years of experience building the world’s leading global realtime developer infrastructure. Our platforms and tooling are trusted by industry leaders like NVIDIA, Mercor, Descript, Epic, Vapi, and Tavus.

With Pipecat Cloud, you build your voice agent leveraging Pipecat’s open source core, add your custom code, and then “docker push” to Pipecat Cloud.

Pipecat is vendor neutral by design, and in designing Pipecat Cloud we followed the Pipecat principles that flexibility and avoiding lock-in are key values.

The code you deploy to Pipecat Cloud is “just” Pipecat code. Anything you run on Pipecat Cloud you can self-host exactly the same way. All of the deployment code is open source and the Pipecat Cloud lifecycle events are fully documented.
Pipecat Cloud leverages Daily’s global infrastructure and includes Daily WebRTC at no additional cost, but Pipecat Cloud also supports direct connections to telephony providers, WebSocket network transport, and the non-commercial peer-to-peer SmallWebRTCTransport module.

Engineering enterprise-grade service

Over the past nine months, our engineering team has focused on:

Fast agent start times:

P99 agent start times are < 1 second
We automatically over-provision as your scale increases and we give you control over how many “reserved instances” you want to keep alive during low-traffic periods. A reserved instance is 1/20th the cost of an active instance.

Multi-region hosting: host voice agents where your users are

us-west (Oregon)
us-east (Virginia)
eu-central (Frankfurt)
ap-south (Mumbai)

Delivering features that help your agents succeed

Krisp VIVA noise cancellation
Smart Turn model access - native audio turn detection, with open weights and open datasets.
Agent profiles for use cases like video avatars and screen sharing that need more CPU
Observability for usage and performance metrics

Network transport flexibility: You can configure your agents for direct client connections using WebRTC, WhatsApp, Twilio, and more.

Daily (WebRTC & PSTN) — Daily is recognized as the top WebRTC developer platform by third-party analysts like Tsahi Levent-Levi. You can use Daily WebRTC and buy phone numbers for telephony connections directly from Daily.
SmallWebRTC — a direct peer-to-peer transport that is particularly useful if you have a regulatory or security requirement not to route traffic through WebRTC servers.
WhatsApp
Twilio
Telnyx
Plivo
Exotel

Reliability: Kubernetes redundancy, logging, and observability

Improving the developer experience and supporting automation: Use the Pipecat Cloud REST API to set up CI/CD workflows that automatically deploy updates to your agent.

HIPAA: Adding compliance enablement for HIPAA workflows, plus advanced privacy and security controls relating to SOC 2 and other certifications.

Our roadmap ahead includes further support for enterprise scale, including expanded regions and SOC 2 for Pipecat Cloud. (Daily’s WebRTC infrastructure is SOC 2 compliant.)

Single-tenant enterprise Pipecat Cloud: Contact us if you need to run Pipecat Cloud in your VPC.

Transparent pricing

Pipecat Cloud pricing is simple: $0.01 per running agent. You can add on, as needed, reserved instances, audio recording, and enterprise support. For enterprise customers, we can also bundle your AI inference costs (transcription, LLM, and voice models) into a single bill.

Our Capacity Planning Guide walks you through how to budget for active agents and reserved agents, as your scale increases.

Enabling developers, supporting realtime AI

The mission behind our work at Daily is to support the development of voice and multimodal AI, from enterprise voice agents to new use cases like the robot personal assistant demo that opened the NVIDIA CES Keynote this year.

Pipecat is a vendor-neutral, open source framework that started life inside Daily as our internal tooling for realtime, conversational AI. Pipecat is now used by thousands of startups, scale-ups and enterprises, all of the foundation AI labs, and technology giants like NVIDIA and AWS.

Pipecat Cloud is the hosting platform we designed to solve infrastructure pain points we were hearing about from many or our customers and partners.

To talk with other developers building on the frontier of realtime AI, join the Pipecat Discord.

If you’re new to voice agents, you can find the Pipecat quickstart here. Thanks again to the Pipecat developer community and all of the engineers who have contributed to Pipecat.

Smart Turn v3.2: Handling noisy environments and short responses

Marcus — Wed, 07 Jan 2026 18:05:44 GMT

We’re happy to kick off the New Year with a new Smart Turn release, with two key improvements to the responsiveness of your AI voice agents.

Smart Turn is an open-source turn detection model, which listens to raw audio data and determines when a user has finished speaking. Using Smart Turn, an AI voice agent can tell precisely when to respond to the user, without interrupting them, or waiting unnecessarily.

As usual, all parts of the model are open: the weights, the datasets, and the training code.

What's new in v3.2

Short utterances

We’ve significantly improved the model’s handling of short utterances, for example single words like “yes” or “okay”. These samples are now miscategorized 40% less often according to our public benchmarks.

We’ve made two changes which make this possible: firstly, a new dataset of short utterances which we plan to expand over time, and secondly, a fix for a padding issue during training reported by the community, which was reducing accuracy.

Background noise

Smart Turn v3.2 is more robust to background ambience, thanks to the addition of realistic cafe/office noise to our training and testing datasets. The result is that the model will perform better in real-world scenarios where the user’s audio isn’t studio-quality.

Usage

The new version is a drop-in replacement for v3.1, and as before, we’re shipping the model in 8MB (CPU) and 32MB (GPU) variants. The weights are available now on HuggingFace:

https://huggingface.co/pipecat-ai/smart-turn-v3/tree/main

As with v3.1, we’ll bundle the weights with the next Pipecat release for use with LocalSmartTurnAnalyzerV3. You can also use v3.2 with Pipecat right now by setting the smart_turn_model_path parameter in the LocalSmartTurnAnalyzerV3 constructor.

More information and benchmarks

For more details on how the model was trained, including our full training code, please see our GitHub repo:

https://github.com/pipecat-ai/smart-turn

We’ve released two new datasets, which were used to train and test this release respectively:

For accuracy benchmarks with the new test dataset, please see the following link:

https://huggingface.co/pipecat-ai/smart-turn-v3/tree/main/benchmarks

Stay in touch

We hope you enjoy the new model! If you have questions about Smart Turn or run into any issues, feel free to join our Discord server, or open a ticket on GitHub.

Building Voice Agents with NVIDIA Open Models

Kwindla Hultman Kramer — Tue, 06 Jan 2026 01:34:53 GMT

How to Build Ultra-low-latency Voice Agents With NVIDIA Cache-aware Streaming ASR

This post accompanies the launch of NVIDIA Nemotron Speech ASR on Hugging Face. Read the full model announcement here.

In this post, we’ll build a voice agent using three NVIDIA open models:

The new Nemotron Speech ASR model
Nemotron 3 Nano LLM
A preview checkpoint of the upcoming NVIDIA Magpie text-to-speech model

This voice agent leverages the new streaming ASR model, Pipecat’s low-latency voice agent building blocks, and some fun code experiments to optimize all three models for very fast response times.

All the code for the post is here in this GitHub repository.

You can clone the repo and run this voice agent:

Scalably for multi-user workloads on the Modal cloud platform.
On an NVIDIA DGX Spark or RTX 5090 for single-user, local development and experimentation.

Feel free to just jump over to the code. Or read on for technical notes about building fast voice agents and the NVIDIA open models.

The state of voice AI agents in 2026

Voice agent deployments are growing by leaps and bounds across a wide range of use cases. For example, we’re seeing voice agents used at scale today in:

Customer support
Answering the phone for small businesses (for example, restaurants)
User research
Outbound phone calls to prepare patients for healthcare appointments
Validation workflows for loan applications
And many, many other scenarios

Both startups and large, established companies are building voice agents that are successful in real-world deployments. The best voice agents today achieve very high “task completed” success metrics and customer satisfaction scores.

Voice AI architecture

As is the case with everything in AI, voice agent technology is evolving rapidly. Today, there are two ways to build voice agents.

Most production voice agents use specialized models together in a pipeline – a speech-to-text model, a text-mode LLM, and a text-to-speech model.
Voice agent developers are beginning to experiment with new speech-to-speech models that take voice input directly and output audio instead of text.

On the left, a block diagram of a voice agent that uses a “pipeline” of specialized AI models. On the right, a voice agent built with a speech-to-speech LLM.

Using three specialized models is currently the best approach for enterprise use cases that require the highest degree of model intelligence and flexibility. But speech-to-speech models are an exciting development and will be a big part of the future of voice AI.

Whether we use a pipeline or a unified speech-to-speech model, voice agents are doing more and more sophisticated tasks. This means that, increasingly, production voice agents are actually multi-agent systems. Inside an agent, sub-agents handle asynchronous tasks, manage the conversation context, and allow code re-use between text and voice agents.

A voice agent that is a multi-agent system under the covers. This agent uses tool calls to start long-running tasks that stream structured data into the context of the voice conversation.

For a deep dive into voice agent architectures, models, and infrastructure, see the Voice AI & Voice Agents Illustrated Primer.

Open source models

Open models have not been widely used for production voice agents.

Voice agents are among the most demanding AI use cases. Voice agents perform long conversations. They must operate on noisy input audio and respond very quickly. Enterprise voice agent use cases require highly accurate instruction following and function calling. People interacting with voice agents have very high expectations for naturalness and “human-like” qualities of voice audio. In all of these areas, proprietary AI models have performed better than open models.

However, this is changing. Nemotron Speech ASR is both fast and accurate. On our benchmarks it performs comparably with or better than commercial speech-to-text models used today in production voice agents. Nemotron 3 Nano is the best-performing LLM in its class on our long-context, multi-turn conversation benchmarks.

Using open models allows us to configure and customize our models and inference stacks for the specific needs of our voice agents in ways that we can’t do with proprietary models. We can optimize for latency, fine-tune on our own data, host inference within our VPCs to satisfy data privacy and regulatory requirements, and implement observability that allows us to deliver the highest levels of reliability, scalability, and consistency.

We expect open models to be used in a larger and larger proportion of voice agent deployments over time. There are various flavors of “open” model licenses. NVIDIA has made the Nemotron Speech ASR and Nemotron 3 Nano available under the NVIDIA Permissive Open-Model License, which allows for unrestricted commercial use and the creation of derivative works.

An ultra-responsive voice agent

Fast, streaming transcription

The Nemotron Speech ASR model is designed specifically for use cases that demand very low latency transcription, such as voice agents.

The headline number here is that Nemotron Speech ASR consistently delivers final transcripts in under 24ms!

ASR (Automatic Speech Recognition) is the general term for machine learning models that process speech input, then output text and other information about that speech. Previous generations of ASR models were generally designed for batch processing rather than realtime transcription. For example, the latency of the Whisper model is 600-800ms, and most commercial speech-to-text models today have latencies in the 200-400ms range.

Model	Openness	Deployment
Parakeet	open weights, open training data, open source inference	local in-cluster
Widely used commercial ASR	proprietary	cloud
Whisper Large V3	open weights, open source inference	local in-cluster

For more about the cache-aware architecture that enables this impressively low latency, see the NVIDIA post announcing the new model.

The model is also very accurate. The industry standard for measuring ASR model accuracy is word error rate. Nemotron Speech ASR has a word error rate on all of our benchmarks roughly equivalent to the best commercial ASR models, and substantially better than previous generation open models like Whisper.

To integrate Nemotron Speech ASR into Pipecat, we created a WebSocket server that performs the transcription inference and a client-side Pipecat service that can be used in any Pipecat agent.

ASR server architecture showing a streaming transcription pipeline. Audio enters through a WebSocket handler, flows to an audio accumulator, then to a mel-spectrogram preprocessor, followed by a streaming encoder. The encoded output is decoded using a greedy decoder to produce transcript output. A reset signal can be sent from the WebSocket handler directly to the decoder.

Running turn detection in parallel with transcription

The Nemotron Speech ASR model can be configured with four different context sizes, each of which have different latency/accuracy trade-offs. The context sizes are 80ms, 160ms, 560ms, and 1.2s. We use the 160ms context size, because this aligns with how we perform turn detection.

Turn detection means determining when the user has stopped speaking and the voice agent should respond. Accurate turn detection is critical to natural conversation. We’re using the open source Pipecat Smart Turn model in this voice agent. The Smart Turn model operates on input audio and runs in parallel with the Nemotron Speech ASR transcription.

We trigger both turn detection and transcript finalization any time we see a 200ms pause in the user’s speech. This gives us 200ms of “non-speech” trailing context after the user’s speech has finished. The Nemotron Speech ASR model actually needs a bit more trailing silence than this, to properly finalize the last words in the user speech. The padding calculation is:

nemotron_final_padding = (right_context + 1) * shift_frames * hop_samples
    = (1 + 1) * 16 * 160
    = 5120 samples = 320ms

Our WebSocket transcription server receives 200ms of “non-speech” trailing audio data from the Pipecat service, and adds 120ms of synthetic silence to enable immediate finalization of the transcript. This works nicely.

Nemotron 3 Nano

Nemotron 3 Nano is a new 30 billion parameter open source LLM from NVIDIA. Nemotron 3 Nano is the best performing model in its size class on our multi-turn conversation benchmarks.

Model	Tool Use	Instruction	KB Ground	Pass Rate	Median Rate	TTFB Med	TTFB P95	TTFB Max
gpt-5.1	300/300	300/300	300/300	100.0%	100.0%	916ms	2011ms	5216ms
gemini-3-flash-preview	300/300	300/300	300/300	100.0%	100.0%	1193ms	1635ms	6653ms
claude-sonnet-4-5	300/300	300/300	300/300	100.0%	100.0%	2234ms	3062ms	5438ms
gpt-4.1	283/300	273/300	298/300	94.9%	97.8%	683ms	1052ms	3860ms
gemini-2.5-flash	275/300	268/300	300/300	93.7%	94.4%	594ms	1349ms	2104ms
gpt-5-mini	271/300	272/300	289/300	92.4%	95.6%	6339ms	17845ms	27028ms
gpt-4o-mini	271/300	262/300	293/300	91.8%	92.2%	760ms	1322ms	3256ms
nemotron-3-nano-30b-a3b*	287/304	286/304	298/304	91.4%	93.3%	171ms	199ms	255ms
gpt-4o	278/300	249/300	294/300	91.2%	95.6%	625ms	1222ms	13378ms
gpt-oss-120b (groq)	272/300	270/300	298/300	89.3%	90.0%	98ms	226ms	2117ms
gpt-5.2	224/300	228/300	250/300	78.0%	92.2%	819ms	1483ms	1825ms
claude-haiku-4-5	221/300	172/300	299/300	76.9%	75.6%	732ms	1334ms	4654ms

[*] Nemotron 3 Nano hosted locally in-cluster on Blackwell GPUs

Like Nemotron Speech ASR, Nemotron 3 Nano is part of a new generation of open models that are designed specifically for speed and inference efficiency. See this resource from NVIDIA research for an overview of the Nemotron 3 hybrid Mamba-Transformer MoE architecture and links to technical papers.

A 30B parameter model is small enough to run very fast on high-end hardware, and can be quantized to run well on GPUs that many developers have at home!

Model variant	Deployment	Resident memory
Nemotron-3-Nano BF16	full weights, Modal Cloud or DGX Spark	72GB
Nemotron-3-Nano Q8	8-bit quantization, faster operation on DGX Spark	32GB
Nemotron-3-Nano Q4	4-bit quantization, RTX 5090	24GB

One note on which LLMs are generally used today for production voice agents: in general, voice agents for applications like customer support need the most “intelligent” models we have available. Voice agent use cases are demanding. A customer support AI agent must do highly accurate instruction following and function calling tasks throughout a long, open-ended, unpredictable human conversation. A 30B parameter model – even one as good as Nemotron 3 Nano – is generally best suited for specialized voice tasks like a home assistant or software voice UI interface.

NVIDIA has announced that two larger Nemotron 3 models are coming soon. If the performance of these larger models relative to their size is similar to Nemotron 3 Nano’s performance, we expect these models to be terrific intelligence engines for voice agents.

In the meantime, Nemotron 3 Nano is the best-performing LLM that I can run on hardware I have at home. I’ve been using this model for a wide variety of “local” voice agent tasks and development experiments on both an NVIDIA DGX Spark and on my desktop computer with an RTX 5090.

You can use Nemotron 3 in reasoning or non-reasoning mode. We usually turn off reasoning for the fast-response core voice agent loop.

For details on using Nemotron 3 Nano in the cloud and building local containers with the latest CUDA, vLLM and llama.cpp support for this new model, see the GitHub repository accompanying this post. There are a couple of inference tooling patches (relating to the reasoning output format in vLLM and to llama.cpp KV caching) that you might find useful if you’re experimenting with this model.

Magpie streaming server

Magpie is a family of text-to-speech models from NVIDIA. In our voice agent project, we’re using an experimental preview checkpoint of an upcoming open source version of Magpie.

Kudos to NVIDIA for releasing this early look at a Magpie model designed, like Nemotron Speech ASR, for streaming, low-latency use cases! We’ve been having a lot of fun experimenting with this preview, doing things that are only possible with open source weights and inference code.

You can use this Magpie model in batch mode by sending an HTTP request with a chunk of text. This batch mode inference delivers audio for a single sentence in about 600ms on the DGX Spark and 300ms on the RTX 5090. But for voice agents, we like to stream all tokens as much as we can, and because Magpie is open source, we can hack together a hybrid streaming mode that optimizes for initial audio chunk latency! This hybrid streaming approach improves average initial response latency 3x.

TTS TTFB Comparison: Batch → Streaming

Hardware	P50 Improvement	Mean Improvement	P90 Improvement
RTX 5090	90 ms (1.9x)	204 ms (3.0x)	430 ms (5.2x)
DGX Spark	236 ms (2.3x)	415 ms (3.3x)	836 ms (4.6x)

Details

RTX 5090

Mode	Min	Max	P50	P90	Mean
Batch	106 ms	630 ms	191 ms	533 ms	305 ms
Pipeline	99 ms	103 ms	101 ms	103 ms	101 ms

DGX Spark

Mode	Min	Max	P50	P90	Mean
Batch	193 ms	1440 ms	422 ms	1067 ms	595 ms
Pipeline	15 ms	276 ms	186 ms	231 ms	180 ms

There’s definitely a quality trade-off with our simple streaming implementation. Try the agent yourself, or listen carefully to the conversation in the video at the beginning of this blog post. You can usually hear a slight disfluency where we “stitch” together the streaming chunks at the beginning of the model response.

To do better, we’d need to retrain part of the model and use a slightly more sophisticated inference approach. Fortunately, this is on the NVIDIA road map.

We integrated this model into Pipecat by creating a WebSocket server for streaming inference, and a client-side Pipecat service. (This is the same approach we used with Nemotron Speech ASR).

Putting the models together and measuring latency

These Nemotron and upcoming Magpie models are completely open: open weights, open source training data sets, and open source inference tooling. Working with open models in production feels like a super-power. We can do things like:

Read the inference code to understand the context requirements of the ASR model, so that we can optimize the interactions between our Pipecat pipeline components and text-to-speech audio buffer handling. (See our description of this above, in the section Fast, streaming transcription.
Fix issues with inference tooling support in new models and on whatever platforms we’re running on. See the code and README.md in the GitHub repo for the small patches we made for vLLM and llama.cpp, and the Docker container build with full MX4FP support for both of those inference servers on DGX Spark and RTX 5090.
Build a semi-streaming inference server for a preview model checkpoint.

Often when we’re building voice agents, our primary concern is to engineer the agent to respond quickly in a real-world conversation. The difference between good latency and an agent too slow to use in production is often a combination of several optimizations, each one cutting peak latencies by 100 or 200ms. Working with open models gives us control over how we prioritize for latency compared to throughput, how we design streaming and chunking of inference results, how to use models together optimally, and many other small things that add up (or subtract down) to fast response times.

It’s useful to measure voice-to-voice latency – the time between the user’s voice stopping and the bot’s voice response starting – in two places: on the server-side and at the client.

We can easily automate the server-side latency measurement. Our bot outputs a log line with a voice-to-voice latency metric for each turn.

2026-01-01 22:43:26.208 | INFO     | v2v_metrics:process_frame:54 - V2VMetrics: ServerVoiceToVoice TTFB: 465ms

We also output log lines with time-to-first-byte for each of our models, and several other log lines that are useful for understanding exactly where we’re “spending our latency budget.” The Pipecat Playground shows graphs of these metrics, which is useful during development and testing. Here’s a test session with our bot running on an RTX 5090.

RTX 5090

Metric	Min	P50	P90	Max
ASR	13ms	19ms	23ms	70ms
LLM	71ms	171ms	199ms	255ms
TTS	99ms	108ms	113ms	146ms
V2V	415ms	508ms	544ms	639ms

DGX Spark

Metric	Min	P50	P90	Max
ASR	24ms	27ms	69ms	122ms
LLM	343ms	750ms	915ms	1669ms
TTS	158ms	185ms	204ms	1171ms
V2V	759ms	1180ms	1359ms	2981ms

It’s also critical to measure the voice-to-voice latency as actually perceived by the user. This is harder to do automatically, especially for telephone call voice agents. The best approach to measuring client-side voice-to-voice latency is to record a call, load the audio file into an audio editor, and measure the gap between the end of the user’s speech waveform and the start of the bot speech waveform. You can’t cheat this measurement, or forget to include an important processing component! We do this periodically in both development and testing, as a sanity check. Here I’m measuring latency in the Descript editor of one turn in the conversation we recorded for the video at the top of this post.

You will typically see client-side voice-to-voice latency numbers about 250ms higher than server-side numbers for a WebRTC voice agent. This is time spent in audio processing at the operating system level, encoding and decoding, and network transport. Usually, this delta is a bit worse for telephone call agents: 300-600ms of extra latency in the telephony path that you don’t have much way to optimize. (Though there are some basic things you should do, such as make sure your voice agent is hosted in the same region as your telephony providers servers.) For more on latency, see the Voice AI and Voice Agents Illustrated Guide.

An inference optimization for local voice agents

We have one more trick up our sleeve when we’re running voice agents locally on a single GPU.

When we run voice agents in production in the cloud, we run each AI model on a dedicated GPU. We stream tokens from each model as fast as we can, and send them down the Pipecat pipeline as they arrive.

But when we’re running locally, all the models are sharing one GPU. In this context, we can engineer much faster voice-to-voice responses if we carefully schedule inference. In our voice agent for this project, we’re doing two things:

We run the Smart Turn model on the CPU so that we can dedicate the GPU to transcription when user speech is arriving. The Smart Turn model runs faster on GPU, but it runs fast enough on CPU, and dividing up the workload this way gives us the best possible performance between the two models.
We interleave small segments of LLM and TTS inference so that GPU resources are dedicated to one model at a time. This significantly reduces time-to-first-token for each model. First we generate a few small chunks of LLM tokens, then TTS audio, then LLM again, then TTS, etc. We generate a smaller segment for the very first response, so we can start audio playout as quickly as possible. We designed this interleaved chunking approach to work in concert with the hybrid Magpie streaming hack described above.

Here’s a sequence diagram showing the interleaved LLM and TTS inference. The three vertical lines in the diagram represent, from left to right:

Tokens arriving in small batches to the Pipecat LLM service in the agent and being pushed down the pipeline.
The Pipecat TTS service, managing the frames from the LLM service, dividing the stream on sentence boundaries, and making inference requests to the Magpie WebSocket server running in our local Docker container.
The Magpie WebSocket server doing inference and sending back audio.

We wrote a custom WebSocket inference server for Magpie, so we control the Pipecat-to-Magpie protocol completely. We’re using llama-server code from the llama.cpp project for LLM inference. Traditional inference stacks aren’t really designed to do this specific kind of chunking, so our code sets a max tokens count (n_predict in llama.cpp), runs repeated small inference chunks, and does some of the buffer management client-side. This could be done more efficiently, using the llama.cpp primitives directly. Writing a perfectly optimized inference server for this interleaved design would be a fun weekend project, and is something that almost anyone with a little bit of programming experience and a willingness to go down some rabbit holes could work together with Claude Code to implement.

Running this voice agent

For enterprise-scale, production use, deploy this agent to the Modal GPU cloud. There are instructions in the GitHub Readme.md. Modal is a serverless GPU platform that makes it easy to deploy AI models for development or production use.

For local development, the GitHub repo has a Dockerfile for DGX Spark (arm64 + Blackwell GB10 CUDA 13.1) and RTX 5090 (x86_64 + Blackwell CUDA 13.0)

If you’re interested in building voice agents, here are some resources you might be interested in:

Voice AI & Voice Agents Illustrated Primer
YouTube recordings of the community voice agents course sessions from last year
The Pipecat Discord, where lots of knowledgeable voice agent developers hang out.

Improved accuracy in Smart Turn v3.1

Marcus — Wed, 03 Dec 2025 18:13:57 GMT

We’re pleased to announce the availability of Smart Turn v3.1, now with improved accuracy thanks to a larger dataset of human audio samples, and improvements to how the model is quantized.

The model uses the same architecture as v3.0, and so v3.1 is a drop-in replacement — simply use your existing inference code with the new ONNX file.

As with v3.0, this new model will be integrated directly into the next Pipecat release, allowing you to integrate Smart Turn into your voice agent with minimal code changes. You can also use v3.1 with your existing version of Pipecat by manually specifying the path to the model weights.

Model weights: https://huggingface.co/pipecat-ai/smart-turn-v3
GitHub repo: https://github.com/pipecat-ai/smart-turn
Datasets: https://huggingface.co/pipecat-ai

New training data

We’d like to thank our partners, Liva AI, Midcentury, and MundoAI, who have provided new human audio samples in English and Spanish.

These new samples have been added to our datasets on HuggingFace – smart-turn-data-v3.1-train and smart-turn-data-v3.1-test contain all the audio data used to train and test Smart Turn v3.1. As with previous versions, these datasets are released openly.

Smart Turn has historically been heavily reliant on synthetic data generated by TTS models. Synthetic data often lacks the natural variability and subtle cues in actual human speech, and so this new human data has had a significant and measurable effect on accuracy, and we're grateful to be able to include it in the new version.

We've included some statistics in the "accuracy" section below showing the effect this new data has had on the model.

Liva AI

Liva AI provides real human voice data to improve speech models. We identify gaps in current research and collect targeted data to fill them, helping speech models better understand and express different languages, accents, and emotions. The company was founded by Ashley Mo, who has published audio research with MIT (IEEE), and Aoi Otani, who has published ML research in ICML and Nature.

We were honoured to partner with the Pipecat team on Smart Turn v3.1, contributing targeted training data that helped it better recognize subtle audio cues for natural turn-taking in conversations. We're entering an era where models not only sound expressive, but can understand speech and respond in conversation with the same dynamics as a human. Smart Turn is an exciting contribution to this field, and by open sourcing their models and datasets, they're enabling others to build better conversational AI for a wide variety of specific use cases. We're thrilled to have contributed to this milestone."

-- Ashley Mo, Co-Founder

Midcentury

Midcentury is a multimodal-native research company advancing AI with real-world datasets. We generate and license voice, video, and interaction data that trains models to perform in the environments where they actually matter. We’ve built large-scale conversational audio datasets across 12+ languages (including Japanese and Korean) and we license proprietary long-form video directly from creators. Leading AI labs rely on us to deliver the proprietary audio and voice data that power their models.

We worked on the Pipecat Smart Turn model because we’re huge fans of how Pipecat is pushing the OSS voice community forward. We genuinely believe that responsibly publishing high-quality data is one of the strongest levers for accelerating AI. Getting the right data to build better models is still way too hard, and we hope projects like this make it easier.

-- David Guo, Co-Founder

MundoAI

MundoAI is advancing AI by building the world's largest and highest quality multimodal datasets, starting with voice. Our bespoke data, spanning 16+ languages and sourced from a global network of contributors, already powers frontier models at leading research labs.

With our focus on multilingual and multimodal systems, we were thrilled to contribute to the Pipecat Smart Turn model and support its open-source mission. Increasing diversity in training data is essential for building models that work for everyone, so we're proud to play a role in enabling better and more inclusive AI.

-- Garreth Lee, Co-Founder

Model variants

In this release, you have the option of the two following variants of Smart Turn, depending on whether you're performing CPU or GPU inference.

CPU model (8MB, int8 quantized): this model is small and fast (the same size as v3.0), with CPU inference in as little as 12ms. As with v3.0, this model will be integrated directly into Pipecat in the next release, and inference completes in around 70ms on Pipecat Cloud.
GPU model (32MB, unquantized): if you’re running the model on a GPU, this larger variant provides better inference time compared to the 8MB version, and accuracy is improved by around 1%. It's also possible to run this model on CPU but with a longer inference time – see the "performance" section below.

Accuracy

Evaluating the model on our new smart-turn-data-v3.1-test dataset, we see a dramatic accuracy improvement for English and Spanish compared to the previous v3.0 model, thanks to the new human datasets:

Language	v3.0	v3.1 (8MB)	v3.1 (32MB)
English	88.3%	94.7%	95.6%
Spanish	86.7%	90.1%	91.0%

We still support 23 languages in total, and the remaining 21 languages have similar performance in v3.1 compared to v3.0. For a full list, see the benchmark results in our HuggingFace repo.

Performance

The 8MB variant of Smart Turn v3.1 maintains the same great inference speed of v3.0. We also now offer the unquantized 32MB variant for GPUs.

The preprocessing (feature extractor) cost is the same for both models and is listed separately.

Note: these results are for a single inference. The model supports batching, which can significantly improve performance.

Device	v3.1 (8MB)	v3.1 (32MB)	Preprocessing
GPU (NVIDIA L40S)	2 ms	1 ms	1 ms
GPU (NVIDIA T4)	5 ms	4 ms	2 ms
CPU (AWS c7a.2xlarge)	9 ms	13 ms	7 ms
CPU (AWS c8g.2xlarge)	20 ms	32 ms	9 ms
CPU (AWS c7a.medium)	37 ms	73 ms	7 ms
CPU (AWS c8g.medium)	57 ms	159 ms	9 ms

We’ve found that setting the following environment variables has a dramatic effect on performance and consistency, even in cases where the ONNX runtime itself is run with multiple threads, due to what seems to be a high level of contention and inter-dependency between the OpenMP threads. Relying on the ONNX runtime for higher-level parallelism on multi-CPU machines seems to give better results.

OMP_NUM_THREADS=1
OMP_WAIT_POLICY="PASSIVE"

What's next

Smart Turn v3.1 is another step forward in turn-taking accuracy, and we couldn't have done it without the contributions from our partners at Liva AI, Midcentury, and MundoAI. Their high-quality human audio data has been instrumental in closing the gap between synthetic and real-world performance.

We're already hard at work on Smart Turn v3.2, where we'll continue our focus on accuracy improvements across all supported languages. As always, we'll be releasing our models, datasets, and benchmarks openly so the community can build on this work.

If you have feedback or questions, reach out to us on GitHub or join the conversation in our Discord.

Announcing Smart Turn v3, with CPU inference in just 12ms

Marcus — Thu, 11 Sep 2025 16:15:32 GMT

Today we’re excited to share our updated Smart Turn v3 model, which leapfrogs both the previous version and competing models in size and performance. For the first time, Smart Turn is small and fast enough to run on CPU.

This new version is still fully open source — this includes the weights, training data, and the training script.

Model weights
GitHub repo with code for training and inference
Previous v2 announcement with additional background info about the model
Datasets used for training and testing
- pipecat-ai/smart-turn-data-v3-train
- pipecat-ai/smart-turn-data-v3-test

Changes

Nearly 50x smaller than v2, at only 8 MB 🤯
Lightning-fast CPU inference: 12ms on modern CPUs, 60ms on a low cost AWS instance. No GPU required — run directly inside your Pipecat Cloud instance!
Expanded language support: Now covers 23 languages:
- 🇸🇦 Arabic, 🇧🇩 Bengali, 🇨🇳 Chinese, 🇩🇰 Danish, 🇳🇱 Dutch, 🇩🇪 German, 🇬🇧🇺🇸 English, 🇫🇮 Finnish, 🇫🇷 French, 🇮🇳 Hindi, 🇮🇩 Indonesian, 🇮🇹 Italian, 🇯🇵 Japanese, 🇰🇷 Korean, 🇮🇳 Marathi, 🇳🇴 Norwegian, 🇵🇱 Polish, 🇵🇹 Portuguese, 🇷🇺 Russian, 🇪🇸 Spanish, 🇹🇷 Turkish, 🇺🇦 Ukrainian, and 🇻🇳 Vietnamese.
Better accuracy compared to v2, despite the size reduction

How Smart Turn v3 compares

We’re always pleased to see innovation from other developers in this space, and since we released Smart Turn v2, two other promising native audio turn detection models have been announced. Here is a high-level comparison:

	Smart Turn v3	Krisp	Ultravox
Size	8 MB	65 MB	1.37 GB
Language support	23 languages	Trained/tested on English only	26 languages
Availability	Open weights, data, and training script	Proprietary	Open weights
Architecture Focus	Single-inference decision latency	Multiple inferences to maximize decision confidence	Using conversation context alongside audio from current turn

We’re currently working on an open and transparent benchmark to compare the accuracy of models, and are working together with both the Krisp and Ultravox teams on this project. We’ve included our own accuracy benchmarks below, and you can reproduce these using benchmark.py and our open test dataset.

Performance

Smart Turn v3 has dramatically improved performance, with a 100x speedup on a c8g.medium AWS instance compared to v2, and a 20-60x improvement on other CPU types.

The figures below include both audio preprocessing and inference. We found that CPU preprocessing contributes approximately 3ms to the execution time, and this starts to outweigh the actual inference time on fast GPUs.

	Smart Turn v2	Smart Turn v3
NVIDIA L40S (Modal)	12.5 ms	3.3 ms
NVIDIA L4 (Modal)	30.8 ms	3.6 ms
NVIDIA A100 (Modal)	19.1 ms	4.3 ms
NVIDIA T4	74.5 ms	6.6 ms
CPU (AWS c7a.2xlarge)	450.6 ms	12.6 ms
CPU (AWS c8g.2xlarge)	903.1 ms	15.2 ms
CPU (Modal, 6 cores)	410.1 ms	17.7 ms
CPU (AWS t3.2xlarge)	900.4 ms	33.8 ms
CPU (AWS c8g.medium)	6272.4 ms	59.8 ms
CPU (AWS t3.medium)	-	94.8 ms

For CPU inference, we got the best results with the following session options, and it may be possible to increase performance further with additional tuning.

def build_cpu_session(onnx_path):
    so = ort.SessionOptions()
    so.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
    so.inter_op_num_threads = 1
    so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
    return ort.InferenceSession(onnx_path, sess_options=so, providers=["CPUExecutionProvider"])

Architecture

Smart Turn v2 was based on the wav2vec2 speech model, which is around 400MB in size.

For v3, we experimented with several architectures before settling on Whisper Tiny, which has only 39M parameters. In our testing, despite the small size of the model, it was able to achieve better accuracy than v2 on our testing set.

Only the encoder layers of Whisper were required, onto which we added the existing linear classification layers from Smart Turn v2, resulting in a model with 8M parameters in total.

We have also applied int8 quantization to the model, in the form of static QAT (quantization aware training). We found that this preserves the accuracy of v2, while leading to significantly increased performance, and a 4x smaller filesize of 8 MB.

Currently we’re exporting the model in ONNX format. Since we’re focusing on quantization and optimized CPU inference in this release, ONNX seemed like a great fit.

Accuracy results

Smart Turn v3 maintains or improves accuracy across all supported languages compared to v2. Please see the table below for the results from our test dataset.

You can reproduce these results yourself using our open testing dataset, and benchmark.py from the Smart Turn GitHub repo.

If you’d like to help clean up the dataset to improve accuracy further by listening to audio samples, please visit the following link: https://smart-turn-dataset.pipecat.ai/

Language	Test samples	Accuracy (%)	False Positives (%)	False Negatives (%)
🇹🇷 Turkish	966	97.10	1.66	1.24
🇰🇷 Korean	890	96.85	1.12	2.02
🇯🇵 Japanese	834	96.76	2.04	1.20
🇳🇱 Dutch	1,401	96.29	1.86	1.86
🇩🇪 German	1,322	96.14	2.50	1.36
🇫🇷 French	1,253	96.01	1.60	2.39
🇵🇹 Portuguese	1,398	95.42	2.79	1.79
🇮🇹 Italian	782	95.01	3.07	1.92
🇫🇮 Finnish	1,010	94.65	3.27	2.08
🇵🇱 Polish	976	94.47	2.87	2.66
🇮🇩 Indonesian	971	94.44	4.22	1.34
🇬🇧 🇺🇸 English	2,846	94.31	2.64	3.06
🇺🇦 Ukrainian	929	94.29	2.80	2.91
🇳🇴 Norwegian	1,014	93.69	3.65	2.66
🇷🇺 Russian	1,470	93.67	3.33	2.99
🇮🇳 Hindi	1,295	93.44	4.40	2.16
🇩🇰 Danish	779	93.07	4.88	2.05
🇪🇸 Spanish	1,295	91.97	4.48	3.55
🇸🇦 Arabic	947	88.60	6.97	4.44
🇨🇳 Chinese	945	88.57	4.76	6.67
🇮🇳 Marathi	774	87.60	8.27	4.13
🇧🇩 Bengali	1,000	84.10	10.80	5.10
🇻🇳 Vietnamese	1,004	81.27	14.84	3.88

How to use the model

As with v2, there are several ways to use the model.

With Pipecat

Support for Smart Turn v3 is already integrated into Pipecat using LocalSmartTurnAnalyzerV3. You’ll need to download the ONNX model file from our HuggingFace repo.

To see this in action in an application, please see our local-smart-turn sample code.

⚠️

Note: the LocalSmartTurnAnalyzerV3 class will be added in Pipecat v0.0.85 (out soon). You can use it right away by using the main branch of Pipecat.

Standalone

You can run the model directly using the ONNX runtime. We’ve included some sample code for this in inference.py in the GitHub repo, and this is used in predict.py and record_and_predict.py.

Note that a VAD model like Silero should be used in conjunction with Smart Turn. The model works with audio chunks up to 8 seconds, and you should include as much context from the current turn as possible. For more details, see the README.

Conclusion

Support for CPU inference is a huge step for Smart Turn, and we encourage you to use this new release directly in your Pipecat Cloud bot instances.

If you speak any of the languages in the list above (particularly those with lower accuracy), we’d appreciate your help listening to some data samples to improve the quality: https://smart-turn-dataset.pipecat.ai/

And if you have any thoughts or questions about the new release, you can get in touch with us at the Pipecat Discord server or on our GitHub repo.

You don’t need a WebRTC server for your voice agents

Kwindla Hultman Kramer — Tue, 22 Jul 2025 00:51:34 GMT

Summary:

There are two ways to set up WebRTC connections to voice agents running in the cloud. You can route through a WebRTC server. Or you can set up direct, “serverless” WebRTC connections. If you are deploying your own WebRTC voice agent infrastructure, you should almost certainly use the serverless approach.
The most widely used serverless WebRTC transport for voice agents is the Pipecat SmallWebRTCTransport, built on the Python aiortc library. The SmallWebRTCTransport ecosystem includes SDK support for JavaScript, iOS, Android, Python, C++, and embedded systems.
Note that using a WebRTC cloud with a large geographic footprint is often the best choice for conversational AI use cases.

Network protocols for voice agents

Voice AI agents use WebRTC, WebSockets, or SIP connections to stream realtime audio.

WebRTC is optimized for client-server (edge-to-cloud) network connections. Use WebRTC if your voice agent is running in a native mobile app or in a web browser.
WebSockets are great for server-to-server audio connections. For example, your voice agent code running in the cloud is connecting to a realtime audio API for transcription. (You shouldn't use WebSockets for edge-to-cloud realtime audio.)
SIP is used for interconnection with telephone systems.

In this post, we'll focus on WebRTC. It's the current state-of-the-art for delivering audio reliably, at the lowest possible latency, over real-world network connections.

Voice agents at scale, for all use cases other than telephone calls, require WebRTC.

To learn more, check out the open source text Voice AI & Voice Agents: An Illustrated Primer. (PRs welcome.) Specifically, we'll discuss different ways developers can build with WebRTC — a traditional WebRTC server approach, as well as the newer "serverless" WebRTC connection. We'll keep in mind requirements for agentic use cases, and start with a brief discussion of WebRTC architecture.

WebRTC architecture

WebRTC was designed to handle the complexities of voice and video, at ultra low latency. A WebRTC connection automatically adjusts to changing network conditions, making audio conversations possible even on poor connections. (Typical poor connections in the real world are cellular data in a congested environment, or a user who is far away from their WiFi router.)

WebRTC is widely used for video and audio calls over the Internet. Google Meet, Microsoft Teams, Facebook Messenger, and WhatsApp all use WebRTC.

For these systems, it makes sense to route video and audio through dedicated WebRTC servers. Because …

Calls with several participants require routing support from a centralized server.
Calls between people who are far apart geographically benefit from mesh routing.
Routing media directly between end-user devices can be challenging. Putting a server in the middle increases call success rates and improves average latency and throughput metrics. In the early days of the Internet, most routes across the network were roughly equivalent. Today, Internet peering strongly favors connectivity to the hyper-scaler clouds (AWS, Google Cloud, Microsoft Azure, and Oracle Cloud).

A technical diagram showing how a multiparticipant call is routed through a WebRTC server

However, WebRTC connections don't need to route through a server. You can set up a direct connection between your client code and your voice agent running in the cloud.

On the left is a voice agent connection routed through a traditional WebRTC server. On the right is a “serverless” voice agent connection, directly between a web client and the agent.

The pros and cons of routing through a server

Routing through a server is the traditional way of scaling WebRTC. Commercial WebRTC cloud providers (like Daily) have spent tens of thousands of engineering hours building out reliable, flexible cloud infrastructure based on the server approach.

A WebRTC cloud:

Enables calls with hundreds or thousands of simultaneous participants.
Routes media packets across private network connections that are faster than long-haul public Internet routes.
Puts servers “close to the edge” in a large number of geographic regions, for optimal first-hop latency and fast regional routing.
Auto-scales to handle large amounts of traffic.
Incorporates redundancy and fail-over between servers, regions, and cloud providers.

On the other hand, routing directly:

Eliminates the network hop through the server.
Massively reduces the complexity of building out and maintaining WebRTC infrastructure for voice agents. With a serverless approach, you don’t have to maintain any WebRTC-specific infrastructure at all, because the WebRTC code is integrated into the client and agent SDKs directly.

Optimizing for two things: latency and reliability

Realtime voice infrastructure has to deliver very low latency and very high reliability. These are the two things we are most concerned with when we design, build, and manage large WebRTC systems.

Latency

A state-of-the-art WebRTC cloud, operating globally, will deliver better average latency than the serverless approach.

A global WebRTC cloud can connect each client to a server very close by, then route server-to-server over private network connections that are faster than the public Internet. This is called mesh routing. Daily’s WebRTC cloud operates approximately 75 points of presence in 10 geographic regions. The P50 first-hop latency to the edge of Daily’s cloud is 13ms.

The benefits of mesh routing more than make up for the extra network hop(s) through Daily’s WebRTC server(s), compared to the direct connections of the serverless WebRTC approach.

However, if you are deploying WebRTC servers without mesh routing, the calculation is reversed. The extra network hop from the WebRTC server to the agent adds 10 to 100ms, depending on exactly how you deploy your infrastructure and where your servers and users are located.

Building and managing a WebRTC cloud with mesh connectivity is a very large engineering, devops, and static infrastructure cost commitment. You must have auto-scaling clusters of WebRTC servers wherever you have users. You will have to set up and maintain VPC routes between your WebRTC server clusters. You will have to write the mesh routing code. (There is no mesh WebRTC open source implementation.) Very few teams will have the engineering resources available to build out a WebRTC cloud with mesh routing, or have the baseline traffic volume to justify WebRTC server clusters in 5 or more geographic regions.

So, if you want to maintain your own WebRTC infrastructure rather than use a commercial WebRTC cloud, you should use the serverless WebRTC approach. Your average latency will be 10 to 100ms lower for direct, serverless WebRTC connections than for connections that route through non-mesh WebRTC servers.

Reliability

Delivering reliable WebRTC sessions means minimizing a large number of potential failure modes: connection setup blocked by firewalls, network issues that impact audio quality, infrastructure components that hit scaling bottlenecks, retry logic in the network code that does not work for every corner case, and many more.

A state-of-the-art WebRTC cloud, plus heavily tested client SDKs, will deliver higher reliability than serverless WebRTC. In particular, the best WebRTC servers and client SDKs have sophisticated, extensively tested code for adapting to changing network conditions.

However, managing a WebRTC cloud is a complex, specific devops job. If you have a non-trivial amount of traffic, you will need auto-scaling, service discovery, geographic routing, new version roll-out, and observability code that is unique to WebRTC workloads. None of this functionality is available in open source WebRTC servers. You will have to design, test, deploy, and learn edge case lessons about all of the Kubernetes-related and other devops components yourself.

Here again, if you want to maintain your own WebRTC infrastructure rather than use a commercial WebRTC cloud, you will have better reliability numbers if you avoid trying to build and manage WebRTC servers at all. You can bundle the cloud side of the WebRTC transport code into your agents, which means that you’re only managing the agent workloads themselves and you get the WebRTC devops nearly “for free.”

When should you use a WebRTC server and when should you “go serverless?”

If you’ve read this far, you know that the basic advice for whether to use WebRTC servers or go serverless for voice agents depends on whether you want to manage all of the infrastructure yourself:

If you are happy to use a commercial WebRTC cloud with a global footprint, you will benefit from the engineering effort that has gone into minimizing latency and maximizing total reliability.
If you want to run your own infrastructure, you should go serverless. You’ll have lower average latency, higher reliability, and much simpler infrastructure to manage.
Serverless WebRTC can be easier to integrate into existing software tools, or to implement for embedded hardware. See the pipecat-esp32 project for a serverless WebRTC client for the ESP32-S3 family of microprocessors. ESP32 chips are widely used in small electronics devices.

There are use cases that require WebRTC servers.

For sessions with more than two participants, you will need to use WebRTC servers. Serverless WebRTC only works for simple voice agent connections: one human and one AI agent. If you want to connect multiple people or multiple agents, you can’t use serverless WebRTC.
For video, you should probably use WebRTC servers. Video requires much more bandwidth than audio and depends heavily on the sophisticated network adaptions that WebRTC servers are good at.

Pipecat is an open source voice (and video) realtime agents toolkit that gives you the flexibility to use a wide range of network transports, depending on your goals and use cases. Pipecat is vendor neutral, and supports serverless WebRTC, cloud WebRTC and SIP (Daily and LiveKit transport implementations), WebSockets (including Twilio Media Streams WebSockets), and telephony connections via Twilio, Plivo, and other providers.

You can use Pipecat’s SmallWebRTCTransport for serverless WebRTC. The SmallWebRTCTransport is designed with all of the serverless advantages described above in mind, and has no dependencies on any external service or infrastructure. All of the Pipecat examples and getting started repos use SmallWebRTCTransport.

You can experiment with both the SmallWebRTCTransport and the DailyTransport side-by-side. You can use serverless WebRTC for development and prototyping, and deploy to production on the Daily global WebRTC cloud with the DailyTransport. Or, of course, you can write your own transport implementation! Pipecat is completely, 100% Open Source and vendor neutral.

The Pipecat Cloud voice agent hosting service bundles Daily WebRTC for free. You don’t pay anything extra to use Daily’s global WebRTC cloud when you deploy to Pipecat Cloud. One of the goals of Pipecat Cloud is to make it easy to use the best performing voice agent building blocks, including Daily WebRTC, the Krisp voice isolation and noise reduction models, and the smart-turn semantic VAD model, all of which are bundled into Pipecat Cloud at no cost.

A special shout out to Sean DuBois, Thor Schaeff, and Aleix Conchillo Flaqué for creating community momentum around the open source WebRTC client for the ESP32 chips. If you’re interested in voice AI hardware, or just in contributing to a fun project, please join us and create PRs, add to the docs, or post videos of things you’re building!

Smart Turn v2: faster inference, and 13 new languages for voice AI

Marcus — Fri, 18 Jul 2025 17:49:37 GMT

We’re releasing a new version of our open source Smart Turn model, with improved inference speed, and support for multiple languages.

When used alongside a traditional voice activity detection (VAD) model, Smart Turn accurately detects when a speaker has finished their sentence, making use of semantic and vocal cues.

Smart Turn specifically uses the audio input (not just a text transcription) to the voice agent to perform the most accurate possible turn detection. The model outputs a prediction, indicating how likely it is that the user has finished speaking.

Using this information, an AI agent can avoid talking over a user, making conversations feel more natural and enjoyable.

We make everything about the model available as open source: the weights, the training script, and the datasets. Smart Turn is also integrated into the Pipecat framework.

Model weights: https://huggingface.co/pipecat-ai/smart-turn-v2
Code for training and inference: https://github.com/pipecat-ai/smart-turn

Native audio input

Smart Turn is trained on audio data and uses the speaker's audio as input. This allows us to make decisions using the intonation and pace of the user's speech — which provide essential cues about the user's intent — rather than just the words themselves.

Additionally, transcription models often ignore critical filler words like "um" and "hmm", whereas Smart Turn is explicitly trained to recognise these. As a result, the voice agent is able to perform the most accurate possible turn detection.

New features

Support for 🇬🇧 🇺🇸 English, 🇫🇷 French, 🇩🇪 German, 🇪🇸 Spanish, 🇵🇹 Portuguese, 🇨🇳 Chinese, 🇯🇵 Japanese, 🇮🇳 Hindi, 🇮🇹 Italian, 🇰🇷 Korean, 🇳🇱 Dutch, 🇵🇱 Polish, 🇷🇺 Russian, and 🇹🇷 Turkish
More than 6x smaller, weighing in at only 360MB compared to the previous model’s 2.3GB
Inference is now 3x faster, with 12ms inference times on an NVIDIA L40S

Using the model

There are several ways to integrate Smart Turn v2 into your voice agent:

With Pipecat

Pipecat supports local inference using LocalSmartTurnAnalyzerV2 (available in v0.0.77), and also supports using the instance hosted on Fal using FalSmartTurnAnalyzer.

For more information, see the Pipecat documentation:

https://docs.pipecat.ai/server/utilities/smart-turn/smart-turn-overview

With Pipecat Cloud

Pipecat Cloud users can make use of Fal's hosted Smart Turn v2 inference using FalSmartTurnAnalyzer. This service is provided at no extra cost.

See the following page for details:

https://pipecat-cloud.mintlify.app/pipecat-in-production/smart-turn

With local inference

From the Smart Turn source repository, obtain the files model.py and inference.py. Import these files into your project and invoke the predict_endpoint() function with your audio. For an example, please see predict.py:

https://github.com/pipecat-ai/smart-turn/blob/main/predict.py

With Fal hosted inference

Fal provides a hosted Smart Turn endpoint which has been updated with the latest v2 model.

https://fal.ai/models/fal-ai/smart-turn/api

Please see the link above for documentation, or try the sample curl command below.

curl -X POST --url https://fal.run/fal-ai/smart-turn \
    --header "Authorization: Key $FAL_KEY" \
    --header "Content-Type: application/json" \
    --data '{ "audio_url": "https://fal.media/files/panda/5-QaAOC32rB_hqWaVdqEH.mpga" }'

Accuracy

Smart Turn v2 achieves around 99% accuracy on unseen data from our human_5_all dataset, which contains English samples recorded by actual human speakers.

We also evaluate the model on synthetic data, broken down by language. The table below shows the approximate accuracy of the model when tested against unseen synthetic data.

Language	Accuracy
🇬🇧 🇺🇸 English	94.27%
🇮🇹 Italian	94.37%
🇫🇷 French	95.46%
🇪🇸 Spanish	92.14%
🇳🇱 Dutch	96.72%
🇷🇺 Russian	93.02%
🇩🇪 German	95.79%
🇨🇳 Chinese (Mandarin)	87.20%
🇰🇷 Korean	95.51%
🇵🇹 Portuguese	95.50%
🇹🇷 Turkish	96.80%
🇯🇵 Japanese	95.38%
🇵🇱 Polish	94.57%
🇮🇳 Hindi	91.20%

In this case, accuracy is defined as the percentage of times when the model correctly classified a sample as either “complete” or “incomplete”, with a 50/50 split of input samples of each type.

We suspect that the main thing holding back the accuracy is invalid or ambiguous samples in the dataset. Manually cleaning up the human_5_all dataset took us from 95% accuracy up to 99%, and so we’re planning to do the same for our new chirp3_1 dataset.

Performance

Smart Turn v2 is around three times as fast as the first version. We've included some approximate benchmarks below showing the performance of the model on various devices, when processing 8 seconds of input audio.

Device	Inference time
NVIDIA L40S (Modal)	12.5 ms
NVIDIA A100 (Modal)	19.1 ms
NVIDIA L4 (Modal)	30.8 ms
NVIDIA T4 (AWS g4dn.xlarge)	74.5 ms
CPU (Modal)	410.1 ms
CPU (AWS c7a.2xlarge)	450.6 ms
CPU (AWS t3.2xlarge)	900.4 ms
CPU (AWS c8g.2xlarge)	903.1 ms
CPU (AWS c8g.medium)	6272.4 ms

Architecture

For the first version of Smart Turn, we opted to base the model on wav2vec2-BERT, in the hope that BERT’s pretraining on 4.5 million hours of multilingual data would help us generalize turn detection to multiple languages.

In practice, during the training of our v2 model, we found that wav2vec2-BERT actually gives less accurate results on unseen data than wav2vec2, possibly because of overfitting with the larger model.

We experimented with several different architectures, including LSTM-based models, and also models where we appended some extra transformer classifier layers to the end of wav2vec2. So far, the best performing architecture we’ve found has been wav2vec2 with a linear classifier.

We’ll continue investigating how we can improve the model’s performance in future, and it’s likely that even smaller versions of the model will continue to produce accurate results.

Dataset

Until now, Smart Turn has only been trained on English datasets, and for this new multilingual version, we’ve added samples in 13 additional languages.

Our goal has always been to rely on synthetic training data as much as possible, because it allows us to generate a much higher number of samples at a lower cost. This means the audio samples are generated using a TTS model, rather than recorded by a human.

All the datasets used to train Smart Turn are freely available here: https://huggingface.co/pipecat-ai/datasets

Text to speech

There are two main types of incomplete sentence we want the model to detect:

Those ending in a filler word, for example, “My phone number is, um…”
Sentences where the speaker uses intonation and vocal cues to indicate that they have more to say

Many text to speech models perform poorly in both these cases. In our testing, few seem to be capable of ending a partial sentence with the desired intonation, and often, filler words like “um” and “er” are pronounced in an unnatural way.

We eventually settled on Google’s Chirp3 model. Pronunciation of filler words is excellent, and we found that ending a sentence with a comma generally causes the model to use the correct intonation.

Sentence list

We used the following dataset as a starting point, as it contains a large number of textual sentences in a variety of languages:

https://huggingface.co/datasets/agentlans/high-quality-multilingual-sentences

Despite the name, we found that not all of the sentences in the dataset were high quality. Many of them were grammatically incorrect, incomplete, or they resembled article section headers rather than something a person would speak out loud.

{"text": "HOW TO GET YOUR There is two ways to get them."}

{"text": "ALGOL and programming language research As Peter Landin noted, ALGOL was the first language to combine seamlessly imperative effects with the  lambda calculus."}

{"text": "Etymology The word aesthetic is derived from the Ancient Greek , which in turn comes from  and is related to ."}

To clean up the data, we used the Gemini 2.5 Flash LLM, asking it to classify each sentence as follows:

Please look at the following sentence, and place it in one of the following categories:

G: The sentence is ungrammatical
P: The punctuation is invalid
I: It is grammatically or semantically incomplete
S: It is unlikely to appear in spoken text (as opposed to written text, e.g. a section header)
C: It mentions controversial topics such as politics or religion
L: The sentence does not match the expected language
X: It doesn't fall into any of the above categories

Provide a brief rationale in at most 10 words, and then give your classification in the form @X@ (with an "at" symbol on each side).

Depending on the language, around 50-80% of the sentences were thrown away.

Filler words

Each language has a different set of filler words. Whereas someone speaking English might fill in a gap with “um…”, someone speaking Japanese might use “えーと” or “あの”.

Using Claude and GPT-o3, we built up a list of filler words in each language. We also included various connective words such as “and”, “but”, and “so”, because these will almost never end a sentence and would imply the speaker has more to say.

For previous English datasets, we were able to split the sentence randomly at a space, end the sentence there, and append a random filler word. This time, we used an LLM (Gemini Flash). This allows us to handle languages which don’t use spaces to separate words, such as Chinese, and also allows us to split the sentence at a more natural point.

Take this sentence and cut it off near the end, then add the filler word "{filler}" and ellipses (...) at the end.

For example:

"How tall is the Eiffel Tower" → "How tall is the, um..."

"What time does the store close" → "What time does the store, uh..."

Sentence to process: "{sentence}"
Filler word to use: "{filler}"

Think about where to cut the sentence naturally (as close to the end as possible), then provide your final answer wrapped in @@@ symbols like this: @@@Your incomplete sentence here@@@

We also generated subsets of data which had filler words in the middle of the sentence, to teach the model to distinguish between the two cases — just because someone said “um” halfway through their sentence, it doesn’t mean they didn’t finish talking.

Training runs

We use Modal to perform model training. For this new version, we switched to an L40S GPU, and also modified the training script to make use of more CPU cores and system memory (to speed up dataset processing).

@app.function(
    image=image,
    gpu="L40S",
    memory=16384,
    cpu=16.0,
    volumes={"/data": volume},
    timeout=86400,
    secrets=[modal.Secret.from_name("wandb-secret")],
)
def training_run(run_number):

Evaluation

We log evaluation results to Weights & Biases live during each training run, which lets us observe how the accuracy of the model evolves as each run progresses.

The training script breaks down the evaluation data by language, so that we can see the performance of each language individually.

We keep the training and evaluation data separate, to ensure that the model can handle inputs that weren’t in its training set. On average, we see around 95% accuracy on this unseen training data.

Help needed: Cleaning up the dataset

As the dataset is synthetic, some of the samples are inaccurate or sound unnatural. To combat this, we’re aiming to manually verify and classify each of the generated samples.

Anyone can help out with this effort, which will improve the accuracy of the next generation of Smart Turn models.

https://smart-turn-dataset.pipecat.ai/

Help needed: Contributing human data samples

Data samples contributed by people (as opposed to synthetic data samples) are essential for evaluating the model and also for training. If you’d like to contribute to this, please visit the following link:

https://conversation-collector.vercel.app/

Conclusion

We’d love to hear your thoughts on the new model — you can get in touch with us at the Pipecat Discord server or on our GitHub repo.

We hope the new multilingual version will open up Smart Turn to more users and use cases, and we look forward to adding even more languages in future!

Advice on Building Voice AI in June 2025

Kwindla Hultman Kramer — Tue, 24 Jun 2025 17:47:09 GMT

My top three pieces of advice for people getting started with voice agents.

Spend time up front understanding why latency and instruction following accuracy drive voice AI tech choices.
You will need to add significant tooling complexity as you go from proof of concept to production. Prepare for that. Especially important: build lightweight evals as early as you can.
The right path is: start with a proven, "best practices" tech stack -> get everything working one piece at a time -> deploy to real-world users and collect data -> then think about optimizing cost/latency/etc.

Let's take these one at a time.

Latency and function calling accuracy drive voice AI architecture

Latency

A good rule of thumb is that you should be aiming for 800ms median voice-to-voice latency (eventually).

It's okay to start with a looser target (1,500ms in initial proof of concept, for example). But you should understand from the beginning what contributes to latency. This drives model choice, network stack, design of your main conversation loop, the fact that you shouldn't (mostly) use MCP in a voice agent, etc.

Big contributors to latency

Network - 200ms (if you use WebRTC, worse with WebSockets)
Turn detection and transcription - 400ms
LLM - 500ms
Text to speech - 200ms

Some things to think about:

Any conversation turn with a tool call doubles the LLM latency.
The above are P50 numbers for the best hosted services today. Using services with worse P50 numbers or big P95 spreads will have a substantial impact on your total voice-to-voice latency.
Trade-offs abound: you really like the voice from provider X, but the P50 TTFB is 700ms rather than 200ms. Is that worth it? It might be! But measure everything so you can make choices that are both quantitative and qualitative.

References:

https://voiceaiandvoiceagents.com/#latency

Instruction following

"Instruction following" just means that the LLM does what you expect it to do, given a good prompt.

The most important subset of instruction following is function calling accuracy.

Almost all production voice AI agents rely heavily on function calling for things like:

context look-up (RAG)
saving data to back-end systems
integrating with telephony systems
cleanly terminating a session

Instruction following accuracy is critical for voice agents. Sadly, even today's best LLMs don't have great instruction following performance in multi-turn conversation contexts.

GPT-4o is the best general-purpose model on the Berkeley Function-Calling Leaderboard. It scores 72% overall accuracy on that benchmark. But on the "multi-turn" subset of the BFCL, GPT-4o scores 50%. GPT-4o-mini scores 34% on multi-turn accuracy.

This has three implications:

For almost all voice AI use cases, you need to use the current best available model. Any other model choice reduces agent performance unacceptably. That means that today you should generally be starting with GPT-4o or Gemini 2.5 Flash, until you collect enough data to write evals that show you if other models work well enough for your app.
Make things as easy as possible for the LLM. Define as few tools as possible. Write detailed, multi-shot prompts. Don't inject extra indeterminacy if you can avoid it (be very selective about where you use MCP vs hard-coding your tool calls, for example).
Because instruction following degrades quickly as multi-turn context length grows, you will very often need to do in-session "context engineering" to achieve acceptable success rates. This means compressing and focusing conversation context at specific points in the voice workflow.

A note on speech-to-speech models/APIs like the OpenAI Realtime API, the Gemini Live API, and AWS Nova Sonic

These next-generation models and APIs are the future. But for most use cases, they aren't the present, yet. Instruction following performance, ability to manage context flexibly via the API, and ability to do end-to-end monitoring and debugging are all worse with speech-to-speech APIs, today.

Most of us building in voice AI today recommend starting with the three-model (transcription -> text-mode LLM -> voice generation) in almost all cases. The exceptions to this rule are if your use case is "narrative" and doesn't require high instruction following accuracy, or if you are doing mixed-language conversations such as language tutoring.

The good news is that if you're using a framework like Pipecat, it's fairly easy to write your agent code so you can switch between the three-model approach and speech-to-speech models without changing any of your app logic. So you can test every new model release, benchmark against your evals, etc. If a speech-to-speech model performs well on your evals, you can move from the 3-model approach to that speech-to-speech model.

References:

Understand the tooling necessary to go from proof of concept to production

It's fairly easy today to build a really good voice AI demo. It's not as easy to go from ~90% conversation success rate to >99%.

On the way from initial POC to production ...

You will definitely need

Traces and observability tooling.

In production, your agent will fail. You will have:

Prompt issues (that you can relatively easily improve if you have good monitoring in place).
Service-level issues (your providers will not be as reliable as you wish they were, at this stage in the evolution of AI).
Bugs in your code :-).

The easier it is to look at stack traces, otel spans, and inference results for every unsuccessful conversation, the faster you will be able to improve your agents.

Inference traces are especially important for building an evals flywheel. Once you have an agent running in production, you can constantly improve your conversation success rates by pulling real-world data into a lighweight evals regime. This is one of the highest impact ways you can spend your engineering/product effort.

You will probably need

Context compression

As described above, instruction following and function calling accuracy degrade considerably over the course of a multi-turn conversation.

If your voice agent needs to follow a series of steps reliably, or will perform conversations longer than a few turns, you will probably need to think about doing "context engineering" to keep the conversation context short and focused.

One useful way to think about context engineering is to design your conversation as a series of workflow states. Each state corresponds to a specific "job to be done" during the voice interaction. For each state, you can define:

A system instruction for the state.
A context transformation to do when you enter the state. This is typically an LLM prompt that takes the full previous conversation state and summarizes it, focusing on the most relevant information.
Tool calls available in this state.
Next states that the LLM can proceed to from this state.

The popular Pipecat Flows library implements helpers for this state machine approach.

Async tool use

Does your agent need to look up information in back-end systems? (To do RAG, for example.) Do you need to use MCP servers? Web search?

All of the above things are probably too slow to do synchronously in your core conversation loop. So you will need to implement some kind of async function calling approach.

The basic idea is to either:

Return from the tool call right away and insert a place-holder function call output message in the conversation context. Later, when the actual tool call returns, you can incorporate the results in the context. (There are a few different ways you might want to do this, depending on your use case.)
Don't do the tool call from the main conversation loop at all. Run a parallel inference pipeline dedicated only to tool calling that inserts information into the conversation context from outside the main conversation loop. This works well for situations where the tool call is contextual but not actually triggered by anything specific that the user says. For example, updating an image to match the context of an interactive children's story, or doing safety checks on input content and generated content.

References:

Optimizing AI Voice Agents: https://www.youtube.com/watch?v=I86dFivLzXY
https://voiceaiandvoiceagents.com/#async-function-calls
https://github.com/pipecat-ai/pipecat-flows

You might need a long tail of additional capabilities

Production voice agents often need things like language switching, content safety filters, voicemail detection logic, phone tree navigation, call center "warm transfers," and many more.

Use a proven stack and make it work, before you optimize or get creative

There are enough moving parts in a production-quality voice AI stack that it generally makes sense to start with elements you know people are using successfully in production, at scale. Get things working reasonably well. And then start optimizing/experimenting.

This means use a framework that gives you reliable network transport, echo cancellation, good turn detection and interruption handling, context management and async function calling helpers, audio resampling and buffer management, etc.

You'll also need to choose a speech-to-text (transcription) model, an LLM, and a text-to-speech (voice) model.

Pick a transcription model that works well for your language. The most widely used transcription models for realtime AI are (in alphabetical order): Cartesia, Deepgram, and Gladia
For the LLM, use GPT-4o or Gemini 2.5 Flash
Pick a voice you like from Cartesia, ElevenLabs, or Rime.

There are other choices in all of these categories. But the fastest path to production is to use models that work well for realtime voice AI as your starting point.

Here are a couple of examples of issues you may face with models other than the ones above.

There are great voices from other TTS providers, including some amazing open source options. But if your TTS model doesn't have word-level timestamps, you can't align the conversation context with what the user heard, when the user interrupts the agent. For most use cases, you need to be able to do this. Cartesia, ElevenLabs, and Rime all have word-level timestamp support.
For STT, if your model delivers final transcripts outside your VAD timeout window, you have to figure out how to handle that. There's no perfect option. You can introduce an additional aggregation delay. (That's what Pipecat does.) But that pads your voice-to-voice response time substantially.

These kinds of things take a lot of time to wrap your head around when they happen in your voice agent and you're new to debugging voice AI pipelines.

Relatedly, start with the default VAD (voice activity detection) and interruption handling settings in your framework. Get everything else working before you start trying to tweak turn detection and interruption handling.

Summary

I'll summarize all of the above recommendations by turning them into a list of don'ts.

Don't start with a speech-to-speech model, an open weights model, a "mini" or "light" model, a fine-tuned model, or anything other than GPT-4o or Gemini 2.5 Flash. You may very well be able to use a wide variety of models for your use case. But you won't know that until you have good, real-world usage data and basic evals.
Don't use MCP unless you have a specific reason you need MCP. Hard-code all the tools calls that you can.

Don't go down any rabbit holes trying to improve VAD, turn-taking logic, hacking together your own echo cancellation, etc. If you are building on a widely used voice tooling/platform (like Pipecat), people are shipping voice agents at scale to production using the defaults of that platform. Get everything working reasonably well before you start changing default settings.

A plain-text version of these notes is also available as a GitHub Gist

How Lemon Slice builds AI characters with Pipecat and Daily

Nina Kuruvilla — Fri, 13 Jun 2025 13:36:58 GMT

Lemon Slice is building the next generation of video foundation models focused on humans. Their platform allows anyone to create videos of expressive, talking characters, and has been used to generate over 1 million clips that range in style from photorealism to cartoons.

Lemon Slice envisions AI video not just as a creator tool, but as the future of interactive media and embodied AI.

“After becoming one of the top creator tools for talking head videos, we recognized that Generative AI is at an inflection point. Eventually a new type of media and entertainment will emerge that is centered around interactivity. Real-time AI video marks the beginning of that era.”
– Lina Colucci, CEO Lemon Slice

To bring this vision to life, Lemon Slice first had to make their existing video transformer model 10 times faster. Through extensive research and experimentation, they developed a zero-shot model that supports 25fps streaming. What sets Lemon Slice apart from other real-time video chat experiences is that their model doesn't rely on pre-existing videos of real people, 3D game engines, or human-driven motions. With just a single image of a character and audio input, it can generate streaming video of that character speaking in real-time.

Going from 0 to 1 with Daily and Pipecat

With their real-time video transformer ready, Lemon Slice faced a new challenge: how to productize it as a complete interactive experience. Building this would require them to incorporate their model into a broader framework that handles user input, AI voice processing (TTS, STT), and LLM inference. Of course, minimizing latency would be critical.

“It was a no-brainer to work with Daily from Day 1”, Colucci notes. “They were the only vendor that could support us across the full tech stack, from audio and video transport to building an entirely new type of AI application.”

Lemon Slice went from concept to functioning prototype in just a few days by leveraging Pipecat, the open-source framework developed by Daily for building multi-modal AI applications. The team seamlessly integrated their video model service into existing Pipecat workflows without ever worrying about low-level video transport details.

For their initial MVP and internal demos, Lemon Slice used Daily Prebuilt, Daily's turnkey interface for video conferencing. This let them quickly go end-to-end during early development and validate that real-time interactive video calls with their model were actually possible.

Before launching, they decided to migrate the video call UI to their own website for greater control over user experience and layout design. What they expected would take at least two days was completed in under three hours. "It came together really fast. Integrating Daily React into our Next.js application was dead simple," says Colucci.

What Made Daily and Pipecat the Perfect Fit

Built-in support for video call recording. Lemon Slice viewed recording as a core part of the user experience they wanted to build. “Talking to AI characters is fun and entertaining. A recording would allow our users to share the experience with others or replay later for themselves,” Colucci explains. Using Daily, they were able to add that functionality with just a single line of code.
Built-in support for chat (text messaging). Lemon Slice wanted to support both audio and text input since a user might prefer one over the other depending on context. Daily and Pipecat’s built-in support for messaging made this an easy extension of their original prototype.
Third-party integrations. Pipecat's extensive support for third-party providers for services like transcription and LLMs made it easy for Lemon Slice to experiment with different options. "This flexibility allowed us to evaluate various providers and identify the optimal combination of latency and performance for our unique needs," notes Colucci. "We could easily swap out components without rewriting our entire pipeline."

Daily as a Partner

Lemon Slice and Daily worked hand-in-hand throughout the development process.

"We were initially excited to work with Daily because of their expertise in WebRTC and video streaming. However, Daily's API and SDKs were so easy to use, and their video infrastructure so robust, that we rarely had questions on that front."

Where Lemon Slice found the most value was in Daily's expertise with Pipecat and orchestrating complex multi-modal AI applications. Daily's engineers were responsive to questions in the Pipecat developer forums, eager to help, and played a pivotal role in helping Lemon Slice launch on an aggressive timeline.

The collaboration enabled Lemon Slice to focus on what they do best: pushing the boundaries of AI video generation, while Daily and Pipecat handled the complex infrastructure and architectural challenges of delivering a unique real-time AI experience at scale. Today, Lemon Slice's interactive AI characters are live and engaging users in real-time conversations - try it yourself here: lemonslice.com