Building a Real-time Voice RAG Agent - by Avi Chawla

Playback speed

×

Share post

Share post at current time

Share from 0:00

0:00

/

0:00

Transcript

Building a Real-time Voice RAG Agent

Hands-on implementation.

Mar 21, 2025

Typing to interact with AI applications can be a bit tedious and boring.

That is why real-time voice interactions will become more and more popular going ahead.

Today, let us show you how we built a real-time Voice RAG Agent, step-by-step.

Here’s an overview of what the app does:

Listens to real-time audio.
Transcribes it via AssemblyAI—a leading speech-to-text platform.
Uses your docs (via LlamaIndex) to craft an answer.
Speaks that answer back with Cartesia—a platform to generate seamless speech, power voice apps, and fine-tune your own voice models in near real-time.

The code is provided later in the article. Also, if you’d like, we have added a video at the top if you prefer to watch it.

Now, let's jump into code!

Set up environment and logging

This ensures we can load configurations from .env and keep track of everything in real-time.

Setup RAG

This is where your documents get indexed for search and retrieval, powered by LlamaIndex.

The Agent’s answer would be grounded to this knowledge base.

Setup Voice Activity Detection

We also want Voice Activity Detection (VAD) for a smooth real-time experience—so we’ll “prewarm” the Silero VAD model.

This helps us detect when someone is actually speaking.

The VoicePipelineAgent and Entry Point

This is where we bring it all together. The Agent:

Listens to real-time audio.
Transcribes it using AssemblyAI.
Crafts an answer with your documents via LlamaIndex.
Speaks that answer back using Cartesia.

Run the app

Finally, we tie it all together. We run our Agent with specifying the prewarm function and main entry point.

That’s it—your Real-Time Voice RAG Agent is ready to roll!

We added a video at the top if you want to see this in action!

A note about Cartesia Sonic 2.0

Cartesia recently launched industry-leading voice models featuring 40ms latency and best-in-class voice quality to build real-time voice AI Agents.

Try Cartesia Sonic 2.0

Get instant cloning with just 3 seconds of audio.
A voice changer for fine-grained control.
Audio infilling to generate personalized content at scale.

Join over 50k developers and build your voice product with Cartesia today →

The entire code is 100% open-source and available in this GitHub repo →

A big shoutout to Cartesia for giving us access to their platform and working with us on today’s demo.

Thanks for reading!

Discussion about this video

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts