AIGen AIHugging Facellama-indexOpenAIPythonvoce-aiRAGagents

How I Built a Talking, Knowledgeable AI Sidekick (and How You Can too build a Voice AI RAG agent )

Cover

Slug

voice-ai-rag-agent

Published

Date

Jun 30, 2025

A Story of Code and a Chatty Voice AI agent That Actually Knows Stuff from your docs

Chapter 1: The Dream

It all started on a rainy afternoon. I was talking to my computer (as one does when he is a remote worker), and realized:

Wouldn’t it be cool if my computer could actually listen, understand, and answer me with real knowledge from my own files?

Not just “Hey Siri, what’s the weather?” but “Hey AI, what’s in my project docs?” or “Remind me what the HR policy says about bringing cats to work?”

And so, the quest began:

I would build a Voice AI RAG agent!

(That’s Retrieval-Augmented Generation, but let’s just call it “RAG” because it sounds like a pirate.)

Chapter 2: The Ingredients

Before you can summon your own digital sidekick, you’ll need a few magical artifacts:

Python 3.11+ (the spellbook)

Cartesia (for making your AI talk like a human, not a fax machine)

AssemblyAI (so your AI can understand your voice, even if you mumble)

Anthropic Claude (the brain—OpenAI is cool, but Claude is the new wizard in town)

LiveKit (for real-time voice rooms, so your AI can join you in a virtual “room”)

A pile of your own documents (so your AI knows your world)

API keys (the secret runes—don’t lose them!)

Chapter 3: The Spell (a.k.a. The Code)

Here’s the full incantation. Don’t worry, I’ll explain every part after you read it.

(Copy, paste, and prepare to be amazed!)


import logging
import os
from dotenv import load_dotenv
from livekit.agents import JobContext, JobProcess, WorkerOptions, cli
from livekit.agents.job import AutoSubscribe
from livekit.agents.llm import (
    ChatContext,
)
from livekit.agents.pipeline import VoicePipelineAgent
from livekit.plugins import cartesia, silero, llama_index, assemblyai

load_dotenv()

logger = logging.getLogger("voice-assistant")
from llama_index.llms.anthropic import Anthropic
from llama_index.core import (
    SimpleDirectoryReader,
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
    Settings
)
from llama_index.core.chat_engine.types import ChatMode
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

load_dotenv()

# Set up the embedding model and LLM
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
llm = Anthropic(model="claude-3-haiku-20240307", max_tokens=512)
Settings.llm = llm
Settings.embed_model = embed_model

# check if storage already exists
PERSIST_DIR = "./chat-engine-storage"
if not os.path.exists(PERSIST_DIR):
    # load the documents and create the index
    documents = SimpleDirectoryReader("docs").load_data()
    index = VectorStoreIndex.from_documents(documents)
    # store it for later
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)


def prewarm(proc: JobProcess):
    proc.userdata["vad"] = silero.VAD.load()


async def entrypoint(ctx: JobContext):

    chat_context = ChatContext().append(
        role="system",
        text=(
            "You are a funny, witty assistant."
            "Respond with short and concise answers. Avoid using unpronouncable punctuation or emojis."
        ),
    )
    
    chat_engine = index.as_chat_engine(chat_mode=ChatMode.CONTEXT)

    logger.info(f"Connecting to room {ctx.room.name}")
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

    participant = await ctx.wait_for_participant()
    logger.info(f"Starting voice assistant for participant {participant.identity}")

    agent = VoicePipelineAgent(
        vad=ctx.proc.userdata["vad"],
        stt=assemblyai.STT(),
        llm=llama_index.LLM(chat_engine=chat_engine),
        tts=cartesia.TTS(
            model="sonic-2",
            voice="bf0a246a-8642-498a-9950-80c35e9276b5",
        ),
        chat_ctx=chat_context,
    )

    agent.start(ctx.room, participant)

    await agent.say(
        "Hey there! How can I help you today?",
        allow_interruptions=True,
    )


if __name__ == "__main__":
    print("Starting voice agent with Anthropic...")
    cli.run_app(
        WorkerOptions(
            entrypoint_fnc=entrypoint,
            prewarm_fnc=prewarm,
        ),
    )

Chapter 4: The Magic Explained

Let’s break down this spellbook, line by line:

1. Imports and Setup

We import all the libraries:

livekit for voice rooms

cartesia for text-to-speech

assemblyai for speech-to-text

llama_index for RAG (so your AI can actually know things from your docs)

Anthropic for the LLM (the brain)

We also load environment variables with dotenv—because hardcoding API keys is a rookie mistake.

2. Embeddings and LLM


embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
llm = Anthropic(model="claude-3-haiku-20240307", max_tokens=512)
Settings.llm = llm
Settings.embed_model = embed_model

The embedding model turns your docs into “AI food” (vectors).

The LLM (Claude) is the brain that answers questions using those vectors.

3. Document Indexing


PERSIST_DIR = "./chat-engine-storage"
if not os.path.exists(PERSIST_DIR):
    documents = SimpleDirectoryReader("docs").load_data()
    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

If you haven’t indexed your docs before, it reads everything in docs/ and builds a knowledge base.

If you have, it loads the existing index (so it doesn’t have to re-read your 500-page PDF every time).

4. Voice Activity Detection (VAD)


def prewarm(proc: JobProcess):
    proc.userdata["vad"] = silero.VAD.load()

This makes sure your AI only listens when you’re actually talking, not when you’re yelling at your cat.

5. The Entrypoint: Where the Magic Happens


async def entrypoint(ctx: JobContext):
    chat_context = ChatContext().append(
        role="system",
        text=(
            "You are a funny, witty assistant."
            "Respond with short and concise answers. Avoid using unpronouncable punctuation or emojis."
        ),
    )
    chat_engine = index.as_chat_engine(chat_mode=ChatMode.CONTEXT)
    ...

Sets the “personality” of your AI (witty, concise, no weird punctuation).

Prepares the chat engine with your indexed docs.


    logger.info(f"Connecting to room {ctx.room.name}")
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
    participant = await ctx.wait_for_participant()
    logger.info(f"Starting voice assistant for participant {participant.identity}")

Connects to a LiveKit room (more on this soon).

Waits for a participant (that’s you!) to join.


    agent = VoicePipelineAgent(
        vad=ctx.proc.userdata["vad"],
        stt=assemblyai.STT(),
        llm=llama_index.LLM(chat_engine=chat_engine),
        tts=cartesia.TTS(
            model="sonic-2",
            voice="bf0a246a-8642-498a-9950-80c35e9276b5",
        ),
        chat_ctx=chat_context,
    )
    agent.start(ctx.room, participant)
    await agent.say(
        "Hey there! How can I help you today?",
        allow_interruptions=True,
    )

Sets up the full voice pipeline: listens, understands, thinks, and talks back.

Greets you with a friendly message.

6. The Main Event



if __name__ == "__main__":
    print("Starting voice agent with Anthropic...")
    cli.run_app(
        WorkerOptions(
            entrypoint_fnc=entrypoint,
            prewarm_fnc=prewarm,
        ),
    )

Chapter 5: Summoning Your AI (a.k.a. Running the Code)

Install your dependencies (see requirements.txt).

Put your API keys in a .env file:


    ANTHROPIC_API_KEY=your_anthropic_key
    ASSEMBLYAI_API_KEY=your_assemblyai_key
    CARTESIA_API_KEY=your_cartesia_api_key
    LIVEKIT_URL=your_livekit_url
    LIVEKIT_API_KEY=your_livekit_api_key
    LIVEKIT_API_SECRET=your_livekit_api_secret

Add your documents to the docs/ folder.

Run


    python voice_agent_anthropic.py start

Chapter 6: Entering the LiveKit Room

What’s a LiveKit room?

Think of it as a virtual meeting room where your AI is always waiting for you.

How do you join?

Use the LiveKit Playground or your own LiveKit client, enter your room name, and your AI will greet you like an old friend (who actually remembers your last conversation).

Chapter 7: The Result

Now, you can:

Talk to your AI: Ask questions, get answers from your own docs.

Get witty, concise responses: No more boring bots!

Impress your friends: “Yeah, my Voice AI actually knows what’s in my files.”

if you have any doubts you can contact me 🙂

How I Built a Talking, Knowledgeable AI Sidekick (and How You Can too build a Voice AI RAG agent )

A Story of Code and a Chatty Voice AI agent That Actually Knows Stuff from your docs

Chapter 1: The Dream

Chapter 2: The Ingredients

Chapter 3: The Spell (a.k.a. The Code)

Chapter 4: The Magic Explained

Chapter 5: Summoning Your AI (a.k.a. Running the Code)

Chapter 6: Entering the LiveKit Room

Chapter 7: The Result

Happy coding! 🚀

Related Posts

How to build a RAG app with groq (llama-3) and llama-index, Hugging Face without OpenAI API keys and with minimal compute (1)

What Makes Real-Time Voice AI agents Feel Real