Building a RAG App with Groq (Llama-3), Hugging Face opensource model and Llama-Index: A Love-Hate Story
Let's build a Retrieval-Augmented Generation (RAG) application using Groq (Llama-3) and Llama-Index without any OpenAI API keys and with minimal compute. If this project can run on my laptop, it can run on anyone’s laptop because I only have 8 GB of RAM and an SSD—no graphics card, no fancy stuff. But before we dive into the code, let me share a little tale about my experience with Llama-Index.
Llama-Index is a gem. However, its documentation is... how do I put this nicely? Let's just say it could use a lot more hand-holding and a lot less cryptic messaging. I was frustrated because I was unable to build a RAG chat app with my data, my custom prompt, and chat history memory in Llama-Index. The documentation didn’t even hint that it was possible! I almost gave up, but then I decided to dive into the source code of Llama-Index to see how it was implemented. And bless the engineers who wrote that code—it's beautifully well-structured and commented. Finally, I understood how to implement exactly what I wanted and even found some great ways to customize it to fit my needs.
So, if you're ready, let's get our hands dirty and build some cool stuff!
Setting Up the Environment
First, we need to set up our environment. We use Flask to create a web application and several modules from Llama-Index for document processing and embedding.
pip install Flask requests llama-index==0.10.18 llama-index-llms-groq==0.1.3 llama-index-embeddings-huggingface==0.2.0
Installing a few necessary packages
from flask import Flask, request, render_template import re from llama_index.core import ( VectorStoreIndex, SimpleDirectoryReader, StorageContext, ServiceContext, load_index_from_storage, ) from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.core.node_parser import SentenceSplitter from llama_index.llms.groq import Groq from llama_index.core.base.llms.types import ChatMessage, MessageRole import warnings warnings.filterwarnings('ignore')
Ignoring warnings is very important because:
Initializing Flask App and API Key
Next, we initialize the Flask app and set up the API key for Groq.
GROQ_API_KEY = "<insert your API key here>" app = Flask(__name__)
Defining the RAG Chatbot Function
The core of our application is the
chat_with_rag
function. This function takes a user message and chat history, processes the documents, and generates a response using the RAG approach. In your root directory make a txt or a pdf file named “paul_graham_essays.txt” and put in some of his essaysdef chat_with_rag(message, history): reader = SimpleDirectoryReader(input_files=["./paul_graham_essays.txt"]) documents = reader.load_data() text_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=200) nodes = text_splitter.get_nodes_from_documents(documents, show_progress=True) embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2") llm = Groq(model="llama3-70b-8192", api_key=GROQ_API_KEY) prompt = "enter your system prompt here" service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm, system_prompt=prompt) vector_index = VectorStoreIndex.from_documents(documents, show_progress=True, service_context=service_context, node_parser=nodes) vector_index.storage_context.persist(persist_dir="./storage_mini") storage_context = StorageContext.from_defaults(persist_dir="./storage_mini") index = load_index_from_storage(storage_context, service_context=service_context) query_engine = index.as_chat_engine(service_context=service_context, chat_mode='condense_plus_context') resp = query_engine.chat(message, chat_history=history) return resp.response
Step-by-step Explanation:
- Document Reading: We use
SimpleDirectoryReader
to read documents from a file.
- Text Splitting: The
SentenceSplitter
splits the documents into chunks for efficient processing.
- Embedding Model: We use
HuggingFaceEmbedding
with thesentence-transformers/all-MiniLM-L6-v2
light weight model.
- Language Model: The Groq model is initialized with the
llama3-70b-8192
model. (Why Gooq? because it uses LPUs which are insanely fast you should read more about)
- Service Context: Combines the embedding model, language model, and a system prompt. (I found this from source code, its not mentioned in the docs)
- Vector Index: Creates a vector index from the documents and persists it to disk.
- Query Engine: Loads the index and sets up the query engine to handle chat interactions.
with query_engine you can use 2 methods
and another one that I have used is
Something else that you can play with is this line in chat_with_rag function
query_engine = index.as_chat_engine(service_context=service_context, chat_mode='condense_plus_context')
I used chat_mode='condense_plus_context' but you can customize things according to your need
Setting Up Flask Routes
We define our chat routes in our Flask application
@app.route('/chat', methods=['POST']) def chat(): data = request.json history_json = data.get('messages', []) system_prompt = { 'role': 'system', 'content': '<enter your system prompt here optional>' } history_json.insert(0, system_prompt) user_message = '' history = [] if history_json: user_message = history_json[-1]['content'] history = [convert_to_chat_message(msg['role'], msg['content']) for msg in history_json] bot_response = chat_with_rag(user_message, history=history) return {'response': bot_response}
Explanation:
- Chat Route: Handles POST requests with user messages, processes the chat history, and returns the bot's response. data.get.messages is an array of json objects that should come from your client or from your other server if you have a microservice arch. Each json should be in the form
{role:"user or assistant", content:"some sample text"}
Helper Function
The
convert_to_chat_message
function converts chat history from JSON format to ChatMessage
objects. This step is very important if you want to maintain a chat_history and basically chat with your RAG and guess what…… this step is not mentioned in the docs 😞def convert_to_chat_message(role: str, content: str) -> ChatMessage: """Converts a role and content into a ChatMessage object.""" # Convert role string to MessageRole enum role_enum = MessageRole(role) # Create and return a ChatMessage object return ChatMessage(role=role_enum, content=content)
Running the Application
Finally, we run the Flask app.
if __name__ == '__main__': app.run(debug=True)
Conclusion
By following these steps, you can build a RAG application using Groq and Llama-Index without relying on OpenAI API keys. This setup ensures minimal compute requirements while maintaining the efficiency and effectiveness of the RAG approach. Just remember, while Llama-Index might have a bit of a learning curve due to its documentation, it's definitely worth the effort. Happy coding!