AIGen AIllamaOpenAIPythonllama-indexgroqLPUHugging Face

How to build a RAG app with groq (llama-3) and llama-index, Hugging Face without OpenAI API keys and with minimal compute

Cover

Slug

RAG-app-with-groq-and-llama-index-without-OpenAI

Published

Date

Jul 24, 2024

Building a RAG App with Groq (Llama-3), Hugging Face opensource model and Llama-Index: A Love-Hate Story

Let's build a Retrieval-Augmented Generation (RAG) application using Groq (Llama-3) and Llama-Index without any OpenAI API keys and with minimal compute. If this project can run on my laptop, it can run on anyone’s laptop because I only have 8 GB of RAM and an SSD—no graphics card, no fancy stuff. But before we dive into the code, let me share a little tale about my experience with Llama-Index.

Llama-Index is a gem. However, its documentation is... how do I put this nicely? Let's just say it could use a lot more hand-holding and a lot less cryptic messaging. I was frustrated because I was unable to build a RAG chat app with my data, my custom prompt, and chat history memory in Llama-Index. The documentation didn’t even hint that it was possible! I almost gave up, but then I decided to dive into the source code of Llama-Index to see how it was implemented. And bless the engineers who wrote that code—it's beautifully well-structured and commented. Finally, I understood how to implement exactly what I wanted and even found some great ways to customize it to fit my needs.

So, if you're ready, let's get our hands dirty and build some cool stuff!

Setting Up the Environment

First, we need to set up our environment. We use Flask to create a web application and several modules from Llama-Index for document processing and embedding.


pip install Flask requests llama-index==0.10.18 llama-index-llms-groq==0.1.3 llama-index-embeddings-huggingface==0.2.0

Installing a few necessary packages


from flask import Flask, request, render_template
import re
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
ServiceContext,
load_index_from_storage,
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.groq import Groq
from llama_index.core.base.llms.types import ChatMessage, MessageRole
import warnings
warnings.filterwarnings('ignore')

Ignoring warnings is very important because:

Initializing Flask App and API Key

Next, we initialize the Flask app and set up the API key for Groq.


GROQ_API_KEY = "<insert your API key here>"
app = Flask(__name__)

Defining the RAG Chatbot Function

The core of our application is the chat_with_rag function. This function takes a user message and chat history, processes the documents, and generates a response using the RAG approach. In your root directory make a txt or a pdf file named “paul_graham_essays.txt” and put in some of his essays


def chat_with_rag(message, history):
		reader = SimpleDirectoryReader(input_files=["./paul_graham_essays.txt"]) 
		documents = reader.load_data()
		text_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=200)
		nodes = text_splitter.get_nodes_from_documents(documents, show_progress=True)
		
		embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
		
		llm = Groq(model="llama3-70b-8192", api_key=GROQ_API_KEY)
		
		prompt = "enter your system prompt here"
		
		service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm, 
		system_prompt=prompt)
		
		vector_index = VectorStoreIndex.from_documents(documents, show_progress=True, 
		service_context=service_context, node_parser=nodes)
		
		vector_index.storage_context.persist(persist_dir="./storage_mini")
		
		storage_context = StorageContext.from_defaults(persist_dir="./storage_mini")
		
		index = load_index_from_storage(storage_context, service_context=service_context)
		
		query_engine = index.as_chat_engine(service_context=service_context,
		chat_mode='condense_plus_context')
		
		resp = query_engine.chat(message, chat_history=history)
		
		return resp.response

Step-by-step Explanation:

Document Reading: We use SimpleDirectoryReader to read documents from a file.

Text Splitting: The SentenceSplitter splits the documents into chunks for efficient processing.

Embedding Model: We use HuggingFaceEmbedding with the sentence-transformers/all-MiniLM-L6-v2 light weight model.

Language Model: The Groq model is initialized with the llama3-70b-8192 model. (Why Gooq? because it uses LPUs which are insanely fast you should read more about)

Service Context: Combines the embedding model, language model, and a system prompt. (I found this from source code, its not mentioned in the docs)

Vector Index: Creates a vector index from the documents and persists it to disk.

Query Engine: Loads the index and sets up the query engine to handle chat interactions.

with query_engine you can use 2 methods

and another one that I have used is

Something else that you can play with is this line in chat_with_rag function


query_engine = index.as_chat_engine(service_context=service_context, chat_mode='condense_plus_context')

I used chat_mode='condense_plus_context' but you can customize things according to your need

Setting Up Flask Routes

We define our chat routes in our Flask application


@app.route('/chat', methods=['POST'])
def chat():
		data = request.json
		history_json = data.get('messages', [])
		system_prompt = {
		    'role': 'system',
		    'content': '<enter your system prompt here optional>'
		}
		history_json.insert(0, system_prompt)
		
		user_message = ''
		history = []
		
		if history_json:
		    user_message = history_json[-1]['content']
		    history = [convert_to_chat_message(msg['role'], msg['content']) for msg in history_json]
		
		bot_response = chat_with_rag(user_message, history=history)
		
		return {'response': bot_response}

Explanation:

Chat Route: Handles POST requests with user messages, processes the chat history, and returns the bot's response. data.get.messages is an array of json objects that should come from your client or from your other server if you have a microservice arch. Each json should be in the form


{role:"user or assistant", content:"some sample text"}

Helper Function

The convert_to_chat_message function converts chat history from JSON format to ChatMessage objects. This step is very important if you want to maintain a chat_history and basically chat with your RAG and guess what…… this step is not mentioned in the docs 😞


def convert_to_chat_message(role: str, content: str) -> ChatMessage:
    """Converts a role and content into a ChatMessage object."""
    # Convert role string to MessageRole enum
    role_enum = MessageRole(role)
    # Create and return a ChatMessage object
    return ChatMessage(role=role_enum, content=content)

Running the Application

Finally, we run the Flask app.


if __name__ == '__main__':
    app.run(debug=True)

Conclusion

By following these steps, you can build a RAG application using Groq and Llama-Index without relying on OpenAI API keys. This setup ensures minimal compute requirements while maintaining the efficiency and effectiveness of the RAG approach. Just remember, while Llama-Index might have a bit of a learning curve due to its documentation, it's definitely worth the effort. Happy coding!