Large-language models like ChatGPT are powerful and versatile generators of natural language, but also extremely limited by the the data they were trained on. To get around this problem, there’s a lot of recent talk about leveraging RAG-based systems, but what is RAG, what it can be used for, and why should you care?
In this article, I’m going to talk about what RAG is and how to implement a RAG-based LLM application (yes, with a complete code sample).
Let’s dive right in.
Retrieval augmented generation is a technique in NLP that allows LLMs like ChatGPT to generate customized outputs that are outside the scope of the data it was trained on. An LLM application without RAG, is akin to asking ChatGPT to summarize an email without providing the actual email as context.
A RAG system consists of two primary components: the retriever and the generator.
The retriever is responsible for searching through the knowledge base for the most relevant pieces of information that correlate with the given input, which is referred to as retrieval results. On the other hand, the generator utilizes these retrieval results to craft a series of prompts based on a predefined prompt template to produce a coherent and relevant response to the input.
Here’s a diagram of a RAG architecture.
In most cases, your “knowledge base” consists of vector embeddings stored in a vector database like ChromaDB, and your “retriever” will 1) embed the given input at runtime and 2) search through the vector space containing your data to find the top K most relevant retrieval results 3) rank the results based on relevancy (or distance to your vectorized input embedding). This will then be processed into a series of prompts and passed onto your “generator”, which is your LLM of choice (GPT-4, lLama2, etc.).
For more curious users, here are the models a retriever commonly employs to extract the most pertinent retrieval results:
(RAG) has various applications across different fields due to its ability to combine retrieval and generation of text for enhanced responses. Having worked with numerous companies building LLM applications at Confident, here is the top four use cases I’ve seen:
In the following section we’ll be building a generalized QA chatbot, and you’ll be able to customize it’s functionality into any of the use cases listed above by tweaking prompts and data stored in your vector database.
For this project, we’re going to build a question-answering (QA) chatbot based on your knowledge base. We’re not going to cover the part on how to index your knowledge base, as that’s a discussion for another day.
We’re going to be using python, ChromaDB for our vector database, and OpenAI for both vector embeddings and chat completion. We’re going to build a chatbot on your favorite Wikipedia page.
First, set up a new project directory and install the dependencies we need.
Your terminal should now start with something like this:
Next, create a new `main.py` file — the entry point to your LLM application.
Lastly, go ahead and get your OpenAI API key here if you don’t already have one, and set it as an enviornment variable:
You’re good to go! Let’s start coding.
Begin by creating an Retriever class that will retrieve the most relevant data from ChromaDB for a given user question.
Open main.py and paste in the following code:
Here, `openai_ef` is the embedding function used under the hood by ChromaDB to vectorize an input. When a user sends a question to your chatbot, a vector embedding will be created from this question using OpenAI’s `text-embedding-ada-002` model. This vector embedding will then be used for ChromaDB to perform a vector similarity search in the `collection` vector space, which contains data from your knowledge base (remember, we’re assuming you’ve already indexed data for this tutorial). This process allows you to search for the top K most relevant retrieval results on any given input.
Now that you’ve created your retriever, paste in the following code to create a generator:
Here, we constructed a series of prompts in the `generate_response` method based on a list of `retrieval_results` that will be provided by the retriever we built earlier. We then send this series of prompts to OpenAI to generate an answer. Using RAG, your QA chatbot can now produce more customized outputs by enhancing the generation with retrieval results!
To wrap things up, lets put everything together:
That’s all folks! You just built your very first RAG-based chatbot.
In this article, you’ve learnt what RAG is, some use cases for RAG, and how to build your own RAG-based LLM application. However, you might have noticed that building your own RAG application is pretty complicated, and indexing your data is often a non-trivial task. Luckily, there are existing open-source frameworks like LangChain and lLamaIndex that allows you to implement what we’ve demonstrated in a much simpler way.
If you like the article, don’t forget to give us a star on Github ❤️: https://github.com/confident-ai/deepeval
You can also find the full code example here: https://github.com/confident-ai/blog-examples/tree/main/rag-llm-app
Till next time!
Subscribe to our weekly newsletter to stay confident in the AI systems you build.