Large-language models like gpt-4 are powerful and versatile generators of natural language, but also extremely limited by the the data they were trained on. To get around this problem, there’s a lot of recent talk about leveraging RAG-based systems, but what is RAG, what it can be used for, and why should you care?
In this article, I’m going to talk about what RAG is, how to implement a RAG-based LLM application (yes, with a complete code sample).
PS. Click here for a great read on how to unit test RAG applications in CI/CD pipelines.
What is RAG?
Retrieval augmented generation is a technique in NLP that allows LLMs like ChatGPT to generate customized outputs that are outside the scope of the data it was trained on. An LLM application without RAG, is akin to asking ChatGPT to summarize an email without providing the actual email as context.
A RAG system consists of two primary components: the retriever and the generator.
The retriever is responsible for searching through the knowledge base for the most relevant pieces of information that correlate with the given input, which is referred to as retrieval results. On the other hand, the generator utilizes these retrieval results to craft a series of prompts based on a predefined prompt template to produce a coherent and relevant response to the input. In fact, a great RAG system is the product of a great retriever and generator, which is why most LLM evaluation metrics nowadays focuses on evaluating either the retriever or generator.
Here’s a diagram of a RAG architecture.

In most cases, your “knowledge base” consists of vector embeddings stored in a vector database like ChromaDB, and your “retriever” will 1) embed the given input at runtime and 2) search through the vector space containing your data to find the top K most relevant retrieval results 3) rank the results based on relevancy (or distance to your vectorized input embedding). This will then be processed into a series of prompts and passed onto your “generator”, which is your LLM of choice (gpt-4, LlaMA 2, etc.).

For more curious users, here are the models a retriever commonly employs to extract the most pertinent retrieval results:
Neural Network Embeddings (eg. OpenAI/Cohere’s embedding models): ranks documents based on their locational proximity in a multidimensional vector space, enabling an understanding of textual relationships and relevance between an input and the document corpus.
Best Match 25 (BM25): a probabilistic retrieval model that enhances text retrieval precision. By considering term frequencies with inverse document frequencies, it takes into account term significance, ensuring that both common and rare terms influence the relevance ranking.
TF-IDF (Term Frequency—Inverse Document Frequency): calculates the significance of a term within a document relative to the broader corpus. By juxtaposing a term’s occurrence in a document with its rarity across the corpus, it ensures a comprehensive relevance ranking.
Hybrid Search: optimizes the relevance of the search results by assigning distinctive weights to different methodologies, such as Neural Network Embeddings, BM25, and TF-IDF.
Applications
(RAG) has various applications across different fields due to its ability to combine retrieval and generation of text for enhanced responses. Having worked with numerous companies building LLM applications at Confident, here is the top four use cases I’ve seen:
Customer support / user onboarding chatbots: retrieve data from internal documents to generate more personalized responses. You can find a tutorial on how to do it here.
Data Extraction. Interestingly, we can use RAG to extract relevant data from documents such as PDFs. You can find a tutorial on how to do it here.
Sales enablement: retrieve data from LinkedIn profiles and email threads to generate more personalized outreach messages.
Content creation and enhancement: retrieve data from past message conversations to generate suggested message replies.
In the following section we’ll be building a generalized QA chatbot, and you’ll be able to customize it’s functionality into any of the use cases listed above by tweaking prompts and data stored in your vector database.
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.





