TL;DR
In this article, you'll learn how to build a RAG based chatbot to chat with any PDF of your choice so you can achieve your lifelong dream of talking to PDFs 😏 In the end, I'll also show how you can test what you've built.
I know, I wrote something similar in my last article on building a customer support chatbot 😅 but this week we're going to dive deep into how to use the raw OpenAI API to chat with PDF data (including text trapped in visuals like tables) stored in ChromaDB, as well as how to use Streamlit to build the chatbot UI.
Introducing RAG, Vector Databases, and OCR
Before we dive into the code, let's debunk what we're going to implement 🕵️ To begin, OCR (Optical Character Recognition) is a technology within the field of computer vision that recognizes the characters present in the document and converts them into text - this is particularly helpful in the case of tables and charts in documents 😬 We'll be using OCR provided by Azure Cognitive Services in this tutorial.
Once text chunks are extracted using OCR, they are converted into a high-dimensional vector (aka. vectorized) using embedding models like Word2Vec, FastText, or BERT. These vectors, which encapsulate the semantic meaning of the text, are then indexed in a vector database. We'll be using ChromaDB as our in-memory vector database 🥳
Now, let's see what happens when a user asks their PDF something. First, the user query is first vectorized using the same embedding model used to vectorize the extracted PDF text chunks. Then, the top K most semantically similar text chunk is fetched by searching through the vector database, which remember, contains the text chunks from our PDF. The retrieved text chunks are then provided as context for ChatGPT to generate an answer based on information in their PDF. This is the process of retrieval, augmented, generation (RAG).
(Click here to learn how to evaluate RAG applications in CI/CD pipelines!)

Feeling educated? 😊 Let's begin.
Project Setup
First, I'm going to guide you through how to set up your project folders and any dependencies you need to install.
Create a project folder and a python virtual environment by running the following command:
mkdir chat-with-pdf
cd chat-with-pdf
python3 -m venv venv
source venv/bin/activateYour terminal should now start something like this:
(venv)Installing dependencies
Run the following command to install OpenAI API, ChromaDB, and Azure:
pip install openai chromadb azure-ai-formrecognizer streamlit tabulateLet's briefly go over what each of those package does:
streamlit - sets up the chat UI, which includes a PDF uploader (thank god 😌)
azure-ai-formrecognizer - extracts textual content from PDFs using OCR
chromadb - is an in-memory vector database that stores the extracted PDF content
openai - we all know what this does (receives relevant data from chromadb and returns a response based on your chatbot input)
Next, create a new main.py file - the entry point to your application
touch main.pyGetting your API keys
Lastly, get your OpenAI and Azure API key ready (click the hyperlink to get them if you don't already have one)
Note: It's pretty troublesome to sign up for an account on Azure Cognitive Services. You'll need a card (although they won't charge you automatically), and phone number 😔 but do give it a try if you're trying to build something serious!
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.







