In this article, you'll learn how to build a RAG based chatbot to chat with any PDF of your choice so you can achieve your lifelong dream of talking to PDFs 😏 In the end, I'll also show how you can test what you've built.
I know, I wrote something similar in my last article on building a customer support chatbot 😅 but this week we're going to dive deep into how to use the raw OpenAI API to chat with PDF data (including text trapped in visuals like tables) stored in ChromaDB, as well as how to use Streamlit to build the chatbot UI.
Before we dive into the code, let's debunk what we're going to implement 🕵️ To begin, OCR (Optical Character Recognition) is a technology within the field of computer vision that recognizes the characters present in the document and converts them into text - this is particularly helpful in the case of tables and charts in documents 😬 We'll be using OCR provided by Azure Cognitive Services in this tutorial.
Once text chunks are extracted using OCR, they are converted into a high-dimensional vector (aka. vectorized) using embedding models like Word2Vec, FastText, or BERT. These vectors, which encapsulate the semantic meaning of the text, are then indexed in a vector database. We'll be using ChromaDB as our in-memory vector database 🥳
Now, let's see what happens when a user asks their PDF something. First, the user query is first vectorized using the same embedding model used to vectorize the extracted PDF text chunks. Then, the top K most semantically similar text chunk is fetched by searching through the vector database, which remember, contains the text chunks from our PDF. The retrieved text chunks are then provided as context for ChatGPT to generate an answer based on information in their PDF. This is the process of retrieval, augmented, generation (RAG).
Feeling educated? 😊 Let's begin.
First, I'm going to guide you through how to set up your project folders and any dependencies you need to install.
Create a project folder and a python virtual environment by running the following command:
Your terminal should now start something like this:
Run the following command to install OpenAI API, ChromaDB, and Azure:
Let's briefly go over what each of those package does:
Next, create a new main.py file - the entry point to your application
Lastly, get your OpenAI and Azure API key ready (click the hyperlink to get them if you don't already have one)
Note: It's pretty troublesome to sign up for an account on Azure Cognitive Services. You'll need a card (although they won't charge you automatically), and phone number 😔 but do give it a try if you're trying to build something serious!
Streamlit is an easy way to build frontend applications using python. Lets import streamlit along with setting up everything else we'll need:
Give our chat UI a title and create a file uploader:
Listen for a change event in `uploaded_file`. This will be triggered when you upload a file:
View your streamlit app by running `main.py` (we'll implement the chat input UI later):
That's the easy part done 🥳! Next comes the not so easy part...
Carrying on from the previous code snippet, we're going to send `temp_file` to Azure Cognitive Services for OCR:
Here, `dict_info` is a dictionary containing information on the extracted text chunks. It's a pretty complicated dictionary, so I would recommend printing it out and seeing for yourself what it looks like.
Paste in the following to finish processing the data received from Azure:
Here, we accessed various properties of the dictionary returned by Azure to get texts on the page, and data stored in tables. The logic is pretty complex because of all the nested structures 😨 but from personal experience, Azure OCR works well even for complex PDF structures, so I highly recommend giving it a try :)
Still with me? 😅 Great, we're almost there so hang in there!
Paste in the code below to store extracted text chunks from `res` in ChromaDB.
The first try block ensures that we can continue uploading PDFs without having to refresh the page.
You might have noticed that we add data into a collection and not to the database directly. A collection in ChromaDB is a vector space. When a user enters a query, it performs a search inside this collection, instead of the entire database. In Chroma, this collection is identified by a unique name, and with a simple line of code, you can add all extracted text chunks via to this collection via `collection.add(...)`.
I get asked a lot about how to build a RAG chatbot without relying on frameworks like langchain and lLamaIndex. Well here's how you do it - you construct a list of prompts dynamically based on the results retrieved from your vector database.
Paste in the following code to wrap things up:
Notice how we reversed `prompts` after constructing a list of prompts according to the list of retrieved text chunks from ChromaDB. This is because the results returned from ChromaDB is ordered in descending order, meaning the most relevant text chunk will always be the first in the results list. However, the way ChatGPT works is it considers the last prompt in a list of prompts more, hence why we have to reverse it.
Run the streamlit app and try things out for yourself 😙:
🎉 Congratulations, you made it to the end!
As you know, LLM applications are a black box and so for production use cases, you'll want to safeguard the performance of your PDF chatbot to keep your users happy. To learn how to build a simple evaluation framework that could get you setup in less than 30 minutes, click here.
In this article, you've learnt:
This tutorial walked you through an example of how you can build a "chat with PDF" application using just Azure OCR, OpenAI, and ChromaDB. With what you've learnt, you can build powerful applications that help increase the productivity of workforces (at least that's the most prominent use case I've came across).
The source code for this tutorial is available here:
Thank you for reading!
Subscribe to our weekly newsletter to stay confident in the AI systems you build.
In this article, I'll share how JudgmentalGPT, our in-house evaluator was built using OpenAI's Assistants.
In this interactive tutorial, I'll show you how to become a Midjournalist to create image you image.