Despite their magic, we’ve all brushed up against the limits of LLMs. You probably have September 2021 burned into your mind now, as I do, as the date history stopped, at least as far as ChatGPT goes. Similarly, LLMs don’t know about things they can’t know about - like your company’s internal documentation, or your private notes. There is a way around this, and it’s also the technique used for all of those “chat with your Data” or “chat with your PDF” applications.
It’s called Retrieval Augmented Generation (RAG for short), and it’s conceptually quite simple and clever. Let’s dig into how it works.
Let’s work off a use case where we want the AI to answer questions about something that it would usually not do too well at:
Alright, first let’s take the birds eye view: what does a RAG workflow look like?
The idea is that we first ingest documents into a database, and then retrieve the relevant ones when a user asks a question. Finally, we add the relevant information to the prompt, and get a much better answer.
Let’s dive into the specifics of the ingestion phase first, and then we’ll look at the augmented retrieval.
The goal here is to get text data from a data source (this could be anything: PDFs, CSV, HTML web pages, the constitution, etc…), and prepare it for the later stages.
A first issue is that the data size is potentially unbounded: how big is a pdf or a google doc? Ingesting the whole Lord of the Rings book would be too much to handle for a single row in most databases.
The solution to the arbitrary size of the text documents is chunking. We split the text document into smaller parts, which are bounded in size. There are many chunking strategies, which depend heavily on the structure of the document being ingested: