Fully local retrieval-augmented generation, step by step

How to implement a local RAG system using LangChain, SQLite-vss, Ollama, and Meta’s Llama 2 large language model.

Outdoor stairs steps painted with colorful lines

Credit: asharihasan28 / Shutterstock

In “Retrieval-augmented generation, step by step,” we walked through a very simple RAG example. Our little application augmented a large language model (LLM) with our own documents, enabling the language model to answer questions about our own content. That example used an embedding model from OpenAI, which meant we had to send our content to OpenAI’s servers—a potential data privacy violation, depending on the application. We also used OpenAI’s public LLM.

This time we will build a fully local version of a retrieval-augmented generation system, using a local embedding model and a local LLM. As in the previous article, we’ll use LangChain to stitch together the various components of our application. Instead of FAISS (Facebook AI Similarity Search), we’ll use SQLite-vss to store our vector data. SQLite-vss is our familiar friend SQLite with an extension that makes it capable of similarity search.

Remember that similarity search for text does a best match on meaning (or semantics) using embeddings, which are numerical representations of words or phrases in a vector space. The shorter the distance between two embeddings in the vector space, the closer in meaning are the two words or phrases. Therefore, to feed our own documents to an LLM, we first need to convert them to embeddings, which is the only raw material that an LLM can take as input.

We save the embedding in the local vector store and then integrate that vector store with our LLM. We’ll use Llama 2 as our LLM, which we’ll run locally using an app called Ollama, which is available for macOS, Linux, and Windows (the latter in preview). You can read about installing Ollama in this InfoWorld article.

Here is the list of components we will need to build a simple, fully local RAG system:

A document corpus. Here we will use just one document, the text of President Biden’s February 7, 2023, State of the Union Address. You can download this text at the link below.
A loader for the document. This code will extract text from the document and pre-process it into chunks for generating an embedding.
An embedding model. This model takes the pre-processed document chunks as input, and outputs an embedding (i.e. a set of vectors that represent the document chunks).
A local vector data store with an index for searching.
An LLM tuned for following instructions and running on your own machine. This machine could be a desktop, a laptop, or a VM in the cloud. In my example it is a Llama 2 model running on Ollama on my Mac.
A chat template for asking questions. This template creates a framework for the LLM to respond in a format that human beings will understand.

Now the code with some more explanation in the comments.

download

State of the Union 2023

Text file of President Biden’s February 7, 2023, State of the Union Address

Fully local RAG example—retrieval code

# LocalRAG.py
# LangChain is a framework and toolkit for interacting with LLMs programmatically

from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import SQLiteVSS
from langchain.document_loaders import TextLoader

# Load the document using a LangChain text loader
loader = TextLoader("./sotu2023.txt")
documents = loader.load()

# Split the document into chunks
text_splitter = CharacterTextSplitter (chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
texts = [doc.page_content for doc in docs]

# Use the sentence transformer package with the all-MiniLM-L6-v2 embedding model
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Load the text embeddings in SQLiteVSS in a table named state_union
db = SQLiteVSS.from_texts(
    texts = texts,
    embedding = embedding_function,
    table = "state_union",
    db_file = "/tmp/vss.db"
)

# First, we will do a simple retrieval using similarity search
# Query
question = "What did the president say about Nancy Pelosi?"
data = db.similarity_search(question)

# print results
print(data[0].page_content)

Fully local RAG example—retrieval output

Mr. Speaker. Madam Vice President. Our First Lady and Second Gentleman.

Members of Congress and the Cabinet. Leaders of our military.

Mr. Chief Justice, Associate Justices, and retired Justices of the Supreme Court.

And you, my fellow Americans.

I start tonight by congratulating the members of the 118th Congress and the new Speaker of the House, Kevin McCarthy.

Mr. Speaker, I look forward to working together.

I also want to congratulate the new leader of the House Democrats and the first Black House Minority Leader in history, Hakeem Jeffries.

Congratulations to the longest serving Senate Leader in history, Mitch McConnell.

And congratulations to Chuck Schumer for another term as Senate Majority Leader, this time with an even bigger majority.

And I want to give special recognition to someone who I think will be considered the greatest Speaker in the history of this country, Nancy Pelosi.

Note that the result includes a literal chunk of text from the document that is relevant to the query. It is what is returned by the similarity search of the vector database, but it is not the answer to the query. The last line of the output is the answer to the query. The rest of the output is the context for the answer.

Note that chunks of your documents is just what you will get if you do a raw similarity search on a vector database. Often you will get more than one chunk, depending on your question and how broad or narrow it is. Because our example question was rather narrow, and because there is only one mention of Nancy Pelosi in the text, we received just one chunk back.

Now we will use the LLM to ingest the chunk of text that came from the similarity search and generate a compact answer to the query.

Before you can run the following code, Ollama must be installed and the llama2:7b model downloaded. Note that in macOS and Linux, Ollama stores the model in the .ollama subdirectory in the home directory of the user.

Fully local RAG—query code

# LLM
from langchain.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
llm = Ollama(
    model = "llama2:7b",
    verbose = True,
    callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]),
)

# QA chain
from langchain.chains import RetrievalQA
from langchain import hub

# LangChain Hub is a repository of LangChain prompts shared by the community
QA_CHAIN_PROMPT = hub.pull("rlm/rag-prompt-llama")
qa_chain = RetrievalQA.from_chain_type(
    llm,
    # we create a retriever to interact with the db using an augmented context
    retriever = db.as_retriever(), 
    chain_type_kwargs = {"prompt": QA_CHAIN_PROMPT},
)

result = qa_chain({"query": question})

Fully local RAG example—query output

In the retrieved context, President Biden refers to Nancy Pelosi as

“someone who I think will be considered the greatest Speaker in the history of this country.”

This suggests that the President has a high opinion of Pelosi’s leadership skills and accomplishments as Speaker of the House.

Note the difference in the output of the two snippets. The first one is a literal chunk of text from the document relevant to the query. The second one is a distilled answer to the query. In the first case we are not using the LLM. We are just using the vector store to retrieve a chunk of text from the document. Only in the second case are we using the LLM, which generates a compact answer to the query.

To use RAG in practical applications you will need to import multiple document types such as PDF, DOCX, RTF, XLSX, and PPTX. Both LangChain and LlamaIndex (another popular framework for building LLM applications) have specialized loaders for a variety of document types.

In addition, you may want to explore other vector stores besides FAISS and SQLite-vss. Like large language models and other areas of generative AI, the vector database space is rapidly evolving. We’ll dive into other options along all of these fronts in future articles here.

Topics

About

Policies

Our Network

More

Fully local retrieval-augmented generation, step by step

How to implement a local RAG system using LangChain, SQLite-vss, Ollama, and Meta’s Llama 2 large language model.

Fully local RAG example—retrieval code

Fully local RAG example—retrieval output

Fully local RAG—query code

Fully local RAG example—query output

More from this author

Retrieval-augmented generation, step by step

Most popular authors

Show me more

OpenSilver 3.0 previews AI-powered UI designer for .NET

How to use FastEndpoints in ASP.NET Core

How Azure Functions is evolving

How to use dbm to stash data quickly in Python

How to auto-generate Python type hints with Monkeytype

How to make HTML GUIs in Python with NiceGUI

Fully local retrieval-augmented generation, step by step

How to implement a local RAG system using LangChain, SQLite-vss, Ollama, and Meta’s Llama 2 large language model.

Fully local RAG example—retrieval code

Fully local RAG example—retrieval output

Fully local RAG—query code

Fully local RAG example—query output

Related content

Beyond the usual suspects: 5 fresh data science tools to try today

Generative AI won’t fix cloud migration

HR professionals trust AI recommendations

Safety off: Programming in Rust with `unsafe`

More from this author

Retrieval-augmented generation, step by step

Most popular authors

Show me more

OpenSilver 3.0 previews AI-powered UI designer for .NET

How to use FastEndpoints in ASP.NET Core

How Azure Functions is evolving

How to use dbm to stash data quickly in Python

How to auto-generate Python type hints with Monkeytype

How to make HTML GUIs in Python with NiceGUI