How To Build a RAG System with Python and LangChain

Last Updated: June 01, 2026

Table of Contents

What Is RAG and Why Does It Work?
Setting Up Dependencies
Step 1: Loading Documents
Step 2: Splitting Documents into Chunks
Step 3: Creating and Storing Embeddings
Step 4: Building the Retriever
Step 5: Building the RAG Chain
The Modern Approach: LCEL Chains
Adding Conversation History
Real-Life Example: A Python Documentation Assistant
Frequently Asked Questions
Summary
Related Articles

Large language models know a lot, but they don’t know your stuff. They don’t know your company’s internal documentation, your product’s support tickets, last quarter’s meeting notes, or the custom knowledge base your team spent three years building. Retrieval-Augmented Generation (RAG) is the engineering pattern that solves this: instead of retraining the model on your data (expensive, slow, quickly outdated), you retrieve the most relevant pieces of your data at query time and inject them into the model’s context window. The model reasons over real information rather than hallucinating from training data.

LangChain is the Python library that makes building RAG pipelines significantly less painful. It provides composable abstractions for document loaders, text splitters, embedding models, vector stores, and retrieval chains — all the components you’d otherwise have to wire together yourself. The abstraction layer also means you can swap OpenAI embeddings for a local model, or swap ChromaDB for Pinecone, without rewriting your pipeline.

In this tutorial you’ll build a complete RAG system from scratch: loading and chunking documents, creating embeddings, storing them in a vector database, and building a question-answering chain that retrieves relevant context and generates grounded answers. By the end you’ll have a working system you can point at your own documents.

Quick Answer
RAG = load documents -> split into chunks -> embed chunks -> store in vector DB -> at query time: embed query -> similarity search -> retrieve top-K chunks -> stuff into LLM prompt -> generate answer. LangChain’s RetrievalQA chain handles the retrieval-and-generation step. Use FAISS or ChromaDB for the vector store, OpenAIEmbeddings or a local model for embeddings.

Programmer routing documents through a retrieval-augmented pipeline — Retrieval-Augmented: because hallucinations are so last quarter

Written by Pubs

Python developer and educator with 15+ years building production systems across data engineering, web APIs, and AI tooling. Founder of Python How To Program — 270+ in-depth tutorials covering the modern Python stack.

View all tutorials by Pubs →

What Is RAG and Why Does It Work?

Part of the Modern Python AI Stack series. See the full tutorial hub for all 23 tutorials on LangGraph, MCP, Pydantic AI, Polars, FastAPI, Litestar, Typer, and more.

LLMs generate text by predicting the most likely next token given their training distribution. That training distribution is frozen at the training cutoff date and contains only what was publicly available on the internet. When you ask about your internal documentation, the model has never seen it — so it either says “I don’t know” or, more troublingly, makes something up that sounds plausible.

RAG short-circuits this problem. When a user asks a question, you first search your document database for the most relevant chunks of text. You then include those chunks in the prompt sent to the LLM: “Here is relevant context from our documentation. Using only this context, answer the following question.” The model reasons over real information you’ve provided rather than its training data.

The key technology enabling fast document search is vector embeddings. An embedding model converts text into a dense vector — a list of hundreds or thousands of numbers that encodes the semantic meaning of the text. Two texts that mean similar things will have vectors close to each other in high-dimensional space. You embed all your documents once, store the vectors in a vector database, and at query time embed the question and find the nearest document vectors. This semantic search finds relevant content even when the exact words don’t match.

RAG Component	What It Does	Common Options
Document Loader	Reads files into LangChain Document objects	TextLoader, PyPDFLoader, WebBaseLoader, CSVLoader
Text Splitter	Breaks documents into chunks	RecursiveCharacterTextSplitter, TokenTextSplitter
Embedding Model	Converts text to vectors	OpenAIEmbeddings, HuggingFaceEmbeddings, OllamaEmbeddings
Vector Store	Stores and searches embeddings	FAISS, ChromaDB, Pinecone, Weaviate
Retriever	Finds relevant chunks for a query	VectorStoreRetriever, BM25Retriever, EnsembleRetriever
LLM	Generates the final answer	ChatOpenAI, Ollama, Anthropic, Google

Setting Up Dependencies

Our RAG system needs several libraries working together: langchain and langchain-openai for the LLM orchestration layer, langchain-community for document loaders, faiss-cpu for the local vector store, tiktoken for token counting, and pypdf for reading PDF files. Install them all with:

pip install langchain langchain-openai langchain-community faiss-cpu tiktoken pypdf

The langchain core package provides the abstractions. langchain-openai adds OpenAI-specific integrations (ChatGPT, embeddings). langchain-community adds community-maintained integrations for document loaders, vector stores, and other tools. faiss-cpu is Facebook’s fast similarity search library for vector storage. tiktoken is OpenAI’s tokenizer library, used internally for accurate chunk sizing. pypdf enables PDF loading.

You’ll need an OpenAI API key. Set it as an environment variable:

export OPENAI_API_KEY="sk-your-key-here"   # Linux/Mac
set OPENAI_API_KEY=sk-your-key-here        # Windows Command Prompt

Step 1: Loading Documents

LangChain’s document loaders convert files into a standard Document object with page_content (the text) and metadata (source file, page number, etc.):

# rag_system.py
from langchain_community.document_loaders import (
    TextLoader,
    PyPDFLoader,
    DirectoryLoader,
    WebBaseLoader,
)

# Load a single text file
loader = TextLoader("company_handbook.txt", encoding="utf-8")
docs = loader.load()
print(f"Loaded {len(docs)} document(s)")
print(f"Content preview: {docs[0].page_content[:200]}")
print(f"Metadata: {docs[0].metadata}")

# Load a PDF (splits by page automatically)
pdf_loader = PyPDFLoader("annual_report.pdf")
pdf_docs = pdf_loader.load()
print(f"PDF has {len(pdf_docs)} pages")

# Load all .txt files from a directory
dir_loader = DirectoryLoader(
    path="./documents/",
    glob="**/*.txt",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"}
)
all_docs = dir_loader.load()
print(f"Loaded {len(all_docs)} documents from directory")

# Load web pages
web_loader = WebBaseLoader([
    "https://docs.python.org/3/library/functions.html",
    "https://docs.python.org/3/library/exceptions.html"
])
web_docs = web_loader.load()

The metadata attached to each document is important — when you retrieve a chunk later, you want to know which document it came from so you can cite your sources. LangChain’s loaders populate source in the metadata automatically for file-based loaders.

Developer chunking and encoding documents into abstract blocks — Chunk it. Encode it. Retrieve it when it matters.

Step 2: Splitting Documents into Chunks

LLMs have context window limits. You can’t stuff an entire 200-page manual into a prompt — you need to split documents into chunks and retrieve only the relevant ones. Good chunking is more important than most people realize: chunks that are too small lack context; chunks that are too large waste the context window on irrelevant information.

# rag_system.py
from langchain.text_splitter import RecursiveCharacterTextSplitter

# RecursiveCharacterTextSplitter: tries to split on natural boundaries
# (paragraphs, then sentences, then words, then characters)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Characters per chunk
    chunk_overlap=200,      # Overlap prevents losing context at boundaries
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]  # Priority order for splitting
)

# Split all loaded documents
chunks = text_splitter.split_documents(all_docs)

print(f"Split {len(all_docs)} documents into {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} chars")

# Inspect a chunk
sample = chunks[5]
print(f"\nChunk content:\n{sample.page_content}")
print(f"\nChunk metadata: {sample.metadata}")

The chunk_overlap parameter is important. When you split at a boundary, you risk losing the context that connects two adjacent chunks. A 200-character overlap ensures each chunk includes the end of the previous chunk, so sentences that span boundaries aren’t orphaned. The tradeoff is slightly more storage and some redundancy in retrieved chunks — worth it for coherence.

Step 3: Creating and Storing Embeddings

Now the important part: convert each chunk into a vector and store them in a vector database. This is the one-time indexing step — you run it when you load new documents, not on every query.

# embeddings.py
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
import os

# Initialize the embedding model
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",  # Cheaper than ada-002, better quality
    openai_api_key=os.environ["OPENAI_API_KEY"]
)

# Test the embedding
test_vector = embeddings.embed_query("What is Python used for?")
print(f"Embedding dimensions: {len(test_vector)}")  # 1536

# Create the vector store from our chunks
vector_store = FAISS.from_documents(
    documents=chunks,
    embedding=embeddings
)

# Save to disk so you don't re-embed every time
vector_store.save_local("faiss_index")

print(f"Vector store created with {vector_store.index.ntotal} vectors")

The FAISS.from_documents() call sends each chunk to the OpenAI embeddings API and builds the FAISS index. This is the most API-expensive step — you pay per token for embeddings. For a 100-page PDF (~50,000 tokens), the cost is about $0.001. After saving to disk, you reload it for every subsequent query without re-embedding.

# Loading a saved vector store on subsequent runs
vector_store = FAISS.load_local(
    "faiss_index",
    embeddings,
    allow_dangerous_deserialization=True  # Required flag in newer LangChain
)

Documents being vectorized and encoded into embedding space — Your documents, vectorized and ready for interrogation

Step 4: Building the Retriever

A retriever takes a query, embeds it, and returns the most similar document chunks:

# rag_system.py
# Create retriever from vector store
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 4,   # Return top 4 most relevant chunks
    }
)

# Test retrieval directly
query = "How do I request time off?"
relevant_docs = retriever.invoke(query)

print(f"Retrieved {len(relevant_docs)} chunks for query: '{query}'")
for i, doc in enumerate(relevant_docs):
    print(f"\n--- Chunk {i+1} (source: {doc.metadata.get('source', 'unknown')}) ---")
    print(doc.page_content[:300] + "...")

The search_type="mmr" option uses Maximal Marginal Relevance — it balances relevance with diversity, preventing the retriever from returning four nearly-identical chunks when your question matches a repeated section of the document. For knowledge bases with redundant content, MMR produces better results.

Step 5: Building the RAG Chain

The retrieval chain ties everything together: it retrieves relevant chunks for a query and passes them to the LLM with an appropriate prompt:

# rag_system.py
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Initialize the LLM
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,           # Deterministic answers for factual QA
    openai_api_key=os.environ["OPENAI_API_KEY"]
)

# Custom prompt that instructs the model to use only the provided context
qa_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""You are a helpful assistant that answers questions based on the provided context.
If the answer is not contained in the context below, say "I don't have information about that."
Do not make up information.

Context:
{context}

Question: {question}

Answer:"""
)

# Build the QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",         # "stuff" = put all chunks in one prompt
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": qa_prompt}
)

# Ask a question
result = qa_chain.invoke({"query": "What is the policy for remote work?"})

print("Answer:")
print(result["result"])
print("\nSources:")
for doc in result["source_documents"]:
    print(f"  - {doc.metadata.get('source', 'unknown')}: {doc.page_content[:100]}...")

The chain_type="stuff" approach puts all retrieved chunks into a single prompt. This is simple and works well for small-to-medium retrievals. For larger retrievals or when chunks exceed context limits, use "map_reduce" (summarizes each chunk separately then combines) or "refine" (iteratively refines the answer with each chunk).

The Modern Approach: LCEL Chains

LangChain’s newer “Expression Language” (LCEL) provides a more composable way to build the same pipeline using the pipe operator:

# rag_system.py
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Define prompt
prompt = ChatPromptTemplate.from_template("""Answer the question based only on the following context.
If you cannot answer from the context, say so clearly.

Context:
{context}

Question: {question}

Answer:""")

def format_docs(docs):
    """Format retrieved documents for injection into prompt."""
    return "\n\n".join(
        f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
        for doc in docs
    )

# Compose the chain with | operator
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Invoke the chain
answer = rag_chain.invoke("How does the annual review process work?")
print(answer)

LCEL’s pipe syntax makes the data flow explicit: the question goes to both the retriever (to find context) and passthrough (to reach the prompt unchanged), both get formatted into the prompt, the prompt goes to the LLM, and the LLM’s response is parsed to a string. Each | is a step in the pipeline.

Developer assembling a LangChain LCEL pipeline with chained components — LCEL: the pipeline syntax that finally clicked

Adding Conversation History

RAG chains that don’t remember previous questions in a conversation frustrate users. Here’s a conversational RAG chain that maintains chat history:

# rag_system.py
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

# Memory stores the conversation history
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True,
    output_key="answer"
)

# Conversational retrieval chain
conv_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    return_source_documents=True,
    verbose=False
)

# Multi-turn conversation
questions = [
    "What are the company's core values?",
    "How do those values apply to customer service?",  # References "those values" from above
    "What's an example of living one of those values at work?"
]

for question in questions:
    result = conv_chain.invoke({"question": question})
    print(f"Q: {question}")
    print(f"A: {result['answer']}\n")

The memory object accumulates conversation turns and the chain automatically reformulates follow-up questions to be self-contained before retrieval. “How do those values apply?” becomes something like “How do the company’s core values apply to customer service?” before the retriever searches — because “those values” alone wouldn’t find relevant chunks.

Real-Life Example: A Python Documentation Assistant

Here’s a complete, runnable RAG system built over Python’s official documentation:

# real_life_project.py
import os
from pathlib import Path
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

def build_python_docs_assistant():
    """Build a RAG assistant over Python documentation pages."""
    print("Loading Python documentation...")
    urls = [
        "https://docs.python.org/3/library/functions.html",
        "https://docs.python.org/3/library/exceptions.html",
        "https://docs.python.org/3/library/stdtypes.html",
        "https://docs.python.org/3/library/itertools.html",
    ]
    loader = WebBaseLoader(urls)
    docs = loader.load()
    print(f"Loaded {len(docs)} pages")

    splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=150)
    chunks = splitter.split_documents(docs)
    print(f"Created {len(chunks)} chunks")

    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    index_path = "python_docs_index"

    if Path(f"{index_path}.faiss").exists():
        print("Loading existing index...")
        vector_store = FAISS.load_local(index_path, embeddings,
                                        allow_dangerous_deserialization=True)
    else:
        print("Creating new index...")
        vector_store = FAISS.from_documents(chunks, embeddings)
        vector_store.save_local(index_path)

    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    retriever = vector_store.as_retriever(search_kwargs={"k": 5})

    prompt = PromptTemplate(
        input_variables=["context", "question"],
        template="""You are a Python expert assistant. Answer questions about Python using
only the documentation excerpts provided below. Include relevant function signatures when available.
If the answer isn't in the context, say so.

Documentation excerpts:
{context}

Question: {question}

Answer:"""
    )

    chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True,
        chain_type_kwargs={"prompt": prompt}
    )
    return chain

def ask(chain, question: str):
    """Ask a question and display the answer with sources."""
    print(f"\nQ: {question}")
    result = chain.invoke({"query": question})
    print(f"A: {result['result']}")
    sources = {doc.metadata.get("source", "unknown") for doc in result["source_documents"]}
    print(f"Sources: {', '.join(sources)}")

# Build and use the assistant
assistant = build_python_docs_assistant()
ask(assistant, "What does the sorted() function return and how do I use the key parameter?")
ask(assistant, "What is the difference between StopIteration and GeneratorExit exceptions?")

AI presenting an answer with source citations attached — Citations included. Your AI just learned accountability.

Frequently Asked Questions

How much does it cost to build a RAG system with OpenAI?
The main cost is embedding your documents. text-embedding-3-small costs $0.02 per million tokens. A 100-page PDF (~50,000 tokens) costs about $0.001 to embed. Querying costs are minimal — embedding a query is a few hundred tokens. For most internal knowledge bases, the total indexing cost is under $1.

Can I use a local LLM instead of OpenAI?
Yes — swap ChatOpenAI for OllamaLLM (with Ollama running locally) or HuggingFacePipeline. Similarly, swap OpenAIEmbeddings for OllamaEmbeddings or HuggingFaceEmbeddings. The rest of the pipeline stays identical. This is LangChain’s main value proposition — swappable components.

How do I handle documents that update frequently?
Use a vector store that supports upsert operations (ChromaDB, Pinecone). Track a hash or last-modified timestamp for each source document, and re-embed only changed files. For frequently updating data, consider a refresh schedule rather than real-time updates.

What chunk size should I use?
It depends on your content and model context window. A common starting point is 512-1000 characters with 10-20% overlap. For technical documentation with dense information, smaller chunks (256-512) work better. For narrative text, larger chunks (1000-2000) preserve context. Always test with representative queries and adjust based on answer quality.

Why does my RAG system give wrong answers even when the right document is in the database?
The most common causes are: chunks too small (losing context), poor chunk boundaries (splitting mid-sentence), the embedding model not capturing domain-specific terminology well, or not retrieving enough chunks (increase k). Check the retrieved chunks for a failing query — if the right content isn’t being retrieved, the problem is in chunking or retrieval. If it is retrieved but the answer is wrong, the problem is in the prompt or LLM.

What’s the difference between FAISS and ChromaDB?
FAISS is a pure similarity-search library — fast, in-memory, no persistence overhead. ChromaDB is a full vector database with built-in persistence, metadata filtering, and a server mode for multi-process access. FAISS is better for prototyping and read-heavy workloads. ChromaDB is better for production use cases where you need metadata filtering or document updates.

Summary

You’ve built a complete RAG system: document loading, chunking, embedding, vector storage, retrieval, and LLM-powered answer generation. The pipeline turns any document collection into a queryable knowledge base that gives grounded, source-cited answers instead of hallucinations. The LangChain abstractions mean you can swap any component — different embedding models, vector stores, or LLMs — without rewriting the pipeline.

The next level is improving retrieval quality with hybrid search (combining vector search with BM25 keyword search), implementing reranking to improve chunk ordering, and adding metadata filtering to target specific document subsets. For related tutorials, see How To Build a Chatbot with Python and Ollama (local LLMs) and Pydantic V2 Data Validation for structuring the outputs of your RAG chain.

Frequently Asked Questions

What is RAG and when do I need it?

Retrieval-Augmented Generation feeds the LLM relevant chunks from your private documents at query time instead of relying on the model’s training knowledge. Use RAG when you need answers grounded in proprietary content (company docs, customer data, recent files) that the model couldn’t have seen during training, or when factual accuracy matters more than fluent prose.

Should I use LangChain or build RAG from scratch?

LangChain is fine for prototypes — it gives you document loaders, splitters, vector store adapters, and retriever chains in 30 lines. For production, many teams replace the LangChain abstractions with direct calls to their vector DB and LLM, because LangChain’s indirection makes debugging harder and version churn is high. Start with LangChain to learn the patterns, then strip away the layers you don’t need.

Which vector database should I use?

For prototypes and < 1M documents, ChromaDB or LanceDB embedded in your process is simplest. For production scale, pgvector (Postgres extension) is the boring-but-reliable choice that piggybacks on your existing database. Pinecone, Weaviate, and Qdrant are dedicated vector DBs with better recall at extreme scale (10M+ docs). Embeddings dominate retrieval quality far more than vector DB choice.

How do I chunk documents for best RAG performance?

Split by semantic structure (paragraphs, sections) rather than fixed character counts. RecursiveCharacterTextSplitter with chunk_size=500, chunk_overlap=100 is a sensible default. Add metadata (source filename, page number, section title) to each chunk so the LLM and your UI can cite where the answer came from.

Why are my RAG answers irrelevant or hallucinated?

Three likely causes. (1) Retrieval is finding the wrong chunks — test with cosine similarity directly on your query embeddings, look at the top-5 chunks, see if they actually contain the answer. (2) Chunks are too small to hold the answer — try chunk_size=1000. (3) The LLM is ignoring the context — wrap retrieved chunks in clear markers and instruct the model to only answer from them, refusing if the context doesn’t support an answer.

Continue Learning Python

Tutorials you might also find useful:

Post Views: 95

How To Build a RAG System with Python and LangChain

What Is RAG and Why Does It Work?

Setting Up Dependencies

Step 1: Loading Documents

Step 2: Splitting Documents into Chunks

Step 3: Creating and Storing Embeddings

Step 4: Building the Retriever

Step 5: Building the RAG Chain

The Modern Approach: LCEL Chains

Adding Conversation History

Real-Life Example: A Python Documentation Assistant

Frequently Asked Questions

Summary

Related Articles

Frequently Asked Questions

What is RAG and when do I need it?

Should I use LangChain or build RAG from scratch?

Which vector database should I use?

How do I chunk documents for best RAG performance?

Why are my RAG answers irrelevant or hallucinated?

Continue Learning Python

Submit a Comment Cancel reply