Large language models know a lot, but they don’t know your stuff. They don’t know your company’s internal documentation, your product’s support tickets, last quarter’s meeting notes, or the custom knowledge base your team spent three years building. Retrieval-Augmented Generation (RAG) is the engineering pattern that solves this: instead of retraining the model on your data (expensive, slow, quickly outdated), you retrieve the most relevant pieces of your data at query time and inject them into the model’s context window. The model reasons over real information rather than hallucinating from training data.
LangChain is the Python library that makes building RAG pipelines significantly less painful. It provides composable abstractions for document loaders, text splitters, embedding models, vector stores, and retrieval chains — all the components you’d otherwise have to wire together yourself. The abstraction layer also means you can swap OpenAI embeddings for a local model, or swap ChromaDB for Pinecone, without rewriting your pipeline.
In this tutorial you’ll build a complete RAG system from scratch: loading and chunking documents, creating embeddings, storing them in a vector database, and building a question-answering chain that retrieves relevant context and generates grounded answers. By the end you’ll have a working system you can point at your own documents.
RAG = load documents -> split into chunks -> embed chunks -> store in vector DB -> at query time: embed query -> similarity search -> retrieve top-K chunks -> stuff into LLM prompt -> generate answer. LangChain’s
RetrievalQA chain handles the retrieval-and-generation step. Use FAISS or ChromaDB for the vector store, OpenAIEmbeddings or a local model for embeddings.

What Is RAG and Why Does It Work?
LLMs generate text by predicting the most likely next token given their training distribution. That training distribution is frozen at the training cutoff date and contains only what was publicly available on the internet. When you ask about your internal documentation, the model has never seen it — so it either says “I don’t know” or, more troublingly, makes something up that sounds plausible.
RAG short-circuits this problem. When a user asks a question, you first search your document database for the most relevant chunks of text. You then include those chunks in the prompt sent to the LLM: “Here is relevant context from our documentation. Using only this context, answer the following question.” The model reasons over real information you’ve provided rather than its training data.
The key technology enabling fast document search is vector embeddings. An embedding model converts text into a dense vector — a list of hundreds or thousands of numbers that encodes the semantic meaning of the text. Two texts that mean similar things will have vectors close to each other in high-dimensional space. You embed all your documents once, store the vectors in a vector database, and at query time embed the question and find the nearest document vectors. This semantic search finds relevant content even when the exact words don’t match.
| RAG Component | What It Does | Common Options |
|---|---|---|
| Document Loader | Reads files into LangChain Document objects | TextLoader, PyPDFLoader, WebBaseLoader, CSVLoader |
| Text Splitter | Breaks documents into chunks | RecursiveCharacterTextSplitter, TokenTextSplitter |
| Embedding Model | Converts text to vectors | OpenAIEmbeddings, HuggingFaceEmbeddings, OllamaEmbeddings |
| Vector Store | Stores and searches embeddings | FAISS, ChromaDB, Pinecone, Weaviate |
| Retriever | Finds relevant chunks for a query | VectorStoreRetriever, BM25Retriever, EnsembleRetriever |
| LLM | Generates the final answer | ChatOpenAI, Ollama, Anthropic, Google |
Setting Up Dependencies
Our RAG system needs several libraries working together: langchain and langchain-openai for the LLM orchestration layer, langchain-community for document loaders, faiss-cpu for the local vector store, tiktoken for token counting, and pypdf for reading PDF files. Install them all with:
pip install langchain langchain-openai langchain-community faiss-cpu tiktoken pypdf
The langchain core package provides the abstractions. langchain-openai adds OpenAI-specific integrations (ChatGPT, embeddings). langchain-community adds community-maintained integrations for document loaders, vector stores, and other tools. faiss-cpu is Facebook’s fast similarity search library for vector storage. tiktoken is OpenAI’s tokenizer library, used internally for accurate chunk sizing. pypdf enables PDF loading.
You’ll need an OpenAI API key. Set it as an environment variable:
export OPENAI_API_KEY="sk-your-key-here" # Linux/Mac
set OPENAI_API_KEY=sk-your-key-here # Windows Command Prompt
Step 1: Loading Documents
LangChain’s document loaders convert files into a standard Document object with page_content (the text) and metadata (source file, page number, etc.):
# rag_system.py
from langchain_community.document_loaders import (
TextLoader,
PyPDFLoader,
DirectoryLoader,
WebBaseLoader,
)
# Load a single text file
loader = TextLoader("company_handbook.txt", encoding="utf-8")
docs = loader.load()
print(f"Loaded {len(docs)} document(s)")
print(f"Content preview: {docs[0].page_content[:200]}")
print(f"Metadata: {docs[0].metadata}")
# Load a PDF (splits by page automatically)
pdf_loader = PyPDFLoader("annual_report.pdf")
pdf_docs = pdf_loader.load()
print(f"PDF has {len(pdf_docs)} pages")
# Load all .txt files from a directory
dir_loader = DirectoryLoader(
path="./documents/",
glob="**/*.txt",
loader_cls=TextLoader,
loader_kwargs={"encoding": "utf-8"}
)
all_docs = dir_loader.load()
print(f"Loaded {len(all_docs)} documents from directory")
# Load web pages
web_loader = WebBaseLoader([
"https://docs.python.org/3/library/functions.html",
"https://docs.python.org/3/library/exceptions.html"
])
web_docs = web_loader.load()
The metadata attached to each document is important — when you retrieve a chunk later, you want to know which document it came from so you can cite your sources. LangChain’s loaders populate source in the metadata automatically for file-based loaders.

Step 2: Splitting Documents into Chunks
LLMs have context window limits. You can’t stuff an entire 200-page manual into a prompt — you need to split documents into chunks and retrieve only the relevant ones. Good chunking is more important than most people realize: chunks that are too small lack context; chunks that are too large waste the context window on irrelevant information.
# rag_system.py
from langchain.text_splitter import RecursiveCharacterTextSplitter
# RecursiveCharacterTextSplitter: tries to split on natural boundaries
# (paragraphs, then sentences, then words, then characters)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters per chunk
chunk_overlap=200, # Overlap prevents losing context at boundaries
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""] # Priority order for splitting
)
# Split all loaded documents
chunks = text_splitter.split_documents(all_docs)
print(f"Split {len(all_docs)} documents into {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} chars")
# Inspect a chunk
sample = chunks[5]
print(f"\nChunk content:\n{sample.page_content}")
print(f"\nChunk metadata: {sample.metadata}")
The chunk_overlap parameter is important. When you split at a boundary, you risk losing the context that connects two adjacent chunks. A 200-character overlap ensures each chunk includes the end of the previous chunk, so sentences that span boundaries aren’t orphaned. The tradeoff is slightly more storage and some redundancy in retrieved chunks — worth it for coherence.
Step 3: Creating and Storing Embeddings
Now the important part: convert each chunk into a vector and store them in a vector database. This is the one-time indexing step — you run it when you load new documents, not on every query.
# embeddings.py
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
import os
# Initialize the embedding model
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small", # Cheaper than ada-002, better quality
openai_api_key=os.environ["OPENAI_API_KEY"]
)
# Test the embedding
test_vector = embeddings.embed_query("What is Python used for?")
print(f"Embedding dimensions: {len(test_vector)}") # 1536
# Create the vector store from our chunks
vector_store = FAISS.from_documents(
documents=chunks,
embedding=embeddings
)
# Save to disk so you don't re-embed every time
vector_store.save_local("faiss_index")
print(f"Vector store created with {vector_store.index.ntotal} vectors")
The FAISS.from_documents() call sends each chunk to the OpenAI embeddings API and builds the FAISS index. This is the most API-expensive step — you pay per token for embeddings. For a 100-page PDF (~50,000 tokens), the cost is about $0.001. After saving to disk, you reload it for every subsequent query without re-embedding.
# Loading a saved vector store on subsequent runs
vector_store = FAISS.load_local(
"faiss_index",
embeddings,
allow_dangerous_deserialization=True # Required flag in newer LangChain
)

Step 4: Building the Retriever
A retriever takes a query, embeds it, and returns the most similar document chunks:
# rag_system.py
# Create retriever from vector store
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={
"k": 4, # Return top 4 most relevant chunks
}
)
# Test retrieval directly
query = "How do I request time off?"
relevant_docs = retriever.invoke(query)
print(f"Retrieved {len(relevant_docs)} chunks for query: '{query}'")
for i, doc in enumerate(relevant_docs):
print(f"\n--- Chunk {i+1} (source: {doc.metadata.get('source', 'unknown')}) ---")
print(doc.page_content[:300] + "...")
The search_type="mmr" option uses Maximal Marginal Relevance — it balances relevance with diversity, preventing the retriever from returning four nearly-identical chunks when your question matches a repeated section of the document. For knowledge bases with redundant content, MMR produces better results.
Step 5: Building the RAG Chain
The retrieval chain ties everything together: it retrieves relevant chunks for a query and passes them to the LLM with an appropriate prompt:
# rag_system.py
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# Initialize the LLM
llm = ChatOpenAI(
model="gpt-4o-mini",
temperature=0, # Deterministic answers for factual QA
openai_api_key=os.environ["OPENAI_API_KEY"]
)
# Custom prompt that instructs the model to use only the provided context
qa_prompt = PromptTemplate(
input_variables=["context", "question"],
template="""You are a helpful assistant that answers questions based on the provided context.
If the answer is not contained in the context below, say "I don't have information about that."
Do not make up information.
Context:
{context}
Question: {question}
Answer:"""
)
# Build the QA chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # "stuff" = put all chunks in one prompt
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": qa_prompt}
)
# Ask a question
result = qa_chain.invoke({"query": "What is the policy for remote work?"})
print("Answer:")
print(result["result"])
print("\nSources:")
for doc in result["source_documents"]:
print(f" - {doc.metadata.get('source', 'unknown')}: {doc.page_content[:100]}...")
The chain_type="stuff" approach puts all retrieved chunks into a single prompt. This is simple and works well for small-to-medium retrievals. For larger retrievals or when chunks exceed context limits, use "map_reduce" (summarizes each chunk separately then combines) or "refine" (iteratively refines the answer with each chunk).
The Modern Approach: LCEL Chains
LangChain’s newer “Expression Language” (LCEL) provides a more composable way to build the same pipeline using the pipe operator:
# rag_system.py
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# Define prompt
prompt = ChatPromptTemplate.from_template("""Answer the question based only on the following context.
If you cannot answer from the context, say so clearly.
Context:
{context}
Question: {question}
Answer:""")
def format_docs(docs):
"""Format retrieved documents for injection into prompt."""
return "\n\n".join(
f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
for doc in docs
)
# Compose the chain with | operator
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Invoke the chain
answer = rag_chain.invoke("How does the annual review process work?")
print(answer)
LCEL’s pipe syntax makes the data flow explicit: the question goes to both the retriever (to find context) and passthrough (to reach the prompt unchanged), both get formatted into the prompt, the prompt goes to the LLM, and the LLM’s response is parsed to a string. Each | is a step in the pipeline.

Adding Conversation History
RAG chains that don’t remember previous questions in a conversation frustrate users. Here’s a conversational RAG chain that maintains chat history:
# rag_system.py
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
# Memory stores the conversation history
memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True,
output_key="answer"
)
# Conversational retrieval chain
conv_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
return_source_documents=True,
verbose=False
)
# Multi-turn conversation
questions = [
"What are the company's core values?",
"How do those values apply to customer service?", # References "those values" from above
"What's an example of living one of those values at work?"
]
for question in questions:
result = conv_chain.invoke({"question": question})
print(f"Q: {question}")
print(f"A: {result['answer']}\n")
The memory object accumulates conversation turns and the chain automatically reformulates follow-up questions to be self-contained before retrieval. “How do those values apply?” becomes something like “How do the company’s core values apply to customer service?” before the retriever searches — because “those values” alone wouldn’t find relevant chunks.
Real-Life Example: A Python Documentation Assistant
Here’s a complete, runnable RAG system built over Python’s official documentation:
# real_life_project.py
import os
from pathlib import Path
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
def build_python_docs_assistant():
"""Build a RAG assistant over Python documentation pages."""
print("Loading Python documentation...")
urls = [
"https://docs.python.org/3/library/functions.html",
"https://docs.python.org/3/library/exceptions.html",
"https://docs.python.org/3/library/stdtypes.html",
"https://docs.python.org/3/library/itertools.html",
]
loader = WebBaseLoader(urls)
docs = loader.load()
print(f"Loaded {len(docs)} pages")
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=150)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
index_path = "python_docs_index"
if Path(f"{index_path}.faiss").exists():
print("Loading existing index...")
vector_store = FAISS.load_local(index_path, embeddings,
allow_dangerous_deserialization=True)
else:
print("Creating new index...")
vector_store = FAISS.from_documents(chunks, embeddings)
vector_store.save_local(index_path)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
retriever = vector_store.as_retriever(search_kwargs={"k": 5})
prompt = PromptTemplate(
input_variables=["context", "question"],
template="""You are a Python expert assistant. Answer questions about Python using
only the documentation excerpts provided below. Include relevant function signatures when available.
If the answer isn't in the context, say so.
Documentation excerpts:
{context}
Question: {question}
Answer:"""
)
chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": prompt}
)
return chain
def ask(chain, question: str):
"""Ask a question and display the answer with sources."""
print(f"\nQ: {question}")
result = chain.invoke({"query": question})
print(f"A: {result['result']}")
sources = {doc.metadata.get("source", "unknown") for doc in result["source_documents"]}
print(f"Sources: {', '.join(sources)}")
# Build and use the assistant
assistant = build_python_docs_assistant()
ask(assistant, "What does the sorted() function return and how do I use the key parameter?")
ask(assistant, "What is the difference between StopIteration and GeneratorExit exceptions?")

Frequently Asked Questions
How much does it cost to build a RAG system with OpenAI?
The main cost is embedding your documents. text-embedding-3-small costs $0.02 per million tokens. A 100-page PDF (~50,000 tokens) costs about $0.001 to embed. Querying costs are minimal — embedding a query is a few hundred tokens. For most internal knowledge bases, the total indexing cost is under $1.
Can I use a local LLM instead of OpenAI?
Yes — swap ChatOpenAI for OllamaLLM (with Ollama running locally) or HuggingFacePipeline. Similarly, swap OpenAIEmbeddings for OllamaEmbeddings or HuggingFaceEmbeddings. The rest of the pipeline stays identical. This is LangChain’s main value proposition — swappable components.
How do I handle documents that update frequently?
Use a vector store that supports upsert operations (ChromaDB, Pinecone). Track a hash or last-modified timestamp for each source document, and re-embed only changed files. For frequently updating data, consider a refresh schedule rather than real-time updates.
What chunk size should I use?
It depends on your content and model context window. A common starting point is 512-1000 characters with 10-20% overlap. For technical documentation with dense information, smaller chunks (256-512) work better. For narrative text, larger chunks (1000-2000) preserve context. Always test with representative queries and adjust based on answer quality.
Why does my RAG system give wrong answers even when the right document is in the database?
The most common causes are: chunks too small (losing context), poor chunk boundaries (splitting mid-sentence), the embedding model not capturing domain-specific terminology well, or not retrieving enough chunks (increase k). Check the retrieved chunks for a failing query — if the right content isn’t being retrieved, the problem is in chunking or retrieval. If it is retrieved but the answer is wrong, the problem is in the prompt or LLM.
What’s the difference between FAISS and ChromaDB?
FAISS is a pure similarity-search library — fast, in-memory, no persistence overhead. ChromaDB is a full vector database with built-in persistence, metadata filtering, and a server mode for multi-process access. FAISS is better for prototyping and read-heavy workloads. ChromaDB is better for production use cases where you need metadata filtering or document updates.
Summary
You’ve built a complete RAG system: document loading, chunking, embedding, vector storage, retrieval, and LLM-powered answer generation. The pipeline turns any document collection into a queryable knowledge base that gives grounded, source-cited answers instead of hallucinations. The LangChain abstractions mean you can swap any component — different embedding models, vector stores, or LLMs — without rewriting the pipeline.
The next level is improving retrieval quality with hybrid search (combining vector search with BM25 keyword search), implementing reranking to improve chunk ordering, and adding metadata filtering to target specific document subsets. For related tutorials, see How To Build a Chatbot with Python and Ollama (local LLMs) and Pydantic V2 Data Validation for structuring the outputs of your RAG chain.