Advanced
You have 200 pages of product documentation, a folder of research papers, or a collection of support tickets, and you want to ask questions about that content in plain English and get accurate, specific answers back. Not summaries — answers. “What does section 4.2 say about refund eligibility?” or “Which support ticket mentions the login timeout bug?” This is exactly what LlamaIndex was built for: turning your own documents into a queryable knowledge base powered by a language model, without requiring a data science background to set up.
LlamaIndex (formerly GPT Index) is a Python framework that connects language models to your data. It handles the mechanics of loading documents, splitting them into chunks, converting those chunks into vector embeddings, storing them in an index, and retrieving the most relevant chunks when a question is asked — then passing those chunks plus your question to an LLM to synthesize a final answer. The OpenAI API is the default backend, but LlamaIndex supports local models via Ollama, Hugging Face, and many others. You will need an OpenAI API key for the examples in this article, though the local model alternative is shown at the end.
This article walks through the complete LlamaIndex workflow: installing the library, loading documents from text files and directories, building a VectorStoreIndex, querying it with a QueryEngine, understanding how the retrieval-augmented generation (RAG) pipeline works under the hood, customizing chunk size and retrieval parameters, using a persistent storage backend so you do not re-index on every run, and building a practical document Q&A tool. By the end you will have a working system you can point at any folder of documents.
LlamaIndex Document Q&A: Quick Example
This five-minute example loads a text file, builds a vector index, and answers a question about its content. It requires an OpenAI API key (set as an environment variable) and a text file to query.
# quick_llamaindex.py
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
# Set your OpenAI key before running:
# export OPENAI_API_KEY="sk-..."
# Load all documents from the ./docs folder
documents = SimpleDirectoryReader("docs").load_data()
# Build a vector index from those documents
index = VectorStoreIndex.from_documents(documents)
# Create a query engine and ask a question
engine = index.as_query_engine()
response = engine.query("What is the main topic of these documents?")
print(response)
The documents cover Python testing best practices, including unit testing
with pytest, test coverage measurement with coverage.py, and setting up
CI pipelines for automated quality checks.
Three steps: load documents with SimpleDirectoryReader, build an index with VectorStoreIndex.from_documents(), and query with engine.query(). LlamaIndex handles chunking, embedding, and retrieval transparently. The answer synthesizes information from across your documents rather than returning raw excerpts. The sections below explain each component and show how to customize the pipeline for real-world use cases.
What is LlamaIndex and How Does the RAG Pipeline Work?
LlamaIndex implements a pattern called Retrieval-Augmented Generation (RAG). Instead of asking a language model to answer from its training data alone, RAG retrieves relevant passages from your documents and includes them as context when asking the question. This gives the model specific, up-to-date information and dramatically reduces hallucination on domain-specific content.
| Component | LlamaIndex Class | What It Does |
|---|---|---|
| Document loading | SimpleDirectoryReader | Reads files from disk (txt, pdf, docx, etc.) |
| Chunking | SentenceSplitter | Splits documents into overlapping text chunks |
| Embedding | OpenAIEmbedding | Converts chunks to vector representations |
| Indexing | VectorStoreIndex | Stores vectors for similarity search |
| Retrieval | VectorIndexRetriever | Finds top-k most relevant chunks for a query |
| Synthesis | ResponseSynthesizer | Sends chunks + query to LLM for final answer |
When you call engine.query("..."), LlamaIndex converts your question to a vector, finds the chunks with the highest cosine similarity to that vector, builds a prompt containing those chunks plus your question, and sends it to the LLM. The model’s answer is grounded in what your documents actually say, not what the model memorized during training.
Installing LlamaIndex
LlamaIndex is split into a core package and optional integrations. Install the core package plus the OpenAI integration to follow the examples in this article.
# install_llamaindex.sh
pip install llama-index llama-index-llms-openai llama-index-embeddings-openai
Successfully installed llama-index-core-0.11.0 llama-index-llms-openai-0.3.0
llama-index-embeddings-openai-0.3.0 openai-1.30.0
If you plan to load PDFs, also install pypdf. For Word documents, install docx2txt. LlamaIndex’s SimpleDirectoryReader auto-detects file types and uses the appropriate parser when those packages are available. Set your OpenAI API key as an environment variable before running any of the examples: export OPENAI_API_KEY="sk-your-key-here".
Loading Documents
SimpleDirectoryReader is the fastest way to load a folder of files. By default it reads .txt, .pdf, .docx, .md, and several other formats. You can also load individual files or use custom readers for APIs and databases.
# load_documents.py
from llama_index.core import SimpleDirectoryReader
# Load all files from a directory
documents = SimpleDirectoryReader("docs").load_data()
print(f"Loaded {len(documents)} document(s)")
for doc in documents[:3]:
print(f" File: {doc.metadata.get('file_name', 'unknown')}")
print(f" Characters: {len(doc.text)}")
print()
# Load specific file extensions only
pdf_docs = SimpleDirectoryReader(
"docs",
required_exts=[".pdf"],
recursive=True
).load_data()
print(f"PDF documents: {len(pdf_docs)}")
Loaded 4 document(s)
File: python_testing.txt
Characters: 8432
File: deployment_guide.txt
Characters: 12891
File: api_reference.txt
Characters: 5672
PDF documents: 2
Each loaded file becomes a Document object with a text property and a metadata dict containing the file name, path, and other attributes. Documents are loaded in full before chunking — if you have very large files, the chunking step (covered next) handles splitting them into manageable pieces. The recursive=True flag traverses subdirectories.
Building a VectorStoreIndex
Once documents are loaded, VectorStoreIndex.from_documents() chunks them, generates embeddings via the OpenAI API, and stores everything in an in-memory vector store. This step calls the OpenAI embeddings API for each chunk, so it costs a small amount and takes a few seconds to minutes depending on document volume.
# build_index.py
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
# Configure chunking: 512 token chunks with 50-token overlap
Settings.text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
documents = SimpleDirectoryReader("docs").load_data()
print(f"Building index from {len(documents)} document(s)...")
index = VectorStoreIndex.from_documents(documents, show_progress=True)
print("Index built successfully.")
print(f"Total nodes (chunks) indexed: {len(index.docstore.docs)}")
Building index from 4 document(s)...
Parsing nodes: 100%|########| 4/4 [00:00<00:00, 12.32it/s]
Generating embeddings: 100%|####| 38/38 [00:03<00:00, 11.8 embeddings/s]
Index built successfully.
Total nodes (chunks) indexed: 38
The chunk_size and chunk_overlap parameters control the granularity of retrieval. Smaller chunks (256-512 tokens) give more precise retrieval at the cost of losing context. Larger chunks (1024+ tokens) preserve more context but may retrieve less relevant sections. The 50-token overlap ensures that sentences split across chunk boundaries are not lost. For most document Q&A tasks, 512 tokens with 50-token overlap is a good starting point.
Querying the Index
The QueryEngine handles retrieval and response synthesis. The default retrieves the top 2 most similar chunks. For document Q&A you usually want more context -- retrieving 5-10 chunks gives the model more to work with and reduces "I don't know" responses on complex questions.
# query_index.py
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("docs").load_data()
index = VectorStoreIndex.from_documents(documents)
# Default query engine (top-2 retrieval)
engine = index.as_query_engine()
response = engine.query("How do I set up pytest for a new project?")
print("Answer:", response)
print()
# Custom retrieval: top-5 chunks, more context
engine_detailed = index.as_query_engine(similarity_top_k=5)
response2 = engine_detailed.query("What testing libraries are recommended?")
print("Detailed answer:", response2)
print()
# Access the source nodes used to generate the answer
for i, node in enumerate(response2.source_nodes):
print(f"Source {i+1}: {node.metadata.get('file_name')} (score: {node.score:.3f})")
Answer: To set up pytest for a new project, install it with pip install pytest,
create a tests/ directory, name your test files test_*.py, and run pytest from
your project root.
Detailed answer: The recommended testing libraries include pytest for test
execution, coverage.py (via pytest-cov) for coverage measurement, and
pytest-mock for mocking external dependencies...
Source 1: python_testing.txt (score: 0.912)
Source 2: python_testing.txt (score: 0.887)
Source 3: deployment_guide.txt (score: 0.743)
Source 4: api_reference.txt (score: 0.701)
Source 5: python_testing.txt (score: 0.694)
The source_nodes attribute is extremely useful for building trust in RAG systems -- you can show users exactly which documents and sections the answer came from. The score is the cosine similarity between the query vector and the chunk vector; higher is more relevant. Displaying sources next to answers is standard practice in production document Q&A systems.
Persisting the Index to Disk
Re-indexing every time you run your application is slow and costs money (each embedding call hits the OpenAI API). LlamaIndex can save the index to disk and reload it in under a second on subsequent runs.
# persist_index.py
import os
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
load_index_from_storage,
)
INDEX_DIR = "./index_storage"
if os.path.exists(INDEX_DIR):
# Load from disk -- no API calls needed
print("Loading index from disk...")
storage_ctx = StorageContext.from_defaults(persist_dir=INDEX_DIR)
index = load_index_from_storage(storage_ctx)
else:
# Build and save
print("Building index (first run)...")
documents = SimpleDirectoryReader("docs").load_data()
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir=INDEX_DIR)
print(f"Index saved to {INDEX_DIR}")
engine = index.as_query_engine(similarity_top_k=4)
print(engine.query("Summarize the key points from all documents."))
Loading index from disk...
The documents cover Python development best practices including testing
strategies with pytest and coverage.py, deployment automation using CI/CD
pipelines, and REST API design patterns with FastAPI...
On the first run, documents are loaded, chunked, and embedded -- which takes a few seconds and costs API credits. Every subsequent run skips that entirely and loads the pre-built index from the ./index_storage folder. This pattern makes LlamaIndex practical for applications that are restarted frequently, like web servers or CLI tools. When your documents change, delete the index_storage directory and let it rebuild on the next run.
Real-Life Example: Company Policy Q&A Tool
# policy_qa.py
"""Simple command-line Q&A tool for a folder of policy documents."""
import os
import sys
from llama_index.core import (
VectorStoreIndex, SimpleDirectoryReader,
StorageContext, load_index_from_storage, Settings
)
from llama_index.core.node_parser import SentenceSplitter
DOCS_DIR = "./policies"
INDEX_DIR = "./policy_index"
Settings.text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)
def build_or_load_index():
if os.path.exists(INDEX_DIR):
storage = StorageContext.from_defaults(persist_dir=INDEX_DIR)
return load_index_from_storage(storage)
docs = SimpleDirectoryReader(DOCS_DIR, recursive=True).load_data()
if not docs:
sys.exit(f"No documents found in {DOCS_DIR}")
print(f"Indexing {len(docs)} document(s)...")
idx = VectorStoreIndex.from_documents(docs, show_progress=True)
idx.storage_context.persist(persist_dir=INDEX_DIR)
return idx
def main():
index = build_or_load_index()
engine = index.as_query_engine(similarity_top_k=5)
print("Policy Q&A ready. Type 'quit' to exit.\n")
while True:
question = input("Question: ").strip()
if question.lower() in ("quit", "exit", "q"):
break
if not question:
continue
response = engine.query(question)
print(f"\nAnswer: {response}\n")
for i, node in enumerate(response.source_nodes[:2], 1):
fname = node.metadata.get("file_name", "unknown")
print(f" Source {i}: {fname} (relevance: {node.score:.2f})")
print()
if __name__ == "__main__":
main()
Policy Q&A ready. Type 'quit' to exit.
Question: How many vacation days do employees get per year?
Answer: Full-time employees receive 15 days of paid vacation per year,
increasing to 20 days after 3 years of service...
Source 1: employee_handbook.txt (relevance: 0.94)
Source 2: benefits_summary.txt (relevance: 0.87)
Question: What is the remote work policy?
Answer: Employees may work remotely up to 3 days per week with manager
approval. Core hours of 10am-3pm must be observed regardless of location...
Source 1: remote_work_policy.txt (relevance: 0.96)
Source 2: employee_handbook.txt (relevance: 0.71)
This tool loads all documents from ./policies, builds a persistent index on first run, and provides an interactive query loop. The source citations after each answer build trust -- users can verify answers against the original documents. To extend this for a web application, replace the input() loop with a Flask or FastAPI endpoint and return str(response) plus the source metadata as JSON.
Frequently Asked Questions
How much does using LlamaIndex with OpenAI cost?
The main cost is embedding generation during indexing. OpenAI's text-embedding-3-small model costs $0.02 per million tokens. A 100-page document is roughly 50,000 tokens, so indexing it costs about $0.001. Query costs depend on the LLM used -- GPT-4o mini queries (which include the retrieved chunks) typically run $0.001-$0.01 per question. For most small to medium document collections, total monthly costs are under $5 with typical usage.
Can I use LlamaIndex without the OpenAI API?
Yes. LlamaIndex supports Ollama, Hugging Face, Anthropic, Google Gemini, and many other providers. Install llama-index-llms-ollama and llama-index-embeddings-huggingface, then configure Settings.llm = Ollama(model="llama3") and Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5") before building the index. Local embeddings via HuggingFace are completely free but require a GPU for reasonable performance on large document collections.
How do I improve answer accuracy?
The most impactful changes are: increase similarity_top_k to retrieve more context (try 5-10), reduce chunk_size to 256-384 tokens for more precise matching, use a more capable LLM for synthesis (GPT-4o instead of GPT-4o mini), and experiment with reranking (the CohereRerank or SentenceTransformerRerank postprocessors). Also ensure your documents are clean -- scanned PDFs with poor OCR produce bad embeddings regardless of the retrieval configuration.
What about very large document collections?
The in-memory VectorStoreIndex works well up to a few thousand documents. For larger collections, use a persistent vector database backend: Chroma, Pinecone, Qdrant, or Weaviate all integrate with LlamaIndex via their respective llama-index-vector-stores-* packages. Replace VectorStoreIndex.from_documents(docs) with a ChromaVectorStore or PineconeVectorStore and the query interface remains identical.
How do I add new documents to an existing index?
Use the index.insert(document) method to add individual documents to an existing index, or index.refresh(documents) to update a list of documents (inserting new ones and updating changed ones by comparing document IDs). After inserting, call index.storage_context.persist() to save the updated index. Documents are identified by their doc_id -- LlamaIndex defaults to the file path, so renaming a file will treat it as a new document.
Conclusion
LlamaIndex makes it practical to build document Q&A systems without needing to understand the full stack of embedding models, vector databases, and prompt engineering. You learned how to load documents with SimpleDirectoryReader, build a VectorStoreIndex, query it with configurable retrieval depth, access source node citations, persist the index to disk for fast reloads, and assemble a practical policy Q&A command-line tool.
The next natural extensions are connecting to a real vector database for large-scale document collections, adding a web API wrapper with FastAPI, and experimenting with different chunking strategies and retrieval parameters to tune answer quality. For production systems, always display source citations alongside answers -- this makes the system auditable and helps users trust the output.
Official documentation: docs.llamaindex.ai.