Advanced

Traditional databases store data as rows and columns, which makes them excellent for exact lookups — find me the user with ID 42, or all orders placed last Tuesday. But they struggle with questions like “find documents similar to this one” or “which of these product descriptions best matches what the user is looking for?” Answering those questions requires vector databases, which store data as numerical embeddings and retrieve results by geometric similarity rather than exact match. Chromadb is the easiest way to get started with vector databases in Python: it is open-source, runs embedded inside your process with zero infrastructure, and requires no external server to get started.

Chromadb is an AI-native embedding database designed for developers building LLM-powered applications, semantic search tools, and recommendation systems. You store your text (or any data that can be embedded) along with optional metadata, and chromadb handles the vector math for finding the most semantically similar items when you query it. It ships with built-in support for several sentence-transformer models so you do not even need a separate embedding step — chromadb can generate embeddings from raw text automatically. For production use, it also supports a client-server mode and can back LlamaIndex, LangChain, and similar RAG frameworks.

This article covers: installing chromadb, creating a persistent client and collections, adding documents with automatic embedding, querying by semantic similarity, filtering results with metadata, using custom embedding functions, managing and updating stored documents, and a practical semantic search project you can adapt for your own data. Everything runs locally with no API keys required for the built-in embedding models.

chromadb Similarity Search: Quick Example

This example creates an in-memory chromadb collection, adds four sentences, and queries for the one most similar to a new sentence — all in under 20 lines.

# quick_chroma.py
import chromadb

# Create an in-memory client (data is lost on exit)
client = chromadb.Client()

# Create a collection (chromadb auto-generates embeddings using sentence-transformers)
collection = client.create_collection("quick_test")

# Add some documents
collection.add(
    documents=["Python is a high-level programming language.",
               "Cats are popular household pets.",
               "Machine learning enables pattern recognition.",
               "Dogs need daily exercise and walks."],
    ids=["doc1", "doc2", "doc3", "doc4"]
)

# Query for the most similar document
results = collection.query(query_texts=["AI and deep learning"], n_results=2)
print("Top matches:")
for doc, dist in zip(results["documents"][0], results["distances"][0]):
    print(f"  [{dist:.3f}] {doc}")
Top matches:
  [0.312] Machine learning enables pattern recognition.
  [0.891] Python is a high-level programming language.

The query “AI and deep learning” correctly identifies the machine learning sentence as the closest match (lowest distance = most similar), followed by the Python sentence. Chromadb automatically generated embeddings for both the stored documents and the query using its default sentence-transformer model — no embedding code required. Lower distance means higher similarity. The sections below show persistent storage, metadata filtering, and production-ready patterns.

What is a Vector Database and Where Does chromadb Fit?

A vector database stores data as high-dimensional numerical vectors called embeddings. An embedding is a list of numbers (typically 384 to 1536 floats) that captures the semantic meaning of text, an image, or any data type. Items with similar meaning end up with vectors that are geometrically close to each other. Finding similar items is then a nearest-neighbor search problem — no SQL, no exact keyword matching.

FeaturechromadbPineconeWeaviate
HostingEmbedded (local) or serverCloud-onlySelf-hosted or cloud
Setup time<5 minutesAccount + API keyDocker or account
Built-in embeddingsYes (sentence-transformers)NoYes (modules)
Best forDev, prototyping, small productionLarge-scale prodGraph + vector hybrid
CostFreeFree tier, then paidFree (self-hosted)

Chromadb’s superpower is its zero-friction local mode: one pip install and you have a fully functional vector database running in your Python process. For production applications with millions of documents, you would graduate to Pinecone or a hosted Weaviate — but chromadb handles hundreds of thousands of documents comfortably and supports a server mode for shared access.

Cache Katie running along glowing vector data highway
Vector embeddings: where your text lives in n-dimensional space.

Installing chromadb

Install chromadb with pip. The first time you create a collection with default settings, it also downloads the sentence-transformers model (~90MB) for automatic embedding generation.

# install_chromadb.sh
pip install chromadb
Successfully installed chromadb-0.5.0 sentence-transformers-3.0.0 torch-2.3.0

The install pulls in PyTorch and sentence-transformers as dependencies, which is why the download is large. If you plan to use only external embedding functions (OpenAI, Cohere, etc.), you can install with pip install chromadb --no-deps and manually install only the dependencies you need, but for most use cases the default install is simpler.

Creating a Persistent Client

The in-memory client from the quick example loses all data when your program exits. For any real use case, create a persistent client that saves the database to disk. The interface is identical — only the client creation differs.

# persistent_chroma.py
import chromadb

# Persistent client: data saved to ./chroma_data/
client = chromadb.PersistentClient(path="./chroma_data")

# get_or_create_collection is idempotent -- safe to call every time
collection = client.get_or_create_collection(
    name="my_docs",
    metadata={"hnsw:space": "cosine"}  # use cosine similarity (default is L2)
)

print(f"Collection: {collection.name}")
print(f"Items stored: {collection.count()}")
Collection: my_docs
Items stored: 0

The hnsw:space metadata key sets the distance metric used for nearest-neighbor search. "cosine" normalizes vectors before comparing them, which generally works better for text embeddings. "l2" (Euclidean distance) is the default and works well for embeddings from models that produce unit-length vectors. Choose cosine unless you have a specific reason to use L2. The HNSW (Hierarchical Navigable Small World) index is the approximate nearest-neighbor algorithm chromadb uses for efficient similarity search.

Adding Documents with Metadata

Chromadb stores three things per document: the text content (optional if you provide raw embeddings), a unique string ID, and optional metadata as a flat key-value dictionary. Metadata enables filtering queries to a subset of the collection — for example, all documents from a specific author, date range, or category.

# add_documents.py
import chromadb

client = chromadb.PersistentClient(path="./chroma_data")
collection = client.get_or_create_collection("support_tickets")

# Add documents with metadata
collection.add(
    documents=[
        "The login page is returning a 500 error after the latest deployment.",
        "Users cannot reset their password via the mobile app.",
        "Export to CSV is missing column headers in the downloaded file.",
        "The dashboard loads slowly when more than 100 items are displayed.",
        "Login button is not visible on small screen sizes below 375px.",
    ],
    ids=["ticket-001", "ticket-002", "ticket-003", "ticket-004", "ticket-005"],
    metadatas=[
        {"category": "bug", "priority": "critical", "component": "auth"},
        {"category": "bug", "priority": "high", "component": "auth"},
        {"category": "bug", "priority": "medium", "component": "export"},
        {"category": "performance", "priority": "medium", "component": "dashboard"},
        {"category": "bug", "priority": "low", "component": "ui"},
    ]
)

print(f"Collection now has {collection.count()} items")
Collection now has 5 items

Each document gets a unique ID that you define. IDs must be unique within a collection; attempting to add a duplicate ID raises an error. Metadata values must be strings, integers, or floats — nested objects are not supported. Use flat, query-friendly keys: category, date, author_id, source are typical patterns.

Debug Dee examining colored dot clusters with magnifying glass
Metadata filters cut candidates before the vector search — faster and more relevant.

Querying with Semantic Search and Metadata Filters

The query() method accepts natural language queries and returns the most semantically similar documents. Metadata filters narrow the search to a subset of the collection before similarity search runs — this is both faster and more accurate than filtering after retrieval.

# query_chroma.py
import chromadb

client = chromadb.PersistentClient(path="./chroma_data")
collection = client.get_or_create_collection("support_tickets")

# Semantic search: find 2 most similar issues
results = collection.query(
    query_texts=["user can't log in"],
    n_results=2
)

print("Semantic search results:")
for doc, meta, dist in zip(
    results["documents"][0],
    results["metadatas"][0],
    results["distances"][0]
):
    print(f"  [{dist:.3f}] {meta.get('component')}: {doc}")

# Filtered search: only look at 'auth' component bugs
filtered = collection.query(
    query_texts=["sign in problem"],
    n_results=3,
    where={"$and": [{"category": "bug"}, {"component": "auth"}]}
)

print("\nFiltered results (auth bugs only):")
for doc in filtered["documents"][0]:
    print(f"  - {doc}")
Semantic search results:
  [0.234] auth: The login page is returning a 500 error after the latest deployment.
  [0.312] auth: Users cannot reset their password via the mobile app.

Filtered results (auth bugs only):
  - The login page is returning a 500 error after the latest deployment.
  - Users cannot reset their password via the mobile app.

The filtered search uses chromadb’s where clause to pre-filter to only auth bugs before running similarity search. The filter language supports $and, $or, $eq, $ne, $gt, $lt, and other comparison operators. For simple equality filters on a single field, you can use where={"category": "bug"} directly without the $and wrapper.

ditions with $and and $or. The where_document parameter filters by document content (substring match), while where filters by metadata. Use metadata filters aggressively — they reduce the candidate set before the vector search, which is both faster and produces better results.

Updating and Deleting Documents

Chromadb supports document updates via upsert() (insert or replace) and deletion by ID. Use upsert when you need to keep stored documents in sync with a source of truth like a database or file system.

# update_chroma.py
import chromadb

client = chromadb.PersistentClient(path="./chroma_data")
collection = client.get_or_create_collection("support_tickets")

# Update an existing document
collection.upsert(
    ids=["ticket-003"],
    documents=["Export to CSV is missing column headers. Reproducible on Windows only."],
    metadatas=[{"category": "bug", "priority": "high", "component": "export"}]
)

# Delete a document
collection.delete(ids=["ticket-005"])

print(f"Collection count after update/delete: {collection.count()}")

# Fetch a specific document by ID
item = collection.get(ids=["ticket-003"], include=["documents", "metadatas"])
print("Updated ticket-003:")
print(f"  {item['documents'][0]}")
print(f"  Priority: {item['metadatas'][0]['priority']}")
Collection count after update/delete: 4
Updated ticket-003:
  Export to CSV is missing column headers. Reproducible on Windows only.
  Priority: high

upsert() is safe to call unconditionally — it creates new documents if the ID does not exist and replaces them if it does. This makes it ideal for sync pipelines where you re-process your source data on a schedule. When you upsert, chromadb regenerates the embedding for the new text, so the vector in the index stays current with your document content.

Real-Life Example: Semantic Support Ticket Router

# ticket_router.py
"""
Routes incoming support tickets to the most relevant existing ticket category
using semantic similarity. Returns similar past tickets + suggested category.
"""
import chromadb
from chromadb.utils.embedding_functions import DefaultEmbeddingFunction

TICKETS_DB = "./ticket_router_db"

SAMPLE_TICKETS = [
    ("t001", "Password reset email never arrives", "auth", "high"),
    ("t002", "Two-factor authentication code rejected", "auth", "high"),
    ("t003", "CSV export crashes with large datasets", "export", "medium"),
    ("t004", "PDF export missing images", "export", "medium"),
    ("t005", "Page loads take over 10 seconds", "performance", "high"),
    ("t006", "Search results appear after long delay", "performance", "medium"),
    ("t007", "Mobile layout broken on iPhone SE", "ui", "low"),
    ("t008", "Dropdown menu overlaps page content", "ui", "low"),
]

def setup_collection(client):
    col = client.get_or_create_collection("tickets", metadata={"hnsw:space": "cosine"})
    if col.count() == 0:
        col.add(
            ids=[t[0] for t in SAMPLE_TICKETS],
            documents=[t[1] for t in SAMPLE_TICKETS],
            metadatas=[{"component": t[2], "priority": t[3]} for t in SAMPLE_TICKETS]
        )
    return col

def route_ticket(collection, new_ticket_text):
    results = collection.query(query_texts=[new_ticket_text], n_results=3)
    docs = results["documents"][0]
    metas = results["metadatas"][0]
    dists = results["distances"][0]
    # Infer category from top match
    top_component = metas[0]["component"]
    print(f"New ticket: {new_ticket_text}")
    print(f"Suggested category: {top_component}")
    print("Similar past tickets:")
    for doc, meta, dist in zip(docs, metas, dists):
        print(f"  [{dist:.2f}] [{meta['component']}] {doc}")
    print()

client = chromadb.PersistentClient(path=TICKETS_DB)
col = setup_collection(client)

route_ticket(col, "Users getting 'invalid token' error when logging in")
route_ticket(col, "Report generation freezes the browser tab")
route_ticket(col, "Footer links not visible on Android devices")
New ticket: Users getting 'invalid token' error when logging in
Suggested category: auth
Similar past tickets:
  [0.21] [auth] Two-factor authentication code rejected
  [0.34] [auth] Password reset email never arrives
  [0.81] [performance] Page loads take over 10 seconds

New ticket: Report generation freezes the browser tab
Suggested category: performance
Similar past tickets:
  [0.19] [performance] Page loads take over 10 seconds
  [0.28] [performance] Search results appear after long delay
  [0.55] [export] CSV export crashes with large datasets

New ticket: Footer links not visible on Android devices
Suggested category: ui
Similar past tickets:
  [0.17] [ui] Mobile layout broken on iPhone SE
  [0.29] [ui] Dropdown menu overlaps page content
  [0.74] [auth] Password reset email never arrives

This router correctly categorizes all three new tickets without any keyword rules or explicit training. It works purely by semantic similarity — “invalid token error when logging in” is correctly routed to auth, and “freezes the browser” matches performance. To extend this for production, add a real historical ticket database, route automatically to the correct support team queue, and use the top-3 similar tickets as context when suggesting a resolution to the agent handling the ticket.

Frequently Asked Questions

How many documents can chromadb handle?

The embedded (local) mode comfortably handles hundreds of thousands of documents on a modern laptop. Chromadb uses HNSW for approximate nearest-neighbor search, which scales well to millions of vectors, but the entire index is held in memory. A collection of 1 million 384-dimensional float32 vectors requires roughly 1.5GB of RAM. For very large collections, use the client-server mode (chromadb.HttpClient) or migrate to a managed vector database like Pinecone or Qdrant.

Can I use my own embeddings instead of the built-in ones?

Yes. Pass an embedding_function parameter when creating a collection. LlamaIndex, LangChain, OpenAI, Cohere, and Hugging Face all have ready-made embedding function adapters in chromadb.utils.embedding_functions. You can also pass pre-computed embeddings as a list of float lists directly to collection.add(embeddings=[...], ids=[...]) — skip the documents parameter entirely if you only want to store vectors.

Is chromadb data durable?

With PersistentClient, yes — data is written to the specified directory using SQLite for metadata and HNSW’s native serialization for vectors. The data directory is self-contained and portable. To back up your chromadb database, copy the entire directory. There is no write-ahead log or crash recovery journal, so avoid writes during process crashes, and keep backups of the data directory for production use.

How do I handle multiple users or datasets in one chromadb instance?

Use separate collections per user or dataset. Collections are lightweight — you can create thousands of them. Alternatively, use metadata to tag each document with an owner_id and filter all queries with where={"owner_id": current_user_id}. The collection-per-user approach is simpler and prevents cross-tenant data leakage at the query level. The metadata filter approach uses less disk space when users share a lot of common documents.

Does chromadb integrate with LangChain and LlamaIndex?

Yes, both frameworks have native chromadb support. In LangChain, use Chroma.from_documents(docs, embedding) to create a LangChain vector store backed by chromadb. In LlamaIndex, use ChromaVectorStore from llama_index.vector_stores.chroma. Both frameworks handle the embedding and storage mechanics, letting you use chromadb as the persistence layer while working with the higher-level RAG abstractions those frameworks provide.

Conclusion

Chromadb removes the infrastructure barrier to building vector-search-powered applications. You learned how to create persistent and in-memory clients, add documents with automatic embedding generation, run semantic similarity queries, apply metadata filters to narrow results, update and delete documents with upsert, and build a practical ticket routing system. The zero-configuration local mode makes chromadb ideal for prototyping and small-to-medium production workloads, while the server mode and framework integrations with LlamaIndex and LangChain make it a viable foundation for larger systems.

Extend the ticket router by connecting it to a real issue tracker, adding a feedback loop to improve routing accuracy over time, or using it as the retrieval layer in a full RAG pipeline with an LLM for resolution suggestions. Official documentation: docs.trychroma.com.