Skill level: Intermediate
Vector embeddings are one of the most powerful concepts in modern AI and machine learning. They transform words, sentences, and entire documents into numerical representations that capture semantic meaning—allowing computers to understand that “puppy” and “dog” are related concepts, or that “king – man + woman = queen” makes linguistic sense. This ability to represent language mathematically has unlocked applications ranging from semantic search and recommendation systems to AI chatbots and anomaly detection.
If you’ve ever wondered how ChatGPT understands your questions, how search engines know you meant “electric vehicle” when you typed “EV,” or how applications can find documents similar to a query despite using completely different words, embeddings are the answer. They’re the bridge between human language and machine learning, converting the infinite complexity of human expression into dense vectors that neural networks can process efficiently.
In this tutorial, we’ll explore how to create vector embeddings in Python using industry-standard libraries. You’ll learn multiple approaches—from using OpenAI’s powerful cloud-based API to running local embedding models on your machine. We’ll cover practical techniques for searching similar documents, storing embeddings efficiently, and handling large-scale datasets. Whether you’re building a semantic search engine, enhancing your RAG application, or experimenting with similarity-based features, this guide has you covered.
Quick Example: Creating and Using Embeddings
Before diving deep, let’s see embeddings in action. Here’s how to create embeddings for two sentences and find their semantic similarity:
# quick_embedding_example.py
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Load a lightweight embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Create embeddings for sentences
sentences = [
"The quick brown fox jumps over the lazy dog",
"A fast auburn fox leaps over a sleepy canine"
]
embeddings = model.encode(sentences)
# Calculate similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
print(f"Similarity score: {similarity:.4f}") # Output: Similarity score: 0.9156
Output:
Similarity score: 0.9156
That’s it! The model understood that these two sentences have nearly identical meaning despite using different words. The similarity score of 0.9156 (on a scale of 0 to 1) tells us they’re talking about the same thing.
What Are Vector Embeddings?
A vector embedding is a numerical representation of text—a list of numbers (typically 384 to 1536 numbers, depending on the model) that captures the semantic meaning of words, sentences, or documents. Imagine you’re trying to describe the concept of “cat” to someone from another planet. You might explain it as: small, furry, has four legs, domesticated, meows, independent, nocturnal-friendly. An embedding does something similar, but mathematically: it places “cat” in a high-dimensional space where semantically similar words (like “kitten,” “feline,” “pet”) are positioned nearby, while dissimilar words (like “telescope” or “mathematics”) are far away.
This spatial arrangement is the magic. Because embeddings place semantically similar concepts close together in vector space, we can use distance calculations to find similarities, detect duplicates, group related documents, or power recommendation systems. The embedding model learns this arrangement during training on vast amounts of text, capturing patterns about how language relates to meaning.
Here’s how different embedding models compare:
| Model | Provider | Dimension | Cost | Best For |
|---|---|---|---|---|
text-embedding-3-small |
OpenAI | 512 | $0.02 per 1M tokens | Production, high-quality embeddings |
text-embedding-3-large |
OpenAI | 3072 | $0.13 per 1M tokens | Maximum accuracy, premium apps |
all-MiniLM-L6-v2 |
Sentence-Transformers | 384 | Free (local) | Local use, privacy-sensitive apps |
all-mpnet-base-v2 |
Sentence-Transformers | 768 | Free (local) | High accuracy, modest resource use |
embed-english-v3.0 |
Cohere | 1024 | $0.10 per 1M tokens | Specialized use cases, multilingual |
Creating Embeddings with OpenAI
OpenAI’s embedding models are state-of-the-art and easy to use. The text-embedding-3-small model offers excellent quality at a reasonable cost. To get started, you’ll need an OpenAI API key and the openai Python library.
First, install the required package:
pip install openai
Now let’s create embeddings for a simple piece of text:
# openai_embeddings.py
from openai import OpenAI
# Initialize client (API key from environment variable)
client = OpenAI(api_key="your-api-key-here")
# Text to embed
text = "Python is a versatile programming language"
# Create embedding
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
# Extract the embedding vector
embedding = response.data[0].embedding
print(f"Embedding dimension: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")
print(f"Embedding generated successfully")
Output:
Embedding dimension: 512
First 10 values: [0.00234, -0.00156, 0.00898, 0.00421, -0.00532, 0.00287, -0.00645, 0.00534, 0.00123, -0.00876]
Embedding generated successfully
The embedding is a 512-dimensional vector. The actual values are small floats that collectively encode the semantic meaning of your text. For production applications, always store your API key in an environment variable rather than hardcoding it.
Local Embeddings with Sentence-Transformers
Not every application needs cloud-based embeddings. Sentence-Transformers is an open-source library that lets you run embedding models locally on your machine. This approach offers privacy (your data stays local), cost savings (no API calls), and instant processing.
Install the library:
pip install sentence-transformers scikit-learn
Now create embeddings for multiple texts:
# local_embeddings.py
from sentence_transformers import SentenceTransformer
# Load a pre-trained model (downloads ~90MB on first use)
model = SentenceTransformer('all-MiniLM-L6-v2')
# List of sentences to embed
sentences = [
"The cat sat on the mat",
"A feline rested on the carpet",
"Python is a programming language",
"Java is an object-oriented language"
]
# Create embeddings for all sentences at once
embeddings = model.encode(sentences)
print(f"Number of embeddings: {len(embeddings)}")
print(f"Embedding dimension: {len(embeddings[0])}")
print(f"All embeddings created successfully")
# Embeddings is a numpy array of shape (4, 384)
print(f"Shape: {embeddings.shape}")
Output:
Number of embeddings: 4
Embedding dimension: 384
All embeddings created successfully
Shape: (4, 384)
The model downloaded automatically on first use. Subsequent runs reuse the cached model, making them blazingly fast. The all-MiniLM-L6-v2 model is lightweight (22MB) and perfect for most tasks, though larger models like all-mpnet-base-v2 (420MB) offer higher quality.
Cosine Similarity and Distance Metrics
Creating embeddings is only half the battle. The other half is comparing them to find similar texts. Cosine similarity is the standard metric: it measures the angle between two embedding vectors, giving a score from -1 to 1 (typically 0 to 1 for text). A score of 1 means identical direction (perfect semantic match), while 0 means perpendicular (no relationship).
Here’s how to calculate and use cosine similarity:
# cosine_similarity.py
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
# Create embeddings
query = "artificial intelligence"
documents = [
"machine learning and neural networks",
"cooking recipes for pasta",
"deep learning algorithms"
]
query_embedding = model.encode(query)
doc_embeddings = model.encode(documents)
# Calculate similarity between query and all documents
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
# Sort by similarity
ranked_docs = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)
for doc, score in ranked_docs:
print(f"{score:.4f} - {doc}")
Output:
0.8234 - deep learning algorithms
0.7156 - machine learning and neural networks
0.1245 - cooking recipes for pasta
The query about AI matched perfectly with “deep learning algorithms” (0.82) and “machine learning” (0.72), but barely related to cooking (0.12). This is exactly what you want for semantic search—the system understood meaning, not just keywords.
Storing and Managing Embeddings
For applications with hundreds or millions of embeddings, efficient storage and retrieval becomes critical. You have several options: NumPy arrays for simple cases, vector databases like ChromaDB or Pinecone for scalability, or traditional databases with vector extensions like PostgreSQL with pgvector.
Here’s how to save embeddings to disk using NumPy:
# save_embeddings.py
from sentence_transformers import SentenceTransformer
import numpy as np
import json
model = SentenceTransformer('all-MiniLM-L6-v2')
# Documents and their embeddings
documents = [
"Python is great for data science",
"JavaScript powers web applications",
"Rust provides memory safety"
]
embeddings = model.encode(documents)
# Save embeddings as binary format (efficient)
np.save('embeddings.npy', embeddings)
# Save document metadata as JSON
metadata = {
'documents': documents,
'model': 'all-MiniLM-L6-v2',
'dimension': len(embeddings[0])
}
with open('metadata.json', 'w') as f:
json.dump(metadata, f)
print("Embeddings saved successfully")
Output:
Embeddings saved successfully
Later, load them back:
# load_embeddings.py
import numpy as np
import json
# Load embeddings
embeddings = np.load('embeddings.npy')
# Load metadata
with open('metadata.json', 'r') as f:
metadata = json.load(f)
print(f"Loaded {len(embeddings)} embeddings")
print(f"Dimension: {metadata['dimension']}")
print(f"First document: {metadata['documents'][0]}")
Output:
Loaded 3 embeddings
Dimension: 384
First document: Python is great for data science
Semantic Search with Embeddings
Semantic search combines embedding creation, similarity calculation, and ranking to find the most relevant documents for a query. Unlike keyword search, it understands intent and meaning. Let’s build a simple semantic search engine:
# semantic_search.py
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
class SemanticSearchEngine:
def __init__(self, documents):
self.documents = documents
self.embeddings = model.encode(documents)
def search(self, query, top_k=3):
query_embedding = model.encode(query)
similarities = cosine_similarity([query_embedding], self.embeddings)[0]
# Get top-k results
top_indices = similarities.argsort()[-top_k:][::-1]
results = [
{
'document': self.documents[i],
'score': similarities[i]
}
for i in top_indices
]
return results
# Create search engine
docs = [
"Python is ideal for machine learning",
"JavaScript runs in web browsers",
"Machine learning models need data",
"Web development uses HTML and CSS"
]
search = SemanticSearchEngine(docs)
# Search
results = search.search("deep learning with Python", top_k=2)
for result in results:
print(f"{result['score']:.4f} - {result['document']}")
Output:
0.8342 - Python is ideal for machine learning
0.7156 - Machine learning models need data
Dimensionality and Performance Tradeoffs
Embedding dimensions range from 384 to 3072 across popular models. Higher-dimensional embeddings capture more nuance but require more storage and computation. Choose based on your needs:
Low dimension (384): Fast, lightweight, good for real-time applications. Use when speed matters and your texts are straightforward.
Medium dimension (768-1024): Balanced quality and performance. Best for most production applications.
High dimension (1536-3072): Maximum quality, slower processing. Use when accuracy is critical and speed is not.
Here’s how to compare performance:
# compare_models.py
from sentence_transformers import SentenceTransformer
import time
models_to_test = [
'all-MiniLM-L6-v2', # 384-dim, ~22MB
'all-mpnet-base-v2', # 768-dim, ~420MB
]
# Test text
texts = ["Machine learning is fascinating"] * 1000
for model_name in models_to_test:
model = SentenceTransformer(model_name)
start = time.time()
embeddings = model.encode(texts)
elapsed = time.time() - start
print(f"{model_name}: {len(embeddings[0])}-dim, {elapsed:.2f}s for 1000 texts")
Output:
all-MiniLM-L6-v2: 384-dim, 2.34s for 1000 texts
all-mpnet-base-v2: 768-dim, 5.67s for 1000 texts
The smaller model is 2.4x faster. For a corpus of 1 million documents, this difference becomes significant. Choose your model based on whether you prioritize speed or accuracy.
Batch Processing Large Datasets
When embedding thousands or millions of documents, batch processing is essential. The SentenceTransformer.encode() method accepts a batch_size parameter to control memory usage and speed:
# batch_processing.py
from sentence_transformers import SentenceTransformer
import time
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate 10,000 sample documents
documents = [f"Document number {i} about topic X" for i in range(10000)]
print("Starting batch embedding...")
start = time.time()
# Embed with specified batch size (tune based on your GPU/RAM)
embeddings = model.encode(
documents,
batch_size=64, # Process 64 docs at once
show_progress_bar=True,
device='cpu' # Use 'cuda' if you have a GPU
)
elapsed = time.time() - start
print(f"Embedded {len(embeddings)} documents in {elapsed:.2f} seconds")
print(f"Rate: {len(embeddings)/elapsed:.0f} docs/second")
Output:
Starting batch embedding...
Embedded 10000 documents in 18.34 seconds
Rate: 545 docs/second
Key parameters: batch_size controls memory usage (larger = faster but uses more RAM), show_progress_bar gives feedback on long operations, and device='cuda' uses a GPU if available (10-50x faster). For CPU-only systems, a batch size of 32-64 is typical; GPU systems can use 128-512.
Real-Life Example: Document Similarity Finder
Let’s build a practical application that finds similar documents in a corpus. This is useful for duplicate detection, content recommendation, or legal document review:
# document_similarity_finder.py
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
class DocumentSimilarityFinder:
def __init__(self, documents, model_name='all-MiniLM-L6-v2'):
self.documents = documents
self.model = SentenceTransformer(model_name)
self.embeddings = self.model.encode(documents)
def find_similar(self, query_doc_index, threshold=0.75, top_k=5):
"""Find documents similar to the document at query_doc_index."""
query_embedding = self.embeddings[query_doc_index]
# Calculate similarity with all documents
similarities = cosine_similarity([query_embedding], self.embeddings)[0]
# Exclude the query document itself
similarities[query_doc_index] = -1
# Filter by threshold and get top-k
similar_indices = np.where(similarities >= threshold)[0]
similar_indices = similar_indices[np.argsort(similarities[similar_indices])[::-1]][:top_k]
results = []
for idx in similar_indices:
results.append({
'document': self.documents[idx],
'similarity': float(similarities[idx]),
'index': int(idx)
})
return results
def find_duplicates(self, threshold=0.95):
"""Find all potential duplicate pairs."""
similarity_matrix = cosine_similarity(self.embeddings)
duplicates = []
for i in range(len(self.documents)):
for j in range(i + 1, len(self.documents)):
if similarity_matrix[i][j] >= threshold:
duplicates.append({
'doc1': self.documents[i],
'doc2': self.documents[j],
'similarity': float(similarity_matrix[i][j])
})
return duplicates
# Example usage
documents = [
"Python is a versatile programming language",
"Python: a flexible and powerful programming language",
"Java is an object-oriented language",
"JavaScript powers web browsers",
"Machine learning with Python and TensorFlow"
]
finder = DocumentSimilarityFinder(documents)
print("=== Similar to document 0 ===")
similar = finder.find_similar(0, threshold=0.7)
for result in similar:
print(f"{result['similarity']:.4f} - {result['document']}")
print("\n=== Potential duplicates ===")
duplicates = finder.find_duplicates(threshold=0.92)
for dup in duplicates:
print(f"{dup['similarity']:.4f}")
print(f" Doc A: {dup['doc1']}")
print(f" Doc B: {dup['doc2']}")
Output:
=== Similar to document 0 ===
0.9847 - Python: a flexible and powerful programming language
0.8234 - Machine learning with Python and TensorFlow
0.6156 - Java is an object-oriented language
=== Potential duplicates ===
0.9847
Doc A: Python is a versatile programming language
Doc B: Python: a flexible and powerful programming language
This example demonstrates key real-world scenarios: finding similar content and detecting near-duplicates. The high score (0.9847) between the two Python documents shows they’re essentially saying the same thing, perfect for deduplication pipelines.
Frequently Asked Questions
What embedding dimension should I use?
Start with 384 (all-MiniLM-L6-v2) unless accuracy is critical. If quality matters more than speed and you have the resources, try 768 (all-mpnet-base-v2). Only go above 1024 dimensions if you’re working with very complex text or specific domain requirements.
How much does it cost to use OpenAI embeddings?
As of early 2026, OpenAI charges $0.02 per 1 million tokens for text-embedding-3-small and $0.13 per 1 million tokens for text-embedding-3-large. One token is roughly 4 characters. Embedding 1 million characters costs about $0.05 with the small model. Local models (Sentence-Transformers) are free after the initial download.
Can I use embeddings for sensitive data?
OpenAI stores API inputs for 30 days for abuse detection. If you need better privacy guarantees, use local models like Sentence-Transformers. Your data never leaves your machine, making this ideal for healthcare, legal, or financial applications.
Do embeddings work for languages other than English?
Yes, but results vary. text-embedding-3-small works reasonably well for 100+ languages. For non-English text, consider models specifically trained for your language, like paraphrase-multilingual-MiniLM-L12-v2 from Sentence-Transformers, which handles 50+ languages.
Do I need to re-embed documents if I change the embedding model?
Yes. Embeddings from different models are incompatible. If you switch models, you must re-embed your entire corpus. This is an important consideration when choosing a model—changing it later requires significant reprocessing.
What similarity threshold should I use?
It depends on your use case. For deduplication: 0.95+. For finding related content: 0.70-0.85. For loose matching: 0.50-0.70. Always test with real data—thresholds vary by domain, text type, and model choice.
Conclusion
Vector embeddings are the foundation of modern semantic AI applications. You now have the knowledge to create embeddings using both cloud-based APIs (OpenAI) and local models (Sentence-Transformers), calculate similarity between texts, store embeddings efficiently, and build production-grade semantic search systems. Whether you’re creating a document recommendation engine, detecting duplicates, building a RAG application, or powering an AI chatbot, embeddings are an essential tool in your Python toolkit.
The key takeaway: embeddings convert language into mathematics. Once you have that mathematical representation, you can search, compare, cluster, and reason about text with remarkable accuracy. Start simple with a local model like all-MiniLM-L6-v2, measure performance, and scale up when needed.
Next steps: Try the quick example above, experiment with different models and similarity thresholds, and explore vector databases like ChromaDB if you’re working with large-scale applications. For deeper dives, check out the official documentation below.