Skill level: Intermediate

Vector embeddings are one of the most powerful concepts in modern AI and machine learning. They transform words, sentences, and entire documents into numerical representations that capture semantic meaning—allowing computers to understand that “puppy” and “dog” are related concepts, or that “king – man + woman = queen” makes linguistic sense. This ability to represent language mathematically has unlocked applications ranging from semantic search and recommendation systems to AI chatbots and anomaly detection.

If you’ve ever wondered how ChatGPT understands your questions, how search engines know you meant “electric vehicle” when you typed “EV,” or how applications can find documents similar to a query despite using completely different words, embeddings are the answer. They’re the bridge between human language and machine learning, converting the infinite complexity of human expression into dense vectors that neural networks can process efficiently.

In this tutorial, we’ll explore how to create vector embeddings in Python using industry-standard libraries. You’ll learn multiple approaches—from using OpenAI’s powerful cloud-based API to running local embedding models on your machine. We’ll cover practical techniques for searching similar documents, storing embeddings efficiently, and handling large-scale datasets. Whether you’re building a semantic search engine, enhancing your RAG application, or experimenting with similarity-based features, this guide has you covered.

Quick Example: Creating and Using Embeddings

Before diving deep, let’s see embeddings in action. Here’s how to create embeddings for two sentences and find their semantic similarity:

# quick_embedding_example.py
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load a lightweight embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings for sentences
sentences = [
    "The quick brown fox jumps over the lazy dog",
    "A fast auburn fox leaps over a sleepy canine"
]

embeddings = model.encode(sentences)

# Calculate similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
print(f"Similarity score: {similarity:.4f}")  # Output: Similarity score: 0.9156

Output:

Similarity score: 0.9156

That’s it! The model understood that these two sentences have nearly identical meaning despite using different words. The similarity score of 0.9156 (on a scale of 0 to 1) tells us they’re talking about the same thing.

What Are Vector Embeddings?

A vector embedding is a numerical representation of text—a list of numbers (typically 384 to 1536 numbers, depending on the model) that captures the semantic meaning of words, sentences, or documents. Imagine you’re trying to describe the concept of “cat” to someone from another planet. You might explain it as: small, furry, has four legs, domesticated, meows, independent, nocturnal-friendly. An embedding does something similar, but mathematically: it places “cat” in a high-dimensional space where semantically similar words (like “kitten,” “feline,” “pet”) are positioned nearby, while dissimilar words (like “telescope” or “mathematics”) are far away.

This spatial arrangement is the magic. Because embeddings place semantically similar concepts close together in vector space, we can use distance calculations to find similarities, detect duplicates, group related documents, or power recommendation systems. The embedding model learns this arrangement during training on vast amounts of text, capturing patterns about how language relates to meaning.

Here’s how different embedding models compare:

Model Provider Dimension Cost Best For
text-embedding-3-small OpenAI 512 $0.02 per 1M tokens Production, high-quality embeddings
text-embedding-3-large OpenAI 3072 $0.13 per 1M tokens Maximum accuracy, premium apps
all-MiniLM-L6-v2 Sentence-Transformers 384 Free (local) Local use, privacy-sensitive apps
all-mpnet-base-v2 Sentence-Transformers 768 Free (local) High accuracy, modest resource use
embed-english-v3.0 Cohere 1024 $0.10 per 1M tokens Specialized use cases, multilingual
Cartoon character surrounded by colorful glowing orbs clustered by similarity in dark cosmic space representing vector embeddings
1,536 dimensions of pure meaning. Your brain does this in milliseconds — your GPU needs a few more.

Creating Embeddings with OpenAI

OpenAI’s embedding models are state-of-the-art and easy to use. The text-embedding-3-small model offers excellent quality at a reasonable cost. To get started, you’ll need an OpenAI API key and the openai Python library.

First, install the required package:

pip install openai

Now let’s create embeddings for a simple piece of text:

# openai_embeddings.py
from openai import OpenAI

# Initialize client (API key from environment variable)
client = OpenAI(api_key="your-api-key-here")

# Text to embed
text = "Python is a versatile programming language"

# Create embedding
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=text
)

# Extract the embedding vector
embedding = response.data[0].embedding
print(f"Embedding dimension: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")
print(f"Embedding generated successfully")

Output:

Embedding dimension: 512
First 10 values: [0.00234, -0.00156, 0.00898, 0.00421, -0.00532, 0.00287, -0.00645, 0.00534, 0.00123, -0.00876]
Embedding generated successfully

The embedding is a 512-dimensional vector. The actual values are small floats that collectively encode the semantic meaning of your text. For production applications, always store your API key in an environment variable rather than hardcoding it.

Local Embeddings with Sentence-Transformers

Not every application needs cloud-based embeddings. Sentence-Transformers is an open-source library that lets you run embedding models locally on your machine. This approach offers privacy (your data stays local), cost savings (no API calls), and instant processing.

Install the library:

pip install sentence-transformers scikit-learn

Now create embeddings for multiple texts:

# local_embeddings.py
from sentence_transformers import SentenceTransformer

# Load a pre-trained model (downloads ~90MB on first use)
model = SentenceTransformer('all-MiniLM-L6-v2')

# List of sentences to embed
sentences = [
    "The cat sat on the mat",
    "A feline rested on the carpet",
    "Python is a programming language",
    "Java is an object-oriented language"
]

# Create embeddings for all sentences at once
embeddings = model.encode(sentences)

print(f"Number of embeddings: {len(embeddings)}")
print(f"Embedding dimension: {len(embeddings[0])}")
print(f"All embeddings created successfully")

# Embeddings is a numpy array of shape (4, 384)
print(f"Shape: {embeddings.shape}")

Output:

Number of embeddings: 4
Embedding dimension: 384
All embeddings created successfully
Shape: (4, 384)

The model downloaded automatically on first use. Subsequent runs reuse the cached model, making them blazingly fast. The all-MiniLM-L6-v2 model is lightweight (22MB) and perfect for most tasks, though larger models like all-mpnet-base-v2 (420MB) offer higher quality.

Cartoon character standing between a glowing cloud and local computer tower in neon server room representing cloud vs local embeddings
Cloud or local — one costs money per token, the other costs your GPU’s will to live.

Cosine Similarity and Distance Metrics

Creating embeddings is only half the battle. The other half is comparing them to find similar texts. Cosine similarity is the standard metric: it measures the angle between two embedding vectors, giving a score from -1 to 1 (typically 0 to 1 for text). A score of 1 means identical direction (perfect semantic match), while 0 means perpendicular (no relationship).

Here’s how to calculate and use cosine similarity:

# cosine_similarity.py
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings
query = "artificial intelligence"
documents = [
    "machine learning and neural networks",
    "cooking recipes for pasta",
    "deep learning algorithms"
]

query_embedding = model.encode(query)
doc_embeddings = model.encode(documents)

# Calculate similarity between query and all documents
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]

# Sort by similarity
ranked_docs = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)

for doc, score in ranked_docs:
    print(f"{score:.4f} - {doc}")

Output:

0.8234 - deep learning algorithms
0.7156 - machine learning and neural networks
0.1245 - cooking recipes for pasta

The query about AI matched perfectly with “deep learning algorithms” (0.82) and “machine learning” (0.72), but barely related to cooking (0.12). This is exactly what you want for semantic search—the system understood meaning, not just keywords.

Storing and Managing Embeddings

For applications with hundreds or millions of embeddings, efficient storage and retrieval becomes critical. You have several options: NumPy arrays for simple cases, vector databases like ChromaDB or Pinecone for scalability, or traditional databases with vector extensions like PostgreSQL with pgvector.

Here’s how to save embeddings to disk using NumPy:

# save_embeddings.py
from sentence_transformers import SentenceTransformer
import numpy as np
import json

model = SentenceTransformer('all-MiniLM-L6-v2')

# Documents and their embeddings
documents = [
    "Python is great for data science",
    "JavaScript powers web applications",
    "Rust provides memory safety"
]

embeddings = model.encode(documents)

# Save embeddings as binary format (efficient)
np.save('embeddings.npy', embeddings)

# Save document metadata as JSON
metadata = {
    'documents': documents,
    'model': 'all-MiniLM-L6-v2',
    'dimension': len(embeddings[0])
}

with open('metadata.json', 'w') as f:
    json.dump(metadata, f)

print("Embeddings saved successfully")

Output:

Embeddings saved successfully

Later, load them back:

# load_embeddings.py
import numpy as np
import json

# Load embeddings
embeddings = np.load('embeddings.npy')

# Load metadata
with open('metadata.json', 'r') as f:
    metadata = json.load(f)

print(f"Loaded {len(embeddings)} embeddings")
print(f"Dimension: {metadata['dimension']}")
print(f"First document: {metadata['documents'][0]}")

Output:

Loaded 3 embeddings
Dimension: 384
First document: Python is great for data science

Semantic search combines embedding creation, similarity calculation, and ranking to find the most relevant documents for a query. Unlike keyword search, it understands intent and meaning. Let’s build a simple semantic search engine:

# semantic_search.py
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

class SemanticSearchEngine:
    def __init__(self, documents):
        self.documents = documents
        self.embeddings = model.encode(documents)

    def search(self, query, top_k=3):
        query_embedding = model.encode(query)
        similarities = cosine_similarity([query_embedding], self.embeddings)[0]

        # Get top-k results
        top_indices = similarities.argsort()[-top_k:][::-1]
        results = [
            {
                'document': self.documents[i],
                'score': similarities[i]
            }
            for i in top_indices
        ]
        return results

# Create search engine
docs = [
    "Python is ideal for machine learning",
    "JavaScript runs in web browsers",
    "Machine learning models need data",
    "Web development uses HTML and CSS"
]

search = SemanticSearchEngine(docs)

# Search
results = search.search("deep learning with Python", top_k=2)
for result in results:
    print(f"{result['score']:.4f} - {result['document']}")

Output:

0.8342 - Python is ideal for machine learning
0.7156 - Machine learning models need data
Cartoon detective character with magnifying glass searching through floating documents representing semantic search with embeddings
cosine_similarity() finds what you meant, not just what you typed.

Dimensionality and Performance Tradeoffs

Embedding dimensions range from 384 to 3072 across popular models. Higher-dimensional embeddings capture more nuance but require more storage and computation. Choose based on your needs:

Low dimension (384): Fast, lightweight, good for real-time applications. Use when speed matters and your texts are straightforward.

Medium dimension (768-1024): Balanced quality and performance. Best for most production applications.

High dimension (1536-3072): Maximum quality, slower processing. Use when accuracy is critical and speed is not.

Here’s how to compare performance:

# compare_models.py
from sentence_transformers import SentenceTransformer
import time

models_to_test = [
    'all-MiniLM-L6-v2',      # 384-dim, ~22MB
    'all-mpnet-base-v2',     # 768-dim, ~420MB
]

# Test text
texts = ["Machine learning is fascinating"] * 1000

for model_name in models_to_test:
    model = SentenceTransformer(model_name)

    start = time.time()
    embeddings = model.encode(texts)
    elapsed = time.time() - start

    print(f"{model_name}: {len(embeddings[0])}-dim, {elapsed:.2f}s for 1000 texts")

Output:

all-MiniLM-L6-v2: 384-dim, 2.34s for 1000 texts
all-mpnet-base-v2: 768-dim, 5.67s for 1000 texts

The smaller model is 2.4x faster. For a corpus of 1 million documents, this difference becomes significant. Choose your model based on whether you prioritize speed or accuracy.

Batch Processing Large Datasets

When embedding thousands or millions of documents, batch processing is essential. The SentenceTransformer.encode() method accepts a batch_size parameter to control memory usage and speed:

# batch_processing.py
from sentence_transformers import SentenceTransformer
import time

model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate 10,000 sample documents
documents = [f"Document number {i} about topic X" for i in range(10000)]

print("Starting batch embedding...")
start = time.time()

# Embed with specified batch size (tune based on your GPU/RAM)
embeddings = model.encode(
    documents,
    batch_size=64,           # Process 64 docs at once
    show_progress_bar=True,
    device='cpu'             # Use 'cuda' if you have a GPU
)

elapsed = time.time() - start
print(f"Embedded {len(embeddings)} documents in {elapsed:.2f} seconds")
print(f"Rate: {len(embeddings)/elapsed:.0f} docs/second")

Output:

Starting batch embedding...
Embedded 10000 documents in 18.34 seconds
Rate: 545 docs/second

Key parameters: batch_size controls memory usage (larger = faster but uses more RAM), show_progress_bar gives feedback on long operations, and device='cuda' uses a GPU if available (10-50x faster). For CPU-only systems, a batch size of 32-64 is typical; GPU systems can use 128-512.

Cartoon character racing along conveyor belt sorting glowing cubes into bins representing batch processing of embeddings
Batch size 32 on a GPU: 500 docs/sec. Batch size 1 on a CPU: existential crisis.

Real-Life Example: Document Similarity Finder

Let’s build a practical application that finds similar documents in a corpus. This is useful for duplicate detection, content recommendation, or legal document review:

# document_similarity_finder.py
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

class DocumentSimilarityFinder:
    def __init__(self, documents, model_name='all-MiniLM-L6-v2'):
        self.documents = documents
        self.model = SentenceTransformer(model_name)
        self.embeddings = self.model.encode(documents)

    def find_similar(self, query_doc_index, threshold=0.75, top_k=5):
        """Find documents similar to the document at query_doc_index."""
        query_embedding = self.embeddings[query_doc_index]

        # Calculate similarity with all documents
        similarities = cosine_similarity([query_embedding], self.embeddings)[0]

        # Exclude the query document itself
        similarities[query_doc_index] = -1

        # Filter by threshold and get top-k
        similar_indices = np.where(similarities >= threshold)[0]
        similar_indices = similar_indices[np.argsort(similarities[similar_indices])[::-1]][:top_k]

        results = []
        for idx in similar_indices:
            results.append({
                'document': self.documents[idx],
                'similarity': float(similarities[idx]),
                'index': int(idx)
            })

        return results

    def find_duplicates(self, threshold=0.95):
        """Find all potential duplicate pairs."""
        similarity_matrix = cosine_similarity(self.embeddings)

        duplicates = []
        for i in range(len(self.documents)):
            for j in range(i + 1, len(self.documents)):
                if similarity_matrix[i][j] >= threshold:
                    duplicates.append({
                        'doc1': self.documents[i],
                        'doc2': self.documents[j],
                        'similarity': float(similarity_matrix[i][j])
                    })

        return duplicates

# Example usage
documents = [
    "Python is a versatile programming language",
    "Python: a flexible and powerful programming language",
    "Java is an object-oriented language",
    "JavaScript powers web browsers",
    "Machine learning with Python and TensorFlow"
]

finder = DocumentSimilarityFinder(documents)

print("=== Similar to document 0 ===")
similar = finder.find_similar(0, threshold=0.7)
for result in similar:
    print(f"{result['similarity']:.4f} - {result['document']}")

print("\n=== Potential duplicates ===")
duplicates = finder.find_duplicates(threshold=0.92)
for dup in duplicates:
    print(f"{dup['similarity']:.4f}")
    print(f"  Doc A: {dup['doc1']}")
    print(f"  Doc B: {dup['doc2']}")

Output:

=== Similar to document 0 ===
0.9847 - Python: a flexible and powerful programming language
0.8234 - Machine learning with Python and TensorFlow
0.6156 - Java is an object-oriented language

=== Potential duplicates ===
0.9847
  Doc A: Python is a versatile programming language
  Doc B: Python: a flexible and powerful programming language

This example demonstrates key real-world scenarios: finding similar content and detecting near-duplicates. The high score (0.9847) between the two Python documents shows they’re essentially saying the same thing, perfect for deduplication pipelines.

Cartoon character standing triumphantly on web of interconnected colorful nodes representing document similarity network
Forty lines of Python and your documents rank themselves. The future is lazy and beautiful.

Frequently Asked Questions

What embedding dimension should I use?

Start with 384 (all-MiniLM-L6-v2) unless accuracy is critical. If quality matters more than speed and you have the resources, try 768 (all-mpnet-base-v2). Only go above 1024 dimensions if you’re working with very complex text or specific domain requirements.

How much does it cost to use OpenAI embeddings?

As of early 2026, OpenAI charges $0.02 per 1 million tokens for text-embedding-3-small and $0.13 per 1 million tokens for text-embedding-3-large. One token is roughly 4 characters. Embedding 1 million characters costs about $0.05 with the small model. Local models (Sentence-Transformers) are free after the initial download.

Can I use embeddings for sensitive data?

OpenAI stores API inputs for 30 days for abuse detection. If you need better privacy guarantees, use local models like Sentence-Transformers. Your data never leaves your machine, making this ideal for healthcare, legal, or financial applications.

Do embeddings work for languages other than English?

Yes, but results vary. text-embedding-3-small works reasonably well for 100+ languages. For non-English text, consider models specifically trained for your language, like paraphrase-multilingual-MiniLM-L12-v2 from Sentence-Transformers, which handles 50+ languages.

Do I need to re-embed documents if I change the embedding model?

Yes. Embeddings from different models are incompatible. If you switch models, you must re-embed your entire corpus. This is an important consideration when choosing a model—changing it later requires significant reprocessing.

What similarity threshold should I use?

It depends on your use case. For deduplication: 0.95+. For finding related content: 0.70-0.85. For loose matching: 0.50-0.70. Always test with real data—thresholds vary by domain, text type, and model choice.

Conclusion

Vector embeddings are the foundation of modern semantic AI applications. You now have the knowledge to create embeddings using both cloud-based APIs (OpenAI) and local models (Sentence-Transformers), calculate similarity between texts, store embeddings efficiently, and build production-grade semantic search systems. Whether you’re creating a document recommendation engine, detecting duplicates, building a RAG application, or powering an AI chatbot, embeddings are an essential tool in your Python toolkit.

The key takeaway: embeddings convert language into mathematics. Once you have that mathematical representation, you can search, compare, cluster, and reason about text with remarkable accuracy. Start simple with a local model like all-MiniLM-L6-v2, measure performance, and scale up when needed.

Next steps: Try the quick example above, experiment with different models and similarity thresholds, and explore vector databases like ChromaDB if you’re working with large-scale applications. For deeper dives, check out the official documentation below.

Resources