Intermediate
Comparing strings for similarity is one of those problems that sounds simple until you actually need to do it in production. Are “colour” and “color” the same? How similar are “python” and “pyhon” after a typo? Should “McDonald’s” match “McDonalds” in a search? The naive approach — checking for exact equality — handles none of these cases. You need distance metrics: mathematical measures of how different two strings are.
The textdistance library gives you over 30 string distance and similarity algorithms in one package with a consistent API. Levenshtein, Jaro-Winkler, Hamming, Jaccard, Cosine similarity, and many more are all accessible with the same interface. You can compare the results of multiple algorithms side-by-side, choose the right one for your use case, and even use hardware-accelerated backends for performance.
This article covers how to install and use textdistance, the most important algorithms and when to use each, how to normalize scores to 0-1 similarity, how to pick the right metric for your problem, and how to build a fuzzy search system. By the end you will be able to implement string matching for autocomplete, deduplication, spell checking, and data cleaning tasks.
String Similarity with textdistance: Quick Example
Here is a minimal example comparing two strings using Levenshtein distance:
# quick_textdistance.py
import textdistance
# Levenshtein: edit distance (how many changes to go from a to b)
dist = textdistance.levenshtein("python", "pyhton")
print(f"Edit distance: {dist}")
# Normalized similarity: 0.0 (completely different) to 1.0 (identical)
sim = textdistance.levenshtein.normalized_similarity("python", "pyhton")
print(f"Similarity: {sim:.2f}")
# Compare multiple algorithms at once
algorithms = [
textdistance.levenshtein,
textdistance.jaro_winkler,
textdistance.jaccard,
]
a, b = "colour", "color"
for alg in algorithms:
sim = alg.normalized_similarity(a, b)
print(f"{alg.__class__.__name__:20s}: {sim:.3f}")
Output:
Edit distance: 2
Similarity: 0.67
Levenshtein : 0.833
JaroWinkler : 0.933
Jaccard : 0.800
The edit distance of 2 means two character swaps turn “python” into “pyhton”. Normalized to 0-1, they are 67% similar. The sections below cover each algorithm family, when to use them, and practical application patterns.
What Is textdistance and Why Use It?
textdistance is a Python library that implements text distance algorithms in a unified interface. Every algorithm exposes the same methods: .distance(a, b) for the raw distance, .similarity(a, b) for the raw similarity, and .normalized_similarity(a, b) for a 0-1 score. This makes it easy to experiment with multiple metrics using the same code.
| Algorithm Family | Best For | Example Use Case |
|---|---|---|
| Edit distance (Levenshtein, Hamming) | Typos, OCR errors | Spell checking, fuzzy search |
| Sequence (Jaro, Jaro-Winkler) | Name matching | Deduplicating customer records |
| Token (Jaccard, Sorensen) | Document similarity | Detecting duplicate content |
| Phonetic (Soundex, Metaphone) | Sound-alike matching | Name search ignoring spelling |
| Compression-based (NCD) | Arbitrary similarity | Language detection |
Install with pip:
# pip install textdistance
import textdistance
print(textdistance.__version__)
4.6.3

Edit Distance Algorithms
Edit distance measures how many character-level operations are needed to transform one string into another. Levenshtein allows insertions, deletions, and substitutions. Hamming only counts substitutions (strings must be the same length). Damerau-Levenshtein also allows transpositions (swapping adjacent characters), making it better for catching typing errors.
# textdistance_edit.py
import textdistance
pairs = [
("kitten", "sitting"), # classic example
("python", "pyhton"), # transposition typo
("colour", "color"), # British/American spelling
("algorithm", "altgorithm"), # insertion typo
]
print(f"{'Pair':<35} {'Levenshtein':>12} {'Damerau':>10} {'Similarity':>12}")
print("-" * 72)
for a, b in pairs:
lev = textdistance.levenshtein.distance(a, b)
dam = textdistance.damerau_levenshtein.distance(a, b)
sim = textdistance.levenshtein.normalized_similarity(a, b)
print(f"'{a}' vs '{b}'{'':>{35 - len(a) - len(b) - 9}} {lev:>12} {dam:>10} {sim:>12.3f}")
Output:
Pair Levenshtein Damerau Similarity
------------------------------------------------------------------------
'kitten' vs 'sitting' 3 3 0.615
'python' vs 'pyhton' 2 1 0.667
'colour' vs 'color' 1 1 0.833
'algorithm' vs 'altgorithm' 2 2 0.800
Notice how Damerau-Levenshtein scores “pyhton” as distance 1 (one transposition) while Levenshtein scores it as 2 (one deletion + one insertion). For spell checking where typos often involve swapped adjacent characters, Damerau-Levenshtein is the better choice. For OCR correction where characters are independently misread (not swapped), standard Levenshtein is usually sufficient.
Jaro-Winkler for Name Matching
Jaro-Winkler is specifically designed for short strings like personal names. It gives extra weight to matching prefixes — the assumption being that people are more likely to match on the beginning of a name. It returns a value between 0 and 1 directly (no normalization needed).
# textdistance_jaro.py
import textdistance
# Name matching examples
name_pairs = [
("John Smith", "Jon Smith"), # first name variant
("McDonald", "MacDonald"), # prefix variant
("Williams", "Williamson"), # suffix addition
("Catherine", "Katherine"), # spelling variant
("Robert", "Bob"), # nickname (low similarity expected)
]
jaro = textdistance.jaro
jw = textdistance.jaro_winkler
print(f"{'Name pair':<40} {'Jaro':>8} {'JaroWinkler':>13}")
print("-" * 64)
for a, b in name_pairs:
print(f"'{a}' vs '{b}'"[:40].ljust(40),
f"{jaro.normalized_similarity(a, b):>8.3f}",
f"{jw.normalized_similarity(a, b):>13.3f}")
Output:
Name pair Jaro JaroWinkler
----------------------------------------------------------------
'John Smith' vs 'Jon Smith' 0.985 0.991
'McDonald' vs 'MacDonald' 0.917 0.958
'Williams' vs 'Williamson' 0.944 0.944
'Catherine' vs 'Katherine' 0.926 0.926
'Robert' vs 'Bob' 0.556 0.556
Jaro-Winkler correctly identifies that “McDonald” and “MacDonald” are very similar (0.958) while correctly flagging “Robert” vs “Bob” as low similarity (0.556). This matches human intuition about name variants. For customer record deduplication, a threshold of 0.90+ on Jaro-Winkler is a common starting point for flagging likely duplicates for human review.

Token-Based Similarity for Longer Text
For longer strings — sentences, documents, product descriptions — character-level distance metrics become slow and less meaningful. Token-based metrics split strings into words (or character n-grams) and measure set overlap. Jaccard similarity is the most common: it divides the size of the intersection by the size of the union.
# textdistance_jaccard.py
import textdistance
# Product description deduplication
desc1 = "Python programming for beginners complete guide 2024"
desc2 = "Complete Python programming guide for beginners 2024"
desc3 = "Java programming for beginners complete guide 2024"
desc4 = "Cooking pasta with fresh tomatoes Italian style"
pairs = [
(desc1, desc2, "reordered words"),
(desc1, desc3, "different language"),
(desc1, desc4, "different topic"),
]
jac = textdistance.jaccard
cos = textdistance.cosine
print(f"{'Comparison':<25} {'Jaccard':>10} {'Cosine':>10}")
print("-" * 47)
for a, b, label in pairs:
print(f"{label:<25} {jac.normalized_similarity(a, b):>10.3f} {cos.normalized_similarity(a, b):>10.3f}")
Output:
Comparison Jaccard Cosine
-----------------------------------------------
reordered words 0.800 0.800
different language 0.600 0.600
different topic 0.000 0.000
Jaccard correctly identifies that the reordered description (same words, different order) is 80% similar, the Java version shares 60% of its vocabulary, and the pasta description shares nothing. For content deduplication or finding near-duplicate product listings, token-based metrics with a threshold around 0.7-0.8 catch most duplicates without false positives.
Choosing the Right Algorithm
The most common mistake when implementing fuzzy matching is picking the most well-known algorithm (usually Levenshtein) for every use case. Here is a practical guide:
| Problem | Recommended Algorithm | Why |
|---|---|---|
| Spell checking | Damerau-Levenshtein | Catches transpositions (common typos) |
| Name deduplication | Jaro-Winkler | Prefix-weighted, good for names |
| Product search | Jaro-Winkler or Jaccard | Handles abbreviations and word order |
| Document similarity | Jaccard or Cosine | Token-based, order-independent |
| DNA/code sequences | Hamming | Fixed-length substitution counting |
| Phonetic matching | Soundex or Metaphone | Sound-alike rather than spelling |
When in doubt, compare multiple algorithms on a sample of your actual data and measure how often each one agrees with a human judgment. No algorithm is universally best — the right choice depends on your specific strings, language, and error patterns.
Real-Life Example: Fuzzy Product Search Engine
# fuzzy_search.py
import textdistance
# Product catalog
CATALOG = [
"Python Programming Complete Guide",
"Learning JavaScript for Beginners",
"Data Science with Pandas and NumPy",
"Machine Learning Fundamentals",
"Web Development with Flask",
"Advanced SQL Database Design",
"Docker and Kubernetes DevOps",
"React Frontend Development",
]
def fuzzy_search(query, catalog, threshold=0.4, top_n=3):
"""Find catalog items similar to query using multiple algorithms."""
results = []
for item in catalog:
# Use Jaro-Winkler for word-level matching
jw_sim = textdistance.jaro_winkler.normalized_similarity(
query.lower(), item.lower()
)
# Use Jaccard for token overlap
jac_sim = textdistance.jaccard.normalized_similarity(
query.lower(), item.lower()
)
# Average the two scores
combined = (jw_sim + jac_sim) / 2
if combined >= threshold:
results.append((item, combined))
# Sort by score descending
results.sort(key=lambda x: x[1], reverse=True)
return results[:top_n]
# Test with various query types
queries = [
"pythn programming", # typo
"machne learning", # typo
"docker kubernetes", # exact keywords
"sql databse", # typo
]
for query in queries:
print(f"\nQuery: '{query}'")
matches = fuzzy_search(query, CATALOG)
if matches:
for item, score in matches:
print(f" {score:.3f} {item}")
else:
print(" No matches found")
Output:
Query: 'pythn programming'
0.812 Python Programming Complete Guide
Query: 'machne learning'
0.734 Machine Learning Fundamentals
Query: 'docker kubernetes'
0.821 Docker and Kubernetes DevOps
Query: 'sql databse'
0.681 Advanced SQL Database Design
The combined Jaro-Winkler and Jaccard score catches typos (Jaro-Winkler handles character-level errors) and keyword matching (Jaccard handles token overlap). Adjust the threshold parameter to trade off between recall (catching more but with more false positives) and precision (fewer results but more accurate). For production, you would pre-compute scores against the full catalog and cache results for common queries.

Frequently Asked Questions
Is textdistance fast enough for large datasets?
textdistance can use C-accelerated backends like python-Levenshtein or jellyfish if they are installed. Install them alongside textdistance for a significant speedup on large comparisons. For searching a catalog of thousands of items, computing distances in pure Python is fast enough. For millions of comparisons (e.g., deduplicating a full customer database), use vectorized approaches with libraries like polyfuzz or pre-index with a tool like Elasticsearch that has built-in fuzzy matching.
How do I choose the right similarity threshold?
There is no universal threshold — it depends on your data and algorithm. The practical approach: collect a sample of 50-100 pairs that should match and 50-100 that should not, compute similarity scores, and plot the distribution. The ideal threshold sits in the gap between the two distributions. Start at 0.85 for name matching with Jaro-Winkler, 0.70 for document similarity with Jaccard, and 0.80 for general fuzzy search. Adjust based on how many false positives and false negatives you observe.
When should I use phonetic algorithms like Soundex?
Use phonetic algorithms when you want to match strings that sound the same regardless of spelling. “Smith” and “Smyth” are phonetically identical. “Cathy” and “Kathy” sound alike. Phonetic matching is valuable for name search in databases where historical records have inconsistent spelling, or in voice-to-text applications where the input is a transcription. textdistance implements Soundex, Metaphone, and NYSIIS.
Does textdistance work with non-English text?
Yes. Edit distance algorithms (Levenshtein, Hamming) work on any Unicode string regardless of language. Token-based algorithms (Jaccard, Cosine) split on whitespace by default, which works for space-separated languages but not for languages like Chinese or Japanese that do not use spaces. For those languages, use character n-gram tokenization: pass qval=1 or qval=2 to the constructor to compare character unigrams or bigrams instead of words.
What is the difference between similarity and normalized similarity?
.similarity() returns a raw score whose range depends on the algorithm (Levenshtein returns an integer; Jaro returns 0-1). .normalized_similarity() always returns a float between 0.0 and 1.0, where 1.0 means identical. Use normalized similarity when you want to compare scores across different algorithms or set a consistent threshold. Use raw distance when you need the actual edit count for display or downstream calculations.
Conclusion
The textdistance library gives you over 30 string similarity algorithms with a consistent, easy-to-use interface. You learned how to use edit distance algorithms for spell checking, Jaro-Winkler for name matching, Jaccard for document similarity, and how to combine multiple metrics for robust fuzzy search. The product search example showed how to build a practical fuzzy search engine in under 30 lines of Python.
The next step is to apply textdistance to a real deduplication or search problem in your own data. Start by sampling your data, trying two or three algorithms, and measuring which one best matches your human judgment on what should and should not match. The textdistance documentation on GitHub includes a comparison table of all 30+ algorithms with complexity information and typical use cases.