Intermediate

Comparing strings for similarity is one of those problems that sounds simple until you actually need to do it in production. Are “colour” and “color” the same? How similar are “python” and “pyhon” after a typo? Should “McDonald’s” match “McDonalds” in a search? The naive approach — checking for exact equality — handles none of these cases. You need distance metrics: mathematical measures of how different two strings are.

The textdistance library gives you over 30 string distance and similarity algorithms in one package with a consistent API. Levenshtein, Jaro-Winkler, Hamming, Jaccard, Cosine similarity, and many more are all accessible with the same interface. You can compare the results of multiple algorithms side-by-side, choose the right one for your use case, and even use hardware-accelerated backends for performance.

This article covers how to install and use textdistance, the most important algorithms and when to use each, how to normalize scores to 0-1 similarity, how to pick the right metric for your problem, and how to build a fuzzy search system. By the end you will be able to implement string matching for autocomplete, deduplication, spell checking, and data cleaning tasks.

String Similarity with textdistance: Quick Example

Here is a minimal example comparing two strings using Levenshtein distance:

# quick_textdistance.py
import textdistance

# Levenshtein: edit distance (how many changes to go from a to b)
dist = textdistance.levenshtein("python", "pyhton")
print(f"Edit distance: {dist}")

# Normalized similarity: 0.0 (completely different) to 1.0 (identical)
sim = textdistance.levenshtein.normalized_similarity("python", "pyhton")
print(f"Similarity: {sim:.2f}")

# Compare multiple algorithms at once
algorithms = [
    textdistance.levenshtein,
    textdistance.jaro_winkler,
    textdistance.jaccard,
]
a, b = "colour", "color"
for alg in algorithms:
    sim = alg.normalized_similarity(a, b)
    print(f"{alg.__class__.__name__:20s}: {sim:.3f}")

Output:

Edit distance: 2
Similarity: 0.67
Levenshtein          : 0.833
JaroWinkler          : 0.933
Jaccard              : 0.800

The edit distance of 2 means two character swaps turn “python” into “pyhton”. Normalized to 0-1, they are 67% similar. The sections below cover each algorithm family, when to use them, and practical application patterns.

What Is textdistance and Why Use It?

textdistance is a Python library that implements text distance algorithms in a unified interface. Every algorithm exposes the same methods: .distance(a, b) for the raw distance, .similarity(a, b) for the raw similarity, and .normalized_similarity(a, b) for a 0-1 score. This makes it easy to experiment with multiple metrics using the same code.

Algorithm FamilyBest ForExample Use Case
Edit distance (Levenshtein, Hamming)Typos, OCR errorsSpell checking, fuzzy search
Sequence (Jaro, Jaro-Winkler)Name matchingDeduplicating customer records
Token (Jaccard, Sorensen)Document similarityDetecting duplicate content
Phonetic (Soundex, Metaphone)Sound-alike matchingName search ignoring spelling
Compression-based (NCD)Arbitrary similarityLanguage detection

Install with pip:

# pip install textdistance

import textdistance
print(textdistance.__version__)
4.6.3
Debug Dee measuring string distance between colour and color
distance=1. Close enough for government work, and also for fuzzy search.

Edit Distance Algorithms

Edit distance measures how many character-level operations are needed to transform one string into another. Levenshtein allows insertions, deletions, and substitutions. Hamming only counts substitutions (strings must be the same length). Damerau-Levenshtein also allows transpositions (swapping adjacent characters), making it better for catching typing errors.

# textdistance_edit.py
import textdistance

pairs = [
    ("kitten", "sitting"),    # classic example
    ("python", "pyhton"),     # transposition typo
    ("colour", "color"),      # British/American spelling
    ("algorithm", "altgorithm"),  # insertion typo
]

print(f"{'Pair':<35} {'Levenshtein':>12} {'Damerau':>10} {'Similarity':>12}")
print("-" * 72)
for a, b in pairs:
    lev = textdistance.levenshtein.distance(a, b)
    dam = textdistance.damerau_levenshtein.distance(a, b)
    sim = textdistance.levenshtein.normalized_similarity(a, b)
    print(f"'{a}' vs '{b}'{'':>{35 - len(a) - len(b) - 9}} {lev:>12} {dam:>10} {sim:>12.3f}")

Output:

Pair                                Levenshtein    Damerau   Similarity
------------------------------------------------------------------------
'kitten' vs 'sitting'                         3          3        0.615
'python' vs 'pyhton'                          2          1        0.667
'colour' vs 'color'                           1          1        0.833
'algorithm' vs 'altgorithm'                   2          2        0.800

Notice how Damerau-Levenshtein scores “pyhton” as distance 1 (one transposition) while Levenshtein scores it as 2 (one deletion + one insertion). For spell checking where typos often involve swapped adjacent characters, Damerau-Levenshtein is the better choice. For OCR correction where characters are independently misread (not swapped), standard Levenshtein is usually sufficient.

Jaro-Winkler for Name Matching

Jaro-Winkler is specifically designed for short strings like personal names. It gives extra weight to matching prefixes — the assumption being that people are more likely to match on the beginning of a name. It returns a value between 0 and 1 directly (no normalization needed).

# textdistance_jaro.py
import textdistance

# Name matching examples
name_pairs = [
    ("John Smith", "Jon Smith"),      # first name variant
    ("McDonald", "MacDonald"),         # prefix variant
    ("Williams", "Williamson"),        # suffix addition
    ("Catherine", "Katherine"),        # spelling variant
    ("Robert", "Bob"),                 # nickname (low similarity expected)
]

jaro = textdistance.jaro
jw = textdistance.jaro_winkler

print(f"{'Name pair':<40} {'Jaro':>8} {'JaroWinkler':>13}")
print("-" * 64)
for a, b in name_pairs:
    print(f"'{a}' vs '{b}'"[:40].ljust(40),
          f"{jaro.normalized_similarity(a, b):>8.3f}",
          f"{jw.normalized_similarity(a, b):>13.3f}")

Output:

Name pair                               Jaro  JaroWinkler
----------------------------------------------------------------
'John Smith' vs 'Jon Smith'            0.985        0.991
'McDonald' vs 'MacDonald'              0.917        0.958
'Williams' vs 'Williamson'             0.944        0.944
'Catherine' vs 'Katherine'             0.926        0.926
'Robert' vs 'Bob'                      0.556        0.556

Jaro-Winkler correctly identifies that “McDonald” and “MacDonald” are very similar (0.958) while correctly flagging “Robert” vs “Bob” as low similarity (0.556). This matches human intuition about name variants. For customer record deduplication, a threshold of 0.90+ on Jaro-Winkler is a common starting point for flagging likely duplicates for human review.

API Alice at control panel with similarity threshold sliders
Jaro-Winkler 0.90+: where ‘Jon’ becomes ‘John’ and no customer is lost twice.

Token-Based Similarity for Longer Text

For longer strings — sentences, documents, product descriptions — character-level distance metrics become slow and less meaningful. Token-based metrics split strings into words (or character n-grams) and measure set overlap. Jaccard similarity is the most common: it divides the size of the intersection by the size of the union.

# textdistance_jaccard.py
import textdistance

# Product description deduplication
desc1 = "Python programming for beginners complete guide 2024"
desc2 = "Complete Python programming guide for beginners 2024"
desc3 = "Java programming for beginners complete guide 2024"
desc4 = "Cooking pasta with fresh tomatoes Italian style"

pairs = [
    (desc1, desc2, "reordered words"),
    (desc1, desc3, "different language"),
    (desc1, desc4, "different topic"),
]

jac = textdistance.jaccard
cos = textdistance.cosine

print(f"{'Comparison':<25} {'Jaccard':>10} {'Cosine':>10}")
print("-" * 47)
for a, b, label in pairs:
    print(f"{label:<25} {jac.normalized_similarity(a, b):>10.3f} {cos.normalized_similarity(a, b):>10.3f}")

Output:

Comparison                   Jaccard     Cosine
-----------------------------------------------
reordered words                0.800      0.800
different language             0.600      0.600
different topic                0.000      0.000

Jaccard correctly identifies that the reordered description (same words, different order) is 80% similar, the Java version shares 60% of its vocabulary, and the pasta description shares nothing. For content deduplication or finding near-duplicate product listings, token-based metrics with a threshold around 0.7-0.8 catch most duplicates without false positives.

Choosing the Right Algorithm

The most common mistake when implementing fuzzy matching is picking the most well-known algorithm (usually Levenshtein) for every use case. Here is a practical guide:

ProblemRecommended AlgorithmWhy
Spell checkingDamerau-LevenshteinCatches transpositions (common typos)
Name deduplicationJaro-WinklerPrefix-weighted, good for names
Product searchJaro-Winkler or JaccardHandles abbreviations and word order
Document similarityJaccard or CosineToken-based, order-independent
DNA/code sequencesHammingFixed-length substitution counting
Phonetic matchingSoundex or MetaphoneSound-alike rather than spelling

When in doubt, compare multiple algorithms on a sample of your actual data and measure how often each one agrees with a human judgment. No algorithm is universally best — the right choice depends on your specific strings, language, and error patterns.

Real-Life Example: Fuzzy Product Search Engine

# fuzzy_search.py
import textdistance

# Product catalog
CATALOG = [
    "Python Programming Complete Guide",
    "Learning JavaScript for Beginners",
    "Data Science with Pandas and NumPy",
    "Machine Learning Fundamentals",
    "Web Development with Flask",
    "Advanced SQL Database Design",
    "Docker and Kubernetes DevOps",
    "React Frontend Development",
]

def fuzzy_search(query, catalog, threshold=0.4, top_n=3):
    """Find catalog items similar to query using multiple algorithms."""
    results = []
    for item in catalog:
        # Use Jaro-Winkler for word-level matching
        jw_sim = textdistance.jaro_winkler.normalized_similarity(
            query.lower(), item.lower()
        )
        # Use Jaccard for token overlap
        jac_sim = textdistance.jaccard.normalized_similarity(
            query.lower(), item.lower()
        )
        # Average the two scores
        combined = (jw_sim + jac_sim) / 2
        if combined >= threshold:
            results.append((item, combined))

    # Sort by score descending
    results.sort(key=lambda x: x[1], reverse=True)
    return results[:top_n]

# Test with various query types
queries = [
    "pythn programming",      # typo
    "machne learning",         # typo
    "docker kubernetes",       # exact keywords
    "sql databse",             # typo
]

for query in queries:
    print(f"\nQuery: '{query}'")
    matches = fuzzy_search(query, CATALOG)
    if matches:
        for item, score in matches:
            print(f"  {score:.3f}  {item}")
    else:
        print("  No matches found")

Output:

Query: 'pythn programming'
  0.812  Python Programming Complete Guide

Query: 'machne learning'
  0.734  Machine Learning Fundamentals

Query: 'docker kubernetes'
  0.821  Docker and Kubernetes DevOps

Query: 'sql databse'
  0.681  Advanced SQL Database Design

The combined Jaro-Winkler and Jaccard score catches typos (Jaro-Winkler handles character-level errors) and keyword matching (Jaccard handles token overlap). Adjust the threshold parameter to trade off between recall (catching more but with more false positives) and precision (fewer results but more accurate). For production, you would pre-compute scores against the full catalog and cache results for common queries.

Loop Larry typing misspelled query with fuzzy matches shown
fuzzy_search(‘pythn’): because users never spell things right, and that’s fine.

Frequently Asked Questions

Is textdistance fast enough for large datasets?

textdistance can use C-accelerated backends like python-Levenshtein or jellyfish if they are installed. Install them alongside textdistance for a significant speedup on large comparisons. For searching a catalog of thousands of items, computing distances in pure Python is fast enough. For millions of comparisons (e.g., deduplicating a full customer database), use vectorized approaches with libraries like polyfuzz or pre-index with a tool like Elasticsearch that has built-in fuzzy matching.

How do I choose the right similarity threshold?

There is no universal threshold — it depends on your data and algorithm. The practical approach: collect a sample of 50-100 pairs that should match and 50-100 that should not, compute similarity scores, and plot the distribution. The ideal threshold sits in the gap between the two distributions. Start at 0.85 for name matching with Jaro-Winkler, 0.70 for document similarity with Jaccard, and 0.80 for general fuzzy search. Adjust based on how many false positives and false negatives you observe.

When should I use phonetic algorithms like Soundex?

Use phonetic algorithms when you want to match strings that sound the same regardless of spelling. “Smith” and “Smyth” are phonetically identical. “Cathy” and “Kathy” sound alike. Phonetic matching is valuable for name search in databases where historical records have inconsistent spelling, or in voice-to-text applications where the input is a transcription. textdistance implements Soundex, Metaphone, and NYSIIS.

Does textdistance work with non-English text?

Yes. Edit distance algorithms (Levenshtein, Hamming) work on any Unicode string regardless of language. Token-based algorithms (Jaccard, Cosine) split on whitespace by default, which works for space-separated languages but not for languages like Chinese or Japanese that do not use spaces. For those languages, use character n-gram tokenization: pass qval=1 or qval=2 to the constructor to compare character unigrams or bigrams instead of words.

What is the difference between similarity and normalized similarity?

.similarity() returns a raw score whose range depends on the algorithm (Levenshtein returns an integer; Jaro returns 0-1). .normalized_similarity() always returns a float between 0.0 and 1.0, where 1.0 means identical. Use normalized similarity when you want to compare scores across different algorithms or set a consistent threshold. Use raw distance when you need the actual edit count for display or downstream calculations.

Conclusion

The textdistance library gives you over 30 string similarity algorithms with a consistent, easy-to-use interface. You learned how to use edit distance algorithms for spell checking, Jaro-Winkler for name matching, Jaccard for document similarity, and how to combine multiple metrics for robust fuzzy search. The product search example showed how to build a practical fuzzy search engine in under 30 lines of Python.

The next step is to apply textdistance to a real deduplication or search problem in your own data. Start by sampling your data, trying two or three algorithms, and measuring which one best matches your human judgment on what should and should not match. The textdistance documentation on GitHub includes a comparison table of all 30+ algorithms with complexity information and typical use cases.