Intermediate
You need to show what changed between two versions of a config file. Or find the closest matching product name from a misspelled user query. Or detect whether two documents are the same despite minor formatting differences. These are all sequence comparison problems, and Python’s standard library has a dedicated module for them: difflib. It is the same engine that powers many code review tools, fuzzy finders, and spell checkers.
difflib is a pure-Python standard library module (no installation needed) that computes differences between sequences. A sequence can be a list of strings (lines of a file), a list of characters, or any other ordered collection. The module provides several tools: SequenceMatcher for computing similarity ratios, Differ for human-readable line-by-line diffs, HtmlDiff for side-by-side HTML diffs, and get_close_matches() for fuzzy string matching.
This article covers the full toolkit: computing similarity scores with SequenceMatcher, generating diffs in unified and context formats, finding close matches, building an HTML diff viewer, and applying everything in a real-world config file auditor. By the end, you will have practical tools for text comparison tasks that previously required external libraries.
difflib Quick Example
# quick_difflib.py
import difflib
# Compare two versions of a config file
old = ["host = localhost\n", "port = 5432\n", "debug = False\n"]
new = ["host = db.prod.example.com\n", "port = 5432\n", "debug = False\n", "pool_size = 10\n"]
# Generate a unified diff (like git diff)
diff = difflib.unified_diff(old, new, fromfile="config.old", tofile="config.new", n=1)
print("".join(diff))
# Similarity ratio between two strings
matcher = difflib.SequenceMatcher(None, "python", "pytohn")
print(f"\nSimilarity: {matcher.ratio():.2%}") # 0.833 = 83.3%
# Fuzzy close matches
matches = difflib.get_close_matches("pythno", ["python", "java", "ruby", "perl"])
print(f"Close matches: {matches}")
Output:
--- config.old
+++ config.new
@@ -1,2 +1,2 @@
-host = localhost
+host = db.prod.example.com
port = 5432
Similarity: 83.33%
Close matches: ['python']
Three distinct tools shown in one example: unified diff for change tracking, SequenceMatcher for similarity scoring, and get_close_matches() for fuzzy lookup. Each addresses a different comparison need, and together they cover the majority of text comparison tasks you will encounter.
What Is difflib and What Can It Do?
difflib implements Ratcliff/Obershelp pattern matching (similar to the Gestalt approach) to find the longest common subsequences in two sequences. It does not use edit distance (Levenshtein), which means it handles block moves and multi-line changes well, making it more suitable for text files than character-level edit distance metrics.
| Function / Class | Input | Output | Use Case |
|---|---|---|---|
| SequenceMatcher | Two sequences | Similarity ratio, opcodes | Similarity scoring, change detection |
| unified_diff() | Two line lists | Unified diff lines | git-style change display |
| context_diff() | Two line lists | Context diff lines | Traditional diff format |
| Differ | Two line lists | Human-readable diff | Readable change display |
| HtmlDiff | Two line lists | HTML table | Side-by-side web display |
| get_close_matches() | String + word list | List of close matches | Fuzzy search, spell check |
SequenceMatcher: Similarity Ratios and Change Blocks
SequenceMatcher is the core engine underlying all other difflib tools. Instantiate it with two sequences and call ratio() for a 0.0-to-1.0 similarity score, or get_opcodes() to get a list of edit operations that transforms sequence A into sequence B.
# sequence_matcher.py
from difflib import SequenceMatcher
# Compare two strings character by character
a = "The quick brown fox jumps over the lazy dog"
b = "The quick brown cat jumps over the lazy dog"
matcher = SequenceMatcher(None, a, b)
print(f"Ratio: {matcher.ratio():.4f}") # 0.9302
print(f"Quick ratio: {matcher.quick_ratio():.4f}") # Upper bound, faster to compute
# Get the exact changes
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag != "equal":
print(f" {tag}: '{a[i1:i2]}' -> '{b[j1:j2]}'")
print()
# Compare lists of lines (more natural for text files)
lines_a = ["def greet(name):", " print(f'Hello {name}')", " return True"]
lines_b = ["def greet(name):", " print(f'Hi {name}!')", " return True"]
matcher2 = SequenceMatcher(None, lines_a, lines_b)
print(f"Line similarity: {matcher2.ratio():.4f}")
for tag, i1, i2, j1, j2 in matcher2.get_opcodes():
if tag == "replace":
print(f" Changed line {i1+1}:")
print(f" FROM: {lines_a[i1:i2]}")
print(f" TO: {lines_b[j1:j2]}")
Output:
Ratio: 0.9302
Quick ratio: 0.9302
replace: 'fox' -> 'cat'
Line similarity: 0.8000
Changed line 2:
FROM: [" print(f'Hello {name}')"]
TO: [" print(f'Hi {name}!')"]
The opcodes are a low-level but powerful API. The five operations are "equal" (no change), "insert" (add from B), "delete" (remove from A), "replace" (change in place), and "equal". Build a custom diff renderer by iterating opcodes and coloring each section — this is exactly what code editors do for inline change highlighting.
The autojunk Parameter
By default, SequenceMatcher ignores “junk” elements: lines that appear in more than 1% of the sequence. For code files this is usually helpful (blank lines, common keywords), but for short strings it can produce unexpected results. Pass autojunk=False to disable this heuristic when comparing short strings or structured data.
Generating Diffs: unified_diff and context_diff
For human-readable change reports, unified_diff() produces the familiar ---/+++/@@@ format used by git, patch, and most code review tools.
# unified_diff_demo.py
import difflib
def compare_files(file_a: str, file_b: str, context_lines: int = 3) -> str:
"""Compare two file contents and return a unified diff string."""
lines_a = file_a.splitlines(keepends=True)
lines_b = file_b.splitlines(keepends=True)
diff = difflib.unified_diff(
lines_a, lines_b,
fromfile="original.py",
tofile="modified.py",
n=context_lines
)
return "".join(diff)
original = """def calculate_discount(price, percent):
discount = price * percent
return discount
def apply_coupon(order, code):
if code == "SAVE10":
return order - 10
return order
"""
modified = """def calculate_discount(price, percent):
if percent > 100:
raise ValueError("Discount cannot exceed 100%")
discount = price * (percent / 100)
return discount
def apply_coupon(order, code, user_id=None):
if code == "SAVE10":
return order - 10
if code == "SAVE20":
return order - 20
return order
"""
result = compare_files(original, modified)
print(result if result else "No differences found.")
Output:
--- original.py
+++ modified.py
@@ -1,4 +1,6 @@
def calculate_discount(price, percent):
+ if percent > 100:
+ raise ValueError("Discount cannot exceed 100%")
- discount = price * percent
+ discount = price * (percent / 100)
return discount
-def apply_coupon(order, code):
+def apply_coupon(order, code, user_id=None):
if code == "SAVE10":
return order - 10
+ if code == "SAVE20":
+ return order - 20
return order
The n=context_lines parameter controls how many unchanged lines to show around each change. The default is 3. Use n=0 to show only changed lines (like a “what changed” summary), or n=999 to show the full file with changes highlighted.
get_close_matches: Fuzzy String Lookup
get_close_matches() is the simplest path to fuzzy matching: give it a word and a vocabulary list, and it returns the best matches above a similarity threshold.
# close_matches.py
from difflib import get_close_matches
vocabulary = [
"python", "javascript", "typescript", "java", "kotlin",
"swift", "rust", "golang", "ruby", "scala",
]
# Basic usage
print(get_close_matches("pyhton", vocabulary)) # ['python']
print(get_close_matches("jvascript", vocabulary)) # ['javascript']
print(get_close_matches("xyz", vocabulary)) # [] -- no close match
# Control sensitivity with n and cutoff
print(get_close_matches("java", vocabulary, n=3, cutoff=0.4))
# ['java', 'javascript'] -- lower cutoff finds more results
# Practical use: did-you-mean suggestion
def did_you_mean(word: str, options: list[str]) -> str | None:
matches = get_close_matches(word, options, n=1, cutoff=0.6)
return matches[0] if matches else None
commands = ["start", "stop", "restart", "status", "reload"]
user_input = "reestart"
suggestion = did_you_mean(user_input, commands)
if suggestion:
print(f"Unknown command '{user_input}'. Did you mean '{suggestion}'?")
Output:
['python']
['javascript']
[]
['java', 'javascript']
Unknown command 'reestart'. Did you mean 'restart'?
The cutoff parameter (default 0.6) controls how similar a match must be to be included. A lower cutoff (0.4) catches more distant matches but produces more false positives. A higher cutoff (0.8) is stricter but misses matches with more typos. For command-line “did you mean?” suggestions, 0.6 is a reasonable starting point.
Real-Life Example: Config File Auditor
This project compares a deployed config file against a template to detect drift, showing exactly what changed in a human-readable report.
# config_auditor.py
import difflib
from dataclasses import dataclass, field
from typing import List
@dataclass
class ConfigAuditResult:
similarity: float
added_lines: List[str] = field(default_factory=list)
removed_lines: List[str] = field(default_factory=list)
changed_sections: List[str] = field(default_factory=list)
@property
def has_drift(self) -> bool:
return self.similarity < 1.0
def summary(self) -> str:
if not self.has_drift:
return "No drift detected. Config matches template."
return (
f"Config drift detected (similarity: {self.similarity:.1%})\n"
f" Added lines: {len(self.added_lines)}\n"
f" Removed lines: {len(self.removed_lines)}"
)
def audit_config(template: str, deployed: str) -> ConfigAuditResult:
"""Compare deployed config against template and return audit result."""
template_lines = template.splitlines(keepends=True)
deployed_lines = deployed.splitlines(keepends=True)
matcher = difflib.SequenceMatcher(None, template_lines, deployed_lines)
result = ConfigAuditResult(similarity=matcher.ratio())
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag == "delete":
result.removed_lines.extend(template_lines[i1:i2])
elif tag == "insert":
result.added_lines.extend(deployed_lines[j1:j2])
elif tag == "replace":
result.removed_lines.extend(template_lines[i1:i2])
result.added_lines.extend(deployed_lines[j1:j2])
result.changed_sections.append(template_lines[i1].strip())
return result
def print_diff(template: str, deployed: str) -> None:
"""Print a unified diff between template and deployed config."""
diff = difflib.unified_diff(
template.splitlines(keepends=True),
deployed.splitlines(keepends=True),
fromfile="template.conf",
tofile="deployed.conf",
n=2
)
diff_text = "".join(diff)
if diff_text:
print(diff_text)
else:
print("Files are identical.")
# Example usage
TEMPLATE = """[server]
host = 0.0.0.0
port = 8080
workers = 4
timeout = 30
[database]
host = db.internal
port = 5432
pool_size = 10
"""
DEPLOYED = """[server]
host = 0.0.0.0
port = 9090
workers = 4
timeout = 30
[database]
host = db.internal
port = 5432
pool_size = 20
max_overflow = 5
"""
result = audit_config(TEMPLATE, DEPLOYED)
print(result.summary())
print()
print("--- Unified Diff ---")
print_diff(TEMPLATE, DEPLOYED)
print()
print(f"Changed sections: {result.changed_sections}")
Output:
Config drift detected (similarity: 83.0%)
Added lines: 3
Removed lines: 2
--- Unified Diff ---
--- template.conf
+++ deployed.conf
@@ -1,6 +1,6 @@
[server]
host = 0.0.0.0
-port = 8080
+port = 9090
workers = 4
@@ -9,3 +9,5 @@
host = db.internal
port = 5432
-pool_size = 10
+pool_size = 20
+max_overflow = 5
Changed sections: ['port = 8080\n', 'pool_size = 10\n']
The ConfigAuditResult dataclass separates the raw diff data (added, removed lines) from the derived properties (has_drift, summary()). This structure makes the auditor easy to extend: add a critical_fields list to flag specific settings (like host changes) as high-severity drift.
Frequently Asked Questions
What is the difference between ratio() and quick_ratio()?
ratio() computes the exact similarity ratio by performing the full sequence comparison. quick_ratio() uses a faster upper-bound estimate that overestimates the true ratio. real_quick_ratio() is even faster but less accurate. Use quick_ratio() as a preliminary filter when you have many candidates: if quick_ratio() < threshold, skip the expensive ratio() call. This optimisation is built into get_close_matches() internally.
When should I use autojunk=False?
Disable autojunk when comparing short strings or structured data where “common” lines should not be discounted. The autojunk heuristic marks elements appearing in more than 1% of the longer sequence as junk. In a short config file, a blank line might appear in only 2 places yet qualify as junk due to the small total line count. Pass SequenceMatcher(isjunk=None, a=a, b=b, autojunk=False) to disable this behaviour.
How do I generate a side-by-side HTML diff?
Use difflib.HtmlDiff(). Call HtmlDiff().make_file(a_lines, b_lines, fromdesc, todesc) to get a complete HTML page with side-by-side tables, highlighting, and legend. Save it to a .html file and open it in a browser. This is useful for generating code review reports: just call make_file() in a loop for each changed file and write the outputs to a folder.
How does difflib compare to Levenshtein distance?
Levenshtein distance counts the minimum character-level edits (insert, delete, substitute) between two strings. difflib uses the Ratcliff/Obershelp algorithm, which finds the longest matching substrings recursively. Levenshtein is better for single-word spell-checking (it handles transpositions naturally). difflib is better for multi-line text comparison because it handles block moves gracefully. For production spell-checkers, use the python-Levenshtein or rapidfuzz library, which are implemented in C and significantly faster than difflib.
Is difflib fast enough for large files?
For files up to a few thousand lines, difflib is fast enough for interactive use. For larger files (tens of thousands of lines), consider calling SequenceMatcher once and caching the result, or use the C-based python-Levenshtein library for pure string comparison. The main performance lever is the autojunk heuristic — it is on by default and significantly speeds up comparison of files with repeated lines (like log files).
Conclusion
Python’s difflib module provides a complete text comparison toolkit without any external dependencies. You have learned how to compute similarity ratios with SequenceMatcher, generate unified and context diffs for change tracking, use get_close_matches() for fuzzy string lookup, and build a config auditor that detects and reports configuration drift. The Ratcliff/Obershelp algorithm at difflib’s core handles multi-line block moves well, making it a natural fit for file comparison tasks.
Extend the config auditor by adding an HTML diff output (using HtmlDiff().make_file()), integrating it into a CI pipeline to fail builds when critical settings drift from the template, or adapting the fuzzy matcher into a search autocomplete feature for a command-line tool. All three extensions build directly on what you have learned here.
For the full API reference, see the official difflib documentation.