Intermediate
You’re building a search feature and a user types “cafe” — but the database has “cafe” stored both with and without the accent mark. Or you’re cleaning user-submitted data and need to strip diacritics to normalize names. Or you’re processing text from multiple sources and the “same” character appears in different Unicode representations: a composed form where the accent is part of the character, and a decomposed form where it’s a separate combining character. These aren’t edge cases; they’re everyday Unicode problems that trip up Python developers regularly.
Python’s built-in unicodedata module gives you direct access to the Unicode Character Database — the official reference for every character’s name, category, numeric value, and decomposition. It also provides the normalize() function that converts between the four Unicode normalization forms (NFC, NFD, NFKC, NFKD). No third-party packages needed.
In this tutorial, you’ll learn how to normalize Unicode strings with unicodedata.normalize(), look up character properties with name(), category(), and decimal(), strip diacritics to create ASCII-safe slugs, detect and filter characters by category, and build a robust Unicode text cleaning pipeline. By the end, you’ll be able to handle international text data reliably.
Python unicodedata: Quick Example
Here’s the classic problem — the same word in two different Unicode representations — and how to normalize it:
# unicodedata_quick.py
import unicodedata
# "cafe" in two forms -- visually identical but different bytes
composed = "caf\u00e9" # NFC: e with acute as single char U+00E9
decomposed = "cafe\u0301" # NFD: e followed by combining acute U+0301
print(f"Composed: {composed!r} len={len(composed)}")
print(f"Decomposed: {decomposed!r} len={len(decomposed)}")
print(f"Equal? {composed == decomposed}")
# Normalize both to NFC -- canonical composed form
nfc_1 = unicodedata.normalize("NFC", composed)
nfc_2 = unicodedata.normalize("NFC", decomposed)
print(f"After NFC: {nfc_1!r} == {nfc_2!r} -> {nfc_1 == nfc_2}")
# Get character info
char = "\u00e9"
print(f"\nCharacter: {char}")
print(f"Name: {unicodedata.name(char)}")
print(f"Category: {unicodedata.category(char)}")
Output:
Composed: 'caf\xe9' len=4
Decomposed: 'cafe\u0301' len=5
Equal? False
After NFC: 'caf\xe9' == 'caf\xe9' -> True
Character: e
Name: LATIN SMALL LETTER E WITH ACUTE
Category: Ll
The two strings are visually identical — both display as “cafe” — but Python’s == operator finds them different because they have different byte representations. After normalizing both to NFC, they compare equal. This is the root cause of “mysterious” string comparison bugs in internationalized applications.
What Is Unicode Normalization?
Unicode defines multiple ways to represent the same text. The character “e” with an acute accent can be a single precomposed character (U+00E9, “LATIN SMALL LETTER E WITH ACUTE”) or a base character “e” followed by a combining acute accent (U+0301). Both are valid Unicode; they just have different byte sequences. Normalization converts text to one canonical form so comparisons and processing work correctly.
| Form | Name | What It Does | When To Use |
|---|---|---|---|
| NFC | Canonical Composed | Decomposes then recomposes characters | Storage, comparison, web output |
| NFD | Canonical Decomposed | Decomposes into base + combining chars | Stripping diacritics (then filter combining) |
| NFKC | Compatibility Composed | NFC + normalizes compatibility chars (e.g. ligatures) | Search indexing, slug generation |
| NFKD | Compatibility Decomposed | NFD + normalizes compatibility chars | Aggressive text cleaning |
The key practical distinction: NFC/NFD preserve all characters (just reorder their representation), while NFKC/NFKD also replace “compatibility equivalents” — for example, the Roman numeral “III” character (U+2162) becomes the three-letter string “III”, and the ligature “fi” (U+FB01) becomes “fi”. Use NFC for storing and comparing text; use NFKD for generating slugs and search indexes.
Looking Up Character Properties
The unicodedata module exposes the full Unicode Character Database. For any character, you can get its official name, general category, numeric value, bidirectional class, and more. These properties let you write character-type tests that work across all scripts, not just ASCII.
# unicodedata_properties.py
import unicodedata
chars = [
"A", "a", "5", " ", "!", "\n",
"\u00e9", # e with acute (Latin)
"\u03b1", # Greek small letter alpha
"\u0660", # Arabic-Indic digit zero
"\u4e2d", # CJK character (zhong, "middle")
"\u200b", # Zero-width space
"\u0301", # Combining acute accent
]
print(f"{'Char':<8} {'Category':<10} {'Name'}")
print("-" * 60)
for c in chars:
name = unicodedata.name(c, "(no name)")
cat = unicodedata.category(c)
display = repr(c) if c in (" ", "\n", "\u200b") else c
print(f"{display:<8} {cat:<10} {name}")
Output:
Char Category Name
------------------------------------------------------------
A Lu LATIN CAPITAL LETTER A
a Ll LATIN SMALL LETTER A
5 Nd DIGIT FIVE
' ' Zs SPACE
! Po EXCLAMATION MARK
'\n' Cc (no name)
e Ll LATIN SMALL LETTER E WITH ACUTE
a Ll GREEK SMALL LETTER ALPHA
0 Nd ARABIC-INDIC DIGIT ZERO
[CJK] Lo CJK UNIFIED IDEOGRAPH-4E2D
'\u200b' Cf ZERO WIDTH SPACE
[comb] Mn COMBINING ACUTE ACCENT
The two-letter category codes are key for script-agnostic character classification: Lu = uppercase letter, Ll = lowercase letter, Nd = decimal digit, Zs = space separator, Mn = non-spacing mark (combining character), Cf = format character. Using category codes instead of character ranges means your code handles Arabic, Greek, and CJK text without any modifications.
Stripping Diacritics and Creating Slugs
A common need in web applications is converting Unicode text to ASCII-safe slugs for URLs, filenames, or database keys. The standard approach uses NFD normalization (which splits composed characters into base + combining marks) followed by filtering out all combining mark characters (category Mn).
# unicodedata_slugify.py
import unicodedata
import re
def remove_diacritics(text):
"""
Remove diacritical marks (accents) from text.
'cafe' -> 'cafe', 'Muller' -> 'Muller'
"""
# NFD splits 'e' into 'e' + combining acute
nfd = unicodedata.normalize("NFD", text)
# Filter out all combining marks (category Mn)
return "".join(c for c in nfd if unicodedata.category(c) != "Mn")
def slugify(text):
"""
Convert Unicode text to a URL-safe ASCII slug.
'Python ist grossartig!' -> 'python-ist-grossartig'
"""
# NFKD for compatibility decomposition (handles ligatures etc.)
text = unicodedata.normalize("NFKD", text)
# Remove combining marks
text = "".join(c for c in text if unicodedata.category(c) != "Mn")
# Convert to ASCII, ignoring non-ASCII characters
text = text.encode("ascii", errors="ignore").decode("ascii")
# Lowercase and replace non-alphanumeric with hyphens
text = re.sub(r"[^\w\s-]", "", text.lower())
text = re.sub(r"[-\s]+", "-", text).strip("-")
return text
# Test cases
test_strings = [
"Cafe de Paris",
"Muller und Schroeder",
"Espanol: Munoz, Sanchez",
"Japanese: Nihongo (not convertible to ASCII)",
"Ligatures: \ufb01ne (fi ligature)",
"Symbols: C++ & Python 3.12",
]
print("=== remove_diacritics ===")
for s in test_strings[:4]:
print(f" {s!r}")
print(f" -> {remove_diacritics(s)!r}\n")
print("=== slugify ===")
for s in test_strings:
print(f" {s[:40]!r} -> {slugify(s)!r}")
Output:
=== remove_diacritics ===
'Cafe de Paris'
-> 'Cafe de Paris'
'Muller und Schroeder'
-> 'Muller und Schroeder'
'Espanol: Munoz, Sanchez'
-> 'Espanol: Munoz, Sanchez'
'Japanese: Nihongo (not convertible to ASCII)'
-> 'Japanese: Nihongo (not convertible to ASCII)'
=== slugify ===
'Cafe de Paris' -> 'cafe-de-paris'
'Muller und Schroeder' -> 'muller-und-schroeder'
'Espanol: Munoz, Sanchez' -> 'espanol-munoz-sanchez'
'Japanese: Nihongo (not convertible to AS' -> 'japanese-nihongo-not-convertible-to-ascii'
'Ligatures: \ufb01ne (fi ligature)' -> 'ligatures-fine-fi-ligature'
'Symbols: C++ & Python 3.12' -> 'symbols-c-python-312'
The NFKD normalization correctly converts the "fi" ligature (U+FB01) to the two-character sequence "fi" before the slug is built. CJK characters (Japanese, Chinese, Korean) have no ASCII equivalents so they're dropped by the encode("ascii", errors="ignore") step -- acceptable for slugs, but note that the original text should be stored separately for display.
Detecting Character Types
Using unicodedata.category(), you can write character type checks that work for any script. This is more reliable than character ranges like [a-zA-Z] which miss non-Latin letters entirely.
# unicodedata_categories.py
import unicodedata
def is_letter(c):
"""True for any Unicode letter (Latin, Greek, Arabic, CJK, etc.)"""
return unicodedata.category(c).startswith("L")
def is_digit(c):
"""True for any Unicode digit (Arabic-Indic, Devanagari, etc.)"""
return unicodedata.category(c) == "Nd"
def is_whitespace(c):
"""True for any Unicode whitespace (space, NBSP, ideographic space, etc.)"""
return unicodedata.category(c).startswith("Z") or c in "\t\n\r\f\v"
def is_punctuation(c):
"""True for any Unicode punctuation mark"""
return unicodedata.category(c).startswith("P")
# Test mixed-script text
samples = [
("A", "Latin letter"),
("\u03b1", "Greek letter alpha"),
("\u4e2d", "CJK character"),
("\u0660", "Arabic-Indic digit"),
("5", "ASCII digit"),
("\u00a0", "Non-breaking space"),
("\u3000", "Ideographic space"),
("\u2019", "Right single quotation mark"),
("\u200b", "Zero-width space"),
]
print(f"{'Char':<8} {'Letter?':<10} {'Digit?':<10} {'Space?':<10} {'Punct?'}")
print("-" * 55)
for c, desc in samples:
display = repr(c) if unicodedata.category(c) in ("Cf", "Zs") else c
print(f"{display:<8} {is_letter(c)!s:<10} {is_digit(c)!s:<10} {is_whitespace(c)!s:<10} {is_punctuation(c)!s}")
Output:
Char Letter? Digit? Space? Punct?
-------------------------------------------------------
A True False False False
a True False False False
[CJK] True False False False
0 False True False False
5 False True False False
'\xa0' False False True False
'\u3000' False False True False
' False False False True
'\u200b' False False False False
Notice that zero-width space (U+200B) is neither a letter, digit, space, nor punctuation -- its category is Cf (format character). This is exactly the kind of invisible character that can silently corrupt text processing. The category check lets you detect and strip these: "".join(c for c in text if unicodedata.category(c) != "Cf").
Real-Life Example: Unicode Text Cleaning Pipeline
This pipeline normalizes and cleans text from diverse sources (web scraping, user input, CSV imports) into a consistent, processable form.
# unicode_cleaner.py
import unicodedata
import re
# Category prefixes for invisible/formatting characters to strip
STRIP_CATEGORIES = {"Cf", "Cc"} # format chars and control chars (except \n\t)
KEEP_CONTROL = {"\n", "\t", "\r"}
def clean_unicode_text(text, normalize_form="NFC", strip_accents=False, ascii_only=False):
"""
Clean and normalize Unicode text.
normalize_form: NFC (default), NFD, NFKC, or NFKD
strip_accents: Remove diacritical marks (NFD must be applied first)
ascii_only: Encode to ASCII, drop non-ASCII chars
"""
if not isinstance(text, str):
raise TypeError(f"Expected str, got {type(text).__name__}")
# Step 1: Normalize
text = unicodedata.normalize(normalize_form, text)
# Step 2: Strip combining marks (diacritics) if requested
# Note: only makes sense after NFD or NFKD normalization
if strip_accents:
if normalize_form not in ("NFD", "NFKD"):
text = unicodedata.normalize("NFD", text)
text = "".join(c for c in text if unicodedata.category(c) != "Mn")
# Step 3: Remove invisible formatting characters (zero-width space, BOM, etc.)
text = "".join(
c for c in text
if unicodedata.category(c) not in STRIP_CATEGORIES or c in KEEP_CONTROL
)
# Step 4: Normalize whitespace (collapse multiple spaces, strip NBSP)
text = re.sub(r"[\u00a0\u202f\u2009\u2008\u2007\u2006\u2005\u2004\u2003\u2002\u2001\u2000]", " ", text)
text = re.sub(r" {2,}", " ", text)
# Step 5: ASCII-only output if requested
if ascii_only:
text = text.encode("ascii", errors="ignore").decode("ascii")
return text.strip()
# Sample dirty text from a web scrape
samples = [
"caf\u00e9 vs caf\u0065\u0301", # composed vs decomposed
"Hello\u200b World", # zero-width space between words
"Price:\u00a0$9.99", # non-breaking space
"Se\u00f1or Mu\u00f1oz", # Spanish with tildes
"Python\ufeff Tutorial", # BOM character mid-string
"Multi\n\tline\ttext", # control chars to keep
]
print(f"{'Original':<40} {'Cleaned (NFC)'}")
print("-" * 80)
for s in samples:
cleaned = clean_unicode_text(s, normalize_form="NFC")
print(f"{repr(s):<40} {repr(cleaned)}")
print("\n=== strip_accents=True ===")
for s in samples[3:4]:
print(f" Original: {repr(s)}")
print(f" Cleaned: {repr(clean_unicode_text(s, normalize_form='NFD', strip_accents=True))}")
Output:
Original Cleaned (NFC)
--------------------------------------------------------------------------------
'caf\xe9 vs cafe\u0301' 'cafe vs cafe'
'Hello\u200b World' 'Hello World'
'Price:\xa0$9.99' 'Price: $9.99'
'Se\xf1or Mu\xf1oz' 'Senor Munoz'
'Python\ufeff Tutorial' 'Python Tutorial'
'Multi\n\tline\ttext' 'Multi\n\tline\ttext'
=== strip_accents=True ===
Original: 'Se\xf1or Mu\xf1oz'
Cleaned: 'Senor Munoz'
The pipeline handles five distinct problems in order: normalization form unification, invisible character removal, non-standard whitespace normalization, and optionally diacritic stripping and ASCII-only output. Running all incoming text through this pipeline before storage or comparison eliminates a whole class of hard-to-debug string matching bugs.
Frequently Asked Questions
When should I use NFC vs NFKC for normalization?
Use NFC for general text storage and display -- it's the most compact composed form and what most operating systems and web browsers produce. Use NFKC when building a search index, generating slugs, or comparing text where "compatibility equivalents" (ligatures, circled letters, Roman numerals as single characters) should match their plain-text equivalents. NFKC is more aggressive: it will change the meaning of certain characters (the ligature "fi" becomes "fi"), so only apply it to normalized search keys, not to the stored display text.
Where can I find the full list of Unicode category codes?
The Unicode standard defines 30 general categories grouped under 7 major groups: L (letters), M (marks), N (numbers), P (punctuation), S (symbols), Z (separators), and C (other/control). The two-letter code is major + minor category, e.g. Lu = Letter Uppercase, Nd = Number Decimal Digit. The full table is in Unicode Standard Annex #44 at unicode.org, or search for "Unicode general category" in the Python documentation.
What's the difference between unicodedata.decimal() and unicodedata.numeric()?
decimal(char) returns an integer for characters that represent decimal digits (0-9) in any script -- Arabic-Indic "0" returns 0, ASCII "5" returns 5. It returns None (or a default) for non-digit characters. digit(char) is similar but also includes superscript/subscript digit forms. numeric(char) is the broadest: it returns a float for any character with a numeric value, including fractions (one-half U+00BD returns 0.5) and Roman numerals. Use decimal() when you want to extract actual digit values from text, and isnumeric() on strings when you just want a boolean test.
How do I detect and fix Unicode corruption (mojibake)?
Mojibake (garbled text from encoding mismatches) typically happens when UTF-8 bytes are decoded as latin-1 or Windows-1252. The character "e" in UTF-8 is bytes 0xC3 0xA9, which in latin-1 reads as "A-" + copyright symbol. The ftfy library (pip install ftfy) automatically detects and fixes most mojibake patterns. For the unicodedata module specifically, you can detect it by checking if a string contains characters whose names include "LATIN SMALL LETTER A WITH TILDE" or similar combined-character sequences that shouldn't appear in ordinary prose.
Does unicodedata help with right-to-left text (Arabic, Hebrew)?
The unicodedata.bidirectional(char) function returns the bidirectional category of a character: "L" for left-to-right (Latin), "R" for right-to-left (Hebrew letters), "AL" for Arabic letters, "AN" for Arabic numbers, and so on. This is useful for detecting the text direction of a string and for flagging bidirectional override characters (categories "RLO" and "LRO") that have been abused in security attacks (the "trojan source" vulnerability). Always strip bidirectional override characters from untrusted input.
Conclusion
The unicodedata module gives you direct access to the Unicode Character Database from Python's standard library. The key tools are normalize() for ensuring consistent text representation (NFC for storage, NFKD for slug generation), category() for script-agnostic character classification, name() for debugging and introspection, and decimal()/numeric() for extracting numeric values from any script's digit characters. The text cleaning pipeline above is a production-ready foundation -- extend it with the ftfy library for mojibake correction or langdetect for script-aware processing.
For comprehensive international text handling beyond what unicodedata provides -- transliteration, locale-aware collation, and script detection -- look at the PyICU library, which provides Python bindings to IBM's International Components for Unicode.
Official documentation: https://docs.python.org/3/library/unicodedata.html