How To Use Python unicodedata for Unicode Operations

Last Updated: June 01, 2026

Table of Contents

Python unicodedata: Quick Example
What Is Unicode Normalization?
Looking Up Character Properties
Stripping Diacritics and Creating Slugs
Detecting Character Types
Real-Life Example: Unicode Text Cleaning Pipeline
Frequently Asked Questions
Conclusion
Related Articles

Intermediate

You’re building a search feature and a user types “cafe” — but the database has “cafe” stored both with and without the accent mark. Or you’re cleaning user-submitted data and need to strip diacritics to normalize names. Or you’re processing text from multiple sources and the “same” character appears in different Unicode representations: a composed form where the accent is part of the character, and a decomposed form where it’s a separate combining character. These aren’t edge cases; they’re everyday Unicode problems that trip up Python developers regularly.

Python’s built-in unicodedata module gives you direct access to the Unicode Character Database — the official reference for every character’s name, category, numeric value, and decomposition. It also provides the normalize() function that converts between the four Unicode normalization forms (NFC, NFD, NFKC, NFKD). No third-party packages needed.

In this tutorial, you’ll learn how to normalize Unicode strings with unicodedata.normalize(), look up character properties with name(), category(), and decimal(), strip diacritics to create ASCII-safe slugs, detect and filter characters by category, and build a robust Unicode text cleaning pipeline. By the end, you’ll be able to handle international text data reliably.

Written by Pubs

Python developer and educator with 15+ years building production systems across data engineering, web APIs, and AI tooling. Founder of Python How To Program — 270+ in-depth tutorials covering the modern Python stack.

View all tutorials by Pubs →

Python unicodedata: Quick Example

Here’s the classic problem — the same word in two different Unicode representations — and how to normalize it:

# unicodedata_quick.py
import unicodedata

# "cafe" in two forms -- visually identical but different bytes
composed = "caf\u00e9"       # NFC: e with acute as single char U+00E9
decomposed = "cafe\u0301"    # NFD: e followed by combining acute U+0301

print(f"Composed:   {composed!r} len={len(composed)}")
print(f"Decomposed: {decomposed!r} len={len(decomposed)}")
print(f"Equal?      {composed == decomposed}")

# Normalize both to NFC -- canonical composed form
nfc_1 = unicodedata.normalize("NFC", composed)
nfc_2 = unicodedata.normalize("NFC", decomposed)
print(f"After NFC:  {nfc_1!r} == {nfc_2!r} -> {nfc_1 == nfc_2}")

# Get character info
char = "\u00e9"
print(f"\nCharacter: {char}")
print(f"Name:      {unicodedata.name(char)}")
print(f"Category:  {unicodedata.category(char)}")

Output:

Composed:   'caf\xe9' len=4
Decomposed: 'cafe\u0301' len=5
Equal?      False
After NFC:  'caf\xe9' == 'caf\xe9' -> True

Character: e
Name:      LATIN SMALL LETTER E WITH ACUTE
Category:  Ll

The two strings are visually identical — both display as “cafe” — but Python’s == operator finds them different because they have different byte representations. After normalizing both to NFC, they compare equal. This is the root cause of “mysterious” string comparison bugs in internationalized applications.

What Is Unicode Normalization?

Unicode defines multiple ways to represent the same text. The character “e” with an acute accent can be a single precomposed character (U+00E9, “LATIN SMALL LETTER E WITH ACUTE”) or a base character “e” followed by a combining acute accent (U+0301). Both are valid Unicode; they just have different byte sequences. Normalization converts text to one canonical form so comparisons and processing work correctly.

Form	Name	What It Does	When To Use
NFC	Canonical Composed	Decomposes then recomposes characters	Storage, comparison, web output
NFD	Canonical Decomposed	Decomposes into base + combining chars	Stripping diacritics (then filter combining)
NFKC	Compatibility Composed	NFC + normalizes compatibility chars (e.g. ligatures)	Search indexing, slug generation
NFKD	Compatibility Decomposed	NFD + normalizes compatibility chars	Aggressive text cleaning

The key practical distinction: NFC/NFD preserve all characters (just reorder their representation), while NFKC/NFKD also replace “compatibility equivalents” — for example, the Roman numeral “III” character (U+2162) becomes the three-letter string “III”, and the ligature “fi” (U+FB01) becomes “fi”. Use NFC for storing and comparing text; use NFKD for generating slugs and search indexes.

Unicode character lookup — unicodedata.lookup() finds characters by name. Your copy-paste does not scale to 149,878 code points.

Looking Up Character Properties

The unicodedata module exposes the full Unicode Character Database. For any character, you can get its official name, general category, numeric value, bidirectional class, and more. These properties let you write character-type tests that work across all scripts, not just ASCII.

# unicodedata_properties.py
import unicodedata

chars = [
    "A", "a", "5", " ", "!", "\n",
    "\u00e9",   # e with acute (Latin)
    "\u03b1",   # Greek small letter alpha
    "\u0660",   # Arabic-Indic digit zero
    "\u4e2d",   # CJK character (zhong, "middle")
    "\u200b",   # Zero-width space
    "\u0301",   # Combining acute accent
]

print(f"{'Char':<8} {'Category':<10} {'Name'}")
print("-" * 60)
for c in chars:
    name = unicodedata.name(c, "(no name)")
    cat = unicodedata.category(c)
    display = repr(c) if c in (" ", "\n", "\u200b") else c
    print(f"{display:<8} {cat:<10} {name}")

Output:

Char     Category   Name
------------------------------------------------------------
A        Lu         LATIN CAPITAL LETTER A
a        Ll         LATIN SMALL LETTER A
5        Nd         DIGIT FIVE
' '      Zs         SPACE
!        Po         EXCLAMATION MARK
'\n'     Cc         (no name)
e        Ll         LATIN SMALL LETTER E WITH ACUTE
a        Ll         GREEK SMALL LETTER ALPHA
0        Nd         ARABIC-INDIC DIGIT ZERO
[CJK]    Lo         CJK UNIFIED IDEOGRAPH-4E2D
'\u200b' Cf         ZERO WIDTH SPACE
[comb]   Mn         COMBINING ACUTE ACCENT

The two-letter category codes are key for script-agnostic character classification: Lu = uppercase letter, Ll = lowercase letter, Nd = decimal digit, Zs = space separator, Mn = non-spacing mark (combining character), Cf = format character. Using category codes instead of character ranges means your code handles Arabic, Greek, and CJK text without any modifications.

Stripping Diacritics and Creating Slugs

A common need in web applications is converting Unicode text to ASCII-safe slugs for URLs, filenames, or database keys. The standard approach uses NFD normalization (which splits composed characters into base + combining marks) followed by filtering out all combining mark characters (category Mn).

# unicodedata_slugify.py
import unicodedata
import re

def remove_diacritics(text):
    """
    Remove diacritical marks (accents) from text.
    'cafe' -> 'cafe', 'Muller' -> 'Muller'
    """
    # NFD splits 'e' into 'e' + combining acute
    nfd = unicodedata.normalize("NFD", text)
    # Filter out all combining marks (category Mn)
    return "".join(c for c in nfd if unicodedata.category(c) != "Mn")

def slugify(text):
    """
    Convert Unicode text to a URL-safe ASCII slug.
    'Python ist grossartig!' -> 'python-ist-grossartig'
    """
    # NFKD for compatibility decomposition (handles ligatures etc.)
    text = unicodedata.normalize("NFKD", text)
    # Remove combining marks
    text = "".join(c for c in text if unicodedata.category(c) != "Mn")
    # Convert to ASCII, ignoring non-ASCII characters
    text = text.encode("ascii", errors="ignore").decode("ascii")
    # Lowercase and replace non-alphanumeric with hyphens
    text = re.sub(r"[^\w\s-]", "", text.lower())
    text = re.sub(r"[-\s]+", "-", text).strip("-")
    return text

# Test cases
test_strings = [
    "Cafe de Paris",
    "Muller und Schroeder",
    "Espanol: Munoz, Sanchez",
    "Japanese: Nihongo (not convertible to ASCII)",
    "Ligatures: \ufb01ne (fi ligature)",
    "Symbols: C++ & Python 3.12",
]

print("=== remove_diacritics ===")
for s in test_strings[:4]:
    print(f"  {s!r}")
    print(f"  -> {remove_diacritics(s)!r}\n")

print("=== slugify ===")
for s in test_strings:
    print(f"  {s[:40]!r} -> {slugify(s)!r}")

Output:

=== remove_diacritics ===
  'Cafe de Paris'
  -> 'Cafe de Paris'

  'Muller und Schroeder'
  -> 'Muller und Schroeder'

  'Espanol: Munoz, Sanchez'
  -> 'Espanol: Munoz, Sanchez'

  'Japanese: Nihongo (not convertible to ASCII)'
  -> 'Japanese: Nihongo (not convertible to ASCII)'

=== slugify ===
  'Cafe de Paris' -> 'cafe-de-paris'
  'Muller und Schroeder' -> 'muller-und-schroeder'
  'Espanol: Munoz, Sanchez' -> 'espanol-munoz-sanchez'
  'Japanese: Nihongo (not convertible to AS' -> 'japanese-nihongo-not-convertible-to-ascii'
  'Ligatures: \ufb01ne (fi ligature)' -> 'ligatures-fine-fi-ligature'
  'Symbols: C++ & Python 3.12' -> 'symbols-c-python-312'

The NFKD normalization correctly converts the "fi" ligature (U+FB01) to the two-character sequence "fi" before the slug is built. CJK characters (Japanese, Chinese, Korean) have no ASCII equivalents so they're dropped by the encode("ascii", errors="ignore") step -- acceptable for slugs, but note that the original text should be stored separately for display.

Unicode category classification — unicodedata.category() returns 'Ll' for lowercase. Your regex returns confusion.

Detecting Character Types

Using unicodedata.category(), you can write character type checks that work for any script. This is more reliable than character ranges like [a-zA-Z] which miss non-Latin letters entirely.

# unicodedata_categories.py
import unicodedata

def is_letter(c):
    """True for any Unicode letter (Latin, Greek, Arabic, CJK, etc.)"""
    return unicodedata.category(c).startswith("L")

def is_digit(c):
    """True for any Unicode digit (Arabic-Indic, Devanagari, etc.)"""
    return unicodedata.category(c) == "Nd"

def is_whitespace(c):
    """True for any Unicode whitespace (space, NBSP, ideographic space, etc.)"""
    return unicodedata.category(c).startswith("Z") or c in "\t\n\r\f\v"

def is_punctuation(c):
    """True for any Unicode punctuation mark"""
    return unicodedata.category(c).startswith("P")

# Test mixed-script text
samples = [
    ("A", "Latin letter"),
    ("\u03b1", "Greek letter alpha"),
    ("\u4e2d", "CJK character"),
    ("\u0660", "Arabic-Indic digit"),
    ("5", "ASCII digit"),
    ("\u00a0", "Non-breaking space"),
    ("\u3000", "Ideographic space"),
    ("\u2019", "Right single quotation mark"),
    ("\u200b", "Zero-width space"),
]

print(f"{'Char':<8} {'Letter?':<10} {'Digit?':<10} {'Space?':<10} {'Punct?'}")
print("-" * 55)
for c, desc in samples:
    display = repr(c) if unicodedata.category(c) in ("Cf", "Zs") else c
    print(f"{display:<8} {is_letter(c)!s:<10} {is_digit(c)!s:<10} {is_whitespace(c)!s:<10} {is_punctuation(c)!s}")

Output:

Char     Letter?    Digit?     Space?     Punct?
-------------------------------------------------------
A        True       False      False      False
a        True       False      False      False
[CJK]    True       False      False      False
0        False      True       False      False
5        False      True       False      False
'\xa0'   False      False      True       False
'\u3000' False      False      True       False
'        False      False      False      True
'\u200b' False      False      False      False

Notice that zero-width space (U+200B) is neither a letter, digit, space, nor punctuation -- its category is Cf (format character). This is exactly the kind of invisible character that can silently corrupt text processing. The category check lets you detect and strip these: "".join(c for c in text if unicodedata.category(c) != "Cf").

Real-Life Example: Unicode Text Cleaning Pipeline

This pipeline normalizes and cleans text from diverse sources (web scraping, user input, CSV imports) into a consistent, processable form.

Legacy encoding with unicodedata — unicodedata.normalize() has been handling your encoding bugs since before you knew they were bugs.

# unicode_cleaner.py
import unicodedata
import re

# Category prefixes for invisible/formatting characters to strip
STRIP_CATEGORIES = {"Cf", "Cc"}  # format chars and control chars (except \n\t)
KEEP_CONTROL = {"\n", "\t", "\r"}

def clean_unicode_text(text, normalize_form="NFC", strip_accents=False, ascii_only=False):
    """
    Clean and normalize Unicode text.

    normalize_form: NFC (default), NFD, NFKC, or NFKD
    strip_accents:  Remove diacritical marks (NFD must be applied first)
    ascii_only:     Encode to ASCII, drop non-ASCII chars
    """
    if not isinstance(text, str):
        raise TypeError(f"Expected str, got {type(text).__name__}")

    # Step 1: Normalize
    text = unicodedata.normalize(normalize_form, text)

    # Step 2: Strip combining marks (diacritics) if requested
    # Note: only makes sense after NFD or NFKD normalization
    if strip_accents:
        if normalize_form not in ("NFD", "NFKD"):
            text = unicodedata.normalize("NFD", text)
        text = "".join(c for c in text if unicodedata.category(c) != "Mn")

    # Step 3: Remove invisible formatting characters (zero-width space, BOM, etc.)
    text = "".join(
        c for c in text
        if unicodedata.category(c) not in STRIP_CATEGORIES or c in KEEP_CONTROL
    )

    # Step 4: Normalize whitespace (collapse multiple spaces, strip NBSP)
    text = re.sub(r"[\u00a0\u202f\u2009\u2008\u2007\u2006\u2005\u2004\u2003\u2002\u2001\u2000]", " ", text)
    text = re.sub(r" {2,}", " ", text)

    # Step 5: ASCII-only output if requested
    if ascii_only:
        text = text.encode("ascii", errors="ignore").decode("ascii")

    return text.strip()


# Sample dirty text from a web scrape
samples = [
    "caf\u00e9 vs caf\u0065\u0301",           # composed vs decomposed
    "Hello\u200b World",                         # zero-width space between words
    "Price:\u00a0$9.99",                         # non-breaking space
    "Se\u00f1or Mu\u00f1oz",                    # Spanish with tildes
    "Python\ufeff Tutorial",                     # BOM character mid-string
    "Multi\n\tline\ttext",                       # control chars to keep
]

print(f"{'Original':<40} {'Cleaned (NFC)'}")
print("-" * 80)
for s in samples:
    cleaned = clean_unicode_text(s, normalize_form="NFC")
    print(f"{repr(s):<40} {repr(cleaned)}")

print("\n=== strip_accents=True ===")
for s in samples[3:4]:
    print(f"  Original: {repr(s)}")
    print(f"  Cleaned:  {repr(clean_unicode_text(s, normalize_form='NFD', strip_accents=True))}")

Output:

Original                                 Cleaned (NFC)
--------------------------------------------------------------------------------
'caf\xe9 vs cafe\u0301'                  'cafe vs cafe'
'Hello\u200b World'                      'Hello World'
'Price:\xa0$9.99'                        'Price: $9.99'
'Se\xf1or Mu\xf1oz'                     'Senor Munoz'
'Python\ufeff Tutorial'                  'Python Tutorial'
'Multi\n\tline\ttext'                    'Multi\n\tline\ttext'

=== strip_accents=True ===
  Original: 'Se\xf1or Mu\xf1oz'
  Cleaned:  'Senor Munoz'

The pipeline handles five distinct problems in order: normalization form unification, invisible character removal, non-standard whitespace normalization, and optionally diacritic stripping and ASCII-only output. Running all incoming text through this pipeline before storage or comparison eliminates a whole class of hard-to-debug string matching bugs.

Frequently Asked Questions

When should I use NFC vs NFKC for normalization?

Use NFC for general text storage and display -- it's the most compact composed form and what most operating systems and web browsers produce. Use NFKC when building a search index, generating slugs, or comparing text where "compatibility equivalents" (ligatures, circled letters, Roman numerals as single characters) should match their plain-text equivalents. NFKC is more aggressive: it will change the meaning of certain characters (the ligature "fi" becomes "fi"), so only apply it to normalized search keys, not to the stored display text.

Where can I find the full list of Unicode category codes?

The Unicode standard defines 30 general categories grouped under 7 major groups: L (letters), M (marks), N (numbers), P (punctuation), S (symbols), Z (separators), and C (other/control). The two-letter code is major + minor category, e.g. Lu = Letter Uppercase, Nd = Number Decimal Digit. The full table is in Unicode Standard Annex #44 at unicode.org, or search for "Unicode general category" in the Python documentation.

What's the difference between unicodedata.decimal() and unicodedata.numeric()?

decimal(char) returns an integer for characters that represent decimal digits (0-9) in any script -- Arabic-Indic "0" returns 0, ASCII "5" returns 5. It returns None (or a default) for non-digit characters. digit(char) is similar but also includes superscript/subscript digit forms. numeric(char) is the broadest: it returns a float for any character with a numeric value, including fractions (one-half U+00BD returns 0.5) and Roman numerals. Use decimal() when you want to extract actual digit values from text, and isnumeric() on strings when you just want a boolean test.

How do I detect and fix Unicode corruption (mojibake)?

Mojibake (garbled text from encoding mismatches) typically happens when UTF-8 bytes are decoded as latin-1 or Windows-1252. The character "e" in UTF-8 is bytes 0xC3 0xA9, which in latin-1 reads as "A-" + copyright symbol. The ftfy library (pip install ftfy) automatically detects and fixes most mojibake patterns. For the unicodedata module specifically, you can detect it by checking if a string contains characters whose names include "LATIN SMALL LETTER A WITH TILDE" or similar combined-character sequences that shouldn't appear in ordinary prose.

Does unicodedata help with right-to-left text (Arabic, Hebrew)?

The unicodedata.bidirectional(char) function returns the bidirectional category of a character: "L" for left-to-right (Latin), "R" for right-to-left (Hebrew letters), "AL" for Arabic letters, "AN" for Arabic numbers, and so on. This is useful for detecting the text direction of a string and for flagging bidirectional override characters (categories "RLO" and "LRO") that have been abused in security attacks (the "trojan source" vulnerability). Always strip bidirectional override characters from untrusted input.

Conclusion

The unicodedata module gives you direct access to the Unicode Character Database from Python's standard library. The key tools are normalize() for ensuring consistent text representation (NFC for storage, NFKD for slug generation), category() for script-agnostic character classification, name() for debugging and introspection, and decimal()/numeric() for extracting numeric values from any script's digit characters. The text cleaning pipeline above is a production-ready foundation -- extend it with the ftfy library for mojibake correction or langdetect for script-aware processing.

For comprehensive international text handling beyond what unicodedata provides -- transliteration, locale-aware collation, and script detection -- look at the PyICU library, which provides Python bindings to IBM's International Components for Unicode.

Official documentation: https://docs.python.org/3/library/unicodedata.html

Continue Learning Python

Tutorials you might also find useful:

Post Views: 62

How To Use Python unicodedata for Unicode Operations

Python unicodedata: Quick Example

What Is Unicode Normalization?

Looking Up Character Properties

Stripping Diacritics and Creating Slugs

Detecting Character Types

Real-Life Example: Unicode Text Cleaning Pipeline

Frequently Asked Questions

When should I use NFC vs NFKC for normalization?

Where can I find the full list of Unicode category codes?

What's the difference between unicodedata.decimal() and unicodedata.numeric()?

How do I detect and fix Unicode corruption (mojibake)?

Does unicodedata help with right-to-left text (Arabic, Hebrew)?

Conclusion

Related Articles

Continue Learning Python

Submit a Comment Cancel reply