How To Use Python Regular Expressions with the re Module

Last Updated: June 01, 2026

Table of Contents

Python Regex: Quick Example
What Are Regular Expressions?
The Five Core re Functions
Named Groups and Compiled Patterns
Real-Life Example: Server Log Analyzer
Frequently Asked Questions
Conclusion
Related Articles

Intermediate

Some text problems are impossible to solve with split(), replace(), and in checks. Extracting all email addresses from a document. Validating that a phone number matches any of fifteen regional formats. Finding every date that appears in a 10,000-line log file. These are pattern-matching problems, and regular expressions — regex — are built exactly for them. Once you understand regex, a problem that would take 50 lines of string manipulation collapses into a single well-crafted pattern.

Python’s re module is built into the standard library and provides a full regex engine. You write a pattern that describes what you’re looking for, and re finds it — in strings of any length, with any number of matches, extracted as individual strings or as named groups. No installation required.

In this article we’ll cover the essential regex syntax (character classes, quantifiers, anchors, groups), the five core re functions (match, search, findall, sub, split), named groups and compiled patterns, lookaheads and lookbehinds, common real-world patterns (email, phone, URL, date), and a practical log file parser. By the end, you’ll be able to write and read regex confidently for most everyday text parsing tasks.

Written by Pubs

Python developer and educator with 15+ years building production systems across data engineering, web APIs, and AI tooling. Founder of Python How To Program — 270+ in-depth tutorials covering the modern Python stack.

View all tutorials by Pubs →

Python Regex: Quick Example

Here’s how to extract all email addresses from a block of text in two lines:

# quick_regex.py
import re

text = "Contact us at support@example.com or sales@company.org for help. Spam: fake@.com"

emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
print(emails)

Output:

['support@example.com', 'sales@company.org']

re.findall() returns a list of all non-overlapping matches. The pattern [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} matches the local part of an email ([a-zA-Z0-9._%+-]+), then @, then a domain ([a-zA-Z0-9.-]+), then a dot (\.), then a TLD of 2+ letters ([a-zA-Z]{2,}). Notice fake@.com wasn’t matched — the domain part requires at least one character before the dot.

What Are Regular Expressions?

A regular expression is a sequence of characters that defines a search pattern. The pattern \d{3}-\d{4} matches “555-1234” — exactly three digits, a hyphen, and four digits. Patterns can describe fixed strings, ranges of characters, repetition, alternatives, and complex structures like “a word followed by a number followed by an optional suffix.”

Pattern	Matches	What It Means
`.`	Any character except newline	Wildcard
`\d`	Any digit (0-9)	Digit shorthand
`\w`	Word character (a-z, A-Z, 0-9, _)	Word char shorthand
`\s`	Whitespace (space, tab, newline)	Space shorthand
`^`	Start of string (or line with MULTILINE)	Anchor
`$`	End of string (or line with MULTILINE)	Anchor
`[abc]`	Any of a, b, or c	Character class
`[^abc]`	Any character NOT a, b, or c	Negated class
`+`	One or more of the preceding	Quantifier
`*`	Zero or more of the preceding	Quantifier
`?`	Zero or one (optional)	Quantifier
`{n,m}`	Between n and m repetitions	Quantifier
`a\|b`	Either a or b	Alternation
`(abc)`	Capture group	Grouping

Always use raw strings (r'...') for regex patterns in Python. Without the r prefix, backslashes like \d and \w would need to be doubled (\\d, \\w) because Python treats \d as a string escape sequence. Raw strings pass the backslash through unchanged, making patterns cleaner and less error-prone.

Pattern matching with re — Pattern matching is detective work for your data.

The Five Core re Functions

re.match() — Match at the Start

re.match() only checks for a match at the very beginning of the string. It’s useful for validating format when you expect the string to start with a specific pattern.

# re_match.py
import re

# Only matches if the pattern is at the START of the string
result = re.match(r'\d{4}-\d{2}-\d{2}', '2026-04-16 09:00:00')
if result:
    print('Matched date:', result.group())
else:
    print('No match')

# Does NOT match -- pattern not at start
result2 = re.match(r'\d{4}-\d{2}-\d{2}', 'Log entry: 2026-04-16')
print('Match with prefix:', result2)  # None

Output:

Matched date: 2026-04-16
Match with prefix: None

re.search() — Find Anywhere in the String

re.search() scans the entire string and returns the first match wherever it appears. This is the function to use when you’re looking for a pattern that might appear anywhere in the text.

# re_search.py
import re

log_line = 'ERROR 2026-04-16 09:23:45 - Connection timeout on port 5432'

# Find the timestamp anywhere in the string
ts_match = re.search(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}', log_line)
if ts_match:
    print('Timestamp found:', ts_match.group())
    print('At position:', ts_match.start(), 'to', ts_match.end())

# Find the port number
port_match = re.search(r'port (\d+)', log_line)
if port_match:
    print('Port:', port_match.group(1))  # group(1) = first capture group

Output:

Timestamp found: 2026-04-16 09:23:45
At position: 6 to 25
Port: 5432

The match.group() method returns the full matched string. match.group(1) returns the first capture group (the content inside the first set of parentheses). match.start() and match.end() give you the character positions of the match in the original string.

re.findall() — Find All Matches

re.findall() returns a list of all non-overlapping matches. If the pattern has no groups, it returns a list of matched strings. If it has one group, it returns the group contents. If it has multiple groups, it returns a list of tuples.

# re_findall.py
import re

text = '''
Server logs for 2026-04-16:
  192.168.1.10 -> request 200 OK
  10.0.0.5 -> request 404 Not Found
  172.16.0.1 -> request 200 OK
  192.168.1.10 -> request 500 Internal Server Error
'''

# Find all IP addresses
ips = re.findall(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', text)
print('IPs found:', ips)

# Find all status codes
codes = re.findall(r'request (\d{3})', text)
print('Status codes:', codes)

# Count 404s and 500s
errors = [c for c in codes if c.startswith(('4', '5'))]
print('Error responses:', len(errors))

Output:

IPs found: ['192.168.1.10', '10.0.0.5', '172.16.0.1', '192.168.1.10']
Status codes: ['200', '404', '200', '500']
Error responses: 2

re.sub() — Replace Matches

re.sub() replaces all occurrences of a pattern with a replacement string or the result of a function. This is the regex-powered version of str.replace().

# re_sub.py
import re

# Normalize phone numbers to a consistent format
phones = [
    'Call us: (02) 9876-5432',
    'Mobile: 0412 345 678',
    'Fax: 02-9876-5432',
]

for phone_text in phones:
    # Remove all non-digit characters except leading country code
    digits_only = re.sub(r'[^\d]', '', re.search(r'[\d\s\-()]+', phone_text).group())
    print(f'{phone_text:30} -> {digits_only}')

# Redact sensitive data: replace card numbers
text = 'Card: 4532-1234-5678-9012, expires 04/28'
redacted = re.sub(r'\d{4}-\d{4}-\d{4}-\d{4}', '[REDACTED]', text)
print('\nRedacted:', redacted)

Output:

Call us: (02) 9876-5432        -> 0298765432
Mobile: 0412 345 678           -> 0412345678
Fax: 02-9876-5432              -> 0298765432

Redacted: Card: [REDACTED], expires 04/28

Regex substitution and replacement — re.sub replaces what re.search finds. Division of labor.

Named Groups and Compiled Patterns

For complex patterns you’ll reuse frequently, named capture groups make the code self-documenting. Instead of match.group(1), you write match.group('year'). Compiled patterns (re.compile()) also avoid re-parsing the pattern on every call, which is important in loops.

# named_groups.py
import re

# Compile a pattern with named groups
log_pattern = re.compile(
    r'(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)\s+'
    r'(?P<date>\d{4}-\d{2}-\d{2})\s+'
    r'(?P<time>\d{2}:\d{2}:\d{2})\s+-\s+'
    r'(?P<message>.+)'
)

log_lines = [
    'INFO 2026-04-16 09:00:01 - Application started',
    'WARNING 2026-04-16 09:15:32 - High memory usage: 85%',
    'ERROR 2026-04-16 09:23:45 - Database connection failed',
]

for line in log_lines:
    m = log_pattern.match(line)
    if m:
        print(f"Level:   {m.group('level')}")
        print(f"Time:    {m.group('date')} {m.group('time')}")
        print(f"Message: {m.group('message')}")
        print()

Output:

Level:   INFO
Time:    2026-04-16 09:00:01
Message: Application started

Level:   WARNING
Time:    2026-04-16 09:15:32
Message: High memory usage: 85%

Level:   ERROR
Time:    2026-04-16 09:23:45
Message: Database connection failed

Named groups use the syntax (?P<name>pattern). The ?P<name> is Python-specific regex syntax (the P stands for “Python extension”). You can also access named groups as a dict via match.groupdict(), which returns {'level': 'INFO', 'date': '2026-04-16', ...} — very useful for feeding parsed log data into data structures.

Real-Life Example: Server Log Analyzer

Validating data with regex — When your input data has trust issues, regex is the bouncer.

Here’s a complete log file analyzer that parses Apache/nginx-style access logs, extracts metrics, and generates a summary report.

# log_analyzer.py
import re
from collections import Counter, defaultdict

# Sample nginx-style access log data
ACCESS_LOG = """
192.168.1.10 - - [16/Apr/2026:09:00:01 +0000] "GET /index.html HTTP/1.1" 200 1234
10.0.0.5 - - [16/Apr/2026:09:00:05 +0000] "GET /api/users HTTP/1.1" 200 892
192.168.1.10 - - [16/Apr/2026:09:01:12 +0000] "POST /api/login HTTP/1.1" 401 145
10.0.0.7 - - [16/Apr/2026:09:02:30 +0000] "GET /images/logo.png HTTP/1.1" 200 45678
192.168.1.25 - - [16/Apr/2026:09:03:11 +0000] "GET /api/data HTTP/1.1" 500 312
10.0.0.5 - - [16/Apr/2026:09:04:00 +0000] "GET /api/users/42 HTTP/1.1" 200 456
192.168.1.10 - - [16/Apr/2026:09:05:22 +0000] "DELETE /api/users/42 HTTP/1.1" 403 88
10.0.0.7 - - [16/Apr/2026:09:06:45 +0000] "GET /api/data HTTP/1.1" 200 789
""".strip()

# Compile the access log pattern
LOG_PATTERN = re.compile(
    r'(?P<ip>\S+) \S+ \S+ \[(?P<datetime>[^\]]+)\] '
    r'"(?P<method>\S+) (?P<path>\S+) HTTP/[\d.]+" '
    r'(?P<status>\d{3}) (?P<bytes>\d+)'
)

def analyze_logs(log_text):
    """Parse log entries and return summary statistics."""
    ip_counter = Counter()
    status_counter = Counter()
    path_counter = Counter()
    method_counter = Counter()
    total_bytes = 0
    errors = []

    for line in log_text.splitlines():
        line = line.strip()
        if not line:
            continue

        m = LOG_PATTERN.match(line)
        if not m:
            errors.append(f'Could not parse: {line}')
            continue

        ip_counter[m.group('ip')] += 1
        status_counter[m.group('status')] += 1
        path_counter[m.group('path')] += 1
        method_counter[m.group('method')] += 1
        total_bytes += int(m.group('bytes'))

    return {
        'total_requests': sum(ip_counter.values()),
        'unique_ips': len(ip_counter),
        'top_ips': ip_counter.most_common(3),
        'status_codes': dict(sorted(status_counter.items())),
        'top_paths': path_counter.most_common(3),
        'methods': dict(method_counter),
        'total_bytes': total_bytes,
        'parse_errors': errors
    }

stats = analyze_logs(ACCESS_LOG)

print(f'Total Requests:  {stats["total_requests"]}')
print(f'Unique IPs:      {stats["unique_ips"]}')
print(f'Total Data:      {stats["total_bytes"] / 1024:.1f} KB')
print(f'\nStatus Codes:')
for code, count in stats['status_codes'].items():
    label = 'OK' if code.startswith('2') else 'ERR' if code.startswith('5') else ''
    print(f'  {code}: {count:3}  {label}')
print(f'\nTop IPs:')
for ip, count in stats['top_ips']:
    print(f'  {ip:15} {count} requests')
print(f'\nHTTP Methods: {stats["methods"]}')
if stats['parse_errors']:
    print(f'\nParse errors: {len(stats["parse_errors"])}')

Output:

Total Requests:  8
Unique IPs:      3
Total Data:      47.5 KB

Status Codes:
  200: 5  OK
  401: 1
  403: 1
  500: 1  ERR

Top IPs:
  192.168.1.10    3 requests
  10.0.0.5        2 requests
  10.0.0.7        2 requests

HTTP Methods: {'GET': 6, 'POST': 1, 'DELETE': 1}

The compiled LOG_PATTERN with named groups is the heart of this analyzer — it extracts all seven fields from each log line in a single match() call. Calling re.compile() once outside the loop means the pattern is parsed only once, which matters when processing millions of log lines.

Frequently Asked Questions

What is greedy vs non-greedy matching?

By default, quantifiers like + and * are greedy — they match as much text as possible. re.search(r'<.+>', 'text') matches the entire string text, not just . Add ? after the quantifier to make it non-greedy (lazy): r'<.+?>' matches  and stops. Use non-greedy quantifiers when you want the shortest possible match between two delimiters.

How do I match across multiple lines?

By default, . doesn’t match newlines. Pass re.DOTALL as a flag to make . match any character including newlines: re.search(r'START.+END', text, re.DOTALL). For patterns where ^ and $ should match line boundaries (not just string boundaries), use re.MULTILINE. Both flags can be combined: re.DOTALL | re.MULTILINE.

When should I use re.compile()?

Use re.compile() when you’re calling the same pattern multiple times — in a loop, or in a function that’s called repeatedly. Compiled patterns cache the parsed regex, avoiding redundant work. For one-off searches in simple scripts, the module-level functions (re.search(), re.findall(), etc.) are fine — they also cache internally under the hood.

How do I match literal special characters like dot or parenthesis?

Escape them with a backslash: \. matches a literal dot (not the “any character” wildcard), \( matches a literal opening parenthesis. In raw strings, that’s r'\.' and r'\('. Common characters that need escaping: . ^ $ * + ? { } [ ] \ | ( ). Use re.escape(your_string) to automatically escape all special characters in a variable you want to match literally.

My regex is running slowly. What can I do?

Several patterns cause catastrophic backtracking: nested quantifiers like (a+)+, alternations with overlapping patterns, or very long strings with no match. Solutions: compile the pattern once with re.compile(), use anchors (^ and $) to limit where Python searches, make quantifiers as specific as possible (use [a-z]+ instead of .+ when you know the character set), and test with tools like regex101.com which shows match steps and warnings about slow patterns.

re.fullmatch beats re.match. Almost always.

Conclusion

Python’s re module gives you a full regex engine for any text processing challenge. We covered the core syntax (character classes, quantifiers, anchors, groups), the five main functions (match, search, findall, sub, split), named capture groups for readable code, compiled patterns for performance, greedy vs non-greedy matching, and a complete server log analyzer. Regular expressions have a reputation for being hard to read, but well-named groups and small, focused patterns keep them maintainable.

Extend the log analyzer to write hourly request rate breakdowns, flag IPs that generate more than 10 errors per hour, or parse a different log format by updating only the compiled pattern. The (?P<name>) named group system makes updating patterns clean because the code downstream references groups by name, not by index.

For the full syntax reference, flag descriptions, and advanced features like conditional matching, see the official re module documentation. The interactive regex101.com is invaluable for testing and debugging patterns.

Continue Learning Python

Tutorials you might also find useful:

Post Views: 78

How To Use Python Regular Expressions with the re Module

Python Regex: Quick Example

What Are Regular Expressions?

The Five Core re Functions

re.match() — Match at the Start

re.search() — Find Anywhere in the String

re.findall() — Find All Matches

re.sub() — Replace Matches

Named Groups and Compiled Patterns

Real-Life Example: Server Log Analyzer

Frequently Asked Questions

What is greedy vs non-greedy matching?

How do I match across multiple lines?

When should I use re.compile()?

How do I match literal special characters like dot or parenthesis?

My regex is running slowly. What can I do?

Conclusion

Related Articles

Continue Learning Python

Submit a Comment Cancel reply