Intermediate
Some text problems are impossible to solve with split(), replace(), and in checks. Extracting all email addresses from a document. Validating that a phone number matches any of fifteen regional formats. Finding every date that appears in a 10,000-line log file. These are pattern-matching problems, and regular expressions — regex — are built exactly for them. Once you understand regex, a problem that would take 50 lines of string manipulation collapses into a single well-crafted pattern.
Python’s re module is built into the standard library and provides a full regex engine. You write a pattern that describes what you’re looking for, and re finds it — in strings of any length, with any number of matches, extracted as individual strings or as named groups. No installation required.
In this article we’ll cover the essential regex syntax (character classes, quantifiers, anchors, groups), the five core re functions (match, search, findall, sub, split), named groups and compiled patterns, lookaheads and lookbehinds, common real-world patterns (email, phone, URL, date), and a practical log file parser. By the end, you’ll be able to write and read regex confidently for most everyday text parsing tasks.
Python Regex: Quick Example
Here’s how to extract all email addresses from a block of text in two lines:
# quick_regex.py
import re
text = "Contact us at support@example.com or sales@company.org for help. Spam: fake@.com"
emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
print(emails)
Output:
['support@example.com', 'sales@company.org']
re.findall() returns a list of all non-overlapping matches. The pattern [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} matches the local part of an email ([a-zA-Z0-9._%+-]+), then @, then a domain ([a-zA-Z0-9.-]+), then a dot (\.), then a TLD of 2+ letters ([a-zA-Z]{2,}). Notice fake@.com wasn’t matched — the domain part requires at least one character before the dot.
What Are Regular Expressions?
A regular expression is a sequence of characters that defines a search pattern. The pattern \d{3}-\d{4} matches “555-1234” — exactly three digits, a hyphen, and four digits. Patterns can describe fixed strings, ranges of characters, repetition, alternatives, and complex structures like “a word followed by a number followed by an optional suffix.”
| Pattern | Matches | What It Means |
|---|---|---|
. | Any character except newline | Wildcard |
\d | Any digit (0-9) | Digit shorthand |
\w | Word character (a-z, A-Z, 0-9, _) | Word char shorthand |
\s | Whitespace (space, tab, newline) | Space shorthand |
^ | Start of string (or line with MULTILINE) | Anchor |
$ | End of string (or line with MULTILINE) | Anchor |
[abc] | Any of a, b, or c | Character class |
[^abc] | Any character NOT a, b, or c | Negated class |
+ | One or more of the preceding | Quantifier |
* | Zero or more of the preceding | Quantifier |
? | Zero or one (optional) | Quantifier |
{n,m} | Between n and m repetitions | Quantifier |
a|b | Either a or b | Alternation |
(abc) | Capture group | Grou�ing |
Always use raw strings (r'...') for regex patterns in Python. Without the r prefix, backslashes like \d and \w would need to be doubled (\\d, \\w) because Python treats \d as a string escape sequence. Raw strings pass the backslash through unchanged, making patterns cleaner and less error-prone.
The Five Core re Functions
re.match() — Match at the Start
re.match() only checks for a match at the very beginning of the string. It’s useful for validating format when you expect the string to start with a specific pattern.
# re_match.py
import re
# Only matches if the pattern is at the START of the string
result = re.match(r'\d{4}-\d{2}-\d{2}', '2026-04-16 09:00:00')
if result:
print('Matched date:', result.group())
else:
print('No match')
# Does NOT match -- pattern not at start
result2 = re.match(r'\d{4}-\d{2}-\d{2}', 'Log entry: 2026-04-16')
print('Match with prefix:', result2) # None
Output:
Matched date: 2026-04-16
Match with prefix: None
re.search() — Find Anywhere in the String
re.search() scans the entire string and returns the first match wherever it appears. This is the function to use when you’re looking for a pattern that might appear anywhere in the text.
# re_search.py
import re
log_line = 'ERROR 2026-04-16 09:23:45 - Connection timeout on port 5432'
# Find the timestamp anywhere in the string
ts_match = re.search(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}', log_line)
if ts_match:
print('Timestamp found:', ts_match.group())
print('At position:', ts_match.start(), 'to', ts_match.end())
# Find the port number
port_match = re.search(r'port (\d+)', log_line)
if port_match:
print('Port:', port_match.group(1)) # group(1) = first capture group
Output:
Timestamp found: 2026-04-16 09:23:45
At position: 6 to 25
Port: 5432
The match.group() method returns the full matched string. match.group(1) returns the first capture group (the content inside the first set of parentheses). match.start() and match.end() give you the character positions of the match in the original string.
re.findall() — Find All Matches
re.findall() returns a list of all non-overlapping matches. If the pattern has no groups, it returns a list of matched strings. If it has one group, it returns the group contents. If it has multiple groups, it returns a list of tuples.
# re_findall.py
import re
text = '''
Server logs for 2026-04-16:
192.168.1.10 -> request 200 OK
10.0.0.5 -> request 404 Not Found
172.16.0.1 -> request 200 OK
192.168.1.10 -> request 500 Internal Server Error
'''
# Find all IP addresses
ips = re.findall(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', text)
print('IPs found:', ips)
# Find all status codes
codes = re.findall(r'request (\d{3})', text)
print('Status codes:', codes)
# Count 404s and 500s
errors = [c for c in codes if c.startswith(('4', '5'))]
print('Error responses:', len(errors))
Output:
IPs found: ['192.168.1.10', '10.0.0.5', '172.16.0.1', '192.168.1.10']
Status codes: ['200', '404', '200', '500']
Error responses: 2
re.sub() — Replace Matches
re.sub() replaces all occurrences of a pattern with a replacement string or the result of a function. This is the regex-powered version of str.replace().
# re_sub.py
import re
# Normalize phone numbers to a consistent format
phones = [
'Call us: (02) 9876-5432',
'Mobile: 0412 345 678',
'Fax: 02-9876-5432',
]
for phone_text in phones:
# Remove all non-digit characters except leading country code
digits_only = re.sub(r'[^\d]', '', re.search(r'[\d\s\-()]+', phone_text).group())
print(f'{phone_text:30} -> {digits_only}')
# Redact sensitive data: replace card numbers
text = 'Card: 4532-1234-5678-9012, expires 04/28'
redacted = re.sub(r'\d{4}-\d{4}-\d{4}-\d{4}', '[REDACTED]', text)
print('\nRedacted:', redacted)
Output:
Call us: (02) 9876-5432 -> 0298765432
Mobile: 0412 345 678 -> 0412345678
Fax: 02-9876-5432 -> 0298765432
Redacted: Card: [REDACTED], expires 04/28
Named Groups and Compiled Patterns
For complex patterns you’ll reuse frequently, named capture groups make the code self-documenting. Instead of match.group(1), you write match.group('year'). Compiled patterns (re.compile()) also avoid re-parsing the pattern on every call, which is important in loops.
# named_groups.py
import re
# Compile a pattern with named groups
log_pattern = re.compile(
r'(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)\s+'
r'(?P<date>\d{4}-\d{2}-\d{2})\s+'
r'(?P<time>\d{2}:\d{2}:\d{2})\s+-\s+'
r'(?P<message>.+)'
)
log_lines = [
'INFO 2026-04-16 09:00:01 - Application started',
'WARNING 2026-04-16 09:15:32 - High memory usage: 85%',
'ERROR 2026-04-16 09:23:45 - Database connection failed',
]
for line in log_lines:
m = log_pattern.match(line)
if m:
print(f"Level: {m.group('level')}")
print(f"Time: {m.group('date')} {m.group('time')}")
print(f"Message: {m.group('message')}")
print()
Output:
Level: INFO
Time: 2026-04-16 09:00:01
Message: Application started
Level: WARNING
Time: 2026-04-16 09:15:32
Message: High memory usage: 85%
Level: ERROR
Time: 2026-04-16 09:23:45
Message: Database connection failed
Named groups use the syntax (?P<name>pattern). The ?P<name> is Python-specific regex syntax (the P stands for “Python extension”). You can also access named groups as a dict via match.groupdict(), which returns {'level': 'INFO', 'date': '2026-04-16', ...} — very useful for feeding parsed log data into data structures.
Real-Life Example: Server Log Analyzer
Here’s a complete log file analyzer that parses Apache/nginx-style access logs, extracts metrics, and generates a summary report.
# log_analyzer.py
import re
from collections import Counter, defaultdict
# Sample nginx-style access log data
ACCESS_LOG = """
192.168.1.10 - - [16/Apr/2026:09:00:01 +0000] "GET /index.html HTTP/1.1" 200 1234
10.0.0.5 - - [16/Apr/2026:09:00:05 +0000] "GET /api/users HTTP/1.1" 200 892
192.168.1.10 - - [16/Apr/2026:09:01:12 +0000] "POST /api/login HTTP/1.1" 401 145
10.0.0.7 - - [16/Apr/2026:09:02:30 +0000] "GET /images/logo.png HTTP/1.1" 200 45678
192.168.1.25 - - [16/Apr/2026:09:03:11 +0000] "GET /api/data HTTP/1.1" 500 312
10.0.0.5 - - [16/Apr/2026:09:04:00 +0000] "GET /api/users/42 HTTP/1.1" 200 456
192.168.1.10 - - [16/Apr/2026:09:05:22 +0000] "DELETE /api/users/42 HTTP/1.1" 403 88
10.0.0.7 - - [16/Apr/2026:09:06:45 +0000] "GET /api/data HTTP/1.1" 200 789
""".strip()
# Compile the access log pattern
LOG_PATTERN = re.compile(
r'(?P<ip>\S+) \S+ \S+ \[(?P<datetime>[^\]]+)\] '
r'"(?P<method>\S+) (?P<path>\S+) HTTP/[\d.]+" '
r'(?P<status>\d{3}) (?P<bytes>\d+)'
)
def analyze_logs(log_text):
"""Parse log entries and return summary statistics."""
ip_counter = Counter()
status_counter = Counter()
path_counter = Counter()
method_counter = Counter()
total_bytes = 0
errors = []
for line in log_text.splitlines():
line = line.strip()
if not line:
continue
m = LOG_PATTERN.match(line)
if not m:
errors.append(f'Could not parse: {line}')
continue
ip_counter[m.group('ip')] += 1
status_counter[m.group('status')] += 1
path_counter[m.group('path')] += 1
method_counter[m.group('method')] += 1
total_bytes += int(m.group('bytes'))
return {
'total_requests': sum(ip_counter.values()),
'unique_ips': len(ip_counter),
'top_ips': ip_counter.most_common(3),
'status_codes': dict(sorted(status_counter.items())),
'top_paths': path_counter.most_common(3),
'methods': dict(method_counter),
'total_bytes': total_bytes,
'parse_errors': errors
}
stats = analyze_logs(ACCESS_LOG)
print(f'Total Requests: {stats["total_requests"]}')
print(f'Unique IPs: {stats["unique_ips"]}')
print(f'Total Data: {stats["total_bytes"] / 1024:.1f} KB')
print(f'\nStatus Codes:')
for code, count in stats['status_codes'].items():
label = 'OK' if code.startswith('2') else 'ERR' if code.startswith('5') else ''
print(f' {code}: {count:3} {label}')
print(f'\nTop IPs:')
for ip, count in stats['top_ips']:
print(f' {ip:15} {count} requests')
print(f'\nHTTP Methods: {stats["methods"]}')
if stats['parse_errors']:
print(f'\nParse errors: {len(stats["parse_errors"])}')
Output:
Total Requests: 8
Unique IPs: 3
Total Data: 47.5 KB
Status Codes:
200: 5 OK
401: 1
403: 1
500: 1 ERR
Top IPs:
192.168.1.10 3 requests
10.0.0.5 2 requests
10.0.0.7 2 requests
HTTP Methods: {'GET': 6, 'POST': 1, 'DELETE': 1}
The compiled LOG_PATTERN with named groups is the heart of this analyzer — it extracts all seven fields from each log line in a single match() call. Calling re.compile() once outside the loop means the pattern is parsed only once, which matters when processing millions of log lines.
Frequently Asked Questions
What is greedy vs non-greedy matching?
By default, quantifiers like + and * are greedy — they match as much text as possible. re.search(r'<.+>', '<b>text</b>') matches the entire string <b>text</b>, not just <b>. Add ? after the quantifier to make it non-greedy (lazy): r'<.+?>' matches <b> and stops. Use non-greedy quantifiers when you want the shortest possible match between two delimiters.
How do I match across multiple lines?
By default, . doesn’t match newlines. Pass re.DOTALL as a flag to make . match any character including newlines: re.search(r'START.+END', text, re.DOTALL). For patterns where ^ and $ should match line boundaries (not just string boundaries), use re.MULTILINE. Both flags can be combined: re.DOTALL | re.MULTILINE.
When should I use re.compile()?
Use re.compile() when you’re calling the same pattern multiple times — in a loop, or in a function that’s called repeatedly. Compiled patterns cache the parsed regex, avoiding redundant work. For one-off searches in simple scripts, the module-level functions (re.search(), re.findall(), etc.) are fine — they also cache internally under the hood.
How do I match literal special characters like dot or parenthesis?
Escape them with a backslash: \. matches a literal dot (not the “any character” wildcard), \( matches a literal opening parenthesis. In raw strings, that’s r'\.' and r'\('. Common characters that need escaping: . ^ $ * + ? { } [ ] \ | ( ). Use re.escape(your_string) to automatically escape all special characters in a variable you want to match literally.
My regex is running slowly. What can I do?
Several patterns cause catastrophic backtracking: nested quantifiers like (a+)+, alternations with overlapping patterns, or very long strings with no match. Solutions: compile the pattern once with re.compile(), use anchors (^ and $) to limit where Python searches, make quantifiers as specific as possible (use [a-z]+ instead of .+ when you know the character set), and test with tools like regex101.com which shows match steps and warnings about slow patterns.
Conclusion
Python’s re module gives you a full regex engine for any text processing challenge. We covered the core syntax (character classes, quantifiers, anchors, groups), the five main functions (match, search, findall, sub, split), named capture groups for readable code, compiled patterns for performance, greedy vs non-greedy matching, and a complete server log analyzer. Regular expressions have a reputation for being hard to read, but well-named groups and small, focused patterns keep them maintainable.
Extend the log analyzer to write hourly request rate breakdowns, flag IPs that generate more than 10 errors per hour, or parse a different log format by updating only the compiled pattern. The (?P<name>) named group system makes updating patterns clean because the code downstream references groups by name, not by index.
For the full syntax reference, flag descriptions, and advanced features like conditional matching, see the official re module documentation. The interactive regex101.com is invaluable for testing and debugging patterns.