Beginner
For most serious applications, you will often have to have persistent storage (storage that still exists after your applications stops running) of some sort. For new developers, it can be quite daunting to decide which option to go for. Is a simple flat file enough? When should you use something like a database? Which database should you use? There are so many options that are available it becomes quite daunting to decide which way to go for.
This is a starting guide to provide an overview of some of the many data storage options that are available for you and how you can go about deciding. One thing to keep in mind is that if you are developing an application which is either planned or has a possibility to scale over time, your underlying database might also grow overtime. It may be quick and easy to implement a file as storage, but as your data grows it might be better to use a relational database but it will take a little bit more effort. Let’s look at this a bit deeper

What are the possible ways to store data?
There are many methods of persistent storage that you can use (persistent storage means that after your program is finished running your data is not lost). The typical ways you can do this is either by using a file which you save data to, or by using the python pickle mechanism. Firstly I will explain what some of the persistent storage options are:
- File: This is where you store the data in a text based file in format such as CSV (comma separated values), JSON, and others
- Python Pickle: A python pickle is a mechanism where you can save a data structure directly to a file, and then you can retrieve the data directly from the file next time you run your program. You can do this with a library called “pickle”
- Config files: config files are similar to File and Python Pickle in that the data is stored in a file format but is intended to be directly edited by a user
- Database SQLite: this is a database where you can run queries to search for data, but the data is stored in a file
- Database Postgres (or other SQL based database): this is a database service where there’s another program that you run to manage the database, and you call functions (or SQL queries) on the database service to get the data back in an efficient manner. SQL based databases are great for structured data – e.g. table-like/excel-like data. You would search for data by category fields as an example
- Key-value database (e.g redis is one of the most famous): A key-value database is exactly that, it contains a database where you search by a key, and then it returns a value. This value can be a single value or it can be a set of fields that are associated with that value. A common use of a key-value database is for hash-based data. Meaning that you have a specific key that you want to search for, and then you get all the related fields associated with that key – much like a dictionary in python, but the benefit being its in a persistent storage
- Graph Database (e.g. Neo4J): A graph database stores data which is built to navigate relationships. This is something that is rather cumbersome to do in a relational database where you need to have many intermediary tables but becomes trivial with GraphQL language
- Text Search (e.g. Elastic Search): A purpose built database for text search which is extremely fast when searching for strings or long text
- Time series database (e.g. influx): For IoT data where each record is stored with a timestamp key and you need to do queries in time blocks, time series databases are ideal. You can do common operations such as to aggregate, search, slice data through specific query operations
- NOSQL document database (e.g. mongodb, couchdb): this is a database that also runs as a separate service but is specifically for “unstructured data” (non-table like data) such as text, images where you search for records in a free form way such as by text strings.
There is no one persistent storage mechanism that fits all, it really depends on your purpose (or “use case”) to determine which database works best for you as there are pros and cons for each.
| Setup | Editable outside Python | Volume | Read Speed | Write Speed | Inbuilt Redundancy | |
| File | None – you can create a file in your python code | For text based | Small | Slow | Slow | No – manual |
| Python Pickle | None- you can create this in your python code | No – only in python | Small | Slow | Slow | No – manual |
| Config File | Optional. You can create a config file before hand | Yes – you can use any text based editor | Small | Slow | Slow | No – manual |
| Database SQLite | None – database created automatically | No – only in python | Small-Med | Slow-Med | Slow-Med | No – manual |
| Relational SQL Database | Separate installation of server | Through the SQL console or other SQL clients | Large | Fast | Fast | Yes, require extra setup |
| NoSQL Column Database | Separate installation of server | Yes, through external client | Very large | Very fast | Very fast | Yes, inbuilt |
| Key-Value database | Separate installation of server | Yes, through external client | Very large | Very fast | Fast-Very Fast | Yes, require extra setup |
| Graph Database | Separate installation of serverSeparate installation of server | Yes, through external client | Large | Med | Med | Yes, require extra setup |
| Time Series Database | Separate installation of server | Yes, through external client | Very large | Very fast | Fast | Yes, require extra setup |
| Text Search Database | Separate installation of server | Yes, through external client | Very large | Very fast | Fast | Yes, require extra setup |
| NoSQL Documet DB | Separate installation of server | Yes, through external client | Very large | Very fast | Fast | Yes, require extra setup |

A big disclaimer here, for some of the responses, the more accurate answer is “it depends”. For example, for redundancy for relational databases, some have it inbuilt such as Oracle RAC enterprise databases and for others you can set up redundancy where you could have an infrastructure solution. However, to provide a simpler guidance, I’ve made this a bit more prescriptive. If you would like to dive deeper, then please don’t rely purely on the table above! Look into the documentation of the particular database product you are considering or reach out to me and I’m happy to provide some advice.
Summary
There are in fact plenty of SaaS-based options for database or persistent storage that are popping up which is exciting. These newer SaaS options (for example, firebase, restdb.io, anvil.works etc) are great in that they save you time on the heavy lifting, but then there may be times you still want to manage your own database. This may be because you want to keep your data yourself, or simply because you want to save costs as you already have an environment either on your own laptop, or you’re paying a fixed price for a virtual machine. Hence, managing your own persistent storage may be more cost effective rather than paying for another SaaS. However, certainly don’t discount the SaaS options altogether, as they will at least help you with things like backups, security updates etc for you.
How To Use Python Regular Expressions with the re Module
Intermediate
Some text problems are impossible to solve with split(), replace(), and in checks. Extracting all email addresses from a document. Validating that a phone number matches any of fifteen regional formats. Finding every date that appears in a 10,000-line log file. These are pattern-matching problems, and regular expressions — regex — are built exactly for them. Once you understand regex, a problem that would take 50 lines of string manipulation collapses into a single well-crafted pattern.
Python’s re module is built into the standard library and provides a full regex engine. You write a pattern that describes what you’re looking for, and re finds it — in strings of any length, with any number of matches, extracted as individual strings or as named groups. No installation required.
In this article we’ll cover the essential regex syntax (character classes, quantifiers, anchors, groups), the five core re functions (match, search, findall, sub, split), named groups and compiled patterns, lookaheads and lookbehinds, common real-world patterns (email, phone, URL, date), and a practical log file parser. By the end, you’ll be able to write and read regex confidently for most everyday text parsing tasks.
Python Regex: Quick Example
Here’s how to extract all email addresses from a block of text in two lines:
# quick_regex.py
import re
text = "Contact us at support@example.com or sales@company.org for help. Spam: fake@.com"
emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
print(emails)
Output:
['support@example.com', 'sales@company.org']
re.findall() returns a list of all non-overlapping matches. The pattern [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} matches the local part of an email ([a-zA-Z0-9._%+-]+), then @, then a domain ([a-zA-Z0-9.-]+), then a dot (\.), then a TLD of 2+ letters ([a-zA-Z]{2,}). Notice fake@.com wasn’t matched — the domain part requires at least one character before the dot.
What Are Regular Expressions?
A regular expression is a sequence of characters that defines a search pattern. The pattern \d{3}-\d{4} matches “555-1234” — exactly three digits, a hyphen, and four digits. Patterns can describe fixed strings, ranges of characters, repetition, alternatives, and complex structures like “a word followed by a number followed by an optional suffix.”
| Pattern | Matches | What It Means |
|---|---|---|
. | Any character except newline | Wildcard |
\d | Any digit (0-9) | Digit shorthand |
\w | Word character (a-z, A-Z, 0-9, _) | Word char shorthand |
\s | Whitespace (space, tab, newline) | Space shorthand |
^ | Start of string (or line with MULTILINE) | Anchor |
$ | End of string (or line with MULTILINE) | Anchor |
[abc] | Any of a, b, or c | Character class |
[^abc] | Any character NOT a, b, or c | Negated class |
+ | One or more of the preceding | Quantifier |
* | Zero or more of the preceding | Quantifier |
? | Zero or one (optional) | Quantifier |
{n,m} | Between n and m repetitions | Quantifier |
a|b | Either a or b | Alternation |
(abc) | Capture group | Grou�ing |
Always use raw strings (r'...') for regex patterns in Python. Without the r prefix, backslashes like \d and \w would need to be doubled (\\d, \\w) because Python treats \d as a string escape sequence. Raw strings pass the backslash through unchanged, making patterns cleaner and less error-prone.
The Five Core re Functions
re.match() — Match at the Start
re.match() only checks for a match at the very beginning of the string. It’s useful for validating format when you expect the string to start with a specific pattern.
# re_match.py
import re
# Only matches if the pattern is at the START of the string
result = re.match(r'\d{4}-\d{2}-\d{2}', '2026-04-16 09:00:00')
if result:
print('Matched date:', result.group())
else:
print('No match')
# Does NOT match -- pattern not at start
result2 = re.match(r'\d{4}-\d{2}-\d{2}', 'Log entry: 2026-04-16')
print('Match with prefix:', result2) # None
Output:
Matched date: 2026-04-16
Match with prefix: None
re.search() — Find Anywhere in the String
re.search() scans the entire string and returns the first match wherever it appears. This is the function to use when you’re looking for a pattern that might appear anywhere in the text.
# re_search.py
import re
log_line = 'ERROR 2026-04-16 09:23:45 - Connection timeout on port 5432'
# Find the timestamp anywhere in the string
ts_match = re.search(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}', log_line)
if ts_match:
print('Timestamp found:', ts_match.group())
print('At position:', ts_match.start(), 'to', ts_match.end())
# Find the port number
port_match = re.search(r'port (\d+)', log_line)
if port_match:
print('Port:', port_match.group(1)) # group(1) = first capture group
Output:
Timestamp found: 2026-04-16 09:23:45
At position: 6 to 25
Port: 5432
The match.group() method returns the full matched string. match.group(1) returns the first capture group (the content inside the first set of parentheses). match.start() and match.end() give you the character positions of the match in the original string.
re.findall() — Find All Matches
re.findall() returns a list of all non-overlapping matches. If the pattern has no groups, it returns a list of matched strings. If it has one group, it returns the group contents. If it has multiple groups, it returns a list of tuples.
# re_findall.py
import re
text = '''
Server logs for 2026-04-16:
192.168.1.10 -> request 200 OK
10.0.0.5 -> request 404 Not Found
172.16.0.1 -> request 200 OK
192.168.1.10 -> request 500 Internal Server Error
'''
# Find all IP addresses
ips = re.findall(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', text)
print('IPs found:', ips)
# Find all status codes
codes = re.findall(r'request (\d{3})', text)
print('Status codes:', codes)
# Count 404s and 500s
errors = [c for c in codes if c.startswith(('4', '5'))]
print('Error responses:', len(errors))
Output:
IPs found: ['192.168.1.10', '10.0.0.5', '172.16.0.1', '192.168.1.10']
Status codes: ['200', '404', '200', '500']
Error responses: 2
re.sub() — Replace Matches
re.sub() replaces all occurrences of a pattern with a replacement string or the result of a function. This is the regex-powered version of str.replace().
# re_sub.py
import re
# Normalize phone numbers to a consistent format
phones = [
'Call us: (02) 9876-5432',
'Mobile: 0412 345 678',
'Fax: 02-9876-5432',
]
for phone_text in phones:
# Remove all non-digit characters except leading country code
digits_only = re.sub(r'[^\d]', '', re.search(r'[\d\s\-()]+', phone_text).group())
print(f'{phone_text:30} -> {digits_only}')
# Redact sensitive data: replace card numbers
text = 'Card: 4532-1234-5678-9012, expires 04/28'
redacted = re.sub(r'\d{4}-\d{4}-\d{4}-\d{4}', '[REDACTED]', text)
print('\nRedacted:', redacted)
Output:
Call us: (02) 9876-5432 -> 0298765432
Mobile: 0412 345 678 -> 0412345678
Fax: 02-9876-5432 -> 0298765432
Redacted: Card: [REDACTED], expires 04/28
Named Groups and Compiled Patterns
For complex patterns you’ll reuse frequently, named capture groups make the code self-documenting. Instead of match.group(1), you write match.group('year'). Compiled patterns (re.compile()) also avoid re-parsing the pattern on every call, which is important in loops.
# named_groups.py
import re
# Compile a pattern with named groups
log_pattern = re.compile(
r'(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)\s+'
r'(?P<date>\d{4}-\d{2}-\d{2})\s+'
r'(?P<time>\d{2}:\d{2}:\d{2})\s+-\s+'
r'(?P<message>.+)'
)
log_lines = [
'INFO 2026-04-16 09:00:01 - Application started',
'WARNING 2026-04-16 09:15:32 - High memory usage: 85%',
'ERROR 2026-04-16 09:23:45 - Database connection failed',
]
for line in log_lines:
m = log_pattern.match(line)
if m:
print(f"Level: {m.group('level')}")
print(f"Time: {m.group('date')} {m.group('time')}")
print(f"Message: {m.group('message')}")
print()
Output:
Level: INFO
Time: 2026-04-16 09:00:01
Message: Application started
Level: WARNING
Time: 2026-04-16 09:15:32
Message: High memory usage: 85%
Level: ERROR
Time: 2026-04-16 09:23:45
Message: Database connection failed
Named groups use the syntax (?P<name>pattern). The ?P<name> is Python-specific regex syntax (the P stands for “Python extension”). You can also access named groups as a dict via match.groupdict(), which returns {'level': 'INFO', 'date': '2026-04-16', ...} — very useful for feeding parsed log data into data structures.
Real-Life Example: Server Log Analyzer
Here’s a complete log file analyzer that parses Apache/nginx-style access logs, extracts metrics, and generates a summary report.
# log_analyzer.py
import re
from collections import Counter, defaultdict
# Sample nginx-style access log data
ACCESS_LOG = """
192.168.1.10 - - [16/Apr/2026:09:00:01 +0000] "GET /index.html HTTP/1.1" 200 1234
10.0.0.5 - - [16/Apr/2026:09:00:05 +0000] "GET /api/users HTTP/1.1" 200 892
192.168.1.10 - - [16/Apr/2026:09:01:12 +0000] "POST /api/login HTTP/1.1" 401 145
10.0.0.7 - - [16/Apr/2026:09:02:30 +0000] "GET /images/logo.png HTTP/1.1" 200 45678
192.168.1.25 - - [16/Apr/2026:09:03:11 +0000] "GET /api/data HTTP/1.1" 500 312
10.0.0.5 - - [16/Apr/2026:09:04:00 +0000] "GET /api/users/42 HTTP/1.1" 200 456
192.168.1.10 - - [16/Apr/2026:09:05:22 +0000] "DELETE /api/users/42 HTTP/1.1" 403 88
10.0.0.7 - - [16/Apr/2026:09:06:45 +0000] "GET /api/data HTTP/1.1" 200 789
""".strip()
# Compile the access log pattern
LOG_PATTERN = re.compile(
r'(?P<ip>\S+) \S+ \S+ \[(?P<datetime>[^\]]+)\] '
r'"(?P<method>\S+) (?P<path>\S+) HTTP/[\d.]+" '
r'(?P<status>\d{3}) (?P<bytes>\d+)'
)
def analyze_logs(log_text):
"""Parse log entries and return summary statistics."""
ip_counter = Counter()
status_counter = Counter()
path_counter = Counter()
method_counter = Counter()
total_bytes = 0
errors = []
for line in log_text.splitlines():
line = line.strip()
if not line:
continue
m = LOG_PATTERN.match(line)
if not m:
errors.append(f'Could not parse: {line}')
continue
ip_counter[m.group('ip')] += 1
status_counter[m.group('status')] += 1
path_counter[m.group('path')] += 1
method_counter[m.group('method')] += 1
total_bytes += int(m.group('bytes'))
return {
'total_requests': sum(ip_counter.values()),
'unique_ips': len(ip_counter),
'top_ips': ip_counter.most_common(3),
'status_codes': dict(sorted(status_counter.items())),
'top_paths': path_counter.most_common(3),
'methods': dict(method_counter),
'total_bytes': total_bytes,
'parse_errors': errors
}
stats = analyze_logs(ACCESS_LOG)
print(f'Total Requests: {stats["total_requests"]}')
print(f'Unique IPs: {stats["unique_ips"]}')
print(f'Total Data: {stats["total_bytes"] / 1024:.1f} KB')
print(f'\nStatus Codes:')
for code, count in stats['status_codes'].items():
label = 'OK' if code.startswith('2') else 'ERR' if code.startswith('5') else ''
print(f' {code}: {count:3} {label}')
print(f'\nTop IPs:')
for ip, count in stats['top_ips']:
print(f' {ip:15} {count} requests')
print(f'\nHTTP Methods: {stats["methods"]}')
if stats['parse_errors']:
print(f'\nParse errors: {len(stats["parse_errors"])}')
Output:
Total Requests: 8
Unique IPs: 3
Total Data: 47.5 KB
Status Codes:
200: 5 OK
401: 1
403: 1
500: 1 ERR
Top IPs:
192.168.1.10 3 requests
10.0.0.5 2 requests
10.0.0.7 2 requests
HTTP Methods: {'GET': 6, 'POST': 1, 'DELETE': 1}
The compiled LOG_PATTERN with named groups is the heart of this analyzer — it extracts all seven fields from each log line in a single match() call. Calling re.compile() once outside the loop means the pattern is parsed only once, which matters when processing millions of log lines.
Frequently Asked Questions
What is greedy vs non-greedy matching?
By default, quantifiers like + and * are greedy — they match as much text as possible. re.search(r'<.+>', '<b>text</b>') matches the entire string <b>text</b>, not just <b>. Add ? after the quantifier to make it non-greedy (lazy): r'<.+?>' matches <b> and stops. Use non-greedy quantifiers when you want the shortest possible match between two delimiters.
How do I match across multiple lines?
By default, . doesn’t match newlines. Pass re.DOTALL as a flag to make . match any character including newlines: re.search(r'START.+END', text, re.DOTALL). For patterns where ^ and $ should match line boundaries (not just string boundaries), use re.MULTILINE. Both flags can be combined: re.DOTALL | re.MULTILINE.
When should I use re.compile()?
Use re.compile() when you’re calling the same pattern multiple times — in a loop, or in a function that’s called repeatedly. Compiled patterns cache the parsed regex, avoiding redundant work. For one-off searches in simple scripts, the module-level functions (re.search(), re.findall(), etc.) are fine — they also cache internally under the hood.
How do I match literal special characters like dot or parenthesis?
Escape them with a backslash: \. matches a literal dot (not the “any character” wildcard), \( matches a literal opening parenthesis. In raw strings, that’s r'\.' and r'\('. Common characters that need escaping: . ^ $ * + ? { } [ ] \ | ( ). Use re.escape(your_string) to automatically escape all special characters in a variable you want to match literally.
My regex is running slowly. What can I do?
Several patterns cause catastrophic backtracking: nested quantifiers like (a+)+, alternations with overlapping patterns, or very long strings with no match. Solutions: compile the pattern once with re.compile(), use anchors (^ and $) to limit where Python searches, make quantifiers as specific as possible (use [a-z]+ instead of .+ when you know the character set), and test with tools like regex101.com which shows match steps and warnings about slow patterns.
Conclusion
Python’s re module gives you a full regex engine for any text processing challenge. We covered the core syntax (character classes, quantifiers, anchors, groups), the five main functions (match, search, findall, sub, split), named capture groups for readable code, compiled patterns for performance, greedy vs non-greedy matching, and a complete server log analyzer. Regular expressions have a reputation for being hard to read, but well-named groups and small, focused patterns keep them maintainable.
Extend the log analyzer to write hourly request rate breakdowns, flag IPs that generate more than 10 errors per hour, or parse a different log format by updating only the compiled pattern. The (?P<name>) named group system makes updating patterns clean because the code downstream references groups by name, not by index.
For the full syntax reference, flag descriptions, and advanced features like conditional matching, see the official re module documentation. The interactive regex101.com is invaluable for testing and debugging patterns.
Related Articles
Further Reading: For more details, see the Python sqlite3 documentation.
Frequently Asked Questions
What are the main data storage options in Python?
Python supports flat files (text, CSV, JSON), databases (SQLite, PostgreSQL, MySQL), key-value stores (Redis, shelve), pickle serialization, and cloud storage. The best choice depends on data size, structure, and access patterns.
When should I use SQLite vs a full database?
Use SQLite for single-user apps, prototypes, and small-to-medium datasets. Switch to PostgreSQL or MySQL for concurrent multi-user access, complex queries at scale, or production-grade reliability.
How do I save Python objects to disk?
Use pickle for Python-specific serialization, json for interoperable data, shelve for dictionary-like persistent storage, or databases for structured data. For data analysis, pandas can save to CSV, Parquet, or HDF5.
Is JSON or CSV better for storing data?
JSON handles nested, hierarchical data well. CSV is simpler for tabular, flat data. Use JSON for API data and configuration; use CSV for datasets and spreadsheet-compatible exports.
How do I choose between file storage and a database?
Use file storage for simple, single-user scenarios. Use a database when you need querying, indexing, concurrent access, or ACID transactions. SQLite bridges both worlds for simpler applications.