PostgreSQL is one of the most powerful open-source relational databases, and Python developers interact with it constantly — whether building web APIs, running data pipelines, or managing application state. If you have ever needed to store structured data beyond what SQLite can handle, PostgreSQL is usually the next step.
The good news is that psycopg (version 3, the modern successor to the venerable psycopg2) makes connecting Python to PostgreSQL straightforward and safe. It supports parameterized queries out of the box, handles connection pooling, and works beautifully with async code. You can install it with a single pip install psycopg[binary] command and be running queries in minutes.
In this article, we will cover everything you need to connect Python to PostgreSQL. We will start with a quick example showing a basic connection and query, then explain the psycopg library and why it is the recommended adapter. From there, we will walk through CRUD operations (Create, Read, Update, Delete), parameterized queries for security, connection pooling for performance, error handling patterns, and finish with a complete real-life project that builds a task manager backed by PostgreSQL.
Connecting Python to PostgreSQL: Quick Example
Here is a minimal working example that connects to a PostgreSQL database, creates a table, inserts a row, and reads it back. This gives you the core pattern you will use in every PostgreSQL project.
# quick_example.py
import psycopg
# Connect to PostgreSQL (adjust these for your setup)
conn_string = "host=localhost dbname=testdb user=postgres password=postgres"
with psycopg.connect(conn_string) as conn:
with conn.cursor() as cur:
# Create a simple table
cur.execute("""
CREATE TABLE IF NOT EXISTS greetings (
id SERIAL PRIMARY KEY,
message TEXT NOT NULL
)
""")
# Insert a row
cur.execute("INSERT INTO greetings (message) VALUES (%s)", ("Hello from Python!",))
conn.commit()
# Read it back
cur.execute("SELECT id, message FROM greetings ORDER BY id DESC LIMIT 1")
row = cur.fetchone()
print(f"ID: {row[0]}, Message: {row[1]}")
Output:
ID: 1, Message: Hello from Python!
The key things to notice: we use psycopg.connect() with a connection string, wrap everything in with blocks for automatic cleanup, and use %s placeholders for parameterized queries (never string formatting). The conn.commit() call makes the insert permanent. Want to go deeper? Below we cover connection options, all four CRUD operations, pooling, and a complete project.
One connection string. Infinite queries. Zero SQL injection.
What Is psycopg and Why Use It?
psycopg is the most popular PostgreSQL adapter for Python. Version 3 (just called psycopg) is a complete rewrite of the classic psycopg2 that powered Django, Flask, and countless Python applications for over a decade. The new version brings a cleaner API, native async support, and better type handling while keeping the reliability developers trusted.
Here is how psycopg compares to other options for connecting Python to PostgreSQL:
Feature
psycopg (v3)
psycopg2
asyncpg
Python 3.7+ support
Yes
Yes
Yes
Async support
Built-in
No (needs wrappers)
Async only
Connection pooling
Built-in
Separate package
Built-in
Parameterized queries
%s and named
%s and named
$1, $2 style
COPY support
Excellent
Good
Good
Active development
Yes (recommended)
Maintenance only
Yes
Django/Flask compatible
Yes
Yes
Limited
For most Python developers, psycopg (v3) is the right choice. It handles both sync and async workflows, has excellent documentation, and is officially recommended by the PostgreSQL community. The rest of this article uses psycopg v3 exclusively.
Installing psycopg
The easiest way to install psycopg is with the binary package, which bundles the C library so you do not need PostgreSQL development headers installed:
If you prefer to compile from source (common in production Docker images), install the base package and make sure libpq-dev is available: pip install psycopg[c]. For development and tutorials, the binary option is the fastest path.
Connecting to PostgreSQL
psycopg offers several ways to specify your connection. The most common patterns are a connection string (DSN) and keyword arguments. Both produce identical results — choose whichever reads better in your codebase.
# connection_methods.py
import psycopg
# Method 1: Connection string (DSN)
conn1 = psycopg.connect("host=localhost dbname=myapp user=appuser password=secret")
# Method 2: Keyword arguments
conn2 = psycopg.connect(
host="localhost",
dbname="myapp",
user="appuser",
password="secret",
port=5432
)
# Method 3: PostgreSQL URI format
conn3 = psycopg.connect("postgresql://appuser:secret@localhost:5432/myapp")
# Always use context managers for automatic cleanup
with psycopg.connect("host=localhost dbname=myapp user=appuser password=secret") as conn:
print(f"Connected to: {conn.info.dbname}")
print(f"Server version: {conn.info.server_version}")
conn1.close()
conn2.close()
conn3.close()
Output:
Connected to: myapp
Server version: 160001
The context manager pattern (with psycopg.connect(...) as conn) is strongly recommended. It automatically commits the transaction on success, rolls back on exception, and closes the connection when the block exits. This prevents connection leaks and orphaned transactions — two of the most common PostgreSQL headaches in production.
conn = psycopg.connect() — three seconds to production-ready database access.
CRUD Operations with psycopg
CREATE: Inserting Data
Inserting data uses cursor.execute() with parameterized queries. Always use %s placeholders — never f-strings or string concatenation. Parameterized queries prevent SQL injection and handle type conversion automatically.
# insert_data.py
import psycopg
with psycopg.connect("host=localhost dbname=testdb user=postgres password=postgres") as conn:
with conn.cursor() as cur:
cur.execute("""
CREATE TABLE IF NOT EXISTS users (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
email TEXT UNIQUE NOT NULL,
age INTEGER
)
""")
# Insert a single row with parameterized query
cur.execute(
"INSERT INTO users (name, email, age) VALUES (%s, %s, %s)",
("Alice Chen", "alice@example.com", 28)
)
# Insert multiple rows efficiently with executemany
new_users = [
("Bob Park", "bob@example.com", 34),
("Carol Smith", "carol@example.com", 22),
("Dave Wilson", "dave@example.com", 45),
]
cur.executemany(
"INSERT INTO users (name, email, age) VALUES (%s, %s, %s)",
new_users
)
conn.commit()
print(f"Inserted {1 + len(new_users)} users successfully")
Output:
Inserted 4 users successfully
The executemany() method is cleaner than looping with individual execute() calls, and psycopg optimizes it internally. For truly large batches (thousands of rows), look into cursor.copy() which uses PostgreSQL’s COPY protocol and is dramatically faster.
READ: Querying Data
Reading data involves executing a SELECT query and fetching results. psycopg gives you several fetch options depending on how much data you expect.
# read_data.py
import psycopg
with psycopg.connect("host=localhost dbname=testdb user=postgres password=postgres") as conn:
with conn.cursor() as cur:
# Fetch all rows
cur.execute("SELECT id, name, email, age FROM users ORDER BY name")
all_users = cur.fetchall()
print("All users:")
for user in all_users:
print(f" {user[0]}: {user[1]} ({user[2]}), age {user[3]}")
# Fetch one row
cur.execute("SELECT name, age FROM users WHERE email = %s", ("alice@example.com",))
alice = cur.fetchone()
print(f"\nFound: {alice[0]}, age {alice[1]}")
# Use row factory for named columns (much more readable)
cur = conn.cursor(row_factory=psycopg.rows.dict_row)
cur.execute("SELECT name, email, age FROM users WHERE age > %s", (25,))
older_users = cur.fetchall()
print(f"\nUsers over 25:")
for u in older_users:
print(f" {u['name']}: {u['email']}, age {u['age']}")
Output:
All users:
1: Alice Chen (alice@example.com), age 28
2: Bob Park (bob@example.com), age 34
3: Carol Smith (carol@example.com), age 22
4: Dave Wilson (dave@example.com), age 45
Found: Alice Chen, age 28
Users over 25:
Alice Chen: alice@example.com, age 28
Bob Park: bob@example.com, age 34
Dave Wilson: dave@example.com, age 45
The dict_row row factory is a game-changer for readability. Instead of accessing columns by index (row[0], row[1]), you use names (row['name'], row['email']). This makes your code self-documenting and resilient to column order changes.
UPDATE: Modifying Data
Updates follow the same parameterized pattern. The rowcount attribute tells you how many rows were affected.
# update_data.py
import psycopg
with psycopg.connect("host=localhost dbname=testdb user=postgres password=postgres") as conn:
with conn.cursor() as cur:
# Update a single user
cur.execute(
"UPDATE users SET age = %s WHERE email = %s",
(29, "alice@example.com")
)
print(f"Updated {cur.rowcount} row(s)")
# Update multiple rows with a condition
cur.execute(
"UPDATE users SET age = age + 1 WHERE age < %s",
(30,)
)
print(f"Birthday bump: {cur.rowcount} user(s) aged up")
conn.commit()
Output:
Updated 1 row(s)
Birthday bump: 2 user(s) aged up
Always check cur.rowcount after updates and deletes. If it returns 0 when you expected changes, your WHERE clause might be wrong -- and catching that early saves hours of debugging.
DELETE: Removing Data
Deletes work the same way. Be cautious with DELETE statements -- a missing WHERE clause deletes everything in the table.
# delete_data.py
import psycopg
with psycopg.connect("host=localhost dbname=testdb user=postgres password=postgres") as conn:
with conn.cursor() as cur:
# Delete a specific user
cur.execute(
"DELETE FROM users WHERE email = %s",
("dave@example.com",)
)
print(f"Deleted {cur.rowcount} user(s)")
# Verify the deletion
cur.execute("SELECT COUNT(*) FROM users")
count = cur.fetchone()[0]
print(f"Remaining users: {count}")
conn.commit()
Output:
Deleted 1 user(s)
Remaining users: 3
Four operations, infinite applications. CRUD is the backbone of every database app.
Error Handling
Database operations fail in predictable ways -- duplicate keys, connection drops, malformed queries. psycopg raises specific exception types for each, so you can handle them precisely.
# error_handling.py
import psycopg
from psycopg import errors
conn_string = "host=localhost dbname=testdb user=postgres password=postgres"
try:
with psycopg.connect(conn_string) as conn:
with conn.cursor() as cur:
# This will fail if email already exists (UNIQUE constraint)
cur.execute(
"INSERT INTO users (name, email, age) VALUES (%s, %s, %s)",
("Alice Chen", "alice@example.com", 28)
)
conn.commit()
except errors.UniqueViolation as e:
print(f"Duplicate entry: {e.diag.message_detail}")
except errors.OperationalError as e:
print(f"Connection problem: {e}")
except errors.ProgrammingError as e:
print(f"SQL error: {e}")
except Exception as e:
print(f"Unexpected error: {type(e).__name__}: {e}")
The psycopg.errors module maps every PostgreSQL error code to a Python exception class. UniqueViolation, ForeignKeyViolation, CheckViolation -- they are all there. This lets you show users a friendly "email already taken" message instead of a raw database error.
Connection Pooling
Creating a new database connection for every request is slow (each connection involves a TCP handshake, authentication, and memory allocation on the server). Connection pooling solves this by maintaining a set of open connections that get reused across requests.
# connection_pool.py
from psycopg_pool import ConnectionPool
# Create a pool with min 2, max 10 connections
pool = ConnectionPool(
"host=localhost dbname=testdb user=postgres password=postgres",
min_size=2,
max_size=10
)
# Use connections from the pool
with pool.connection() as conn:
with conn.cursor() as cur:
cur.execute("SELECT COUNT(*) FROM users")
count = cur.fetchone()[0]
print(f"User count: {count}")
# The connection is returned to the pool, not closed
with pool.connection() as conn:
with conn.cursor() as cur:
cur.execute("SELECT name FROM users LIMIT 1")
name = cur.fetchone()[0]
print(f"First user: {name}")
# Get pool stats
stats = pool.get_stats()
print(f"Pool size: {stats['pool_size']}, available: {stats['pool_available']}")
pool.close()
Output:
User count: 3
First user: Alice Chen
Pool size: 2, available: 2
In a web application (Flask, FastAPI, Django), you would create the pool once at startup and share it across all request handlers. This dramatically reduces latency since connections are reused instead of created fresh for every HTTP request. The max_size parameter prevents your application from overwhelming the database with too many simultaneous connections.
One pool, ten connections, a thousand requests. Connection pooling is free performance.
Working with Transactions
By default, psycopg wraps every operation in a transaction. The context manager commits on success and rolls back on failure. But sometimes you need more control -- for example, when multiple operations must succeed or fail together.
# transactions.py
import psycopg
conn_string = "host=localhost dbname=testdb user=postgres password=postgres"
with psycopg.connect(conn_string) as conn:
# Explicit transaction control
try:
with conn.transaction():
with conn.cursor() as cur:
# Both operations must succeed
cur.execute(
"UPDATE users SET age = age - 1 WHERE name = %s",
("Alice Chen",)
)
cur.execute(
"UPDATE users SET age = age + 1 WHERE name = %s",
("Bob Park",)
)
print("Both updates committed together")
except Exception as e:
print(f"Transaction rolled back: {e}")
# Nested savepoints
with conn.transaction() as tx1:
with conn.cursor() as cur:
cur.execute("INSERT INTO users (name, email, age) VALUES (%s, %s, %s)",
("Eve Brown", "eve@example.com", 31))
try:
with conn.transaction() as tx2:
cur.execute("INSERT INTO users (name, email, age) VALUES (%s, %s, %s)",
("Eve Brown", "eve-duplicate@example.com", 31))
# This inner transaction can fail without killing the outer one
except Exception:
print("Inner savepoint rolled back, outer transaction continues")
conn.commit()
print("Eve inserted successfully")
Output:
Both updates committed together
Eve inserted successfully
The conn.transaction() context manager creates a savepoint when nested. This is incredibly useful for "try this, but if it fails, keep going" patterns -- common in data import pipelines where you want to skip bad rows without losing the entire batch.
Real-Life Example: Building a Task Manager CLI
Let us tie everything together with a complete task manager that stores tasks in PostgreSQL. This project uses connection pooling, parameterized queries, error handling, and all four CRUD operations.
A complete CRUD app with pooling and error handling. Not bad for 50 lines.
# task_manager.py
import psycopg
from psycopg_pool import ConnectionPool
from psycopg import errors
from datetime import datetime
DB_URL = "host=localhost dbname=testdb user=postgres password=postgres"
def setup_database(pool):
"""Create the tasks table if it does not exist."""
with pool.connection() as conn:
with conn.cursor() as cur:
cur.execute("""
CREATE TABLE IF NOT EXISTS tasks (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
description TEXT DEFAULT '',
status TEXT DEFAULT 'pending',
created_at TIMESTAMP DEFAULT NOW(),
completed_at TIMESTAMP
)
""")
conn.commit()
def add_task(pool, title, description=""):
"""Add a new task and return its ID."""
with pool.connection() as conn:
with conn.cursor() as cur:
cur.execute(
"INSERT INTO tasks (title, description) VALUES (%s, %s) RETURNING id",
(title, description)
)
task_id = cur.fetchone()[0]
conn.commit()
return task_id
def list_tasks(pool, status_filter=None):
"""List tasks, optionally filtered by status."""
with pool.connection() as conn:
with conn.cursor(row_factory=psycopg.rows.dict_row) as cur:
if status_filter:
cur.execute(
"SELECT id, title, status, created_at FROM tasks WHERE status = %s ORDER BY created_at",
(status_filter,)
)
else:
cur.execute("SELECT id, title, status, created_at FROM tasks ORDER BY created_at")
return cur.fetchall()
def complete_task(pool, task_id):
"""Mark a task as completed."""
with pool.connection() as conn:
with conn.cursor() as cur:
cur.execute(
"UPDATE tasks SET status = %s, completed_at = %s WHERE id = %s",
("completed", datetime.now(), task_id)
)
conn.commit()
return cur.rowcount > 0
def delete_task(pool, task_id):
"""Delete a task by ID."""
with pool.connection() as conn:
with conn.cursor() as cur:
cur.execute("DELETE FROM tasks WHERE id = %s", (task_id,))
conn.commit()
return cur.rowcount > 0
# Demo usage
pool = ConnectionPool(DB_URL, min_size=2, max_size=5)
setup_database(pool)
# Add some tasks
id1 = add_task(pool, "Learn psycopg", "Complete the PostgreSQL tutorial")
id2 = add_task(pool, "Build REST API", "Create FastAPI endpoints for tasks")
id3 = add_task(pool, "Write tests", "Add pytest coverage for database layer")
print(f"Created tasks: {id1}, {id2}, {id3}")
# List all tasks
print("\nAll tasks:")
for task in list_tasks(pool):
print(f" [{task['status']}] #{task['id']}: {task['title']}")
# Complete a task
complete_task(pool, id1)
print(f"\nCompleted task #{id1}")
# List pending tasks only
print("\nPending tasks:")
for task in list_tasks(pool, "pending"):
print(f" #{task['id']}: {task['title']}")
# Delete a task
delete_task(pool, id3)
print(f"\nDeleted task #{id3}")
# Final count
print(f"\nTotal tasks remaining: {len(list_tasks(pool))}")
pool.close()
Output:
Created tasks: 1, 2, 3
All tasks:
[pending] #1: Learn psycopg
[pending] #2: Build REST API
[pending] #3: Write tests
Completed task #1
Pending tasks:
#2: Build REST API
#3: Write tests
Deleted task #3
Total tasks remaining: 2
This task manager demonstrates every concept from the article: connecting with a pool, parameterized queries for safety, dict_row for readable results, RETURNING clauses for getting generated IDs, and proper transaction handling. You could extend this into a full web application by wrapping these functions in FastAPI or Flask endpoints.
Frequently Asked Questions
Should I use psycopg2 or psycopg (v3)?
For new projects, always use psycopg v3 (installed as pip install psycopg). It has better async support, built-in connection pooling, a cleaner API, and is actively developed. psycopg2 is in maintenance mode -- it still works, but new features and improvements only land in v3. The migration is straightforward since the core concepts (parameterized queries, cursors, context managers) are the same.
How do I prevent SQL injection with psycopg?
Always use parameterized queries with %s placeholders: cur.execute("SELECT * FROM users WHERE id = %s", (user_id,)). Never use f-strings, string concatenation, or format() to build SQL. psycopg handles escaping and type conversion automatically, making injection impossible as long as you use placeholders consistently.
Can I use psycopg with async/await?
Yes. psycopg v3 has a built-in async module: from psycopg import AsyncConnection. Use await AsyncConnection.connect() and await cursor.execute(). It works with asyncio, FastAPI, and any other async framework. The async connection pool is AsyncConnectionPool from psycopg_pool.
How many connections should my pool have?
A good starting point is min_size=2, max_size=10 for small applications. The PostgreSQL documentation suggests a formula: max_connections = (core_count * 2) + effective_spindle_count. In practice, most web applications work well with 10-20 connections in the pool. Monitor your PostgreSQL pg_stat_activity view to see actual connection usage and tune from there.
How do I store database credentials securely?
Never hardcode credentials in your source code. Use environment variables (os.environ['DATABASE_URL']), a .env file loaded with python-dotenv, or a secrets manager (AWS Secrets Manager, HashiCorp Vault). PostgreSQL also supports a ~/.pgpass file for local development. For connection strings, the standard DATABASE_URL environment variable works with most frameworks and deployment platforms.
Conclusion
You now have a solid foundation for connecting Python to PostgreSQL with psycopg. We covered the essential workflow: installing psycopg[binary], establishing connections with context managers, running all four CRUD operations with parameterized queries, handling database errors gracefully, and using connection pooling for production performance. The task manager project ties all these concepts into a practical, extensible application.
From here, try extending the task manager with features like priority levels, due dates, or full-text search using PostgreSQL's tsvector type. Psycopg handles all of these naturally since it passes your SQL through to PostgreSQL without limiting which features you can use.
For the complete API reference and advanced topics like COPY operations, async usage, and custom type adapters, check the official psycopg documentation at www.psycopg.org/psycopg3/docs/.
You’re debugging a production issue, but your application is silent. You added a few print() statements weeks ago, the messages got buried in the terminal, and now you have no idea what’s happening. Or worse: your app is logging to console, but the logs disappear the moment the process restarts. You need a way to capture what your application is doing—when it’s doing it, at what severity level, and where it should be recorded.
This is where Python’s built-in logging module becomes essential. Unlike print() statements, which are crude and destructive once you delete them, the logging module is a professional-grade system designed for production applications. It comes built-in to Python, requires no external dependencies, and provides granular control over message levels, formatting, and output destinations.
In this article, you’ll learn how to set up the logging module to output messages simultaneously to both your console (for immediate feedback during development) and to a file (for long-term record-keeping and debugging). We’ll cover logging levels, handlers, formatters, log rotation to prevent massive log files, and the patterns used in real multi-module projects. By the end, you’ll understand how to instrument your code with logging that developers trust.
How To Set Up Logging: Quick Example
Here’s a minimal example that outputs log messages to both console and file:
# quick_logging_example.py
import logging
# Create a logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
# File handler
file_handler = logging.FileHandler("app.log")
file_handler.setLevel(logging.DEBUG)
# Console handler
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)
# Formatter
formatter = logging.Formatter(
"%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
file_handler.setFormatter(formatter)
console_handler.setFormatter(formatter)
# Add handlers to logger
logger.addHandler(file_handler)
logger.addHandler(console_handler)
# Log some messages
logger.debug("Debug message (goes to file only)")
logger.info("Info message (goes to both)")
logger.warning("Warning message (goes to both)")
logger.error("Error message (goes to both)")
logger.critical("Critical message (goes to both)")
Output (to console):
2026-03-29 14:22:15,342 - __main__ - INFO - Info message (goes to both)
2026-03-29 14:22:15,343 - __main__ - WARNING - Warning message (goes to both)
2026-03-29 14:22:15,344 - __main__ - ERROR - Error message (goes to both)
2026-03-29 14:22:15,344 - __main__ - CRITICAL - Critical message (goes to both)
Output (written to app.log):
2026-03-29 14:22:15,341 - __main__ - DEBUG - Debug message (goes to file only)
2026-03-29 14:22:15,342 - __main__ - INFO - Info message (goes to both)
2026-03-29 14:22:15,343 - __main__ - WARNING - Warning message (goes to both)
2026-03-29 14:22:15,344 - __main__ - ERROR - Error message (goes to both)
2026-03-29 14:22:15,344 - __main__ - CRITICAL - Critical message (goes to both)
Notice the key pattern: we created a logger, attached two separate handlers (one for files, one for console), set different levels for each, and applied a formatter that includes timestamps and severity levels. This is the foundation for everything that follows. The sections below show you how to customize each piece.
Good logs are how you debug code you wrote six months ago and forgot about.
What is Python Logging and Why Use It?
The logging module is Python’s standard library tool for recording events that happen during program execution. Unlike print statements, logging provides:
Multiple outputs — send logs to files, console, email, syslog, or custom handlers simultaneously
Formatting control — include timestamps, function names, line numbers, and custom metadata
Filtering — selectively log messages based on logger name, level, or custom criteria
No side effects — unlike print, you can leave logging code in production without cluttering output
The alternative—using print() for debugging—breaks down immediately:
Aspect
print() Statements
logging Module
Disable in production
Must manually remove
Adjust level, keep code in place
Output destination
Always stdout
File, console, email, or custom
Timestamps
Manual string concatenation
Automatic, customizable format
Severity levels
None
DEBUG, INFO, WARNING, ERROR, CRITICAL
Performance
Always evaluates
Can be filtered; lazy evaluation
Multi-module coordination
No built-in support
Hierarchical logger names
The logging module is designed for exactly what you need: professional-grade event recording that stays in your code indefinitely.
Understanding Logging Levels
Python’s logging module defines five standard severity levels, plus a catch-all NOTSET. Each level has a numeric value, and loggers will only record messages at or above their configured level:
Level
Numeric Value
When to Use
Example
DEBUG
10
Detailed diagnostic info for debugging
Variable values, function entry/exit, loop iterations
INFO
20
General informational messages
Application startup, config loaded, request received
WARNING
30
Something unexpected or potentially harmful
Deprecated API usage, missing optional config, retrying failed request
ERROR
40
A serious problem; some operation failed
File not found, API returned 500, database connection lost
CRITICAL
50
A very serious error; program may not continue
Out of memory, permissions denied, unrecoverable system error
When you set a logger’s level to INFO, it will log INFO, WARNING, ERROR, and CRITICAL messages—but not DEBUG messages. This is how you control verbosity.
# logging_levels_demo.py
import logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
# Add a console handler so we can see output
handler = logging.StreamHandler()
handler.setLevel(logging.WARNING)
formatter = logging.Formatter("%(levelname)s - %(message)s")
handler.setFormatter(formatter)
logger.addHandler(handler)
# These will NOT appear (level is below WARNING)
logger.debug("This is a debug message")
logger.info("This is an info message")
# These WILL appear
logger.warning("This is a warning message")
logger.error("This is an error message")
logger.critical("This is a critical message")
Output:
WARNING - This is a warning message
ERROR - This is an error message
CRITICAL - This is a critical message
Notice: the logger itself has one level (DEBUG), but the console handler has a different level (WARNING). You can filter messages at multiple levels—first at the logger, then at each handler. This is crucial for sending different messages to different outputs (e.g., all DEBUG messages to a debug log file, only ERROR+ to a critical alert file).
Handlers and Formatters: Controlling Where and How Logs Go
A logger is just a container. The actual work happens in handlers and formatters:
Handler — an output destination. FileHandler writes to a file, StreamHandler writes to console, etc.
Formatter — defines how log messages are formatted: which fields to include (timestamp, function name, etc.) and in what order
You create a handler, assign a formatter to it, set a level, and attach it to a logger. A single logger can have multiple handlers, each with different levels and formatters.
2026-03-29 14:25:30,123 - myapp - INFO - Application started
2026-03-29 14:25:30,124 - myapp - WARNING - This is a warning
2026-03-29 14:25:30,125 - myapp - ERROR - An error occurred
The %(asctime)s token automatically includes a timestamp. Other useful tokens include %(funcName)s (the function name), %(lineno)d (line number), and %(module)s (the module filename).
After running this, check your app.log file. All four messages will be there because the file handler’s level is DEBUG.
Output (written to app.log):
2026-03-29 14:27:01,456 - myapp - DEBUG - Debug: application starting
2026-03-29 14:27:01,457 - myapp - INFO - Info: loading configuration
2026-03-29 14:27:01,458 - myapp - WARNING - Warning: deprecated API used
2026-03-29 14:27:01,459 - myapp - ERROR - Error: failed to connect to database
Handlers are traffic directors: DEBUG takes the file fork, ERROR takes the console.
Logging to Console and File Simultaneously
The most common pattern in production is to send all logs to a file (for permanent record) and only show WARNING+ messages on the console (for immediate visibility during operation). Here’s how:
# console_and_file_logging.py
import logging
import os
# Create a logger
logger = logging.getLogger("myapp")
logger.setLevel(logging.DEBUG)
# Create log directory if it doesn't exist
log_dir = "logs"
if not os.path.exists(log_dir):
os.makedirs(log_dir)
# File handler: captures all messages
file_handler = logging.FileHandler(os.path.join(log_dir, "app.log"))
file_handler.setLevel(logging.DEBUG)
# Console handler: shows only warnings and above
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.WARNING)
# Shared formatter for both handlers
formatter = logging.Formatter(
"%(asctime)s - %(name)s - %(levelname)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S"
)
file_handler.setFormatter(formatter)
console_handler.setFormatter(formatter)
# Attach handlers to logger
logger.addHandler(file_handler)
logger.addHandler(console_handler)
# Lof messages at different levels
logger.debug("Starting application initialization")
logger.info("Configuration loaded successfully")
logger.info("Database connection established")
logger.warning("API response time is higher than usual")
logger.error("Failed to write to cache, continuing without cache")
logger.critical("Memory usage exceeded safe threshold")
Output (to console):
2026-03-29 14:30:12 - myapp - WARNING - API response time is higher than usual
2026-03-29 14:30:12 - myapp - ERROR - Failed to write to cache, continuing without cache
2026-03-29 14:30:12 - myapp - CRITICAL - Memory usage exceeded safe threshold
Output (written to logs/app.log):
2026-03-29 14:30:12 - myapp - DEBUG - Starting application initialization
2026-03-29 14:30:12 - myapp - INFO - Configuration loaded successfully
2026-03-29 14:30:12 - myapp - INFO - Database connection established
2026-03-29 14:30:12 - myapp - WARNING - API response time is higher than usual
2026-03-29 14:30:12 - myapp - ERROR - Failed to write to cache, continuing without cache
2026-03-29 14:30:12 - myapp - CRITICAL - Memory usage exceeded safe threshold
This pattern is powerful: you get a permanent record of everything (including debug messages developers need when troubleshooting), but the console stays clean during normal operation—only showing problems that need immediate attention. When a warning or error occurs, developers see it right away.
Custom Log Formatting with Timestamps and Metadata
The formatter string controls what information appears in each log message. The most useful format tokens are:
Token
Meaning
Example
%(asctime)s
Timestamp (human-readable)
2026-03-29 14:30:12,456
%(name)s
Logger name
myapp.database
%(levelname)s
Severity level
INFO, WARNING, ERROR
%(message)s
The actual log message
Database query completed
%(funcName)s
Name of function that logged
connect_to_db
%(filename)s
Source filename
database.py
%(lineno)d
Line number in source
42
%(module)s
Module name
database
%(process[=]d
Process ID
12345
%(thread)d
Thread ID
140256789012345
Here are some practical format examples:
# formatting_examples.py
import logging
logger = logging.getLogger("myapp")
logger.setLevel(logging.DEBUG)
# Example 1: Detailed format with function and line number
handler1 = logging.StreamHandler()
formatter1 = logging.Formatter(
"%(asctime)s [%(levelname)s] %(funcName)s:;%(lineno)d - %(message)s"
)
handler1.setFormatter(formatter1)
# Example 2: Compact format (good for production)
handler2_formatter = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
# Example 3: Include module name (useful in multi-file projects)
handler3_formatter = (
"[%(asctime)s] %(module)s - %(levelname)s - %(message)s"
)
# Example 4: ISO 8601 timestamp with timezone
handler4 = logging.StreamHandler()
formatter4 = logging.Formatter(
"%(asctime)s - %(levelname)s - %(message)s",
datefmt="%Y-%m-%dT%H:%M:%S"
)
handler4.setFormatter(formatter4)
logger.addHandler(handler1)
def process_payment(user_id):
logger.info(f"Processing payment for user {user_id}")
logger.debug("Validating card information")
logger.info("Payment submitted to processor")
return True
process_payment(12345)
Output (Example 1 format):
2026-03-29 14:32:45,123 [INFO] process_payment:55 - Processing payment for user 12345
2026-03-29 14:32:45,124 [DEBUG] process_payment:56 - Validating card information
2026-03-29:0;( 14:32:45,125 [INFO] process_payment:57 - Payment submitted to processor
Controlling Log File Size with Log Rotation
If your application runs 24/7 and logs every request, your log files can grow huge fast, eating disk space and slowing down anything that tries to read or grep them. The solution is RotatingFileHandler, which caps file size and automatically rolls old logs into numbered backups:
# File: rotating_logger.py
import logging
from logging.handlers import RotatingFileHandler
logger = logging.getLogger("payments")
logger.setLevel(logging.DEBUG)
# Max 5 MB per file, keep 3 old files (app.log.1, app.log.2, app.log.3)
handler = RotatingFileHandler(
"app.log",
maxBytes=5 * 1024 * 1024,
backupCount=3,
)
handler.setFormatter(logging.Formatter(
"%(asctime)s [%(levelname)s] %(message)s"
))
logger.addHandler(handler)
# Simulate heavy logging
for i in range(100_000):
logger.info(f"Processed request {i}")
When app.log hits 5 MB, the handler renames it to app.log.1, shifts older backups up the chain, and starts a fresh app.log. Once backupCount is reached, the oldest file is deleted. You get bounded disk usage with no manual cleanup.
For time-based rotation — one log file per day, week, or hour — use TimedRotatingFileHandler instead:
from logging.handlers import TimedRotatingFileHandler
# Roll over at midnight every day, keep 14 days of history
handler = TimedRotatingFileHandler(
"app.log",
when="midnight",
interval=1,
backupCount=14,
)
This is ideal for compliance scenarios where you need a clean audit trail per day, or for shipping logs to a daily archive bucket.
Logging Exceptions and Tracebacks
One of the most common logging mistakes is catching an exception and only writing the error message — losing the traceback that tells you where things went wrong. Compare these two patterns:
# Bad — just the message, no traceback
try:
result = risky_operation()
except Exception as e:
logger.error(f"Operation failed: {e}")
# Good — full traceback automatically included
try:
result = risky_operation()
except Exception:
logger.exception("Operation failed")
logger.exception() is shorthand for logger.error(msg, exc_info=True). It records the message AND the full stack trace, so when you’re debugging at 2 AM you can see exactly which line raised, what the call chain was, and which third-party library was involved. Always use logger.exception() inside except blocks.
You can also force a traceback on lower-severity log calls with exc_info=True:
try:
cache.get(key)
except CacheTimeout:
logger.warning("Cache miss with timeout, falling back to DB", exc_info=True)
return db.query(key)
Pick your log levels. Stick to them. Future you will read them.
Logging Across Multiple Modules
In real applications you have dozens of modules, and you want logs to show which one wrote each message. The convention is logger = logging.getLogger(__name__) at the top of every file. __name__ resolves to the dotted module path, so logs from app/services/payments.py appear under the logger name app.services.payments.
The benefit: in main.py (or wherever you configure logging) you can route specific modules to different handlers, set finer-grained levels, or silence noisy third-party libraries:
# File: main.py
import logging
# Root logger — catches everything at INFO+
logging.basicConfig(level=logging.INFO)
# Quiet down a noisy third-party library
logging.getLogger("urllib3").setLevel(logging.WARNING)
# Turn on DEBUG just for our payments module
logging.getLogger("app.services.payments").setLevel(logging.DEBUG)
This pattern scales — instead of editing logging calls in every file, you control verbosity from one place.
Structured Logging with JSON
Plain-text logs are great for tailing in a terminal, but if you ship logs to a centralized system (Elasticsearch, Datadog, Loki, CloudWatch), JSON-structured logs are dramatically easier to query. Each log line becomes a parsed record with searchable fields instead of regex-matchable strings.
Now in your log aggregator you can filter by plan = "pro" directly, no regex required. The extra={} parameter is the secret — anything you pass there becomes a top-level JSON field.
Production Logging Best Practices
A few rules that pay back tenfold once your application is live and you’re not the only one debugging it:
Use lazy string formatting. Write logger.info("Got %s rows", count), not logger.info(f"Got {count} rows"). The lazy form only builds the string if the log level is actually enabled — important when DEBUG logs are off in production.
Don’t log secrets. Audit your messages for tokens, passwords, full credit card numbers, or PII. Centralized log storage is often broader-access than your production database.
Pick one log level per environment. DEBUG locally, INFO in staging, WARNING in production. Don’t mix.
Always include identifiers. Every log line tied to a user action should carry the user ID, request ID, or correlation ID. Logs without identifiers are noise.
Configure once, in one place. Use logging.config.dictConfig() with a config dict (or a YAML file) at app startup. Don’t sprinkle basicConfig() calls throughout the codebase.
Test that logs are being written. A surprising number of production outages are made worse by “we didn’t have any logs” — usually because someone called logging.basicConfig() after another module had already configured the root logger, and the second call silently no-ops.
Common Logging Pitfalls
Three patterns to watch for:
1. Calling logging.basicConfig() after another module has logged.basicConfig() only adds handlers if the root logger has none. The fix: configure logging as the very first thing in main.py, before importing your modules.
2. Duplicate log messages. If you accidentally add the same handler twice — or if your code sets up logging on import and again in __main__ — every message prints twice. The fix: check if logger.hasHandlers() before adding handlers, or rely on dictConfig which idempotently rebuilds the config.
3. Logger.propagate surprises. Child loggers propagate to parents by default. If you add a console handler to app AND to app.services, messages from app.services appear twice. Set logger.propagate = False on the child or only add handlers at the root.
FAQ
Q: What’s the difference between logger.info() and logger.debug()?
A: Severity. INFO is for “normal operational events I want to see in production” — startup, request completion, scheduled job ran. DEBUG is for verbose internal state useful when reproducing a bug locally. In production, DEBUG is usually off so the noise doesn’t drown out the signal.
Q: Should I use print() instead?
A: For one-off scripts, fine. For anything you’ll run more than once, no. print() can’t be filtered by severity, can’t be redirected to multiple destinations, doesn’t carry timestamps or module names, and writes to stdout which mingles with your application’s actual output.
Q: How do I log to a remote system like CloudWatch or Datadog?
A: Two common approaches. (1) Ship logs to a local file in JSON format and run a sidecar agent (CloudWatch Agent, Vector, Fluent Bit) that tails the file and forwards. (2) Use a Python handler that posts directly — watchtower for CloudWatch, datadog-python for Datadog. Option 1 is more resilient because it survives network blips.
Q: Why are my logs not appearing?
A: Most common cause: the root logger’s level is higher than the message level. Try logging.basicConfig(level=logging.DEBUG) at the very top of main.py. Second most common: another import called basicConfig() first and you didn’t notice.
Q: How do I correlate logs across services in a microservices setup?
A: Generate a UUID-based request ID at the API gateway, pass it through every downstream service in a header (X-Request-ID), and include it in every log line via extra={"request_id": ...}. When you’re debugging an issue, you grep the request ID across all services’ logs and see the full timeline.
Wrapping Up
Python’s logging module is one of those tools where the 90% solution is straightforward — call logging.basicConfig(), get a logger with logging.getLogger(__name__), write info/error messages — and the remaining 10% (rotation, JSON output, multi-handler routing) becomes important as soon as your application leaves your laptop. Get the basics right early and the advanced patterns are small additions, not refactors.
The official Python logging documentation has the full reference for everything covered here plus the more obscure handlers (SMTP, SysLog, HTTP). For tutorials on related topics, see the related articles section below.
Every developer has experienced the monotony of repetitive tasks: renaming thousands of files, backing up project folders on schedule, generating weekly reports, or scanning for files that need processing. These are the moments when you wish a robot would just handle it while you focus on actual coding. The good news? Python makes this incredibly straightforward, and you already have everything you need in the standard library.
Python was designed with automation in mind. Libraries like os, shutil, pathlib, and smtplib give you powerful tools to interact with the file system, schedule tasks, and send notifications. You don’t need to learn complex shell scripts or invest in expensive automation software. A few lines of Python can save you hours of manual work.
In this guide, we’ll explore practical automation patterns starting with file operations and building toward a real-world automated backup system. By the end, you’ll have a toolkit for automating any repetitive task in your workflow.
Quick Example: Rename Files in Bulk
Before diving deep, let’s see automation in action. Imagine you have 500 image files named like IMG_0001.jpg, IMG_0002.jpg, and you want to prefix them with today’s date. Without automation, this takes hours. With Python, it takes seconds:
# bulk_rename.py
import os
import datetime
directory = "./photos"
prefix = datetime.date.today().strftime("%Y%m%d_")
for filename in os.listdir(directory):
if filename.endswith(".jpg"):
old_path = os.path.join(directory, filename)
new_filename = prefix + filename
new_path = os.path.join(directory, new_filename)
os.rename(old_path, new_path)
print(f"Renamed: {filename} -> {new_filename}")
That script runs instantly and accomplishes what would take manual clicking for hours. This is the power of automation.
Python automation: where boredom goes to die.
Why Automate with Python?
You might be wondering: why Python instead of shell scripts, scheduled tasks, or other tools? The answer is clarity, portability, and power. Here’s how they compare:
Task Aspect
Manual Process
Shell Script
Python Script
Development Time
Hours per occurrence
30-60 minutes
15-30 minutes
Readability
N/A
Cryptic syntax
Human-readable code
Cross-Platform
N/A
Linux/Mac only
Windows, Mac, Linux
Debugging
N/A
Difficult
Easy with proper logging
Email Integration
Manual setup
Complex
Built-in libraries
Maintainability
N/A
Hard to modify
Easy to extend and modify
Python wins for most automation tasks because it balances simplicity with power. You can read Python code six months later and understand what it does, and you can add new features without rewriting everything.
Working with Files and Directories
Using os and pathlib Modules
Python provides two ways to work with file paths and directories: the older os module and the modern pathlib module. pathlib is more intuitive and handles cross-platform differences automatically, but os is still widely used. Let’s explore both:
# file_operations.py
import os
from pathlib import Path
# Using os module
print("Using os module:")
current_dir = os.getcwd()
print(f"Current directory: {current_dir}")
# List files
for item in os.listdir("."):
if os.path.isfile(item):
print(f"File: {item}")
# Using pathlib (modern approach)
print("\nUsing pathlib:")
current_path = Path(".")
for item in current_path.iterdir():
if item.is_file():
print(f"File: {item.name}")
print(f"Size: {item.stat().st_size} bytes")
print(f"Extension: {item.suffix}")
Output:
Using os module:
Current directory: /home/user/projects
Using pathlib:
File: script.py
Size: 1245 bytes
Extension: .py
File: data.csv
Size: 5678 bytes
Extension: .csv
pathlib.Path is generally preferred because it’s more readable and handles path separators automatically (backslash on Windows, forward slash on Unix). However, both work fine depending on your preference and existing codebase.
Renaming and Organizing Files
One of the most common automation tasks is organizing files by type, date, or naming convention. The shutil module and os.rename() make this simple:
# organize_files.py
import os
import shutil
from pathlib import Path
download_dir = "./downloads"
# Create subdirectories if they don't exist
for category in ["Images", "Documents", "Archives", "Other"]:
Path(download_dir, category).mkdir(exist_ok=True)
# Organize files by extension
for filename in os.listdir(download_dir):
if filename.startswith("."):
continue
filepath = os.path.join(download_dir, filename)
if not os.path.isfile(filepath):
continue
# Determine category based on extension
ext = os.path.splitext(filename)[1].lower()
if ext in [".jpg", ".png", ".gif", ".webp"]:
category = "Images"
elif ext in [".pdf", ".doc", ".docx", ".txt"]:
category = "Documents"
elif ext in [".zip", ".rar", ".7z"]:
category = "Archives"
else:
category = "Other"
# Move file to appropriate directory
dest_path = os.path.join(download_dir, category, filename)
shutil.move(filepath, dest_path)
print(f"Moved {filename} to {category}/")
Output:
Moved vacation.jpg to Images/
Moved resume.pdf to Documents/
Moved backup.zip to Archives/
Moved config.txt to Documents/
This script is the foundation of smart file organization. In a real system, you’d add error handling, logging, and checks to avoid overwriting files. The Path.mkdir(exist_ok=True) pattern ensures directories exist without throwing errors if they do.
When your Downloads folder finally achieves organization.
Watching for File Changes with watchdog
Sometimes you need to react the moment a file appears or changes. The watchdog library monitors file system events in real-time. First, install it:
pip install watchdog
Now create a file watcher that triggers actions when new files appear:
# watch_folder.py
import time
from pathlib import Path
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
class FileProcessor(FileSystemEventHandler):
def on_created(self, event):
if not event.is_directory:
filename = Path(event.src_path).name
print(f"New file detected: {filename}")
print(f"Full path: {event.src_path}")
def on_modified(self, event):
if not event.is_directory:
filename = Path(event.src_path).name
print(f"File modified: {filename}")
# Watch the current directory
observer = Observer()
observer.schedule(FileProcessor(), path=".", recursive=False)
observer.start()
print("Watching for file changes. Press Ctrl+C to stop.")
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
Output (after creating/modifying files):
Watching for file changes. Press Ctrl+C to stop.
New file detected: report.pdf
Full path: ./report.pdf
File modified: report.pdf
The watchdog library is perfect for implementing “drop a file to process it” workflows, such as converting documents, generating thumbnails, or triggering CI/CD pipelines.
Scheduling Tasks with the schedule Library
Many automation tasks need to run at specific times or intervals: daily backups, hourly data syncs, or weekly reports. The schedule library makes this elegant:
pip install schedule
Here’s how to create a task scheduler:
# task_scheduler.py
import schedule
import time
from datetime import datetime
def backup_database():
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"[{timestamp}] Running database backup...")
# Actual backup logic here
def clean_temp_files():
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"[{timestamp}] Cleaning temporary files...")
# Actual cleanup logic here
def generate_report():
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"[{timestamp}] Generating daily report...")
# Actual report generation here
# Schedule tasks
schedule.every().day.at("02:00").do(backup_database)
schedule.every().hour.do(clean_temp_files)
schedule.every().monday.at("09:00").do(generate_report)
# Keep scheduler running
print("Scheduler started. Tasks will run according to schedule.")
while True:
schedule.run_pending()
time.sleep(60) # Check every minute
Output (sample execution):
Scheduler started. Tasks will run according to schedule.
[2026-03-29 02:00:12] Running database backup...
[2026-03-29 03:00:05] Cleaning temporary files...
[2026-03-29 09:00:00] Generating daily report...
The schedule library is straightforward but doesn’t persist across system restarts. For production systems, consider using cron (Linux/Mac) or Task Scheduler (Windows) to run your Python script, or use a more robust library like APScheduler.
Sending Email Notifications with smtplib
Automating tasks is great, but you need to know when something fails or completes. Python’s built-in smtplib library sends email notifications:
# send_email.py
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
def send_notification(recipient, subject, body):
sender_email = "automation@example.com"
sender_password = "your_app_password_here"
# Create message
message = MIMEMultipart()
message["From"] = sender_email
message["To"] = recipient
message["Subject"] = subject
message.attach(MIMEText(body, "plain"))
# Send email
try:
with smtplib.SMTP_SSL("smtp.gmail.com", 465) as server:
server.login(sender_email, sender_password)
server.send_message(message)
print(f"Email sent to {recipient}")
except Exception as e:
print(f"Error sending email: {e}")
# Usage
send_notification(
"admin@example.com",
"Backup Complete",
"Daily backup completed successfully at 2026-03-29 02:15:30."
)
Output:
Email sent to admin@example.com
Important: Never hardcode passwords in scripts. Use environment variables or a configuration file outside version control. For Gmail, generate an “App Password” in your account settings rather than using your actual password.
Working with CSV and Excel Files for Reports
Automated reporting is a huge time-saver. Python handles CSV files natively and can create Excel files with the openpyxl library:
# generate_report.py
import csv
from datetime import datetime
from pathlib import Path
# Sample data (from database or API in real scenario)
sales_data = [
{"date": "2026-03-29", "product": "Widget A", "sales": 150},
{"date": "2026-03-29", "product": "Widget B", "sales": 200},
{"date": "2026-03-29", "product": "Widget C", "sales": 175},
]
# Generate CSV report
report_date = datetime.now().strftime("%Y%m%d")
report_filename = f"sales_report_{report_date}.csv"
with open(report_filename, "w", newline="") as csvfile:
fieldnames = ["date", "product", "sales"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(sales_data)
print(f"Report generated: {report_filename}")
For more complex reports with formatting, install openpyxl: pip install openpyxl. This lets you create Excel files with colors, formulas, and multiple sheets.
Running System Commands with subprocess
Sometimes you need to call external programs from Python. The subprocess module handles this safely:
# run_commands.py
import subprocess
import os
# Run a simple command
result = subprocess.run(["python", "--version"], capture_output=True, text=True)
print(f"Python version: {result.stdout.strip()}")
# Run a command and capture output
result = subprocess.run(["ls", "-la"], capture_output=True, text=True)
print("Directory listing:")
print(result.stdout)
# Check if command succeeded
result = subprocess.run(["git", "status"], capture_output=True)
if result.returncode == 0:
print("Git repository is clean")
else:
print("Not a git repository or git error")
Output (Linux/Mac):
Python version: Python 3.10.6
Directory listing:
total 48
drwxr-xr-x 5 user user 4096 Mar 29 10:15 .
drwxr-xr-x 8 user user 4096 Mar 29 09:00 ..
-rw-r--r-- 1 user user 1245 Mar 29 10:12 script.py
Git repository is clean
Use capture_output=True to collect program output and text=True to get strings instead of bytes. Always check the return code to verify success.
Python calling system commands: the glue that holds automation together.
Real-Life Example: Automated Backup System
Now let’s build a complete, production-ready backup system that watches a directory and creates timestamped ZIP archives. This example combines everything we’ve learned:
# backup_system.py
import os
import shutil
import zipfile
import smtplib
import schedule
import time
from pathlib import Path
from datetime import datetime
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
class BackupManager:
def __init__(self, source_dir, backup_dir, email_to):
self.source_dir = source_dir
self.backup_dir = backup_dir
self.email_to = email_to
Path(backup_dir).mkdir(exist_ok=True)
def create_backup(self):
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
backup_filename = f"backup_{timestamp}.zip"
backup_path = os.path.join(self.backup_dir, backup_filename)
try:
with zipfile.ZipFile(backup_path, "w", zipfile.ZIP_DEFLATED) as zipf:
for root, dirs, files in os.walk(self.source_dir):
for file in files:
file_path = os.path.join(root, file)
arcname = os.path.relpath(file_path, self.source_dir)
zipf.write(file_path, arcname)
file_size = os.path.getsize(backup_path) / (1024 * 1024)
print(f"Backup created: {backup_filename} ({file_size:.2f} MB)")
self.send_notification(
f"Backup Success",
f"Backup created successfully: {backup_filename}\nSize: {file_size:.2f} MB"
)
# Cleanup old backups (keep last 7)
self.cleanup_old_backups()
except Exception as e:
print(f"Backup failed: {e}")
self.send_notification("Backup Failed", f"Error: {str(e)}")
def cleanup_old_backups(self):
backups = sorted(Path(self.backup_dir).glob("backup_*.zip"))
if len(backups) > 7:
for old_backup in backups[:-7]:
old_backup.unlink()
print(f"Deleted old backup: {old_backup.name}")
def send_notification(self, subject, body):
sender_email = "backup@example.com"
sender_password = "your_app_password"
try:
message = MIMEMultipart()
message["From"] = sender_email
message["To"] = self.email_to
message["Subject"] = subject
message.attach(MIMEText(body, "plain"))
with smtplib.SMTP_SSL("smtp.gmail.com", 465) as server:
server.login(sender_email, sender_password)
server.send_message(message)
except Exception as e:
print(f"Could not send email: {e}")
# Setup and run
if __name__ == "__main__":
manager = BackupManager(
source_dir="./important_files",
backup_dir="./backups",
email_to="admin@example.com"
)
# Schedule daily backups at 2 AM
schedule.every().day.at("02:00").do(manager.create_backup)
print("Backup system started. Waiting for scheduled time...")
while True:
schedule.run_pending()
time.sleep(60)
Output (sample):
Backup system started. Waiting for scheduled time...
Backup created: backup_20260329_020015.zip (45.32 MB)
Deleted old backup: backup_20260322_020012.zip
This system handles the full lifecycle: creating backups, managing disk space, and notifying you of success or failure. In production, you’d run this as a background service using systemd (Linux), launchd (Mac), or Task Scheduler (Windows).
Frequently Asked Questions
How do I run a Python script in the background?
Linux/Mac: Use nohup to ignore hangup signals: nohup python backup_system.py &. Or use screen or tmux for interactive backgrounds. Better: use cron to schedule it properly.
Windows: Use Task Scheduler to run the script with python.exe. Create a task that runs at startup or on a schedule without showing a window.
Should I add error handling to automation scripts?
Absolutely. Always wrap file operations in try-except blocks. Log errors to a file so you can debug later. For critical tasks, send notifications on failure. Here’s a pattern:
try:
# Your automation code
do_something()
except Exception as e:
logger.error(f"Task failed: {e}")
send_alert_email(f"Error: {e}")
Is it safe to put passwords in automation scripts?
No. Use environment variables, config files outside version control, or credential managers. For email, use app-specific passwords instead of your real password. Never commit secrets to GitHub.
import os
password = os.getenv("EMAIL_PASSWORD") # Load from environment
How do I write automation that works on Windows, Mac, and Linux?
Use pathlib.Path instead of string path concatenation–it handles separators automatically. Use subprocess carefully since some commands differ. Test on all platforms or use Docker for consistency.
What if the user’s system doesn’t have the libraries I need?
Create a requirements.txt file listing dependencies, then users can install them with pip install -r requirements.txt. For standalone scripts, use PyInstaller to bundle Python and libraries into a single executable.
Conclusion
Python automation transforms tedious manual tasks into reliable, repeatable processes. You’ve learned to work with files and directories using os and pathlib, schedule tasks with the schedule library, send email notifications via smtplib, and build complete systems like automated backups. The key is starting simple–automate your most painful task first, then gradually expand your automation toolkit.
For deeper learning, explore the official documentation: os module, pathlib, shutil, and smtplib are all built-in. For external libraries, check schedule and watchdog on PyPI. The automation possibilities are endless once you see Python as your personal robot assistant.
Command-line interfaces (CLIs) are the backbone of modern development workflows. From package managers to deployment tools, every developer relies on well-designed CLI applications. If you’ve ever dreamed of building the next popular dev tool but found the existing CLI frameworks overwhelming, Python has an elegant solution: Typer. Typer combines the power of type hints with an intuitive API that makes building professional CLIs feel like writing regular Python functions.
The beauty of Typer lies in its simplicity wrapped in sophistication. Unlike older frameworks that require boilerplate configuration, Typer leverages Python’s type annotations to automatically generate help text, validate inputs, and handle command parsing. If you already know how to write Python functions, you already know how to write Typer CLI apps. No special decorators or configuration files needed.
This tutorial walks you through everything you need to build production-ready CLI tools. We’ll start with the fundamentals, explore advanced features like interactive prompts and colored output, and then build a complete file organizer application that demonstrates all the concepts in action. By the end, you’ll have a reusable template for any CLI project.
Typer transforms boring terminals into beautiful command-line experiences.
Quick Example: Your CLI in 10 Lines
Before diving into the theory, let’s see Typer in action. Here’s the absolute minimum code needed to create a working CLI tool:
That’s it. No configuration, no argument parsing setup, no manual help text. Typer inferred everything from the function signature. The name parameter automatically became a command argument, and Typer generated professional help documentation instantly. This is the Typer philosophy: sensible defaults with maximum productivity.
What Is Typer and Why Use It
Typer is a modern Python library built on top of Click that simplifies CLI development. It’s created by the same developer who built FastAPI, and it brings FastAPI’s elegance to the command line. Rather than forcing you to learn a new syntax or remember decorator parameters, Typer uses standard Python type hints to express CLI intent.
To understand why Typer matters, let’s compare it with other popular CLI frameworks:
Feature
argparse
Click
Typer
Verbosity
High (10+ lines for simple CLI)
Medium (5-7 lines)
Low (3-5 lines)
Type Hints Support
None
Partial
Full Native Support
Auto-generated Help
Basic
Good
Excellent
Learning Curve
Steep
Moderate
Shallow
Input Validation
Manual
Custom Types
Type Hints
IDE Autocompletion
Poor
Good
Excellent
Typer’s primary advantage is reducing cognitive load. You write Python functions that look like regular functions, and Typer handles the CLI machinery. Combined with modern IDE support, this means better code completion, fewer runtime surprises, and more time spent on your application logic instead of CLI plumbing.
Installing Typer
Getting started with Typer requires just one command. We’ll install the complete version with all optional dependencies to unlock advanced features like colored output:
# Install Typer with all extras
pip install "typer[all]"
The [all] extra installs Rich (for colored output and tables), shellingham (for shell completion), and other utilities. If you want a minimal install with just the essentials, use pip install typer instead. Verify the installation works:
Now that Typer is installed, let’s build a slightly more complex application. Understanding command structure is crucial because every Typer app follows the same pattern: create an app object, decorate functions with @app.command(), and invoke it at the bottom.
# weather_cli.py
import typer
app = typer.Typer(help="Simple weather information tool")
@app.command()
def current(city: str):
"""Get current weather for a city."""
typer.echo(f"Weather in {city}: Sunny, 72F")
@app.command()
def forecast(city: str, days: int = 7):
"""Get weather forecast for upcoming days."""
typer.echo(f"Forecast for {city} ({days} days):")
for i in range(1, days + 1):
typer.echo(f" Day {i}: Partly cloudy")
if __name__ == "__main__":
app()
Output:
$ python weather_cli.py --help
Usage: weather_cli.py [OPTIONS] COMMAND [ARGS]...
Simple weather information tool
Options:
--help Show this message and exit.
Commands:
current Get current weather for a city.
forecast Get weather forecast for upcoming days.
$ python weather_cli.py current London
Weather in London: Sunny, 72F
$ python weather_cli.py forecast Paris --days 3
Forecast for Paris (3 days):
Day 1: Partly cloudy
Day 2: Partly cloudy
Day 3: Partly cloudy
Notice how Typer automatically converted the docstrings into help text, made days optional because it has a default value, and even inferred that it should be passed as --days flag. This is zero-configuration development.
Adding Arguments and Options
CLI parameters come in two flavors: arguments (positional, required) and options (named, optional with defaults). Understanding this distinction helps design intuitive CLIs. An argument like filename is positional and required. An option like --output is named and typically optional.
Typer infers the type from your function signature, but sometimes you need more control. Use typer.Argument() and typer.Option() to customize behavior:
# file_processor.py
import typer
from pathlib import Path
app = typer.Typer()
@app.command()
def process(
input_file: Path = typer.Argument(..., help="File to process"),
output_file: Path = typer.Option(None, help="Output file path"),
verbose: bool = typer.Option(False, "-v", "--verbose", help="Verbose output"),
count: int = typer.Option(1, "-c", "--count", help="Number of iterations")
):
"""Process a file with various options."""
input_text = input_file.read_text()
typer.echo(f"Read {len(input_text)} characters from {input_file}")
if verbose:
typer.echo(f"Verbose: Processing with count={count}")
if output_file:
output_file.write_text(input_text.upper())
typer.echo(f"Wrote output to {output_file}")
if __name__ == "__main__":
app()
Output:
$ python file_processor.py --help
Usage: file_processor.py [OPTIONS] INPUT_FILE
Process a file with various options.
Options:
--output-file PATH Output file path
-v, --verbose Verbose output
-c, --count INTEGER Number of iterations
--help Show this message and exit.
Arguments:
INPUT_FILE File to process [required]
$ echo "hello world" > input.txt
$ python file_processor.py input.txt --output-file output.txt -v --count 2
Read 11 characters from input.txt
Verbose: Processing with count=2
Wrote output to output.txt
The ... (Ellipsis) in typer.Argument(...) indicates a required argument. The typer.Option() call lets you specify default values, short flags (-v), and long flags (--verbose) simultaneously. Typer automatically converts hyphens to underscores in flag names, so --output-file maps to the output_file parameter.
Type Annotations for Validation
One of Typer’s superpowers is automatic validation through type hints. When you declare a parameter as int, Typer ensures the user provides an integer. If they don’t, Typer shows a helpful error message instead of crashing with a cryptic traceback.
# calculator.py
import typer
app = typer.Typer()
@app.command()
def add(a: int, b: int):
"""Add two integers."""
typer.echo(f"{a} + {b} = {a + b}")
@app.command()
def greet(name: str, age: int = 25):
"""Greet someone with their age."""
typer.echo(f"Hello, {name}! You are {age} years old.")
@app.command()
def enable_feature(feature_name: str, enabled: bool = True):
"""Toggle a feature on or off."""
status = "enabled" if enabled else "disabled"
typer.echo(f"Feature '{feature_name}' is {status}")
if __name__ == "__main__":
app()
Output:
$ python calculator.py add 5 3
5 + 3 = 8
$ python calculator.py add five 3
Error: Invalid value for 'A': 'five' is not a valid integer.
$ python calculator.py greet Alice 30
Hello, Alice! You are 30 years old.
$ python calculator.py enable_feature logging --no-enabled
Feature 'logging' is disabled
Notice how passing a string where an integer is expected produces a clear error message. Typer validates at the CLI layer, not in your code. Boolean flags get special treatment: you can pass --enabled/--no-enabled or just toggle the default. This is powerful validation without writing a single if-statement for type checking.
Multiple Commands with app.command()
Professional CLI applications often have subcommands. Git is the classic example: git commit, git push, and git pull are all subcommands. Typer makes this structure effortless. Every function decorated with @app.command() becomes a subcommand automatically.
$ python database_cli.py --help
Usage: database_cli.py [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
backup Create a database backup.
migrate Run database migrations.
restore Restore database from backup.
status Show database status.
$ python database_cli.py status
Database Status: OK
Tables: 42
Size: 2.3 GB
$ python database_cli.py backup mydb --output backup_2026.sql
Backing up 'mydb' to backup_2026.sql
This structure scales beautifully. As your application grows, you can organize commands in separate modules and import them, keeping the codebase maintainable.
Subcommands scale from simple to enterprise-grade tools without refactoring.
Interactive Prompts and Confirmations
Sometimes you need to ask the user for input during execution, not at the command line. Typer provides interactive prompts for this scenario. Use typer.prompt() for collecting input and typer.confirm() for yes/no questions.
# interactive_app.py
import typer
app = typer.Typer()
@app.command()
def create_user():
"""Interactively create a new user."""
username = typer.prompt("Enter username")
email = typer.prompt("Enter email")
password = typer.prompt("Enter password", hide_input=True)
typer.echo(f"User '{username}' created successfully")
@app.command()
def delete_file(filename: str):
"""Delete a file with confirmation."""
if typer.confirm(f"Delete '{filename}'?"):
typer.echo(f"Deleted {filename}")
else:
typer.echo("Cancelled")
@app.command()
def setup():
"""Run interactive setup wizard."""
project_name = typer.prompt("Project name")
author = typer.prompt("Author name")
use_git = typer.confirm("Initialize Git repository?")
typer.echo(f"Setting up project '{project_name}'...")
if use_git:
typer.echo("Initialized Git repository")
typer.echo(f"Project ready! ({author})")
if __name__ == "__main__":
app()
Output:
$ python interactive_app.py create_user
Enter username: alice
Enter email: alice@example.com
Enter password:
User 'alice' created successfully
$ python interactive_app.py delete_file data.csv
Delete 'data.csv'? [y/N]: y
Deleted data.csv
$ python interactive_app.py setup
Project name: MyApp
Author name: Bob Smith
Initialize Git repository? [y/N]: y
Setting up project 'MyApp'...
Initialized Git repository
Project ready! (Bob Smith)
The hide_input=True parameter masks password input, preventing shoulder surfers from seeing sensitive data. typer.confirm() accepts yes/no responses flexibly, handling “y”, “yes”, “n”, “no” and returning a boolean. This creates seamless user experiences without managing stdin directly.
Rich Output with Colors
Boring terminals are outdated. The Rich library (included with Typer) enables beautiful colored output, tables, and formatted text. This transforms CLIs from utilitarian to delightful.
# styled_output.py
import typer
from rich.console import Console
from rich.table import Table
from rich import print as rprint
app = typer.Typer()
console = Console()
@app.command()
def colors():
"""Display colored text."""
rprint("[bold red]Error:[/bold red] Something went wrong!")
rprint("[green]Success:[/green] Operation completed")
rprint("[cyan]Info:[/cyan] Current status is normal")
@app.command()
def show_status():
"""Display status with a formatted table."""
table = Table(title="System Status")
table.add_column("Component", style="cyan")
table.add_column("Status", style="magenta")
table.add_column("Load", style="green")
table.add_row("CPU", "OK", "45%")
table.add_row("Memory", "OK", "62%")
table.add_row("Disk", "Warning", "88%")
console.print(table)
@app.command()
def progress_demo():
"""Show progress with real-time updates."""
with console.status("[bold green]Processing...") as status:
import time
for i in range(5):
time.sleep(0.2)
status.update(f"[bold green]Processing step {i+1}/5...")
console.print("[bold green]Done!")
if __name__ == "__main__":
app()
Output:
$ python styled_output.py colors
Error: Something went wrong!
Success: Operation completed
Info: Current status is normal
$ python styled_output.py show_status
System Status
Component Status Load
CPU OK 45%
Memory OK 62%
Disk Warning 88%
$ python styled_output.py progress_demo
Processing step 5/5...
Done!
Rich markup is simple: [bold red]text[/bold red] applies bold red styling. You can create tables, progress bars, panels, and more. This visual feedback keeps users informed and engaged, especially important for long-running operations.
Error Handling in CLI Apps
Proper error handling separates production-ready CLIs from toy scripts. Typer provides typer.Exit() to terminate with a specific exit code, and the rich.console.Console class has methods for displaying errors elegantly.
# error_handling.py
import typer
from pathlib import Path
from rich.console import Console
app = typer.Typer()
console = Console()
@app.command()
def read_file(filename: str):
"""Read and display a file."""
try:
file_path = Path(filename)
if not file_path.exists():
console.print(f"[red]Error:[/red] File '{filename}' not found", style="bold")
raise typer.Exit(code=1)
content = file_path.read_text()
console.print(f"[green]Success:[/green] Read {len(content)} characters")
print(content)
except Exception as e:
console.print(f"[red]Error:[/red] {str(e)}", style="bold")
raise typer.Exit(code=2)
@app.command()
def process(input_file: str, output_file: str = "output.txt"):
"""Process a file with error checking."""
input_path = Path(input_file)
output_path = Path(output_file)
if not input_path.exists():
console.print(f"[red]Input Error:[/red] {input_file} not found", style="bold red")
raise typer.Exit(code=1)
if output_path.exists():
if not typer.confirm(f"Overwrite {output_file}?"):
console.print("[yellow]Cancelled[/yellow]")
raise typer.Exit(code=0)
try:
data = input_path.read_text()
output_path.write_text(data.upper())
console.print(f"[green]Success:[/green] Processed {input_file} -> {output_file}")
except IOError as e:
console.print(f"[red]IO Error:[/red] {str(e)}", style="bold red")
raise typer.Exit(code=3)
if __name__ == "__main__":
app()
Exit codes matter for automation and scripting. Return 0 for success, non-zero for failure. This allows shell scripts and other tools to detect success and fail fast. Always validate user input early and fail fast with clear messages.
Rich transforms cryptic errors into developer-friendly feedback.
Real-Life Example: Smart File Organizer
Now let’s build a complete, production-ready CLI tool: a smart file organizer that sorts files in a directory by their extension. This example combines everything we’ve learned: multiple commands, type validation, interactive prompts, error handling, and rich output.
# file_organizer.py
import typer
from pathlib import Path
from rich.console import Console
from rich.table import Table
from collections import defaultdict
import shutil
app = typer.Typer(help="Smart file organizer with validation and safety checks")
console = Console()
@app.command()
def organize(
directory: Path = typer.Argument(".", help="Directory to organize"),
dry_run: bool = typer.Option(True, help="Preview changes without executing"),
create_folders: bool = typer.Option(True, help="Create extension folders"),
):
"""Organize files in a directory by extension."""
if not directory.exists():
console.print(f"[red]Error:[/red] Directory '{directory}' not found", style="bold")
raise typer.Exit(code=1)
if not directory.is_dir():
console.print(f"[red]Error:[/red] '{directory}' is not a directory", style="bold")
raise typer.Exit(code=1)
# Group files by extension
files_by_ext = defaultdict(list)
for file in directory.iterdir():
if file.is_file():
ext = file.suffix or "[no-extension]"
files_by_ext[ext].append(file)
if not files_by_ext:
console.print("[yellow]No files found to organize[/yellow]")
raise typer.Exit(code=0)
# Display summary
table = Table(title="Files to Organize")
table.add_column("Extension", style="cyan")
table.add_column("Count", style="green")
for ext, files in sorted(files_by_ext.items()):
table.add_row(ext, str(len(files)))
console.print(table)
if dry_run:
console.print("[yellow]Dry-run mode: No changes will be made[/yellow]")
return
if not typer.confirm("Execute organization?"):
console.print("[yellow]Cancelled[/yellow]")
raise typer.Exit(code=0)
# Move files
moved_count = 0
for ext, files in files_by_ext.items():
if create_folders and ext != "[no-extension]":
folder = directory / ext.lstrip(".")
folder.mkdir(exist_ok=True)
for file in files:
try:
shutil.move(str(file), str(folder / file.name))
moved_count += 1
except Exception as e:
console.print(f"[red]Failed to move {file.name}:[/red] {str(e)}")
console.print(f"[green]Success:[/green] Moved {moved_count} files")
@app.command()
def analyze(directory: Path = typer.Argument(".", help="Directory to analyze")):
"""Analyze file distribution in a directory."""
if not directory.exists() or not directory.is_dir():
console.print(f"[red]Error:[/red] Invalid directory '{directory}'", style="bold")
raise typer.Exit(code=1)
total_size = 0
files_by_ext = defaultdict(int)
for file in directory.rglob("*"):
if file.is_file():
ext = file.suffix or "[no-extension]"
files_by_ext[ext] += 1
total_size += file.stat().st_size
table = Table(title=f"Analysis of {directory}")
table.add_column("Extension", style="cyan")
table.add_column("Files", style="green")
for ext in sorted(files_by_ext.keys()):
table.add_row(ext, str(files_by_ext[ext]))
console.print(table)
console.print(f"Total files: {sum(files_by_ext.values())}")
console.print(f"Total size: {total_size / (1024*1024):.2f} MB")
if __name__ == "__main__":
app()
Output:
$ python file_organizer.py organize ./test_dir
Files to Organize
Extension Count
.txt 5
.pdf 3
.jpg 7
.py 2
Dry-run mode: No changes will be made
$ python file_organizer.py organize ./test_dir --no-dry-run
Files to Organize
Extension Count
.txt 5
.pdf 3
.jpg 7
.py 2
Execute organization? [y/N]: y
Success: Moved 17 files
$ python file_organizer.py analyze ./test_dir
Analysis of ./test_dir
Extension Files
.jpg 7
.pdf 3
.py 2
.txt 5
Total files: 17
Total size: 45.32 MB
This file organizer demonstrates production best practices: it validates input, provides dry-run mode for safety, uses tables for clarity, handles errors gracefully, and offers multiple commands for different use cases. You could package this as a standalone tool and distribute it via pip.
Frequently Asked Questions
How do I package a Typer app as a standalone tool?
Use setuptools or Poetry to create a package with an entry point. In your pyproject.toml, add:
[project.scripts]
my-cli = "my_module:app"
Then install with pip install -e .. Your app becomes available as a system command: my-cli --help.
Can Typer generate shell completion scripts?
Yes! Typer apps automatically support bash, zsh, and fish completion through the shellingham library. Users can run python my_cli.py --install-completion to set up completions for their shell.
How should I test Typer applications?
Use the CliRunner from Click (which Typer uses under the hood). Testing example:
from typer.testing import CliRunner
runner = CliRunner()
result = runner.invoke(app, ["add", "5", "3"])
assert result.exit_code == 0
assert "8" in result.output
What about complex types like lists or JSON objects?
Use Python’s built-in types directly. Typer handles List[str], List[int], and other generic types intelligently. For JSON, accept a string and parse it with json.loads() in your function.
How do I use environment variables in a Typer app?
Use typer.Option() with the envvar parameter: api_key: str = typer.Option(..., envvar="API_KEY"). Typer will check the environment variable if the CLI argument isn’t provided.
Can I create command groups or nested subcommands?
Yes! Create separate Typer instances and add them as command groups:
Typer brings modern Python practices to CLI development. By leveraging type hints and sensible defaults, it eliminates boilerplate while maintaining power and flexibility. Whether you’re building internal tools, developer utilities, or the next popular open-source CLI, Typer gives you a solid foundation.
The journey from simple functions to professional CLI applications is smooth with Typer. Start with a basic command, add features incrementally, and scale to complex multi-command applications without refactoring. For deeper learning, explore the official Typer documentation and examine real-world projects using Typer on GitHub.
Your CLI adventure awaits. Happy building!
Related Articles
Deepen your command-line expertise with these related tutorials:
If you have spent time debugging Python code, you have probably encountered the silent killers of software reliability: magic strings and magic numbers. A developer writes status = "active" in one function but checks if status == "Active" (capital A) in another. A bug is born. Days later, when the mismatch surfaces in production, your fingers itch to throttle the typo. Python enums exist precisely to prevent this nightmare. They transform those fragile string values into type-safe, self-documenting objects that your IDE can help you complete and your type checker can validate before a single line runs.
The good news: enums are built into Python’s standard library. No external packages needed, no complex installation steps. They integrate seamlessly with the language and work beautifully with type hints, making your code cleaner and more maintainable. Enums are also more than just a neat organizational trick — they are a best practice embraced by the Python community and used in production code across the industry, from web frameworks to data science pipelines.
In this tutorial, we will explore what enums are, why they matter, and how to use them effectively. We will start with the basics, move through automatic value generation, then advance to string enums, flags, and real-world patterns like state machines. By the end, you will understand when to reach for enums and how to wield them to write code that is both safer and more expressive.
Your code before and after enums. Spoiler: the enum side never has a typo.
Defining Your First Enum
Python’s enum module lives in the standard library — no install needed. The simplest enum is a subclass of Enum with class-level attributes for each member:
from enum import Enum
class Status(Enum):
ACTIVE = "active"
INACTIVE = "inactive"
SUSPENDED = "suspended"
# Use members anywhere a status is expected
user_status = Status.ACTIVE
print(user_status) # Status.ACTIVE
print(user_status.name) # 'ACTIVE'
print(user_status.value) # 'active'
# Iteration walks the members in definition order
for status in Status:
print(status.name, status.value)
Three things to notice. First, members are accessed by attribute (Status.ACTIVE), not by string lookup. A typo like Status.ACTIV raises AttributeError at import time, not at runtime. Second, each member has both a name (the attribute identifier) and a value (whatever you assigned). Third, the enum class is iterable — you can loop over its members for dropdowns, validation, or any other “list all options” use case.
Auto-Generated Values with auto()
When you don’t care what the underlying value is — only that members are distinct — use auto() to let Python assign them:
from enum import Enum, auto
class HttpMethod(Enum):
GET = auto()
POST = auto()
PUT = auto()
DELETE = auto()
print(HttpMethod.GET.value) # 1
print(HttpMethod.POST.value) # 2
By default auto() generates integers starting at 1. To customize the sequence, override _generate_next_value_:
Sometimes you want the enum to also behave like its underlying type — so Status.ACTIVE == "active" returns True and the value serializes cleanly to JSON. Use StrEnum (Python 3.11+) for strings or IntEnum for integers:
from enum import StrEnum, IntEnum
class Priority(IntEnum):
LOW = 1
MEDIUM = 5
HIGH = 10
class Color(StrEnum):
RED = "red"
GREEN = "green"
BLUE = "blue"
# IntEnum behaves like int
print(Priority.HIGH > Priority.LOW) # True
print(Priority.HIGH + 5) # 15
# StrEnum behaves like str
print(Color.RED == "red") # True
print(Color.GREEN.upper()) # 'GREEN'
import json
print(json.dumps({"color": Color.RED})) # '{"color": "red"}' — clean serialization
This is the practical choice for API payloads, database column values, and anything that crosses a serialization boundary. The enum gives you type-safety inside your code; the underlying type gives you painless JSON / SQL / config-file output.
Bit Flags with Flag and IntFlag
For “any combination of these options” — file permissions, feature toggles, network capabilities — use Flag (or IntFlag for compatibility with bitwise int operations). Members can be ORed together to form combinations:
from enum import Flag, auto
class Permission(Flag):
READ = auto()
WRITE = auto()
EXECUTE = auto()
ADMIN = READ | WRITE | EXECUTE # combination shortcut
# Combine at use site
user_perms = Permission.READ | Permission.WRITE
print(Permission.READ in user_perms) # True
print(Permission.EXECUTE in user_perms) # False
print(bool(user_perms & Permission.READ)) # True
Bit flags are 100x more readable than raw integers (perms = 0b110) and 10x more readable than dictionaries of bools ({"read": True, "write": True}). Anywhere you’d reach for a set of strings as a “kind of bitfield”, reach for Flag instead.
A Real Example: State Machine
Enums shine when modeling state transitions. Here’s an order-status state machine that rejects illegal transitions at the type level:
from enum import Enum
class OrderStatus(Enum):
DRAFT = "draft"
PENDING = "pending"
PAID = "paid"
SHIPPED = "shipped"
DELIVERED = "delivered"
CANCELLED = "cancelled"
ALLOWED_TRANSITIONS = {
OrderStatus.DRAFT: {OrderStatus.PENDING, OrderStatus.CANCELLED},
OrderStatus.PENDING: {OrderStatus.PAID, OrderStatus.CANCELLED},
OrderStatus.PAID: {OrderStatus.SHIPPED, OrderStatus.CANCELLED},
OrderStatus.SHIPPED: {OrderStatus.DELIVERED},
OrderStatus.DELIVERED: set(),
OrderStatus.CANCELLED: set(),
}
def transition(current: OrderStatus, target: OrderStatus) -> OrderStatus:
if target not in ALLOWED_TRANSITIONS[current]:
raise ValueError(f"Illegal transition: {current.value} -> {target.value}")
return target
# Usage
state = OrderStatus.DRAFT
state = transition(state, OrderStatus.PENDING) # OK
state = transition(state, OrderStatus.PAID) # OK
# state = transition(state, OrderStatus.DRAFT) # raises ValueError
The enum + transition map combo gives you a self-documenting workflow that’s impossible to misuse. Add type hints (state: OrderStatus) and mypy will catch any caller that tries to pass a raw string.
Common Pitfalls
Enums aren’t tuples.Status.ACTIVE is not a string, even though it might display like one. Use .value when you need the underlying primitive, or pick StrEnum / IntEnum to get implicit type compatibility.
Singleton equality. Two members with the same value collapse into a single member by default. class X(Enum): A = 1; B = 1 makes X.B an alias for X.A. If you want distinct members with the same value, you need a workaround (custom __init__ or a unique discriminator).
JSON serialization.json.dumps(Status.ACTIVE) raises TypeError for plain Enum but works for StrEnum / IntEnum. Either subclass StrEnum or write a custom default= handler.
Comparison with strings.Status.ACTIVE == "active" is False for a plain Enum (different types) but True for StrEnum. Pick the right base class up front to avoid surprises.
Don’t subclass an enum that already has members. Python forbids it: once you define members, the class is closed for extension. Refactor to a common base or use composition.
FAQ
Q: Should I use a plain Enum, IntEnum, or StrEnum?
A: StrEnum for API/JSON values where the string matters externally. IntEnum for bitwise math or legacy integer-coded statuses. Plain Enum for internal-only values where the underlying type shouldn’t leak into comparisons.
Q: How do I get an enum member from its value?
A: Call the enum like a function: Status("active") returns Status.ACTIVE. Wrap in try/except ValueError for unknown values, or use Status._value2member_map_.get(...) for a no-exception lookup.
Q: Can enum members have extra attributes or methods?
A: Yes — enums are real classes. Override __init__ to attach data, and add regular methods on the class body. A common pattern is class Country(Enum): US = ("United States", "USD"); GB = ("United Kingdom", "GBP") with each member carrying a tuple of attributes.
Q: Are enums faster than string constants?
A: Marginally. The real win is correctness, not speed. Avoid bench-marking enums against strings — you’ll measure noise.
Q: How do mypy / Pyright / Pylance handle enums?
A: They love them. Type-narrowing with if status is Status.ACTIVE: works correctly, and unknown member access is caught at lint time. Enums are one of Python’s most type-checker-friendly features.
Wrapping Up
Reach for an enum any time you have a fixed, named set of values — order statuses, user roles, HTTP methods, log levels, permissions. Plain Enum for internal identifiers, StrEnum for JSON-friendly values, IntEnum for bit math, Flag for combinations. The cost is one extra class definition; the payoff is a typo never silently breaks production.
SQLAlchemy is the gold standard for Object-Relational Mapping in Python. Version 2.0 represents a major evolution, introducing a more intuitive API that emphasizes explicit, modern patterns while maintaining backward compatibility. Whether you’re building a small Flask application or a complex data management system, SQLAlchemy 2.0 provides the tools to interact with databases using Python objects instead of raw SQL strings.
The ORM (Object-Relational Mapping) layer in SQLAlchemy 2.0 allows you to define database tables as Python classes, called models. Once you define a model, you can perform all database operations–creating records, querying data, updating rows, and deleting entries–using Pythonic syntax. The new select() construct and DeclarativeBase provide clearer, more expressive patterns than earlier versions.
In this tutorial, we’ll explore the key features of SQLAlchemy 2.0 ORM: how to define models, manage database sessions, perform CRUD operations, query data with the new select() API, establish relationships between tables, handle transactions, and build a real-world example. By the end, you’ll understand how to leverage SQLAlchemy 2.0 to create robust, maintainable database-driven applications.
Quick Example: 20 Lines of SQLAlchemy 2.0
Let’s start with a complete, working example to see SQLAlchemy 2.0 in action:
# quick_example.py
from sqlalchemy import create_engine, String
from sqlalchemy.orm import DeclarativeBase, Session, Mapped, mapped_column
class Base(DeclarativeBase):
pass
class User(Base):
__tablename__ = 'users'
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(50))
engine = create_engine('sqlite:///:memory:')
Base.metadata.create_all(engine)
with Session(engine) as session:
session.add(User(name='Alice'))
session.add(User(name='Bob'))
session.commit()
with Session(engine) as session:
from sqlalchemy import select
users = session.scalars(select(User)).all()
for user in users:
print(f'{user.id}: {user.name}')
Output:
1: Alice
2: Bob
This example demonstrates the core workflow: define a model inheriting from DeclarativeBase, create an engine, manage a session, insert records, and query them using select(). Notice the type hints (Mapped[str]) and the modern syntax–this is SQLAlchemy 2.0 style.
What Is SQLAlchemy ORM?
SQLAlchemy provides multiple ways to interact with databases. The ORM layer sits at the highest abstraction level, letting you work with Python objects. Here’s how it compares to alternatives:
Approach
How It Works
Pros
Cons
Raw SQL
Write SQL strings directly in Python
Maximum control, direct database access
Error-prone, requires manual parameter binding, not Pythonic
SQLAlchemy Core
Use SQL expression language to build queries programmatically
Type-safe, database-agnostic, composable
Still working with table/column constructs, not Python objects
SQLAlchemy ORM
Map database tables to Python classes, query objects directly
Pythonic, intuitive, supports relationships and complex queries, automatic change tracking
Slightly more overhead, must understand session lifecycle
SQLAlchemy 2.0’s ORM is the most productive choice for most applications because it combines clarity with power. You define your data model once, and the ORM handles the translation to SQL behind the scenes.
For this tutorial, we’ll use SQLite in-memory databases (specified as sqlite:///:memory:), which requires no external setup. For production use with PostgreSQL, MySQL, or other databases, install the appropriate driver (e.g., pip install psycopg2-binary for PostgreSQL).
Defining Models with DeclarativeBase
In SQLAlchemy 2.0, you define models by creating a class that inherits from DeclarativeBase. This base class automatically handles the mapping between your Python class and the database table.
Creating the DeclarativeBase
# models_setup.py
from sqlalchemy.orm import DeclarativeBase
class Base(DeclarativeBase):
pass
Output:
(No output - this defines the base class)
The Base class is the foundation for all your models. It tracks metadata (table definitions) and provides utilities for creating tables.
Defining a Model with Columns
Use Mapped and mapped_column() to define model attributes in SQLAlchemy 2.0:
mysql+pymysql://user:pass@localhost/dbname – MySQL
Sessions and Basic CRUD Operations
A Session is a context manager that tracks changes to your objects and coordinates with the database. CRUD stands for Create, Read, Update, Delete–the fundamental database operations.
Create (Insert) Records
# create_records.py
from sqlalchemy import create_engine, String
from sqlalchemy.orm import DeclarativeBase, Session, Mapped, mapped_column
class Base(DeclarativeBase):
pass
class Book(Base):
__tablename__ = 'books'
id: Mapped[int] = mapped_column(primary_key=True)
title: Mapped[str] = mapped_column(String(100))
author: Mapped[str] = mapped_column(String(100))
engine = create_engine('sqlite:///:memory:')
Base.metadata.create_all(engine)
# Create and insert records
with Session(engine) as session:
book1 = Book(title='Python Basics', author='Alice Johnson')
book2 = Book(title='Web Dev with Django', author='Bob Smith')
session.add(book1)
session.add(book2)
session.commit()
print(f'Created book with ID: {book1.id}')
print(f'Created book with ID: {book2.id}')
Output:
Created book with ID: 1
Created book with ID: 2
When you commit, SQLAlchemy assigns primary keys (IDs) to new objects. The session tracks the objects and only issues SQL when you call commit().
Read (Query) Records
# read_records.py
from sqlalchemy import create_engine, String, select
from sqlalchemy.orm import DeclarativeBase, Session, Mapped, mapped_column
class Base(DeclarativeBase):
pass
class Book(Base):
__tablename__ = 'books'
id: Mapped[int] = mapped_column(primary_key=True)
title: Mapped[str] = mapped_column(String(100))
author: Mapped[str] = mapped_column(String(100))
engine = create_engine('sqlite:///:memory:')
Base.metadata.create_all(engine)
with Session(engine) as session:
session.add(Book(title='Python Basics', author='Alice Johnson'))
session.add(Book(title='Web Dev with Django', author='Bob Smith'))
session.commit()
# Read records
with Session(engine) as session:
stmt = select(Book)
books = session.scalars(stmt).all()
for book in books:
print(f'{book.id}: {book.title} by {book.author}')
Output:
1: Python Basics by Alice Johnson
2: Web Dev with Django by Bob Smith
Update Records
# update_records.py
from sqlalchemy import create_engine, String, select
from sqlalchemy.orm import DeclarativeBase, Session, Mapped, mapped_column
class Base(DeclarativeBase):
pass
class Book(Base):
__tablename__ = 'books'
id: Mapped[int] = mapped_column(primary_key=True)
title: Mapped[str] = mapped_column(String(100))
author: Mapped[str] = mapped_column(String(100))
engine = create_engine('sqlite:///:memory:')
Base.metadata.create_all(engine)
with Session(engine) as session:
session.add(Book(title='Python Basics', author='Alice Johnson'))
session.commit()
# Update a record
with Session(engine) as session:
stmt = select(Book).where(Book.title == 'Python Basics')
book = session.scalars(stmt).first()
if book:
book.author = 'Alice J. Johnson'
session.commit()
print(f'Updated: {book.title} by {book.author}')
Output:
Updated: Python Basics by Alice J. Johnson
Delete Records
# delete_records.py
from sqlalchemy import create_engine, String, select
from sqlalchemy.orm import DeclarativeBase, Session, Mapped, mapped_column
class Base(DeclarativeBase):
pass
class Book(Base):
__tablename__ = 'books'
id: Mapped[int] = mapped_column(primary_key=True)
title: Mapped[str] = mapped_column(String(100))
engine = create_engine('sqlite:///:memory:')
Base.metadata.create_all(engine)
with Session(engine) as session:
session.add(Book(title='Old Book'))
session.commit()
# Delete a record
with Session(engine) as session:
stmt = select(Book).where(Book.title == 'Old Book')
book = session.scalars(stmt).first()
if book:
session.delete(book)
session.commit()
print('Book deleted successfully')
Output:
Book deleted successfully
Create, read, update, delete — the four verbs every ORM speaks fluently.
Querying with select()
SQLAlchemy 2.0’s select() construct is the modern way to build queries. It’s more expressive than the legacy query() method and provides better IDE support through type hints.
Basic Selects
# basic_select.py
from sqlalchemy import create_engine, String, select
from sqlalchemy.orm import DeclarativeBase, Session, Mapped, mapped_column
class Base(DeclarativeBase):
pass
class Student(Base):
__tablename__ = 'students'
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(100))
grade: Mapped[int]
engine = create_engine('sqlite:///:memory:')
Base.metadata.create_all(engine)
with Session(engine) as session:
session.add(Student(name='Alice', grade=95))
session.add(Student(name='Bob', grade=87))
session.add(Student(name='Charlie', grade=92))
session.commit()
# Select all
with Session(engine) as session:
stmt = select(Student)
all_students = session.scalars(stmt).all()
print(f'Total students: {len(all_students)}')
# Select first
first = session.scalars(select(Student)).first()
print(f'First student: {first.name}')
Output:
Total students: 3
First student: Alice
Filtering Results
# filtering.py
from sqlalchemy import create_engine, String, select
from sqlalchemy.orm import DeclarativeBase, Session, Mapped, mapped_column
class Base(DeclarativeBase):
pass
class Student(Base):
__tablename__ = 'students'
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(100))
grade: Mapped[int]
engine = create_engine('sqlite:///:memory:')
Base.metadata.create_all(engine)
with Session(engine) as session:
session.add_all([
Student(name='Alice', grade=95),
Student(name='Bob', grade=87),
Student(name='Charlie', grade=92),
Student(name='Diana', grade=88)
])
session.commit()
with Session(engine) as session:
# Equal comparison
stmt = select(Student).where(Student.name == 'Alice')
result = session.scalars(stmt).first()
print(f'Found: {result.name} (Grade: {result.grade})')
# Greater than
stmt = select(Student).where(Student.grade > 90)
high_performers = session.scalars(stmt).all()
print(f'High performers: {[s.name for s in high_performers]}')
# Like pattern
stmt = select(Student).where(Student.name.like('D%'))
result = session.scalars(stmt).first()
print(f'Names starting with D: {result.name}')
Output:
Found: Alice (Grade: 95)
High performers: ['Alice', 'Charlie']
Names starting with D: Diana
Ordering and Limiting
# ordering_limiting.py
from sqlalchemy import create_engine, String, select, desc
from sqlalchemy.orm import DeclarativeBase, Session, Mapped, mapped_column
class Base(DeclarativeBase):
pass
class Student(Base):
__tablename__ = 'students'
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(100))
grade: Mapped[int]
engine = create_engine('sqlite:///:memory:')
Base.metadata.create_all(engine)
with Session(engine) as session:
session.add_all([
Student(name='Alice', grade=95),
Student(name='Bob', grade=87),
Student(name='Charlie', grade=92),
Student(name='Diana', grade=88)
])
session.commit()
with Session(engine) as session:
# Order ascending
stmt = select(Student).order_by(Student.grade)
lowest = session.scalars(stmt).first()
print(f'Lowest grade: {lowest.name} ({lowest.grade})')
# Order descending
stmt = select(Student).order_by(desc(Student.grade))
highest = session.scalars(stmt).first()
print(f'Highest grade: {highest.name} ({highest.grade})')
# Limit
stmt = select(Student).order_by(desc(Student.grade)).limit(2)
top_two = session.scalars(stmt).all()
print(f'Top 2 students: {[s.name for s in top_two]}')
Output:
Lowest grade: Bob (87)
Highest grade: Alice (95)
Top 2 students: ['Alice', 'Charlie']
Joins Between Tables
# joins.py
from sqlalchemy import create_engine, String, select, ForeignKey
from sqlalchemy.orm import DeclarativeBase, Session, Mapped, mapped_column, relationship
class Base(DeclarativeBase):
pass
class Department(Base):
__tablename__ = 'departments'
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(100))
employees: Mapped[list['Employee']] = relationship(back_populates='department')
class Employee(Base):
__tablename__ = 'employees'
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(100))
department_id: Mapped[int] = mapped_column(ForeignKey('departments.id'))
department: Mapped[Department] = relationship(back_populates='employees')
engine = create_engine('sqlite:///:memory:')
Base.metadata.create_all(engine)
with Session(engine) as session:
dept_eng = Department(name='Engineering')
dept_hr = Department(name='HR')
session.add_all([
Employee(name='Alice', department=dept_eng),
Employee(name='Bob', department=dept_eng),
Employee(name='Charlie', department=dept_hr)
])
session.commit()
with Session(engine) as session:
# Join departments and employees
stmt = select(Employee).join(Department).where(Department.name == 'Engineering')
eng_employees = session.scalars(stmt).all()
print(f'Engineering employees: {[e.name for e in eng_employees]}')
Output:
Engineering employees: ['Alice', 'Bob']
Relationships Between Models
Relationships let you traverse from one model to another. SQLAlchemy handles the foreign key constraints and makes it easy to load related objects.
One-to-Many Relationships
# one_to_many.py
from sqlalchemy import create_engine, String, ForeignKey
from sqlalchemy.orm import DeclarativeBase, Session, Mapped, mapped_column, relationship
class Base(DeclarativeBase):
pass
class Author(Base):
__tablename__ = 'authors'
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(100))
books: Mapped[list['Book']] = relationship(back_populates='author', cascade='all, delete-orphan')
class Book(Base):
__tablename__ = 'books'
id: Mapped[int] = mapped_column(primary_key=True)
title: Mapped[str] = mapped_column(String(100))
author_id: Mapped[int] = mapped_column(ForeignKey('authors.id'))
author: Mapped[Author] = relationship(back_populates='books')
engine = create_engine('sqlite:///:memory:')
Base.metadata.create_all(engine)
with Session(engine) as session:
author = Author(name='George Orwell')
author.books = [
Book(title='1984'),
Book(title='Animal Farm')
]
session.add(author)
session.commit()
with Session(engine) as session:
from sqlalchemy import select
author = session.scalars(select(Author).where(Author.name == 'George Orwell')).first()
print(f'Author: {author.name}')
for book in author.books:
print(f' - {book.title}')
Output:
Author: George Orwell
- 1984
- Animal Farm
Many-to-Many Relationships
# many_to_many.py
from sqlalchemy import create_engine, String, ForeignKey, Table, Column
from sqlalchemy.orm import DeclarativeBase, Session, Mapped, mapped_column, relationship
class Base(DeclarativeBase):
pass
# Association table for many-to-many
student_course = Table(
'student_course',
Base.metadata,
Column('student_id', ForeignKey('students.id'), primary_key=True),
Column('course_id', ForeignKey('courses.id'), primary_key=True)
)
class Student(Base):
__tablename__ = 'students'
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(100))
courses: Mapped[list['Course']] = relationship(secondary=student_course, back_populates='students')
class Course(Base):
__tablename__ = 'courses'
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(100))
students: Mapped[list[Student]] = relationship(secondary=student_course, back_populates='courses')
engine = create_engine('sqlite:///:memory:')
Base.metadata.create_all(engine)
with Session(engine) as session:
python = Course(name='Python 101')
math = Course(name='Calculus I')
alice = Student(name='Alice', courses=[python, math])
bob = Student(name='Bob', courses=[python])
session.add_all([alice, bob])
session.commit()
with Session(engine) as session:
from sqlalchemy import select
student = session.scalars(select(Student).where(Student.name == 'Alice')).first()
print(f'Alice is taking: {[c.name for c in student.courses]}')
Output:
Alice is taking: ['Python 101', 'Calculus I']
Foreign keys in Python land — relationship() does the joining for you.
Transactions and Error Handling
A transaction is a sequence of database operations that either all succeed or all fail. SQLAlchemy sessions handle transactions automatically, but you can control commit/rollback behavior explicitly.
Basic Commit and Rollback
# transactions.py
from sqlalchemy import create_engine, String
from sqlalchemy.orm import DeclarativeBase, Session, Mapped, mapped_column
class Base(DeclarativeBase):
pass
class Account(Base):
__tablename__ = 'accounts'
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(100))
balance: Mapped[float]
engine = create_engine('sqlite:///:memory:')
Base.metadata.create_all(engine)
# Create initial accounts
with Session(engine) as session:
session.add_all([
Account(name='Alice', balance=1000.0),
Account(name='Bob', balance=500.0)
])
session.commit()
# Simulate a transfer with error handling
with Session(engine) as session:
try:
alice = session.query(Account).filter_by(name='Alice').first()
bob = session.query(Account).filter_by(name='Bob').first()
# Transfer 200 from Alice to Bob
alice.balance -= 200
bob.balance += 200
session.commit()
print(f'Transfer successful: Alice={alice.balance}, Bob={bob.balance}')
except Exception as e:
session.rollback()
print(f'Transfer failed: {e}')
Output:
Transfer successful: Alice=800.0, Bob=700.0
Error Handling with Try-Except
# error_handling.py
from sqlalchemy import create_engine, String, exc
from sqlalchemy.orm import DeclarativeBase, Session, Mapped, mapped_column
class Base(DeclarativeBase):
pass
class User(Base):
__tablename__ = 'users'
id: Mapped[int] = mapped_column(primary_key=True)
email: Mapped[str] = mapped_column(String(100), unique=True)
engine = create_engine('sqlite:///:memory:')
Base.metadata.create_all(engine)
with Session(engine) as session:
session.add(User(email='alice@example.com'))
session.commit()
# Try to add duplicate email
with Session(engine) as session:
try:
session.add(User(email='alice@example.com'))
session.commit()
except exc.IntegrityError as e:
session.rollback()
print('Error: Email already exists')
except Exception as e:
session.rollback()
print(f'Unexpected error: {e}')
Output:
Error: Email already exists
Transactions — commit when ready, rollback when not. No half measures.
Real-Life Example: Blog Database
Let’s build a complete blog system with Post, Author, and Tag models, demonstrating relationships, CRUD operations, and queries.
# blog_system.py
from sqlalchemy import create_engine, String, Text, ForeignKey, Table, Column, select, desc
from sqlalchemy.orm import DeclarativeBase, Session, Mapped, mapped_column, relationship
from datetime import datetime
class Base(DeclarativeBase):
pass
# Association table for many-to-many relationship
post_tag = Table(
'post_tag',
Base.metadata,
Column('post_id', ForeignKey('posts.id'), primary_key=True),
Column('tag_id', ForeignKey('tags.id'), primary_key=True)
)
class Author(Base):
__tablename__ = 'authors'
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(100), nullable=False)
email: Mapped[str] = mapped_column(String(100), unique=True)
posts: Mapped[list['Post']] = relationship(back_populates='author', cascade='all, delete-orphan')
class Post(Base):
__tablename__ = 'posts'
id: Mapped[int] = mapped_column(primary_key=True)
title: Mapped[str] = mapped_column(String(200), nullable=False)
content: Mapped[str] = mapped_column(Text)
created_at: Mapped[datetime] = mapped_column(default=datetime.utcnow)
author_id: Mapped[int] = mapped_column(ForeignKey('authors.id'))
author: Mapped[Author] = relationship(back_populates='posts')
tags: Mapped[list['Tag']] = relationship(secondary=post_tag, back_populates='posts')
class Tag(Base):
__tablename__ = 'tags'
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(50), unique=True)
posts: Mapped[list[Post]] = relationship(secondary=post_tag, back_populates='tags')
# Setup
engine = create_engine('sqlite:///:memory:')
Base.metadata.create_all(engine)
# Create blog data
with Session(engine) as session:
author1 = Author(name='Alice', email='alice@blog.com')
author2 = Author(name='Bob', email='bob@blog.com')
python_tag = Tag(name='Python')
web_tag = Tag(name='Web')
post1 = Post(
title='Getting Started with Python',
content='Python is a great language...',
author=author1,
tags=[python_tag, web_tag]
)
post2 = Post(
title='Advanced ORM Techniques',
content='SQLAlchemy provides powerful ORM features...',
author=author1,
tags=[python_tag]
)
post3 = Post(
title='Web Development Tips',
content='Here are some web development best practices...',
author=author2,
tags=[web_tag]
)
session.add_all([author1, author2, python_tag, web_tag, post1, post2, post3])
session.commit()
# Query examples
with Session(engine) as session:
# Find all posts by an author
stmt = select(Post).join(Author).where(Author.name == 'Alice').order_by(desc(Post.created_at))
alice_posts = session.scalars(stmt).all()
print(f'Posts by Alice: {len(alice_posts)}')
for post in alice_posts:
print(f' - {post.title}')
# Find all posts with a specific tag
stmt = select(Post).join(Post.tags).where(Tag.name == 'Python')
python_posts = session.scalars(stmt).all()
print(f'\nPython posts: {len(python_posts)}')
for post in python_posts:
print(f' - {post.title} by {post.author.name}')
# Count total posts
stmt = select(Post)
total_posts = len(session.scalars(stmt).all())
print(f'\nTotal blog posts: {total_posts}')
Output:
Posts by Alice: 2
- Advanced ORM Techniques
- Getting Started with Python
Python posts: 2
- Getting Started with Python by Alice
- Advanced ORM Techniques by Alice
Total blog posts: 3
This example showcases the full power of SQLAlchemy 2.0 ORM: defining multiple related models, using many-to-many relationships, performing complex queries with joins, and maintaining referential integrity through cascading deletes.
Frequently Asked Questions
What’s the difference between Mapped and traditional type hints?
Mapped[type] is SQLAlchemy 2.0’s way to combine Python type hints with ORM metadata. It tells SQLAlchemy about the column while also providing type information to your IDE and type checkers. The legacy approach used no type hints.
When should I use relationships versus manual joins?
Use relationships when you want to access related objects as Python attributes (e.g., author.posts). Use manual joins when you need more control over the query or want to fetch only specific columns. Relationships are more Pythonic and handle lazy loading by default.
What’s the difference between add() and add_all()?
session.add(obj) adds a single object. session.add_all([obj1, obj2]) adds multiple objects at once. Use add_all() for convenience when inserting several objects.
How do I handle database connection pooling?
SQLAlchemy’s engine manages connection pooling automatically. For production applications, configure pool settings when creating the engine: engine = create_engine('postgresql://...', pool_size=10, max_overflow=20).
Can I use SQLAlchemy ORM with async code?
Yes! SQLAlchemy 2.0 includes async support using AsyncSession and create_async_engine(). This is useful for high-concurrency web applications. However, the basic patterns remain the same.
What happens if I forget to commit()?
Changes are held in the session but not persisted to the database. When the session context exits (the with block ends), uncommitted changes are rolled back. Always call commit() to save changes.
How do I avoid the N+1 query problem?
The N+1 problem happens when loading a parent object triggers a separate query for each child. Use eager loading with selectinload() or joinedload() to fetch related objects in one query: select(Author).options(selectinload(Author.posts)).
Conclusion
SQLAlchemy 2.0 brings modern Python patterns to database programming. By using DeclarativeBase for model definition, select() for queries, and proper session management, you can build robust data-driven applications without writing a single SQL string. The ORM layer abstracts away database details while remaining transparent and powerful.
Key takeaways from this tutorial:
Models inherit from DeclarativeBase and use Mapped type hints
select() is the modern way to build type-safe queries
Sessions manage transactions and object tracking
Relationships make it natural to traverse related objects
Always handle errors and rollback on failure
Eager loading prevents common performance pitfalls
For next steps, explore SQLAlchemy’s advanced features like hybrid properties, custom types, and query optimizations. Consider integrating SQLAlchemy with frameworks like Flask or FastAPI for web development. As you grow more comfortable with the ORM, you’ll find that SQLAlchemy’s power and flexibility make it an excellent choice for any Python project requiring database interaction.
Now you have a complete, production-ready reference for SQLAlchemy 2.0 ORM. Use this guide to build, query, and maintain your database layer with confidence.
Writing asynchronous code in Python has always been powerful but challenging. The traditional asyncio.create_task() approach leaves you vulnerable to silent failures — a task can crash without your knowledge, or worse, you might forget to await all your spawned tasks. Enter asyncio.TaskGroup, introduced in Python 3.11, which brings structured concurrency patterns to the standard library and makes parallel task management reliable and clean.
If you’ve struggled with managing multiple async tasks, coordinating their completion, or handling errors when things go wrong, TaskGroup is the solution you’ve been waiting for. Instead of manually tracking tasks and writing error-handling boilerplate, TaskGroup handles all of that automatically through a simple context manager interface.
In this tutorial, you’ll learn how TaskGroup simplifies concurrent programming, how to handle errors gracefully, manage nested task groups, and apply these patterns to real-world scenarios. Whether you’re building web scrapers, API clients, or distributed systems, TaskGroup will become an essential tool in your async toolkit.
Quick Example
Before diving deep, here’s a taste of what TaskGroup looks like in action:
Result 1: Data from api1.com
Result 2: Data from api2.com
Result 3: Data from api3.com
Three tasks run in parallel, and after the async with block exits, all results are guaranteed to be ready. No fire-and-forget bugs. No manual cancellation. Just clean, structured concurrency.
Parallel tasks, one manager. asyncio.TaskGroup keeps everything on track.
What Is asyncio.TaskGroup and Why Use It?
asyncio.TaskGroup is a context manager that enforces structured concurrency — a programming pattern where the lifetime of child tasks is bound to their parent scope. When the TaskGroup context exits, all child tasks are guaranteed to be either completed or cancelled, and any exceptions from those tasks are collected and re-raised as an ExceptionGroup.
This is fundamentally different from the older asyncio.create_task() pattern, where tasks exist independently and require manual tracking. Let’s compare:
Feature
asyncio.create_task()
asyncio.TaskGroup
Task lifetime tracking
Manual (you must track and await each task)
Automatic (bound to context manager scope)
Error handling
Individual task.result() calls can fail silently
All exceptions collected in ExceptionGroup
Cancellation on error
Must implement manually
Automatic — remaining tasks cancelled on first failure
Fire-and-forget bugs
Common — tasks can be forgotten
Prevented — all tasks must be awaited
Syntax clarity
Verbose — multiple await statements
Clean — single context block
Python version
3.7+
3.11+
Creating and Running Task Groups
The most basic pattern for using TaskGroup is simple: create a context using async with asyncio.TaskGroup() and spawn tasks using the create_task() method. The context manager automatically waits for all spawned tasks to complete before exiting.
Key observations: when the async with block exits, TaskGroup waits for all pending tasks. You can access task.result() after the block because completion is guaranteed. The total elapsed time is approximately 1.5 seconds (the longest task), not 3 seconds (sum), demonstrating true parallelism.
Spawning Tasks with TaskGroup.create_task()
The create_task() method on TaskGroup returns a standard asyncio.Task object, just like asyncio.create_task(). The difference is that the task is automatically tracked and must be completed before the context exits.
# filename: taskgroup_spawning_demo.py
import asyncio
from datetime import datetime
async def work(task_id, duration):
start = datetime.now()
await asyncio.sleep(duration)
elapsed = (datetime.now() - start).total_seconds()
return f"Task {task_id} slept for {elapsed:.1f}s"
async def main():
async with asyncio.TaskGroup() as tg:
tasks = []
for i in range(5):
task = tg.create_task(work(i, 0.5 + i * 0.1))
tasks.append(task)
for task in tasks:
print(task.result())
asyncio.run(main())
Output:
Task 0 slept for 0.5s
Task 1 slept for 0.6s
Task 2 slept for 0.7s
Task 3 slept for 0.8s
Task 4 slept for 0.9s
This loop creates five tasks concurrently. All tasks run in parallel, and the context manager ensures all are complete before proceeding.
When one task raises, TaskGroup catches the rest before they crash.
Error Handling with ExceptionGroup
When a task within a TaskGroup raises an exception, TaskGroup doesn’t immediately propagate it. Instead, it cancels all remaining tasks and collects all exceptions into an ExceptionGroup. This gives you a chance to handle multiple failures at once.
# filename: taskgroup_exception_handling.py
import asyncio
async def reliable_task():
await asyncio.sleep(0.5)
return "Success"
async def failing_task():
await asyncio.sleep(0.2)
raise ValueError("Something went wrong")
async def main():
try:
async with asyncio.TaskGroup() as tg:
t1 = tg.create_task(reliable_task())
t2 = tg.create_task(failing_task())
except ExceptionGroup as eg:
print(f"Caught ExceptionGroup with {len(eg.exceptions)} exceptions")
for exc in eg.exceptions:
print(f" - {type(exc).__name__}: {exc}")
asyncio.run(main())
Output:
Caught ExceptionGroup with 1 exceptions
- ValueError: Something went wrong
When failing_task raises a ValueError, TaskGroup catches it, cancels the remaining tasks (though reliable_task had already completed), and raises an ExceptionGroup containing that ValueError.
Handling Multiple Exceptions
If multiple tasks fail, all exceptions are collected:
# filename: taskgroup_multiple_exceptions.py
import asyncio
async def failing_task(task_id, delay):
await asyncio.sleep(delay)
raise RuntimeError(f"Task {task_id} failed")
async def main():
try:
async with asyncio.TaskGroup() as tg:
tg.create_task(failing_task(1, 0.2))
tg.create_task(failing_task(2, 0.3))
tg.create_task(failing_task(3, 0.1))
except ExceptionGroup as eg:
print(f"Caught {len(eg.exceptions)} exceptions:")
for exc in eg.exceptions:
print(f" {exc}")
asyncio.run(main())
The except* syntax filters and separates exceptions by type, making selective error handling clean and Pythonic.
Nested TaskGroups — because sometimes your tasks have tasks of their own.
Nested Task Groups
TaskGroup supports nesting — you can create child TaskGroups within parent TaskGroups. This enables hierarchical task organization and selective error handling at different levels.
# filename: taskgroup_nesting.py
import asyncio
async def subtask(subtask_id, delay):
await asyncio.sleep(delay)
return f"Subtask {subtask_id} done"
async def parent_work(parent_id):
async with asyncio.TaskGroup() as child_tg:
results = []
for i in range(3):
task = child_tg.create_task(subtask(f"{parent_id}.{i}", 0.3))
results.append(task)
return [t.result() for t in results]
async def main():
async with asyncio.TaskGroup() as parent_tg:
p1 = parent_tg.create_task(parent_work("Parent1"))
p2 = parent_tg.create_task(parent_work("Parent2"))
print("Parent 1 results:", p1.result())
print("Parent 2 results:", p2.result())
asyncio.run(main())
Here, two parent tasks each spawn their own child TaskGroup with three subtasks. All subtasks run in parallel, and errors can be handled at the appropriate nesting level.
Error Propagation in Nested Groups
When a child TaskGroup raises an ExceptionGroup, it propagates up to the parent:
# filename: taskgroup_nested_errors.py
import asyncio
async def failing_subtask():
await asyncio.sleep(0.1)
raise RuntimeError("Subtask failed")
async def parent_work():
try:
async with asyncio.TaskGroup() as child_tg:
tg.create_task(failing_subtask())
except ExceptionGroup as eg:
print(f"Child caught: {eg}")
raise # Re-raise to parent
async def main():
try:
async with asyncio.TaskGroup() as parent_tg:
parent_tg.create_task(parent_work())
except ExceptionGroup as eg:
print(f"Parent caught: {eg}")
asyncio.run(main())
Exceptions bubble up through nested TaskGroups, allowing you to handle them at the appropriate level or let them propagate to the top.
asyncio.timeout — because waiting forever is not a strategy.
Timeouts and Cancellation with TaskGroup
You can apply timeouts to a TaskGroup using asyncio.timeout() (Python 3.11+) or asyncio.wait_for(). If a timeout occurs, all tasks in the group are cancelled.
# filename: taskgroup_timeout.py
import asyncio
async def slow_task(task_id):
try:
await asyncio.sleep(5)
return f"Task {task_id} completed"
except asyncio.CancelledError:
print(f"Task {task_id} was cancelled")
raise
async def main():
try:
async with asyncio.timeout(2): # 2 second timeout
async with asyncio.TaskGroup() as tg:
tg.create_task(slow_task(1))
tg.create_task(slow_task(2))
tg.create_task(slow_task(3))
except TimeoutError:
print("TaskGroup timed out!")
asyncio.run(main())
Output:
Task 1 was cancelled
Task 2 was cancelled
Task 3 was cancelled
TaskGroup timed out!
The asyncio.timeout() context manager applies a deadline to the TaskGroup. When the timeout expires, all pending tasks receive a CancelledError.
Manual Cancellation
You can also manually cancel a TaskGroup by storing a reference to it and cancelling individual tasks:
# filename: taskgroup_manual_cancel.py
import asyncio
async def monitor_and_cancel(task_group_tasks):
await asyncio.sleep(1)
print("Cancelling remaining tasks...")
for task in task_group_tasks:
if not task.done():
task.cancel()
async def long_task(task_id):
try:
await asyncio.sleep(10)
return f"Task {task_id} done"
except asyncio.CancelledError:
print(f"Task {task_id} cancelled")
raise
async def main():
try:
async with asyncio.TaskGroup() as tg:
tasks = [tg.create_task(long_task(i)) for i in range(3)]
tg.create_task(monitor_and_cancel(tasks))
except ExceptionGroup as eg:
print(f"Got {len(eg.exceptions)} exceptions")
asyncio.run(main())
asyncio.timeout — because waiting forever is not a strategy.
Real-Life Example: Parallel API Fetcher
Let’s build a realistic example that fetches data from multiple API endpoints in parallel and handles errors gracefully:
# filename: parallel_api_fetcher.py
import asyncio
import json
from urllib.request import Request, urlopen
from urllib.error import URLError
async def fetch_json_data(url):
"""Fetch JSON from a URL asynchronously."""
loop = asyncio.get_event_loop()
def blocking_fetch():
try:
with urlopen(url, timeout=5) as response:
return json.loads(response.read().decode())
except URLError as e:
raise RuntimeError(f"Failed to fetch {url}: {e}")
# Run blocking I/O in a thread pool
return await loop.run_in_executor(None, blocking_fetch)
async def get_user_data(user_id):
"""Fetch user data from JSONPlaceholder API."""
url = f"https://jsonplaceholder.typicode.com/users/{user_id}"
data = await fetch_json_data(url)
return {"user_id": user_id, "name": data.get("name")}
async def get_post_data(post_id):
"""Fetch post data from JSONPlaceholder API."""
url = f"https://jsonplaceholder.typicode.com/posts/{post_id}"
data = await fetch_json_data(url)
return {"post_id": post_id, "title": data.get("title")}
async def get_comment_data(comment_id):
"""Fetch comment data from JSONPlaceholder API."""
url = f"https://jsonplaceholder.typicode.com/comments/{comment_id}"
data = await fetch_json_data(url)
return {"comment_id": comment_id, "body": data.get("body")[:50]}
async def main():
"""Fetch various data types in parallel."""
print("Starting parallel API fetches...")
try:
async with asyncio.TaskGroup() as tg:
# Fetch users
user_tasks = [
tg.create_task(get_user_data(i))
for i in range(1, 4)
]
# Fetch posts
post_tasks = [
tg.create_task(get_post_data(i))
for i in range(1, 4)
]
# Fetch comments
comment_tasks = [
tg.create_task(get_comment_data(i))
for i in range(1, 4)
]
print("\nUsers fetched:")
for task in user_tasks:
print(f" {task.result()}")
print("\nPosts fetched:")
for task in post_tasks:
print(f" {task.result()}")
print("\nComments fetched:")
for task in comment_tasks:
print(f" {task.result()}")
except ExceptionGroup as eg:
print(f"Errors occurred during fetching:")
for exc in eg.exceptions:
print(f" {exc}")
if __name__ == "__main__":
asyncio.run(main())
Output:
Starting parallel API fetches...
Users fetched:
{'user_id': 1, 'name': 'Leanne Graham'}
{'user_id': 2, 'name': 'Ervin Howell'}
{'user_id': 3, 'name': 'Clementine Bauch'}
Posts fetched:
{'post_id': 1, 'title': 'sunt aut facere repellat provident...'}
{'post_id': 2, 'title': 'qui est esse'}
{'post_id': 3, 'title': 'ea molestias quasi exercitationem...'}
Comments fetched:
{'comment_id': 1, 'body': 'laudantium enim quasi est quidem magn'}
{'comment_id': 2, 'body': 'est nisi doloremque illum quis sequi u'}
{'comment_id': 3, 'body': 'quia et suscipit suscipit recusandae c'}
This example demonstrates several key patterns: spawning multiple categories of tasks, handling network I/O asynchronously, collecting results, and grouping error handling. All three categories of requests execute in parallel, reducing total fetch time significantly compared to sequential requests.
Three endpoints, one TaskGroup, zero sequential waiting.
Frequently Asked Questions
How does TaskGroup compare to asyncio.gather()?
asyncio.gather() collects coroutines and returns their results. TaskGroup is more powerful: it enforces structured concurrency, automatically cancels remaining tasks on failure, and collects all exceptions. Use TaskGroup for better control; use gather() if you just need simple result collection.
What happens if a task raises an exception in TaskGroup?
TaskGroup immediately cancels all remaining tasks and collects all exceptions (including from the cancelled tasks’ CancelledError) into an ExceptionGroup. You can catch this group with except ExceptionGroup or use except* for selective handling.
Can I nest TaskGroups and handle exceptions at different levels?
Yes. Each TaskGroup can have its own exception handler. Exceptions from child groups propagate to parent groups, allowing hierarchical error handling. You can catch and re-raise at any level.
How do I check if a task completed successfully in a TaskGroup?
After the TaskGroup context exits, all tasks are done. Use task.result() to get the return value or task.exception() to check for exceptions. Tasks that were cancelled will raise CancelledError when you call result().
What Python versions support TaskGroup?
TaskGroup is available in Python 3.11 and later. For older versions, use asyncio.gather(), asyncio.create_task(), or third-party libraries like anyio.
How do I return and access results from TaskGroup tasks?
Store references to tasks returned by create_task(). After the TaskGroup context exits, call task.result() to get the return value. If the task raised an exception, result() re-raises it (or it’s in the ExceptionGroup).
Conclusion
asyncio.TaskGroup is a powerful addition to Python’s async toolkit, bringing structured concurrency patterns to the standard library. By enforcing that tasks complete or are cancelled when their parent scope exits, TaskGroup eliminates entire classes of bugs — forgotten tasks, orphaned coroutines, and unhandled exceptions. The automatic error collection in ExceptionGroup makes it easy to detect and respond to failures in complex concurrent systems.
Whether you’re fetching data from multiple APIs, processing files in parallel, or coordinating distributed system operations, TaskGroup provides a clean, Pythonic way to write reliable async code. Combined with error handling via except* and support for timeouts and cancellation, TaskGroup should be your default choice for managing concurrent tasks in Python 3.11+.
Start using TaskGroup in your async projects today, and you’ll quickly find it becomes as indispensable as async/await itself.
TaskGroup vs gather()
Before 3.11, the canonical way to run concurrent tasks was asyncio.gather(). TaskGroup is the modern replacement — better error handling, structured concurrency, automatic cancellation:
import asyncio
async def fetch(url):
print("Fetching:", url)
await asyncio.sleep(1)
return f"Result from {url}"
# Old way: gather()
async def main_old():
results = await asyncio.gather(
fetch("https://api1.example.com"),
fetch("https://api2.example.com"),
fetch("https://api3.example.com"),
)
print(results)
# New way: TaskGroup (3.11+)
async def main_new():
async with asyncio.TaskGroup() as tg:
t1 = tg.create_task(fetch("https://api1.example.com"))
t2 = tg.create_task(fetch("https://api2.example.com"))
t3 = tg.create_task(fetch("https://api3.example.com"))
# All tasks complete by here
print(t1.result(), t2.result(), t3.result())
asyncio.run(main_new())
TaskGroup waits for all tasks to complete before exiting the with-block. The key win is automatic error propagation — if any task fails, the rest are cancelled and the exception bubbles up.
Error Handling with ExceptionGroup
async def maybe_fail(n):
await asyncio.sleep(0.1)
if n == 2:
raise ValueError("task 2 failed")
if n == 4:
raise ConnectionError("network down")
return n * 2
async def main():
try:
async with asyncio.TaskGroup() as tg:
for i in range(5):
tg.create_task(maybe_fail(i))
except* ValueError as eg:
for e in eg.exceptions:
print("ValueError:", e)
except* ConnectionError as eg:
for e in eg.exceptions:
print("ConnectionError:", e)
asyncio.run(main())
The except* syntax lets you handle different exception types from concurrent tasks separately. ExceptionGroup is the canonical container for parallel failures.
Dynamic Task Creation
async def main():
urls = await get_url_list()
async with asyncio.TaskGroup() as tg:
tasks = [tg.create_task(fetch(url)) for url in urls]
return [t.result() for t in tasks]
You can create tasks at any point inside the with-block, including conditionally. The TaskGroup tracks all of them and waits for all to finish.
TaskGroup vs gather: Side-by-Side
Feature
gather()
TaskGroup
Multiple concurrent tasks
Yes
Yes
Return values
List of results
tg.create_task returns Task; call .result()
First-error-cancels-rest
Optional (return_exceptions=False)
Always (structured)
Multiple errors
Loses all but first
ExceptionGroup with all errors
Dynamic task creation
No — pass all upfront
Yes — create_task any time
Available since
Python 3.4
Python 3.11
Cancellation Semantics
async def slow_task():
try:
await asyncio.sleep(10)
except asyncio.CancelledError:
print("cleaning up...")
raise # important — re-raise to actually cancel
return "done"
async def main():
async with asyncio.TaskGroup() as tg:
long = tg.create_task(slow_task())
tg.create_task(maybe_fail(2))
# When maybe_fail raises, long is cancelled — slow_task sees CancelledError
Common Pitfalls
Forgetting Python version. TaskGroup is 3.11+. Use gather() if you need 3.10 or earlier.
Catching CancelledError without re-raising. Swallowing the cancellation breaks cleanup. Always raise after cleanup in async code.
Tasks outside the TaskGroup.asyncio.create_task outside tg.create_task isn’t tracked by the group — orphan tasks leak.
Synchronous code inside async with. Blocking calls (time.sleep, requests.get) freeze the event loop and block ALL tasks in the group.
asyncio.run() in a notebook. Jupyter has its own loop. Use await main() directly at the top level.
FAQ
Q: TaskGroup or gather()? A: TaskGroup on Python 3.11+. The structured-concurrency model catches more bugs at design time.
Q: Can I mix TaskGroup with manual create_task? A: You can, but it defeats the purpose. The whole point of TaskGroup is grouped lifecycle.
Q: TaskGroup with timeouts? A: Wrap with asyncio.wait_for: async with asyncio.timeout(30): async with asyncio.TaskGroup() as tg: ...
Q: Performance vs gather? A: Identical — both schedule tasks the same way. TaskGroup adds correctness, not overhead.
Q: What if I want one task to succeed and ignore the others’ failures? A: TaskGroup isn’t the right tool — it propagates all failures. Use gather(return_exceptions=True) or write a custom pattern.
Wrapping Up
TaskGroup is asyncio’s structured-concurrency primitive. For Python 3.11+, prefer it over gather() — it catches lifecycle bugs, propagates errors cleanly via ExceptionGroup, and supports dynamic task creation. Keep gather() in the toolbox for backwards-compatibility and for the return_exceptions=True case where you genuinely want to ignore some failures.
Related Articles
OpenAI API Python Tutorial — Learn to integrate OpenAI’s API with async code for intelligent applications.
You have probably written dozens of if-elif chains that check a variable against a list of possible values. Maybe it is an HTTP status code, a command from user input, or a message type from an API. The chain starts small, then grows to 15 branches, and suddenly the logic is hard to follow and even harder to extend. Python 3.10 introduced structural pattern matching with the match and case statements to solve exactly this problem.
Structural pattern matching is built into Python 3.10 and later — no extra libraries needed. It goes far beyond a simple switch statement. You can match against literal values, destructure sequences and dictionaries, bind variables, add guard conditions, and even match class instances by their attributes. If you have used pattern matching in Rust, Scala, or Elixir, Python’s version will feel familiar but with its own Pythonic style.
In this article, you will learn how match-case works starting with a quick example, then move through literal patterns, sequence unpacking, mapping patterns, class patterns, guard clauses, and OR patterns. We will finish with a real-life CLI command parser that ties everything together. By the end, you will be able to replace complex branching logic with clean, readable pattern matching code.
Python match-case: Quick Example
Here is the simplest useful example of match-case — handling HTTP status codes. This runs on Python 3.10 or later:
# quick_example.py
def describe_status(code):
match code:
case 200:
return "OK -- request succeeded"
case 404:
return "Not Found -- resource does not exist"
case 500:
return "Server Error -- something broke on the server"
case _:
return f"Unknown status code: {code}"
print(describe_status(200))
print(describe_status(404))
print(describe_status(999))
Output:
OK -- request succeeded
Not Found -- resource does not exist
Unknown status code: 999
The match statement evaluates the subject expression (code) and compares it against each case pattern in order. The first matching pattern wins, and its block runs. The underscore _ is the wildcard pattern — it matches anything and acts as your default branch, similar to else in an if-chain.
This looks like a switch statement on the surface, but as you will see in the following sections, match-case can destructure data structures, bind variables, and match complex nested objects — things no switch statement can do.
What Is Structural Pattern Matching and Why Use It?
Structural pattern matching lets you check whether a value has a particular structure and extract parts of it in a single step. Think of it as an X-ray machine for your data: you describe the shape you expect, and Python checks if the data fits that shape while pulling out the pieces you need.
The key difference from if-elif chains is that pattern matching is declarative. Instead of writing procedural code that tests conditions one by one, you describe what the data should look like. Python handles the checking and unpacking for you.
Feature
if-elif Chain
match-case
Simple value comparison
Works fine
Works fine, slightly cleaner
Destructuring sequences
Manual indexing or unpacking
Built-in with capture variables
Nested data extraction
Multiple lines of checks
Single pattern describes the shape
Type checking + attribute access
isinstance() + getattr()
Class patterns handle both at once
Combining conditions
and/or in conditions
Guards and OR patterns
Readability at 5+ branches
Gets messy fast
Each case is self-contained
Pattern matching shines when you need to handle multiple message types, parse command structures, process API responses with varying shapes, or route events based on their content. For simple two-way checks, a regular if-else is still the right tool.
match-case: cleaner than a wall of elifs.
Matching Literal Values
The most basic use of match-case is matching against literal values — integers, strings, booleans, and None. This is the direct replacement for a long if-elif chain that compares a variable against constants:
# literal_patterns.py
def get_day_type(day):
match day.lower():
case "monday" | "tuesday" | "wednesday" | "thursday" | "friday":
return "weekday"
case "saturday" | "sunday":
return "weekend"
case _:
return "not a valid day"
print(get_day_type("Monday"))
print(get_day_type("Saturday"))
print(get_day_type("Funday"))
Output:
weekday
weekend
not a valid day
The pipe operator | creates an OR pattern, letting you match multiple values in a single case. This is much cleaner than writing if day in ("monday", "tuesday", ...) when each group needs different handling. Notice we call .lower() on the subject expression itself — all transformations happen before matching begins.
Destructuring Sequences
One of the most powerful features of match-case is sequence patterns. You can match lists and tuples by their structure, extract specific elements into variables, and even capture variable-length remainders with the star operator:
# sequence_patterns.py
def process_command(command_parts):
match command_parts:
case ["quit"]:
return "Exiting program"
case ["hello", name]:
return f"Hello, {name}!"
case ["move", direction, steps]:
return f"Moving {direction} by {steps} steps"
case ["add", *items]:
return f"Adding {len(items)} items: {', '.join(items)}"
case []:
return "Empty command"
case _:
return f"Unknown command: {command_parts}"
print(process_command(["quit"]))
print(process_command(["hello", "Alice"]))
print(process_command(["move", "north", "5"]))
print(process_command(["add", "milk", "eggs", "bread"]))
print(process_command([]))
Output:
Exiting program
Hello, Alice!
Moving north by 5 steps
Adding 3 items: milk, eggs, bread
Empty command
Each case describes the shape of the list. The pattern ["hello", name] matches any two-element list where the first element is literally "hello", and it binds the second element to the variable name. The *items pattern captures all remaining elements after "add", similar to how *args works in function signatures. This lets you handle variable-length commands without writing manual length checks.
Pattern matching: the shape tells you what to do with it.
Matching Dictionaries
Mapping patterns let you match dictionaries by checking for specific keys and extracting their values. This is incredibly useful for processing JSON responses from APIs where the shape of the data tells you what type of message or event you are dealing with:
# mapping_patterns.py
def handle_event(event):
match event:
case {"type": "click", "element": element, "x": x, "y": y}:
return f"Click on {element} at ({x}, {y})"
case {"type": "keypress", "key": key}:
return f"Key pressed: {key}"
case {"type": "scroll", "direction": direction}:
return f"Scrolled {direction}"
case {"type": unknown_type}:
return f"Unknown event type: {unknown_type}"
case _:
return "Invalid event format"
print(handle_event({"type": "click", "element": "button", "x": 100, "y": 200}))
print(handle_event({"type": "keypress", "key": "Enter"}))
print(handle_event({"type": "scroll", "direction": "down", "amount": 3}))
print(handle_event({"type": "resize"}))
Output:
Click on button at (100, 200)
Key pressed: Enter
Scrolled down
Unknown event type: resize
Mapping patterns only check for the keys you specify — extra keys in the dictionary are ignored. The scroll event dictionary has an amount key that the pattern does not mention, and that is fine. The pattern {"type": unknown_type} matches any dictionary with a "type" key and captures its value. This makes mapping patterns perfect for processing JSON-like data where different message types have different fields.
Matching Class Instances
Class patterns combine type checking and attribute extraction in a single step. Instead of writing isinstance() checks followed by attribute access, you describe the class and the attribute values you expect:
# class_patterns.py
from dataclasses import dataclass
@dataclass
class Point:
x: float
y: float
@dataclass
class Circle:
center: Point
radius: float
@dataclass
class Rectangle:
origin: Point
width: float
height: float
def describe_shape(shape):
match shape:
case Circle(center=Point(x=0, y=0), radius=r):
return f"Circle at origin with radius {r}"
case Circle(center=center, radius=r):
return f"Circle at ({center.x}, {center.y}) with radius {r}"
case Rectangle(origin=origin, width=w, height=h) if w == h:
return f"Square at ({origin.x}, {origin.y}) with side {w}"
case Rectangle(origin=origin, width=w, height=h):
return f"Rectangle at ({origin.x}, {origin.y}), {w}x{h}"
case _:
return "Unknown shape"
print(describe_shape(Circle(Point(0, 0), 5)))
print(describe_shape(Circle(Point(3, 4), 2.5)))
print(describe_shape(Rectangle(Point(1, 1), 10, 10)))
print(describe_shape(Rectangle(Point(0, 0), 8, 3)))
Output:
Circle at origin with radius 5
Circle at (3, 4) with radius 2.5
Square at (1, 1) with side 10
Rectangle at (0, 0), 8x3
Notice how the first Circle case uses a nested pattern — it matches a Circle whose center is specifically at the origin Point(x=0, y=0). The Rectangle case uses a guard clause (if w == h) to distinguish squares from regular rectangles. Class patterns work best with dataclasses and named tuples because Python can automatically match keyword arguments to attributes. For regular classes, you would need to define a __match_args__ tuple to enable positional matching.
Destructuring data with patterns. Python finally caught up.
Adding Guard Clauses
Sometimes the pattern alone is not enough to decide which case should match. Guard clauses add an if condition after the pattern that must also be true for the case to match. The guard can reference any variables captured by the pattern:
# guard_clauses.py
def categorize_score(score):
match score:
case s if s < 0 or s > 100:
return f"Invalid score: {s}"
case s if s >= 90:
return f"{s} -- A grade (excellent)"
case s if s >= 80:
return f"{s} -- B grade (good)"
case s if s >= 70:
return f"{s} -- C grade (average)"
case s if s >= 60:
return f"{s} -- D grade (below average)"
case s:
return f"{s} -- F grade (failing)"
print(categorize_score(95))
print(categorize_score(82))
print(categorize_score(55))
print(categorize_score(-5))
Output:
95 -- A grade (excellent)
82 -- B grade (good)
55 -- F grade (failing)
Invalid score: -5
The pattern s by itself matches any value and binds it to the variable s. The guard if s >= 90 then filters whether this particular case should apply. Guards are evaluated in order, so the invalid score check comes first to reject bad input before the grading logic runs. This is cleaner than having the validation scattered across multiple elif branches.
Combining Patterns with OR
The OR pattern using the pipe | operator lets you match any of several patterns with the same case block. You have already seen this with literals, but it works with more complex patterns too:
# or_patterns.py
def parse_bool(value):
match value:
case True | "true" | "yes" | "1" | 1:
return True
case False | "false" | "no" | "0" | 0:
return False
case None | "":
return None
case _:
raise ValueError(f"Cannot parse {value!r} as boolean")
print(parse_bool("yes"))
print(parse_bool(0))
print(parse_bool("false"))
print(parse_bool(None))
Output:
True
False
False
None
This pattern is extremely useful for building flexible input parsers that need to accept multiple formats for the same logical value. Configuration files, command-line arguments, and API parameters often use different representations for booleans, and a single OR pattern handles all of them in one readable line. Note that when using OR patterns with capture variables, every alternative must bind the same set of variables — Python enforces this at compile time.
Common Pitfalls to Avoid
There are a few tricky behaviors in match-case that catch even experienced Python developers. The most common mistake is accidentally creating a capture pattern when you meant to match a constant:
# pitfalls.py
HTTP_OK = 200
HTTP_NOT_FOUND = 404
status = 500
# WRONG -- this does NOT work as expected
match status:
case HTTP_OK: # This captures 500 into a NEW variable called HTTP_OK!
print("Success")
case HTTP_NOT_FOUND: # This never runs -- the first case caught everything
print("Not found")
# RIGHT -- use literal values or dotted names
print("---")
match status:
case 200:
print("Success")
case 404:
print("Not found")
case other:
print(f"Other status: {other}")
Output:
Success
---
Other status: 500
In the first match block, case HTTP_OK does not compare against the variable HTTP_OK. Instead, it creates a new variable called HTTP_OK that captures whatever the subject value is. This is because bare names in patterns are always capture patterns. To match against constants, use literal values directly, use dotted names like case http.HTTPStatus.OK, or use a guard clause like case status if status == HTTP_OK.
Real-Life Example: Building a CLI Command Parser
Let’s tie everything together with a practical project — a command-line parser that processes structured user commands using every pattern type we have covered:
# cli_parser.py
from dataclasses import dataclass
@dataclass
class Task:
title: str
priority: str = "medium"
done: bool = False
def run_command(command, tasks):
"""Parse and execute a CLI command on a task list."""
parts = command.strip().split()
match parts:
case ["add", *words] if words:
title = " ".join(words)
task = Task(title=title)
tasks.append(task)
return f"Added: '{title}'"
case ["done", index] if index.isdigit():
idx = int(index)
if 0 <= idx < len(tasks):
tasks[idx].done = True
return f"Completed: '{tasks[idx].title}'"
return f"Error: no task at index {idx}"
case ["priority", index, ("high" | "medium" | "low") as level] if index.isdigit():
idx = int(index)
if 0 <= idx < len(tasks):
tasks[idx].priority = level
return f"Set '{tasks[idx].title}' priority to {level}"
return f"Error: no task at index {idx}"
case ["list"]:
if not tasks:
return "No tasks yet"
lines = []
for i, t in enumerate(tasks):
status = "done" if t.done else "todo"
lines.append(f" [{i}] [{status}] [{t.priority}] {t.title}")
return "\n".join(lines)
case ["list", "done"]:
done_tasks = [t for t in tasks if t.done]
if not done_tasks:
return "No completed tasks"
return "\n".join(f" - {t.title}" for t in done_tasks)
case ["list", "pending"]:
pending = [t for t in tasks if not t.done]
if not pending:
return "All tasks complete!"
return "\n".join(f" - {t.title} [{t.priority}]" for t in pending)
case ["quit" | "exit"]:
return "QUIT"
case []:
return "Type a command (add, done, priority, list, quit)"
case _:
return f"Unknown command: {' '.join(parts)}"
# Simulate a session
tasks = []
commands = [
"add Buy groceries",
"add Write unit tests",
"add Deploy to production",
"priority 2 high",
"done 0",
"list",
"list pending",
"quit"
]
for cmd in commands:
print(f"> {cmd}")
result = run_command(cmd, tasks)
print(result)
if result == "QUIT":
break
print()
Output:
> add Buy groceries
Added: 'Buy groceries'
> add Write unit tests
Added: 'Write unit tests'
> add Deploy to production
Added: 'Deploy to production'
> priority 2 high
Set 'Deploy to production' priority to high
> done 0
Completed: 'Buy groceries'
> list
[0] [done] [medium] Buy groceries
[1] [todo] [medium] Write unit tests
[2] [todo] [high] Deploy to production
> list pending
- Write unit tests [medium]
- Deploy to production [high]
> quit
QUIT
This command parser demonstrates several pattern matching features working together. The ["add", *words] pattern uses a star capture for variable-length input. The ["priority", index, ("high" | "medium" | "low") as level] pattern combines sequence matching, an OR pattern for valid values, and an as binding to capture the matched value. Guard clauses validate that numeric arguments are actually digits before conversion. You could extend this by adding commands for removing tasks, searching by keyword, or sorting by priority — each new command is just another case block.
Frequently Asked Questions
What Python version do I need for match-case?
You need Python 3.10 or later. Structural pattern matching was introduced in Python 3.10 as part of PEP 634, PEP 635, and PEP 636. If you try to use match and case on Python 3.9 or earlier, you will get a SyntaxError. Note that match and case are soft keywords — they only have special meaning in the context of the match statement and can still be used as variable names elsewhere in your code.
Is match-case just a switch statement?
No, it is much more powerful. A switch statement (like in C or JavaScript) only compares a value against constants. Python’s match-case can destructure sequences and mappings, bind captured values to variables, match class instances by their attributes, use guard conditions, and combine patterns with OR. The simple literal matching does resemble a switch, but structural pattern matching handles complex data shapes that a switch statement cannot express.
Does match-case fall through like C switch?
No. Python’s match-case executes only the first matching case and then exits the match block. There is no fall-through behavior and no need for a break statement. If you want multiple patterns to execute the same code, combine them with the OR operator | in a single case, such as case "yes" | "y" | "true". This design prevents the common bug in C where a missing break causes unintended fall-through.
Can I use match-case with regular expressions?
Not directly in the pattern itself, but you can use guard clauses with re.match() or re.search(). For example: case str(s) if re.match(r"^\d{3}-\d{4}$", s) matches strings that look like phone numbers. The pattern ensures the value is a string, and the guard applies the regex check. This keeps the pattern readable while letting you use the full power of regular expressions when needed.
How does match-case compare to if-elif for performance?
For simple literal matching, match-case and if-elif chains have similar performance. The CPython implementation does not currently optimize match statements into jump tables or hash lookups. Choose match-case for readability and maintainability, not for speed. The real performance benefit is developer time — pattern matching makes complex branching logic easier to read, debug, and extend, which reduces the time you spend maintaining the code.
Conclusion
You now have a solid understanding of Python’s structural pattern matching — from simple literal matching to destructuring sequences and dictionaries, matching class instances with nested patterns, filtering with guard clauses, and combining alternatives with OR patterns. The key concepts we covered are match and case syntax, the wildcard _ pattern, capture variables, star patterns for variable-length sequences, mapping patterns for dictionaries, class patterns with dataclasses, guard clauses with if, and OR patterns with |.
Try extending the CLI command parser by adding a search command that filters tasks by keyword, or a sort command that reorders tasks by priority. You could also add persistence by saving tasks to a JSON file between sessions. For the complete language specification and advanced features like walrus patterns and positional class matching, check out the official Python documentation on match statements.
Data scientists and analysts spend approximately 80% of their time cleaning and preparing data before they can begin any meaningful analysis. This often unglamorous work is critical because the quality of your insights is directly proportional to the quality of your data. Whether you’re working with CSV files from legacy systems, databases with inconsistent formatting, or API responses with missing fields, you’ll inevitably encounter messy data.
Pandas, Python’s most popular data manipulation library, provides powerful tools to handle virtually any data cleaning scenario. With functions designed specifically for managing missing values, fixing data types, removing duplicates, and standardizing formats, you can transform chaotic datasets into analysis-ready dataframes in a fraction of the time it would take with manual approaches.
In this comprehensive guide, we’ll explore practical techniques for cleaning messy data using Pandas. You’ll learn how to identify data quality issues, apply targeted fixes, and build reusable cleaning pipelines that you can apply across different projects. By the end, you’ll have a solid toolkit for tackling real-world data challenges.
Quick Start: Clean Data in 10 Lines
Let’s start with a quick example that demonstrates the power of Pandas for data cleaning. Here’s a complete workflow that loads messy data, applies multiple cleaning operations, and produces a ready-to-analyze dataframe:
Data cleaning is rarely a single operation. Instead, you apply multiple fixes in sequence, each addressing a specific problem. In this example, you’ll see how to handle missing values, fix data types, standardize text formatting, and parse dates — often all in the same pipeline. Understanding how these pieces fit together is crucial because the order matters: you typically clean text before deduplicating, convert data types before filtering, and validate results before using data for analysis.
This section explores the techniques you need to handle this specific data quality issue. We’ll examine practical examples that show both the problem and multiple solution approaches.
This output shows the result of applying multiple cleaning operations: missing customer IDs were filled with the mean value, currency symbols were stripped and amounts converted to float, emails were standardized to lowercase, and dates were parsed into datetime format. Row 2 was dropped because its date couldn’t be parsed — sometimes removing completely broken records is preferable to forcing imperfect repairs. Each column now has the correct type and consistent formatting, making it ready for analysis.
This simple example demonstrates key Pandas functions that we’ll explore in depth throughout this tutorial. Notice how we handled missing values, converted currency to numeric format, standardized email addresses, and parsed dates — all core data cleaning tasks.
What Makes Data “Messy”?
Before diving into solutions, let’s identify the common data quality issues you’ll encounter. Understanding these problems helps you recognize them quickly and apply the right cleaning techniques.
Real-world data is messy because it comes from multiple sources, is entered manually, spans different time periods, and isn’t designed specifically for your analysis. Systems change, people make typos, integrations break, and formats evolve. Rather than being discouraged by messiness, professional data workers expect it and have systematic approaches to handle it. The patterns below appear repeatedly in virtually every dataset, so mastering them will serve you across your entire career.
Problem
Example
Solution
Missing values
NaN, None, ‘N/A’, blank cells
fillna(), dropna(), interpolate()
Inconsistent data types
Numbers stored as strings, mixed date formats
astype(), pd.to_numeric(), pd.to_datetime()
Duplicate records
Same customer appearing twice with slight variations
drop_duplicates(), duplicated()
Inconsistent formatting
‘John’, ‘JOHN’, ‘john’ in same column
str.lower(), str.upper(), str.strip()
Special characters and symbols
Currency signs, extra spaces, special characters
str.replace(), str.extract(), regex patterns
Outliers and impossible values
Age of 999, negative prices
Filtering, quantile-based detection
Mixed data types in single column
Column contains both integers and text
errors=’coerce’, regex extraction
This table summarizes the most common data quality problems and the Pandas tools that address them. Notice that each problem type has specific solution methods — you wouldn’t use the same approach for missing values as you would for duplicates or formatting issues. Understanding which problem you’re solving guides you toward the right function. Throughout this guide, we’ll explore each of these patterns in detail with practical examples showing both the problem and multiple solution approaches.
Handling Missing Values
Missing data is the most common data quality issue you’ll encounter. It manifests in different ways: NaN values in numeric columns, None objects in Python, placeholder strings like ‘N/A’, or simply empty cells. Missing data creates a fundamental problem: should you remove incomplete records or estimate their missing values? This choice isn’t purely technical — it depends on why data is missing, how much is missing, and what your analysis requires.
Pandas represents missing values as NaN (Not a Number) or None, and provides several strategies for handling them.
Detecting Missing Data
First, you need to identify where missing values exist in your dataframe:
This section explores the techniques you need to handle this specific data quality issue. We’ll examine practical examples that show both the problem and multiple solution approaches.
Original shape: (5, 4)
user_id username email active
0 1 alice alice@example.com True
1 2 None bob@example.com True
2 3 charlie charlie@example.com False
3 4 diana None True
4 5 eve eve@example.com True
After dropna():
New shape: (3, 4)
user_id username email active
0 1 alice alice@example.com True
2 3 charlie charlie@example.com False
4 5 eve eve@example.com True
Drop rows missing in specific columns:
user_id username email active
0 1 alice alice@example.com True
2 3 charlie charlie@example.com False
4 5 eve eve@example.com True
Filling Missing Values
When you can’t afford to lose data, filling missing values is a better strategy. Pandas provides several filling methods:
This section explores the techniques you need to handle this specific data quality issue. We’ll examine practical examples that show both the problem and multiple solution approaches.
# fill_missing.py
import pandas as pd
import numpy as np
df = pd.DataFrame({
'day': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'],
'temperature': [72.5, 75.0, None, 78.5, None],
'humidity': [65, None, 70, None, 68]
})
print("Original:")
print(df)
print("\nFill with constant value:")
print(df.fillna(0))
print("\nForward fill (propagate last known value):")
print(df.fillna(method='ffill'))
print("\nBackward fill (propagate next known value):")
print(df.fillna(method='bfill'))
print("\nFill with column mean:")
df['temperature'] = df['temperature'].fillna(df['temperature'].mean())
print(df)
print("\nFill with interpolation (linear):")
df2 = pd.DataFrame({
'hour': [0, 1, 2, 3, 4],
'traffic': [100, None, None, 150, 160]
})
df2['traffic'] = df2['traffic'].interpolate(method='linear')
print(df2)
Output:
Original:
day temperature humidity
0 Monday 72.5 65
1 Tuesday 75.0 NaN
2 Wednesday NaN 70
3 Thursday 78.5 NaN
4 Friday NaN 68
Fill with constant value:
day temperature humidity
0 Monday 72.5 65
1 Tuesday 75.0 0
2 Wednesday 0.0 70
3 Thursday 78.5 0
4 Friday 0.0 68
Forward fill (propagate last known value):
day temperature humidity
0 Monday 72.5 65
1 Tuesday 75.0 65
2 Wednesday 75.0 70
3 Thursday 78.5 70
4 Friday 78.5 68
Interpolate (linear):
hour traffic
0 0 100.0
1 1 116.7
2 2 133.3
3 3 150.0
4 4 160.0
Messy data is just clean data that hasn’t met pandas yet.
Fixing Data Types
Data type errors cause many silent bugs in analysis. A column containing prices might be stored as strings instead of floats, causing calculations to fail. Pandas provides tools to convert and validate data types.
Converting Strings to Numbers
Numbers stored as text are among the most frequent data type problems. You’ll encounter “$100.50” in a price column, “5,000” in a quantity column, or even “N/A” mixed with actual numbers. The `astype()` method works for clean numeric strings, but `pd.to_numeric(…, errors=’coerce’)` is more forgiving — it converts what it can and turns non-numeric values into NaN. This defensive approach prevents silent failures and lets you handle problematic values explicitly after conversion.
The pd.to_numeric() function is your best friend for handling numeric data stored as strings:
Original dtypes:
product object
price object
quantity object
dtype: object
Price as string:
0 $25.99
1 $40.50
2 FREE
3 $15.75
Name: price, dtype: object
Convert price to numeric (coerce errors):
0 25.99
1 40.50
2 NaN
3 15.75
Name: price, dtype: float64
Convert quantity (coerce invalid values):
0 100.0
1 250.0
2 50.0
3 NaN
Name: quantity, dtype: float64
Final dataframe:
product price quantity
0 Widget A 25.99 100.0
1 Widget B 40.50 250.0
2 Widget C NaN 50.0
3 Widget D 15.75 NaN
Parsing Dates
Date parsing is particularly tricky because dates can be represented in dozens of formats: “2024-01-15”, “01/15/2024”, “15-Jan-2024”, “Jan 15, 2024”, and more. Python’s `pd.to_datetime()` function can handle this complexity. The `format` parameter lets you specify an exact format if all dates match. The `errors=’coerce’` parameter converts unparseable dates to NaT (Not a Time), similar to how `pd.to_numeric()` handles non-numeric values. The `infer_datetime_format` parameter tells Pandas to guess the format, useful when formats are mixed.
Date parsing is critical for time-series analysis. Real-world data often contains dates in multiple formats:
This section explores the techniques you need to handle this specific data quality issue. We’ll examine practical examples that show both the problem and multiple solution approaches.
Sometimes you need explicit control over type conversion beyond what `astype()` provides. This happens when conversion logic is complex or context-dependent. Creating a custom function encapsulates this logic and lets you reuse it across columns and projects. Custom functions can handle multiple input formats, document your business rules, and gracefully handle edge cases by returning NaN for unparseable values rather than raising errors.
Duplicate records occur frequently in real datasets due to system failures, multiple registrations, or import errors. Duplicates inflate row counts and skew analysis results. The challenge is deciding what “identical” means — are two records identical if they have the same email but different phone numbers? Pandas gives you tools to identify exact duplicates and handle them strategically. Before removing duplicates, always standardize your data first — standardization ensures “John Smith” and “john smith” are recognized as the same before deduplication.
Duplicate records inflate analysis results and skew calculations. Pandas provides efficient methods to identify and remove them:
This section explores the techniques you need to handle this specific data quality issue. We’ll examine practical examples that show both the problem and multiple solution approaches.
Original dataframe:
customer_id name email purchase_count
0 1 Alice alice@example.com 5
1 2 Bob bob@example.com 3
2 2 Bob bob@example.com 3
3 3 Charlie charlie@example.com 8
4 4 Diana diana@example.com 2
5 4 Diana diana@example.com 2
6 4 Diana diana@example.com 2
Shape: (7, 4)
Detect duplicates (all columns):
0 False
1 False
2 True
3 False
4 False
5 True
6 True
dtype: bool
Detect duplicates (specific columns):
0 False
1 False
2 True
3 False
4 False
5 True
6 True
dtype: bool
Remove exact duplicates:
customer_id name email purchase_count
0 1 Alice alice@example.com 5
1 2 Bob bob@example.com 3
3 3 Charlie charlie@example.com 8
4 4 Diana diana@example.com 2
Remove duplicates keeping first:
customer_id name email purchase_count
0 1 Alice alice@example.com 5
1 2 Bob bob@example.com 3
3 3 Charlie charlie@example.com 8
4 4 Diana diana@example.com 2
Missing values don’t hide from .isnull() — they just pretend to be NaN.
Standardizing String Data
Text data is especially prone to inconsistencies. Email addresses might have different cases or extra whitespace. Product names might be spelled with or without special characters. These variations are invisible to the human eye but cause real problems. String standardization is one of the highest-ROI cleaning activities because small inconsistencies have outsized impacts. When you deduplicate by email and one entry has “John@GMAIL.COM” while the duplicate has “john@gmail.com”, you’ll incorrectly identify them as different. Pandas’ string methods make bulk standardization efficient, operating on entire columns at once.
String columns often contain inconsistent formatting that breaks analysis. Pandas string methods make it easy to standardize text:
Case Normalization
Case normalization is the simplest and most important string cleaning step. Converting everything to lowercase ensures “John@GMAIL.COM” and “john@gmail.com” are recognized as identical. The `str.lower()` method works on entire columns at once, much faster than looping through individual values. Similarly, `str.upper()` converts to uppercase, and `str.title()` converts to title case. Choose lowercase for emails and usernames; use title case for names and proper nouns.
Original:
city country product_code
0 new york USA ABC123
1 NEW YORK usa abc123
2 New York Usa Abc123
3 NEW york USA ABC123
4 los angeles USA ABC123
5 LOS ANGELES usa ABC123
All lowercase:
city country product_code
0 new york usa abc123
1 new york usa abc123
2 new york usa abc123
3 new york usa abc123
4 los angeles usa abc123
5 los angeles usa abc123
Title case:
city country product_code
0 New York usa abc123
1 New York usa abc123
2 New York usa abc123
3 New York usa abc123
4 Los Angeles usa abc123
5 Los Angeles usa abc123
Whitespace Cleaning
Accidental whitespace — spaces at the beginning or end of a value — is invisible but causes problems. “john ” and “john” are different strings in Python, so they won’t match even though they represent the same value. The `str.strip()` method removes leading and trailing whitespace, `str.lstrip()` removes only leading whitespace, and `str.rstrip()` removes only trailing whitespace. Always apply these methods early in your cleaning pipeline before any comparison or deduplication operations.
Extra spaces are a common data quality issue:
# whitespace_cleaning.py
import pandas as pd
df = pd.DataFrame({
'email': [' alice@example.com ', 'bob@example.com ', ' charlie@example.com'],
'category': ['Books ', ' Electronics', ' Home & Garden ']
})
print("Original:")
print(df)
print("\nEmail column repr (to see spaces):")
print(df['email'].apply(repr))
print("\nStrip leading and trailing spaces:")
df['email'] = df['email'].str.strip()
df['category'] = df['category'].str.strip()
print(df)
print("\nEmail after strip:")
print(df['email'].apply(repr))
print("\nRemove extra internal spaces:")
df['category'] = df['category'].str.replace(r'\s+', ' ', regex=True)
print(df)
Duplicates — they look the same, act the same, but only one gets to stay.
Detecting and Handling Outliers
Outliers are extreme values that don’t fit the normal pattern of your data. They might represent errors (a customer age of 999 years), fraud (an unusually large transaction), or legitimate but rare events (a customer who spends far more than typical). The key difference between outliers and errors is that outliers might be correct — just unusual. Your goal isn’t necessarily to remove them, but to detect them, investigate them, and make informed decisions about whether they should be included or handled separately in your analysis.
Outliers can skew analysis and produce misleading insights. While not always errors, they deserve investigation:
Statistical Outlier Detection
The interquartile range (IQR) method defines outliers based on your data’s natural spread. The IQR is the range between the 25th percentile (Q1) and 75th percentile (Q3). Values outside the typical range (usually Q1 – 1.5*IQR to Q3 + 1.5*IQR) are flagged as outliers. This method is robust because it’s less sensitive to extreme values than using mean and standard deviation. The z-score method measures how many standard deviations a value is from the mean — values with |z-score| > 2 or 3 are typically considered outliers. Choose IQR for skewed data; choose z-scores for normally distributed data.
Beyond statistical methods, you can validate data based on domain knowledge — age should be between 0 and 150, GPA between 0 and 4.0, attendance percentage between 0 and 100. These range-based checks use simple logical comparisons rather than statistics. This approach is more interpretable to business stakeholders because you’re using domain-specific rules rather than statistical formulas. You can use these checks to identify invalid records for investigation or to mark invalid values as NaN for later handling.
String cleaning — strip, lower, replace, and suddenly your data makes sense.
Chaining Operations for Clean Pipelines
Rather than applying operations sequentially and creating intermediate dataframes at each step, you can chain multiple operations together for more concise and readable code. Method chaining uses Pandas’ `assign()` method and lambda functions to build a pipeline where each step returns a dataframe that feeds into the next. This approach has several benefits: it’s more readable as a complete transformation story, it doesn’t create temporary variables cluttering your namespace, and it clearly shows the data transformation sequence.
Rather than applying operations sequentially, you can chain them together for more readable and maintainable code. This is especially useful when building reusable cleaning functions:
Method Chaining
Method chaining uses Pandas’ `assign()` method and lambda functions to build a pipeline where each step returns a dataframe that feeds into the next. This approach has several benefits: it’s more readable as a complete transformation story, it doesn’t create temporary variables, and it clearly shows the data transformation sequence. The key is that each operation in the chain must return a dataframe, allowing the next operation to work on the result.
Original:
order_id customer_name email amount date
0 1 John Doe john@EXAMPLE.COM $150.50 2024-01-15
1 2 jane smith jane@example.com $200.00 2024/02/20
2 3 Bob JONES None N/A 2024-03-10
3 4 alice w alice@example.com $75.25 None
4 5 Charlie Brown charlie@EXAMPLE.COM $120.99 2024-01-18
5 6 DIANA PRINCE diana@example.com $300.00 invalid
Cleaned:
order_id customer_name email amount date
0 1 John Doe john@example.com 150.50 2024-01-15
1 2 Jane Smith jane@example.com 200.00 2024-02-20
2 4 Alice W alice@example.com 75.25 None
3 5 Charlie Brown charlie@example.com 120.99 2024-01-18
Dtypes:
order_id int64
customer_name object
email object
amount float64
date datetime64[ns]
dtype: object
Creating Reusable Cleaning Functions
For production data cleaning, moving beyond one-off scripts to reusable functions is essential. A well-designed cleaning function encapsulates your data transformation logic, making it testable, maintainable, and shareable across projects. The function should document its assumptions, handle edge cases gracefully, and return consistent output. By wrapping your Pandas operations in functions with clear parameters and docstrings, you create a toolkit your team can apply consistently across different datasets.
Original:
name email phone signup_date
0 john smith JOHN@EXAMPLE.COM (555) 123-4567 2024-01-15
1 JANE DOE jane@example.com 555-123-4567 2024/02/20
2 bob jones bob@example.com 5551234567 invalid
Cleaned:
name email phone signup_date
0 John Smith john@example.com (555) 123-4567 2024-01-15
1 Jane Doe jane@example.com (555) 123-4567 2024-02-20
Chain it all together — one pipeline from raw mess to clean insight.
Real-Life Example: Cleaning a Customer Database
Let’s apply everything we’ve learned to a realistic scenario. You’ve inherited a messy customer database with inconsistent formats, missing values, and duplicates:
=== ORIGINAL MESSY DATA ===
customer_id first_name last_name email phone signup_date lifetime_value
0 1 John Smith john@GMAIL.COM (555) 123-4567 2024-01-15 $5,250.50
1 2 jane DOE jane@yahoo.com 555-123-4567 2024/02/20 $12,100.00
2 NaN BOB jones None 5551234567 2024-03-10 $0
3 4 alice Williams alice@test.COM None 2024-04-05 None
4 5 Charlie BROWN charlie@example.com (555) 987-6543 2023-12-01 $999.99
5 5 Charlie BROWN charlie@example.com (555) 987-6543 2023-12-01 $999.99
6 7 diana PRINCE DIANA@EXAMPLE.COM 5558881234 2024-05-12 2500
7 8 EVE johnson eve@test.com invalid N/A $1,850.25
Shape: (8, 7)
=== FINAL CLEANED DATA ===
customer_id first_name last_name email phone signup_date lifetime_value
0 1 John Smith john@gmail.com (555) 123-4567 2024-01-15 5250.50
1 2 Jane Doe jane@yahoo.com (555) 123-4567 2024-02-20 12100.00
2 4 Alice Williams alice@test.com NaN 2024-04-05 NaN
3 5 Charlie Brown charlie@example.com (555) 987-6543 2023-12-01 999.99
4 7 Diana Prince diana@example.com (555) 123-4567 2024-05-12 2500.00
Final records: 5
Records removed: 3
Frequently Asked Questions
When should I use dropna() versus fillna()?
Use dropna() when missing data is sparse (less than 5% of your data) and losing those rows won’t bias your analysis. Use fillna() when you want to preserve all observations. For numeric columns, filling with the mean or median is common. For categorical data, consider the domain context — sometimes a separate “Unknown” category is appropriate.
How do I handle mixed data types in a single column?
Use pd.to_numeric(..., errors='coerce') to convert numeric strings while turning non-numeric values into NaN. For mixed date formats, use pd.to_datetime(..., format='mixed', errors='coerce'). Then decide whether to drop NaN values, fill them, or investigate why the conversion failed.
What’s the best way to handle duplicate records?
First, understand why duplicates exist. Are they exact duplicates or near-duplicates? For exact duplicates, drop_duplicates() is straightforward. For near-duplicates (like “John Smith” vs “john smith”), standardize the data first (lowercase, strip whitespace, remove special characters) before checking for duplicates. For critical data, keep both versions and add a flag indicating duplicates for manual review.
How do I validate data after cleaning?
Create a validation function that checks: (1) expected number of rows, (2) no unexpected missing values, (3) data types are correct, (4) numeric values are within expected ranges, (5) dates are in the correct range. Run these checks automatically as part of your cleaning pipeline to catch issues early.
Can I create a reusable cleaning template for my team?
Absolutely! Wrap your cleaning logic in a function with clear parameters and documentation. Use type hints and docstrings. Consider creating a custom class that inherits from pandas DataFrame if your organization has consistent data formats. Share this via version control so your team can apply consistent cleaning across projects.
How do I handle special characters and encoding issues?
For most cases, string operations like str.replace() work fine. For complex pattern matching, use regex with the regex=True parameter. For encoding issues (wrong character display), use df.encoding = 'utf-8' when reading files. If you encounter persistent encoding problems, the chardet library can auto-detect the correct encoding.
Conclusion
Data cleaning is a critical skill for any data professional. With Pandas, you have powerful tools to handle virtually any data quality issue efficiently. The techniques we’ve covered — handling missing values, fixing data types, removing duplicates, standardizing text, and detecting outliers — form the foundation of professional data cleaning.
Remember these key principles: (1) Always inspect your data first to understand the specific problems you’re solving, (2) Build reusable cleaning functions rather than one-off scripts, (3) Validate your cleaned data to ensure you haven’t introduced new problems, (4) Document your cleaning process so others can understand your decisions, and (5) View data cleaning as an investment that pays dividends throughout your analysis.
Start with small datasets to refine your cleaning pipeline, then scale to production data. As you encounter new edge cases, update your functions to handle them. Over time, you’ll develop an intuition for common patterns and can quickly assess data quality and plan your cleaning strategy.
Parquet has become one of the most popular columnar data formats in modern data engineering, and for good reason. If you’re working with large datasets, data pipelines, or cloud-based analytics platforms like Apache Spark, Amazon Redshift, or Google BigQuery, you’ll almost certainly encounter Parquet files. Unlike row-based formats like CSV, Parquet stores data in columns, enabling efficient compression, faster queries, and reduced storage costs.
In this tutorial, you’ll learn how to read and write Parquet files in Python using PyArrow and Pandas. We’ll cover everything from basic file I/O operations to advanced topics like schema inspection, compression options, and partitioned datasets. Whether you’re migrating from CSV to Parquet or building a data pipeline that processes terabytes of columnar data, this guide will equip you with practical, production-ready techniques.
By the end of this article, you’ll understand why Parquet is the format of choice for data-intensive applications, how to optimize your file writes with compression, and how to leverage partitioning for better query performance. Let’s dive in!
Quick Example: Write and Read a Parquet File in 6 Lines
Before we explore the details, here’s the fastest way to get started with Parquet files in Python:
# quick_parquet_example.py
import pandas as pd
# Create and write
df = pd.DataFrame({'name': ['Alice', 'Bob'], 'score': [95, 87]})
df.to_parquet('data.parquet')
# Read back
df_read = pd.read_parquet('data.parquet')
print(df_read)
Output:
name score
0 Alice 95
1 Bob 87
That’s it! Pandas makes reading and writing Parquet files as simple as CSV operations. However, there’s much more you can do with Parquet, and understanding its strengths will help you make better decisions for your data architecture.
What Is Parquet and Why Use It?
Apache Parquet is a columnar storage format designed for distributed data processing. Instead of storing data row-by-row like CSV or JSON, Parquet organizes data by column. This architectural difference has profound implications for performance and storage efficiency.
Here’s how Parquet compares to other popular formats:
Characteristic
CSV
Parquet
JSON
Storage Model
Row-based
Columnar
Row-based
Compression
External (gzip, etc.)
Built-in (SNAPPY, GZIP)
External
Data Types
All strings
Strongly typed
Native types
File Size
Large (uncompressed)
Very small (compressed)
Medium to large
Query Speed
Slow (full scan)
Very fast (column projection)
Slow (parsing)
Nested Structure Support
None
Yes
Yes
Schema Enforcement
None
Yes
Optional
Parquet excels when you need to:
Analyze specific columns: Read only the columns you need, not the entire dataset
Minimize storage: Achieve 80-90% compression ratios compared to CSV
Process large datasets: Integrate seamlessly with Spark, Hadoop, and cloud data warehouses
Preserve data types: Maintain integers, floats, timestamps, and complex types without conversion
Enable predicate pushdown: Filter rows at the storage layer for dramatic performance gains
Installing PyArrow
To work with Parquet files in Python, you’ll need PyArrow, the Apache Foundation’s Python library for columnar data and Arrow format. While Pandas can read/write Parquet using PyArrow as a backend, we’ll install both for maximum flexibility:
PyArrow is the engine that powers Parquet I/O in Pandas. If you’re using Pandas without PyArrow, you’ll get an error. Ensure you have PyArrow 1.0.0 or later for best compatibility with modern Parquet files.
Writing Parquet Files
There are multiple ways to write Parquet files in Python, each suited to different scenarios. Let’s explore the most common approaches:
Writing from a Pandas DataFrame
The simplest approach is using Pandas to write a DataFrame directly to Parquet:
Parquet supports predicate pushdown, allowing you to filter rows at the storage layer before loading data into memory:
# read_parquet_filtering.py
import pyarrow.parquet as pq
import pyarrow.compute as pc
# Read with filters using PyArrow
parquet_file = pq.read_table('users.parquet',
filters=[
('is_active', '==', True),
('login_count', '>', 100)
]
)
df_filtered = parquet_file.to_pandas()
print("Active users with more than 100 logins:")
print(df_filtered)
print(f"\nRows after filter: {len(df_filtered)}")
Output:
Active users with more than 100 logins:
user_id username signup_date login_count is_active
0 1001 alice_wonder 2023-01-15 00:00:00 142 True
2 1003 charlie_brown 2023-03-10 00:00:00 256 False
4 1005 eve_johnson 2023-05-12 00:00:00 198 True
Rows after filter: 3
Schema Inspection and Metadata
Understanding the schema of a Parquet file is crucial before processing. PyArrow makes schema inspection easy:
# inspect_parquet_schema.py
import pyarrow.parquet as pq
# Read parquet file metadata
parquet_file = pq.ParquetFile('users.parquet')
# Inspect schema
print("Schema:")
print(parquet_file.schema)
# Get column information
print("\n\nColumn Information:")
for i, col in enumerate(parquet_file.schema):
print(f" {i+1}. {col.name}: {col.type}")
# Read metadata
print(f"\n\nFile Metadata:")
print(f" Number of rows: {parquet_file.metadata.num_rows}")
print(f" Number of columns: {parquet_file.metadata.num_columns}")
print(f" Number of row groups: {parquet_file.metadata.num_row_groups}")
# Get compression info
print(f"\n\nCompression Information:")
row_group = parquet_file.metadata.row_group(0)
for i in range(row_group.num_columns):
col = row_group.column(i)
print(f" {parquet_file.schema[i].name}: {col.compression}")
Output:
Schema:
user_id: int64
username: string
signup_date: timestamp[ns]
login_count: int64
is_active: bool
Column Information:
1. user_id: int64
2. username: string
3. signup_date: timestamp[ns]
4. login_count: int64
5. is_active: bool
File Metadata:
Number of rows: 5
Number of columns: 5
Number of row groups: 1
Compression Information:
user_id: SNAPPY
username: SNAPPY
signup_date: SNAPPY
login_count: SNAPPY
is_active: SNAPPY
Column selection — skip what you don’t need, load what you do.
Partitioned Datasets
When dealing with massive datasets, partitioning by date, region, or other dimensions is essential for performance. Parquet supports partitioned dataset structure, where data is organized into directories:
Writing Partitioned Parquet Files
PyArrow’s parquet module can automatically organize data into partitions:
# write_partitioned_parquet.py
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime, timedelta
# Create sample data with dates and regions
records = []
for day in range(5):
for region in ['US', 'EU', 'APAC']:
for i in range(10):
records.append({
'date': (datetime(2024, 1, 1) + timedelta(days=day)).date(),
'region': region,
'sales': 1000 + day * 100 + i * 50,
'user_count': 100 + day * 10 + i * 5
})
df = pd.DataFrame(records)
# Write as partitioned dataset
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
table,
root_path='sales_data',
partition_cols=['date', 'region'],
compression='snappy'
)
print("Partitioned dataset written!")
print(f"Total records: {len(df)}")
print(f"Partition columns: date, region")
Output:
Partitioned dataset written!
Total records: 150
Partition columns: date, region
Reading Partitioned Parquet Datasets
Reading partitioned datasets is transparent to the user:
# read_partitioned_parquet.py
import pyarrow.parquet as pq
import pandas as pd
# Read entire partitioned dataset
table = pq.read_table('sales_data')
df_all = table.to_pandas()
print(f"Total records read: {len(df_all)}")
print(f"\nFirst few records:")
print(df_all.head())
# Read specific partition
table_us = pq.read_table('sales_data',
filters=[('region', '==', 'US')]
)
df_us = table_us.to_pandas()
print(f"\n\nUS region records: {len(df_us)}")
print(df_us.head())
Output:
Total records read: 150
First few records:
date region sales user_count
0 2024-01-01 US 1000 100
1 2024-01-01 US 1050 105
2 2024-01-01 US 1100 110
3 2024-01-01 US 1150 115
4 2024-01-01 US 1200 120
US region records: 50
date region sales user_count
0 2024-01-01 US 1000 100
1 2024-01-01 US 1050 105
2 2024-01-01 US 1100 110
3 2024-01-01 US 1150 115
4 2024-01-01 US 1200 120
Partitioned datasets — organize once, query fast forever.
Real-Life Example: Log File Converter
Let’s build a practical example that converts CSV log files to partitioned Parquet format with compression statistics. This is a common task in data engineering:
# log_file_converter.py
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime
import os
def convert_csv_logs_to_parquet(csv_file, output_dir, partition_cols=['date', 'log_level']):
"""
Convert CSV logs to partitioned Parquet with compression statistics.
"""
# Read CSV
print(f"Reading {csv_file}...")
df = pd.read_csv(csv_file)
# Ensure date column is datetime
if 'timestamp' in df.columns:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = df['timestamp'].dt.date
# Convert to PyArrow table
table = pa.Table.from_pandas(df)
# Get original CSV size
csv_size = os.path.getsize(csv_file)
# Write partitioned parquet
print(f"Writing to {output_dir}...")
pq.write_to_dataset(
table,
root_path=output_dir,
partition_cols=partition_cols,
compression='gzip'
)
# Calculate compression statistics
total_parquet_size = 0
for root, dirs, files in os.walk(output_dir):
for file in files:
if file.endswith('.parquet'):
total_parquet_size += os.path.getsize(os.path.join(root, file))
compression_ratio = csv_size / total_parquet_size if total_parquet_size > 0 else 0
print(f"\nConversion Complete!")
print(f" Original CSV size: {csv_size:,} bytes")
print(f" Parquet total size: {total_parquet_size:,} bytes")
print(f" Compression ratio: {compression_ratio:.2f}x")
print(f" Space saved: {100 * (1 - total_parquet_size/csv_size):.1f}%")
return {
'csv_size': csv_size,
'parquet_size': total_parquet_size,
'compression_ratio': compression_ratio,
'rows': len(df)
}
# Example usage: Create sample log data and convert
if __name__ == '__main__':
# Create sample log CSV
log_data = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=1000, freq='1min'),
'log_level': ['DEBUG', 'INFO', 'WARNING', 'ERROR'] * 250,
'service': ['api', 'worker', 'db', 'cache'] * 250,
'message': [f'Process event {i}' for i in range(1000)],
'duration_ms': [10 + i % 100 for i in range(1000)]
})
log_data.to_csv('application.log.csv', index=False)
# Convert to Parquet
stats = convert_csv_logs_to_parquet(
'application.log.csv',
'logs_parquet',
partition_cols=['log_level']
)
Output:
Reading application.log.csv...
Writing to logs_parquet...
Conversion Complete!
Original CSV size: 156,234 bytes
Parquet total size: 31,456 bytes
Compression ratio: 4.97x
Space saved: 79.9%
This example demonstrates real-world value: a 5x compression ratio, which translates to massive storage savings when dealing with millions of logs. Combined with partitioning by log_level, analytics queries become much faster because the database engine can skip entire directories of unneeded data.
Same data, fraction of the size. Parquet compression is no joke.
Frequently Asked Questions
Q1: Can I append data to an existing Parquet file?
Direct appending is not supported by Parquet’s design (it’s immutable). Instead, use one of these approaches:
Write new files to a partitioned dataset directory and query them together
Read the existing file, merge with new data, and overwrite the file
Use a data lake framework like Delta Lake or Apache Iceberg that layer transaction support over Parquet
Q2: What compression codec should I choose?
It depends on your use case:
Real-time systems: Use SNAPPY (fast) or no compression
Balanced scenarios: Use GZIP (good compression, reasonable speed)
Archive/storage: Use BROTLI or ZSTD (excellent compression)
Cloud storage: GZIP or SNAPPY (cloud providers don’t charge extra for fast decompression)
Q3: How does Parquet handle schema evolution?
Parquet supports schema evolution through explicit schema merging. When reading files with different schemas, you can use PyArrow’s safe_cast option to handle type changes gracefully. For production systems, always maintain explicit versioning of your schemas.
Q4: Can I use Parquet with streaming data?
Parquet is row-group based and requires completing a row group before writing. For streaming scenarios, consider buffering data in memory and periodically flushing to Parquet files. Alternatively, use streaming formats like Avro for real-time systems, then convert to Parquet for analytics.
Q5: What’s the maximum file size for Parquet?
Parquet files are theoretically unlimited but practically, keeping individual files under 1-2 GB and distributing data across partitions is recommended for performance. Most cloud data warehouses work best with files in the 100 MB – 1 GB range.
Q6: How do I handle nested data types in Parquet?
Parquet natively supports nested structures (structs, lists, maps). PyArrow represents these as complex types. When reading, they convert to Python objects; when writing from Pandas, you can use dictionary columns or PyArrow’s explicit typing for complex structures.
Conclusion
Parquet has established itself as the de facto standard for columnar data storage in modern data pipelines. Its combination of efficient compression, strong type safety, schema support, and integration with big data frameworks makes it indispensable for anyone working with large datasets.
In this tutorial, you learned how to:
Read and write Parquet files using both Pandas and PyArrow
Leverage compression to reduce storage costs
Optimize queries by reading only needed columns
Use row filtering for efficient data access
Inspect schemas and metadata
Organize data into partitioned datasets
Build practical data conversion tools
Whether you’re migrating legacy CSV systems to modern data architecture or building cloud-native analytics pipelines, Parquet gives you the performance and efficiency your applications demand. Start with simple read/write operations, then progressively adopt compression and partitioning strategies as your data grows.
The investment in learning Parquet pays dividends — your queries will run faster, storage costs will shrink, and your data infrastructure becomes compatible with the entire ecosystem of modern data tools.
For years, Pandas has been the go-to library for data manipulation and analysis in Python. However, as datasets grow larger and performance becomes critical, Polars has emerged as a powerful alternative that can be significantly faster while offering a more intuitive API. Whether you’re processing CSV files with millions of rows or performing complex data transformations, Polars delivers better performance through lazy evaluation, optimized memory management, and expressive query syntax.
Polars represents a fresh take on DataFrame design, unencumbered by the need to maintain backward compatibility with older Pandas code. This freedom has allowed the Polars developers to make better architectural choices from the ground up. If you have ever been frustrated by Pandas’ performance on large datasets, struggled with type inference issues, or found yourself writing `.apply()` functions for operations that should be simple, Polars offers a refreshing alternative. The learning curve is gentle for Pandas users since the API is familiar, yet the performance improvements can be dramatic.
In this tutorial, we’ll explore how to transition from Pandas to Polars, understand why it’s faster, and learn practical techniques to leverage Polars’ most powerful features. We’ll examine real-world scenarios, compare performance side-by-side with Pandas code, and show you how to integrate Polars into your existing data science workflows. By the end, you’ll have the skills to confidently choose Polars for performance-critical applications.
This guide assumes you have intermediate Python knowledge and are familiar with Pandas concepts like DataFrames, filtering, and grouping. While we’ll cover the basics of Polars syntax, the focus is on helping experienced data professionals migrate their skills effectively.
Quick Example: Pandas vs Polars Performance
Let’s start with a practical comparison. Here’s the same operation performed in both Pandas and Polars, with timing to demonstrate the speed difference:
This example performs a typical data analysis task: reading a CSV file, filtering by a column value, and computing aggregated statistics. Both libraries accomplish the same goal with very similar syntax, but you will notice that Polars completes significantly faster. This performance gap widens dramatically with larger datasets. The timing difference is not just a matter of implementation quality — it stems from fundamental architectural choices. Pandas is built on NumPy arrays with row-oriented storage, while Polars uses columnar storage written in Rust. For filtering operations that examine specific columns, columnar storage is inherently more efficient because you can read only the columns you need and leverage CPU cache optimally.
=== PANDAS ===
Time: 0.001234 seconds
department mean max
0 Engineering 87666.67 90000
=== POLARS ===
Time: 0.000456 seconds
department salary salary
0 Engineering 87666.67 90000
Polars is 2.71x faster than Pandas
Notice how both libraries achieve the same result, but Polars completes in roughly a third of the time. For larger datasets with millions of rows, this difference becomes even more pronounced. The advantage comes from Polars’ columnar storage, lazy evaluation, and query optimization.
What Is Polars and Why Is It Faster?
Polars is a DataFrame library written in Rust with Python bindings, designed from the ground up for performance. Unlike Pandas, which prioritizes flexibility and backward compatibility, Polars was built with speed and memory efficiency in mind. Here’s how they compare:
Feature
Pandas
Polars
Implementation Language
Python, C (NumPy)
Rust with Python bindings
Memory Model
Row-oriented (can be memory-intensive)
Columnar (memory-efficient)
Evaluation Mode
Eager (immediate execution)
Lazy (optimized execution graphs)
Data Types
Implicit coercion (can cause issues)
Strict typing (safer operations)
Missing Values
NaN (float-based)
Null (type-aware)
Performance
Good for small-medium datasets
Excellent for all dataset sizes
Parallel Processing
Limited without manual optimization
Built-in multi-threading
SQL Support
Not native
Native SQL interface available
The three main reasons Polars outperforms Pandas are: (1) Columnar storage stores data by column rather than by row, enabling vectorized operations and better memory caching; (2) Lazy evaluation builds an execution plan before running queries, allowing the query optimizer to eliminate redundant operations; and (3) Rust implementation provides near-native performance without the overhead of Python’s global interpreter lock.
Understanding these architectural differences helps explain why Polars can be so much faster. Columnar storage means that when you filter a single column, Polars only needs to read that column from disk and memory, whereas Pandas must read every column. Lazy evaluation means Polars can see your entire query before execution and reorder operations for efficiency — for example, pushing filters down before groupby operations to reduce the amount of data that needs to be grouped. The Rust implementation eliminates Python interpreter overhead, which is particularly significant for tight loops and large-scale operations. These advantages compound when working with large datasets, making Polars not just incrementally faster but often orders of magnitude quicker for real-world data tasks.
Installing Polars and Creating DataFrames
Getting started with Polars is straightforward. First, install the library using pip:
pip install polars
Installation is quick and straightforward since Polars is available on PyPI with pre-compiled binaries for most platforms. Once installed, you have access to the full power of the Polars library — no additional configuration is needed. The library is actively maintained with frequent releases that add features and performance improvements.
Once installed, import Polars and create your first DataFrame. There are several ways to construct a DataFrame, similar to Pandas but with some syntactic differences:
Polars provides multiple ways to construct DataFrames, each suited to different data sources. The pl.DataFrame() constructor is flexible — you can pass dictionaries, lists of dictionaries, or even specify schemas explicitly for strict type control. When you define a schema, Polars enforces type consistency from the start, preventing silent type coercion bugs that can plague Pandas workflows. The pl.read_csv() function, by contrast, infers types automatically, which is convenient for quick exploratory work but may require schema validation for production pipelines.
# creating_dataframes.py
import polars as pl
# Method 1: From a dictionary (most common)
df1 = pl.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [28, 34, 25],
'city': ['New York', 'London', 'Paris']
})
print("Method 1: From Dictionary")
print(df1)
print()
# Method 2: From a list of dictionaries
data = [
{'product': 'Laptop', 'price': 1200, 'quantity': 5},
{'product': 'Mouse', 'price': 25, 'quantity': 50},
{'product': 'Keyboard', 'price': 75, 'quantity': 30}
]
df2 = pl.DataFrame(data)
print("Method 2: From List of Dictionaries")
print(df2)
print()
# Method 3: Specify data types explicitly
df3 = pl.DataFrame(
{
'id': [1, 2, 3],
'email': ['user@example.com', 'admin@example.com', 'guest@example.com'],
'active': [True, True, False]
},
schema={
'id': pl.Int32,
'email': pl.Utf8,
'active': pl.Boolean
}
)
print("Method 3: With Explicit Types")
print(df3)
print()
# Method 4: Read from CSV (inline data)
from io import StringIO
csv_data = """year,revenue,profit
2021,150000,30000
2022,185000,42000
2023,220000,55000"""
df4 = pl.read_csv(StringIO(csv_data))
print("Method 4: From CSV String")
print(df4)
Each of these four methods is useful in different scenarios. Method 1 is the most common for programmatically creating small test DataFrames. Method 2 is useful when you have data coming from a database query or API response as a list of dictionaries. Method 3 with explicit schema specification is critical for production code where you need to guarantee that, for example, IDs are 32-bit integers and not mistakenly inferred as 64-bit. Method 4 demonstrates Polars\’ ability to read directly from various sources — CSV files, Parquet, JSON, and many other formats. Notice that reading from CSV returns a Polars DataFrame immediately, while with Pandas you might need to worry about dtype inference and missing value handling.
Notice how Polars displays the data types beneath each column header (e.g., str, i64, bool). This explicit type information is invaluable for debugging — you will immediately see if a column has the wrong type, whereas Pandas might silently convert strings to floats or vice versa. The output format is also designed for readability in terminal environments, using box-drawing characters to clearly delineate rows and columns. The table header shows the shape (number of rows and columns) and each column’s name, data type, and sample values. Type annotations like i64 mean 64-bit signed integer, f64 means 64-bit float, and str means string. These type indicators give you immediate confidence that your data was parsed correctly. With Pandas, you often need to call .dtypes or .info() to see types, and even then, you might discover type inference issues that lead to bugs downstream.
Selecting, Filtering, and Sorting Data
Once you have a DataFrame, you’ll frequently need to select columns, filter rows, and sort data. Polars provides clean syntax for these operations that feels more intuitive than Pandas in many cases:
The filtering API in Polars is one of its greatest strengths — it is built around the concept of expressions that operate on entire columns at once. Instead of Pandas row-by-row boolean indexing, Polars uses the filter() method with pl.col() expressions. This functional approach is not only more readable, but it also allows Polars query optimizer to parallelize operations and eliminate unnecessary data movement. You can combine conditions using & for AND and | for OR, just like in Pandas, but Polars will intelligently reorder and optimize the operations before execution.
Notice how the filtering operations chain together in a readable way. The select() method picks just the columns you need, reducing memory usage immediately. The filter() method uses expressions to evaluate conditions across the entire column in one pass, which is much faster than Pandas row-by-row iteration. When you combine multiple filters with `&` or `|`, Polars intelligently evaluates them together. Finally, sort() arranges results by one or multiple columns, with control over ascending vs. descending order per column. This composable API is one of Polars’ greatest strengths — each method returns a new DataFrame, allowing you to chain operations naturally and readably.
Output:
=== Select Columns ===
shape: (7, 2)
┌────────────────┬────────┐
│ name ┆ salary │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════════════╪════════╡
│ Alice Johnson ┆ 95000 │
│ Bob Smith ┆ 65000 │
│ Charlie Brown ┆ 88000 │
│ Diana Prince ┆ 72000 │
│ Eve Wilson ┆ 68000 │
│ Frank Miller ┆ 92000 │
│ Grace Lee ┆ 58000 │
└────────────────┴────────┘
=== Filter by Department ===
shape: (3, 5)
┌─────────────┬────────────────┬──────────────┬────────┬─────────────────┐
│ employee_id ┆ name ┆ department ┆ salary ┆ years_employed │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 ┆ i64 │
╞═════════════╪════════════════╪══════════════╪════════╪═════════════════╡
│ 101 ┆ Alice Johnson ┆ Engineering ┆ 95000 ┆ 5 │
│ 103 ┆ Charlie Brown ┆ Engineering ┆ 88000 ┆ 4 │
│ 106 ┆ Frank Miller ┆ Engineering ┆ 92000 ┆ 7 │
└─────────────┴────────────────┴──────────────┴────────┴─────────────────┘
=== Filter Multiple Conditions ===
shape: (2, 5)
│ employee_id ┆ name ┆ department ┆ salary ┆ years_employed │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 ┆ i64 │
╞═════════════╪═══════════════╪══════════════╪════════╪════════════════╡
│ 101 ┆ Alice Johnson ┆ Engineering ┆ 95000 ┆ 5 │
│ 106 ┆ Frank Miller ┆ Engineering ┆ 92000 ┆ 7 │
└─────────────┴───────────────┴──────────────┴────────┴────────────────┘
=== Filter with OR ===
shape: (4, 5)
│ employee_id ┆ name ┆ department │ salary ┆ years_employed │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 ┆ i64 │
╞═════════════╪══════════════╪════════════╪════════╪════════════════╡
│ 102 ┆ Bob Smith ┆ Sales ┆ 65000 ┆ 3 │
│ 105 ┆ Eve Wilson ┆ Sales ┆ 68000 ┆ 2 │
│ 107 ┆ Grace Lee ┆ HR ┆ 58000 ┆ 1 │
└─────────────┴──────────────┴────────────┴────────┴────────────────┘
=== Sort by Salary (Descending) ===
shape: (7, 5)
┌─────────────┬────────────────┬──────────────┬────────┬─────────────────┐
│ employee_id ┆ name ┆ department ┆ salary ┆ years_employed │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 ┆ i64 │
╞═════════════╪════════════════╪══════════════╪════════╪═════════════════╡
│ 101 ┆ Alice Johnson ┆ Engineering ┆ 95000 ┆ 5 │
│ 106 ┆ Frank Miller ┆ Engineering ┆ 92000 ┆ 7 │
│ 103 ┆ Charlie Brown ┆ Engineering ┆ 88000 ┆ 4 │
│ 104 ┆ Diana Prince ┆ Marketing ┆ 72000 ┆ 6 │
│ 105 ┆ Eve Wilson ┆ Sales ┆ 68000 ┆ 2 │
│ 102 ┆ Bob Smith ┆ Sales ┆ 65000 ┆ 3 │
│ 107 ┆ Grace Lee ┆ HR ┆ 58000 ┆ 1 │
└─────────────┴────────────────┴──────────────┴────────┴─────────────────┘
=== Sort by Department, then Salary ===
shape: (7, 5)
[similar output showing sorted results]
Polars — because life’s too short for slow DataFrames.
Expressions and Column Operations
One of Polars’ most powerful features is its expression system. Expressions allow you to define transformations that are lazily evaluated and optimized by Polars’ query engine. This is a paradigm shift from Pandas, where operations are evaluated immediately:
Expressions form the core of Polars query language. Think of them as recipes for transforming columns — they describe what you want to do, not how to do it. When you write pl.col("salary").mean(), you are not immediately computing the mean; you are defining an expression that says “take the salary column and calculate its mean.” This separation between definition and execution is what enables Polars to apply aggressive optimizations. The query optimizer can see your entire pipeline of expressions and decide the most efficient order of operations, potentially combining multiple steps into a single pass through the data.
In Pandas, you often reach for `.apply()` or create intermediate columns with `.assign()` when you need to transform data. These approaches are flexible but inefficient — they iterate through rows or create unnecessary intermediate DataFrames. With Polars expressions, you define your transformation declaratively and let the optimizer handle execution. Another key difference: Polars expressions are type-aware and vectorized. They operate on entire columns, not individual rows, which means they can be compiled to efficient machine code. This is why Polars expressions are typically 10-100x faster than the equivalent `.apply()` in Pandas for numerical operations. The composability of expressions is another major win — you can chain method calls together, combining filtering, transformation, and aggregation in a single readable expression that executes as efficiently as hand-written optimized code.
# polars_expressions.py
import polars as pl
from io import StringIO
csv_data = """product,q1_sales,q2_sales,q3_sales,q4_sales
Laptop,45000,52000,58000,61000
Tablet,28000,31000,35000,38000
Smartphone,120000,135000,150000,165000
Monitor,18000,19000,22000,24000"""
df = pl.read_csv(StringIO(csv_data))
# Basic arithmetic expressions
print("=== Total Sales by Product ===")
result = df.select([
pl.col('product'),
(pl.col('q1_sales') + pl.col('q2_sales') + pl.col('q3_sales') + pl.col('q4_sales')).alias('total_sales')
])
print(result)
print()
# Using sum() expression on multiple columns
print("=== Average Quarterly Sales ===")
result = df.select([
pl.col('product'),
((pl.col('q1_sales') + pl.col('q2_sales') + pl.col('q3_sales') + pl.col('q4_sales')) / 4).alias('avg_quarterly')
])
print(result)
print()
# Conditional expressions
print("=== High Performers (Q4 > 50k) ===")
result = df.select([
pl.col('product'),
pl.when(pl.col('q4_sales') > 50000).then('High').otherwise('Standard').alias('category')
])
print(result)
print()
# String operations
print("=== Product Names with Prefix ===")
result = df.select([
(pl.lit('PRODUCT_') + pl.col('product')).alias('full_name'),
pl.col('q1_sales')
])
print(result)
print()
# Multiple aggregations in one expression
print("=== Complex Statistics ===")
q_cols = ['q1_sales', 'q2_sales', 'q3_sales', 'q4_sales']
result = df.select([
pl.col('product'),
pl.concat_list(q_cols).list.mean().alias('mean_sales'),
pl.concat_list(q_cols).list.max().alias('max_sales'),
pl.concat_list(q_cols).list.min().alias('min_sales')
])
print(result)
These examples demonstrate the power and flexibility of expressions. Notice that expressions can be nested and combined — you can use `pl.lit()` for literal values, `pl.col()` to reference columns, arithmetic and string operations, and higher-order functions like `list.mean()` for more complex transformations. The key advantage is that all these operations compose elegantly and are executed as a single lazy expression, allowing Polars to optimize them together. Compare this to Pandas, where you might need to chain multiple `.apply()` calls or use `assign()` repeatedly, each of which creates an intermediate DataFrame and executes eagerly.
The expressions we have seen so far operate on entire columns. But often, you will want to apply expressions within groups — for example, computing the total revenue for each product category, or finding the average salary by department. This is where groupby() combined with agg() (aggregate) becomes essential. The agg() method accepts a list of expressions and applies each one to every group, giving you fine-grained control over which aggregations happen on which columns.
GroupBy and Aggregation
Aggregating data by groups is fundamental to data analysis. Polars makes grouping and aggregation intuitive and fast:
In Polars, groupby() is typically paired immediately with agg() to perform aggregations on groups. Unlike Pandas, where you might call .groupby().mean() or .groupby()["column"].sum(), Polars requires you to be explicit about which columns get which operations. This explicitness might feel verbose at first, but it is actually a feature — you are forced to think clearly about what you are aggregating and how. Moreover, because expressions are lazy, Polars can optimize grouped operations across multiple CPU cores automatically, often giving you parallel speedups without any extra code on your part.
# polars_groupby.py
import polars as pl
from io import StringIO
csv_data = """region,product,units_sold,revenue
North,Laptop,120,240000
North,Desktop,80,128000
North,Monitor,200,40000
South,Laptop,150,300000
South,Desktop,95,152000
South,Monitor,180,36000
East,Laptop,110,220000
East,Desktop,70,112000
East,Monitor,220,44000
West,Laptop,140,280000
West,Desktop,85,136000
West,Monitor,190,38000"""
df = pl.read_csv(StringIO(csv_data))
# Simple groupby with single aggregation
print("=== Total Revenue by Region ===")
result = df.groupby('region').agg(pl.col('revenue').sum()).sort('revenue', descending=True)
print(result)
print()
# Multiple aggregations
print("=== Region Statistics ===")
result = df.groupby('region').agg([
pl.col('revenue').sum().alias('total_revenue'),
pl.col('units_sold').sum().alias('total_units'),
pl.col('revenue').mean().alias('avg_revenue'),
pl.col('units_sold').count().alias('product_count')
])
print(result)
print()
# Groupby multiple columns
print("=== Revenue by Region and Product ===")
result = df.groupby(['region', 'product']).agg(
pl.col('revenue').sum().alias('total_revenue'),
pl.col('units_sold').sum().alias('total_units')
).sort(['region', 'total_revenue'], descending=[False, True])
print(result)
print()
# Groupby with conditional aggregation
print("=== High-Value Sales (>40k) ===")
result = df.groupby('product').agg(
pl.col('revenue').filter(pl.col('revenue') > 40000).sum().alias('high_value_revenue'),
pl.col('revenue').count().alias('total_sales_count')
)
print(result)
Aggregations are powerful, but they are even more powerful when combined with other operations. For instance, you might filter rows, transform columns, group by a category, and then aggregate — all in a single logical operation. By default, each operation executes immediately, which is fine for small datasets but wastes computational resources on large ones. This is where lazy evaluation enters the picture. Lazy evaluation defers execution until you explicitly request results, allowing Polars to analyze your entire query and find the optimal execution plan.
Expressions chain like magic — filter, transform, aggregate, done.
Lazy Evaluation with LazyFrames
Lazy evaluation is one of Polars’ defining features and a major source of its performance advantage. Instead of executing operations immediately, Polars builds an execution plan and optimizes it before running. This allows the query optimizer to eliminate redundant operations, push filters down, and parallelize efficiently:
With lazy evaluation, you chain your operations together without worrying about intermediate results. Polars builds a directed acyclic graph (DAG) of your operations, analyzes the dependencies, and figures out the best way to execute everything. For example, if you filter and then select only a few columns, Polars will reorder operations to select columns first (reducing memory traffic) before filtering. If you have multiple aggregations on the same grouped data, Polars will combine them into a single pass. These optimizations happen automatically — you do not need to think about it, but understanding that it is happening can help you write more efficient queries.
Notice the query plan output — it shows how Polars intends to execute your operations. The optimizer reorders and combines steps for efficiency. When you call collect(), this optimized plan is executed. This is fundamentally different from Pandas, where operations happen one by one as you write them. The performance gains from lazy evaluation can be dramatic on large datasets with complex pipelines — sometimes 10x or even 100x faster, depending on the operations and data size.
The lazy approach can be significantly faster because Polars’ query optimizer performs several optimizations: (1) Predicate pushdown moves filters as early as possible to reduce data processed; (2) Projection pushdown selects only needed columns; (3) Common subexpression elimination avoids redundant calculations; and (4) Parallel execution processes data across multiple CPU cores automatically. These optimizations are sophisticated — they involve analyzing the entire computation graph and intelligently reordering operations while preserving correctness. This is something Pandas cannot do because it executes eagerly, one operation at a time.
Understanding lazy evaluation changes how you think about data processing. Instead of thinking “execute this step, then this step,” you think “build a description of what I want, then execute it optimally.” This mental shift is subtle but powerful. It encourages you to compose operations declaratively, expressing what data you want rather than how to get it. The Polars optimizer then handles the “how” — and it is usually smarter than what you would write manually.
Lazy evaluation — Polars reads the whole plan before lifting a finger.
Converting Between Pandas and Polars
If you’re working in an environment where you need both Pandas and Polars, or migrating existing Pandas code, conversion between the two is straightforward:
Sometimes you cannot immediately rewrite an entire codebase in Polars — maybe you have legacy Pandas code, or you need a library that only works with Pandas DataFrames. Fortunately, conversion between Pandas and Polars is quick and seamless. The to_pandas() method converts a Polars DataFrame to Pandas, and pl.from_pandas() does the reverse. The conversion itself is relatively fast because both libraries use columnar memory layouts internally, so there is minimal copying involved. This makes it practical to use Polars for the heavy lifting (loading, filtering, aggregating) and then hand off results to Pandas or other libraries for specialized analysis or visualization.
A practical approach is to adopt Polars incrementally. Start by identifying the most performance-critical sections of your data pipeline — typically data loading and initial filtering. Replace those sections with Polars code using lazy evaluation to maximize performance benefits. Once you have the processed results, convert back to Pandas if you need to use legacy code or specific libraries that depend on Pandas. This hybrid approach gives you immediate performance gains without requiring a complete rewrite. Over time, as you become more comfortable with Polars’ API, you can migrate more of your pipeline, eventually eliminating the Pandas dependency entirely if desired.
# pandas_polars_conversion.py
import pandas as pd
import polars as pl
from io import StringIO
csv_data = """name,department,salary
Alice,Engineering,95000
Bob,Sales,65000
Charlie,Engineering,88000
Diana,Marketing,72000"""
# Method 1: Pandas DataFrame to Polars
print("=== Convert Pandas to Polars ===")
df_pandas = pd.read_csv(StringIO(csv_data))
print("Original Pandas DataFrame:")
print(df_pandas)
print(f"Type: {type(df_pandas)}")
print()
df_polars = pl.from_pandas(df_pandas)
print("Converted to Polars:")
print(df_polars)
print(f"Type: {type(df_polars)}")
print()
# Method 2: Polars DataFrame to Pandas
print("=== Convert Polars to Pandas ===")
df_polars_new = pl.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard'],
'price': [1200, 25, 75],
'in_stock': [True, True, False]
})
print("Original Polars DataFrame:")
print(df_polars_new)
print()
df_pandas_new = df_polars_new.to_pandas()
print("Converted to Pandas:")
print(df_pandas_new)
print(f"Type: {type(df_pandas_new)}")
print()
# Method 3: Working with Polars then converting back
print("=== Polars Processing + Pandas Export ===")
df_work = pl.DataFrame({
'quarter': ['Q1', 'Q1', 'Q2', 'Q2', 'Q3', 'Q3'],
'region': ['North', 'South', 'North', 'South', 'North', 'South'],
'sales': [45000, 52000, 58000, 61000, 62000, 68000]
})
# Process with Polars (faster)
result = (df_work
.groupby('region')
.agg(pl.col('sales').mean().alias('avg_sales'))
)
# Convert to Pandas for compatibility with other tools
result_pandas = result.to_pandas()
print(result_pandas)
print(f"Pandas type: {type(result_pandas)}")
Output:
=== Convert Pandas to Polars ===
Original Pandas DataFrame:
name department salary
0 Alice Engineering 95000
1 Bob Sales 65000
2 Charlie Engineering 88000
3 Diana Marketing 72000
Type:
Converted to Polars:
shape: (4, 3)
┌─────────┬──────────────┬────────┐
│ name ┆ department ┆ salary │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════════╪══════════════╪════════╡
│ Alice ┆ Engineering ┆ 95000 │
│ Bob ┆ Sales ┆ 65000 │
│ Charlie ┆ Engineering ┆ 88000 │
│ Diana ┆ Marketing ┆ 72000 │
└─────────┴──────────────┴────────┘
Type:
=== Convert Polars to Pandas ===
Original Polars DataFrame:
shape: (3, 3)
┌──────────┬───────┬──────────┐
│ product ┆ price ┆ in_stock │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ bool │
╞══════════╪═══════╪══════════╡
│ Laptop ┆ 1200 ┆ true │
│ Mouse ┆ 25 ┆ true │
│ Keyboard ┆ 75 ┆ false │
└──────────┴───────┴──────────┘
Converted to Pandas:
product price in_stock
0 Laptop 1200 True
1 Mouse 25 True
2 Keyboard 75 False
Type:
=== Polars Processing + Pandas Export ===
region avg_sales
0 North 55000.0
1 South 60333.333333
Pandas type:
The conversion workflow is straightforward: load your data with Polars for speed, perform transformations using lazy evaluation and expressions, and collect the results. If you need to pass the data to a Pandas-dependent library or visualization tool, convert it at that point. This hybrid approach lets you get the best of both worlds — Polars performance for data wrangling and whatever specialized tools your workflow requires.
Pandas and Polars — best friends when you use .to_pandas() wisely.
Real-Life Example: Sales Data Analyzer
Let’s build a practical example that demonstrates multiple Polars features in a realistic scenario. This analyzer reads transaction data, performs complex aggregations, identifies trends, and generates insights:
Real-world data pipelines combine multiple techniques — filtering, grouping, joining, and creating new computed columns. This sales analyzer demonstrates how to structure a Polars pipeline for a typical business use case. Notice how the entire sequence of operations reads like a narrative: “Start with sales data, lazy-load it, filter by region and date, group by product and salesperson, compute metrics, and collect results.” Each step is a Polars expression or method call that chains naturally. Because we are using lazy evaluation, Polars will optimize this entire pipeline before executing a single row of data.
This example shows a realistic data processing pipeline where you start with raw CSV data, progressively filter and transform it, and end up with summarized metrics. In a production setting, you would likely save these results to a database or export them for reporting. The beauty of the Polars approach is that it scales — whether you have 1 million rows or 1 billion rows, the code structure remains the same, and Polars optimizer and parallelization kick in automatically. With Pandas, you would need to be more careful about memory usage and might have to restructure the code for larger datasets. The power of lazy evaluation combined with expressions means you can write concise, readable queries that execute at lightning speed.
Pipeline complete — clean data in, insights out, milliseconds flat.
Frequently Asked Questions
As you begin integrating Polars into your data science workflow, several questions naturally arise. This section addresses the most common concerns and misconceptions about Polars, its relationship to Pandas, and how to best leverage it in production environments. We will cover adoption strategies, performance expectations, and practical guidance for transitioning existing codebases.
1. Is Polars a complete replacement for Pandas?
Polars is a powerful alternative but not 100% compatible with every Pandas operation. Polars is excellent for data manipulation, aggregation, and analysis, which cover 80-90% of typical data tasks. Some areas where Pandas still excels include time series operations (Polars’ temporal support is improving), certain statistical functions, and specific visualization integrations. For most projects, you can migrate to Polars entirely, but it’s good to know both libraries. The beauty is that you do not need to choose one or the other — you can use both strategically within the same project. Use Polars where you need performance and a clean API, and fall back to Pandas where you need specific functionality or library support.
2. How much faster is Polars really?
Performance gains depend heavily on dataset size and operation type. For small datasets (< 100K rows), differences may be negligible. For medium datasets (1-100M rows), Polars is typically 2-10x faster. For large datasets (> 100M rows), the difference can be 10-100x or more, especially with lazy evaluation and multi-column operations. Benchmarks consistently show Polars outperforming Pandas on standard operations like groupby, filtering, and joins. The speedups come not just from being written in Rust, but from algorithmic optimizations made possible by lazy evaluation. When Polars can see your entire operation graph before execution, it can make decisions that Pandas never can. For example, it can decide to read only the columns you need from a CSV file, skip rows that will be filtered out, and parallelize across cores without any explicit parallel programming on your part.
3. Can I use Polars with Pandas code I already have?
Absolutely. You can convert between Polars and Pandas using pl.from_pandas() and .to_pandas(). A practical approach is to use Polars for heavy data processing where speed matters, then convert to Pandas if you need specific functionality or library integrations. Many projects use both libraries strategically. For instance, you might use Pandas for data exploration in notebooks and Polars for production pipelines, or vice versa. The key is that the conversion overhead is minimal because both libraries understand columnar layouts, so moving data between them is a fast operation rather than a bottleneck.
4. What about memory usage? Is Polars more memory-efficient?
Yes, Polars uses less memory than Pandas in most scenarios. The columnar storage model is more efficient, and Polars does not create unnecessary intermediate copies during operations. For a 1GB dataset, Polars might use 300-500MB while Pandas uses 2-3GB. This becomes critical when working with datasets approaching available RAM. The memory efficiency comes from multiple sources: (1) columnar storage means data is stored densely without padding; (2) lazy evaluation avoids creating intermediate DataFrames for chained operations; and (3) Polars uses more efficient data type representations (e.g., native nulls instead of NaN, smaller integer types by default). On systems with limited RAM, using Polars instead of Pandas can literally mean the difference between a workload running and running out of memory.
5. How do I debug Polars lazy evaluation if something goes wrong?
Use the .explain() method to visualize the execution plan, or use .show_graph() for a visual representation. If an error occurs, wrap your lazy chain with .collect() earlier to see where the issue is. You can also use eager evaluation (remove .lazy()) temporarily for debugging, then switch back to lazy mode once fixed. Lazy evaluation can seem mysterious at first because nothing executes until you call .collect(). If your code fails, the error message might not point to where you expected. The .explain() output helps demystify this — it shows you the exact execution plan Polars will use, allowing you to see if columns are being selected correctly, if filters are in the right position, and if joins are happening on the correct keys. This visibility is invaluable for diagnosing performance issues or unexpected results.
6. Does Polars support distributed computing like Spark?
Polars is designed for single-machine multi-core processing and is not a distributed computing framework like Spark. However, Polars is so fast that many workloads that would require Spark with Pandas can run efficiently on a single machine with Polars. For true distributed computing, you would still use Spark, but consider whether Polars might solve your problem first. The computing power of modern machines has grown tremendously — a single laptop can process gigabytes of data in seconds with Polars, which would have required a cluster a few years ago. This is why many data teams find they do not need Spark when they switch to Polars.
7. What about null/missing values in Polars?
Polars uses a proper Null type (similar to SQL) instead of NaN, making it more type-safe. By default, Polars allows nulls in any column. You can use .fill_null(), .drop_nulls(), or conditional logic with pl.when().then().otherwise() to handle missing data. The syntax is often more explicit and safer than Pandas’ approach. One of Polars’ design wins is that every data type can have a true null value, just like in databases. Pandas conflates missing values (NaN for floats, None for objects) which can lead to subtle bugs. Polars forces you to think clearly about whether a value is truly missing (null) or a valid data point. This explicitness prevents entire classes of bugs and makes your data pipelines more reliable.
Conclusion
Polars represents a significant evolution in Python data processing. Its combination of speed, memory efficiency, and expressive syntax makes it an excellent choice for modern data work. Whether you’re analyzing millions of rows of transaction data, processing sensor readings, or building data pipelines, Polars delivers measurable performance improvements over Pandas. The library has matured significantly in recent years and now supports the vast majority of data manipulation tasks that Pandas users encounter daily.
The key advantages are clear: lazy evaluation optimizes complex queries, the expression-based API is intuitive and composable, and the Rust implementation eliminates Python’s performance bottlenecks. For intermediate and advanced Python developers familiar with Pandas, the learning curve is minimal, and the payoff is substantial. You are not learning a completely new paradigm — you are adopting a better implementation of the same concepts you already know.
What we have covered in this guide provides you with a solid foundation for using Polars effectively. We started with basic DataFrame creation and manipulation, progressed through filtering and expressions, explored groupby aggregations, and discovered the power of lazy evaluation. We examined real-world examples and discussed practical integration strategies with existing Pandas code. These techniques form the core of most data analysis workflows — master these, and you will be equipped to handle complex data problems efficiently.
Start by trying Polars on your most performance-critical data operations. Use lazy evaluation for complex multi-step transformations, and leverage groupby and expressions for aggregations. Convert to and from Pandas as needed for compatibility with existing tools. Over time, you will likely find Polars becoming your default choice for data analysis, with Pandas reserved for specific edge cases. The performance benefits are not merely academic — they directly translate to faster iteration during exploration, shorter pipeline runtimes in production, and the ability to handle larger datasets on the same hardware.
The future of Python data processing is here, and it is fast. Give Polars a try in your next project and experience the difference firsthand. You will not regret the investment in learning this powerful library.
Best Practices and Tips for Polars Success
As you integrate Polars into your workflows, keep a few best practices in mind. First, always prefer lazy evaluation for production code — the performance benefits are substantial and there is rarely a downside to deferring execution until you call .collect(). Second, be explicit with your schemas whenever possible, especially for CSV and JSON files. Polars can infer types, but explicit schemas prevent surprises and make your code more maintainable. Third, use .explain() when you are curious about how Polars plans to execute your query — this is educational and helps you understand what optimizations are happening behind the scenes.
Fourth, take advantage of Polars\’ rich expression system rather than falling back to Python loops or `.apply()` methods. Expressions are faster, more readable, and often shorter. Fifth, remember that Polars is eager about memory — it reads data into memory efficiently, but massive datasets that do not fit in RAM still require strategies like filtering early or processing chunks. Finally, stay up to date with Polars releases. The library is actively developed and new features, optimizations, and bug fixes arrive regularly. The community is welcoming and the documentation continues to improve. Polars is used in production by data teams at major companies and has proven itself as a reliable, performant alternative to Pandas. It is not an experimental project — it is battle-tested and production-ready.
Making HTTP requests is a fundamental task in web development and data collection. However, when you need to fetch data from multiple endpoints simultaneously, traditional blocking requests become a bottleneck. If you want to retrieve data from 100 different APIs, making sequential requests could take minutes. This is where concurrency comes in, allowing you to send multiple requests at the same time and dramatically speed up your application.
aiohttp is a Python library that enables asynchronous HTTP client and server functionality. Built on top of asyncio, aiohttp allows you to handle hundreds or thousands of concurrent requests without creating a thread for each one. This makes it ideal for web scraping, working with REST APIs, and building high-performance applications that need to juggle multiple I/O operations.
In this tutorial, we’ll explore how to use aiohttp to make concurrent HTTP requests, manage sessions efficiently, handle errors gracefully, and implement best practices like rate limiting and timeout management. By the end, you’ll understand how to build scalable applications that can fetch data from multiple sources simultaneously with minimal resource usage.
Quick Example: Fetching 5 URLs Concurrently
Let’s start with a simple example that demonstrates the power of concurrent requests. This script fetches data from five endpoints at the same time:
# concurrent_fetch_example.py
import asyncio
import aiohttp
async def fetch_url(session, url):
async with session.get(url) as response:
return await response.json()
async def main():
urls = [
'https://httpbin.org/delay/2',
'https://httpbin.org/delay/2',
'https://httpbin.org/delay/2',
'https://httpbin.org/delay/2',
'https://httpbin.org/delay/2',
]
async with aiohttp.ClientSession() as session:
tasks = [fetch_url(session, url) for url in urls]
results = await asyncio.gather(*tasks)
print(f"Fetched {len(results)} responses")
asyncio.run(main())
Output:
Fetched 5 responses
Notice how all five requests are sent immediately and processed in parallel. With traditional blocking requests, this would take 10 seconds (2 seconds per request). With aiohttp and asyncio, it completes in roughly 2 seconds because the requests happen concurrently.
aiohttp.ClientSession: one connection pool to rule them all.
What is aiohttp?
aiohttp is a Python async HTTP client and server library. It’s built on top of asyncio, Python’s standard asynchronous I/O framework, which means it uses coroutines and event loops instead of threading to handle multiple operations concurrently. This approach is more efficient than threading because it avoids the overhead of context switching and thread management.
Key features of aiohttp include:
Asynchronous HTTP requests and responses
Connection pooling and session management
Automatic handling of redirects and cookies
Support for streaming and multipart uploads
Built-in timeout and error handling
WebSocket support
Both client and server functionality
The library is essential for building modern Python web applications that need to handle I/O efficiently. Whether you’re scraping data, calling APIs, or building a backend service that needs to communicate with multiple external services, aiohttp provides the tools you need.
Installing aiohttp
aiohttp is available on PyPI and can be installed with pip:
# terminal
pip install aiohttp
You can verify the installation by checking the version:
The key parts of this code are: a ClientSession manages connections and cookies, and the async with statement ensures the connection is properly closed. Notice we use await to wait for the response without blocking other operations.
Making POST Requests
POST requests send data to a server. Here’s how to create a POST request with aiohttp:
# post_request_example.py
import asyncio
import aiohttp
import json
async def post_example():
async with aiohttp.ClientSession() as session:
payload = {'name': 'Alice', 'age': 30}
async with session.post('https://httpbin.org/post', json=payload) as response:
result = await response.json()
print(f"Status: {response.status}")
print(f"Sent data: {result['json']}")
asyncio.run(post_example())
Output:
Status: 200
Sent data: {'name': 'Alice', 'age': 30}
The json parameter automatically serializes your dictionary and sets the correct Content-Type header. You can also use data for form-encoded data or files for multipart uploads.
400 requests. One event loop. Zero blocking.
Making Concurrent Requests with asyncio.gather()
The real power of aiohttp comes from running multiple requests concurrently. The asyncio.gather() function is the key tool for this:
# concurrent_requests.py
import asyncio
import aiohttp
import time
async def fetch(session, url):
async with session.get(url) as response:
return await response.json()
async def fetch_multiple(urls):
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks)
return results
async def main():
urls = [
'https://jsonplaceholder.typicode.com/posts/1',
'https://jsonplaceholder.typicode.com/posts/2',
'https://jsonplaceholder.typicode.com/posts/3',
]
start = time.time()
results = await fetch_multiple(urls)
elapsed = time.time() - start
print(f"Fetched {len(results)} posts in {elapsed:.2f} seconds")
asyncio.run(main())
Output:
Fetched 3 posts in 0.85 seconds
The pattern here is crucial: create a list of coroutines (tasks), then pass them to asyncio.gather(). This sends all requests immediately and waits for all of them to complete. If you need to handle errors individually, you can pass return_exceptions=True to gather().
ClientTimeout exists. Your 30-second silence has been noticed.
Session Management Best Practices
A ClientSession is a container for making HTTP requests and managing connections. It’s important to reuse the same session for multiple requests because it maintains a connection pool, which significantly improves performance. Here’s the right way to manage sessions:
# session_management.py
import asyncio
import aiohttp
async def fetch_with_session(session, url):
async with session.get(url) as response:
return await response.json()
async def main():
# Create session once
async with aiohttp.ClientSession() as session:
urls = [
'https://httpbin.org/get?id=1',
'https://httpbin.org/get?id=2',
'https://httpbin.org/get?id=3',
]
# Reuse the same session for all requests
tasks = [fetch_with_session(session, url) for url in urls]
results = await asyncio.gather(*tasks)
print(f"Successfully fetched {len(results)} responses")
asyncio.run(main())
Output:
Successfully fetched 3 responses
Never create a new session for each request. Creating sessions is expensive because they initialize connection pools and other resources. Instead, create one session and reuse it for all your requests in a given scope.
Error Handling and Timeouts
Network requests can fail for various reasons. aiohttp provides built-in mechanisms to handle errors and set timeouts:
# error_handling.py
import asyncio
import aiohttp
async def fetch_with_error_handling(session, url):
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=5)) as response:
if response.status == 200:
return await response.json()
else:
print(f"Error: {response.status} for {url}")
return None
except asyncio.TimeoutError:
print(f"Timeout fetching {url}")
return None
except aiohttp.ClientError as e:
print(f"Request failed for {url}: {e}")
return None
async def main():
urls = [
'https://httpbin.org/delay/1',
'https://httpbin.org/status/500',
'https://httpbin.org/delay/10', # Will timeout
]
async with aiohttp.ClientSession() as session:
tasks = [fetch_with_error_handling(session, url) for url in urls]
results = await asyncio.gather(*tasks)
successful = len([r for r in results if r is not None])
print(f"Successfully fetched {successful} out of {len(urls)} responses")
asyncio.run(main())
Output:
Error: 500 for https://httpbin.org/status/500
Timeout fetching https://httpbin.org/delay/10
Successfully fetched 1 out of 3 responses
The ClientTimeout object lets you set different timeout durations. You can set a total timeout, or separate timeouts for connection, reading, and writing. Always wrap requests in try-except blocks to handle network errors gracefully.
Implementing Rate Limiting
When making many concurrent requests, you may need to respect rate limits imposed by the server. Here’s how to implement basic rate limiting with asyncio:
# rate_limiting.py
import asyncio
import aiohttp
import time
class RateLimiter:
def __init__(self, max_requests, time_period):
self.max_requests = max_requests
self.time_period = time_period
self.requests = []
async def acquire(self):
now = time.time()
# Remove requests older than the time period
self.requests = [req_time for req_time in self.requests
if now - req_time < self.time_period]
if len(self.requests) >= self.max_requests:
sleep_time = self.time_period - (now - self.requests[0])
await asyncio.sleep(sleep_time)
await self.acquire()
else:
self.requests.append(time.time())
async def fetch_with_limit(session, url, limiter):
await limiter.acquire()
async with session.get(url) as response:
return await response.json()
async def main():
limiter = RateLimiter(max_requests=2, time_period=1.0)
urls = [
'https://httpbin.org/get?id=1',
'https://httpbin.org/get?id=2',
'https://httpbin.org/get?id=3',
'https://httpbin.org/get?id=4',
]
async with aiohttp.ClientSession() as session:
tasks = [fetch_with_limit(session, url, limiter) for url in urls]
start = time.time()
results = await asyncio.gather(*tasks)
elapsed = time.time() - start
print(f"Fetched {len(results)} responses in {elapsed:.2f} seconds")
asyncio.run(main())
Output:
Fetched 4 responses in 2.05 seconds
This rate limiter ensures no more than 2 requests happen within any 1-second window. You can adjust max_requests and time_period to match your API’s limits.
Three HTTP requests at once. Total time = max, not sum.
Working with Headers and Authentication
Many APIs require custom headers or authentication tokens. Here’s how to handle them with aiohttp:
You can also set default headers for all requests in a session by passing them during session creation. For basic authentication, use auth=aiohttp.BasicAuth('user', 'pass').
Real-Life Example: Concurrent API Data Collector
Let’s build a practical example that fetches data from multiple endpoints, implements error handling, rate limiting, and timeout management:
# api_data_collector.py
import asyncio
import aiohttp
import time
from typing import List, Dict, Any
class APICollector:
def __init__(self, requests_per_second=2, timeout=10):
self.requests_per_second = requests_per_second
self.timeout = aiohttp.ClientTimeout(total=timeout)
self.request_times = []
async def rate_limit(self):
now = time.time()
# Remove old timestamps
self.request_times = [t for t in self.request_times
if now - t < 1.0]
if len(self.request_times) >= self.requests_per_second:
sleep_time = 1.0 - (now - self.request_times[0])
await asyncio.sleep(sleep_time)
await self.rate_limit()
else:
self.request_times.append(time.time())
async def fetch(self, session, url: str) -> Dict[str, Any]:
await self.rate_limit()
try:
async with session.get(url, timeout=self.timeout) as response:
if response.status == 200:
return {
'url': url,
'status': 'success',
'data': await response.json(),
}
else:
return {
'url': url,
'status': 'error',
'code': response.status,
}
except asyncio.TimeoutError:
return {
'url': url,
'status': 'timeout',
}
except Exception as e:
return {
'url': url,
'status': 'error',
'error': str(e),
}
async def collect(self, urls: List[str]) -> List[Dict[str, Any]]:
async with aiohttp.ClientSession() as session:
tasks = [self.fetch(session, url) for url in urls]
return await asyncio.gather(*tasks)
async def main():
collector = APICollector(requests_per_second=3, timeout=5)
urls = [
'https://jsonplaceholder.typicode.com/posts/1',
'https://jsonplaceholder.typicode.com/posts/2',
'https://jsonplaceholder.typicode.com/posts/3',
'https://jsonplaceholder.typicode.com/users/1',
'https://jsonplaceholder.typicode.com/comments/1',
]
start = time.time()
results = await collector.collect(urls)
elapsed = time.time() - start
successful = sum(1 for r in results if r['status'] == 'success')
print(f"Collected {successful}/{len(urls)} responses in {elapsed:.2f}s")
for result in results:
print(f" {result['url'].split('/')[-1]}: {result['status']}")
asyncio.run(main())
This APICollector class demonstrates a production-ready pattern for making concurrent requests with all the best practices: rate limiting, timeout handling, error recovery, and clean result reporting. You can extend it further with retry logic, exponential backoff, or caching.
Frequently Asked Questions
What’s the difference between aiohttp and requests?
The requests library is synchronous and blocks while waiting for responses, making it suitable for simple scripts. aiohttp is asynchronous and allows thousands of concurrent requests without blocking, making it essential for high-performance applications. Use requests for simple scripts and aiohttp for anything that needs concurrency.
Should I create a new session for each request?
No, absolutely not. Creating a new session is expensive because it initializes connection pooling and other resources. Always create one session and reuse it for all requests within a given scope. When you’re done with all requests, close the session using a context manager or the await session.close() method.
How do I limit the number of concurrent connections?
You can set connection limits when creating a ClientSession using the connector parameter: connector = aiohttp.TCPConnector(limit=100, limit_per_host=30). The limit parameter sets the total number of connections, while limit_per_host limits connections to a single host.
How do I handle large file downloads?
For large files, read the response in chunks instead of all at once: async for chunk in response.content.iter_chunked(8192): file.write(chunk). This prevents loading the entire file into memory at once.
Does aiohttp support WebSockets?
Yes, aiohttp has built-in WebSocket support for both client and server use cases. You can establish WebSocket connections with async with session.ws_connect(url) as ws: ... and exchange messages bidirectionally.
What exceptions should I catch?
The main exceptions to catch are asyncio.TimeoutError, aiohttp.ClientError (and its subclasses like ClientConnectionError, ClientSSLError), and asyncio.CancelledError for cancelled tasks. Always catch the more specific exceptions before the general ones.
Conclusion
aiohttp is the go-to library for making concurrent HTTP requests in Python. By leveraging asyncio, it allows you to handle dozens, hundreds, or even thousands of concurrent connections efficiently without the overhead of threading. The key takeaways are: create one session and reuse it, use asyncio.gather() for concurrency, always implement proper error handling and timeouts, and respect server rate limits.
Whether you’re building a web scraper, integrating multiple APIs, or creating a high-performance backend service, mastering aiohttp will significantly improve your application’s responsiveness and efficiency. The patterns and best practices shown in this tutorial will serve you well in production environments.
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as r:
r.raise_for_status()
return await r.text()
async def main(urls):
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
urls = ["https://example.com", "https://python.org", "https://github.com"]
results = asyncio.run(main(urls))
for url, result in zip(urls, results):
if isinstance(result, Exception):
print(url, "FAILED:", result)
else:
print(url, "got", len(result), "bytes")
One ClientSession per program (or per long-lived context) — re-creating sessions is expensive. gather(return_exceptions=True) lets one failure not cancel the rest.
Concurrency Limit
import asyncio
import aiohttp
async def fetch_with_sem(sem, session, url):
async with sem:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as r:
return await r.text()
async def main(urls, concurrency=20):
sem = asyncio.Semaphore(concurrency)
async with aiohttp.ClientSession() as session:
results = await asyncio.gather(
*[fetch_with_sem(sem, session, url) for url in urls],
return_exceptions=True,
)
return results
# Now scraping 10,000 URLs maxes out at 20 in flight
results = asyncio.run(main(big_url_list, concurrency=20))
Without a semaphore, asyncio.gather() launches ALL tasks at once. 10,000 simultaneous connections = errors, throttling, OS file-handle exhaustion. The semaphore caps how many run concurrently.
POST, Headers, and JSON
async with aiohttp.ClientSession() as session:
# POST with JSON body
async with session.post(
"https://api.example.com/users",
json={"name": "Alice", "email": "alice@example.com"},
headers={"Authorization": "Bearer " + token},
) as resp:
result = await resp.json()
# POST form data
async with session.post(url, data={"key": "value"}) as resp:
text = await resp.text()
# File upload (multipart)
with open("photo.jpg", "rb") as f:
async with session.post(url, data={"file": f}) as resp:
...
Retries with tenacity
from tenacity import retry, stop_after_attempt, wait_random_exponential
import aiohttp
@retry(
stop=stop_after_attempt(5),
wait=wait_random_exponential(multiplier=1, max=30),
)
async def fetch_with_retry(session, url):
async with session.get(url) as r:
r.raise_for_status()
return await r.text()
async with aiohttp.ClientSession() as session:
async with session.get(big_file_url) as resp:
with open("output.bin", "wb") as f:
async for chunk in resp.content.iter_chunked(8192):
f.write(chunk)
# Process or write each 8KB chunk; memory stays constant
Common Pitfalls
Creating a Session per request. A new ClientSession allocates a new connection pool. Open once, reuse for all requests.
Forgetting to close the session. Use async with. Forgetting leaks connections.
No timeout. Default timeout is unlimited. A hanging request blocks forever. Always pass timeout=aiohttp.ClientTimeout(total=N).
Using synchronous requests inside async.requests.get blocks the event loop. Switch to aiohttp or wrap in asyncio.to_thread.
SSL errors. If you see “SSL certificate verify failed”, DON’T disable verification in production. Update certifi, check the actual cert, or use ssl.SSLContext explicitly.
FAQ
Q: aiohttp or httpx? A: aiohttp is older and battle-tested; httpx supports sync AND async, with a requests-like API. httpx for new code; aiohttp when you have existing aiohttp code or need server features.
Q: How fast is concurrent vs sequential? A: For I/O-bound work, near-linear speedup until you saturate network or target server. 100 sequential 100ms requests = 10 seconds; 100 concurrent = ~150ms.
Q: How do I handle rate limits? A: Honor the Retry-After header from 429 responses. Use a semaphore for per-second limits. Tenacity wraps both.
Q: Connection pooling settings? A: aiohttp’s TCPConnector has limit (total) and limit_per_host. Defaults are 100/30, usually fine. Tune higher for many-host scrapers, lower for single-host hammering.
Q: Streaming POST? A: session.post(url, data=async_iterator) — pass an async generator that yields chunks. aiohttp streams them as they come.
Wrapping Up
Concurrent HTTP in Python is asyncio + aiohttp + a semaphore. One Session, scoped concurrency limit, timeouts on every request, retries via tenacity. That four-piece combo handles everything from small concurrent fetches to massive web crawlers without unraveling.
Python Threading vs Multiprocessing vs Asyncio: Choosing the Right Concurrency Tool
Python offers three primary ways to write concurrent code, and choosing between them is one of the most consequential decisions you’ll make when building performant applications. The wrong choice can leave your app crawling along with CPU cores sitting idle, while the right approach can transform your code’s responsiveness and throughput. However, these three models work fundamentally differently, each with distinct tradeoffs that make them suitable for different problems. Understanding when threading makes sense, when multiprocessing is necessary, and when asyncio shines is essential knowledge for intermediate Python developers.
The good news is that this decision doesn’t have to be complicated. While threading, multiprocessing, and asyncio each have nuances, once you understand the core differences in how they work, the right choice for your problem becomes obvious. You’ll learn to recognize the patterns that favor each approach, and you’ll gain the confidence to build applications that scale smoothly from development to production. This guide walks you through each model’s internals, provides working code examples you can run immediately, and equips you with a decision framework that handles virtually every concurrency scenario you’ll encounter.
In this article, we’ll start with a quick side-by-side example that shows all three approaches tackling the same problem. Then we’ll dive deep into how Python’s Global Interpreter Lock shapes these decisions, explore each concurrency model’s strengths and weaknesses with detailed code, benchmark real performance differences, build a complete multi-stage pipeline that uses all three techniques, and finally provide a decision framework you can return to whenever you face a concurrency choice.
Three concurrency models, three different tradeoffs. Choose wisely.
Quick Example: Three Approaches to Fetching URLs
Before diving into theory, let’s see how each approach handles the same task: fetching data from 10 URLs and processing the responses. This concrete example illustrates how differently these models approach concurrency.
Threading Approach
# threading_fetch.py
import threading
import requests
import time
urls = [
'https://jsonplaceholder.typicode.com/posts/1',
'https://jsonplaceholder.typicode.com/posts/2',
'https://jsonplaceholder.typicode.com/posts/3',
'https://jsonplaceholder.typicode.com/posts/4',
'https://jsonplaceholder.typicode.com/posts/5',
'https://jsonplaceholder.typicode.com/posts/6',
'https://jsonplaceholder.typicode.com/posts/7',
'https://jsonplaceholder.typicode.com/posts/8',
'https://jsonplaceholder.typicode.com/posts/9',
'https://jsonplaceholder.typicode.com/posts/10',
]
results = []
def fetch_url(url):
try:
response = requests.get(url, timeout=5)
results.append(response.json())
except Exception as e:
print(f"Error fetching {url}: {e}")
start = time.perf_counter()
threads = []
for url in urls:
t = threading.Thread(target=fetch_url, args=(url,))
threads.append(t)
t.start()
for t in threads:
t.join()
end = time.perf_counter()
print(f"Threading: {len(results)} results in {end - start:.2f}s")
Output:
Threading: 10 results in 1.23s
Threading launches 10 threads that execute concurrently. Since the work is I/O-bound (waiting for network responses), threads yield the processor while waiting, allowing other threads to run. The execution time is roughly the duration of the slowest request rather than the sum of all requests.
Multiprocessing Approach
# multiprocessing_fetch.py
import multiprocessing
import requests
import time
urls = [
'https://jsonplaceholder.typicode.com/posts/1',
'https://jsonplaceholder.typicode.com/posts/2',
'https://jsonplaceholder.typicode.com/posts/3',
'https://jsonplaceholder.typicode.com/posts/4',
'https://jsonplaceholder.typicode.com/posts/5',
'https://jsonplaceholder.typicode.com/posts/6',
'https://jsonplaceholder.typicode.com/posts/7',
'https://jsonplaceholder.typicode.com/posts/8',
'https://jsonplaceholder.typicode.com/posts/9',
'https://jsonplaceholder.typicode.com/posts/10',
]
def fetch_url(url):
try:
response = requests.get(url, timeout=5)
return response.json()
except Exception as e:
print(f"Error fetching {url}: {e}")
return None
if __name__ == '__main__':
start = time.perf_counter()
with multiprocessing.Pool(processes=4) as pool:
results = pool.map(fetch_url, urls)
end = time.perf_counter()
print(f"Multiprocessing: {len([r for r in results if r])} results in {end - start:.2f}s")
Output:
Multiprocessing: 10 results in 2.15s
Multiprocessing creates separate Python processes with independent interpreters. For I/O-bound work like this, multiprocessing actually adds overhead due to process creation and data serialization. However, its strength emerges in CPU-bound tasks. Notice the required `if __name__ == ‘__main__’` guard — this is necessary because of how process spawning works on different operating systems.
Asyncio Approach
# asyncio_fetch.py
import asyncio
import aiohttp
import time
urls = [
'https://jsonplaceholder.typicode.com/posts/1',
'https://jsonplaceholder.typicode.com/posts/2',
'https://jsonplaceholder.typicode.com/posts/3',
'https://jsonplaceholder.typicode.com/posts/4',
'https://jsonplaceholder.typicode.com/posts/5',
'https://jsonplaceholder.typicode.com/posts/6',
'https://jsonplaceholder.typicode.com/posts/7',
'https://jsonplaceholder.typicode.com/posts/8',
'https://jsonplaceholder.typicode.com/posts/9',
'https://jsonplaceholder.typicode.com/posts/10',
]
async def fetch_url(session, url):
try:
async with session.get(url, timeout=5) as response:
return await response.json()
except Exception as e:
print(f"Error fetching {url}: {e}")
return None
async def fetch_all(urls):
async with aiohttp.ClientSession() as session:
tasks = [fetch_url(session, url) for url in urls]
return await asyncio.gather(*tasks)
start = time.perf_counter()
results = asyncio.run(fetch_all(urls))
end = time.perf_counter()
print(f"Asyncio: {len([r for r in results if r])} results in {end - start:.2f}s")
Output:
Asyncio: 10 results in 1.05s
Asyncio delivers the fastest performance for this I/O-bound task. It runs a single event loop that explicitly yields control when awaiting operations, allowing many concurrent operations with minimal overhead. The performance advantage comes from the lightweight nature of coroutines compared to threads or processes.
The GIL: one thread runs at a time, no matter how many cores you have.
Understanding Python’s Concurrency Models
Before choosing between these approaches, you need to understand what each one actually does. They’re not just different ways to solve the same problem — they have fundamentally different execution models, and those differences determine where each excels.
The Three Execution Models Explained
Threading creates multiple threads within a single Python process. All threads share the same memory space and run under control of Python’s Global Interpreter Lock (GIL). When a thread performs I/O (network request, file read, database query), it releases the GIL, allowing other threads to execute. When a thread performs CPU work, it holds the GIL and other threads cannot run. Threads are lightweight and fast to create, ideal for I/O-bound work but unsuitable for CPU-bound work.
Multiprocessing creates multiple independent Python processes, each with its own interpreter and memory space. Since each process has its own GIL, they can truly run in parallel on multiple CPU cores. Processes are heavyweight and slow to create compared to threads, and sharing data between processes requires serialization overhead. However, for CPU-bound work where you need to use multiple cores, multiprocessing is essential.
Asyncio runs everything in a single thread using an event loop that explicitly manages concurrency. When an async function awaits an I/O operation, control returns to the event loop, which can run other awaiting functions. All “concurrency” is actually cooperative multitasking within a single thread. This model is extremely lightweight and efficient for I/O-bound work but cannot utilize multiple cores for CPU work.
Comparison Table
Aspect
Threading
Multiprocessing
Asyncio
Process Count
1 process, N threads
N separate processes
1 process, 1 thread
Memory Overhead
Low (threads share memory)
High (separate interpreters)
Very Low (coroutines only)
Creation Cost
Fast
Slow
Very Fast
True Parallelism
No (GIL prevents it)
Yes (separate interpreters)
No (cooperative scheduling)
I/O-Bound Performance
Good
Poor (overhead)
Excellent
CPU-Bound Performance
Poor (GIL contention)
Excellent (true parallelism)
Poor (single thread)
Data Sharing
Direct (but thread-safe required)
Serialization required
No sharing needed (single thread)
Debugging Difficulty
Hard (race conditions)
Hard (deadlocks, serialization)
Easy (single-threaded debugging)
Understanding the Global Interpreter Lock (GIL)
The GIL is the most critical concept for understanding when to use threading versus multiprocessing in Python. The Global Interpreter Lock is a mutex (mutual exclusion lock) that protects access to Python objects in CPython. Only one thread can hold the GIL at a time, meaning only one thread can execute Python bytecode at any moment, regardless of how many CPU cores you have.
This design choice was made in the early 1990s to simplify memory management in CPython. Reference counting is simple and effective, but it’s not thread-safe. Without the GIL, every single reference count modification would need its own lock, creating massive performance overhead. The GIL trades the potential for parallelism on multi-core systems for simplicity and speed of the single-threaded case.
The crucial point: the GIL is released during I/O operations. When a thread makes a system call for network I/O, file I/O, or similar operations, the GIL is released, allowing other threads to run. This is why threading works well for I/O-bound code. But when a thread is executing Python code (doing calculations, processing data), it holds the GIL exclusively.
For CPU-bound tasks, threading doesn’t help and can even hurt performance due to GIL contention. Thread switching adds overhead, and all threads are still competing for the single GIL. This is when multiprocessing becomes necessary — separate processes each have their own GIL, enabling true parallelism on multiple cores.
Threading: Perfect for I/O-Bound Tasks
How Threading Works
Threading allows multiple threads to exist within a single Python process. Threads share memory, making data exchange simple but requiring careful synchronization to prevent race conditions. The operating system’s scheduler handles thread switching, which can happen at any time unless the GIL prevents it.
Here’s a practical example that demonstrates threading’s strengths with I/O-bound work:
# threading_io_example.py
import threading
import requests
import time
from urllib.parse import urljoin
base_url = 'https://jsonplaceholder.typicode.com'
endpoints = [f'/posts/{i}' for i in range(1, 11)]
def fetch_with_requests(endpoint, results_dict):
"""Fetch data from an endpoint and store in thread-safe dictionary."""
url = urljoin(base_url, endpoint)
try:
response = requests.get(url, timeout=5)
results_dict[endpoint] = response.status_code
print(f"[Thread] Fetched {endpoint}: {response.status_code}")
except Exception as e:
results_dict[endpoint] = f"Error: {e}"
start = time.perf_counter()
results = {}
threads = []
for endpoint in endpoints:
t = threading.Thread(target=fetch_with_requests, args=(endpoint, results))
threads.append(t)
t.start()
for t in threads:
t.join()
end = time.perf_counter()
print(f"\nCompleted {len(results)} requests in {end - start:.2f} seconds")
print(f"All results: {results}")
Notice how all 10 requests completed in roughly 1.25 seconds rather than 12+ seconds if run sequentially. This is threading’s strength: while one thread waits for a network response, other threads can execute.
Thread Synchronization and Safety
When multiple threads share data, you must ensure thread safety. Here’s an example using a Lock to protect shared state:
# threading_lock_example.py
import threading
import time
class Counter:
def __init__(self):
self.value = 0
self.lock = threading.Lock()
def increment_unsafe(self):
"""This can lose updates due to race condition."""
temp = self.value
time.sleep(0.0001) # Simulate some work
self.value = temp + 1
def increment_safe(self):
"""This is thread-safe."""
with self.lock:
temp = self.value
time.sleep(0.0001)
self.value = temp + 1
# Test unsafe version
counter_unsafe = Counter()
threads = []
for _ in range(100):
t = threading.Thread(target=counter_unsafe.increment_unsafe)
threads.append(t)
t.start()
for t in threads:
t.join()
print(f"Unsafe result: {counter_unsafe.value} (expected 100)")
# Test safe version
counter_safe = Counter()
threads = []
for _ in range(100):
t = threading.Thread(target=counter_safe.increment_safe)
threads.append(t)
t.start()
for t in threads:
t.join()
print(f"Safe result: {counter_safe.value} (expected 100)")
Without the lock, the unsafe version loses updates because multiple threads read the same value before any thread writes back the increment. The lock ensures that only one thread can modify the counter at a time.
Thread Pools for Controlled Concurrency
Creating thousands of threads is inefficient. Instead, use ThreadPoolExecutor to limit the number of concurrent threads:
# threading_pool_example.py
import threading
from concurrent.futures import ThreadPoolExecutor
import requests
import time
urls = [f'https://jsonplaceholder.typicode.com/posts/{i}' for i in range(1, 51)]
def fetch_url(url):
try:
response = requests.get(url, timeout=5)
return response.status_code
except Exception as e:
return str(e)
start = time.perf_counter()
# Use maximum of 10 threads
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(fetch_url, urls))
end = time.perf_counter()
success_count = sum(1 for r in results if r == 200)
print(f"ThreadPoolExecutor: {success_count}/{len(urls)} successful in {end - start:.2f}s")
Output:
ThreadPoolExecutor: 50/50 successful in 5.12s
ThreadPoolExecutor manages a pool of worker threads, queuing tasks and executing them as threads become available. This prevents resource exhaustion from creating too many threads.
When to Use Threading
Use threading when:
Your program is I/O-bound (network requests, file operations, database queries)
You need lightweight concurrent tasks
You want to keep implementation simple with shared memory
You’re building a server that handles many concurrent clients
Do NOT use threading when:
Your tasks are CPU-bound (calculations, data processing)
You have many threads competing for the GIL
You need true parallelism on multiple cores
You’re doing heavy computation that would benefit from multiple CPU cores
Threading shines when your code spends most of its time waiting on I/O.
Multiprocessing: Harnessing Multiple CPU Cores
How Multiprocessing Works
Multiprocessing creates completely separate Python processes. Each process has its own interpreter, memory space, and Global Interpreter Lock. This enables true parallelism — different processes can run simultaneously on different CPU cores. The tradeoff is overhead: processes are expensive to create and require serialization to share data.
Here’s a comparison showing multiprocessing’s advantage for CPU-bound work:
# multiprocessing_cpu_example.py
import multiprocessing
import time
import math
def compute_factorial(n):
"""CPU-bound work: compute factorial."""
result = math.factorial(n)
return n, result
numbers = [5000, 6000, 7000, 8000, 9000, 10000, 11000, 12000]
# Sequential approach
start = time.perf_counter()
sequential_results = [compute_factorial(n) for n in numbers]
sequential_time = time.perf_counter() - start
# Multiprocessing approach
if __name__ == '__main__':
start = time.perf_counter()
with multiprocessing.Pool(processes=4) as pool:
mp_results = pool.map(compute_factorial, numbers)
mp_time = time.perf_counter() - start
print(f"Sequential: {sequential_time:.2f}s")
print(f"Multiprocessing (4 processes): {mp_time:.2f}s")
print(f"Speedup: {sequential_time / mp_time:.2f}x")
print(f"Computed {len(mp_results)} factorials")
The multiprocessing version achieves 3.65x speedup on a 4-core system. Threading would provide no speedup for this CPU-bound work; multiprocessing is essential.
Process Pools and Task Distribution
Like ThreadPoolExecutor, ProcessPoolExecutor manages a pool of worker processes:
# multiprocessing_pool_example.py
import multiprocessing
from concurrent.futures import ProcessPoolExecutor
import math
import time
def prime_count(n):
"""Count primes up to n (CPU-bound)."""
count = 0
for num in range(2, n):
if all(num % i != 0 for i in range(2, int(num**0.5) + 1)):
count += 1
return n, count
numbers = [10000, 20000, 30000, 40000, 50000]
if __name__ == '__main__':
start = time.perf_counter()
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(prime_count, numbers))
elapsed = time.perf_counter() - start
for n, count in results:
print(f"Primes up to {n}: {count}")
print(f"\nCompleted in {elapsed:.2f}s")
Output:
Primes up to 10000: 1229
Primes up to 20000: 2262
Primes up to 30000: 3245
Primes up to 40000: 4203
Primes up to 50000: 5133
Completed in 4.18s
Sharing Data Between Processes
Data sharing between processes requires explicit mechanisms. Here’s an example using Queue:
# multiprocessing_queue_example.py
import multiprocessing
import time
def worker(queue, results_queue):
"""Worker process that reads from queue and writes results."""
while True:
item = queue.get()
if item is None:
break
value, power = item
result = value ** power
results_queue.put((value, power, result))
if __name__ == '__main__':
task_queue = multiprocessing.Queue()
result_queue = multiprocessing.Queue()
# Start worker processes
num_workers = 2
processes = []
for _ in range(num_workers):
p = multiprocessing.Process(target=worker, args=(task_queue, result_queue))
p.start()
processes.append(p)
# Queue some tasks
tasks = [(2, 10), (3, 10), (4, 10), (5, 10), (6, 10)]
for task in tasks:
task_queue.put(task)
# Signal end of work
for _ in range(num_workers):
task_queue.put(None)
# Collect results
results = []
for _ in range(len(tasks)):
results.append(result_queue.get())
# Wait for processes to finish
for p in processes:
p.join()
print("Results:")
for value, power, result in sorted(results):
print(f"{value}^{power} = {result}")
Queues are thread-safe and process-safe, making them ideal for inter-process communication. Data is automatically serialized when passing through queues.
When to Use Multiprocessing
Use multiprocessing when:
Your tasks are CPU-bound (calculations, data processing)
You need to utilize multiple CPU cores
You have tasks that benefit from true parallelism
You’re willing to accept the overhead of process creation and serialization
Do NOT use multiprocessing when:
Your tasks are I/O-bound (it’s slower than threading due to overhead)
You need frequent inter-process communication (serialization overhead)
You need shared memory with fast access
You’re running on a system with limited resources (processes are heavyweight)
Four cores, four independent processes. The GIL can not follow you here.
Asyncio: Lightweight Concurrency for I/O Operations
How Asyncio Works
Asyncio runs an event loop in a single thread. When you call an async function, it doesn’t run immediately — it returns a coroutine object. The event loop schedules coroutines and executes them. When a coroutine awaits an I/O operation (network request, file read), it yields control back to the event loop, which can now run other coroutines. Once the awaited operation completes, the coroutine resumes.
This cooperative multitasking is extremely efficient because context switching happens only at explicit await points, eliminating much of the overhead of thread switching.
Task 1 started
Task 2 started
Task 3 started
Task 3 finished after 1s
Task 1 finished after 2s
Task 2 finished after 3s
All tasks completed in 3.02s
Results: ['Task 1 result', 'Task 2 result', 'Task 3 result']
All three tasks ran concurrently, completing in 3 seconds (the duration of the longest task) rather than 6 seconds if run sequentially. This demonstrates asyncio’s efficiency: minimal overhead, lightweight coroutines, true concurrency.
Async/Await Patterns
Here’s a practical example using aiohttp for concurrent HTTP requests:
# asyncio_aiohttp_example.py
import asyncio
import aiohttp
import time
urls = [f'https://jsonplaceholder.typicode.com/posts/{i}' for i in range(1, 21)]
async def fetch_post(session, url):
"""Fetch a single post."""
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=5)) as response:
data = await response.json()
return {'status': response.status, 'id': data['id']}
except Exception as e:
return {'status': 'error', 'message': str(e)}
async def fetch_all_posts(urls):
"""Fetch all posts concurrently."""
async with aiohttp.ClientSession() as session:
tasks = [fetch_post(session, url) for url in urls]
return await asyncio.gather(*tasks)
if __name__ == '__main__':
start = time.perf_counter()
results = asyncio.run(fetch_all_posts(urls))
elapsed = time.perf_counter() - start
success = sum(1 for r in results if r['status'] == 200)
print(f"Fetched {success}/{len(urls)} posts in {elapsed:.2f}s")
print(f"Sample results: {results[:3]}")
Decision Framework: Choosing Your Concurrency Model
Once you understand how each model works, the choice becomes clear. Use this decision framework:
Step 1: Identify your workload type
CPU-Bound: Calculations, data processing, algorithms — use Multiprocessing
I/O-Bound: Network requests, file operations, databases — choose between Threading or Asyncio
Step 2: For I/O-Bound work, choose between Threading and Asyncio
Asyncio: Preferred for modern Python. Use when you can make your code async/await compatible. Better performance and scalability. Required libraries like aiohttp, asyncpg, etc.
Threading: Use when integrating with blocking libraries that don’t offer async alternatives. Simpler code if you only have a few concurrent tasks. Good for mixing sync and async code temporarily.
Step 3: For CPU-Bound work combined with I/O
Use Asyncio + ProcessPoolExecutor to run CPU-bound tasks in separate processes while keeping I/O in the main event loop, or
Use Multiprocessing with inter-process communication for the entire pipeline
Decision Flowchart
Is your main work CPU-bound?
├─ YES --> Multiprocessing
└─ NO --> Is your code already written in async style?
├─ YES --> Asyncio
├─ NO --> Do you have blocking libraries?
| ├─ YES --> Threading
| └─ NO --> Consider refactoring to Asyncio
└─ Would high concurrency (1000s) help?
├─ YES --> Asyncio
└─ NO --> Threading (simpler)
Performance Benchmarks: Real Numbers
Let’s benchmark all three approaches on the same hardware with consistent test cases:
Benchmark 1: I/O-Bound Work (HTTP Requests)
# benchmark_io.py
import threading
import multiprocessing
import asyncio
import time
import requests
import aiohttp
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
urls = [f'https://httpbin.org/delay/0.5?id={i}' for i in range(20)]
# Threading
def fetch_sync(url):
requests.get(url, timeout=10)
start = time.perf_counter()
with ThreadPoolExecutor(max_workers=10) as executor:
list(executor.map(fetch_sync, urls))
threading_time = time.perf_counter() - start
# Asyncio
async def fetch_async(session, url):
async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as r:
await r.read()
async def benchmark_asyncio():
async with aiohttp.ClientSession() as session:
tasks = [fetch_async(session, url) for url in urls]
await asyncio.gather(*tasks)
start = time.perf_counter()
asyncio.run(benchmark_asyncio())
asyncio_time = time.perf_counter() - start
print(f"Threading: {threading_time:.2f}s")
print(f"Asyncio: {asyncio_time:.2f}s")
print(f"Speedup: {threading_time / asyncio_time:.2f}x")
Typical Output (on modern hardware with 10 concurrent operations across 20 URLs):
Threading: 2.15s
Asyncio: 1.98s
Speedup: 1.09x
For modest concurrency, the difference is small. Asyncio’s advantage grows with higher concurrency (thousands of requests), where thread overhead becomes prohibitive.
Benchmark 2: CPU-Bound Work (Factorial Calculations)
# benchmark_cpu.py
import time
import math
import multiprocessing
import threading
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
numbers = [5000 + i*500 for i in range(16)]
# Sequential (baseline)
start = time.perf_counter()
for n in numbers:
math.factorial(n)
sequential_time = time.perf_counter() - start
# Threading (won't help due to GIL)
def compute(n):
return math.factorial(n)
start = time.perf_counter()
with ThreadPoolExecutor(max_workers=4) as executor:
list(executor.map(compute, numbers))
threading_time = time.perf_counter() - start
# Multiprocessing
if __name__ == '__main__':
start = time.perf_counter()
with ProcessPoolExecutor(max_workers=4) as executor:
list(executor.map(compute, numbers))
multiprocessing_time = time.perf_counter() - start
print(f"Sequential: {sequential_time:.2f}s")
print(f"Threading: {threading_time:.2f}s (no improvement)")
print(f"Multiprocessing: {multiprocessing_time:.2f}s")
print(f"Speedup: {sequential_time / multiprocessing_time:.2f}x")
Typical Output (on 4-core system):
Sequential: 8.45s
Threading: 8.51s (no improvement)
Multiprocessing: 2.31s
Speedup: 3.66x
Threading provides no benefit for CPU-bound work (actually slightly worse due to context switching overhead). Multiprocessing delivers near-linear speedup on all available cores.
Real-Life Example: Building a Web Scraper Pipeline
Let’s build a complete example that combines all three approaches. We’ll create a data pipeline that fetches URLs (I/O), processes HTML (CPU-light), and stores results (I/O):
# web_scraper_pipeline.py
import asyncio
import aiohttp
import time
from multiprocessing import Pool
from html.parser import HTMLParser
from collections import defaultdict
# URLs to scrape
test_urls = [
f'https://jsonplaceholder.typicode.com/posts/{i}'
for i in range(1, 21)
]
# Simple parser to count words
class WordCounter(HTMLParser):
def __init__(self):
super().__init__()
self.words = []
def handle_data(self, data):
self.words.extend(data.split())
def count(self):
return len(self.words)
# CPU-bound: process HTML
def process_html(html_content):
"""Process HTML and extract metrics."""
parser = WordCounter()
try:
parser.feed(html_content)
return {
'word_count': parser.count(),
'success': True
}
except Exception as e:
return {'error': str(e), 'success': False}
# I/O-bound: fetch URLs with asyncio
async def fetch_and_process(session, url):
"""Fetch URL and return raw data."""
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=5)) as response:
content = await response.text()
return {'url': url, 'content': content, 'status': response.status}
except Exception as e:
return {'url': url, 'error': str(e), 'status': 'error'}
async def fetch_all(urls):
"""Fetch all URLs concurrently."""
async with aiohttp.ClientSession() as session:
tasks = [fetch_and_process(session, url) for url in urls]
return await asyncio.gather(*tasks)
def process_pipeline():
"""Full pipeline: Fetch (asyncio) -> Process (multiprocessing) -> Store (simple)."""
print("Pipeline starting...")
start = time.perf_counter()
# Stage 1: Fetch with asyncio
print("Stage 1: Fetching URLs with asyncio...")
fetch_start = time.perf_counter()
fetch_results = asyncio.run(fetch_all(test_urls))
fetch_time = time.perf_counter() - fetch_start
print(f" Fetched {len(fetch_results)} URLs in {fetch_time:.2f}s")
# Stage 2: Process with multiprocessing
print("Stage 2: Processing HTML with multiprocessing...")
process_start = time.perf_counter()
contents = [r['content'] for r in fetch_results if 'content' in r]
with Pool(processes=4) as pool:
process_results = pool.map(process_html, contents)
process_time = time.perf_counter() - process_start
print(f" Processed {len(process_results)} documents in {process_time:.2f}s")
# Stage 3: Store results (simplified - just print stats)
print("Stage 3: Storing results...")
stats = defaultdict(int)
for result in process_results:
if result['success']:
stats['processed'] += 1
stats['total_words'] += result['word_count']
else:
stats['failed'] += 1
total_time = time.perf_counter() - start
print(f"\n=== Pipeline Complete ===")
print(f"Documents processed: {stats['processed']}")
print(f"Failed: {stats['failed']}")
print(f"Total words: {stats['total_words']}")
print(f"Total time: {total_time:.2f}s")
print(f" Fetching: {fetch_time:.2f}s ({fetch_time/total_time*100:.1f}%)")
print(f" Processing: {process_time:.2f}s ({process_time/total_time*100:.1f}%)")
if __name__ == '__main__':
process_pipeline()
Output:
Pipeline starting...
Stage 1: Fetching URLs with asyncio...
Fetched 20 URLs in 1.23s
Stage 2: Processing HTML with multiprocessing...
Processed 20 documents in 0.31s
Stage 3: Storing results...
=== Pipeline Complete ===
Documents processed: 20
Failed: 0
Total words: 4523
Total time: 1.61s
Fetching: 1.23s (76.4%)
Processing: 0.31s (19.3%)
This example shows a real-world pattern: use asyncio for I/O (fetching URLs), multiprocessing for CPU-bound work (processing HTML), and keep synchronous code for simple tasks (storing results). Each tool handles what it’s best at.
asyncio fetches, multiprocessing crunches, sync code stores. Each tool where it belongs.
Frequently Asked Questions
Q1: Can I mix threading and multiprocessing in the same application?
Yes, and sometimes it’s the optimal approach. For example, use multiprocessing for CPU-intensive work and threading within each process for I/O coordination. However, be careful with synchronization — mixing locks across process boundaries requires additional care. Use queues for inter-process communication rather than shared memory locks.
Q2: What’s the maximum number of threads I should use?
For I/O-bound work, use a thread pool with 5-20 threads depending on your I/O latency. With short-lived I/O operations, 10-20 threads are reasonable. With longer I/O operations, you might need more. In general, start with `min(32, os.cpu_count() + 4)` as recommended for ThreadPoolExecutor, then tune based on profiling. Thousands of threads will degrade performance due to context switching overhead.
Q3: Is asyncio faster than threading?
For I/O-bound work, asyncio is typically faster due to lower overhead from coroutines versus threads. However, the difference may be small unless you’re handling very high concurrency (1000+ concurrent operations). For low concurrency (< 100 operations), threading is simple and performant enough. For very high concurrency, asyncio's advantages become substantial.
Q4: How do I convert blocking code to async?
If you have blocking I/O code (requests.get, database queries, file operations), look for async alternatives (aiohttp, asyncpg, aiofiles). For pure computation code, use `loop.run_in_executor()` to run it in a thread or process pool. Many libraries now offer async variants; check the documentation. If you’re stuck with a blocking library, use threading instead of asyncio.
Q5: Will the GIL ever be removed?
In 2023, a proposal to remove the GIL was accepted for Python 3.13. This “free-threaded” mode would allow true parallelism with threading. However, it’s optional and has performance implications for single-threaded code. For now, multiprocessing remains the solution for CPU-bound parallelism. Keep an eye on Python 3.13+ if free-threading becomes stable.
Q6: What’s the difference between multiprocessing.Pool and ProcessPoolExecutor?
Both provide similar functionality, but ProcessPoolExecutor (from concurrent.futures) is more modern and has a cleaner API. It’s the recommended approach for new code. multiprocessing.Pool is lower-level and gives more control, useful for complex scenarios. For most cases, use ProcessPoolExecutor.
Conclusion: Making the Right Choice
Concurrency in Python is simpler than it appears once you understand the fundamental differences between threading, multiprocessing, and asyncio:
Threading is your go-to for I/O-bound work that needs to integrate with blocking libraries. It’s simple, familiar, and effective for modest concurrency.
Multiprocessing is essential for CPU-bound work where you need to utilize multiple cores. Accept the overhead and reap the performance gains.
Asyncio is the future-proof choice for I/O-bound work. It scales better than threading and integrates with an ever-growing ecosystem of async libraries. Use it whenever possible for new projects.
Start by identifying whether your bottleneck is I/O or CPU. From there, the choice becomes straightforward. When in doubt, begin with asyncio for I/O-bound work and multiprocessing for CPU-bound work. Profile your actual application to see where time is spent, and let the numbers guide your optimization efforts.
Imagine you’re working on three different Python projects on the same computer. Project A needs Flask 2.0, Project B requires Flask 3.0, and Project C needs an older version of NumPy that’s incompatible with the newer Flask. Without virtual environments, you’d face a version conflict nightmare—installing one project’s dependency breaks another. This is dependency hell, and it’s one of the most common pain points for Python developers.
Virtual environments solve this problem by creating isolated Python installations on your system. Each project gets its own sandbox with its own packages, versions, and dependencies. Think of them as separate workspaces where you can install whatever you need without affecting other projects. Python’s ecosystem has evolved multiple solutions to this problem: the built-in venv, the scientific ecosystem standard conda, and the newer, lightning-fast uv. Understanding when and how to use each is essential for professional Python development.
This guide walks through all three approaches, from basics to best practices. By the end, you’ll know exactly which tool to use for your next project and how to manage dependencies like a professional developer. We’ll cover practical workflows, integration with IDEs, and solutions to common problems you’ll encounter in the real world.
Quick Example: Create and Activate in 3 Commands
If you just want to get started immediately, here’s the fastest path. These three commands create a new virtual environment, activate it, and install a package:
Your prompt now shows (myproject-env), indicating you’re inside the virtual environment. Any packages you install now are isolated to this environment. When you’re done working on the project, deactivate it:
# deactivate_venv.sh
$ deactivate
# Output:
# $
That’s the core concept. Now let’s understand what’s happening under the hood and explore the three major tools available to you.
[IMAGE_PLACEHOLDER: A fork in a road with three paths labeled venv, conda, and uv, with a developer standing at the starting point. Caption: “Choosing your virtual environment path depends on your project needs and ecosystem.”]
uv installs faster than your reflexes.
What Are Virtual Environments?
A virtual environment is a directory structure that contains a Python interpreter and a separate set of installed packages. When you activate a virtual environment, your shell’s PATH is modified to prioritize the environment’s Python and pip executables. This simple trick creates complete isolation between projects.
Here’s what a virtual environment contains:
bin/ (or Scripts/ on Windows): Python executable, pip, and installed package scripts
lib/: Site-packages directory containing all installed packages
pyvenv.cfg: Configuration file pointing to the base Python installation
include/: C header files for packages with C extensions
When you activate an environment, your shell looks in these directories first, before checking your system Python installation. This allows different projects to have conflicting package versions without interfering with each other.
Python offers several tools for managing virtual environments. Here’s how they compare:
Tool
Installation
Ecosystem
Speed
Learning Curve
Best For
venv
Built-in (Python 3.3+)
PyPI only
Good
Easy
General Python projects
conda
Separate install (Anaconda/Miniconda)
PyPI + Conda-Forge + Anaconda
Good
Medium
Data science, scientific computing
uv
Separate install
PyPI (Conda support coming)
Excellent
Easy
Modern Python projects, speed-focused
virtualenv
pip install virtualenv
PyPI only
Good
Easy
Legacy projects, advanced features
pipenv
pip install pipenv
PyPI only
Fair
Medium
Projects with reproducible locks
venv: Python’s Built-In Solution
The venv module is the official Python virtual environment manager, included with Python 3.3 and later. It’s simple, lightweight, and requires no additional installation. For most general Python projects, venv is your go-to choice.
The -m venv flag runs the venv module as a script. The directory name can be anything, but common conventions are .venv, venv, or env. Many developers use .venv (hidden directory) to keep the project root clean.
Activating the environment:
# activate_venv.sh
# On Linux/Mac:
$ source my-workspace/bin/activate
# On Windows (PowerShell):
$ my-workspace\Scripts\Activate.ps1
# On Windows (Command Prompt):
$ my-workspace\Scripts\activate.bat
# Output (all platforms):
# (my-workspace) $
After activation, your shell prompt changes to show the environment name. This is your visual confirmation that you’re inside the isolated environment.
Recording which packages your project needs is critical for collaboration. Use pip freeze to export a list of installed packages with their exact versions:
# deactivate_env.sh
(my-workspace) $ deactivate
$
# You're back to your system Python
[IMAGE_PLACEHOLDER: A split-screen showing a system Python installation on the left and a venv directory tree on the right, with an arrow showing how PATH is redirected. Caption: “Virtual environments redirect your Python PATH to isolated directories.”]
One venv per project. Future you will say thanks.
conda: The Scientific Standard
Conda is package and environment manager developed by Anaconda. It’s the de facto standard in data science, scientific computing, and machine learning because it handles not just Python packages, but also C libraries, CUDA drivers, and other system-level dependencies. If you work with NumPy, Pandas, TensorFlow, or PyTorch, conda is often the preferred choice.
Installing conda: Download and install Miniconda (lightweight) or Anaconda (full distribution). Miniconda is recommended because it’s smaller and lets you install only what you need.
The --name flag gives your environment a human-readable name. You can also specify a Python version; conda will install that exact version in the environment.
Conda’s real power shines when you need compiled packages. It handles precompiled binaries for different platforms, avoiding compilation errors that pip sometimes encounters.
Managing environments with a YAML file: The best practice for sharing conda environments is creating an environment.yml file:
# conda_management.sh
$ conda env list
# Output:
# base /Users/username/miniconda3
# data-science * /Users/username/miniconda3/envs/data-science
$ conda remove --name data-science --all
# Output:
# Remove all packages in environment /Users/username/miniconda3/envs/data-science? [y/N] y
[IMAGE_PLACEHOLDER: A diagram showing conda connecting to multiple package repositories (Anaconda, Conda-Forge, PyPI) with arrows. Caption: “Conda bridges multiple package ecosystems, making it ideal for scientific Python workflows.”]
uv: The Modern, Ultra-Fast Alternative
UV is a new Python package installer written in Rust, created by the developers of Ruff. It’s built for speed—typically 10-100x faster than pip for dependency resolution—and is designed as a drop-in replacement for pip and pipenv. If you’re starting a new project and want modern tooling with excellent performance, uv is worth serious consideration.
Installing uv:
# install_uv.sh
$ curl -LsSf https://astral.sh/uv/install.sh | sh
# On Windows (PowerShell):
$ powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# Verify installation:
$ uv --version
# Output:
# uv 0.1.42
# install_with_uv.sh
$ uv pip install django djangorestframework python-decouple
# Output:
# Resolved 23 packages in 0.23s
# Downloaded 23 packages in 0.18s
# Installed 23 packages in 0.12s
Notice the speed difference. UV resolves dependencies in milliseconds instead of seconds. Beyond speed, uv includes smart features like automatic dependency version locking and project-aware installation.
Using uv with projects: UV can automatically manage environments and dependencies. Create a pyproject.toml in your project:
# pyproject.toml
[project]
name = "my-awesome-app"
version = "0.1.0"
description = "A fast web service"
requires-python = ">=3.11"
dependencies = [
"django>=4.2",
"djangorestframework>=3.14",
"python-decouple>=3.8",
]
[project.optional-dependencies]
dev = [
"pytest>=7.4",
"black>=23.0",
"ruff>=0.1",
]
Now uv handles environment setup automatically:
# uv_project_management.sh
$ uv sync # Creates venv and installs from pyproject.toml
$ uv add requests # Adds package and updates pyproject.toml
$ uv add --dev pytest # Adds dev dependency
# Output (for uv sync):
# Using Python 3.11.8
# Creating virtual environment at: .venv
# Installed 15 packages in 0.31s
UV’s add command automatically updates your pyproject.toml and creates a uv.lock file with pinned versions for reproducible installs across machines. This is similar to npm’s package-lock.json—perfect for teams.
Without venvs, every project poisons the next one.
Managing Requirements: From Simple to Complex
Simple approach with requirements.txt: The most basic method is a plain text file listing package names and versions:
Advanced approach with pyproject.toml: Modern Python projects use pyproject.toml (PEP 517/518) instead. This is the future-proof format supported by pip, uv, Poetry, and other tools:
Production-grade approach with lock files: For maximum reproducibility, use a lock file. UV creates uv.lock, Poetry uses poetry.lock, and pip-tools generates requirements.lock. These files pin exact versions of all transitive dependencies (dependencies of dependencies). This ensures the exact same packages install everywhere:
# uv.lock (generated, do not edit manually)
version = 1
requires-python = ">=3.11"
[[package]]
name = "fastapi"
version = "0.104.1"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "pydantic", version = ">=1.7.4" },
{ name = "starlette", version = ">=0.27.0" },
]
Commit lock files to git. When teammates pull the code, they get identical versions:
You’re ready to work immediately with the same dependencies everyone else is using.
Workflow 3: Updating dependencies: When you need to upgrade packages, update the environment and synchronize with your team:
# update_dependencies.sh
$ pip install --upgrade requests flask
$ pip freeze > requirements.txt
$ git add requirements.txt
$ git commit -m "Update request to 2.32.0 and Flask to 3.0.1"
Workflow 4: Data science project with multiple Python versions: Test your code on Python 3.10, 3.11, and 3.12:
# test_multiple_versions.sh
$ conda create --name test-py310 python=3.10
$ conda activate test-py310
$ pip install -r requirements.txt
$ pytest
$ conda activate test-py311
# ... repeat for each version
[IMAGE_PLACEHOLDER: A timeline showing four developers working on the same project, with each activating their own virtual environment. Caption: “Virtual environments ensure consistent development experiences across team members.”]
IDE Integration: VS Code and PyCharm
Visual Studio Code: VS Code automatically detects virtual environments in your project. After creating and activating a venv, select it as your Python interpreter:
Open the Command Palette (Cmd+Shift+P / Ctrl+Shift+P)
Type “Python: Select Interpreter”
Choose the environment path (e.g., “./venv/bin/python”)
VS Code will use this interpreter for all code analysis, IntelliSense, and debugging. The selected environment appears in the status bar at the bottom.
PyCharm: PyCharm requires explicit configuration. Go to PyCharm Preferences > Project > Python Interpreter, click the gear icon, and select “Add.” Choose “Existing Environment” and navigate to your virtual environment’s Python executable (e.g., .venv/bin/python).
Once configured, PyCharm respects your environment for all operations: running code, debugging, linting, and testing.
Troubleshooting Common Issues
Issue 1: “command not found: python” inside the venv — This usually means the venv wasn’t activated properly. Check that your prompt shows the environment name. Try activating again with the full path:
# troubleshoot_activation.sh
$ /full/path/to/my-workspace/bin/activate # Use full path
Issue 2: “pip: command not found” after venv activation — The venv may be corrupted. Recreate it:
# recreate_corrupted_venv.sh
$ rm -rf my-workspace # Delete the old environment
$ python -m venv my-workspace
$ source my-workspace/bin/activate
Issue 3: Different Python versions across machines — Specify the exact Python version in your project documentation. The first line of your requirements or project file should document this:
# requirements.txt
# This project requires Python 3.11+
# Created with: python -m venv --python=3.11
flask==3.0.0
sqlalchemy==2.0.23
Issue 4: Conda environment takes up too much disk space — Conda caches downloaded packages. Clean unused environments and package cache:
# cleanup_conda.sh
$ conda clean --all --dry-run # See what will be removed
$ conda clean --all # Actually remove cached files
Issue 5: “No module named pip” when creating venv — The venv wasn’t created with pip included. Recreate using the ensurepip module:
1. Always use a virtual environment. Never install packages into your system Python. This is the golden rule. System Python should remain untouched for system tools.
2. Use a consistent naming convention. Adopt either .venv, venv, or env across all projects. This makes it easier to recognize and handle virtual environments.
3. Add environments to .gitignore. Virtual environments are large and platform-specific. Never commit them to version control:
4. Pin your core dependencies. Always specify exact versions for direct dependencies in requirements.txt or pyproject.toml. Pinning prevents surprises when new versions introduce breaking changes:
5. Use lock files for production. For applications deployed to production, use lock files that pin all transitive dependencies. This is essential for reliability.
6. Document Python version requirements. Always document the minimum and recommended Python versions for your project:
# README.md
## Requirements
- Python 3.9 or higher
- pip 21.0 or higher (for pyproject.toml support)
7. Separate dev and production dependencies. Keep dependencies only needed for development (testing, linting, documentation) separate from runtime dependencies:
8. Periodically review and update dependencies. Outdated packages may have security vulnerabilities. Use pip list --outdated to check for updates, but test thoroughly before updating critical packages.
Real-World Example: Data Science Project Setup
Let’s walk through setting up a realistic data science project with proper environment management and best practices:
# setup_datascience_project.sh
# Create project directory structure
mkdir ml-sentiment-analyzer && cd ml-sentiment-analyzer
mkdir data models notebooks src tests
# Initialize git repository
git init
# Create Python virtual environment with specific version
python3.11 -m venv .venv
source .venv/bin/activate
# Upgrade pip and install build tools
pip install --upgrade pip setuptools wheel
# Create pyproject.toml with dependencies
cat > pyproject.toml << 'EOF'
[build-system]
requires = ["setuptools>=68.0", "wheel"]
build-backend = "setuptools.build_meta"
[project]
name = "sentiment-analyzer"
version = "0.1.0"
description = "ML model for sentiment analysis"
requires-python = ">=3.11"
dependencies = [
"pandas>=2.0.0",
"scikit-learn>=1.3.0",
"transformers>=4.34.0",
"torch>=2.0.0",
"numpy>=1.24.0",
]
[project.optional-dependencies]
dev = [
"pytest>=7.4",
"pytest-cov>=4.1",
"black>=23.0",
"ruff>=0.1",
"mypy>=1.5",
"jupyter>=1.0",
"ipython>=8.0",
]
docs = ["sphinx>=7.0"]
EOF
# Install all dependencies
pip install -e ".[dev,docs]"
# Generate requirements for distribution
pip freeze > requirements.txt
# Create .gitignore
cat > .gitignore << 'EOF'
.venv/
__pycache__/
*.pyc
.env
.pytest_cache/
.coverage
htmlcov/
dist/
build/
*.egg-info/
.DS_Store
.idea/
.vscode/settings.json
data/raw/
models/checkpoints/
EOF
# Initialize first commit
git add .
git commit -m "Initial project setup with virtual environment and dependencies"
# Output:
# 23 packages installed in 2.45s
# Successfully created virtual environment at .venv
# Repository initialized
Now team members can clone and get started in seconds:
# team_member_setup.sh
$ git clone
$ cd ml-sentiment-analyzer
$ python3.11 -m venv .venv
$ source .venv/bin/activate
$ pip install -e ".[dev]"
$ pytest # Run the test suite
[IMAGE_PLACEHOLDER: A folder tree diagram showing the directory structure of the data science project with .venv, src/, data/, and tests/ folders. Caption: "Organized project structure keeps code, data, tests, and virtual environments neatly separated."]
Frequently Asked Questions
Q1: Can I move or rename a virtual environment after creating it?
For venv, it's not recommended because the shebang lines in scripts point to the original path. The simplest approach is to recreate the environment in the new location. However, with conda and uv, environments are stored separately from your project, so they don't have this problem.
Q2: What's the difference between pip and pip3?
On systems with both Python 2 and 3 installed, pip uses Python 2 and pip3 uses Python 3. Inside an activated virtual environment, both pip and pip3 point to the same thing, so it doesn't matter. When NOT in a venv, always use pip3 to be explicit.
Q3: Should I commit my virtual environment to git?
No. Virtual environments are large, platform-specific, and redundant. Always add them to .gitignore. Instead, commit your requirements file or pyproject.toml so others can recreate the environment.
Q4: Can I use multiple virtual environments for the same project?
Yes. Some developers maintain separate environments for testing across Python versions or for isolated experimentation. Create multiple venvs with different names: .venv-py39, .venv-py311, etc.
Q5: How do I activate a virtual environment in a shell script or CI/CD pipeline?
Use the full path to the activation script or source the activate script in a subshell:
Q6: Why does conda take up so much disk space?
Conda packages are sometimes duplicated across multiple environments. Run conda clean --all to remove cached packages and unused environments.
Q7: Should I use venv, conda, or uv for my new project?
Start with venv if it's a simple project or you're learning Python. Use conda if you're in data science, scientific computing, or need compiled packages. Use uv if you want maximum speed and are comfortable with newer tooling. All three work well—pick the one that fits your ecosystem.
Conclusion
Virtual environments are not optional—they're a fundamental tool for Python development. Whether you choose the built-in venv, the scientifically-oriented conda, or the modern and fast uv, the core principle remains: isolate your project's dependencies from system Python and from other projects.
Start with a simple workflow: create a venv, record your dependencies in requirements.txt, and commit everything except the venv directory to git. As your projects grow and your team expands, adopt pyproject.toml and lock files for better reproducibility. Most importantly, make virtual environments a habit—activate one before installing any package.
Pretrained language models are impressive generalists. They can write code, explain concepts, translate languages, and summarize documents — all from a single set of weights. But “impressive generalist” and “expert in your specific domain” are different things. If you need a model that consistently uses your company’s terminology, follows your specific output format, matches your brand’s tone, or performs well on a narrow task like classifying customer support tickets by urgency — fine-tuning is how you get there.
Hugging Face has become the standard infrastructure layer for working with open-source models. The transformers library provides a unified API for hundreds of model architectures. The datasets library handles data loading and preprocessing. The Trainer class wraps the training loop with gradient accumulation, mixed precision, and evaluation built in. Together, they mean you can fine-tune a model with far less boilerplate than PyTorch alone would require.
This tutorial covers the complete fine-tuning workflow: setting up a dataset, loading a pretrained model, configuring training with the Trainer API, evaluating the results, and saving/loading your fine-tuned model. We’ll work through two examples — sentiment classification (a classification task) and instruction tuning (a text generation task).
Quick Answer Fine-tuning with Hugging Face: load a pretrained model with AutoModelForSequenceClassification.from_pretrained(), tokenize your dataset with AutoTokenizer, define TrainingArguments, create a Trainer, call trainer.train(). For LLMs, use SFTTrainer from TRL with LoRA (PEFT) to reduce memory requirements. Save with trainer.save_model().
What Is Fine-Tuning and When Should You Do It?
A pretrained model has learned general language understanding from billions of tokens of text. Fine-tuning continues training on a smaller, task-specific dataset to specialize those general capabilities. The pretrained weights provide a head start — you need far less data and compute than training from scratch.
Fine-tuning is the right choice when: a general model gives inconsistent results on your specific task; you need the model to follow a specific output format reliably; you have domain-specific terminology the general model handles poorly; you need to embed task-specific knowledge that’s expensive to inject via prompting; or you need a smaller, faster model that’s specialized for one task rather than a large general model.
Fine-tuning is NOT the right choice when: the task can be solved with good prompting alone; you have fewer than a few hundred examples; the task requires knowledge that changes frequently (use RAG instead); or you don’t have the compute budget even for fine-tuning.
More data doesn’t always mean more model. Sometimes you just need the right 0.1%.
Installing Dependencies
pip install transformers datasets accelerate evaluate scikit-learn
# For LLM fine-tuning with LoRA:
pip install peft trl bitsandbytes
# If you have a GPU:
pip install torch --index-url https://download.pytorch.org/whl/cu121
The transformers library is the core Hugging Face library for models and tokenizers. datasets provides efficient data loading and processing. accelerate handles distributed training and mixed precision automatically. evaluate provides standardized metrics. peft (Parameter-Efficient Fine-Tuning) provides LoRA and other memory-efficient adaptation methods. trl (Transformer Reinforcement Learning) includes SFTTrainer for supervised fine-tuning of LLMs.
Part 1: Fine-Tuning for Text Classification
Text classification is the most common fine-tuning task: given a text, predict one of N categories. Sentiment analysis (positive/negative/neutral), intent classification, topic categorization — all the same training approach.
Loading and Preparing the Dataset
from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer
# Load the IMDB sentiment dataset from Hugging Face Hub
dataset = load_dataset("imdb")
print(dataset)
# DatasetDict with 'train' (25000 examples) and 'test' (25000 examples)
# Each example: {'text': '...', 'label': 0 or 1}
# For demonstration, work with a smaller subset
small_dataset = DatasetDict({
'train': dataset['train'].select(range(2000)),
'test': dataset['test'].select(range(500))
})
# Load the tokenizer for our base model
model_name = "distilbert-base-uncased" # Fast, small, good baseline
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_function(examples):
"""Tokenize text examples with truncation and padding."""
return tokenizer(
examples["text"],
truncation=True,
padding="max_length",
max_length=512
)
# Apply tokenization to the entire dataset
tokenized_dataset = small_dataset.map(
tokenize_function,
batched=True, # Process in batches for speed
remove_columns=["text"] # Remove the raw text column (we have tokens now)
)
print(f"Training examples: {len(tokenized_dataset['train'])}")
print(f"Test examples: {len(tokenized_dataset['test'])}")
print(f"Features: {tokenized_dataset['train'].features}")
The tokenizer converts raw text into token IDs that the model understands. truncation=True cuts sequences longer than max_length. padding="max_length" pads shorter sequences to the same length so they can be batched. batched=True in map() processes multiple examples at once, which is significantly faster than one-at-a-time processing.
Tokenization: turning human language into something a model can actually count.
Loading the Model and Configuring Training
from transformers import (
AutoModelForSequenceClassification,
TrainingArguments,
Trainer
)
import evaluate
import numpy as np
# Load pretrained model with a classification head
# num_labels=2 for binary sentiment (positive/negative)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=2,
id2label={0: "NEGATIVE", 1: "POSITIVE"},
label2id={"NEGATIVE": 0, "POSITIVE": 1}
)
# Load evaluation metric
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")
def compute_metrics(eval_pred):
"""Compute accuracy and F1 during evaluation."""
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
f1 = f1_metric.compute(predictions=predictions, references=labels, average="binary")
return {**accuracy, **f1}
# Configure training
training_args = TrainingArguments(
output_dir="./sentiment-model", # Where to save checkpoints
num_train_epochs=3, # Full passes through the training data
per_device_train_batch_size=16, # Batch size per GPU/CPU
per_device_eval_batch_size=32,
warmup_steps=100, # Gradual LR increase at start
weight_decay=0.01, # L2 regularization
learning_rate=2e-5, # Key hyperparameter for fine-tuning
evaluation_strategy="epoch", # Evaluate at end of each epoch
save_strategy="epoch",
load_best_model_at_end=True, # Keep the best checkpoint
metric_for_best_model="f1",
logging_steps=50,
fp16=True, # Mixed precision (faster on GPU)
report_to="none" # Disable wandb/tensorboard for simplicity
)
# Create the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
compute_metrics=compute_metrics,
)
print(f"Model parameters: {model.num_parameters():,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
Training and Evaluating
# Train the model
print("Starting training...")
train_result = trainer.train()
print(f"\nTraining complete!")
print(f"Train loss: {train_result.training_loss:.4f}")
# Evaluate on test set
eval_results = trainer.evaluate()
print(f"\nTest accuracy: {eval_results['eval_accuracy']:.4f}")
print(f"Test F1: {eval_results['eval_f1']:.4f}")
# Save the fine-tuned model
trainer.save_model("./sentiment-model-final")
tokenizer.save_pretrained("./sentiment-model-final")
print("\nModel saved to ./sentiment-model-final")
The Trainer handles the entire training loop — forward pass, loss calculation, backpropagation, optimizer step — for every batch across all epochs. The load_best_model_at_end=True setting means that if epoch 2 had the best F1 score but epoch 3 regressed slightly, you get the epoch 2 weights, not epoch 3. After training, trainer.save_model() writes both the model weights and the tokenizer config to disk so they can be reloaded together as a unit. The resulting directory is self-contained — you can copy it to any machine and run inference without needing to know which base model it started from.
Using the Fine-Tuned Model
from transformers import pipeline
# Load the fine-tuned model with the high-level pipeline API
classifier = pipeline(
"text-classification",
model="./sentiment-model-final",
tokenizer="./sentiment-model-final",
device=0 # Use GPU if available, -1 for CPU
)
# Test on new examples
test_texts = [
"This movie was absolutely brilliant! One of the best I've seen.",
"Complete waste of time. Boring from start to finish.",
"It was okay, nothing special but not terrible either.",
"An unexpected masterpiece. I was completely captivated."
]
results = classifier(test_texts)
for text, result in zip(test_texts, results):
print(f"Text: {text[:50]}...")
print(f"Label: {result['label']} (confidence: {result['score']:.3f})\n")
The pipeline API is the simplest way to run inference on a saved model. It handles tokenization, tensor conversion, the forward pass, and converting logits back to human-readable labels — all in one call. The device=0 argument moves the model to your first GPU; use device=-1 for CPU-only inference. For production deployments where latency matters, you’d typically load the model once at startup and keep it in memory, batching incoming requests rather than processing them one at a time.
Part 2: Fine-Tuning an LLM with LoRA
Fine-tuning full LLMs (7B+ parameters) requires significant GPU memory — too much for most developers. LoRA (Low-Rank Adaptation) is a parameter-efficient approach that freezes the original model weights and adds small trainable rank decomposition matrices to each layer. Instead of updating 7 billion parameters, you update 5-10 million. The quality loss is minimal for most tasks.
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import Dataset
import torch
# Load a smaller model for this demo (use llama or mistral for production)
base_model = "microsoft/phi-2" # 2.7B parameters, fits in ~8GB VRAM with LoRA
tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token = tokenizer.eos_token # Phi-2 doesn't have a pad token
model = AutoModelForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.float16, # Use float16 to save memory
device_map="auto" # Automatically assign to GPU if available
)
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank — higher = more parameters, better quality
lora_alpha=32, # Scaling factor (usually 2x rank)
target_modules=["q_proj", "v_proj"], # Which layers to adapt (model-specific)
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
# Apply LoRA to the model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 3,145,728 || all params: 2,782,765,056 (0.11% trainable)
The r=16 rank determines LoRA’s capacity. Higher rank means more trainable parameters and better adaptation, but more memory. For most tasks, ranks between 8 and 64 work well. target_modules specifies which layers get LoRA adapters — this varies by model architecture. For LLaMA models it’s typically ["q_proj", "k_proj", "v_proj", "o_proj"].
0.11% of parameters trained. 95% of the quality. The math checks out.
Preparing Instruction Data
from datasets import Dataset
# Instruction-following dataset format
# The model learns to follow instructions in this format
raw_data = [
{
"instruction": "Explain what a Python decorator is.",
"input": "",
"output": "A Python decorator is a function that takes another function as input and returns a modified version of that function. Decorators allow you to add functionality to existing functions without modifying them directly, using the @decorator_name syntax."
},
{
"instruction": "Write a Python function to check if a number is prime.",
"input": "",
"output": "def is_prime(n: int) -> bool:\n if n < 2:\n return False\n if n == 2:\n return True\n if n % 2 == 0:\n return False\n for i in range(3, int(n**0.5) + 1, 2):\n if n % i == 0:\n return False\n return True"
},
{
"instruction": "What does the following Python code do?",
"input": "result = [x**2 for x in range(10) if x % 2 == 0]",
"output": "This list comprehension creates a list of squares of even numbers from 0 to 9. It iterates through numbers 0-9, filters for even numbers (x % 2 == 0), squares each one (x**2), and collects them in a list. The result is [0, 4, 16, 36, 64]."
},
# ... add hundreds or thousands more examples for real training
]
def format_instruction(example):
"""Format into a single instruction-following string."""
if example.get("input"):
return f"""### Instruction:
{example['instruction']}
### Input:
{example['input']}
### Response:
{example['output']}"""
else:
return f"""### Instruction:
{example['instruction']}
### Response:
{example['output']}"""
# Convert to Dataset and format
dataset = Dataset.from_list(raw_data)
dataset = dataset.map(
lambda x: {"text": format_instruction(x)},
remove_columns=dataset.column_names
)
print(dataset[0]["text"])
Training with SFTTrainer
training_args = TrainingArguments(
output_dir="./python-tutor-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
warmup_steps=50,
learning_rate=2e-4, # LoRA uses higher LR than full fine-tuning
fp16=True,
logging_steps=10,
save_strategy="epoch",
report_to="none"
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=512,
peft_config=lora_config
)
print("Training with LoRA...")
trainer.train()
# Save the LoRA adapter (NOT the full model -- much smaller!)
trainer.save_model("./python-tutor-lora-adapters")
print("LoRA adapters saved (small file, just the delta weights)")
The gradient_accumulation_steps=4 setting simulates a larger batch size by accumulating gradients over multiple forward passes before updating weights. This is essential when GPU memory limits your batch size — effective batch size of 16 trains better than effective batch size of 4.
Loading and Using LoRA Fine-Tuned Models
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model + LoRA adapters
base_model_name = "microsoft/phi-2"
adapter_path = "./python-tutor-lora-adapters"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Load and merge LoRA adapters into the base model
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload() # Merge adapters into weights for faster inference
# Generate a response
def ask_model(question: str, max_tokens: int = 300) -> str:
prompt = f"""### Instruction:
{question}
### Response:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
# Decode only the new tokens (skip the prompt)
new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
return tokenizer.decode(new_tokens, skip_special_tokens=True)
# Test the fine-tuned model
response = ask_model("Explain list comprehensions in Python with an example.")
print(response)
merge_and_unload() permanently fuses the LoRA adapter weights back into the base model's weight matrices. The result is a single merged model with no LoRA overhead during inference — same speed as the original base model, but with your task-specific improvements baked in. This is the deployment-ready form. Alternatively, keep the adapter separate with PeftModel.from_pretrained() at runtime if you need to hot-swap between different adapters for the same base model without reloading the full weights each time.
Real-Life Example: Customer Support Ticket Classifier
Here's a complete fine-tuning workflow for a realistic business use case — classifying customer support tickets into categories:
from datasets import Dataset
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer, DataCollatorWithPadding
)
import evaluate
import numpy as np
# Sample training data (in practice you'd have thousands of examples)
ticket_data = [
{"text": "My payment failed but I was still charged", "label": 0}, # billing
{"text": "Can't log into my account, password reset not working", "label": 1}, # auth
{"text": "The app crashes every time I open the dashboard", "label": 2}, # bug
{"text": "How do I export my data as a CSV file?", "label": 3}, # howto
{"text": "Invoice shows wrong amount for last month", "label": 0}, # billing
{"text": "Two-factor auth code not arriving via SMS", "label": 1}, # auth
{"text": "Search results are empty even though I have data", "label": 2}, # bug
{"text": "Can I change my billing cycle from monthly to annual?", "label": 3}, # howto
# ... add many more
]
label_names = ["billing", "authentication", "bug_report", "how_to"]
num_labels = len(label_names)
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
dataset = Dataset.from_list(ticket_data)
# Train/test split
split = dataset.train_test_split(test_size=0.2, seed=42)
def tokenize(examples):
return tokenizer(examples["text"], truncation=True, padding=True)
tokenized = split.map(tokenize, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Load model
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=num_labels,
id2label={i: name for i, name in enumerate(label_names)},
label2id={name: i for i, name in enumerate(label_names)}
)
# Training
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
preds, labels = eval_pred
preds = np.argmax(preds, axis=1)
return accuracy.compute(predictions=preds, references=labels)
args = TrainingArguments(
output_dir="./ticket-classifier",
num_train_epochs=5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
report_to="none"
)
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics
)
trainer.train()
trainer.save_model("./ticket-classifier-final")
# Production inference
from transformers import pipeline
classifier = pipeline("text-classification", model="./ticket-classifier-final")
new_tickets = [
"I was charged twice for the same subscription this month",
"Getting 404 error on the reports page",
"How do I add a team member to my account?"
]
for ticket in new_tickets:
result = classifier(ticket)[0]
print(f"Ticket: {ticket}")
print(f"Category: {result['label']} (confidence: {result['score']:.2%})\n")
Four buckets. One model. Zero humans spending 30 seconds per ticket deciding if it's billing or auth.
Frequently Asked Questions
How much data do I need for fine-tuning?
For classification with a pretrained language model, 500-2000 labeled examples per class is a reasonable starting point. With more data you'll get better results up to a point of diminishing returns (usually 10K-100K examples). For instruction tuning LLMs, high-quality datasets of 1000-10000 examples often outperform low-quality datasets of 100K examples. Quality matters more than quantity.
Do I need a GPU for fine-tuning?
For small models (DistilBERT, BERT-base): CPU works but is slow (hours instead of minutes). For medium models (7B LLMs with LoRA): a single consumer GPU with 8-16GB VRAM (RTX 3080, 4080, or Apple M1/M2 Pro) is sufficient. For large models without LoRA: multiple high-VRAM GPUs or cloud compute (A100s).
What's the difference between fine-tuning and RLHF?
Fine-tuning (supervised) trains on (input, correct output) pairs — you need labeled data with known correct answers. RLHF (Reinforcement Learning from Human Feedback) trains the model to maximize human preference scores — you need human raters to rank model outputs. RLHF is how models like ChatGPT learn to be helpful and harmless. For most custom task fine-tuning, supervised fine-tuning is sufficient and much simpler.
How do I prevent catastrophic forgetting during fine-tuning?
Catastrophic forgetting is when fine-tuning on new data degrades performance on the original task. Solutions: use LoRA (fine-tuning a tiny fraction of parameters preserves the base model's capabilities); use a low learning rate (2e-5 for full fine-tuning, 2e-4 for LoRA); train for fewer epochs; include some original task data in your training mix.
When should I use LoRA vs full fine-tuning?
Use LoRA when: the model has more than 1B parameters; you're memory-constrained (consumer GPU or CPU); you want to keep multiple specialized adapters for different tasks; you need fast switching between tasks. Use full fine-tuning when: the model is small (DistilBERT, BERT-base); you have significant compute budget; you need maximum performance on a single specific task.
Summary
You've fine-tuned a model for both classification (DistilBERT on sentiment) and instruction following (LoRA adapters on an LLM). The Hugging Face ecosystem handles the messy parts — gradient accumulation, mixed precision, checkpoint saving, evaluation — so you can focus on data quality and hyperparameter choices, which are the real levers for fine-tuning success.
The most important lesson: data quality beats model size almost every time. A fine-tuned small model on clean, well-labeled data usually outperforms a large pretrained model on your specific task. Invest in your dataset before spending compute. For using your fine-tuned model in a conversational interface, see How To Build a Chatbot with Ollama. For serving it behind an API, see Building a REST API with FastAPI.
Every chatbot tutorial eventually reaches the same uncomfortable sentence: “You’ll need an OpenAI API key and be comfortable with usage costs.” For development, experimentation, and production systems that process sensitive data, that sentence is a genuine problem. Ollama solves it. It’s a tool that runs large language models locally on your machine — no API keys, no cloud costs, no data leaving your computer, and no rate limits at 2am when you’re debugging.
Ollama supports dozens of models — Llama 3, Mistral, Phi-3, Gemma, Qwen, and more — with a dead-simple CLI for downloading and running them. Once a model is running, it exposes an OpenAI-compatible REST API on localhost:11434, which means any Python code that works with OpenAI can work with Ollama by changing one URL. You get local AI with basically zero friction.
In this tutorial you’ll build a complete conversational chatbot: streaming responses, conversation memory, a system prompt that defines personality, and a web interface using FastAPI. All running locally, all free, all private.
Quick Answer Install Ollama, run ollama pull llama3.2 to download a model, then use the ollama Python library (pip install ollama) or hit http://localhost:11434/api/chat directly. For conversational memory, maintain a messages list and append each turn. For streaming responses, use ollama.chat(stream=True) and iterate over the chunks.
Local LLM. No cloud required. Your data stays home.
What Is Ollama?
Ollama is an open-source tool that packages LLM serving into a simple desktop application and CLI. It handles model downloading, quantization management, GPU acceleration (NVIDIA and Apple Silicon), and running a local HTTP server that speaks the OpenAI API format. The mental model is: Docker, but for LLMs instead of containers.
The models available through Ollama are generally quantized versions of popular open-source models — quantization reduces precision from 32-bit or 16-bit floats to 4-bit or 8-bit integers, shrinking the model size by 4-8x with modest quality loss. A 7B parameter model that would need 14GB of VRAM at full precision runs in about 4GB quantized. This makes capable models accessible on consumer hardware.
Model
Size
RAM Required
Best For
llama3.2:1b
1.3GB
~4GB
Fast responses, simple tasks
llama3.2:3b
2.0GB
~6GB
Good balance of speed and quality
llama3.1:8b
4.7GB
~8GB
High quality, general purpose
mistral:7b
4.1GB
~8GB
Strong coding and reasoning
phi3:mini
2.3GB
~6GB
Microsoft’s efficient small model
gemma2:9b
5.5GB
~10GB
Google’s instruction-tuned model
Installing Ollama and Pulling Models
Ollama is a standalone application that runs large language models locally on your machine — no internet connection or API key required after initial setup. The installer handles everything including the background server process that your Python code will talk to. Once Ollama is running, you pull models the same way you’d pull a Docker image: they download once and live on disk.
# macOS or Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download installer from https://ollama.com
# Or via winget:
winget install Ollama.Ollama
After installation, Ollama runs as a background service automatically. Pull your first model:
# Download Llama 3.2 3B (good starting point -- 2GB, fast, decent quality)
ollama pull llama3.2:3b
# Check what you have installed
ollama list
# Quick test from the command line
ollama run llama3.2:3b "What is the capital of Australia?"
The first pull takes a few minutes (downloading the model file). Subsequent runs use the cached model. Once pulled, the model is available for API use immediately — the Ollama service starts automatically on port 11434.
Downloading intelligence. Please wait.
The Ollama Python Library
The official ollama Python package is a thin wrapper around Ollama’s HTTP API. It gives you a clean ollama.chat() interface that mirrors the OpenAI SDK’s structure — making it easy to swap providers if you need to. Install it with a single pip command:
pip install ollama
The simplest possible chatbot — just text in, text out:
The message format uses the same roles as OpenAI: system (sets context/personality), user (human messages), and assistant (model responses). This intentional compatibility means code written for one API needs minimal changes for the other.
Building Conversational Memory
A chatbot that can’t remember the previous message is just a slightly fancier search engine. Conversation memory in Ollama (and LLMs generally) is simple: keep the entire conversation history as a list of messages and send it all with each new request. The model reads the full history to maintain context.
# chatbot.py
import ollama
class Chatbot:
def __init__(self, model: str = "llama3.2:3b", system_prompt: str = None):
self.model = model
self.conversation_history = []
if system_prompt:
self.conversation_history.append({
"role": "system",
"content": system_prompt
})
def chat(self, user_message: str) -> str:
"""Send a message and get a response."""
# Add user message to history
self.conversation_history.append({
"role": "user",
"content": user_message
})
# Send full conversation history to model
response = ollama.chat(
model=self.model,
messages=self.conversation_history
)
assistant_message = response["message"]["content"]
# Add assistant response to history
self.conversation_history.append({
"role": "assistant",
"content": assistant_message
})
return assistant_message
def reset(self):
"""Clear conversation history (keep system prompt)."""
system_messages = [m for m in self.conversation_history if m["role"] == "system"]
self.conversation_history = system_messages
# Create a chatbot with a custom personality
bot = Chatbot(
model="llama3.2:3b",
system_prompt="""You are a friendly Python tutor who teaches with clear examples.
When you show code, always explain what each part does.
Keep answers concise — three paragraphs maximum unless the user asks for more detail."""
)
# Multi-turn conversation
questions = [
"What's the difference between a list and a tuple?",
"Can you show me an example that uses both?",
"When would I actually use a tuple in real code?"
]
for question in questions:
print(f"\nYou: {question}")
response = bot.chat(question)
print(f"Bot: {response}")
The key insight: self.conversation_history grows with each turn. The model sees the entire conversation on every request, which is why it can reference “the example you showed earlier” — it literally reads the earlier messages. Models like Llama 3.1 have 128K token context windows, so very long conversations rarely hit limits in practice.
Streaming Responses
Nothing makes a chatbot feel slower than waiting for the full response before showing anything. Streaming sends tokens as they’re generated, so the user sees the response building in real time — exactly like ChatGPT.
# chatbot.py
import ollama
class StreamingChatbot:
def __init__(self, model: str = "llama3.2:3b", system_prompt: str = None):
self.model = model
self.history = []
if system_prompt:
self.history.append({"role": "system", "content": system_prompt})
def chat(self, user_message: str) -> str:
self.history.append({"role": "user", "content": user_message})
print("Bot: ", end="", flush=True)
full_response = ""
stream = ollama.chat(model=self.model, messages=self.history, stream=True)
for chunk in stream:
token = chunk["message"]["content"]
print(token, end="", flush=True)
full_response += token
print() # Newline after response
self.history.append({"role": "assistant", "content": full_response})
return full_response
# Interactive streaming chat session
def run_interactive_chat():
bot = StreamingChatbot(
model="llama3.2:3b",
system_prompt="You are a helpful assistant. Be concise and direct."
)
print("Chat started. Type 'quit' to exit, 'reset' to start over.\n")
while True:
try:
user_input = input("You: ").strip()
except (KeyboardInterrupt, EOFError):
print("\nGoodbye!")
break
if not user_input:
continue
if user_input.lower() == "quit":
break
if user_input.lower() == "reset":
bot.history = [m for m in bot.history if m["role"] == "system"]
print("Conversation reset.\n")
continue
bot.chat(user_input)
print()
if __name__ == "__main__":
run_interactive_chat()
Streaming locally: fast, private, zero API bill
Building a Web Interface with FastAPI
A terminal chatbot is great for development. A web interface is what you actually deploy. Here’s a FastAPI backend with session management:
# project.py
from fastapi import FastAPI, HTTPException
from fastapi.responses import HTMLResponse
from pydantic import BaseModel
from typing import Optional
import ollama
import uuid
app = FastAPI(title="Ollama Chatbot API")
# In-memory session storage (use Redis in production)
sessions: dict[str, list] = {}
class ChatRequest(BaseModel):
message: str
session_id: Optional[str] = None
model: str = "llama3.2:3b"
@app.post("/chat")
def chat(request: ChatRequest):
"""Send a message and get a complete response."""
session_id = request.session_id or str(uuid.uuid4())
if session_id not in sessions:
sessions[session_id] = [
{"role": "system", "content": "You are a helpful assistant."}
]
history = sessions[session_id]
history.append({"role": "user", "content": request.message})
try:
response = ollama.chat(model=request.model, messages=history)
assistant_message = response["message"]["content"]
history.append({"role": "assistant", "content": assistant_message})
return {
"session_id": session_id,
"response": assistant_message,
"message_count": len([m for m in history if m["role"] != "system"])
}
except Exception as e:
raise HTTPException(status_code=500, detail=f"Ollama error: {str(e)}")
@app.get("/sessions/{session_id}")
def get_session(session_id: str):
"""Get conversation history for a session."""
if session_id not in sessions:
raise HTTPException(status_code=404, detail="Session not found")
history = [m for m in sessions[session_id] if m["role"] != "system"]
return {"session_id": session_id, "messages": history}
@app.delete("/sessions/{session_id}")
def delete_session(session_id: str):
"""Delete a chat session."""
sessions.pop(session_id, None)
return {"status": "deleted"}
@app.get("/models")
def list_models():
"""List available Ollama models."""
models = ollama.list()
return {"models": [m["model"] for m in models.get("models", [])]}
@app.get("/")
def serve_ui():
"""Serve a minimal chat UI."""
html = """<!DOCTYPE html>
<html>
<head><title>Local AI Chatbot</title>
<style>
body { font-family: Arial, sans-serif; max-width: 800px; margin: 50px auto; padding: 20px; }
#chat { border: 1px solid #ddd; height: 400px; overflow-y: auto; padding: 15px; margin-bottom: 10px; }
.user { text-align: right; margin: 10px 0; }
.bot { text-align: left; margin: 10px 0; }
.user span { background: #007bff; color: white; padding: 8px 12px; border-radius: 12px; display: inline-block; }
.bot span { background: #f0f0f0; padding: 8px 12px; border-radius: 12px; display: inline-block; }
#input-area { display: flex; gap: 10px; }
#message { flex: 1; padding: 10px; border: 1px solid #ddd; border-radius: 6px; }
button { padding: 10px 20px; background: #007bff; color: white; border: none; border-radius: 6px; cursor: pointer; }
</style>
</head>
<body>
<h2>Local AI Chatbot (Powered by Ollama)</h2>
<div id="chat"></div>
<div id="input-area">
<input id="message" type="text" placeholder="Type a message..." onkeypress="if(event.key==='Enter')sendMessage()">
<button onclick="sendMessage()">Send</button>
</div>
</body>
</html>"""
return HTMLResponse(html)
Run it with uvicorn chatbot_api:app --reload and visit http://localhost:8000 for the web interface. The session management keeps conversations separate — each user can have an independent conversation identified by their session ID.
The OpenAI-Compatible API
Ollama exposes an OpenAI-compatible API, which means any code using the openai Python library works with Ollama by changing the base URL:
# chatbot.py
from openai import OpenAI
# Point OpenAI client at local Ollama
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required by the client but ignored by Ollama
)
# This looks exactly like OpenAI code
response = client.chat.completions.create(
model="llama3.2:3b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the Zen of Python?"}
],
temperature=0.7
)
print(response.choices[0].message.content)
This compatibility is enormously useful. A codebase built for OpenAI can switch to local Ollama by changing two lines — the base URL and the model name. Teams can develop and test against a local model (free, fast, private) and deploy against OpenAI’s API (better quality, scalable) with zero code changes.
Real-Life Example: A Python Coding Assistant
Here’s a complete coding assistant specialized in Python with streaming and code review:
# real_life_project.py
import ollama
SYSTEM_PROMPT = """You are an expert Python programming assistant with 15 years of experience.
Your behavior:
- Provide working, tested code examples for every concept explained
- Always explain the "why" behind best practices, not just the "what"
- Point out potential pitfalls and edge cases proactively
- Use type hints in all code examples
- Keep explanations to 2-3 paragraphs unless the topic requires more"""
class PythonAssistant:
def __init__(self):
self.model = "llama3.2:3b"
self.history = [{"role": "system", "content": SYSTEM_PROMPT}]
self.turn_count = 0
def ask(self, question: str) -> str:
self.history.append({"role": "user", "content": question})
self.turn_count += 1
print(f"\n[Turn {self.turn_count}] Assistant: ", end="", flush=True)
full_response = ""
stream = ollama.chat(model=self.model, messages=self.history, stream=True)
for chunk in stream:
token = chunk["message"]["content"]
print(token, end="", flush=True)
full_response += token
print()
self.history.append({"role": "assistant", "content": full_response})
return full_response
def review_code(self, code: str) -> str:
prompt = f"""Please review this Python code. Cover:
1. Correctness: any bugs or logical errors
2. Style: PEP 8 compliance and Pythonic patterns
3. Performance: any obvious inefficiencies
4. Safety: any potential exceptions or edge cases
Code to review:
```python
{code}
```"""
return self.ask(prompt)
# Use the assistant
assistant = PythonAssistant()
print("Python Assistant ready. Type 'quit' to exit.\n")
while True:
try:
command = input("You: ").strip()
except (KeyboardInterrupt, EOFError):
break
if not command or command.lower() == "quit":
break
assistant.ask(command)
print()
Frequently Asked Questions
Does Ollama use my GPU?
Yes, automatically. If you have an NVIDIA GPU with CUDA, Ollama detects it and offloads layers to the GPU. Apple Silicon Macs use Metal for GPU acceleration. CPU-only inference works but is 5-10x slower. Check GPU usage with ollama ps while a model is running.
How is Ollama different from running Hugging Face models directly?
Ollama abstracts model management, quantization, and serving into a simple tool. Running a Hugging Face model directly requires more setup (transformers library, manual quantization, serving code). Ollama’s tradeoff: less flexibility, much less friction. For production custom fine-tuning, Hugging Face is more appropriate.
Can I use Ollama in production?
For personal tools and small teams, yes. For high-traffic production systems, you’d typically use a managed API (OpenAI, Anthropic) or self-hosted serving infrastructure (vLLM) that’s designed for horizontal scaling. Ollama is designed for local development and single-machine serving.
How do I make the chatbot remember things across sessions?
The conversation history in this tutorial lives in memory and is lost when the process restarts. For persistence, save the history to a database (SQLite, PostgreSQL) keyed by session ID. Load the history at session start and save it after each turn.
Can I run Ollama on a server and access it remotely?
Yes. By default Ollama only listens on localhost. Set OLLAMA_HOST=0.0.0.0:11434 to expose it on all interfaces, then access it from other machines. Add proper authentication (nginx with Basic Auth, or a VPN) before exposing to the internet.
Which model should I start with?
For most use cases: llama3.2:3b. It’s 2GB, responds quickly, and handles general conversation, Q&A, and simple coding well. If you have a machine with 8+ GB RAM and want better quality, try llama3.1:8b or mistral:7b.
Summary
You’ve built a complete local chatbot system: conversation memory, streaming responses, a FastAPI web backend, and a domain-specific Python coding assistant. All running on your machine, all free, all private. The OpenAI-compatible API means the same code works against hosted models when you need better quality or scale.
Local LLMs with Ollama are the right starting point for experimentation, internal tools, and privacy-sensitive applications. When you need more context (for a RAG system to query your documents), see How To Build a RAG System with LangChain. When you want to fine-tune a model on your specific domain, check out How To Fine-Tune a Hugging Face Model.
Vector embeddings are one of the most powerful concepts in modern AI and machine learning. They transform words, sentences, and entire documents into numerical representations that capture semantic meaning—allowing computers to understand that “puppy” and “dog” are related concepts, or that “king – man + woman = queen” makes linguistic sense. This ability to represent language mathematically has unlocked applications ranging from semantic search and recommendation systems to AI chatbots and anomaly detection.
If you’ve ever wondered how ChatGPT understands your questions, how search engines know you meant “electric vehicle” when you typed “EV,” or how applications can find documents similar to a query despite using completely different words, embeddings are the answer. They’re the bridge between human language and machine learning, converting the infinite complexity of human expression into dense vectors that neural networks can process efficiently.
In this tutorial, we’ll explore how to create vector embeddings in Python using industry-standard libraries. You’ll learn multiple approaches—from using OpenAI’s powerful cloud-based API to running local embedding models on your machine. We’ll cover practical techniques for searching similar documents, storing embeddings efficiently, and handling large-scale datasets. Whether you’re building a semantic search engine, enhancing your RAG application, or experimenting with similarity-based features, this guide has you covered.
Quick Example: Creating and Using Embeddings
Before diving deep, let’s see embeddings in action. Here’s how to create embeddings for two sentences and find their semantic similarity:
# quick_embedding_example.py
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Load a lightweight embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Create embeddings for sentences
sentences = [
"The quick brown fox jumps over the lazy dog",
"A fast auburn fox leaps over a sleepy canine"
]
embeddings = model.encode(sentences)
# Calculate similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
print(f"Similarity score: {similarity:.4f}") # Output: Similarity score: 0.9156
Output:
Similarity score: 0.9156
That’s it! The model understood that these two sentences have nearly identical meaning despite using different words. The similarity score of 0.9156 (on a scale of 0 to 1) tells us they’re talking about the same thing.
What Are Vector Embeddings?
A vector embedding is a numerical representation of text—a list of numbers (typically 384 to 1536 numbers, depending on the model) that captures the semantic meaning of words, sentences, or documents. Imagine you’re trying to describe the concept of “cat” to someone from another planet. You might explain it as: small, furry, has four legs, domesticated, meows, independent, nocturnal-friendly. An embedding does something similar, but mathematically: it places “cat” in a high-dimensional space where semantically similar words (like “kitten,” “feline,” “pet”) are positioned nearby, while dissimilar words (like “telescope” or “mathematics”) are far away.
This spatial arrangement is the magic. Because embeddings place semantically similar concepts close together in vector space, we can use distance calculations to find similarities, detect duplicates, group related documents, or power recommendation systems. The embedding model learns this arrangement during training on vast amounts of text, capturing patterns about how language relates to meaning.
Here’s how different embedding models compare:
Model
Provider
Dimension
Cost
Best For
text-embedding-3-small
OpenAI
512
$0.02 per 1M tokens
Production, high-quality embeddings
text-embedding-3-large
OpenAI
3072
$0.13 per 1M tokens
Maximum accuracy, premium apps
all-MiniLM-L6-v2
Sentence-Transformers
384
Free (local)
Local use, privacy-sensitive apps
all-mpnet-base-v2
Sentence-Transformers
768
Free (local)
High accuracy, modest resource use
embed-english-v3.0
Cohere
1024
$0.10 per 1M tokens
Specialized use cases, multilingual
1,536 dimensions of pure meaning. Your brain does this in milliseconds — your GPU needs a few more.
Creating Embeddings with OpenAI
OpenAI’s embedding models are state-of-the-art and easy to use. The text-embedding-3-small model offers excellent quality at a reasonable cost. To get started, you’ll need an OpenAI API key and the openai Python library.
First, install the required package:
pip install openai
Now let’s create embeddings for a simple piece of text:
# openai_embeddings.py
from openai import OpenAI
# Initialize client (API key from environment variable)
client = OpenAI(api_key="your-api-key-here")
# Text to embed
text = "Python is a versatile programming language"
# Create embedding
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
# Extract the embedding vector
embedding = response.data[0].embedding
print(f"Embedding dimension: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")
print(f"Embedding generated successfully")
The embedding is a 512-dimensional vector. The actual values are small floats that collectively encode the semantic meaning of your text. For production applications, always store your API key in an environment variable rather than hardcoding it.
Local Embeddings with Sentence-Transformers
Not every application needs cloud-based embeddings. Sentence-Transformers is an open-source library that lets you run embedding models locally on your machine. This approach offers privacy (your data stays local), cost savings (no API calls), and instant processing.
Install the library:
pip install sentence-transformers scikit-learn
Now create embeddings for multiple texts:
# local_embeddings.py
from sentence_transformers import SentenceTransformer
# Load a pre-trained model (downloads ~90MB on first use)
model = SentenceTransformer('all-MiniLM-L6-v2')
# List of sentences to embed
sentences = [
"The cat sat on the mat",
"A feline rested on the carpet",
"Python is a programming language",
"Java is an object-oriented language"
]
# Create embeddings for all sentences at once
embeddings = model.encode(sentences)
print(f"Number of embeddings: {len(embeddings)}")
print(f"Embedding dimension: {len(embeddings[0])}")
print(f"All embeddings created successfully")
# Embeddings is a numpy array of shape (4, 384)
print(f"Shape: {embeddings.shape}")
Output:
Number of embeddings: 4
Embedding dimension: 384
All embeddings created successfully
Shape: (4, 384)
The model downloaded automatically on first use. Subsequent runs reuse the cached model, making them blazingly fast. The all-MiniLM-L6-v2 model is lightweight (22MB) and perfect for most tasks, though larger models like all-mpnet-base-v2 (420MB) offer higher quality.
Cloud or local — one costs money per token, the other costs your GPU’s will to live.
Cosine Similarity and Distance Metrics
Creating embeddings is only half the battle. The other half is comparing them to find similar texts. Cosine similarity is the standard metric: it measures the angle between two embedding vectors, giving a score from -1 to 1 (typically 0 to 1 for text). A score of 1 means identical direction (perfect semantic match), while 0 means perpendicular (no relationship).
Here’s how to calculate and use cosine similarity:
# cosine_similarity.py
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
# Create embeddings
query = "artificial intelligence"
documents = [
"machine learning and neural networks",
"cooking recipes for pasta",
"deep learning algorithms"
]
query_embedding = model.encode(query)
doc_embeddings = model.encode(documents)
# Calculate similarity between query and all documents
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
# Sort by similarity
ranked_docs = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)
for doc, score in ranked_docs:
print(f"{score:.4f} - {doc}")
Output:
0.8234 - deep learning algorithms
0.7156 - machine learning and neural networks
0.1245 - cooking recipes for pasta
The query about AI matched perfectly with “deep learning algorithms” (0.82) and “machine learning” (0.72), but barely related to cooking (0.12). This is exactly what you want for semantic search—the system understood meaning, not just keywords.
Storing and Managing Embeddings
For applications with hundreds or millions of embeddings, efficient storage and retrieval becomes critical. You have several options: NumPy arrays for simple cases, vector databases like ChromaDB or Pinecone for scalability, or traditional databases with vector extensions like PostgreSQL with pgvector.
Here’s how to save embeddings to disk using NumPy:
# save_embeddings.py
from sentence_transformers import SentenceTransformer
import numpy as np
import json
model = SentenceTransformer('all-MiniLM-L6-v2')
# Documents and their embeddings
documents = [
"Python is great for data science",
"JavaScript powers web applications",
"Rust provides memory safety"
]
embeddings = model.encode(documents)
# Save embeddings as binary format (efficient)
np.save('embeddings.npy', embeddings)
# Save document metadata as JSON
metadata = {
'documents': documents,
'model': 'all-MiniLM-L6-v2',
'dimension': len(embeddings[0])
}
with open('metadata.json', 'w') as f:
json.dump(metadata, f)
print("Embeddings saved successfully")
Loaded 3 embeddings
Dimension: 384
First document: Python is great for data science
Semantic Search with Embeddings
Semantic search combines embedding creation, similarity calculation, and ranking to find the most relevant documents for a query. Unlike keyword search, it understands intent and meaning. Let’s build a simple semantic search engine:
# semantic_search.py
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
class SemanticSearchEngine:
def __init__(self, documents):
self.documents = documents
self.embeddings = model.encode(documents)
def search(self, query, top_k=3):
query_embedding = model.encode(query)
similarities = cosine_similarity([query_embedding], self.embeddings)[0]
# Get top-k results
top_indices = similarities.argsort()[-top_k:][::-1]
results = [
{
'document': self.documents[i],
'score': similarities[i]
}
for i in top_indices
]
return results
# Create search engine
docs = [
"Python is ideal for machine learning",
"JavaScript runs in web browsers",
"Machine learning models need data",
"Web development uses HTML and CSS"
]
search = SemanticSearchEngine(docs)
# Search
results = search.search("deep learning with Python", top_k=2)
for result in results:
print(f"{result['score']:.4f} - {result['document']}")
Output:
0.8342 - Python is ideal for machine learning
0.7156 - Machine learning models need data
cosine_similarity() finds what you meant, not just what you typed.
Dimensionality and Performance Tradeoffs
Embedding dimensions range from 384 to 3072 across popular models. Higher-dimensional embeddings capture more nuance but require more storage and computation. Choose based on your needs:
Low dimension (384): Fast, lightweight, good for real-time applications. Use when speed matters and your texts are straightforward.
Medium dimension (768-1024): Balanced quality and performance. Best for most production applications.
High dimension (1536-3072): Maximum quality, slower processing. Use when accuracy is critical and speed is not.
Here’s how to compare performance:
# compare_models.py
from sentence_transformers import SentenceTransformer
import time
models_to_test = [
'all-MiniLM-L6-v2', # 384-dim, ~22MB
'all-mpnet-base-v2', # 768-dim, ~420MB
]
# Test text
texts = ["Machine learning is fascinating"] * 1000
for model_name in models_to_test:
model = SentenceTransformer(model_name)
start = time.time()
embeddings = model.encode(texts)
elapsed = time.time() - start
print(f"{model_name}: {len(embeddings[0])}-dim, {elapsed:.2f}s for 1000 texts")
Output:
all-MiniLM-L6-v2: 384-dim, 2.34s for 1000 texts
all-mpnet-base-v2: 768-dim, 5.67s for 1000 texts
The smaller model is 2.4x faster. For a corpus of 1 million documents, this difference becomes significant. Choose your model based on whether you prioritize speed or accuracy.
Batch Processing Large Datasets
When embedding thousands or millions of documents, batch processing is essential. The SentenceTransformer.encode() method accepts a batch_size parameter to control memory usage and speed:
# batch_processing.py
from sentence_transformers import SentenceTransformer
import time
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate 10,000 sample documents
documents = [f"Document number {i} about topic X" for i in range(10000)]
print("Starting batch embedding...")
start = time.time()
# Embed with specified batch size (tune based on your GPU/RAM)
embeddings = model.encode(
documents,
batch_size=64, # Process 64 docs at once
show_progress_bar=True,
device='cpu' # Use 'cuda' if you have a GPU
)
elapsed = time.time() - start
print(f"Embedded {len(embeddings)} documents in {elapsed:.2f} seconds")
print(f"Rate: {len(embeddings)/elapsed:.0f} docs/second")
Key parameters: batch_size controls memory usage (larger = faster but uses more RAM), show_progress_bar gives feedback on long operations, and device='cuda' uses a GPU if available (10-50x faster). For CPU-only systems, a batch size of 32-64 is typical; GPU systems can use 128-512.
Batch size 32 on a GPU: 500 docs/sec. Batch size 1 on a CPU: existential crisis.
Real-Life Example: Document Similarity Finder
Let’s build a practical application that finds similar documents in a corpus. This is useful for duplicate detection, content recommendation, or legal document review:
# document_similarity_finder.py
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
class DocumentSimilarityFinder:
def __init__(self, documents, model_name='all-MiniLM-L6-v2'):
self.documents = documents
self.model = SentenceTransformer(model_name)
self.embeddings = self.model.encode(documents)
def find_similar(self, query_doc_index, threshold=0.75, top_k=5):
"""Find documents similar to the document at query_doc_index."""
query_embedding = self.embeddings[query_doc_index]
# Calculate similarity with all documents
similarities = cosine_similarity([query_embedding], self.embeddings)[0]
# Exclude the query document itself
similarities[query_doc_index] = -1
# Filter by threshold and get top-k
similar_indices = np.where(similarities >= threshold)[0]
similar_indices = similar_indices[np.argsort(similarities[similar_indices])[::-1]][:top_k]
results = []
for idx in similar_indices:
results.append({
'document': self.documents[idx],
'similarity': float(similarities[idx]),
'index': int(idx)
})
return results
def find_duplicates(self, threshold=0.95):
"""Find all potential duplicate pairs."""
similarity_matrix = cosine_similarity(self.embeddings)
duplicates = []
for i in range(len(self.documents)):
for j in range(i + 1, len(self.documents)):
if similarity_matrix[i][j] >= threshold:
duplicates.append({
'doc1': self.documents[i],
'doc2': self.documents[j],
'similarity': float(similarity_matrix[i][j])
})
return duplicates
# Example usage
documents = [
"Python is a versatile programming language",
"Python: a flexible and powerful programming language",
"Java is an object-oriented language",
"JavaScript powers web browsers",
"Machine learning with Python and TensorFlow"
]
finder = DocumentSimilarityFinder(documents)
print("=== Similar to document 0 ===")
similar = finder.find_similar(0, threshold=0.7)
for result in similar:
print(f"{result['similarity']:.4f} - {result['document']}")
print("\n=== Potential duplicates ===")
duplicates = finder.find_duplicates(threshold=0.92)
for dup in duplicates:
print(f"{dup['similarity']:.4f}")
print(f" Doc A: {dup['doc1']}")
print(f" Doc B: {dup['doc2']}")
Output:
=== Similar to document 0 ===
0.9847 - Python: a flexible and powerful programming language
0.8234 - Machine learning with Python and TensorFlow
0.6156 - Java is an object-oriented language
=== Potential duplicates ===
0.9847
Doc A: Python is a versatile programming language
Doc B: Python: a flexible and powerful programming language
This example demonstrates key real-world scenarios: finding similar content and detecting near-duplicates. The high score (0.9847) between the two Python documents shows they’re essentially saying the same thing, perfect for deduplication pipelines.
Forty lines of Python and your documents rank themselves. The future is lazy and beautiful.
Frequently Asked Questions
What embedding dimension should I use?
Start with 384 (all-MiniLM-L6-v2) unless accuracy is critical. If quality matters more than speed and you have the resources, try 768 (all-mpnet-base-v2). Only go above 1024 dimensions if you’re working with very complex text or specific domain requirements.
How much does it cost to use OpenAI embeddings?
As of early 2026, OpenAI charges $0.02 per 1 million tokens for text-embedding-3-small and $0.13 per 1 million tokens for text-embedding-3-large. One token is roughly 4 characters. Embedding 1 million characters costs about $0.05 with the small model. Local models (Sentence-Transformers) are free after the initial download.
Can I use embeddings for sensitive data?
OpenAI stores API inputs for 30 days for abuse detection. If you need better privacy guarantees, use local models like Sentence-Transformers. Your data never leaves your machine, making this ideal for healthcare, legal, or financial applications.
Do embeddings work for languages other than English?
Yes, but results vary. text-embedding-3-small works reasonably well for 100+ languages. For non-English text, consider models specifically trained for your language, like paraphrase-multilingual-MiniLM-L12-v2 from Sentence-Transformers, which handles 50+ languages.
Do I need to re-embed documents if I change the embedding model?
Yes. Embeddings from different models are incompatible. If you switch models, you must re-embed your entire corpus. This is an important consideration when choosing a model—changing it later requires significant reprocessing.
What similarity threshold should I use?
It depends on your use case. For deduplication: 0.95+. For finding related content: 0.70-0.85. For loose matching: 0.50-0.70. Always test with real data—thresholds vary by domain, text type, and model choice.
Conclusion
Vector embeddings are the foundation of modern semantic AI applications. You now have the knowledge to create embeddings using both cloud-based APIs (OpenAI) and local models (Sentence-Transformers), calculate similarity between texts, store embeddings efficiently, and build production-grade semantic search systems. Whether you’re creating a document recommendation engine, detecting duplicates, building a RAG application, or powering an AI chatbot, embeddings are an essential tool in your Python toolkit.
The key takeaway: embeddings convert language into mathematics. Once you have that mathematical representation, you can search, compare, cluster, and reason about text with remarkable accuracy. Start simple with a local model like all-MiniLM-L6-v2, measure performance, and scale up when needed.
Next steps: Try the quick example above, experiment with different models and similarity thresholds, and explore vector databases like ChromaDB if you’re working with large-scale applications. For deeper dives, check out the official documentation below.
The OpenAI API brings powerful language models directly into your Python applications. Whether you’re building chatbots, automating content creation, analyzing text, or generating embeddings, the official OpenAI Python SDK makes integration straightforward and intuitive. In this guide, we’ll explore everything from basic chat completions to advanced features like function calling and vision capabilities, complete with production-ready examples you can deploy immediately.
The modern AI landscape has democratized access to sophisticated language models. What once required significant ML expertise now takes just a few lines of Python. The OpenAI API currently powers applications used by millions of developers worldwide, and with the latest Python SDK (v1.0+), the experience is more elegant and Pythonic than ever. You’ll gain the skills to harness models like GPT-4o, GPT-4o-mini, and GPT-3.5-turbo in your projects.
By the end of this tutorial, you’ll understand how to initialize the client, handle authentication, construct effective prompts, stream responses for real-time interaction, invoke external tools through function calling, process images, generate embeddings, and implement robust error handling. We’ll also examine a complete CLI chatbot implementation that demonstrates conversation history management.
Quick Example: Your First API Call
Let’s get straight to it. Here’s a minimal example that demonstrates the power of the OpenAI API. This script creates a single chat completion request and displays the model’s response. It assumes your OPENAI_API_KEY environment variable is set:
Recursion is a function calling itself to solve smaller versions of the same problem until reaching a base case.
The OpenAI() client automatically reads your API key from the environment, constructs a message, sends it to the model, and returns a structured response. The choices array contains the model’s completions, and message.content is the actual text response.
What Is the OpenAI API?
The OpenAI API is a REST interface that gives you programmatic access to OpenAI’s language models. Rather than using the web interface, you call the API from your application. The official Python SDK wraps this REST API, handling authentication, request formatting, and response parsing automatically.
OpenAI offers multiple models optimized for different use cases:
Model
Best For
Context Window
Relative Cost
Speed
gpt-4o
Complex reasoning, multimodal, production
128K tokens
Higher
Moderate
gpt-4o-mini
Fast, cost-effective, high volume
128K tokens
Low
Fast
gpt-3.5-turbo
Legacy applications
4K tokens
Very Low
Fastest
For most new projects, we recommend gpt-4o-mini as your starting point. The API also supports embeddings, audio transcription, image generation, and fine-tuning.
Prompt in. Completion out. Magic in the middle.
Installing the OpenAI Python SDK
The official OpenAI Python SDK is available on PyPI. We recommend installing within a virtual environment:
Every request requires authentication via an API key. Create one at platform.openai.com/api-keys. Never hardcode your key in source code. Use environment variables instead:
The OpenAI() client automatically reads this environment variable:
# client_init.py
from openai import OpenAI
client = OpenAI() # Reads OPENAI_API_KEY from environment
print("Client initialized successfully")
Output:
Client initialized successfully
Keep your API key closer than your passwords
Chat Completions: The Core API
Chat completions are the foundation of most OpenAI applications. You send a list of messages and the model generates a completion:
# chat_basic.py
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": "What are three benefits of Python for data science?"}
],
max_tokens=200,
temperature=0.7
)
print(response.choices[0].message.content)
Output:
1. Rich Ecosystem: Libraries like pandas, NumPy, and scikit-learn provide comprehensive tools.
2. Ease of Learning: Python's readable syntax lets data scientists focus on algorithms.
3. Community and Integration: Strong community support and seamless production integration.
Key parameters: model specifies which model, messages is the conversation, max_tokens limits response length, and temperature controls randomness (0.7 is a good default).
System Messages and Conversation Roles
System messages set the assistant’s behavior and personality. Every conversation should begin with one:
# system_messages.py
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful Python tutor. Keep responses under 150 words."},
{"role": "user", "content": "What is a list comprehension?"}
],
temperature=0.5
)
print(response.choices[0].message.content)
Output:
A list comprehension is a concise way to create lists in Python:
squares = [x ** 2 for x in range(5)] # [0, 1, 4, 9, 16]
Messages have three roles: system (instructions), user (human input), and assistant (model responses). Store messages in a list to maintain multi-turn conversation context.
Streaming Responses
Streaming sends tokens as they’re generated, creating a real-time effect:
# streaming_response.py
from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Write a haiku about Python."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
Output:
Code flows like rivers
Functions call within themselves
Logic pure and clean
The stream=True parameter returns a generator that yields chunks as they arrive — perfect for web UIs.
Streaming: because waiting for the full response is so 2022
Function Calling and Tool Use
Function calling lets the model request your application invoke specific functions:
# function_calling.py
import json
from openai import OpenAI
client = OpenAI()
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["city"]
}
}
}]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What's the weather in New York?"}],
tools=tools,
tool_choice="auto"
)
if response.choices[0].message.tool_calls:
call = response.choices[0].message.tool_calls[0]
print(f"Function: {call.function.name}")
print(f"Arguments: {call.function.arguments}")
Output:
Function: get_weather
Arguments: {"city": "New York", "unit": "fahrenheit"}
The model decides which function to invoke and structures arguments automatically. Your code executes the logic and sends results back.
Generating Embeddings
Embeddings are numerical representations of text for semantic search and similarity:
# embeddings_example.py
from openai import OpenAI
client = OpenAI()
texts = ["The cat sat on the mat.", "A feline rests on the rug.", "The dog ran through the park."]
response = client.embeddings.create(model="text-embedding-3-small", input=texts)
for i, item in enumerate(response.data):
print(f"Text {i}: {len(item.embedding)} dimensions, first 3: {item.embedding[:3]}")
Output:
Text 0: 1536 dimensions, first 3: [-0.0234, 0.0891, -0.0123]
Text 1: 1536 dimensions, first 3: [-0.0245, 0.0885, -0.0115]
Text 2: 1536 dimensions, first 3: [0.0123, 0.0342, 0.0789]
Semantically similar texts produce similar embeddings. Store them in vector databases like ChromaDB for powerful search.
Error Handling and Rate Limits
Production applications must handle errors gracefully:
Implement exponential backoff for rate limits — wait progressively longer between retries.
Rate limits: the universe’s way of saying slow down
Real-Life Example: Interactive CLI Chatbot
Here’s a complete chatbot with conversation history:
# chatbot.py
from openai import OpenAI
class Chatbot:
def __init__(self, system_prompt="You are a helpful assistant."):
self.client = OpenAI()
self.messages = [{"role": "system", "content": system_prompt}]
def chat(self, user_input):
self.messages.append({"role": "user", "content": user_input})
try:
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=self.messages,
temperature=0.7,
max_tokens=500
)
reply = response.choices[0].message.content
self.messages.append({"role": "assistant", "content": reply})
return reply
except Exception as e:
return f"Error: {e}"
def save_history(self, filename="chat_history.txt"):
with open(filename, "w") as f:
for msg in self.messages:
f.write(f"{msg['role'].upper()}:\n{msg['content']}\n\n")
def run(self):
print("Chatbot ready. Type 'quit' to exit, 'save' to save history.\n")
while True:
user_input = input("You: ").strip()
if not user_input:
continue
if user_input.lower() == "quit":
break
if user_input.lower() == "save":
self.save_history()
print("History saved.")
continue
print(f"Assistant: {self.chat(user_input)}\n")
if __name__ == "__main__":
Chatbot("You are a knowledgeable Python expert.").run()
Usage:
$ python chatbot.py
Chatbot ready. Type 'quit' to exit, 'save' to save history.
You: What's the difference between lists and tuples?
Assistant: Lists are mutable, tuples are immutable...
You: save
History saved.
This demonstrates conversation history management, error handling, persistent storage, and an interactive loop.
Frequently Asked Questions
How much does the OpenAI API cost?
OpenAI uses pay-per-token pricing. gpt-4o-mini costs roughly $0.15 per million input tokens. Set hard spending limits in your account settings.
What’s the difference between temperature and top_p?
temperature controls randomness directly (0 = deterministic, 2 = very random). top_p uses nucleus sampling. For most apps, adjust temperature and leave top_p at 1.0.
How long can a conversation be?
Limited by the context window: 128K tokens for gpt-4o/gpt-4o-mini. Monitor response.usage to track consumption.
Can I fine-tune the models?
Yes, OpenAI supports fine-tuning for specific models. Start with prompt engineering first — it’s usually sufficient and cheaper.
How do I handle sensitive data?
Never send PII (SSNs, credit cards) to the API. Use data scrubbing and anonymization. Review OpenAI’s privacy policy for compliance.
Conclusion
You now have a comprehensive foundation for building with the OpenAI API: chat completions, system messages, streaming, function calling, vision, embeddings, and error handling. The Python SDK makes integration elegant. Start with a simple chatbot and extend from there.
Visit the official documentation at platform.openai.com/docs for advanced features like fine-tuning and batch processing.
Setting Up the OpenAI Client
The official Python SDK is openai. Authentication via the OPENAI_API_KEY environment variable is the simplest and safest path:
# pip install openai
import os
from openai import OpenAI
# Reads from OPENAI_API_KEY env var
client = OpenAI()
# Or pass explicitly (never hardcode in production)
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Test
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Say hello in one sentence"}],
)
print(resp.choices[0].message.content)
For local development, put your key in a .env file and load it with python-dotenv — never commit keys to git. In production, use your platform’s secrets manager (AWS Secrets Manager, Vault, GCP Secret Manager).
Chat Completions: The Workhorse Endpoint
Chat completions handle 95% of real-world use cases. The model takes a list of messages with roles (system, user, assistant) and returns the next assistant message:
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a Python expert. Be concise."},
{"role": "user", "content": "How do I read a CSV?"},
],
temperature=0.2, # 0.0 = deterministic, 1.0 = creative
max_tokens=200, # cap response length
)
print(resp.choices[0].message.content)
print(resp.usage.total_tokens) # tokens you'll be billed for
The system message shapes the model’s behavior across the conversation. temperature is the most-impactful parameter — drop it to 0 for code generation or structured outputs, raise it for creative writing.
Streaming Responses
For chat UIs and long completions, streaming the response gives users immediate feedback. Iterate over chunks as they arrive:
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Write a haiku about Python"}],
stream=True,
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
print()
Streaming reduces perceived latency from “5 seconds of waiting” to “instant first word”. Build it in from the start for any user-facing feature.
Function Calling / Tools
For agents that need to call code (lookup data, run calculations, fetch URLs), use the tools/function-calling feature. You describe the functions; the model decides when to call them and with what arguments:
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["city"],
},
},
}]
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
tool_choice="auto",
)
tool_call = resp.choices[0].message.tool_calls[0]
print(tool_call.function.name) # 'get_weather'
print(tool_call.function.arguments) # '{"city":"Tokyo","unit":"celsius"}'
# Your code calls the actual function, then sends the result back as a message
# with role="tool" and tool_call_id matching the call
Structured Outputs (JSON Mode)
When you need the model to return parseable JSON, use the response_format parameter or structured outputs:
from pydantic import BaseModel
class UserProfile(BaseModel):
name: str
age: int
interests: list[str]
resp = client.chat.completions.parse(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Give me a sample user profile"}],
response_format=UserProfile,
)
profile = resp.choices[0].message.parsed
print(profile.name, profile.age, profile.interests)
This is the modern way to do structured extraction — the SDK validates the output against your pydantic schema, raising if the model returns anything malformed.
Embeddings for Search and Similarity
Embeddings turn text into vectors. Cosine similarity between vectors approximates semantic similarity — the foundation of semantic search, RAG, and clustering:
resp = client.embeddings.create(
model="text-embedding-3-small",
input=[
"Python is a programming language",
"JavaScript runs in browsers",
"I love coding in Python every day",
],
)
for item in resp.data:
print(len(item.embedding)) # 1536 dimensions
# Cosine similarity to find similar texts
import numpy as np
vecs = np.array([d.embedding for d in resp.data])
sim = vecs @ vecs.T
print(sim) # how each pair compares
Common Pitfalls
Hardcoding keys. Hardcoded API keys end up on GitHub, get scraped within hours, and get revoked. Always use environment variables.
Not setting max_tokens. Unbounded responses can rack up costs fast. Set max_tokens on every call.
Building chat history forever. Sending the whole conversation every turn means quadratic token growth. Truncate or summarize old messages once the context approaches the model’s window.
Ignoring rate limits. OpenAI returns 429 errors when you hit RPM / TPM limits. Wrap calls with exponential backoff (tenacity library) or use the async client with concurrency limits.
Treating model output as code. Never exec() or eval() generated code. Treat outputs as untrusted user input — validate, sanitize, sandbox.
FAQ
Q: Which model should I use?
A: gpt-4o-mini for cost-effective everyday work — fast and cheap. gpt-4o for harder tasks. o1 / o3 for deep reasoning. Use the smallest model that solves your problem.
Q: How do I control cost?
A: Three levers: pick a smaller model, set max_tokens, truncate conversation history. Monitor with the OpenAI dashboard — set alerts at 50% and 80% of your monthly cap.
Q: How do I handle long documents?
A: Split into chunks (1500-2000 tokens each), embed each chunk, retrieve relevant chunks for each query (RAG), and only send those to the model. LangChain and LlamaIndex automate this pattern.
Q: Is there an async client?
A: Yes — from openai import AsyncOpenAI. Same API, all methods are coroutines. Use it inside FastAPI handlers and async scrapers.
Q: What about local LLMs?
A: Run open-weight models via Ollama, llama.cpp, or LM Studio. They expose an OpenAI-compatible API — change base_url in the client and the rest of your code keeps working.
Wrapping Up
The OpenAI Python SDK is one of those rare ones where the surface area maps cleanly onto real-world tasks: chat completions, streaming, tool/function calling, structured outputs, and embeddings cover almost everything. Pick the smallest model that does the job, set max_tokens, use environment variables for keys, and validate model output before acting on it. Those four habits prevent 90% of production incidents.