How To Use Python concurrent.futures for Thread and Process Pools

How To Use Python concurrent.futures for Thread and Process Pools

Intermediate

You have a Python script that makes 50 API calls or processes 100 image files, and it runs painfully slowly because it does each task one at a time. Every developer hits this wall. The fix is parallelism, but Python’s threading and multiprocessing modules can be verbose and error-prone. The concurrent.futures module is the modern answer — a clean, high-level interface that makes parallel execution almost as simple as a regular function call.

The concurrent.futures module ships with Python 3.2+ and requires zero installation. It gives you two executor classes: ThreadPoolExecutor for I/O-bound tasks (network requests, file operations) and ProcessPoolExecutor for CPU-bound tasks (image processing, number crunching). Both share the same API, so switching between them is usually a one-line change.

In this guide, we’ll cover how both executors work, when to choose threads vs processes, how to submit tasks and collect results with map() and submit(), how to handle errors gracefully, and how to process results as they complete with as_completed(). By the end, you’ll be able to turn any slow sequential loop into a fast parallel pipeline.

concurrent.futures: Quick Example

Here is a minimal example that downloads 5 URLs in parallel using a thread pool. This replaces a sequential loop that would take 5x longer:

# quick_concurrent.py
import urllib.request
from concurrent.futures import ThreadPoolExecutor

URLS = [
    "https://httpbin.org/delay/1",
    "https://httpbin.org/get",
    "https://httpbin.org/ip",
    "https://httpbin.org/uuid",
    "https://httpbin.org/user-agent",
]

def fetch(url):
    with urllib.request.urlopen(url, timeout=10) as response:
        return url, response.status

with ThreadPoolExecutor(max_workers=5) as executor:
    results = list(executor.map(fetch, URLS))

for url, status in results:
    print(f"{status} -- {url}")

Output:

200 -- https://httpbin.org/delay/1
200 -- https://httpbin.org/get
200 -- https://httpbin.org/ip
200 -- https://httpbin.org/uuid
200 -- https://httpbin.org/user-agent

The with ThreadPoolExecutor(max_workers=5) as executor: block creates a pool of 5 threads and shuts them down cleanly when the block exits. executor.map(fetch, URLS) dispatches all 5 calls in parallel and returns results in the same order as the input. What used to take 5 seconds of sequential I/O now takes about 1 second.

What Is concurrent.futures and When Should You Use It?

The concurrent.futures module provides a unified interface for running callables asynchronously. Under the hood, it manages worker threads or processes for you — no manual threading.Thread creation, no Queue wiring, no join() calls. You describe what to run, and the executor handles the rest.

The key question is which executor to use. Python’s Global Interpreter Lock (GIL) means threads cannot run Python bytecode in true parallel — they share one CPU core. However, the GIL is released during I/O operations, so threads do speed up I/O-bound work dramatically. Processes have no GIL limitation and run on separate CPU cores, making them right for CPU-bound work.

Task TypeExamplesBest ExecutorWhy
I/O-boundHTTP requests, file reads, DB queriesThreadPoolExecutorGIL released during I/O; threads are lightweight
CPU-boundImage processing, parsing, mathProcessPoolExecutorTrue parallelism across CPU cores; bypasses GIL
MixedDownload + processBoth in pipelineThread pool to download, process pool to compute

If you are unsure, start with ThreadPoolExecutor. It is simpler (no pickling overhead) and works well for most real-world tasks that involve any I/O at all.

Using ThreadPoolExecutor
Futures let you fire and forget. Then collect results when ready.

ThreadPoolExecutor: Running I/O Tasks in Parallel

The ThreadPoolExecutor is the workhorse for network-heavy Python code. Create one with max_workers to control how many threads run simultaneously. A good starting number for HTTP requests is 10-20; going higher risks hitting server rate limits or exhausting local ports.

Using executor.map() for Uniform Tasks

executor.map(fn, iterable) is the easiest pattern. It mirrors Python’s built-in map() but runs the function in parallel. Results are returned in the same order as the input, even if some tasks finish earlier.

# thread_map.py
import time
import urllib.request
from concurrent.futures import ThreadPoolExecutor

def fetch_length(url):
    """Return the byte length of a URL's response body."""
    try:
        with urllib.request.urlopen(url, timeout=10) as r:
            return url, len(r.read())
    except Exception as e:
        return url, f"ERROR: {e}"

urls = [
    "https://httpbin.org/get",
    "https://httpbin.org/headers",
    "https://httpbin.org/ip",
    "https://httpbin.org/uuid",
    "https://httpbin.org/anything",
]

start = time.perf_counter()
with ThreadPoolExecutor(max_workers=5) as executor:
    results = list(executor.map(fetch_length, urls))
elapsed = time.perf_counter() - start

for url, length in results:
    print(f"{length:>8}  {url}")
print(f"\nCompleted in {elapsed:.2f}s")

Output:

     312  https://httpbin.org/get
     183  https://httpbin.org/headers
      45  https://httpbin.org/ip
      53  https://httpbin.org/uuid
     401  https://httpbin.org/anything

Completed in 0.61s

Running these 5 requests sequentially would take 2-4 seconds depending on network latency. In parallel, they all run at once and finish in the time it takes the slowest one to respond. The try/except inside fetch_length is important — if any URL fails and raises an exception inside executor.map(), the exception re-raises when you iterate the results.

Using executor.submit() for Flexible Futures

executor.submit(fn, *args) gives you more control. It returns a Future object immediately — a handle to a computation that may not have finished yet. You can collect futures and inspect them later, which is useful when tasks have different arguments or you want to process results as they arrive.

# thread_submit.py
import urllib.request
from concurrent.futures import ThreadPoolExecutor, as_completed

def check_url(url, timeout=5):
    """Return (url, status_code) or (url, error_message)."""
    try:
        with urllib.request.urlopen(url, timeout=timeout) as r:
            return url, r.status
    except Exception as e:
        return url, f"FAILED: {type(e).__name__}"

urls = [
    "https://httpbin.org/status/200",
    "https://httpbin.org/status/404",
    "https://httpbin.org/status/500",
    "https://httpbin.org/delay/0",
    "https://httpbin.org/get",
]

with ThreadPoolExecutor(max_workers=5) as executor:
    # Submit all tasks and keep a dict mapping Future -> url
    future_to_url = {executor.submit(check_url, url): url for url in urls}

    # Process results as each Future completes (not in submission order)
    for future in as_completed(future_to_url):
        url = future_to_url[future]
        try:
            _, status = future.result()
            print(f"[{status}] {url}")
        except Exception as exc:
            print(f"[EXCEPTION] {url}: {exc}")

Output (order varies by completion time):

[200] https://httpbin.org/get
[200] https://httpbin.org/delay/0
[404] https://httpbin.org/status/404
[500] https://httpbin.org/status/500
[200] https://httpbin.org/status/200

as_completed(future_to_url) yields futures in the order they finish, not the order they were submitted. This is ideal for displaying progress or handling results the moment they are ready. The future.result() call either returns the return value of your function or re-raises any exception that occurred inside the worker.

Collecting Future results
submit() returns a Future. The result arrives when it arrives.

ProcessPoolExecutor: True Parallel CPU Work

For CPU-bound tasks, threads provide no speedup because the GIL prevents true parallel execution. ProcessPoolExecutor spawns separate Python interpreter processes, each with its own GIL and memory space, enabling genuine multi-core parallelism.

# process_pool.py
import math
import time
from concurrent.futures import ProcessPoolExecutor

def is_prime(n):
    """CPU-intensive primality test."""
    if n < 2:
        return n, False
    if n == 2:
        return n, True
    if n % 2 == 0:
        return n, False
    for i in range(3, int(math.isqrt(n)) + 1, 2):
        if n % i == 0:
            return n, False
    return n, True

# Large numbers that require real computation to check
numbers = [
    999_999_937,
    999_999_929,
    999_999_893,
    999_999_883,
    999_999_877,
    999_999_613,
    999_999_541,
    999_999_527,
]

start = time.perf_counter()
with ProcessPoolExecutor() as executor:
    results = list(executor.map(is_prime, numbers))
elapsed = time.perf_counter() - start

for n, prime in results:
    status = "PRIME" if prime else "composite"
    print(f"{n:>15,}  {status}")
print(f"\nChecked {len(numbers)} numbers in {elapsed:.2f}s")

# What if __name__ == '__main__' is omitted?
# On Windows: RuntimeError -- processes can't spawn without this guard.

Output:

    999,999,937  PRIME
    999,999,929  PRIME
    999,999,893  PRIME
    999,999,883  PRIME
    999,999,877  PRIME
    999,999,613  PRIME
    999,999,541  PRIME
    999,999,527  PRIME

Checked 8 numbers in 0.38s

Without a process pool, checking 8 large primes sequentially might take 1-2 seconds on a single core. With ProcessPoolExecutor, all 8 run on separate cores simultaneously. Note that on Windows, code that creates processes must be inside a if __name__ == '__main__': guard — without it, Python tries to re-import the module in each subprocess and enters an infinite loop. This is not required on macOS/Linux but is still good practice.

The Pickling Constraint

Everything passed to a ProcessPoolExecutor must be picklable — Python’s serialization format used to send data between processes. This means functions defined at the module level (not inside other functions or as lambdas), and arguments that are built-in types, dataclasses, or picklable objects. This is the main gotcha that catches developers switching from ThreadPoolExecutor.

# pickling_gotcha.py
from concurrent.futures import ProcessPoolExecutor

# This works fine -- module-level function
def double(x):
    return x * 2

# This will FAIL -- lambda is not picklable
transform = lambda x: x * 3

with ProcessPoolExecutor() as executor:
    # OK:
    results = list(executor.map(double, [1, 2, 3, 4, 5]))
    print("double results:", results)

    # This raises PicklingError:
    # results = list(executor.map(transform, [1, 2, 3]))  # DO NOT DO THIS

Output:

double results: [2, 4, 6, 8, 10]
Waiting for futures
as_completed() delivers results as they finish, not in order.

Timeouts and Cancellation

Production code must handle slow or hanging tasks. Both executors support per-call timeouts via future.result(timeout=N). If the task does not finish within N seconds, a TimeoutError is raised. The task itself is not cancelled — it continues running in the background — but your main thread can move on.

# timeout_example.py
import urllib.request
from concurrent.futures import ThreadPoolExecutor, as_completed, TimeoutError

def slow_fetch(url, delay_seconds=3):
    """Fetch a URL that deliberately delays the response."""
    full_url = f"https://httpbin.org/delay/{delay_seconds}"
    try:
        with urllib.request.urlopen(full_url, timeout=10) as r:
            return url, r.status
    except Exception as e:
        return url, f"ERROR: {e}"

tasks = [
    ("fast", 0),
    ("medium", 2),
    ("slow", 5),
]

with ThreadPoolExecutor(max_workers=3) as executor:
    futures = {
        executor.submit(slow_fetch, name, delay): name
        for name, delay in tasks
    }

    for future in as_completed(futures):
        name = futures[future]
        try:
            _, status = future.result(timeout=3)  # 3-second deadline
            print(f"[OK] {name}: {status}")
        except TimeoutError:
            print(f"[TIMEOUT] {name}: took too long")

Output:

[OK] fast: 200
[OK] medium: 200
[TIMEOUT] slow: took too long

The “slow” task requested a 5-second delay, but our future.result(timeout=3) gives up after 3 seconds and raises TimeoutError. The underlying thread is still running — it is your responsibility to design workers that can be abandoned safely. For true cancellation, consider using asyncio with task cancellation support instead.

Real-Life Example: Parallel Website Health Checker

Let’s build a practical tool that checks a list of URLs in parallel, reports status codes and response times, and flags any that fail or respond too slowly.

Handling executor errors
When a Future raises, the exception waits for you to ask for it.
# url_health_checker.py
import time
import urllib.request
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass

@dataclass
class CheckResult:
    url: str
    status: int
    elapsed_ms: float
    error: str = ""

def check_url(url):
    """Check a URL and return a CheckResult."""
    start = time.perf_counter()
    try:
        with urllib.request.urlopen(url, timeout=8) as response:
            elapsed = (time.perf_counter() - start) * 1000
            return CheckResult(url=url, status=response.status, elapsed_ms=round(elapsed, 1))
    except urllib.error.HTTPError as e:
        elapsed = (time.perf_counter() - start) * 1000
        return CheckResult(url=url, status=e.code, elapsed_ms=round(elapsed, 1))
    except Exception as e:
        elapsed = (time.perf_counter() - start) * 1000
        return CheckResult(url=url, status=0, elapsed_ms=round(elapsed, 1), error=str(e))

def run_health_check(urls, max_workers=10, slow_threshold_ms=2000):
    """Run parallel health checks and print a report."""
    print(f"Checking {len(urls)} URLs with {max_workers} workers...\n")
    print(f"{'Status':<8} {'Time (ms)':>10} {'URL'}")
    print("-" * 60)

    results = []
    start_total = time.perf_counter()

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_url = {executor.submit(check_url, url): url for url in urls}
        for future in as_completed(future_to_url):
            result = future.result()
            results.append(result)
            flag = " [SLOW]" if result.elapsed_ms > slow_threshold_ms else ""
            flag = " [ERROR]" if result.error else flag
            flag = " [DOWN]" if result.status >= 500 else flag
            print(f"{result.status:<8} {result.elapsed_ms:>10.1f} {result.url}{flag}")

    total_elapsed = (time.perf_counter() - start_total)
    ok = sum(1 for r in results if 200 <= r.status < 300)
    print(f"\nDone in {total_elapsed:.2f}s | OK: {ok}/{len(urls)}")
    return results

if __name__ == "__main__":
    urls_to_check = [
        "https://httpbin.org/get",
        "https://httpbin.org/status/200",
        "https://httpbin.org/status/404",
        "https://httpbin.org/delay/1",
        "https://httpbin.org/ip",
        "https://httpbin.org/uuid",
    ]
    run_health_check(urls_to_check, max_workers=6)

Output:

Checking 6 URLs with 6 workers...

Status   Time (ms)  URL
------------------------------------------------------------
200        312.4  https://httpbin.org/get
200        198.7  https://httpbin.org/ip
200        201.1  https://httpbin.org/uuid
200        203.9  https://httpbin.org/status/200
404        195.3  https://httpbin.org/status/404
200       1203.8  https://httpbin.org/delay/1

Done in 1.21s | OK: 5/6

This checker runs all 6 requests in parallel, prints each result as it arrives (thanks to as_completed), and provides a summary. The @dataclass makes the result clean and typed. You can extend it by adding CSV export, retry logic for 5xx errors, or a configurable slow_threshold_ms alert.

Frequently Asked Questions

How many workers should I use?

For ThreadPoolExecutor with I/O-bound tasks, a common rule is 10-50 workers depending on the task. The default (when max_workers is omitted) is min(32, os.cpu_count() + 4) in Python 3.8+. For ProcessPoolExecutor with CPU-bound tasks, use os.cpu_count() or leave it as default (which matches CPU count). Too many workers adds overhead and can trigger rate limiting on the server side.

When should I use map() vs submit()?

Use executor.map(fn, iterable) when all tasks are the same function with a single iterable argument and you want results in order. Use executor.submit(fn, *args) when tasks have different arguments, you need the results as they complete (via as_completed), or you want to inspect Future objects individually. For most batch processing, map() is simpler; for monitoring progress or mixed tasks, use submit().

How do exceptions work inside workers?

Any exception raised inside a worker function is captured and stored in the Future. It is re-raised when you call future.result() or when iterating executor.map() results. With map(), the exception is raised at the point you access the failing result in the iterator -- so wrap the iteration in a try/except. With submit() and as_completed(), wrap each future.result() call individually so one failing task does not stop the rest.

When should I use concurrent.futures vs asyncio?

Use concurrent.futures when you have synchronous (blocking) functions you want to run in parallel without rewriting them. It works with any existing code. Use asyncio when you are writing new I/O-heavy code from scratch and want maximum concurrency with minimal thread overhead -- asyncio can handle thousands of concurrent connections in a single thread. You can also combine them: asyncio.run_in_executor() lets you run blocking code in a thread pool from inside an async function.

What does the with block do for executors?

Using with ThreadPoolExecutor() as executor: calls executor.shutdown(wait=True) when the block exits. This waits for all submitted futures to complete before proceeding. If you create an executor without the context manager, you must call executor.shutdown() manually or risk leaving threads/processes running after your script ends. The context manager is the safer and recommended pattern.

Conclusion

The concurrent.futures module gives you clean, high-level parallelism with minimal code. You have learned when to use ThreadPoolExecutor for I/O-bound tasks and ProcessPoolExecutor for CPU-bound work, how executor.map() delivers ordered results effortlessly, and how executor.submit() with as_completed() lets you handle results the moment they arrive. You also know how to handle timeouts, exceptions, and the pickling constraint that affects process pools.

The health checker example is a real starting point -- extend it to check your own URLs, write results to a CSV, or send Slack alerts when a site goes down. The pattern scales from 5 URLs to 5,000 with a single max_workers change. For the full API reference, see the Python concurrent.futures documentation.

How To Use Python Decorators: A Complete Guide

How To Use Python Decorators: A Complete Guide

Intermediate

You’ve probably seen the @ symbol above function definitions in Python code and wondered what it does. That’s a decorator — one of Python’s most powerful and elegant features. Decorators let you wrap a function with additional behavior (logging, caching, access control, rate limiting, timing) without modifying the function’s code. They’re the reason you can add authentication to a Flask route with a single line, or enable caching with @functools.lru_cache.

Decorators are a pure Python feature — no installation required. They’re built on Python’s first-class functions (functions that can be passed as arguments and returned from other functions). Once you understand how decorators work mechanically, you’ll be able to read and write the patterns used by virtually every Python framework, from Django’s @login_required to FastAPI’s @app.get() to pytest’s @pytest.fixture.

In this tutorial, you’ll learn how decorators work from first principles, how to use functools.wraps to preserve function metadata, how to write parameterized decorators (decorators that take arguments), how to stack multiple decorators, how to use class-based decorators, and how to apply these techniques in real-world scenarios like timing, retry logic, and access control.

Decorators: Quick Example

Here’s the simplest useful decorator — one that logs when a function is called:

# decorator_quick.py
import functools

def log_calls(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        print(f"Calling {func.__name__}({args}, {kwargs})")
        result = func(*args, **kwargs)
        print(f"{func.__name__} returned {result}")
        return result
    return wrapper

@log_calls
def add(a, b):
    return a + b

# This is equivalent to: add = log_calls(add)
result = add(3, 4)
print(f"Final result: {result}")

# The function's identity is preserved
print(f"Function name: {add.__name__}")

Output:

Calling add((3, 4), {})
add returned 7
Final result: 7
Function name: add

The @log_calls syntax is shorthand for add = log_calls(add). The decorator receives the original function, returns a new wrapper function that adds behavior before and after calling the original, and replaces the name add with the wrapper. The @functools.wraps(func) line copies the original function’s name, docstring, and other metadata onto the wrapper — always include this.

Wrapping functions with decorators
Wrapping functions is just the beginning of the rabbit hole.

How Decorators Work: First Principles

To truly understand decorators, you need to understand that in Python, functions are objects — they can be passed as arguments and returned from other functions. This is called “first-class functions.” Decorators are just a syntax shortcut for a function transformation pattern.

# first_class_functions.py

# Functions can be passed as arguments
def apply_twice(func, value):
    return func(func(value))

def double(x):
    return x * 2

result = apply_twice(double, 3)
print(f"Apply twice: {result}")  # 3 -> 6 -> 12

# Functions can be returned from other functions
def make_multiplier(n):
    def multiplier(x):
        return x * n
    return multiplier  # Returns the inner function

triple = make_multiplier(3)
print(f"Triple 5: {triple(5)}")  # 15

# The decorator pattern manually, without @ syntax
def shout(func):
    def wrapper(*args, **kwargs):
        result = func(*args, **kwargs)
        return result.upper() + "!!!"
    return wrapper

def greet(name):
    return f"Hello, {name}"

# Without @ syntax -- same result
greet = shout(greet)
print(greet("alice"))  # HELLO, ALICE!!!

Output:

Apply twice: 12
Triple 5: 15
HELLO, ALICE!!!

The key insight: @shout above a function definition is exactly equivalent to writing greet = shout(greet) after the definition. The @ syntax just makes it more readable and places the decoration visually near the function definition where it belongs.

Always Use functools.wraps

Without @functools.wraps(func), your decorator replaces the original function’s metadata with the wrapper’s. This causes problems with debugging, documentation, and tools that inspect function names. Always include it:

# wraps_example.py
import functools

# WITHOUT functools.wraps -- breaks function identity
def bad_decorator(func):
    def wrapper(*args, **kwargs):
        return func(*args, **kwargs)
    return wrapper

# WITH functools.wraps -- preserves identity
def good_decorator(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        return func(*args, **kwargs)
    return wrapper

@bad_decorator
def my_function_bad():
    """This function does something important."""
    pass

@good_decorator
def my_function_good():
    """This function does something important."""
    pass

print(f"Bad decorator name:    {my_function_bad.__name__}")
print(f"Bad decorator docstr:  {my_function_bad.__doc__}")
print()
print(f"Good decorator name:   {my_function_good.__name__}")
print(f"Good decorator docstr: {my_function_good.__doc__}")

Output:

Bad decorator name:    wrapper
Bad decorator docstr:  None

Good decorator name:   my_function_good
Good decorator docstr: This function does something important.
Stacking multiple decorators
Stack them high, debug them later.

Practical Decorator Examples

Timing Functions

A timer decorator measures how long a function takes to execute — great for performance monitoring and identifying bottlenecks:

# timer_decorator.py
import functools
import time

def timer(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        elapsed = time.perf_counter() - start
        print(f"{func.__name__} took {elapsed:.4f} seconds")
        return result
    return wrapper

@timer
def slow_function():
    time.sleep(0.1)
    return "done"

@timer
def sum_million():
    return sum(range(1_000_000))

slow_function()
result = sum_million()
print(f"Sum result: {result:,}")

Output:

slow_function took 0.1002 seconds
sum_million took 0.0312 seconds
Sum result: 499,999,500,000

Retry Logic

A retry decorator automatically re-runs a function if it raises an exception — essential for network calls, database operations, and any code that can fail transiently:

# retry_decorator.py
import functools
import time
import random

def retry(max_attempts=3, delay=1.0, exceptions=(Exception,)):
    """Decorator factory: retries a function up to max_attempts times."""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            last_error = None
            for attempt in range(1, max_attempts + 1):
                try:
                    return func(*args, **kwargs)
                except exceptions as e:
                    last_error = e
                    print(f"Attempt {attempt}/{max_attempts} failed: {e}")
                    if attempt < max_attempts:
                        time.sleep(delay)
            raise last_error
        return wrapper
    return decorator

# Simulated unreliable function (fails 70% of the time)
call_count = 0

@retry(max_attempts=5, delay=0.1, exceptions=(ValueError,))
def unreliable_api_call():
    global call_count
    call_count += 1
    if random.random() < 0.7:
        raise ValueError(f"API timeout on call #{call_count}")
    return f"Success on call #{call_count}"

random.seed(42)
result = unreliable_api_call()
print(f"Final result: {result}")

Output:

Attempt 1/5 failed: API timeout on call #1
Attempt 2/5 failed: API timeout on call #2
Attempt 3/5 failed: API timeout on call #3
Final result: Success on call #4

Notice the decorator factory pattern: retry(max_attempts=5, delay=0.1) returns a decorator, which then returns a wrapper. This is a three-level nesting -- outer function configures, middle function receives the function to decorate, inner function is what actually runs. This is the standard pattern for parameterized decorators.

Parameterized Decorators

When your decorator needs configuration (like the number of retries in the example above), you add one more level of nesting -- a "decorator factory" that takes the parameters and returns the actual decorator:

# parameterized_decorator.py
import functools

def repeat(n):
    """Call the decorated function n times."""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            results = []
            for _ in range(n):
                results.append(func(*args, **kwargs))
            return results
        return wrapper
    return decorator

@repeat(3)
def say_hello(name):
    return f"Hello, {name}!"

results = say_hello("Alice")
for r in results:
    print(r)

Output:

Hello, Alice!
Hello, Alice!
Hello, Alice!

Stacking Multiple Decorators

You can apply multiple decorators to the same function by stacking them. They apply from bottom to top (closest to the function first):

# stacking_decorators.py
import functools
import time

def timer(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        print(f"  [timer] {func.__name__}: {time.perf_counter()-start:.4f}s")
        return result
    return wrapper

def log_result(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        result = func(*args, **kwargs)
        print(f"  [log] {func.__name__} returned: {result}")
        return result
    return wrapper

# Applied bottom-up: log_result wraps the original,
# then timer wraps log_result's wrapper
@timer
@log_result
def compute(x, y):
    return x ** y

result = compute(2, 10)
print(f"Final result: {result}")

Output:

  [log] compute returned: 1024
  [timer] compute: 0.0001s
Final result: 1024

Real-Life Example: Access Control Decorators

Here's a practical access control system using decorators -- the same pattern used by web frameworks for route authentication:

# access_control.py
import functools

# Simulated current user session
current_user = {'name': 'alice', 'roles': ['user', 'editor'], 'logged_in': True}

def login_required(func):
    """Decorator that requires the user to be logged in."""
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        if not current_user.get('logged_in'):
            print(f"Access denied: login required for {func.__name__}")
            return None
        return func(*args, **kwargs)
    return wrapper

def require_role(role):
    """Decorator factory: requires the user to have a specific role."""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            if role not in current_user.get('roles', []):
                print(f"Access denied: '{role}' role required for {func.__name__}")
                return None
            return func(*args, **kwargs)
        return wrapper
    return decorator

@login_required
def view_dashboard():
    return f"Dashboard for {current_user['name']}"

@login_required
@require_role('admin')
def delete_user(user_id):
    return f"Deleted user {user_id}"

@login_required
@require_role('editor')
def publish_post(post_id):
    return f"Published post {post_id}"

# Alice is logged in and has 'editor' but not 'admin'
print(view_dashboard())
print(delete_user(42))
print(publish_post(101))

# Simulate a logged-out user
current_user['logged_in'] = False
print(view_dashboard())

Output:

Dashboard for alice
Access denied: 'admin' role required for delete_user
Published post 101
Access denied: login required for view_dashboard

This is the exact pattern used by Flask's @login_required and Django's @permission_required. The decorators are reusable across any number of functions -- add access control to a new function by adding one line above its definition. The stacked @login_required @require_role('admin') means the user must pass both checks: logged in AND has the required role.

Real-world decorator patterns
Decorators in the wild do more than you think.

Frequently Asked Questions

When should I use a decorator instead of a helper function?

Use a decorator when you want to add the same cross-cutting behavior (logging, timing, validation, caching) to multiple functions without repeating the logic. If you find yourself writing the same "before" and "after" code in many functions, that's a strong signal to extract it into a decorator. For one-off or highly specific behavior, a regular helper function is simpler.

Can I use a class as a decorator?

Yes -- any callable can be a decorator. A class with a __call__ method works as a decorator. Class-based decorators are useful when you need to maintain state between calls (like call counts or cached results). Define __init__(self, func) to receive the function and __call__(self, *args, **kwargs) to wrap it. The @functools.wraps(func) approach works on __call__ too.

Do decorators work on class methods?

Yes, but with one caveat: the first argument of instance methods is self. Since decorators use *args, **kwargs, this is handled automatically. However, @staticmethod and @classmethod are themselves decorators. When stacking with them, always place @staticmethod or @classmethod outermost (closest to the def).

What is @functools.lru_cache and when should I use it?

@functools.lru_cache(maxsize=128) memoizes a function's return values -- if the function is called again with the same arguments, it returns the cached result instead of recomputing. Use it for pure functions (no side effects) that are called repeatedly with the same inputs. It's especially powerful for recursive functions like Fibonacci where the same sub-problems repeat many times.

Why does my IDE show wrong type hints after applying a decorator?

Without @functools.wraps, the decorated function's signature shows as (*args, **kwargs) -- losing the original type hints. With @functools.wraps, the function identity is preserved, but the signature the type checker sees is still the wrapper's. For full type hint preservation in decorated functions, use typing.ParamSpec and typing.Concatenate (Python 3.10+) to annotate the wrapper correctly.

Conclusion

Decorators are one of Python's most powerful code-reuse mechanisms. In this tutorial, you learned how Python's first-class functions make decorators possible, why @functools.wraps(func) is essential in every decorator, how to write practical decorators for timing, retry logic, and logging, how to create parameterized decorators using a decorator factory pattern, how to stack multiple decorators on a single function, and how the access control pattern mirrors real framework implementations.

The access control project is a foundation you can extend: add role inheritance, time-based access restrictions, or rate limiting. Every web framework you'll encounter -- Flask, Django, FastAPI -- relies heavily on decorators for its most important features.

For deeper coverage, see the functools module documentation and PEP 318 which introduced decorator syntax to Python.

How To Use Python Generators and yield for Memory-Efficient Code

How To Use Python Generators and yield for Memory-Efficient Code

Intermediate

Imagine you need to process a log file with 10 million lines. The naive approach — reading the whole file into a list first — would use several gigabytes of memory before you even start processing. Python generators solve this elegantly: instead of building the entire sequence in memory, they produce values one at a time, on demand. This “lazy evaluation” approach lets you work with datasets of any size using a constant, tiny amount of memory.

Generators are a core Python feature that you’ll find everywhere in Python’s standard library — range(), zip(), enumerate(), and file iteration all use generator-like lazy evaluation. Understanding generators not only helps you write memory-efficient code, it also makes you a better Python developer because you’ll understand why these built-in tools work the way they do. No installation needed — generators are a built-in language feature.

In this tutorial, you’ll learn how to create generators with yield, understand how the generator protocol works under the hood, use generator expressions as compact alternatives to list comprehensions, delegate to sub-generators with yield from, build generator pipelines for data processing, and apply all of this in a practical log file analysis project.

Generators: Quick Example

Here’s a generator function next to its equivalent list-based function, demonstrating the memory difference:

# generators_quick.py
import sys

# List version: builds entire sequence in memory
def first_n_squares_list(n):
    return [i * i for i in range(n)]

# Generator version: produces values one at a time
def first_n_squares_gen(n):
    for i in range(n):
        yield i * i

# Compare memory usage
squares_list = first_n_squares_list(1000000)
squares_gen = first_n_squares_gen(1000000)

print(f"List size:      {sys.getsizeof(squares_list):,} bytes")
print(f"Generator size: {sys.getsizeof(squares_gen):,} bytes")

# Use the generator just like any iterable
total = sum(squares_gen)
print(f"Sum of first 1M squares: {total:,}")

Output:

List size:      8,448,728 bytes
Generator size: 104 bytes
Sum of first 1M squares: 333,332,833,333,500,000

The generator object itself is only 104 bytes regardless of how many values it will produce. The list consumed over 8 MB. Both can be iterated the same way — sum() works with any iterable. The key difference is when and how the values are created.

Using yield to pause execution
yield is just return with commitment issues.

What Are Generators and How Do They Work?

A generator function looks like a regular function, but uses yield instead of return. When you call a generator function, it doesn’t execute the function body — it returns a generator object. The body executes lazily, only when you ask for the next value.

FeatureRegular FunctionGenerator Function
ReturnsA value immediatelyA generator object immediately
ExecutionRuns completely on callRuns step by step, pausing at each yield
MemoryAll values in memory at onceOne value in memory at a time
Re-usableYes, call it againNo — exhausted after one iteration
Syntaxreturn valueyield value

The yield keyword does two things: it sends a value out of the generator, and it pauses execution at that point. The generator’s local variables and execution state are preserved between yields. When next() is called again, execution resumes from right after the yield statement.

The Generator Protocol

Generators implement Python’s iterator protocol: they have a __next__() method. You can call next() manually to see exactly how this works step by step:

# generator_protocol.py

def countdown(n):
    print(f"Starting countdown from {n}")
    while n > 0:
        print(f"  About to yield {n}")
        yield n
        print(f"  Resumed after yielding {n}")
        n -= 1
    print("Countdown complete!")

# Create the generator object (nothing runs yet)
gen = countdown(3)
print(f"Generator object: {gen}")
print()

# Manually advance the generator
val1 = next(gen)
print(f"Got: {val1}\n")

val2 = next(gen)
print(f"Got: {val2}\n")

val3 = next(gen)
print(f"Got: {val3}\n")

# One more next() raises StopIteration
try:
    next(gen)
except StopIteration:
    print("Generator exhausted -- StopIteration raised")

Output:

Generator object: <generator object countdown at 0x7f8b1c2d3a50>

Starting countdown from 3
  About to yield 3
Got: 3

  Resumed after yielding 3
  About to yield 2
Got: 2

  Resumed after yielding 2
  About to yield 1
Got: 1

  Resumed after yielding 1
Countdown complete!
Generator exhausted -- StopIteration raised

This output reveals the exact sequence: calling countdown(3) did nothing. The first next(gen) started execution, ran until the first yield 3, and paused. The second next(gen) resumed right after that yield. For loops handle StopIteration automatically — they call next() and stop when the exception is raised.

Generator pipelines for data processing
Chain generators and watch your data flow like water.

Generator Expressions

Generator expressions are the compact syntax for creating simple generators — they look exactly like list comprehensions but use parentheses instead of square brackets. They’re ideal for single-use transformations passed directly to functions:

# generator_expressions.py

# List comprehension: builds all values immediately
squares_list = [x*x for x in range(10)]
print(f"List: {squares_list}")

# Generator expression: lazy, no brackets
squares_gen = (x*x for x in range(10))
print(f"Generator: {squares_gen}")
print(f"Sum via generator: {sum(squares_gen)}")

# Use generator expressions inline -- no extra parentheses needed
total = sum(x*x for x in range(1000000))
print(f"Sum of 1M squares: {total:,}")

# Filter with generator expressions
big_squares = (x*x for x in range(100) if x*x > 500)
print(f"Squares > 500: {list(big_squares)[:5]}...")

# Chained transformations (memory-efficient pipeline)
data = range(1, 1000001)
even_nums = (x for x in data if x % 2 == 0)
squared = (x*x for x in even_nums)
under_million = (x for x in squared if x < 1_000_000)
result = list(under_million)
print(f"Even squares under 1M: {len(result)} values")

Output:

List: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Generator: <generator object <genexpr> at 0x7f8b1c2d3b60>
Sum via generator: 285
Sum of 1M squares: 333,332,833,333,500,000
Squares > 500: [529, 576, 625, 676, 729]...
Even squares under 1M: 499 values

Delegating with yield from

yield from lets a generator delegate to another iterable -- it's cleaner than looping and yielding each item manually. This is especially useful when building recursive generators or combining multiple generators:

# yield_from.py

# Without yield from -- verbose
def flatten_manual(nested):
    for sublist in nested:
        for item in sublist:
            yield item

# With yield from -- cleaner
def flatten(nested):
    for sublist in nested:
        yield from sublist

data = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
print("Flattened:", list(flatten(data)))

# Chain multiple generators with yield from
def chain_generators(*iterables):
    for it in iterables:
        yield from it

gen1 = (x for x in range(3))
gen2 = (x*10 for x in range(3))
gen3 = ['a', 'b', 'c']

chained = list(chain_generators(gen1, gen2, gen3))
print("Chained:", chained)

Output:

Flattened: [1, 2, 3, 4, 5, 6, 7, 8, 9]
Chained: [0, 1, 2, 0, 10, 20, 'a', 'b', 'c']

Real-Life Example: Log File Analyzer

Here's a practical generator pipeline that processes a large log file without loading it all into memory at once. This pattern handles files of any size efficiently:

# log_analyzer.py
import re
from datetime import datetime

# Sample log data (representing a large file in real use)
SAMPLE_LOG = """2026-04-17 09:00:01 INFO  User alice logged in
2026-04-17 09:01:15 ERROR Database connection timeout
2026-04-17 09:01:16 ERROR Retrying connection (attempt 1/3)
2026-04-17 09:01:17 INFO  Database reconnected successfully
2026-04-17 09:02:33 WARNING High memory usage: 87%
2026-04-17 09:03:45 INFO  User bob logged in
2026-04-17 09:04:12 ERROR Disk write failed: /var/log/app.log
2026-04-17 09:05:00 INFO  Backup completed successfully
2026-04-17 09:06:22 ERROR Authentication failed for user charlie
2026-04-17 09:07:11 INFO  Scheduled job completed in 1.23s"""

# Generator: read lines one at a time (use open() for real files)
def read_lines(text):
    for line in text.strip().splitlines():
        yield line

# Generator: parse each line into a structured dict
LOG_PATTERN = re.compile(
    r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (\w+)\s+(.+)'
)

def parse_logs(lines):
    for line in lines:
        match = LOG_PATTERN.match(line)
        if match:
            timestamp_str, level, message = match.groups()
            yield {
                'timestamp': datetime.strptime(timestamp_str, '%Y-%m-%d %H:%M:%S'),
                'level': level,
                'message': message
            }

# Generator: filter to only error entries
def filter_level(entries, level):
    for entry in entries:
        if entry['level'] == level:
            yield entry

# Build the pipeline (nothing runs yet)
lines = read_lines(SAMPLE_LOG)
parsed = parse_logs(lines)
errors = filter_level(parsed, 'ERROR')

# Consume the pipeline -- data flows through all stages
print("ERROR log entries:")
for entry in errors:
    time_str = entry['timestamp'].strftime('%H:%M:%S')
    print(f"  [{time_str}] {entry['message']}")

# Rebuild and get counts (generator is exhausted after one pass)
lines = read_lines(SAMPLE_LOG)
all_entries = list(parse_logs(lines))
counts = {}
for entry in all_entries:
    counts[entry['level']] = counts.get(entry['level'], 0) + 1

print("\nLog level summary:")
for level, count in sorted(counts.items()):
    print(f"  {level}: {count}")

Output:

ERROR log entries:
  [09:01:15] Database connection timeout
  [09:01:16] Retrying connection (attempt 1/3)
  [09:04:12] Disk write failed: /var/log/app.log
  [09:06:22] Authentication failed for user charlie

Log level summary:
  ERROR: 4
  INFO: 5
  WARNING: 1

This pipeline pattern is memory-efficient because at any moment, only one log line exists in memory as it flows through read_lines -> parse_logs -> filter_level. For a 10 GB log file, this approach uses the same tiny amount of memory as it does for a 10-line file. To use this with a real log file, replace read_lines(SAMPLE_LOG) with open('app.log', 'r') -- file objects are themselves iterable generators in Python.

Memory-efficient iteration with generators
Why load a million rows when you can yield one at a time?

Frequently Asked Questions

Can I iterate a generator multiple times?

No -- generators are single-use. Once exhausted, calling next() always raises StopIteration. If you need to iterate multiple times, either store the values in a list first (items = list(my_gen())), or call the generator function again to create a fresh generator object. This is why generator expressions are usually passed directly to functions like sum() or list() that consume them in one pass.

What is the difference between a generator and an iterator?

Every generator is an iterator, but not every iterator is a generator. An iterator is any object with a __next__() and __iter__() method. A generator is a specific way to create an iterator using a function with yield -- Python automatically implements the iterator protocol for you. Generators are the most convenient way to create iterators in Python.

What does generator.send() do?

The .send(value) method resumes a generator and sends a value in as the result of the current yield expression. This turns generators into coroutines -- two-way communication channels. It's used in advanced patterns like cooperative multitasking. For most use cases, you'll never need .send() -- standard next() iteration is sufficient.

When should I use a generator vs. a list?

Use a generator when: you're processing a large sequence and don't need all values in memory at once, you're reading from a file or network stream, you only need to iterate once, or you're building a pipeline of transformations. Use a list when: you need to index into the sequence, iterate multiple times, get its length with len(), or pass it to code that explicitly expects a list.

Can generators produce infinite sequences?

Yes -- this is one of the most powerful uses of generators. A generator can loop indefinitely, yielding values forever, because it never builds a finite collection in memory. The standard library's itertools.count() and itertools.cycle() are examples of infinite generators. Just make sure your consuming code has a termination condition (like islice or a loop with a break) to stop pulling values.

Conclusion

Generators are one of Python's most elegant features. In this tutorial, you learned how yield turns a function into a lazy generator, how the generator protocol works with next() and StopIteration, how generator expressions provide compact syntax for simple generators, how yield from delegates to sub-iterables, and how to compose generators into data processing pipelines.

The log analyzer project shows the real-world payoff: a memory-efficient pipeline the scales to gigabyte files with no changes. Try extending it to count errors per hour, find the longest gap between errors, or write the filtered entries to a new file.

For more on generators and iteration in Python, see the Python HOWTO: Generators and the itertools documentation for powerful generator-based utilities.

How To Build a Flask Web Application in Python

How To Build a Flask Web Application in Python

Intermediate

You’ve written Python scripts, maybe built some command-line tools, and now you want to build something others can access through a browser or call from a mobile app. Flask is the fastest path from Python knowledge to a working web application. It’s a lightweight web framework that gives you just what you need — routing, templates, request handling, and JSON responses — without the complexity of larger frameworks like Django.

Flask is a third-party package, so you’ll need to install it with pip install flask. Once installed, a minimal Flask app runs in fewer than 10 lines. Flask works great for REST APIs, small web applications, internal tools, and prototyping ideas quickly. When your app grows large and needs built-in admin panels, ORM, and authentication systems, you might switch to Django — but for most projects, Flask’s simplicity is its superpower.

In this tutorial, you’ll build a Flask web application from scratch. You’ll learn how routes map URLs to functions, how to render HTML templates with Jinja2, how to handle GET and POST form submissions, how to return JSON for API endpoints, and how to use Flask’s development server. By the end, you’ll have a working contact book web application that demonstrates all these concepts together.

Flask: Quick Example

Here’s the smallest possible Flask app — a web server that responds to HTTP requests:

# app.py
from flask import Flask

app = Flask(__name__)

@app.route('/')
def home():
    return '

Hello, World!

Your Flask app is running.

' @app.route('/about') def about(): return '

A simple Flask application.

' if __name__ == '__main__': app.run(debug=True)

To run it:

$ pip install flask
$ python app.py
 * Running on http://127.0.0.1:5000
 * Debug mode: on

Open http://127.0.0.1:5000 in a browser and you’ll see “Hello, World!”. The @app.route('/') decorator registers the home function as the handler for requests to the root URL. The string you return becomes the HTTP response body. The debug=True option enables the auto-reloader (restarts when you change files) and the interactive debugger — never use this in production.

Setting up Flask routes
Route decorators are just URL bouncers for your functions.

What Is Flask and When Should You Use It?

Flask is a “micro” web framework for Python. “Micro” doesn’t mean small or limited — it means Flask provides the core tools (routing, request handling, templates) and lets you add everything else (database, authentication, forms) as separate packages you choose yourself. This makes Flask lightweight, flexible, and easy to understand.

FeatureFlaskDjangoFastAPI
Learning curveLowMedium-HighMedium
Built-in ORMNo (use SQLAlchemy)YesNo
Best forAPIs, small apps, prototypesLarge full-stack appsHigh-performance APIs
TemplatesJinja2Django templatesNo built-in
Async supportLimited (Flask 2.0+)LimitedFirst-class

Flask is the right choice when you want to build something functional quickly without learning a large framework, when you need a REST API backend, or when you want full control over which libraries you use for database access, authentication, and other concerns.

Routes and View Functions

A route maps a URL pattern to a Python function. When a browser requests that URL, Flask calls the function and returns its return value as the HTTP response. Routes can include dynamic segments (variables in the URL) captured with angle brackets:

# routes.py
from flask import Flask

app = Flask(__name__)

# Static route
@app.route('/')
def index():
    return '

Home Page

' # Dynamic route: captures a string segment @app.route('/user/') def user_profile(username): return f'

Profile: {username}

' # Dynamic route with type conversion (int only) @app.route('/post/') def show_post(post_id): return f'

Post ID: {post_id} (type: {type(post_id).__name__})

' # Route that accepts GET and POST @app.route('/submit', methods=['GET', 'POST']) def submit(): from flask import request if request.method == 'POST': return '

Form submitted!

' return '
' if __name__ == '__main__': app.run(debug=True)

Example URLs and responses:

GET /              -> Home Page
GET /user/alice    -> Profile: alice
GET /post/42       -> Post ID: 42 (type: int)
GET /post/abc      -> 404 Not Found (not an int)

The <int:post_id> converter ensures Flask only matches the route when the URL segment is a valid integer, and automatically converts it for you. If someone requests /post/abc, Flask returns a 404 response. Flask also supports <float:> and <path:> converters.

HTML Templates with Jinja2

Returning raw HTML strings from view functions gets unwieldy fast. Flask uses the Jinja2 template engine to render HTML files stored in a templates/ directory. Templates can include Python-like logic: loops, conditionals, and variable interpolation.

Template File Structure

Flask looks for templates in a folder named templates next to your app.py. Create this structure:

# Project structure:
# myapp/
#   app.py
#   templates/
#     base.html
#     index.html
#     user.html

A base template defines the common page structure that other templates extend:

<!-- templates/base.html -->
<!DOCTYPE html>
<html>
<head>
    <title>{% block title %}My App{% endblock %}</title>
</head>
<body>
    <nav><a href="/">Home</a> | <a href="/users">Users</a></nav>
    <main>{% block content %}{% endblock %}</main>
</body>
</html>
<!-- templates/index.html -->
{% extends "base.html" %}

{% block title %}Home - My App{% endblock %}

{% block content %}
<h1>Welcome, {{ name }}!</h1>
<ul>
{% for item in items %}
    <li>{{ item }}</li>
{% endfor %}
</ul>
{% endblock %}
# app_templates.py
from flask import Flask, render_template

app = Flask(__name__)

@app.route('/')
def index():
    return render_template('index.html',
        name='Alice',
        items=['Learn Flask', 'Build an API', 'Deploy to cloud']
    )

if __name__ == '__main__':
    app.run(debug=True)

The render_template() function loads the template file, substitutes the variables you pass as keyword arguments, and returns the resulting HTML string. Jinja2’s {{ variable }} syntax outputs values, while {% for %}, {% if %}, and {% block %} are control structures.

Jinja templates in Flask
Templates turn raw data into something humans want to look at.

Handling Forms and POST Requests

Web forms submit data as HTTP POST requests. Flask’s request object gives you access to form data, query parameters, and request headers. Use request.form.get() for form fields (not request.form[]) to avoid KeyError if a field is missing:

# form_handling.py
from flask import Flask, render_template, request, redirect, url_for

app = Flask(__name__)

# In-memory storage (use a database in production)
contacts = []

@app.route('/contacts')
def contact_list():
    return render_template('contacts.html', contacts=contacts)

@app.route('/contacts/add', methods=['GET', 'POST'])
def add_contact():
    if request.method == 'POST':
        name = request.form.get('name', '').strip()
        email = request.form.get('email', '').strip()

        if name and email:  # Basic validation
            contacts.append({'name': name, 'email': email})
            return redirect(url_for('contact_list'))
        else:
            error = 'Both name and email are required.'
            return render_template('add_contact.html', error=error)

    return render_template('add_contact.html', error=None)

The Post/Redirect/Get pattern is critical for form handling: after a successful POST, redirect the user to a GET route. This prevents the “Are you sure you want to resubmit the form?” browser dialog when the user refreshes the page. The url_for('contact_list') call generates the URL for the contact_list view function — use this instead of hardcoding URLs to keep your routes maintainable.

Flask: micro-framework. Mighty-results.
Flask: micro-framework. Mighty-results.

Building a JSON API

Flask makes it straightforward to return JSON responses for API endpoints. Use jsonify() to convert Python dicts and lists into properly formatted JSON responses with the correct Content-Type header:

# json_api.py
from flask import Flask, jsonify, request

app = Flask(__name__)

# Simple in-memory task list
tasks = [
    {'id': 1, 'title': 'Learn Flask', 'done': True},
    {'id': 2, 'title': 'Build an API', 'done': False},
]

@app.route('/api/tasks', methods=['GET'])
def get_tasks():
    return jsonify({'tasks': tasks, 'count': len(tasks)})

@app.route('/api/tasks/', methods=['GET'])
def get_task(task_id):
    task = next((t for t in tasks if t['id'] == task_id), None)
    if task is None:
        return jsonify({'error': 'Task not found'}), 404
    return jsonify(task)

@app.route('/api/tasks', methods=['POST'])
def create_task():
    data = request.get_json()
    if not data or 'title' not in data:
        return jsonify({'error': 'title is required'}), 400

    new_task = {
        'id': max(t['id'] for t in tasks) + 1,
        'title': data['title'],
        'done': data.get('done', False)
    }
    tasks.append(new_task)
    return jsonify(new_task), 201

if __name__ == '__main__':
    app.run(debug=True)

Test with curl:

$ curl http://127.0.0.1:5000/api/tasks
{"tasks": [{"id": 1, "title": "Learn Flask", "done": true}, ...], "count": 2}

$ curl -X POST http://127.0.0.1:5000/api/tasks \
  -H "Content-Type: application/json" \
  -d '{"title": "Deploy to production"}'
{"done": false, "id": 3, "title": "Deploy to production"}

Return the appropriate HTTP status code as the second argument to jsonify(): 200 for success (default), 201 for created, 400 for bad request, 404 for not found. This tells API clients what happened without them needing to parse the response body.

Real-Life Example: Contact Book Web App

Let’s bring everything together in a complete contact book application with a list view, add form, and JSON API endpoint:

# contact_book.py
from flask import Flask, render_template, request, redirect, url_for, jsonify

app = Flask(__name__)

contacts = [
    {'id': 1, 'name': 'Alice Chen', 'email': 'alice@example.com', 'phone': '555-0101'},
    {'id': 2, 'name': 'Bob Smith', 'email': 'bob@example.com', 'phone': '555-0102'},
]
next_id = 3

@app.route('/')
def index():
    search = request.args.get('q', '').lower()
    if search:
        results = [c for c in contacts
                   if search in c['name'].lower() or search in c['email'].lower()]
    else:
        results = contacts

    # Return HTML list (simplified -- in production use render_template)
    rows = ''.join(
        f'{c["name"]}{c["email"]}'
        f'{c["phone"]}'
        for c in results
    )
    return f'''
    

Contact Book

Search:
{rows}
NameEmailPhone

Add Contact

''' @app.route('/add', methods=['GET', 'POST']) def add(): global next_id error = None if request.method == 'POST': name = request.form.get('name', '').strip() email = request.form.get('email', '').strip() phone = request.form.get('phone', '').strip() if not name or not email: error = 'Name and email are required.' else: contacts.append({'id': next_id, 'name': name, 'email': email, 'phone': phone}) next_id += 1 return redirect(url_for('index')) return f'''

Add Contact

{'

' + error + '

' if error else ''}
Name:
Email:
Phone:
''' @app.route('/api/contacts') def api_contacts(): return jsonify({'contacts': contacts, 'total': len(contacts)}) if __name__ == '__main__': app.run(debug=True, port=5000)

Test the API endpoint:

$ curl http://127.0.0.1:5000/api/contacts
{
  "contacts": [
    {"email": "alice@example.com", "id": 1, "name": "Alice Chen", "phone": "555-0101"},
    {"email": "bob@example.com", "id": 2, "name": "Bob Smith", "phone": "555-0102"}
  ],
  "total": 2
}

This app demonstrates routing, form handling with the Post/Redirect/Get pattern, query parameter reading (request.args.get()), and JSON API responses in a single file. To extend this into a production-ready app, add a database with Flask-SQLAlchemy, use proper Jinja2 templates instead of HTML strings, add Flask-WTF for form validation and CSRF protection, and deploy with Gunicorn behind an Nginx proxy.

Deploying your Flask app
Your Flask app works on localhost. Now make it work everywhere else.

Frequently Asked Questions

Is debug=True safe for production?

No — never use debug=True in production. Debug mode enables the Werkzeug interactive debugger, which lets anyone who triggers an exception run arbitrary Python code in your server. In production, run Flask with Gunicorn or uWSGI: gunicorn app:app. Set FLASK_ENV=production as an environment variable to disable debug mode.

How do I serve static files (CSS, JS, images) with Flask?

Create a folder named static next to your app.py. Flask automatically serves files from this folder at /static/filename. In templates, use url_for('static', filename='style.css') to generate the correct URL. For production, serve static files directly from Nginx or a CDN for better performance.

How do I connect a database to Flask?

The most common choice is Flask-SQLAlchemy, which wraps SQLAlchemy’s ORM with Flask integration. Install it with pip install flask-sqlalchemy and configure app.config['SQLALCHEMY_DATABASE_URI']. For a simpler option for small apps, use the built-in sqlite3 module directly. For production APIs, consider Flask-SQLAlchemy with PostgreSQL.

How do I handle 404 and 500 errors with custom pages?

Use the @app.errorhandler() decorator: @app.errorhandler(404) on a function that returns a custom error page. The function receives the error object as an argument. Return a tuple of (response, status_code) to ensure the correct HTTP status code is sent: return render_template('404.html'), 404.

What are Flask Blueprints?

Blueprints let you split a large Flask app into smaller, modular components. Each blueprint is like a mini Flask app with its own routes, templates, and static files. For example, you might have an auth blueprint for login/logout routes and a dashboard blueprint for the main app. Register blueprints with app.register_blueprint(auth_bp, url_prefix='/auth'). They’re the standard way to organize larger Flask applications.

Conclusion

Flask makes it possible to go from Python developer to web application builder in a single session. In this tutorial, you covered the core building blocks: creating a Flask app and defining routes with @app.route(), using dynamic URL segments with type converters, rendering HTML with Jinja2 templates, handling GET and POST requests with request.form.get(), implementing the Post/Redirect/Get pattern, and returning JSON responses with jsonify().

The contact book project ties all these concepts together. Extend it by adding Flask-SQLAlchemy for persistent storage, Flask-Login for user authentication, and deploying with Gunicorn on any cloud provider.

For deeper learning, see the official Flask documentation and the Jinja2 template documentation.

Minimal Flask App

from flask import Flask, jsonify, request

app = Flask(__name__)

@app.route("/")
def index():
    return "Hello, Flask"

@app.route("/users/")
def get_user(user_id):
    return jsonify({"id": user_id, "name": "Alice"})

@app.route("/users", methods=["POST"])
def create_user():
    data = request.get_json()
    return jsonify({"id": 1, **data}), 201

if __name__ == "__main__":
    app.run(debug=True, port=5000)

Routes map URLs to functions. <int:user_id> is a typed path parameter. request.get_json() parses the body. jsonify sets Content-Type to application/json. That’s 80% of a typical Flask app.

Application Factory Pattern

# myapp/__init__.py
from flask import Flask

def create_app(config_name="production"):
    app = Flask(__name__)
    app.config.from_object(f"myapp.config.{config_name.title()}Config")

    # Register blueprints
    from myapp.routes.users import bp as users_bp
    from myapp.routes.posts import bp as posts_bp
    app.register_blueprint(users_bp, url_prefix="/api/users")
    app.register_blueprint(posts_bp, url_prefix="/api/posts")

    # Initialize extensions
    db.init_app(app)
    mail.init_app(app)

    return app

# myapp/routes/users.py
from flask import Blueprint

bp = Blueprint("users", __name__)

@bp.route("/")
def list_users():
    return jsonify([])

@bp.route("/")
def get_user(user_id):
    return jsonify({"id": user_id})

Factory pattern + Blueprints lets you split a large app across files. Each Blueprint registers its own routes; the factory wires them together. Tests can create app instances with custom config.

Templates and Jinja2

# myapp/templates/users.html



  

{{ title }}

    {% for user in users %}
  • {{ user.name }} - {{ user.email }}
  • {% endfor %}
# myapp/routes/users.py from flask import render_template @bp.route("/") def list_users(): users = User.query.all() return render_template("users.html", title="All Users", users=users)

Request Handling

from flask import request

@app.route("/search")
def search():
    q = request.args.get("q", "")
    limit = request.args.get("limit", 10, type=int)
    return jsonify({"query": q, "limit": limit})

@app.route("/upload", methods=["POST"])
def upload():
    if "file" not in request.files:
        return jsonify({"error": "No file"}), 400
    f = request.files["file"]
    f.save(f"/uploads/{f.filename}")
    return jsonify({"status": "ok"})

@app.route("/json", methods=["POST"])
def json_endpoint():
    data = request.get_json()
    if not data:
        return jsonify({"error": "Invalid JSON"}), 400
    return jsonify({"echo": data})

Error Handling

from flask import jsonify
from werkzeug.exceptions import NotFound, BadRequest

@app.errorhandler(404)
def not_found(error):
    return jsonify({"error": "Not found"}), 404

@app.errorhandler(BadRequest)
def bad_request(error):
    return jsonify({"error": "Bad request"}), 400

@app.errorhandler(Exception)
def handle_exception(error):
    app.logger.exception("Unhandled error")
    return jsonify({"error": "Internal server error"}), 500

# Inside routes
@app.route("/users/")
def get_user(user_id):
    user = User.query.get(user_id)
    if not user:
        abort(404)
    return jsonify(user.to_dict())

Common Pitfalls

  • Using app.run() in production. The built-in dev server is single-threaded and not production-safe. Deploy with gunicorn or uwsgi.
  • Mutating app.config at request time. Config is read once at startup. Changes mid-request only affect that worker.
  • Flask globals across requests. request, g, session are request-scoped. Don’t reference them at module level — they have no context outside a request.
  • No DB connection pool. Opening a fresh DB connection per request is slow. Use Flask-SQLAlchemy or a connection pool.
  • CSRF on JSON APIs. Flask-WTF’s CSRF protection blocks JSON POSTs by default. Either disable for the API blueprint or use a token-based auth scheme.

FAQ

Q: Flask or FastAPI?
A: Flask for sync web apps with templates. FastAPI for async APIs with Pydantic validation + OpenAPI docs.

Q: How do I deploy Flask?
A: gunicorn (or uwsgi) behind nginx. Don’t expose the dev server to the internet.

Q: Flask-RESTful, Flask-Smorest, or plain Flask?
A: Plain Flask for simple APIs. Flask-Smorest for OpenAPI docs + Marshmallow validation. Flask-RESTful is older — has been mostly replaced.

Q: How do I add database support?
A: Flask-SQLAlchemy for SQL + ORM. Flask-Migrate for migrations (Alembic). Plain SQLAlchemy works too without the Flask wrapper.

Q: Authentication?
A: Flask-Login for session-based, Flask-JWT-Extended for tokens. For OAuth, use Authlib.

Wrapping Up

Flask is the minimalist web framework that scales — start with one route in 10 lines, grow into application factories with Blueprints, deploy with gunicorn. The ecosystem (Flask-SQLAlchemy, Flask-Login, Flask-Migrate) covers nearly every web-app concern. For modern async APIs, FastAPI may be a better fit, but Flask remains the right answer for traditional template-driven web apps.

How To Use Python os Module for File System Operations

How To Use Python os Module for File System Operations

Beginner

Every real Python project touches the file system sooner or later — reading config files, creating output directories, scanning folders for data files, or reading environment variables set by deployment tools. Without the right tools, these tasks involve fragile hardcoded paths that break when you move the project to another machine or operating system. Python’s built-in os module gives you a portable, consistent interface to file system operations that works the same on Windows, macOS, and Linux.

The os module is part of Python’s standard library — no installation needed. It provides functions for working with file paths, creating and removing directories, listing directory contents, reading environment variables, and running system-level operations. For modern path handling, Python 3.4+ also offers the pathlib module as an object-oriented alternative — but understanding os is essential because you’ll encounter it in virtually every Python codebase.

In this tutorial, you’ll learn how to navigate and inspect the file system with os.path, create and remove directories, list and filter files, read environment variables, walk directory trees recursively, and apply all of this in a practical file organizer project. By the end, you’ll have a solid foundation for writing scripts that interact reliably with the file system on any platform.

Quick Example: File System Basics

Here’s a quick demonstration of the most commonly used os functions:

# os_quick.py
import os

# Current working directory
cwd = os.getcwd()
print(f"Working directory: {cwd}")

# List files and folders
entries = os.listdir('.')
print(f"Entries in current dir: {len(entries)}")

# Build a cross-platform path
config_path = os.path.join(cwd, 'config', 'settings.json')
print(f"Config path: {config_path}")

# Check what exists
print(f"Path exists: {os.path.exists(config_path)}")
print(f"Is file: {os.path.isfile(config_path)}")
print(f"Is directory: {os.path.isdir(os.path.dirname(config_path))}")

# Read an environment variable with a default
home = os.environ.get('HOME', '/tmp')
print(f"Home directory: {home}")

Output:

Working directory: /home/user/myproject
Entries in current dir: 12
Config path: /home/user/myproject/config/settings.json
Path exists: False
Is file: False
Is directory: False
Home directory: /home/user

Notice that os.path.join() assembles paths using the correct separator for your operating system (forward slash on Unix, backslash on Windows). This is the correct way to build file paths — never concatenate strings with hardcoded slashes.

Working with file paths using os
Hardcoded paths break on every OS except yours.

What Is the os Module?

The os module is Python’s interface to operating system functionality. It abstracts over the differences between Windows, macOS, and Linux, letting you write cross-platform code that works on any system. Think of it as a Python wrapper around the file system commands you’d normally type in a terminal.

CategoryFunctionsPurpose
Navigationgetcwd(), chdir()Get or change the current working directory
Listinglistdir(), scandir()List files and directories
Pathsos.path.join(), os.path.exists()Build and inspect paths
Directoriesmkdir(), makedirs(), rmdir()Create and remove directories
Filesremove(), rename()Delete and rename files
Environmentenviron, environ.get()Read environment variables
Walkingwalk()Recursively traverse directories

For the pure path manipulation parts (join, exists, basename, etc.), the newer pathlib module provides a more ergonomic object-oriented interface. But os is still the right tool for operations like walking directories and reading environment variables.

Working with Paths

The os.path submodule contains functions for inspecting and manipulating file paths. These are the most frequently used functions in day-to-day Python scripting.

Core Path Functions

Use these functions to check what exists and extract information from paths without actually opening files:

# path_functions.py
import os

# Build paths portably
base_dir = '/home/user/project'
data_path = os.path.join(base_dir, 'data', 'input.csv')
print(f"Full path: {data_path}")

# Extract parts of a path
print(f"Directory: {os.path.dirname(data_path)}")
print(f"Filename:  {os.path.basename(data_path)}")
name, ext = os.path.splitext(os.path.basename(data_path))
print(f"Name: {name}, Extension: {ext}")

# Expand special shortcuts
home_config = os.path.expanduser('~/.bashrc')
print(f"Home config: {home_config}")

# Absolute path (resolves ./ and ../ references)
relative = '../data/file.txt'
absolute = os.path.abspath(relative)
print(f"Absolute: {absolute}")

# Check file/directory properties
test_path = '/etc/hosts'  # Exists on most Unix systems
print(f"\n/etc/hosts exists: {os.path.exists(test_path)}")
print(f"Is file: {os.path.isfile(test_path)}")
print(f"Is dir:  {os.path.isdir(os.path.dirname(test_path))}")

Output:

Full path: /home/user/project/data/input.csv
Directory: /home/user/project/data
Filename:  input.csv
Name: input, Extension: .csv
Home config: /home/user/.bashrc
Absolute: /data/file.txt

/etc/hosts exists: True
Is file: True
Is dir:  True

Creating and Managing Directories

Scripts frequently need to create output directories before writing files. The key is using makedirs with exist_ok=True so your script doesn’t crash if the directory already exists:

# directories.py
import os

# Create a single directory
os.makedirs('output', exist_ok=True)
print("Created 'output/' directory (or it already existed)")

# Create nested directories in one call
nested = os.path.join('output', '2026', 'april', 'reports')
os.makedirs(nested, exist_ok=True)
print(f"Created nested directories: {nested}")

# List what we created
for item in os.listdir('output'):
    item_path = os.path.join('output', item)
    kind = 'DIR' if os.path.isdir(item_path) else 'FILE'
    print(f"  [{kind}] {item}")

# Remove an empty directory
os.rmdir('output/2026/april/reports')
print("\nRemoved 'reports' directory")
# Note: os.rmdir() only removes EMPTY directories
# For non-empty directories, use shutil.rmtree()

Output:

Created 'output/' directory (or it already existed)
Created nested directories: output/2026/april/reports
  [DIR] 2026
Removed 'reports' directory
Environment variables with os.environ
os.environ is your apps secret keeper.

Listing and Filtering Files

Two functions list directory contents: os.listdir() returns filenames as strings, while os.scandir() returns DirEntry objects with built-in file metadata — which is more efficient when you need to check whether entries are files or directories.

# list_files.py
import os

# Create some test files to list
os.makedirs('test_dir', exist_ok=True)
for fname in ['report.csv', 'data.json', 'notes.txt', 'archive.zip']:
    open(os.path.join('test_dir', fname), 'w').close()

# listdir: simple string list
all_entries = os.listdir('test_dir')
print("All entries:", all_entries)

# Filter by extension
csv_files = [f for f in all_entries if f.endswith('.csv')]
print("CSV files:", csv_files)

# scandir: more efficient when you need metadata
print("\nUsing scandir:")
with os.scandir('test_dir') as entries:
    for entry in entries:
        if entry.is_file():
            size = entry.stat().st_size
            print(f"  {entry.name} ({size} bytes)")

# Clean up test files
import shutil
shutil.rmtree('test_dir')

Output:

All entries: ['report.csv', 'data.json', 'notes.txt', 'archive.zip']
CSV files: ['report.csv']

Using scandir:
  archive.zip (0 bytes)
  data.json (0 bytes)
  notes.txt (0 bytes)
  report.csv (0 bytes)
Generators: produce one. Don't materialize the list.
Generators: produce one. Don’t materialize the list.

Reading Environment Variables

Environment variables are the standard way to pass configuration to Python scripts — API keys, database URLs, feature flags, and deployment settings. Never hardcode secrets in your code; read them from the environment instead:

# env_vars.py
import os

# Read an environment variable (returns None if not set)
api_key = os.environ.get('MY_API_KEY')
print(f"API key set: {api_key is not None}")

# Read with a default value
debug_mode = os.environ.get('DEBUG', 'false').lower() == 'true'
print(f"Debug mode: {debug_mode}")

# Raise an error if a required variable is missing
try:
    db_url = os.environ['DATABASE_URL']
except KeyError:
    print("DATABASE_URL not set -- using default SQLite")
    db_url = 'sqlite:///local.db'

print(f"Database: {db_url}")

# Set an environment variable for child processes
os.environ['APP_ENV'] = 'testing'
print(f"APP_ENV: {os.environ.get('APP_ENV')}")

# List all environment variables (just the keys)
env_keys = sorted(os.environ.keys())
print(f"\nTotal env vars: {len(env_keys)}")

Output:

API key set: False
Debug mode: False
DATABASE_URL not set -- using default SQLite
Database: sqlite:///local.db
APP_ENV: testing

Total env vars: 47

Use os.environ.get('KEY', 'default') for optional settings and os.environ['KEY'] (which raises KeyError if missing) for required settings. This way, your code fails fast with a clear error when required configuration is absent rather than failing later with a cryptic message.

Walking Directory Trees with os.walk

os.walk() recursively traverses a directory tree, yielding a tuple of (dirpath, dirnames, filenames) for every directory it visits. This is invaluable for finding files deep in nested folder structures:

# walk_example.py
import os

def find_files_by_extension(root_dir, extension):
    """Find all files with a given extension in a directory tree."""
    found = []
    for dirpath, dirnames, filenames in os.walk(root_dir):
        # Skip hidden directories (those starting with '.')
        dirnames[:] = [d for d in dirnames if not d.startswith('.')]

        for filename in filenames:
            if filename.endswith(extension):
                full_path = os.path.join(dirpath, filename)
                found.append(full_path)
    return found

# Example: find all Python files in the current directory
python_files = find_files_by_extension('.', '.py')
for f in python_files[:5]:  # Show first 5
    print(f)

print(f"\nTotal .py files found: {len(python_files)}")

Output:

./walk_example.py
./os_quick.py
./path_functions.py

Total .py files found: 3

The key trick here is dirnames[:] = [...] — modifying the list in-place tells os.walk() which subdirectories to skip. This “prune” technique prevents the walker from descending into directories you don’t want to search (like .git, __pycache__, or node_modules).

Walking directory trees with os.walk
os.walk goes deeper than your nested folder habit.

Real-Life Example: File Organizer Script

Let’s build a script that organizes files in a folder by sorting them into subdirectories based on their file extension — the classic “Downloads folder cleaner” project:

# file_organizer.py
import os
import shutil

# Map extensions to folder names
EXTENSION_MAP = {
    '.pdf':  'Documents',
    '.docx': 'Documents',
    '.txt':  'Documents',
    '.jpg':  'Images',
    '.jpeg': 'Images',
    '.png':  'Images',
    '.gif':  'Images',
    '.mp4':  'Videos',
    '.mov':  'Videos',
    '.mp3':  'Audio',
    '.wav':  'Audio',
    '.zip':  'Archives',
    '.tar':  'Archives',
    '.py':   'Code',
    '.js':   'Code',
    '.csv':  'Data',
    '.json': 'Data',
}

def organize_folder(source_dir):
    """Sort files in source_dir into subdirectories by type."""
    moved = 0
    skipped = 0

    for filename in os.listdir(source_dir):
        src_path = os.path.join(source_dir, filename)

        # Skip directories and hidden files
        if os.path.isdir(src_path) or filename.startswith('.'):
            skipped += 1
            continue

        # Get the file extension
        _, ext = os.path.splitext(filename)
        folder_name = EXTENSION_MAP.get(ext.lower(), 'Other')

        # Create destination directory if needed
        dest_dir = os.path.join(source_dir, folder_name)
        os.makedirs(dest_dir, exist_ok=True)

        # Move the file
        dest_path = os.path.join(dest_dir, filename)
        if not os.path.exists(dest_path):  # Don't overwrite
            shutil.move(src_path, dest_path)
            moved += 1
            print(f"  Moved: {filename} -> {folder_name}/")
        else:
            print(f"  Skipped (exists): {filename}")
            skipped += 1

    print(f"\nDone: {moved} files moved, {skipped} skipped.")

# Set up a test directory with sample files
test_dir = 'test_downloads'
os.makedirs(test_dir, exist_ok=True)
for fname in ['report.pdf', 'photo.jpg', 'backup.zip',
              'script.py', 'data.csv', 'video.mp4']:
    open(os.path.join(test_dir, fname), 'w').close()

print(f"Organizing '{test_dir}'...")
organize_folder(test_dir)

# Show result
print("\nResulting structure:")
for item in sorted(os.listdir(test_dir)):
    print(f"  {item}/")

# Clean up test dir
shutil.rmtree(test_dir)

Output:

Organizing 'test_downloads'...
  Moved: backup.zip -> Archives/
  Moved: data.csv -> Data/
  Moved: photo.jpg -> Images/
  Moved: report.pdf -> Documents/
  Moved: script.py -> Code/
  Moved: video.mp4 -> Videos/

Done: 6 files moved, 0 skipped.

Resulting structure:
  Archives/
  Code/
  Data/
  Documents/
  Images/
  Videos/

This project uses os.listdir(), os.path.join(), os.path.isdir(), os.path.splitext(), and os.makedirs() together. The shutil.move() function handles the actual file moving (the shutil module complements os for higher-level file operations). Extend this by reading the extension-to-folder mapping from a JSON config file, adding a dry-run mode that shows what would be moved without doing it, or recursively organizing nested subfolders.

Frequently Asked Questions

Should I use os.path or pathlib?

For new code targeting Python 3.4+, pathlib is generally preferred for path operations because it’s more readable and object-oriented. For example, Path.home() / 'data' / 'file.txt' is cleaner than os.path.join(os.path.expanduser('~'), 'data', 'file.txt'). However, os remains necessary for environment variables, os.walk(), and other OS-level operations. In practice, most projects use both.

How do I delete a non-empty directory?

os.rmdir() only removes empty directories and raises OSError if the directory has contents. To delete a directory and all its contents, use shutil.rmtree(path) from the shutil module. Be careful — this is irreversible and doesn’t send files to the Recycle Bin. Always double-check the path before calling rmtree.

How do I get a file’s size and modification time?

Use os.stat(path) to read file metadata. The result has st_size (bytes), st_mtime (last modification time as a Unix timestamp), and other fields. Use datetime.fromtimestamp(os.stat(p).st_mtime) to convert the timestamp to a readable date. When using scandir(), you can call entry.stat() directly without an extra system call.

How do I rename or move a file?

os.rename(src, dst) renames or moves a file within the same filesystem. If you need to move across filesystems or drives, use shutil.move(src, dst) instead. os.rename() will overwrite the destination on Unix systems if it already exists, but raises an error on Windows. Always check os.path.exists(dst) first if portability matters.

When should I use os.getcwd() vs os.path.abspath?

os.getcwd() returns the process’s current working directory. Use it when you need to know where the script is running from. os.path.abspath(path) resolves a relative path against the cwd and normalizes any ../ segments. If you’re building paths from a relative reference, use abspath to make them unambiguous.

Conclusion

You’ve seen the core functions of Python’s os module: getcwd() and chdir() for navigation, the os.path submodule for portable path manipulation, makedirs() and rmdir() for directory management, listdir() and scandir() for listing files, environ.get() for reading environment variables, and walk() for recursive directory traversal. Take the file organizer example and extend it with a JSON config file, a dry-run flag, or a log file that records every file move.

As your projects grow, you’ll find yourself reaching for pathlib for new path-manipulation code but sticking with os for environment variables and walking. Both modules are well-documented in the official Python documentation.

Environment Variables

The os.environ dict gives you access to the process’s environment variables:

import os
api_key = os.environ.get("API_KEY", "")
db_url = os.environ.get("DATABASE_URL", "sqlite:///dev.db")
debug = os.environ.get("DEBUG", "0") == "1"
secret = os.environ["SECRET_KEY"]
os.environ["TZ"] = "UTC"
for k, v in os.environ.items(): print(k, "=", v)

Use os.environ.get(name, default) for optional vars, os.environ[name] for required ones — KeyError tells you which variable was missing.

File System Operations

import os
os.getcwd()
os.chdir("/tmp")
files = os.listdir(".")
for entry in os.scandir("."):
    print(entry.name, entry.is_file(), entry.stat().st_size)
for root, dirs, files in os.walk("/tmp/data"):
    for f in files:
        path = os.path.join(root, f)
os.rename("old.txt", "new.txt")
os.remove("delete-me.txt")
os.makedirs("a/b/c", exist_ok=True)

Process Information

import os
os.getpid()
os.getppid()
os.cpu_count()
exit_code = os.system("ls -la")  # use subprocess instead for new code
pid = os.fork()  # Unix only
if pid == 0:
    print("child")
else:
    os.waitpid(pid, 0)

File Permissions and Stats

import os, stat
os.chmod("/tmp/data.txt", 0o644)
os.chmod("/tmp/script.sh", 0o755)
s = os.stat("/tmp/data.txt")
print(s.st_size, s.st_mtime)
print(stat.S_ISDIR(s.st_mode))
os.symlink("/real/path", "/some/link")

Cross-Platform Path Manipulation

import os.path
os.path.join("/tmp", "data", "file.txt")
os.path.exists("/tmp/data")
os.path.dirname("/tmp/data/file.txt")
os.path.basename("/tmp/data/file.txt")
os.path.splitext("file.tar.gz")
os.path.abspath(".")
os.path.expanduser("~/docs")

Common Pitfalls

  • Using os.system for shell. No output capture, shell-injection-vulnerable. Use subprocess.run instead.
  • Modifying os.environ after subprocess starts. Children inherit at spawn time. Set vars BEFORE the call.
  • Forgetting Windows path separators. Use os.path.join or pathlib. Don’t hardcode /.
  • Mutating os.environ in tests. Tests that change env vars without cleanup poison the test runner.
  • os.walk on huge trees. Visits every directory. For “find files matching” prefer pathlib rglob.

FAQ

Q: os vs pathlib?
A: pathlib for new code. os.path when working with code that passes string paths.

Q: os.system vs subprocess?
A: subprocess every time. os.system has no stdin/stdout/stderr control.

Q: Home directory?
A: os.path.expanduser("~") or str(Path.home()).

Q: Username?
A: os.environ.get("USER") or os.environ.get("USERNAME") — Unix vs Windows.

Q: Detect OS?
A: os.name returns ‘posix’ or ‘nt’. platform.system() for more detail.

Wrapping Up

The os module covers environment variables, file ops, process info, system metadata. For modern path manipulation, prefer pathlib; for process management, subprocess; for env vars, os.environ is still the right answer.

Running system commands from Python
Sometimes Python needs to call in the big guns.
How To Use Python datetime Module for Date and Time Operations

How To Use Python datetime Module for Date and Time Operations

Beginner

Dates and times appear in almost every real-world Python project — logging when an event occurred, scheduling tasks, calculating how long something took, or displaying timestamps to users. Without proper tools, working with dates in code becomes a nightmare of string parsing, timezone confusion, and off-by-one-day errors. Python’s built-in datetime module solves all of this cleanly and consistently.

The datetime module is part of Python’s standard library — no installation required. It provides several classes: date for calendar dates, time for time-of-day values, datetime for combined date and time, timedelta for representing durations, and timezone for handling timezone offsets. Together these classes cover the vast majority of date/time tasks you’ll encounter.

In this tutorial, you’ll learn how to create and manipulate date and datetime objects, format dates for display and parse them from strings, do date arithmetic with timedelta, work with timezones, and apply everything in a practical project that calculates age and upcoming birthdays. By the end you’ll handle dates in Python with confidence.

Working with Dates: Quick Example

Here’s a quick working example that covers the most common operations — getting today’s date, formatting it, and calculating a future date:

# datetime_quick.py
from datetime import date, datetime, timedelta

# Today's date
today = date.today()
print("Today:", today)

# Current date and time
now = datetime.now()
print("Now:", now.strftime("%Y-%m-%d %H:%M:%S"))

# Date arithmetic: 30 days from now
future = today + timedelta(days=30)
print("30 days from now:", future)

# Days until end of year
end_of_year = date(today.year, 12, 31)
days_left = (end_of_year - today).days
print(f"Days until end of year: {days_left}")

Output:

Today: 2026-04-17
Now: 2026-04-17 09:23:45
30 days from now: 2026-05-17
Days until end of year: 258

The key insight here is that subtracting two date objects returns a timedelta object, and you access its .days attribute to get the integer count. The strftime method formats datetimes into readable strings — we’ll cover all the format codes later in this tutorial.

What Is the datetime Module?

The datetime module provides classes for working with dates and times in Python. Think of it as Python’s built-in calendar and clock library. Unlike Unix timestamps (which are just large integers), datetime objects are human-readable, support arithmetic, and can be converted to and from formatted strings.

Here’s how the main classes relate to each other:

ClassWhat It RepresentsExample
dateA calendar date (year, month, day)date(2026, 4, 17)
timeA time of day (hour, minute, second, microsecond)time(9, 30, 0)
datetimeA specific moment in time (date + time combined)datetime(2026, 4, 17, 9, 30)
timedeltaA duration or difference between two momentstimedelta(days=7, hours=2)
timezoneA fixed UTC offset for timezone-aware datetimestimezone(timedelta(hours=5))

In most everyday code, you’ll use date, datetime, and timedelta most often. The timezone class becomes important when your application serves users in multiple regions or interacts with APIs that return UTC timestamps.

datetime basics and creating dates
datetime.now() is only accurate until you deploy to another timezone.

Creating Date and Datetime Objects

There are several ways to create date and datetime objects depending on whether you know the specific values or need the current moment.

Getting the Current Date and Time

The most common starting point is getting today’s date or the current datetime. Use date.today() for just the date, or datetime.now() for date plus time:

# create_dates.py
from datetime import date, datetime

# Just the date (no time component)
today = date.today()
print(f"date.today(): {today}")
print(f"Year: {today.year}, Month: {today.month}, Day: {today.day}")

# Date + time (uses system's local time)
now = datetime.now()
print(f"\ndatetime.now(): {now}")
print(f"Hour: {now.hour}, Minute: {now.minute}, Second: {now.second}")
print(f"Microsecond: {now.microsecond}")

# UTC time (timezone-naive but in UTC)
utc_now = datetime.utcnow()
print(f"\ndatetime.utcnow(): {utc_now}")

Output:

date.today(): 2026-04-17
Year: 2026, Month: 4, Day: 17

datetime.now(): 2026-04-17 09:23:45.123456
Hour: 9, Minute: 23, Second: 45
Microsecond: 123456

datetime.utcnow(): 2026-04-17 07:23:45.123456

Creating Specific Dates

When you need to represent a fixed date (a birthday, a deadline, a historical event), pass the year, month, and day directly to the constructor. The datetime constructor accepts the same arguments plus optional hour, minute, second, and microsecond:

# specific_dates.py
from datetime import date, datetime

# Create a specific date
python_release = date(1991, 2, 20)  # Python's first public release
print(f"Python released: {python_release}")

# Create a specific datetime
meeting = datetime(2026, 5, 1, 14, 30, 0)  # May 1, 2026 at 2:30 PM
print(f"Meeting scheduled: {meeting}")

# Access individual components
print(f"Meeting day of week (0=Mon): {meeting.weekday()}")
print(f"Meeting ISO weekday (1=Mon): {meeting.isoweekday()}")

Output:

Python released: 1991-02-20
Meeting scheduled: 2026-05-01 14:30:00
Meeting day of week (0=Mon): 3
Meeting ISO weekday (1=Mon): 4

The weekday() method returns 0 for Monday through 6 for Sunday. isoweekday() returns 1 for Monday through 7 for Sunday — which one you use depends on your preference and how the day numbering will appear in your output.

Formatting dates with strftime
strftime gives you dates your users can actually read.

Date Arithmetic with timedelta

One of the most powerful features of the datetime module is the ability to add and subtract time using timedelta objects. A timedelta represents a fixed duration — it can hold days, seconds, and microseconds internally (though you can specify it in any combination of units).

Creating and Using timedelta

Create a timedelta by specifying the duration, then add or subtract it from a date or datetime object:

# timedelta_basics.py
from datetime import date, datetime, timedelta

today = date.today()

# Create timedeltas
one_week = timedelta(weeks=1)
two_days = timedelta(days=2)
ninety_days = timedelta(days=90)

print(f"Today: {today}")
print(f"One week from now: {today + one_week}")
print(f"Two days ago: {today - two_days}")
print(f"90 days from now: {today + ninety_days}")

# Timedelta from subtraction
deadline = date(2026, 12, 31)
days_remaining = deadline - today
print(f"\nDays until Dec 31: {days_remaining.days}")

# Timedelta with hours/minutes (use datetime)
start = datetime(2026, 4, 17, 9, 0, 0)
duration = timedelta(hours=2, minutes=30)
end = start + duration
print(f"\nMeeting start: {start.strftime('%H:%M')}")
print(f"Meeting end:   {end.strftime('%H:%M')}")

Output:

Today: 2026-04-17
One week from now: 2026-04-24
Two days ago: 2026-04-15
90 days from now: 2026-07-16

Days until Dec 31: 258

Meeting start: 09:00
Meeting end:   11:30

When you subtract two dates, Python returns a timedelta object. Access its .days attribute for the integer count of days. For timedeltas involving hours, work with datetime objects instead of bare date objects — date has no concept of hours or minutes.

Formatting and Parsing Dates

Dates need to be displayed to users and parsed from user input, config files, API responses, and databases. Python provides strftime for formatting (datetime to string) and strptime for parsing (string to datetime).

Formatting with strftime

The strftime method formats a datetime using format codes. The most important codes to know:

CodeMeaningExample
%Y4-digit year2026
%m2-digit month (01-12)04
%d2-digit day (01-31)17
%HHour (00-23, 24-hr)14
%IHour (01-12, 12-hr)02
%MMinute (00-59)30
%SSecond (00-59)00
%pAM or PMPM
%AFull weekday nameFriday
%BFull month nameApril
%ZTimezone nameUTC
# strftime_examples.py
from datetime import datetime

now = datetime(2026, 4, 17, 14, 30, 0)

print(now.strftime("%Y-%m-%d"))           # ISO format
print(now.strftime("%d/%m/%Y"))           # UK format
print(now.strftime("%B %d, %Y"))          # Human-readable
print(now.strftime("%A, %B %d, %Y"))      # Full weekday
print(now.strftime("%I:%M %p"))           # 12-hour clock
print(now.strftime("%Y-%m-%dT%H:%M:%S"))  # ISO 8601 / API format

Output:

2026-04-17
17/04/2026
April 17, 2026
Friday, April 17, 2026
02:30 PM
2026-04-17T14:30:00

Parsing with strptime

When you receive a date as a string (from a form input, a CSV file, or an API response), use strptime to convert it into a datetime object. The format string must match the input exactly:

# strptime_examples.py
from datetime import datetime

# Parse common date formats
date_str1 = "2026-04-17"
dt1 = datetime.strptime(date_str1, "%Y-%m-%d")
print(f"Parsed ISO: {dt1}")

date_str2 = "April 17, 2026"
dt2 = datetime.strptime(date_str2, "%B %d, %Y")
print(f"Parsed long form: {dt2}")

date_str3 = "17/04/2026 14:30:00"
dt3 = datetime.strptime(date_str3, "%d/%m/%Y %H:%M:%S")
print(f"Parsed with time: {dt3}")

# Now you can do arithmetic on parsed dates
delta = dt1 - dt2  # Both represent 2026-04-17
print(f"Difference: {delta.days} days")

Output:

Parsed ISO: 2026-04-17 00:00:00
Parsed long form: 2026-04-17 00:00:00
Parsed with time: 2026-04-17 14:30:00
Difference: 0 days

A common mistake is mismatching the format string with the actual string. If the format doesn’t match, Python raises a ValueError. Always test your format string against real data before deploying to production.

Working with timezones in Python
UTC is the one true timezone. Everything else is just an opinion.

Working with Timezones

By default, datetime objects created with datetime.now() are “naive” — they have no timezone information. This is fine for local scripts, but problematic for applications that serve global users or interact with APIs. Python’s timezone class provides simple fixed-offset timezone support:

# timezone_examples.py
from datetime import datetime, timezone, timedelta

# UTC-aware datetime
utc_now = datetime.now(timezone.utc)
print(f"UTC now: {utc_now}")
print(f"UTC offset: {utc_now.utcoffset()}")

# Create specific timezone offsets
eastern = timezone(timedelta(hours=-5))   # EST (UTC-5)
india = timezone(timedelta(hours=5, minutes=30))  # IST (UTC+5:30)

# Convert UTC to other timezones
eastern_time = utc_now.astimezone(eastern)
india_time = utc_now.astimezone(india)

print(f"\nEastern time: {eastern_time.strftime('%Y-%m-%d %H:%M %Z')}")
print(f"India time:   {india_time.strftime('%Y-%m-%d %H:%M %Z')}")

# Compare aware datetimes
is_same = eastern_time == india_time
print(f"\nSame moment? {is_same}")  # True -- same instant, different display

Output:

UTC now: 2026-04-17 07:23:45.123456+00:00
UTC offset: 0:00:00

Eastern time: 2026-04-17 02:23 UTC-05:00
India time:   2026-04-17 12:53 IST

Same moment? True

For production applications with complex timezone requirements (daylight saving time, historical timezone data), consider using the zoneinfo module (Python 3.9+) or the third-party pytz library, which include full IANA timezone database support.

Real-Life Example: Birthday Countdown Calculator

Let’s build a practical birthday calculator that tells you a person’s current age, how many days until their next birthday, and what day of the week it falls on:

# birthday_calculator.py
from datetime import date

def calculate_age(birthdate):
    """Calculate age in years from a birthdate."""
    today = date.today()
    age = today.year - birthdate.year
    # Adjust if birthday hasn't occurred yet this year
    if (today.month, today.day) < (birthdate.month, birthdate.day):
        age -= 1
    return age

def days_until_birthday(birthdate):
    """Return days until next birthday and the date it falls on."""
    today = date.today()
    # Next birthday this year
    next_birthday = birthdate.replace(year=today.year)

    # If birthday already passed this year, use next year
    if next_birthday < today:
        next_birthday = birthdate.replace(year=today.year + 1)

    days_left = (next_birthday - today).days
    return days_left, next_birthday

def birthday_report(name, birthdate_str):
    """Print a full birthday report for a person."""
    birthdate = date.fromisoformat(birthdate_str)  # Parses YYYY-MM-DD

    age = calculate_age(birthdate)
    days_left, next_bday = days_until_birthday(birthdate)

    weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday',
                'Friday', 'Saturday', 'Sunday']
    bday_weekday = weekdays[next_bday.weekday()]

    print(f"--- Birthday Report for {name} ---")
    print(f"Birthdate:  {birthdate.strftime('%B %d, %Y')}")
    print(f"Age:        {age} years old")
    print(f"Next bday:  {next_bday.strftime('%B %d, %Y')} ({bday_weekday})")
    print(f"Countdown:  {days_left} days to go")
    if days_left == 0:
        print("           ** Happy Birthday! **")
    print()

birthday_report("Alice", "1990-07-15")
birthday_report("Bob", "1985-04-20")
birthday_report("Carol", "2000-12-31")

Output:

--- Birthday Report for Alice ---
Birthdate:  July 15, 1990
Age:        35 years old
Next bday:  July 15, 2026 (Wednesday)
Countdown:  89 days to go

--- Birthday Report for Bob ---
Birthdate:  April 20, 1985
Age:        40 years old
Next bday:  April 20, 2026 (Monday)
Countdown:  3 days to go

--- Birthday Report for Carol ---
Birthdate:  December 31, 2000
Age:        25 years old
Next bday:  December 31, 2026 (Thursday)
Countdown:  258 days to go

This project demonstrates several key concepts: using date.fromisoformat() to parse ISO-formatted date strings (a cleaner alternative to strptime for YYYY-MM-DD), using replace() to adjust a date's year while keeping month and day, subtracting dates to get day counts, and the subtle off-by-one logic needed to correctly calculate age. You could extend this by reading birthdays from a CSV file or sending email notifications when a birthday is approaching.

Calculating date differences with timedelta
timedelta does the calendar math so you dont have to count on your fingers.

Frequently Asked Questions

How do I compare two dates in Python?

Use standard comparison operators (<, >, ==, <=, >=) directly on date or datetime objects. For example, if date1 < date2: works exactly as you'd expect. Just make sure both objects are the same type -- comparing a naive datetime with a timezone-aware one raises a TypeError.

How do I convert a Unix timestamp to a datetime?

Use datetime.fromtimestamp(ts) to convert a Unix timestamp (seconds since epoch) to a local datetime, or datetime.utcfromtimestamp(ts) for UTC. For a timezone-aware result, use datetime.fromtimestamp(ts, tz=timezone.utc). To go the other direction, call dt.timestamp() on any datetime object.

What is the easiest way to get an ISO 8601 formatted date string?

Call .isoformat() on any date or datetime object. This returns a standard ISO 8601 string like "2026-04-17" for dates or "2026-04-17T14:30:00" for datetimes. To parse ISO strings back into date objects, use date.fromisoformat() or datetime.fromisoformat() -- both were added in Python 3.7.

How do I find the first or last day of a month?

For the first day, use dt.replace(day=1). For the last day, use the calendar module: import calendar; last_day = calendar.monthrange(year, month)[1]. Then create the date with date(year, month, last_day). This handles the varying lengths of months (and leap years) correctly.

How do I measure elapsed time in seconds or milliseconds?

Subtract two datetime objects to get a timedelta, then call .total_seconds() on the result. For example: elapsed = (end_time - start_time).total_seconds(). For high-precision timing of code execution, use time.perf_counter() from the time module instead -- it's designed for benchmarking with sub-millisecond precision.

Conclusion

The datetime module gives you everything you need to work with dates and times in Python without installing third-party libraries. In this tutorial, you learned how to create date and datetime objects with date.today(), datetime.now(), and constructor calls; do date arithmetic using timedelta; format datetimes into strings with strftime and its format codes; parse date strings back to datetime objects with strptime and fromisoformat; and handle timezones with the built-in timezone class.

The birthday calculator project shows how these pieces fit together in a real application. Try extending it: read birthdays from a CSV file using the csv module, sort the list by upcoming birthday, or send a Telegram notification when a birthday is fewer than 7 days away.

For full documentation and additional classes, see the official Python datetime documentation. For complex timezone requirements, explore the zoneinfo module added in Python 3.9.

How To Use Python Regular Expressions with the re Module

How To Use Python Regular Expressions with the re Module

Intermediate

Some text problems are impossible to solve with split(), replace(), and in checks. Extracting all email addresses from a document. Validating that a phone number matches any of fifteen regional formats. Finding every date that appears in a 10,000-line log file. These are pattern-matching problems, and regular expressions — regex — are built exactly for them. Once you understand regex, a problem that would take 50 lines of string manipulation collapses into a single well-crafted pattern.

Python’s re module is built into the standard library and provides a full regex engine. You write a pattern that describes what you’re looking for, and re finds it — in strings of any length, with any number of matches, extracted as individual strings or as named groups. No installation required.

In this article we’ll cover the essential regex syntax (character classes, quantifiers, anchors, groups), the five core re functions (match, search, findall, sub, split), named groups and compiled patterns, lookaheads and lookbehinds, common real-world patterns (email, phone, URL, date), and a practical log file parser. By the end, you’ll be able to write and read regex confidently for most everyday text parsing tasks.

Python Regex: Quick Example

Here’s how to extract all email addresses from a block of text in two lines:

# quick_regex.py
import re

text = "Contact us at support@example.com or sales@company.org for help. Spam: fake@.com"

emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
print(emails)

Output:

['support@example.com', 'sales@company.org']

re.findall() returns a list of all non-overlapping matches. The pattern [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} matches the local part of an email ([a-zA-Z0-9._%+-]+), then @, then a domain ([a-zA-Z0-9.-]+), then a dot (\.), then a TLD of 2+ letters ([a-zA-Z]{2,}). Notice fake@.com wasn’t matched — the domain part requires at least one character before the dot.

What Are Regular Expressions?

A regular expression is a sequence of characters that defines a search pattern. The pattern \d{3}-\d{4} matches “555-1234” — exactly three digits, a hyphen, and four digits. Patterns can describe fixed strings, ranges of characters, repetition, alternatives, and complex structures like “a word followed by a number followed by an optional suffix.”

PatternMatchesWhat It Means
.Any character except newlineWildcard
\dAny digit (0-9)Digit shorthand
\wWord character (a-z, A-Z, 0-9, _)Word char shorthand
\sWhitespace (space, tab, newline)Space shorthand
^Start of string (or line with MULTILINE)Anchor
$End of string (or line with MULTILINE)Anchor
[abc]Any of a, b, or cCharacter class
[^abc]Any character NOT a, b, or cNegated class
+One or more of the precedingQuantifier
*Zero or more of the precedingQuantifier
?Zero or one (optional)Quantifier
{n,m}Between n and m repetitionsQuantifier
a|bEither a or bAlternation
(abc)Capture groupGrou�ing

Always use raw strings (r'...') for regex patterns in Python. Without the r prefix, backslashes like \d and \w would need to be doubled (\\d, \\w) because Python treats \d as a string escape sequence. Raw strings pass the backslash through unchanged, making patterns cleaner and less error-prone.

Pattern matching with re
Pattern matching is detective work for your data.

The Five Core re Functions

re.match() — Match at the Start

re.match() only checks for a match at the very beginning of the string. It’s useful for validating format when you expect the string to start with a specific pattern.

# re_match.py
import re

# Only matches if the pattern is at the START of the string
result = re.match(r'\d{4}-\d{2}-\d{2}', '2026-04-16 09:00:00')
if result:
    print('Matched date:', result.group())
else:
    print('No match')

# Does NOT match -- pattern not at start
result2 = re.match(r'\d{4}-\d{2}-\d{2}', 'Log entry: 2026-04-16')
print('Match with prefix:', result2)  # None

Output:

Matched date: 2026-04-16
Match with prefix: None

re.search() scans the entire string and returns the first match wherever it appears. This is the function to use when you’re looking for a pattern that might appear anywhere in the text.

# re_search.py
import re

log_line = 'ERROR 2026-04-16 09:23:45 - Connection timeout on port 5432'

# Find the timestamp anywhere in the string
ts_match = re.search(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}', log_line)
if ts_match:
    print('Timestamp found:', ts_match.group())
    print('At position:', ts_match.start(), 'to', ts_match.end())

# Find the port number
port_match = re.search(r'port (\d+)', log_line)
if port_match:
    print('Port:', port_match.group(1))  # group(1) = first capture group

Output:

Timestamp found: 2026-04-16 09:23:45
At position: 6 to 25
Port: 5432

The match.group() method returns the full matched string. match.group(1) returns the first capture group (the content inside the first set of parentheses). match.start() and match.end() give you the character positions of the match in the original string.

re.findall() — Find All Matches

re.findall() returns a list of all non-overlapping matches. If the pattern has no groups, it returns a list of matched strings. If it has one group, it returns the group contents. If it has multiple groups, it returns a list of tuples.

# re_findall.py
import re

text = '''
Server logs for 2026-04-16:
  192.168.1.10 -> request 200 OK
  10.0.0.5 -> request 404 Not Found
  172.16.0.1 -> request 200 OK
  192.168.1.10 -> request 500 Internal Server Error
'''

# Find all IP addresses
ips = re.findall(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', text)
print('IPs found:', ips)

# Find all status codes
codes = re.findall(r'request (\d{3})', text)
print('Status codes:', codes)

# Count 404s and 500s
errors = [c for c in codes if c.startswith(('4', '5'))]
print('Error responses:', len(errors))

Output:

IPs found: ['192.168.1.10', '10.0.0.5', '172.16.0.1', '192.168.1.10']
Status codes: ['200', '404', '200', '500']
Error responses: 2

re.sub() — Replace Matches

re.sub() replaces all occurrences of a pattern with a replacement string or the result of a function. This is the regex-powered version of str.replace().

# re_sub.py
import re

# Normalize phone numbers to a consistent format
phones = [
    'Call us: (02) 9876-5432',
    'Mobile: 0412 345 678',
    'Fax: 02-9876-5432',
]

for phone_text in phones:
    # Remove all non-digit characters except leading country code
    digits_only = re.sub(r'[^\d]', '', re.search(r'[\d\s\-()]+', phone_text).group())
    print(f'{phone_text:30} -> {digits_only}')

# Redact sensitive data: replace card numbers
text = 'Card: 4532-1234-5678-9012, expires 04/28'
redacted = re.sub(r'\d{4}-\d{4}-\d{4}-\d{4}', '[REDACTED]', text)
print('\nRedacted:', redacted)

Output:

Call us: (02) 9876-5432        -> 0298765432
Mobile: 0412 345 678           -> 0412345678
Fax: 02-9876-5432              -> 0298765432

Redacted: Card: [REDACTED], expires 04/28
Regex substitution and replacement
re.sub replaces what re.search finds. Division of labor.

Named Groups and Compiled Patterns

For complex patterns you’ll reuse frequently, named capture groups make the code self-documenting. Instead of match.group(1), you write match.group('year'). Compiled patterns (re.compile()) also avoid re-parsing the pattern on every call, which is important in loops.

# named_groups.py
import re

# Compile a pattern with named groups
log_pattern = re.compile(
    r'(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)\s+'
    r'(?P<date>\d{4}-\d{2}-\d{2})\s+'
    r'(?P<time>\d{2}:\d{2}:\d{2})\s+-\s+'
    r'(?P<message>.+)'
)

log_lines = [
    'INFO 2026-04-16 09:00:01 - Application started',
    'WARNING 2026-04-16 09:15:32 - High memory usage: 85%',
    'ERROR 2026-04-16 09:23:45 - Database connection failed',
]

for line in log_lines:
    m = log_pattern.match(line)
    if m:
        print(f"Level:   {m.group('level')}")
        print(f"Time:    {m.group('date')} {m.group('time')}")
        print(f"Message: {m.group('message')}")
        print()

Output:

Level:   INFO
Time:    2026-04-16 09:00:01
Message: Application started

Level:   WARNING
Time:    2026-04-16 09:15:32
Message: High memory usage: 85%

Level:   ERROR
Time:    2026-04-16 09:23:45
Message: Database connection failed

Named groups use the syntax (?P<name>pattern). The ?P<name> is Python-specific regex syntax (the P stands for “Python extension”). You can also access named groups as a dict via match.groupdict(), which returns {'level': 'INFO', 'date': '2026-04-16', ...} — very useful for feeding parsed log data into data structures.

Real-Life Example: Server Log Analyzer

Validating data with regex
When your input data has trust issues, regex is the bouncer.

Here’s a complete log file analyzer that parses Apache/nginx-style access logs, extracts metrics, and generates a summary report.

# log_analyzer.py
import re
from collections import Counter, defaultdict

# Sample nginx-style access log data
ACCESS_LOG = """
192.168.1.10 - - [16/Apr/2026:09:00:01 +0000] "GET /index.html HTTP/1.1" 200 1234
10.0.0.5 - - [16/Apr/2026:09:00:05 +0000] "GET /api/users HTTP/1.1" 200 892
192.168.1.10 - - [16/Apr/2026:09:01:12 +0000] "POST /api/login HTTP/1.1" 401 145
10.0.0.7 - - [16/Apr/2026:09:02:30 +0000] "GET /images/logo.png HTTP/1.1" 200 45678
192.168.1.25 - - [16/Apr/2026:09:03:11 +0000] "GET /api/data HTTP/1.1" 500 312
10.0.0.5 - - [16/Apr/2026:09:04:00 +0000] "GET /api/users/42 HTTP/1.1" 200 456
192.168.1.10 - - [16/Apr/2026:09:05:22 +0000] "DELETE /api/users/42 HTTP/1.1" 403 88
10.0.0.7 - - [16/Apr/2026:09:06:45 +0000] "GET /api/data HTTP/1.1" 200 789
""".strip()

# Compile the access log pattern
LOG_PATTERN = re.compile(
    r'(?P<ip>\S+) \S+ \S+ \[(?P<datetime>[^\]]+)\] '
    r'"(?P<method>\S+) (?P<path>\S+) HTTP/[\d.]+" '
    r'(?P<status>\d{3}) (?P<bytes>\d+)'
)

def analyze_logs(log_text):
    """Parse log entries and return summary statistics."""
    ip_counter = Counter()
    status_counter = Counter()
    path_counter = Counter()
    method_counter = Counter()
    total_bytes = 0
    errors = []

    for line in log_text.splitlines():
        line = line.strip()
        if not line:
            continue

        m = LOG_PATTERN.match(line)
        if not m:
            errors.append(f'Could not parse: {line}')
            continue

        ip_counter[m.group('ip')] += 1
        status_counter[m.group('status')] += 1
        path_counter[m.group('path')] += 1
        method_counter[m.group('method')] += 1
        total_bytes += int(m.group('bytes'))

    return {
        'total_requests': sum(ip_counter.values()),
        'unique_ips': len(ip_counter),
        'top_ips': ip_counter.most_common(3),
        'status_codes': dict(sorted(status_counter.items())),
        'top_paths': path_counter.most_common(3),
        'methods': dict(method_counter),
        'total_bytes': total_bytes,
        'parse_errors': errors
    }

stats = analyze_logs(ACCESS_LOG)

print(f'Total Requests:  {stats["total_requests"]}')
print(f'Unique IPs:      {stats["unique_ips"]}')
print(f'Total Data:      {stats["total_bytes"] / 1024:.1f} KB')
print(f'\nStatus Codes:')
for code, count in stats['status_codes'].items():
    label = 'OK' if code.startswith('2') else 'ERR' if code.startswith('5') else ''
    print(f'  {code}: {count:3}  {label}')
print(f'\nTop IPs:')
for ip, count in stats['top_ips']:
    print(f'  {ip:15} {count} requests')
print(f'\nHTTP Methods: {stats["methods"]}')
if stats['parse_errors']:
    print(f'\nParse errors: {len(stats["parse_errors"])}')

Output:

Total Requests:  8
Unique IPs:      3
Total Data:      47.5 KB

Status Codes:
  200: 5  OK
  401: 1
  403: 1
  500: 1  ERR

Top IPs:
  192.168.1.10    3 requests
  10.0.0.5        2 requests
  10.0.0.7        2 requests

HTTP Methods: {'GET': 6, 'POST': 1, 'DELETE': 1}

The compiled LOG_PATTERN with named groups is the heart of this analyzer — it extracts all seven fields from each log line in a single match() call. Calling re.compile() once outside the loop means the pattern is parsed only once, which matters when processing millions of log lines.

Frequently Asked Questions

What is greedy vs non-greedy matching?

By default, quantifiers like + and * are greedy — they match as much text as possible. re.search(r'<.+>', '<b>text</b>') matches the entire string <b>text</b>, not just <b>. Add ? after the quantifier to make it non-greedy (lazy): r'<.+?>' matches <b> and stops. Use non-greedy quantifiers when you want the shortest possible match between two delimiters.

How do I match across multiple lines?

By default, . doesn’t match newlines. Pass re.DOTALL as a flag to make . match any character including newlines: re.search(r'START.+END', text, re.DOTALL). For patterns where ^ and $ should match line boundaries (not just string boundaries), use re.MULTILINE. Both flags can be combined: re.DOTALL | re.MULTILINE.

When should I use re.compile()?

Use re.compile() when you’re calling the same pattern multiple times — in a loop, or in a function that’s called repeatedly. Compiled patterns cache the parsed regex, avoiding redundant work. For one-off searches in simple scripts, the module-level functions (re.search(), re.findall(), etc.) are fine — they also cache internally under the hood.

How do I match literal special characters like dot or parenthesis?

Escape them with a backslash: \. matches a literal dot (not the “any character” wildcard), \( matches a literal opening parenthesis. In raw strings, that’s r'\.' and r'\('. Common characters that need escaping: . ^ $ * + ? { } [ ] \ | ( ). Use re.escape(your_string) to automatically escape all special characters in a variable you want to match literally.

My regex is running slowly. What can I do?

Several patterns cause catastrophic backtracking: nested quantifiers like (a+)+, alternations with overlapping patterns, or very long strings with no match. Solutions: compile the pattern once with re.compile(), use anchors (^ and $) to limit where Python searches, make quantifiers as specific as possible (use [a-z]+ instead of .+ when you know the character set), and test with tools like regex101.com which shows match steps and warnings about slow patterns.

re.fullmatch beats re.match. Almost always.
re.fullmatch beats re.match. Almost always.

Conclusion

Python’s re module gives you a full regex engine for any text processing challenge. We covered the core syntax (character classes, quantifiers, anchors, groups), the five main functions (match, search, findall, sub, split), named capture groups for readable code, compiled patterns for performance, greedy vs non-greedy matching, and a complete server log analyzer. Regular expressions have a reputation for being hard to read, but well-named groups and small, focused patterns keep them maintainable.

Extend the log analyzer to write hourly request rate breakdowns, flag IPs that generate more than 10 errors per hour, or parse a different log format by updating only the compiled pattern. The (?P<name>) named group system makes updating patterns clean because the code downstream references groups by name, not by index.

For the full syntax reference, flag descriptions, and advanced features like conditional matching, see the official re module documentation. The interactive regex101.com is invaluable for testing and debugging patterns.

How To Read and Write CSV Files in Python

How To Read and Write CSV Files in Python

Beginner

CSV (Comma-Separated Values) files are the most universal format for tabular data. Excel exports CSV. Databases export CSV. Every data analytics tool imports CSV. When a colleague sends you “the data,” there’s a good chance it’s a .csv file. If you work with spreadsheets, databases, or any form of tabular data, you’ll read and write CSV files all the time.

Python’s built-in csv module handles CSV reading and writing cleanly. It manages quoting, delimiters, line endings, and encoding edge cases that would break a naive split(',') approach — like fields that contain commas inside quotes, or newlines inside values. Just import csv and you’re ready.

In this article we’ll cover reading with csv.reader and csv.DictReader, writing with csv.writer and csv.DictWriter, handling different delimiters and encodings, dealing with real-world CSV quirks, filtering and transforming CSV data, and a complete sales report generator as a practical example. By the end, you’ll be fluent with CSV handling in Python for both simple and complex files.

Reading CSV in Python: Quick Example

Let’s read a CSV file and print its contents in three lines of code:

# quick_csv.py
import csv

# Create a sample CSV file to read
with open('people.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['Name', 'Age', 'City'])
    writer.writerow(['Alice', '30', 'Sydney'])
    writer.writerow(['Bob', '25', 'Melbourne'])
    writer.writerow(['Charlie', '35', 'Brisbane'])

# Read and print it
with open('people.csv', 'r', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

Output:

['Name', 'Age', 'City']
['Alice', '30', 'Sydney']
['Bob', '25', 'Melbourne']
['Charlie', '35', 'Brisbane']

Note the newline='' argument when opening files for writing with the csv module — this is required on Windows to prevent extra blank lines between rows. The csv.reader returns each row as a list of strings. We’ll look at csv.DictReader shortly, which gives you named fields as dicts instead of positional lists.

What Is CSV and Why Use Python’s csv Module?

A CSV file stores tabular data as plain text with values separated by a delimiter (usually a comma, but sometimes a tab, semicolon, or pipe). The first row is typically a header row with column names. While it looks simple, CSV has many edge cases that a naive line.split(',') approach gets wrong.

CSV Edge CaseExampleWhat Breaks split(‘,’)
Field contains comma"Smith, John",30Splits in the wrong place
Field contains newline"line1\nline2",30Breaks row detection
Field contains quotes"say ""hello""",30Escaping ignored
Tab-separated valuesAlice\t30\tSydneyDelimiter mismatch
Different encodingsAccented chars in latin-1UnicodeDecodeError

Python’s csv module handles all of these correctly. It follows RFC 4180 (the CSV standard) by default and lets you configure delimiters, quoting, and line terminators through the dialect system.

Reading CSV data in Python
Commas separate your data. Misplaced quotes separate you from your sanity.

Reading CSV with DictReader

csv.DictReader is the preferred way to read CSV files for most use cases. It reads the header row and returns each subsequent row as an OrderedDict (or regular dict in Python 3.8+) with column names as keys. No more remembering which index is which.

# dict_reader.py
import csv

# Create a sample CSV with product data
csv_data = """id,product,price,stock,category
1,Python Handbook,29.99,150,Books
2,USB-C Hub,49.99,75,Electronics
3,Mechanical Keyboard,89.99,40,Electronics
4,Standing Desk Pad,24.99,200,Office
5,Monitor Light,35.00,90,Electronics
"""

with open('products.csv', 'w', newline='', encoding='utf-8') as f:
    f.write(csv_data)

# Read with DictReader
with open('products.csv', 'r', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    products = list(reader)

print(f'Loaded {len(products)} products\n')

# Access by column name
for p in products:
    name = p['product']
    price = float(p['price'])
    stock = int(p['stock'])
    print(f'  {name}: ${price:.2f} ({stock} in stock)')

# Filter: electronics under $60
print('\nElectronics under $60:')
electronics = [p for p in products
               if p['category'] == 'Electronics' and float(p['price']) < 60]
for p in electronics:
    print(f'  {p["product"]} - ${p["price"]}')

Output:

Loaded 5 products

  Python Handbook: $29.99 (150 in stock)
  USB-C Hub: $49.99 (75 in stock)
  Mechanical Keyboard: $89.99 (40 in stock)
  Standing Desk Pad: $24.99 (200 in stock)
  Monitor Light: $35.00 (90 in stock)

Electronics under $60:
  USB-C Hub - $49.99
  Monitor Light - $35.00

Remember that all values from DictReader are strings -- use int() or float() to convert numeric fields before arithmetic. A common defensive pattern: wrap conversions in try/except or use a helper like safe_float = lambda x: float(x) if x else 0.0 to handle empty or malformed fields.

Writing CSV with DictWriter

csv.DictWriter lets you write dicts to CSV rows without tracking column order manually. You specify the field names once and then write rows as dicts.

# dict_writer.py
import csv
from datetime import date

# Data to write
orders = [
    {'order_id': 'ORD-001', 'customer': 'Alice', 'amount': 149.97, 'date': '2026-04-16', 'status': 'shipped'},
    {'order_id': 'ORD-002', 'customer': 'Bob', 'amount': 89.99, 'date': '2026-04-16', 'status': 'pending'},
    {'order_id': 'ORD-003', 'customer': 'Charlie', 'amount': 24.99, 'date': '2026-04-15', 'status': 'delivered'},
]

fieldnames = ['order_id', 'customer', 'amount', 'date', 'status']

with open('orders.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()         # Write the header row
    writer.writerows(orders)     # Write all rows at once

print('Written to orders.csv')

# Verify: read it back
with open('orders.csv', 'r', encoding='utf-8') as f:
    print(f.read())

Output:

Written to orders.csv
order_id,customer,amount,date,status
ORD-001,Alice,149.97,2026-04-16,shipped
ORD-002,Bob,89.99,2026-04-16,pending
ORD-003,Charlie,24.99,2026-04-15,delivered

writer.writeheader() writes the field names as the first row. writer.writerows(orders) writes all rows in one call, which is more efficient than looping and calling writer.writerow() for each item. If a dict has extra keys not in fieldnames, DictWriter ignores them by default (or raises an error with extrasaction='raise').

Writing structured CSV with DictWriter
DictWriter gives your rows names so you dont have to count columns.

Handling Different Delimiters and Encodings

Not all "CSV" files use commas. Tab-separated files (.tsv) are common in bioinformatics. Semicolon-separated files appear frequently in European locales (where commas are decimal separators). The delimiter parameter handles all of these.

# custom_delimiter.py
import csv

# Write a tab-separated file
tsv_data = [
    ['gene_id', 'chromosome', 'start', 'end', 'strand'],
    ['BRCA1', 'chr17', '43044295', '43125482', '-'],
    ['TP53', 'chr17', '7661779', '7687538', '-'],
    ['EGFR', 'chr7', '55086725', '55275031', '+'],
]

with open('genes.tsv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f, delimiter='\t')
    writer.writerows(tsv_data)

# Read it back
with open('genes.tsv', 'r', encoding='utf-8') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        print(' | '.join(f'{col:15}' for col in row))

Output:

gene_id         | chromosome     | start          | end            | strand
BRCA1           | chr17          | 43044295       | 43125482       | -
TP53            | chr17          | 7661779        | 7687538        | -
EGFR            | chr7           | 55086725       | 55275031       | +

For Windows-generated CSV files, you may encounter cp1252 encoding instead of UTF-8. If you get a UnicodeDecodeError, try encoding='cp1252' or encoding='latin-1'. For files where you don't know the encoding, the chardet library (install via pip) can detect it automatically.

Real-Life Example: Monthly Sales Report Generator

Analyzing CSV data
csv.reader handles the parsing. You handle the business logic.

Here's a complete script that reads a raw sales CSV, filters and aggregates the data, and writes a formatted monthly summary report.

# sales_report.py
import csv
from collections import defaultdict
from datetime import datetime

# Create sample sales data
raw_sales = [
    ['date', 'product', 'category', 'quantity', 'unit_price', 'region'],
    ['2026-04-01', 'Python Handbook', 'Books', '3', '29.99', 'NSW'],
    ['2026-04-01', 'USB-C Hub', 'Electronics', '2', '49.99', 'VIC'],
    ['2026-04-02', 'Standing Desk Pad', 'Office', '5', '24.99', 'QLD'],
    ['2026-04-03', 'Python Handbook', 'Books', '1', '29.99', 'NSW'],
    ['2026-04-03', 'Monitor Light', 'Electronics', '4', '35.00', 'VIC'],
    ['2026-04-04', 'Mechanical Keyboard', 'Electronics', '2', '89.99', 'NSW'],
    ['2026-04-05', 'USB-C Hub', 'Electronics', '3', '49.99', 'WA'],
    ['2026-04-06', 'Python Handbook', 'Books', '2', '29.99', 'VIC'],
]

with open('sales_raw.csv', 'w', newline='', encoding='utf-8') as f:
    csv.writer(f).writerows(raw_sales)

# --- Process the data ---
category_totals = defaultdict(float)
region_totals = defaultdict(float)
product_units = defaultdict(int)
grand_total = 0.0

with open('sales_raw.csv', 'r', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        try:
            qty = int(row['quantity'])
            price = float(row['unit_price'])
            revenue = qty * price
        except (ValueError, KeyError) as e:
            print(f'Skipping malformed row: {row} | Error: {e}')
            continue

        category_totals[row['category']] += revenue
        region_totals[row['region']] += revenue
        product_units[row['product']] += qty
        grand_total += revenue

# --- Write the summary report ---
report_rows = []
report_rows.append(['=== MONTHLY SALES REPORT: April 2026 ===', '', ''])
report_rows.append(['', '', ''])

report_rows.append(['REVENUE BY CATEGORY', '', ''])
report_rows.append(['Category', 'Revenue', '% of Total'])
for cat, rev in sorted(category_totals.items(), key=lambda x: -x[1]):
    pct = rev / grand_total * 100
    report_rows.append([cat, f'${rev:.2f}', f'{pct:.1f}%'])

report_rows.append(['', '', ''])
report_rows.append(['REVENUE BY REGION', '', ''])
report_rows.append(['Region', 'Revenue', '% of Total'])
for region, rev in sorted(region_totals.items(), key=lambda x: -x[1]):
    pct = rev / grand_total * 100
    report_rows.append([region, f'${rev:.2f}', f'{pct:.1f}%'])

report_rows.append(['', '', ''])
report_rows.append([f'GRAND TOTAL', f'${grand_total:.2f}', '100.0%'])

with open('sales_report.csv', 'w', newline='', encoding='utf-8') as f:
    csv.writer(f).writerows(report_rows)

# Print summary to console
print('Sales Report Generated\n')
print('Revenue by Category:')
for cat, rev in sorted(category_totals.items(), key=lambda x: -x[1]):
    print(f'  {cat:20} ${rev:8.2f}')
print(f'\n  {"TOTAL":20} ${grand_total:8.2f}')
print('\nReport written to sales_report.csv')

Output:

Sales Report Generated

Revenue by Category:
  Electronics          $529.90
  Books                $179.94
  Office               $124.95

  TOTAL                $834.79

Report written to sales_report.csv

This script demonstrates the complete CSV pipeline: reading raw data row by row using DictReader, aggregating with defaultdict, handling malformed rows gracefully, and writing a structured multi-section report using csv.writer. The same pattern scales to millions of rows with minimal modification.

Frequently Asked Questions

Why does my CSV file have blank lines between rows on Windows?

This happens when you open the file without newline=''. Without it, Python's universal newline handling adds an extra \r\n, and the csv module adds another, resulting in double line endings. Always use open('file.csv', 'w', newline='', encoding='utf-8') when writing CSV files to prevent this issue.

My CSV file has garbled characters. What's wrong?

The file was probably saved with a different encoding -- most commonly Windows' cp1252 or latin-1. Try encoding='cp1252' or encoding='latin-1' when opening the file. Excel in particular often exports as cp1252 rather than UTF-8. If you're not sure, install the chardet library and run chardet.detect(open('file.csv', 'rb').read()) to detect the encoding automatically.

How do I handle a very large CSV file without running out of memory?

The csv.reader and csv.DictReader objects are iterators -- they read one row at a time, so they don't load the entire file into memory. Don't call list(reader) on a huge file; instead, process rows in a for loop. For multi-gigabyte files, this approach uses only a few KB of memory regardless of file size.

How do I handle fields that contain commas or newlines?

The csv module handles this automatically. When you write a field that contains a comma, quote, or newline, the writer automatically wraps it in double quotes. When you read it back, the reader correctly identifies the entire quoted field as a single value. You don't need to do any manual escaping -- just let the csv module handle it.

Should I use the csv module or pandas for CSV?

For simple reading/writing and lightweight processing, the built-in csv module is perfect -- no dependencies, fast startup, and minimal memory overhead. For heavy data manipulation (filtering, grouping, joining, sorting large datasets), pandas is faster and more expressive. The csv module is the right choice for scripts that need to run on any machine without installing dependencies.

Conclusion

Python's csv module is a reliable, production-ready tool for all CSV work. We covered csv.reader and csv.writer for row-level access, csv.DictReader and csv.DictWriter for named-field access, handling custom delimiters like tabs and semicolons, always using newline='' on Windows, encoding issues and how to debug them, defensive parsing with try/except, and a complete aggregation-based sales report generator. The csv module handles all the quoting and escaping edge cases that would trip up a naive split-based approach.

Try extending the sales report example to generate a pivot table by region and category, or add a chart using matplotlib based on the aggregated data. CSV processing and data visualization are a natural combination.

For the complete API including dialect configuration and custom quoting behavior, see the official csv module documentation.

How To Read and Write JSON Files in Python

How To Read and Write JSON Files in Python

Beginner

JSON (JavaScript Object Notation) is the universal language of data exchange on the web. REST APIs return JSON. Configuration files are written in JSON. NoSQL databases store JSON. When you fetch data from any web service — weather APIs, payment processors, social media platforms — you’re almost certainly receiving JSON. Knowing how to read and write JSON in Python is one of the most practical skills you can have.

Python makes JSON handling easy with its built-in json module. You can parse a JSON string into a Python dictionary with one function call, and serialize Python data back to JSON with another. No installation required — import json is all you need. The module handles the translation between Python types (dicts, lists, strings, numbers, booleans) and their JSON equivalents automatically.

In this article we’ll cover reading JSON from strings and files, writing JSON to strings and files, pretty-printing, handling nested structures, working with real API data, customizing serialization for Python objects, and error handling. By the end, you’ll be comfortable parsing any JSON structure you encounter and serializing your Python data to clean, readable JSON output.

Reading JSON in Python: Quick Example

Here’s how to parse a JSON string and work with the resulting Python data in just a few lines:

# quick_json.py
import json

json_string = '{"name": "Alice", "age": 30, "languages": ["Python", "SQL"]}'

# Parse JSON string -> Python dict
data = json.loads(json_string)

print(type(data))           # 
print(data['name'])         # Alice
print(data['languages'])    # ['Python', 'SQL']

# Serialize Python dict -> JSON string
output = json.dumps(data, indent=2)
print(output)

Output:

<class 'dict'>
Alice
['Python', 'SQL']
{
  "name": "Alice",
  "age": 30,
  "languages": [
    "Python",
    "SQL"
  ]
}

json.loads() parses a JSON string (the “s” stands for “string”), while json.dumps() serializes to a string. The indent=2 argument to dumps() pretty-prints with 2-space indentation. For reading and writing files directly, use json.load() and json.dump() (without the “s”).

What Is JSON and How Does It Map to Python?

JSON is a text-based data format derived from JavaScript object syntax. It stores data as key-value pairs (objects), ordered lists (arrays), strings, numbers, booleans, and null. Python’s json module automatically converts between JSON types and Python types.

JSON TypePython TypeExample
objectdict{"key": "value"}
arraylist[1, 2, 3]
stringstr"hello"
number (int)int42
number (float)float3.14
true / falseTrue / Falsetrue
nullNonenull

One important difference: JSON only supports string keys in objects, while Python dicts can have any hashable key. When you serialize a Python dict with integer keys, the json module automatically converts them to strings. Keep this in mind when working with round-trip serialization.

Loading JSON data in Python
json.loads turns a string into a dictionary. json.dumps does the reverse. Pick the right one.

Reading JSON from a File

Reading JSON from a file is extremely common — configuration files, data exports, and API response caches are often stored as JSON files. Use json.load() (no “s”) with a file object.

# read_json_file.py
import json

# First, create a sample JSON file to read
sample_data = {
    "app": "MyApp",
    "version": "2.1.0",
    "database": {
        "host": "localhost",
        "port": 5432,
        "name": "myapp_db"
    },
    "features": ["auth", "notifications", "analytics"]
}

with open('config.json', 'w', encoding='utf-8') as f:
    json.dump(sample_data, f, indent=2)

# Now read it back
with open('config.json', 'r', encoding='utf-8') as f:
    config = json.load(f)

print('App:', config['app'])
print('DB host:', config['database']['host'])
print('DB port:', config['database']['port'])
print('Features:', ', '.join(config['features']))

Output:

App: MyApp
DB host: localhost
DB port: 5432
Features: auth, notifications, analytics

Always open files with encoding='utf-8' — JSON is defined as UTF-8 by default and many JSON files use Unicode characters. The with statement ensures the file is properly closed even if an error occurs during parsing.

Writing JSON to a File

Serializing Python data to a JSON file is just as straightforward. The json.dump() function writes directly to a file object, which is more efficient than creating a string with json.dumps() and then writing it.

# write_json_file.py
import json
from datetime import date

# Python data to serialize
user_data = {
    "users": [
        {"id": 1, "name": "Alice", "active": True, "score": 98.5},
        {"id": 2, "name": "Bob", "active": False, "score": 72.0},
        {"id": 3, "name": "Charlie", "active": True, "score": 85.25},
    ],
    "total": 3,
    "generated": "2026-04-16"
}

# Write with pretty printing and sorted keys
with open('users.json', 'w', encoding='utf-8') as f:
    json.dump(user_data, f, indent=2, sort_keys=True)

print('Written to users.json')

# Verify by reading it back
with open('users.json', 'r', encoding='utf-8') as f:
    content = f.read()

print(content[:300])

Output:

Written to users.json
{
  "generated": "2026-04-16",
  "total": 3,
  "users": [
    {
      "active": true,
      "id": 1,
      "name": "Alice",
      "score": 98.5
    },
    ...
  ]
}

The sort_keys=True option outputs keys in alphabetical order, which makes JSON diffs much cleaner in version control — you won’t see spurious changes just because Python iterated dict keys in a different order. Use it for any JSON file that will be committed to a git repository.

Exchanging JSON via APIs
JSON is the universal language of APIs. Speak it fluently.

Working with Real API Data

The most common use of JSON in Python is parsing data from REST APIs. Here’s how to fetch and parse real JSON data from a public practice API:

# api_json.py
import json
import urllib.request

# Fetch a list of users from JSONPlaceholder (a free practice REST API)
url = 'https://jsonplaceholder.typicode.com/users'

with urllib.request.urlopen(url) as response:
    raw = response.read().decode('utf-8')

users = json.loads(raw)

print(f'Fetched {len(users)} users\n')

for user in users[:3]:  # Show first 3 users
    name = user.get('name', 'Unknown')
    email = user.get('email', 'Unknown')
    city = user.get('address', {}).get('city', 'Unknown')
    company = user.get('company', {}).get('name', 'Unknown')
    print(f'{name} | {email} | {city} | {company}')

Output:

Fetched 10 users

Leanne Graham | Sincere@april.biz | Gwenborough | Romaguera-Crona
Ervin Howell | Shanna@melissa.tv | Wisokyburgh | Deckow-Crist
Clementine Bauch | Nathan@yesenia.net | McKenziehaven | Romaguera-Jacobson

The .get('key', default) pattern is defensive JSON parsing — it returns the default value if the key is missing rather than raising a KeyError. For nested structures like address.city, chain the .get() calls: user.get('address', {}).get('city', 'Unknown'). If 'address' is missing, the inner .get() runs on an empty dict and safely returns 'Unknown' instead of crashing.

Navigating Nested JSON

Real-world API responses are often deeply nested. Here’s how to extract data from a complex nested structure safely:

# nested_json.py
import json

# Simulate a complex API response
response_text = '''
{
  "status": "success",
  "data": {
    "post": {
      "id": 42,
      "title": "Understanding Python JSON",
      "author": {"id": 7, "name": "Sam Dev"},
      "tags": ["python", "json", "tutorial"],
      "stats": {"views": 1250, "likes": 87, "comments": 14}
    }
  }
}
'''

data = json.loads(response_text)

# Defensive nested access
post = data.get('data', {}).get('post', {})
title = post.get('title', 'Unknown')
author_name = post.get('author', {}).get('name', 'Unknown')
views = post.get('stats', {}).get('views', 0)
tags = post.get('tags', [])

print(f'Title: {title}')
print(f'Author: {author_name}')
print(f'Views: {views:,}')
print(f'Tags: {", ".join(tags)}')

Output:

Title: Understanding Python JSON
Author: Sam Dev
Views: 1,250
Tags: python, json, tutorial

The chained .get() approach is much safer than writing data['data']['post']['title'] — any missing key in the chain would raise a KeyError and crash your script. With .get(), you control the default at every level.

Custom Serialization for Python Objects

The json module can’t serialize Python objects like datetime by default — they’re not JSON-native types. You have two options: use a custom encoder class or use the default parameter of json.dumps().

# custom_json.py
import json
from datetime import datetime, date

# Option 1: default function for simple cases
def json_default(obj):
    if isinstance(obj, (datetime, date)):
        return obj.isoformat()
    raise TypeError(f'Object of type {type(obj).__name__} is not JSON serializable')

data = {
    'event': 'User signup',
    'timestamp': datetime(2026, 4, 16, 9, 30, 0),
    'date_only': date(2026, 4, 16),
    'user_id': 123
}

result = json.dumps(data, default=json_default, indent=2)
print(result)

Output:

{
  "event": "User signup",
  "timestamp": "2026-04-16T09:30:00",
  "date_only": "2026-04-16",
  "user_id": 123
}

The default function is called whenever json.dumps() encounters an object it can’t serialize natively. Return a JSON-serializable value (a string, number, list, or dict), and json.dumps() will use it in place of the original object. The ISO 8601 format for datetime strings (2026-04-16T09:30:00) is the widely-accepted standard.

Handling JSON Errors

JSON parsing fails when the input is malformed. Always wrap json.loads() in a try/except when dealing with data from external sources.

# json_errors.py
import json

def safe_parse(json_string):
    """Parse JSON safely, returning None on failure."""
    try:
        return json.loads(json_string)
    except json.JSONDecodeError as e:
        print(f'JSON parse error at line {e.lineno}, col {e.colno}: {e.msg}')
        print(f'Bad input: {json_string[:100]}')
        return None

# Valid JSON
result = safe_parse('{"name": "Alice", "age": 30}')
print('Valid:', result)

# Invalid JSON (missing closing brace)
result2 = safe_parse('{"name": "Bob"')
print('Invalid:', result2)

# Invalid JSON (trailing comma -- not valid in JSON)
result3 = safe_parse('{"key": "value",}')
print('Trailing comma:', result3)

Output:

Valid: {'name': 'Alice', 'age': 30}
JSON parse error at line 1, col 15: Expecting property name enclosed in double quotes
Bad input: {"name": "Bob"
Invalid: None
JSON parse error at line 1, col 17: Expecting property name enclosed in double quotes
Bad input: {"key": "value",}
Trailing comma: None

json.JSONDecodeError is a subclass of ValueError and carries the line number, column, and a descriptive message about what went wrong. Always check for this error when parsing API responses, user-provided input, or files from external sources — any of these can contain malformed JSON.

Real-Life Example: JSON Config Manager

JSON configuration files
Config files in JSON keep your settings out of your code.

Here’s a complete configuration manager that reads a JSON config file, applies defaults for missing keys, validates required fields, and writes updated config back to disk.

# config_manager.py
import json
import os

DEFAULTS = {
    "debug": False,
    "log_level": "INFO",
    "database": {
        "host": "localhost",
        "port": 5432,
        "pool_size": 5
    },
    "cache": {
        "enabled": True,
        "ttl_seconds": 300
    }
}

REQUIRED_KEYS = ["database.host", "database.port"]

def deep_merge(base, override):
    """Merge override into base dict recursively."""
    result = base.copy()
    for key, val in override.items():
        if key in result and isinstance(result[key], dict) and isinstance(val, dict):
            result[key] = deep_merge(result[key], val)
        else:
            result[key] = val
    return result

def get_nested(d, dotted_key, default=None):
    """Access nested dict value using dot notation."""
    keys = dotted_key.split('.')
    for key in keys:
        if not isinstance(d, dict) or key not in d:
            return default
        d = d[key]
    return d

def load_config(config_path):
    """Load config from file, merging with defaults."""
    if os.path.exists(config_path):
        try:
            with open(config_path, 'r', encoding='utf-8') as f:
                user_config = json.load(f)
        except json.JSONDecodeError as e:
            print(f'Error reading config: {e}')
            user_config = {}
    else:
        print(f'Config file not found at {config_path}, using defaults')
        user_config = {}

    config = deep_merge(DEFAULTS, user_config)

    # Validate required keys
    missing = [k for k in REQUIRED_KEYS if get_nested(config, k) is None]
    if missing:
        raise ValueError(f'Missing required config keys: {missing}')

    return config

def save_config(config, config_path):
    """Write config to JSON file."""
    with open(config_path, 'w', encoding='utf-8') as f:
        json.dump(config, f, indent=2, sort_keys=True)
    print(f'Config saved to {config_path}')

# Demo
config = load_config('app_config.json')
config['debug'] = True
config['database']['pool_size'] = 10
save_config(config, 'app_config.json')

print(f"Debug mode: {config['debug']}")
print(f"DB pool size: {config['database']['pool_size']}")
print(f"Cache TTL: {config['cache']['ttl_seconds']}s")

Output:

Config file not found at app_config.json, using defaults
Config saved to app_config.json
Debug mode: True
DB pool size: 10
Cache TTL: 300s

The deep_merge() function recursively merges user settings into defaults, so users only need to specify the keys they want to override. The dot-notation accessor get_nested() makes validation and access of nested keys clean and readable. This pattern is used in virtually every production application that uses a JSON config file.

Frequently Asked Questions

What is the difference between json.loads and json.load?

json.loads() parses a JSON string (the “s” = string). json.load() reads from a file object. Similarly, json.dumps() serializes to a string and json.dump() writes to a file. This naming convention is consistent across Python’s standard library (e.g., pickle.loads/pickle.load follows the same pattern).

How do I pretty-print JSON?

Use json.dumps(data, indent=2) for 2-space indentation or indent=4 for 4-space. Add sort_keys=True to sort keys alphabetically. From the command line, you can pretty-print a JSON file with: python -m json.tool myfile.json. This is built into Python and works on any valid JSON file.

How does Python handle Unicode in JSON?

By default, json.dumps() escapes non-ASCII characters as \uXXXX escape sequences. To output them directly as Unicode (which is valid JSON and more readable), use json.dumps(data, ensure_ascii=False). Always open JSON files with encoding='utf-8' to handle any Unicode content correctly.

Why can’t I serialize datetime objects?

The JSON spec only defines six data types: object, array, string, number, boolean, and null. Python’s datetime doesn’t map to any of these, so the json module raises TypeError. The idiomatic solution is to convert to ISO 8601 strings using dt.isoformat(). Pass a default function to json.dumps() that handles these conversions, as shown in the custom serialization section above.

How do I handle very large JSON files efficiently?

For JSON files too large to load into memory at once, use the ijson library (install via pip install ijson) for streaming incremental JSON parsing. It parses the file as it reads, yielding items one at a time. The standard json module always loads the entire file into memory — fine for files up to hundreds of MB, but not for multi-GB JSON datasets.

Conclusion

Python’s json module makes JSON handling simple and reliable. We covered json.loads()/json.load() for parsing, json.dumps()/json.dump() for serialization, pretty-printing with indent and sort_keys, defensive nested access with chained .get(), parsing real API responses, custom serialization for datetime objects, and robust error handling with json.JSONDecodeError. JSON fluency is an essential Python skill — you’ll use it in almost every project that touches the internet or stores configuration.

Try extending the config manager to support environment variable overrides (keys from os.environ take precedence over the file) or to validate values against a schema using jsonschema (available via pip). Both are common patterns in production-grade config management.

For the full API reference and additional encoder/decoder customization options, see the official json module documentation.

How To Use the Python logging Module for Application Logging

How To Use the Python logging Module for Application Logging

Intermediate

Every production Python application needs logging. Not print() statements that vanish when your script closes — real, structured logs with timestamps, severity levels, file rotation, and the ability to turn verbosity up or down without touching your code. When something goes wrong at 2am on a production server, your log file is the only witness. If all you left behind are a few print("here") calls, you’re debugging blind.

Python ships with a powerful, flexible logging module in its standard library. It’s built around a hierarchy of loggers, handlers, and formatters that you configure once and use everywhere. The learning curve is a bit steeper than print(), but the payoff — structured, timestamped, level-filtered, file-backed logs — is enormous. No third-party packages are required to get started.

In this article we’ll cover the five log levels, the basicConfig shortcut, named loggers and the logger hierarchy, handlers (console, file, rotating), formatters, logging from multiple modules, and a complete real-world logging setup for a data pipeline application. By the end, you’ll have a professional logging setup you can drop into any project.

Python Logging: Quick Example

Here’s the fastest way to get meaningful logging output with timestamps and levels:

# quick_logging.py
import logging

logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(levelname)s - %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S'
)

logging.debug('This is a debug message')
logging.info('Application started')
logging.warning('Disk space running low')
logging.error('Failed to connect to database')
logging.critical('Application cannot continue')

Output:

2026-04-16 09:00:01 - DEBUG - This is a debug message
2026-04-16 09:00:01 - INFO - Application started
2026-04-16 09:00:01 - WARNING - Disk space running low
2026-04-16 09:00:01 - ERROR - Failed to connect to database
2026-04-16 09:00:01 - CRITICAL - Application cannot continue

basicConfig() sets up the root logger — the single logging object all other loggers inherit from if not configured themselves. The format string uses special tokens like %(asctime)s, %(levelname)s, and %(message)s. In the sections below we’ll go beyond the root logger to set up named, per-module loggers and file handlers.

What Is the logging Module and Why Use It?

The logging module provides a standardized way to emit messages from your application at different severity levels. Unlike print(), log messages carry metadata (timestamp, level, logger name, file, line number), can be routed to multiple destinations simultaneously (console AND file), and can be filtered by level without changing any code.

LevelNumeric ValueWhen to Use
DEBUG10Detailed diagnostic info during development
INFO20Confirmation that things are working as expected
WARNING30Something unexpected happened, but the app continues
ERROR40A serious problem — part of the app couldn’t run
CRITICAL50A severe error — the app may not be able to continue

The level you set on a logger or handler acts as a filter: only messages at that level or higher are processed. Set to DEBUG in development to see everything; set to WARNING or ERROR in production to reduce noise. No code changes required — just a config change.

Understanding logging levels
Five logging levels. Choose wisely or drown in noise.

Named Loggers and the Logger Hierarchy

The best practice is to create a named logger for each module using __name__. This gives every log message a module-level identifier and lets you control logging granularity per module in large applications.

# named_logger.py
import logging

# Create a logger named after this module
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

# Create a console handler with formatting
handler = logging.StreamHandler()
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)

# Use the logger
logger.info('Module initialized')
logger.debug('Loading configuration from file')
logger.warning('Config file not found, using defaults')

Output:

2026-04-16 09:00:01 - __main__ - INFO - Module initialized
2026-04-16 09:00:01 - __main__ - DEBUG - Loading configuration from file
2026-04-16 09:00:01 - __main__ - WARNING - Config file not found, using defaults

Loggers form a hierarchy based on their names. A logger named myapp.database is a child of myapp, which is a child of the root logger. Messages propagate up the hierarchy by default — so configuring handlers on the root logger or a parent logger affects all children. This hierarchy is what makes it possible to configure logging once in your main module and have it work across all your imports.

Logging to a File

Writing logs to a file ensures you have a record of what happened, even after the terminal session closes. The FileHandler writes log messages to a file you specify.

# file_logging.py
import logging

logger = logging.getLogger('myapp')
logger.setLevel(logging.DEBUG)

# Console handler -- only show WARNING and above in the terminal
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.WARNING)
console_handler.setFormatter(logging.Formatter('%(levelname)s: %(message)s'))

# File handler -- write everything DEBUG and above to a file
file_handler = logging.FileHandler('app.log', encoding='utf-8')
file_handler.setLevel(logging.DEBUG)
file_handler.setFormatter(
    logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
)

logger.addHandler(console_handler)
logger.addHandler(file_handler)

# Now emit messages
logger.debug('Processing item 1')       # Only in file
logger.info('Item 1 processed OK')     # Only in file
logger.warning('Item 2 skipped')        # Console AND file
logger.error('Item 3 failed: timeout') # Console AND file

Terminal output:

WARNING: Item 2 skipped
ERROR: Item 3 failed: timeout

app.log contents:

2026-04-16 09:00:01 - myapp - DEBUG - Processing item 1
2026-04-16 09:00:01 - myapp - INFO - Item 1 processed OK
2026-04-16 09:00:01 - myapp - WARNING - Item 2 skipped
2026-04-16 09:00:01 - myapp - ERROR - Item 3 failed: timeout

This dual-handler pattern is extremely common in production: the console shows only what operators need to see in real time (warnings and errors), while the file captures the full diagnostic history for post-mortem debugging.

Configuring log handlers
Handlers decide where your logs go. Choose file, console, or both.

Rotating Log Files

Log files grow indefinitely if nothing manages them. The RotatingFileHandler automatically rotates log files when they hit a size limit, keeping a configurable number of backup files. The TimedRotatingFileHandler rotates on a schedule (daily, hourly, etc.).

# rotating_logs.py
import logging
from logging.handlers import RotatingFileHandler, TimedRotatingFileHandler

logger = logging.getLogger('rotating_demo')
logger.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')

# Rotate when file hits 1MB, keep 5 backups
size_handler = RotatingFileHandler(
    'app_size.log',
    maxBytes=1_000_000,   # 1 MB
    backupCount=5,
    encoding='utf-8'
)
size_handler.setFormatter(formatter)

# Rotate daily at midnight, keep 7 days of logs
time_handler = TimedRotatingFileHandler(
    'app_daily.log',
    when='midnight',
    interval=1,
    backupCount=7,
    encoding='utf-8'
)
time_handler.setFormatter(formatter)

logger.addHandler(size_handler)
logger.addHandler(time_handler)

for i in range(100):
    logger.info(f'Processing record {i}')

print('Logging complete. Check app_size.log and app_daily.log')

Output:

Logging complete. Check app_size.log and app_daily.log

When app_size.log reaches 1MB, it’s renamed to app_size.log.1, then app_size.log.2, and so on up to the backupCount. Older backups are deleted automatically. For long-running services like web servers or data pipelines, TimedRotatingFileHandler with when='midnight' and backupCount=30 gives you a month of daily logs with zero maintenance.

Logging Exceptions

One of the most valuable logging features is capturing full exception tracebacks. Use logger.exception() inside an except block — it logs the message at ERROR level and automatically appends the full traceback.

# exception_logging.py
import logging

logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

def divide(a, b):
    try:
        result = a / b
        logger.info(f'Divided {a} / {b} = {result}')
        return result
    except ZeroDivisionError:
        logger.exception(f'Failed to divide {a} by {b}')
        return None

divide(10, 2)
divide(10, 0)  # Will log the traceback

Output:

2026-04-16 09:00:01 - INFO - Divided 10 / 2 = 5.0
2026-04-16 09:00:01 - ERROR - Failed to divide 10 by 0
Traceback (most recent call last):
  File "exception_logging.py", line 8, in divide
    result = a / b
ZeroDivisionError: division by zero

logger.exception() is equivalent to logger.error(msg, exc_info=True). The traceback is automatically included — you don’t need to call traceback.format_exc() or format it yourself. This is the pattern every production application should use inside exception handlers.

Real-Life Example: Data Pipeline Logger

Monitoring log pipelines
Multiple loggers, multiple destinations, one clean pipeline.

Here’s a complete logging setup for a data pipeline that processes records from a source file, transforms them, and writes them to an output file — with full logging at every stage.

# data_pipeline.py
import logging
import logging.config
import json
import os
from logging.handlers import RotatingFileHandler

def setup_logger(name, log_file, level=logging.DEBUG):
    """Create a configured logger with console and rotating file handlers."""
    logger = logging.getLogger(name)
    logger.setLevel(level)

    if logger.handlers:
        return logger  # Prevent duplicate handlers on re-import

    formatter = logging.Formatter(
        '%(asctime)s | %(name)-20s | %(levelname)-8s | %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S'
    )

    # Console: WARNING and above
    console = logging.StreamHandler()
    console.setLevel(logging.WARNING)
    console.setFormatter(formatter)

    # File: everything, rotate at 500KB, keep 3 backups
    fh = RotatingFileHandler(log_file, maxBytes=500_000, backupCount=3, encoding='utf-8')
    fh.setLevel(logging.DEBUG)
    fh.setFormatter(formatter)

    logger.addHandler(console)
    logger.addHandler(fh)
    return logger

def process_records(records, logger):
    """Process a list of record dicts. Returns (success_count, error_count)."""
    success = 0
    errors = 0

    for i, record in enumerate(records):
        try:
            if 'name' not in record:
                raise ValueError(f'Missing required field: name')
            if not isinstance(record.get('age', 0), int):
                raise TypeError(f'age must be an integer, got {type(record["age"]).__name__}')

            # Simulate transform
            transformed = {
                'id': i + 1,
                'name': record['name'].strip().title(),
                'age': record['age'],
                'status': 'active'
            }
            logger.debug(f'Processed record {i+1}: {transformed["name"]}')
            success += 1

        except (ValueError, TypeError) as e:
            logger.warning(f'Skipping record {i+1}: {e}')
            errors += 1
        except Exception as e:
            logger.exception(f'Unexpected error on record {i+1}')
            errors += 1

    return success, errors

def run_pipeline(input_records, output_path):
    logger = setup_logger('pipeline', 'pipeline.log')

    logger.info(f'Pipeline started. Input records: {len(input_records)}')

    success, errors = process_records(input_records, logger)

    logger.info(f'Pipeline complete. Success: {success}, Errors: {errors}')
    if errors > 0:
        logger.warning(f'{errors} records skipped due to errors')

    return success, errors

# Run the pipeline
sample_data = [
    {'name': 'alice johnson', 'age': 30},
    {'name': 'bob smith', 'age': 25},
    {'age': 40},                         # Missing name -- will warn
    {'name': 'charlie brown', 'age': 'thirty'},  # Wrong type -- will warn
    {'name': 'diana prince', 'age': 28},
]

success, errors = run_pipeline(sample_data, 'output.json')
print(f'\nFinal result: {success} processed, {errors} skipped')
print('Check pipeline.log for full details')

Output:

WARNING | pipeline             | WARNING  | Skipping record 3: Missing required field: name
WARNING | pipeline             | WARNING  | Skipping record 4: age must be an integer, got str

Final result: 3 processed, 2 skipped
Check pipeline.log for full details

The setup_logger() function uses a guard (if logger.handlers: return logger) to prevent duplicate handlers when the function is called multiple times, which is a common gotcha in larger projects. The pipeline logs every step to the file (DEBUG level) while showing only warnings and errors on the console, giving operators a clean output while preserving the full diagnostic trail in the log file.

Frequently Asked Questions

When should I use basicConfig vs named loggers?

Use basicConfig() for scripts and quick tools where you just need output to the console. For any application with multiple modules, use named loggers (logging.getLogger(__name__)) so you can identify which module emitted each message and control logging per module. Named loggers are the standard for libraries and production code.

Why am I seeing duplicate log messages?

This almost always happens because you added a handler to a logger that also propagates to the root logger, which has its own handler. Fix it either by setting logger.propagate = False on your named logger, or by removing the handler from the root logger. The pattern if logger.handlers: return logger in a setup function also prevents duplicate handlers when the function is called more than once.

How do I turn off all logging in production?

Set the root logger level to logging.CRITICAL + 1 or logging.NOTSET and remove all handlers: logging.disable(logging.CRITICAL) silences everything at CRITICAL and below (effectively everything). More typically, just set the production handler level to WARNING or ERROR rather than disabling logging entirely — you want errors logged even in production.

How do I configure logging from a config file?

Use logging.config.fileConfig('logging.ini') for INI-format config files, or logging.config.dictConfig(config_dict) for dictionary-based config (which you can load from a JSON or YAML file). Dictionary config is the modern approach — it’s more flexible and easier to version-control alongside your application code.

Does logging slow down my application?

At DEBUG level with many log messages, yes, logging adds overhead — especially if writing to disk. In production, set the level to WARNING or ERROR so most logging.debug() and logging.info() calls return immediately without any I/O. For extremely hot code paths, check if logger.isEnabledFor(logging.DEBUG): before constructing expensive log messages.

Conclusion

Python’s logging module gives you a production-grade observability system built into the standard library. We covered the five log levels (DEBUG through CRITICAL), setting up named loggers with getLogger(__name__), combining console and file handlers with different levels, automatic log rotation with RotatingFileHandler and TimedRotatingFileHandler, capturing exception tracebacks with logger.exception(), and building a complete pipeline logger. Replace every print() statement in your applications with the appropriate logging call — your future debugging self will thank you.

Extend the data pipeline example by loading the logging configuration from a JSON file so it can be adjusted without code changes, or add a SMTPHandler to email you when a CRITICAL event fires. The logging module’s handler ecosystem is extensive.

See the official logging documentation and the logging cookbook for advanced patterns including thread-safe logging and multiprocessing log handlers.

How To Use Python subprocess Module to Run System Commands

How To Use Python subprocess Module to Run System Commands

Intermediate

Sometimes Python alone isn’t enough — you need to run a shell command, launch another program, or query your operating system directly from a script. Maybe you want to zip a directory, ping a server, run a linter, or call a tool that only exists as a command-line binary. Python’s subprocess module is exactly what you need for all of these tasks, and it’s built right into the standard library.

The subprocess module lets you spawn new processes, connect to their input/output/error pipes, and obtain their return codes. It replaced the older os.system() and os.popen() functions with a cleaner, more powerful API. The primary function you’ll use is subprocess.run(), introduced in Python 3.5, which covers the vast majority of use cases. No third-party packages required — just import subprocess.

In this article we’ll cover the core subprocess.run() function, capturing standard output and error, handling return codes and exceptions, running commands with shell features, using Popen for advanced control, and building a real-world disk usage scanner. By the end, you’ll be able to integrate shell commands seamlessly into any Python script.

Running a Command: Quick Example

Let’s get a win immediately. Here’s how to run a simple command and capture its output in three lines of Python:

# quick_subprocess.py
import subprocess

result = subprocess.run(['echo', 'Hello from subprocess!'], capture_output=True, text=True)
print(result.stdout)
print('Return code:', result.returncode)

Output:

Hello from subprocess!

Return code: 0

We pass the command as a list of strings (the command and its arguments separately), set capture_output=True to capture stdout and stderr, and text=True to decode the output as a string instead of bytes. A return code of 0 means the command succeeded. We’ll dive deeper into each of these parameters below.

What Is subprocess and When Should You Use It?

The subprocess module is Python’s interface for spawning external processes — programs that run independently from your Python interpreter but whose output and status you can monitor and collect. Think of it as Python holding a terminal in one hand, running a command, and handing you the results.

Common use cases include: running shell utilities (ls, grep, curl), calling compiled binaries or other language tools, automating build pipelines, interacting with version control (git commands), and checking system status (disk usage, process lists, network info).

ApproachWhen to UseNotes
subprocess.run()Most use casesWaits for process to finish, returns CompletedProcess
subprocess.Popen()Need streaming I/O or async controlLow-level, non-blocking
os.system()Legacy code onlyNo output capture, avoid in new code
os.popen()Legacy code onlyDeprecated, no error handling

In virtually all new code, use subprocess.run(). The older os.system() and os.popen() functions are still available but offer no output capture, no error handling, and no return codes — they’re strictly inferior.

Running system commands with subprocess
subprocess.run() is Python phoning a friend outside the interpreter.

Capturing stdout and stderr

Capturing command output is the most common subprocess task. You need to see what a command printed before your script can act on it — whether you’re parsing a list of files, checking a version number, or validating command output.

# capture_output.py
import subprocess

# Run 'python --version' and capture the output
result = subprocess.run(['python3', '--version'], capture_output=True, text=True)

print('stdout:', result.stdout.strip())
print('stderr:', result.stderr.strip())
print('returncode:', result.returncode)

Output:

stdout: Python 3.12.0
stderr:
returncode: 0

When capture_output=True, both stdout and stderr are captured as strings (with text=True) and stored in the result.stdout and result.stderr attributes. Without capture_output=True, output goes directly to the terminal and you can’t read it in Python.

Capturing Error Output

Some commands write their useful output to stderr (like ffmpeg, git, and many C tools). Always capture both streams:

# capture_stderr.py
import subprocess

# A command that writes to stderr -- 'ls' on a nonexistent directory
result = subprocess.run(
    ['ls', '/nonexistent/path'],
    capture_output=True,
    text=True
)

print('stdout:', repr(result.stdout))
print('stderr:', repr(result.stderr))
print('returncode:', result.returncode)

Output:

stdout: ''
stderr: "ls: cannot access '/nonexistent/path': No such file or directory\n"
returncode: 2

A non-zero return code (here, 2) signals that the command failed. The error message landed in stderr, not stdout — which is why capturing both is important.

Handling Errors and Return Codes

Checking whether a subprocess succeeded is essential for any production script. There are two patterns: checking returncode manually, or using check=True to raise an exception automatically on failure.

# handle_errors.py
import subprocess

# Pattern 1: Check return code manually
result = subprocess.run(['ls', '/tmp'], capture_output=True, text=True)
if result.returncode == 0:
    print('Files in /tmp:')
    print(result.stdout[:200])
else:
    print('Error:', result.stderr)

# Pattern 2: Raise exception on failure (CalledProcessError)
try:
    subprocess.run(['ls', '/nonexistent'], capture_output=True, text=True, check=True)
except subprocess.CalledProcessError as e:
    print(f'Command failed with code {e.returncode}: {e.stderr.strip()}')

Output:

Files in /tmp:
tmpXYZ123
tmpabc456

Command failed with code 2: ls: cannot access '/nonexistent': No such file or directory

Use check=True when failure should abort your script. Use manual return code checking when you want to handle errors gracefully and keep running. The CalledProcessError exception carries .returncode, .stdout, and .stderr for full context.

Handling subprocess return codes
Return codes are the only honest answer your subprocess gives you.

Running Shell Commands with shell=True

Sometimes you need shell features: pipes (|), redirects (>), glob expansion (*.txt), or environment variable substitution. Passing a command string with shell=True runs it through the system shell, giving you access to all of these.

# shell_true.py
import subprocess

# Using shell=True to pipe commands together
result = subprocess.run(
    'echo "line1\nline2\nline3" | wc -l',
    shell=True,
    capture_output=True,
    text=True
)
print('Line count:', result.stdout.strip())

# Using shell=True for glob expansion
result2 = subprocess.run(
    'ls /tmp/*.log 2>/dev/null | head -5',
    shell=True,
    capture_output=True,
    text=True
)
print('Log files:', result2.stdout.strip() or 'None found')

Output:

Line count: 3
Log files: None found

When using shell=True, pass the command as a single string rather than a list. This is convenient for complex shell pipelines but carries a security risk: if any part of the string comes from user input, an attacker could inject malicious commands. For production code, always use the list form when possible and reserve shell=True for controlled, developer-written strings.

Setting Timeouts

How To Build CLI Apps with Python Click

How To Build CLI Apps with Python Click

Intermediate

Every serious Python developer eventually needs to build a command-line interface. Whether it is a deployment tool, a data processing script, or a developer utility, a well-designed CLI makes the difference between a tool your team actually uses and one that sits forgotten. Python’s standard argparse module works, but it is verbose — you write 20 lines of setup code before you handle your first argument. Click is the modern alternative: decorator-based, expressive, and composable, it cuts that boilerplate in half and adds features argparse simply does not have.

Click was created by the team behind Flask and follows the same philosophy: explicit is better than implicit, but explicit does not have to be painful. You decorate a Python function with @click.command() and @click.option(), and Click handles argument parsing, help text, type conversion, validation, and error messages automatically. Install it with pip install click.

This article covers everything you need to build production-quality CLI tools with Click: basic commands and options, arguments, type validation, prompts, multi-command groups (subcommands), progress bars, and output formatting. By the end, we will build a complete file management CLI that demonstrates all these features working together.

Click Quick Example

Here is a complete Click CLI that greets a user, with an optional count parameter:

# quick_click.py
import click

@click.command()
@click.option('--name', default='World', help='Who to greet.')
@click.option('--count', default=1, type=int, help='Number of greetings.')
@click.option('--loud', is_flag=True, help='Use uppercase.')
def greet(name, count, loud):
    """A friendly greeting command."""
    for _ in range(count):
        message = f"Hello, {name}!"
        if loud:
            message = message.upper()
        click.echo(message)

if __name__ == '__main__':
    greet()

Run it from the terminal:

$ python quick_click.py --name Alice --count 3
Hello, Alice!
Hello, Alice!
Hello, Alice!

$ python quick_click.py --name Bob --loud
HELLO, BOB!

$ python quick_click.py --help
Usage: quick_click.py [OPTIONS]

  A friendly greeting command.

Options:
  --name TEXT     Who to greet.
  --count INTEGER  Number of greetings.
  --loud          Use uppercase.
  --help          Show this message and exit.

Click generated a complete help page automatically from the function’s docstring and decorator metadata. The --help flag, type validation, and default values all come for free.

Options vs Arguments

Click distinguishes between two kinds of inputs: options (named flags like --name Alice) and arguments (positional inputs like a filename). Options are optional by default; arguments are required by default.

FeatureOption (@click.option)Argument (@click.argument)
Syntax--flag valuePositional: cmd value
RequiredOptional by defaultRequired by default
Help textShown in --helpShown in usage line
Best forConfiguration, flagsPrimary inputs (files, names)
# options_arguments.py
import click

@click.command()
@click.argument('filename')                        # Required positional arg
@click.option('--output', '-o', default='-',       # -o is a short alias
              help='Output file (default: stdout)')
@click.option('--lines', '-n', default=10,
              type=int, help='Number of lines to show.')
@click.option('--verbose', '-v', is_flag=True,
              help='Show extra information.')
def head(filename, output, lines, verbose):
    """Show the first N lines of FILENAME."""
    if verbose:
        click.echo(f"Reading {filename}, showing {lines} lines")
    try:
        with open(filename) as f:
            for i, line in enumerate(f):
                if i >= lines:
                    break
                click.echo(line, nl=False)
    except FileNotFoundError:
        click.echo(f"Error: {filename} not found", err=True)
        raise SystemExit(1)

if __name__ == '__main__':
    head()

Run it as python options_arguments.py myfile.txt --lines 5 --verbose. The -o short alias for --output is defined right in the option decorator. Click handles both -o file.txt and --output file.txt automatically.

Building Click commands
Click turns your functions into CLI commands with one decorator.

Types and Validation

Click converts option and argument values to the specified Python type and shows a helpful error if the conversion fails. Beyond basic types, Click has specialized types like click.Path for file paths and click.Choice for enumerated values.

# types_demo.py
import click

@click.command()
@click.argument('input_file', type=click.Path(exists=True, readable=True))
@click.option('--format', 'output_format',
              type=click.Choice(['json', 'csv', 'text'], case_sensitive=False),
              default='text', help='Output format.')
@click.option('--max-size', type=click.IntRange(1, 1000),
              default=100, help='Max size (1-1000).')
@click.option('--scale', type=float, help='Scaling factor.')
def process(input_file, output_format, max_size, scale):
    """Process INPUT_FILE with validation."""
    click.echo(f"Processing: {input_file}")
    click.echo(f"Format: {output_format}")
    click.echo(f"Max size: {max_size}")
    if scale:
        click.echo(f"Scale: {scale}")

if __name__ == '__main__':
    process()

When you pass an invalid value, Click provides a clear error message:

$ python types_demo.py myfile.txt --format xml
Error: Invalid value for '--format': 'xml' is not one of 'json', 'csv', 'text'.

$ python types_demo.py nonexistent.txt
Error: Invalid value for 'INPUT_FILE': Path 'nonexistent.txt' does not exist.

click.Path(exists=True) validates the file exists before your function even runs. click.IntRange(1, 1000) ensures the integer is within bounds. These validations happen automatically and produce user-friendly error messages — no manual error handling needed.

Interactive Prompts and Confirmation

For destructive operations, you often want to confirm with the user. Click provides @click.confirmation_option(), @click.password_option(), and click.prompt() for interactive input collection.

# prompts_demo.py
import click

@click.command()
@click.option('--username', prompt='Username',
              help='Your username.')
@click.option('--password', prompt=True,
              hide_input=True, confirmation_prompt=True,
              help='Your password.')
@click.option('--database', prompt='Database name',
              default='mydb', show_default=True)
def setup_connection(username, password, database):
    """Set up a database connection."""
    click.echo(f"Connecting to {database} as {username}...")
    click.echo(f"Password length: {len(password)} chars")
    # In a real app, you'd use these to create a connection
    click.echo("Connection configured successfully!")

@click.command()
@click.argument('filename')
@click.confirmation_option(prompt='Are you sure you want to delete this file?')
def delete_file(filename):
    """Permanently delete FILENAME."""
    import os
    try:
        os.remove(filename)
        click.echo(f"Deleted: {filename}", err=False)
    except FileNotFoundError:
        click.echo(f"File not found: {filename}", err=True)

if __name__ == '__main__':
    setup_connection()

Run python prompts_demo.py and Click interactively prompts for each required value. The password is hidden during input (no echo to terminal) and asks for confirmation. The @click.confirmation_option adds a yes/no prompt before any destructive action — and automatically processes -y or --yes flags to skip the prompt in automated scripts.

Advanced Click features
Nested commands give your CLI the depth of a real tool.

Multi-Command Groups (Subcommands)

Real CLI tools like git and docker use subcommands: git commit, git push, docker build, docker run. Click’s @click.group() decorator creates this structure cleanly. Each subcommand is just another decorated function.

# groups_demo.py
import click

@click.group()
@click.option('--debug/--no-debug', default=False,
              help='Enable debug output.')
@click.pass_context
def cli(ctx, debug):
    """Project management tool."""
    ctx.ensure_object(dict)
    ctx.obj['DEBUG'] = debug

@cli.command()
@click.argument('name')
@click.option('--template', default='basic',
              type=click.Choice(['basic', 'flask', 'fastapi']),
              help='Project template.')
@click.pass_context
def create(ctx, name, template):
    """Create a new project."""
    if ctx.obj['DEBUG']:
        click.echo(f"[DEBUG] Creating {name} with template {template}")
    click.echo(f"Creating project '{name}'...")
    click.echo(f"Template: {template}")
    click.echo(f"Done! Run: cd {name} && python main.py")

@cli.command()
@click.argument('name')
@click.pass_context
def delete(ctx, name):
    """Delete a project."""
    if ctx.obj['DEBUG']:
        click.echo(f"[DEBUG] Deleting {name}")
    click.confirm(f"Delete project '{name}'? This cannot be undone.", abort=True)
    click.echo(f"Project '{name}' deleted.")

@cli.command()
@click.pass_context
def list_projects(ctx):
    """List all projects."""
    click.echo("Projects:")
    for project in ['api-service', 'data-pipeline', 'dashboard']:
        click.echo(f"  - {project}")

# Register the list command with a different name
cli.add_command(list_projects, name='list')

if __name__ == '__main__':
    cli()

Run it as:

$ python groups_demo.py --help
Usage: groups_demo.py [OPTIONS] COMMAND [ARGS]...

  Project management tool.

Options:
  --debug / --no-debug  Enable debug output.
  --help                Show this message and exit.

Commands:
  create  Create a new project.
  delete  Delete a project.
  list    List all projects.

$ python groups_demo.py create myapp --template flask
Creating project 'myapp'...
Template: flask
Done! Run: cd myapp && python main.py

$ python groups_demo.py --debug create myapp
[DEBUG] Creating myapp with template basic
Creating project 'myapp'...

The ctx.pass_context pattern passes a shared context object through all subcommands. The --debug flag is defined on the group level and passed down through context — this is the Click pattern for global flags that affect all subcommands.

Real-Life Example: A File Processing CLI

Here is a complete, practical CLI tool for processing text files — counting words, searching for patterns, and converting case — with progress bars for large files.

# filetools.py
import click
import re
from pathlib import Path

@click.group()
def cli():
    """File processing toolkit."""

@cli.command()
@click.argument('files', nargs=-1, type=click.Path(exists=True), required=True)
@click.option('--words/--no-words', default=True, help='Count words.')
@click.option('--lines/--no-lines', default=True, help='Count lines.')
@click.option('--chars/--no-chars', default=False, help='Count characters.')
def count(files, words, lines, chars):
    """Count words/lines/chars in FILES."""
    total_w, total_l, total_c = 0, 0, 0
    for filepath in files:
        content = Path(filepath).read_text()
        w = len(content.split())
        l = content.count('\n')
        c = len(content)
        total_w += w; total_l += l; total_c += c
        parts = []
        if lines: parts.append(f"{l:>8} lines")
        if words: parts.append(f"{w:>8} words")
        if chars: parts.append(f"{c:>8} chars")
        click.echo(f"{'  '.join(parts)}  {filepath}")
    if len(files) > 1:
        click.echo(f"{'':->40}")
        click.echo(f"{total_l:>8} lines  {total_w:>8} words  total")

@cli.command()
@click.argument('pattern')
@click.argument('files', nargs=-1, type=click.Path(exists=True), required=True)
@click.option('--ignore-case', '-i', is_flag=True, help='Case-insensitive.')
@click.option('--count-only', '-c', is_flag=True, help='Print match count only.')
def search(pattern, files, ignore_case, count_only):
    """Search for PATTERN in FILES."""
    flags = re.IGNORECASE if ignore_case else 0
    for filepath in files:
        content = Path(filepath).read_text()
        matches = [(i+1, line) for i, line in enumerate(content.splitlines())
                   if re.search(pattern, line, flags)]
        if count_only:
            click.echo(f"{len(matches):>5}  {filepath}")
        else:
            for lineno, line in matches:
                click.secho(f"{filepath}:{lineno}: ", nl=False, fg='cyan')
                # Highlight the match in yellow
                highlighted = re.sub(pattern,
                    lambda m: click.style(m.group(), fg='yellow', bold=True),
                    line, flags=flags)
                click.echo(highlighted)

if __name__ == '__main__':
    cli()

Run as:

$ python filetools.py count README.md
      45 lines      312 words  README.md

$ python filetools.py search "import" *.py --ignore-case
filetools.py:1: import click
filetools.py:2: import re
filetools.py:3: from pathlib import Path

The nargs=-1 pattern on FILES accepts any number of file arguments, like the Unix convention. click.secho() combines echo with styled output (colors). The --ignore-case short alias -i matches grep’s convention, making the tool feel natural to Unix users.

@click.command(). One decorator. Full CLI. Done.
@click.command(). One decorator. Full CLI. Done.

Frequently Asked Questions

When should I use Click instead of argparse?

Use Click for new CLI tools — it is less verbose and more composable. argparse is already in the standard library and requires no installation, so it is better for simple scripts that need zero dependencies. Click shines for multi-command CLIs with many options, complex validation, interactive prompts, and colored output. If you are building something beyond a simple script, Click’s developer experience wins decisively.

How does Click compare to Typer?

Typer is built on top of Click and generates Click CLI definitions from Python function type hints. If you use type annotations throughout your code, Typer reduces Click boilerplate further — you get options and arguments from type hints with no decorators. The trade-off: Typer adds a dependency and is less flexible than Click for complex CLI patterns. Click is more explicit; Typer is more magic. Both are excellent choices.

How do I test Click commands?

Click provides a CliRunner for testing. Use from click.testing import CliRunner; runner = CliRunner(); result = runner.invoke(my_command, ['--option', 'value']). The result object has exit_code, output, and exception attributes. This lets you test CLI behavior in pytest without spawning a subprocess, and it works with input prompts by passing input='yes\n' to invoke().

Can Click read options from environment variables?

Yes. Set auto_envvar_prefix='MYAPP' on the group, and Click automatically reads MYAPP_OPTION_NAME from the environment for any option not provided on the command line. You can also set it per-option: @click.option('--api-key', envvar='API_KEY'). This is the standard pattern for 12-factor applications where configuration comes from the environment.

How do I package a Click app as a proper CLI command?

Add an entry_points section to your pyproject.toml: [project.scripts] mytool = "mypackage.cli:main". After pip install -e ., running mytool in the terminal invokes your Click function directly. This is the standard way to distribute CLI tools on PyPI — users install your package and get the command available system-wide.

Conclusion

We covered the full Click toolkit: defining commands with @click.command(), options with @click.option(), arguments with @click.argument(), type validation with click.Path and click.Choice, interactive prompts, multi-command groups with shared context using @click.pass_context, and colored output with click.secho(). The file processing CLI showed how to compose these features into a tool that feels like a native Unix command.

From here, explore Click’s progress bar support (click.progressbar()), file path handling with lazy file opening, and the CliRunner for testing. Click’s plugin system also allows distributing CLI extensions as separate packages — the same pattern used by Flask extensions.

Official documentation: click.palletsprojects.com

Click Basics

Click turns Python functions into CLIs via decorators. The simplest CLI is two decorators on a function:

# pip install click

import click

@click.command()
@click.option("--name", default="World", help="Whom to greet")
@click.option("--count", default=1, type=int, help="Number of greetings")
def greet(name, count):
    """Print a friendly greeting."""
    for _ in range(count):
        click.echo(f"Hello, {name}!")

if __name__ == "__main__":
    greet()

Save as hello.py and run python hello.py --name Alice --count 3. Click generates --help automatically from the docstring and option help text.

Commands and Subcommands

For multi-command CLIs (think git commit, git push), use a group:

import click

@click.group()
@click.option("--verbose", is_flag=True)
@click.pass_context
def cli(ctx, verbose):
    ctx.ensure_object(dict)
    ctx.obj["verbose"] = verbose

@cli.command()
@click.argument("path")
@click.pass_context
def status(ctx, path):
    """Show repo status."""
    click.echo(f"Status of {path}")

@cli.command()
@click.argument("message")
@click.pass_context
def commit(ctx, message):
    """Make a commit."""
    click.echo(f"Committing: {message}")

if __name__ == "__main__":
    cli(obj={})

Run: python tool.py --verbose status . or python tool.py commit "fix bug".

Type Conversion and Validation

Click’s types validate and convert arguments before they hit your function:

@click.command()
@click.option("--port", type=click.IntRange(1, 65535), default=8080)
@click.option("--path", type=click.Path(exists=True, dir_okay=False))
@click.option("--format", type=click.Choice(["json", "yaml", "toml"]))
@click.option("--threshold", type=click.FloatRange(0, 1))
@click.option("--cert", type=click.File("rb"))
@click.option("--config-dir", type=click.Path(file_okay=False, writable=True))
def serve(port, path, format, threshold, cert, config_dir):
    ...

Each type rejects bad input at the CLI boundary with a clean error message — saves you from writing validation by hand.

Confirmation and Prompts

For dangerous commands, prompt for confirmation. For missing values, prompt interactively:

@click.command()
@click.argument("project")
@click.confirmation_option(prompt="Are you sure you want to delete?")
def delete(project):
    click.echo(f"Deleted {project}")

@click.command()
@click.option("--username", prompt="Username")
@click.option("--password", prompt="Password", hide_input=True, confirmation_prompt=True)
def login(username, password):
    click.echo(f"Logging in as {username}")

Output: echo, secho, progressbar

Use click.echo instead of print() — handles platform encoding correctly, can be styled, and respects --quiet flags:

click.echo("Plain message")
click.secho("Success!", fg="green", bold=True)
click.secho("Error!", fg="red", err=True)   # to stderr

# Built-in progress bar
import time
with click.progressbar(range(1000), label="Processing") as items:
    for i in items:
        time.sleep(0.001)

# Pager output (like `git log`)
click.echo_via_pager("\n".join(f"Line {i}" for i in range(1000)))

Distributing as a Real Command

To make mytool available system-wide (not python tool.py), use the entry-points pattern in pyproject.toml:

# pyproject.toml
[project]
name = "mytool"
version = "0.1.0"
dependencies = ["click"]

[project.scripts]
mytool = "mytool.cli:cli"

# Then: pip install -e .
# Now: mytool status . (anywhere on the system)

Common Pitfalls

  • Mixing argument and option names. Options start with -- (or -x for shortcuts). Arguments don’t. Get this wrong and Click complains at decoration time.
  • Forgetting @click.pass_context. If a subcommand needs the parent group’s settings, decorate it with @click.pass_context and accept ctx as the first parameter.
  • Calling sys.exit() inside commands. Use click.Abort() or ctx.exit(code) — these clean up properly with Click’s machinery.
  • Testing CLIs manually. Use click.testing.CliRunner instead — runs the command in-process and gives you a result object with output and exit code.
  • Long help text. Docstrings get truncated. For long help, use @click.command(help="""...""") instead.

FAQ

Q: Click, Typer, or argparse?
A: Typer if you love type hints (it’s Click with Pydantic-style annotations). Click for the largest ecosystem and proven track record. argparse for stdlib-only constraints.

Q: How do I add tab completion?
A: Click ships with completion generators for bash, zsh, fish: _MYTOOL_COMPLETE=zsh_source mytool in your shell init.

Q: How do I write tests for a Click CLI?
A: from click.testing import CliRunner; result = CliRunner().invoke(cli, ["arg1", "--opt", "value"]). The result has output, exit_code, and exception for assertions.

Q: Can a CLI command return a value?
A: Click commands return None — they’re CLI entry points, not function returns. To pass data between subcommands, store it in ctx.obj.

Q: How do I make a command with optional flags AND positional arguments?
A: @click.argument("file", required=False, default=None) for an optional positional argument. Stack with @click.option for flags.

Wrapping Up

Click is the canonical Python CLI framework — battle-tested, well-documented, and friendly to both small scripts and complex multi-command tools. Start with @click.command + @click.option + @click.argument; graduate to groups when you need subcommands. Pair with entry-points in pyproject.toml to ship real shell commands. For type-hint enthusiasts, Typer wraps Click in a more modern API; the underlying ideas are identical.

How To Format Python Code with Black and isort

How To Format Python Code with Black and isort

Beginner

Code reviews should be about logic, architecture, and correctness — not about whether you put a space before a colon or how you sorted your imports. But without automated formatting, every team spends time on style debates, inconsistent diffs pollute git history, and onboarding new developers is a friction-filled process. Black and isort are the two tools that eliminate this problem entirely: Black reformats your Python code in one consistent opinionated style, and isort keeps your imports sorted and organized. Combined, they handle the vast majority of Python style decisions automatically.

Black calls itself “the uncompromising code formatter.” It has almost no configuration options by design — the goal is for every Black-formatted project to look the same, so developers can read any Python project without adjusting to a new style. isort sorts imports alphabetically within their sections (standard library, third-party, local), keeping them clean and diff-friendly. Install both with pip install black isort.

In this article, we will cover: how to use Black to format Python files from the command line, how to configure Black’s line length and target Python version, how isort organizes imports, how to combine Black and isort without conflicts, how to run both as pre-commit hooks so formatting is automatic on every commit, and how to integrate them into CI. By the end, you will have a fully automated formatting pipeline that requires zero style decisions from your team.

Black Quick Example

Here is some unformatted Python code and what Black does to it:

# unformatted.py (before Black)
import sys,os
def   calculate(x,y,   z):
    result=x+y*z
    if result>100: return True
    else:
      return    False

my_list=[1,2,   3,4,
  5,6]
d={'key1':'value1','key2':   'value2','key3':'value3'}

Run black unformatted.py and the file becomes:

# unformatted.py (after Black)
import sys, os


def calculate(x, y, z):
    result = x + y * z
    if result > 100:
        return True
    else:
        return False


my_list = [
    1,
    2,
    3,
    4,
    5,
    6,
]
d = {"key1": "value1", "key2": "value2", "key3": "value3"}

Black added consistent spacing, expanded the list to one-item-per-line (because it exceeded the line length), kept the dictionary on one line (because it fits), and enforced double quotes. Notice it also changed import sys,os but did not sort them — that is isort’s job.

What Is Black and What Does It Enforce?

Black is a PEP 8-compliant code formatter that enforces a specific subset of style choices. Unlike flake8 (which only reports style violations), Black actually rewrites your code. It makes these decisions for you:

Style AspectBlack’s ChoiceReason
QuotesDouble quotes alwaysConsistency; avoids escaping
Trailing commasAdded in multi-line structuresCleaner git diffs
Line length88 characters (configurable)Slightly more than PEP 8’s 79
Blank lines2 between top-level, 1 between methodsPEP 8 standard
Magic trailing commaRespects it — keeps multi-line if comma presentDeveloper intent preserved

The key insight is that Black removes the decision-making burden from developers. You do not debate whether to use single or double quotes — Black uses double. You do not argue about line wrapping — Black wraps at 88 characters. Once your team agrees to use Black, style debates disappear from code reviews.

Before and after formatting
Black formats your code. Arguments are not welcome.

Using Black on the Command Line

Black’s command-line interface is straightforward. You can format a single file, a directory, or use --check to preview what would change without modifying files.

# Install Black
# pip install black

# Format a single file (modifies in place)
# black my_script.py

# Format an entire directory
# black src/

# Check what would change (exit 1 if changes needed)
# black --check src/

# Show the diff without applying changes
# black --diff my_script.py

# Format with a custom line length (79 for strict PEP 8)
# black --line-length 79 src/

# Target a specific Python version
# black --target-version py311 src/

The --check flag is what you use in CI pipelines — it returns exit code 1 if any files need reformatting, which fails the CI build. This forces developers to run Black locally before pushing. The --diff flag shows exactly what would change, which is useful for understanding Black’s decisions.

# pyproject.toml -- configure Black project-wide
[tool.black]
line-length = 88
target-version = ['py311']
include = '\.pyi?$'
exclude = '''
/(
    \.git
  | \.venv
  | build
  | dist
)/
'''

Put pyproject.toml in your project root and Black will use these settings automatically. The target-version setting tells Black which Python syntax features are available — this affects magic trailing comma behavior and some string formatting decisions.

Using isort to Organize Imports

isort sorts Python imports into three sections separated by blank lines: standard library imports, third-party imports, and local imports. Within each section, it sorts alphabetically. This matches PEP 8 and makes diffs clean — changing one import only affects one line.

# messy_imports.py (before isort)
import json
from flask import Flask, request
import os
from myapp.models import User
import sys
from datetime import datetime
import requests
from myapp.utils import format_date

After running isort messy_imports.py:

# messy_imports.py (after isort)
import json
import os
import sys
from datetime import datetime

import requests
from flask import Flask, request

from myapp.models import User
from myapp.utils import format_date

Standard library imports (json, os, sys, datetime) come first. Third-party imports (requests, flask) come second. Local imports (myapp.*) come last. Within each section, everything is alphabetically sorted. The blank lines between sections are isort’s signature — they make the import structure visually clear.

Making Black and isort Work Together

Black and isort can conflict: Black sometimes reformats lines that isort just organized. The fix is to tell isort to use Black-compatible settings. isort has a built-in --profile black option that makes them cooperate perfectly.

# pyproject.toml -- configure isort for Black compatibility
[tool.isort]
profile = "black"
line_length = 88

With this configuration, isort will format multi-line imports the same way Black would. Run them in order — isort first, then Black — or use a pre-commit hook that runs both:

# Run both tools on the src/ directory
# isort src/
# black src/

# Verify both are satisfied (for CI)
# isort --check-only src/ && black --check src/
Configuring formatters
pyproject.toml is where Black and isort learn to agree.

Automating with Pre-Commit Hooks

The most effective way to use Black and isort is as pre-commit hooks — they run automatically every time you commit code, so formatting is never forgotten. The pre-commit framework makes this easy.

# Install pre-commit
# pip install pre-commit
# .pre-commit-config.yaml -- put this in your project root
repos:
  - repo: https://github.com/psf/black
    rev: 24.3.0
    hooks:
      - id: black
        language_version: python3.11

  - repo: https://github.com/PyCQA/isort
    rev: 5.13.2
    hooks:
      - id: isort
        args: ["--profile", "black"]

  - repo: https://github.com/PyCQA/flake8
    rev: 7.0.0
    hooks:
      - id: flake8
# Initialize pre-commit (run once after creating the config)
# pre-commit install

# Run manually on all files
# pre-commit run --all-files

After pre-commit install, every git commit automatically runs Black, isort, and flake8 on the staged files. If any formatting changes are needed, the commit is blocked and the files are auto-fixed — you just git add the changes and commit again. This means formatting violations never reach the repository.

black: opinionated formatting. Stop arguing about commas.
black: opinionated formatting. Stop arguing about commas.

Real-Life Example: Setting Up a Full Python Project

Here is a complete project setup script that installs Black, isort, and pre-commit and configures them to work together:

# project_setup.py
"""
Script to set up Black + isort + pre-commit for a Python project.
Run from your project root directory.
"""
import subprocess
import sys
from pathlib import Path

def run(cmd):
    """Run a shell command and print output."""
    print(f"Running: {' '.join(cmd)}")
    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.stdout:
        print(result.stdout)
    if result.returncode != 0:
        print(f"ERROR: {result.stderr}")
    return result.returncode == 0

# Install tools
run([sys.executable, '-m', 'pip', 'install', 'black', 'isort', 'pre-commit', '--quiet'])

# Write pyproject.toml config
pyproject = Path('pyproject.toml')
if not pyproject.exists():
    pyproject.write_text('''[tool.black]
line-length = 88
target-version = ['py311']

[tool.isort]
profile = "black"
line_length = 88
''')
    print("Created pyproject.toml")

# Write pre-commit config
precommit = Path('.pre-commit-config.yaml')
precommit.write_text('''repos:
  - repo: https://github.com/psf/black
    rev: 24.3.0
    hooks:
      - id: black
  - repo: https://github.com/PyCQA/isort
    rev: 5.13.2
    hooks:
      - id: isort
        args: ["--profile", "black"]
''')
print("Created .pre-commit-config.yaml")

# Install pre-commit hooks
if Path('.git').exists():
    run(['pre-commit', 'install'])
    print("pre-commit hooks installed!")
else:
    print("Not a git repo -- skipping pre-commit install")
    print("Run 'git init' first, then 'pre-commit install'")

print("\nSetup complete! Run 'pre-commit run --all-files' to format existing code.")

Output:

Running: /usr/bin/python3 -m pip install black isort pre-commit --quiet
Created pyproject.toml
Created .pre-commit-config.yaml
pre-commit hooks installed!

Setup complete! Run 'pre-commit run --all-files' to format existing code.

This script sets up the entire formatting pipeline in under a minute. Once it runs, every commit in the project will be automatically formatted by Black and isort with no developer effort. You can extend the pre-commit config to add mypy for type checking or bandit for security scanning.

Frequently Asked Questions

How is Black different from autopep8?

autopep8 fixes PEP 8 violations but tries to be minimally invasive — it only changes what it must. Black is more opinionated and makes more sweeping changes to ensure a consistent style everywhere. Black produces deterministic output (running it twice gives the same result), while autopep8 may not. Most teams prefer Black because it eliminates more style decisions and produces cleaner diffs.

How do I introduce Black to a large existing codebase?

Run black . --diff first to see the scope. Then run black . in a single dedicated commit with the message “Apply Black formatting” — this isolates all formatting changes from logic changes. Configure git to exclude this commit from git blame with git blame --ignore-rev <commit-hash> or add the hash to .git-blame-ignore-revs. From that point, all new commits will be incrementally formatted.

Can I skip Black formatting for specific code?

Yes. Wrap code with # fmt: off and # fmt: on to disable Black for a specific block. This is useful for manually aligned code like lookup tables or matrix definitions where Black’s formatting would hurt readability. Use it sparingly — the value of Black comes from it being consistently applied everywhere.

How do I fail CI if code is not formatted?

Run black --check . and isort --check-only . in your CI pipeline. Both commands return exit code 1 if any files need formatting, which fails the CI build. In GitHub Actions, add a formatting job that runs before your tests. When it fails, developers must run Black and isort locally before the CI passes.

How do I set up VS Code to auto-format with Black?

Install the Black Formatter extension from the VS Code marketplace. Then in your settings.json: set "editor.formatOnSave": true, "[python]": { "editor.defaultFormatter": "ms-python.black-formatter" }, and install the isort extension for import sorting. Now every time you save a Python file, Black and isort run automatically.

Conclusion

We covered the complete Black and isort workflow: formatting files with black and isort on the command line, configuring both tools via pyproject.toml, using --profile black to make isort Black-compatible, automating everything with pre-commit hooks, and checking formatting in CI with --check flags. The project setup script ties it all together into a one-command installation.

The cumulative benefit of Black and isort is significant — teams report that code review time drops noticeably when formatting is no longer a discussion topic. Developers spend more mental energy on logic and less on whitespace. New contributors can get up to speed faster because the formatting standards are enforced automatically rather than documented in a style guide.

Official documentation: black.readthedocs.io and pycqa.github.io/isort

How To Use Hypothesis for Property-Based Testing in Python

How To Use Hypothesis for Property-Based Testing in Python

Intermediate

You wrote unit tests. You covered the happy path. You even tested a few edge cases — empty strings, zero values, negative numbers. But then production breaks on an input you never imagined: a Unicode string with a zero-width space, a list with 2 billion elements, or a float that is technically not-a-number. Property-based testing is the approach that finds these bugs before your users do. Instead of you specifying test inputs, the library generates hundreds of random inputs automatically and searches for ones that break your code.

Hypothesis is Python’s leading property-based testing library. You describe the shape of valid inputs using strategies, and Hypothesis generates inputs of that shape, tries to break your code, and if it finds a failing case, automatically shrinks it to the smallest possible example that still fails. This gives you a precise, minimal reproduction case instead of a random mess. Install it with pip install hypothesis. It works alongside pytest and unittest with zero configuration.

In this article, we will cover: what property-based testing is and when to use it, how to write your first Hypothesis test, how to use built-in strategies for common types, how to compose strategies for custom data structures, how to use stateful testing for sequences of operations, and how to apply Hypothesis to real code to find real bugs. By the end, you will have a new tool that makes your test suite dramatically more thorough.

Hypothesis Quick Example

Here is a Hypothesis test that checks a property of Python’s built-in sorted() function — that sorting a list and then reversing it should equal sorting in reverse order:

# quick_hypothesis.py
from hypothesis import given
from hypothesis import strategies as st

@given(st.lists(st.integers()))
def test_sort_reverse_equivalent(numbers):
    """sorted then reversed == sorted with reverse=True"""
    sorted_then_reversed = list(reversed(sorted(numbers)))
    sorted_reversed = sorted(numbers, reverse=True)
    assert sorted_then_reversed == sorted_reversed

# Run with pytest: pytest quick_hypothesis.py -v
# Or call directly:
test_sort_reverse_equivalent()
print("All tests passed!")

Output:

All tests passed!

Hypothesis ran this function hundreds of times with randomly generated lists — empty lists, lists with one element, lists with thousands of integers, lists with negative numbers, lists with duplicates. It found no counterexample, so the property holds. If the property had been wrong, Hypothesis would show you the smallest list that breaks it.

What Is Property-Based Testing?

Traditional unit tests are example-based: you write specific inputs and expected outputs. Property-based tests are contract-based: you describe invariants that must hold for ANY valid input. The library’s job is to find inputs that violate those invariants.

AspectExample-Based (pytest)Property-Based (Hypothesis)
Input sourceYou write it manuallyLibrary generates it
CoverageOnly cases you thought ofHundreds of random cases
Bug discoveryKnown edge casesUnknown edge cases
Failure outputThe input you wroteSmallest failing example
Best forKoown requirementsAlgorithmic invariants

Property-based testing does not replace example-based tests — it complements them. Use both together. Write example-based tests for known requirements, add property-based tests for algorithmic invariants and data transformations.

Test strategies and data generation
Hypothesis generates test cases your brain never would.

Understanding Strategies

A strategy tells Hypothesis how to generate values for a particular type. The hypothesis.strategies module (conventionally imported as st) provides strategies for all Python built-in types, plus tools to compose them into complex structures.

# strategies_demo.py
from hypothesis import given, settings
from hypothesis import strategies as st

# Basic strategies
@given(st.integers(min_value=0, max_value=100))
def test_squares_are_positive(n):
    assert n * n >= 0

# Text strategies
@given(st.text(min_size=1, max_size=50))
def test_strip_never_longer(s):
    assert len(s.strip()) <= len(s)

# Float strategies
@given(st.floats(allow_nan=False, allow_infinity=False))
def test_abs_never_negative(f):
    assert abs(f) >= 0

# Lists of specific type
@given(st.lists(st.integers(), min_size=1))
def test_max_is_in_list(lst):
    assert max(lst) in lst

# Run all tests
test_squares_are_positive()
test_strip_never_longer()
test_abs_never_negative()
test_max_is_in_list()
print("All strategy tests passed!")

Output:

All strategy tests passed!

The constraints in strategies are important. st.floats(allow_nan=False, allow_infinity=False) excludes the special IEEE 754 values that would break most arithmetic. min_size=1 on a list ensures max() does not raise a ValueError on an empty list — though you might want to test that case separately.

Composing Custom Strategies

Real applications use complex data structures, not just integers. Hypothesis lets you compose strategies using st.fixed_dict(), st.builds(), and the @st.composite decorator to generate custom objects.

# custom_strategies.py
from hypothesis import given
from hypothesis import strategies as st
from dataclasses import dataclass

@dataclass
class Product:
    name: str
    price: float
    quantity: int

# Strategy for valid products
product_strategy = st.builds(
    Product,
    name=st.text(alphabet=st.characters(whitelist_categories=('Lu', 'Ll', 'Nd', 'Zs')),
                 min_size=1, max_size=50),
    price=st.floats(min_value=0.01, max_value=10000.0, allow_nan=False),
    quantity=st.integers(min_value=0, max_value=10000)
)

def calculate_total(products):
    """Calculate total value of product inventory."""
    return sum(p.price * p.quantity for p in products)

@given(st.lists(product_strategy, min_size=1, max_size=20))
def test_total_always_non_negative(products):
    """Total inventory value must never be negative."""
    total = calculate_total(products)
    assert total >= 0, f"Negative total: {total}"

@given(st.lists(product_strategy, min_size=2))
def test_adding_strategy-product_increases_total(products):
    """Adding a product with positive price and quantity increases total."""
    base_total = calculate_total(products[:-1])
    last = products[-1]
    if last.price > 0 and last.quantity > 0:
        full_total = calculate_total(products)
        assert full_total > base_total

test_total_always_non_negative()
test_adding_strategy-product_increases_total()
print("Custom strategy tests passed!")

Output:

Custom strategy tests passed!

The st.builds() strategy calls the Product constructor with generated values for each field. You can nest strategies arbitrarily — a list of products, each with a composed strategy for its fields. This mirrors how your real application data is structured, so Hypothesis generates realistic test data automatically.

Finding edge case failures
When Hypothesis finds a bug, it shrinks the input to the smallest reproducer.

Finding Real Bugs with Hypothesis

The real value of property-based testing shows when Hypothesis finds a bug you would never have written a test for. Here is an example with a buggy encoding function:

# finding_bugs.py
from hypothesis import given
from hypothesis import strategies as st

def encode(data: list) -> str:
    """Run-length encode a list of integers. E.g., [1,1,2,3,3] -> '2x1,1x2,2x3'"""
    if not data:
        return ''
    result = []
    count = 1
    for i in range(1, len(data)):
        if data[i] == data[i-1]:
            count += 1
        else:
            result.append(f"{count}x{data[i-1]}")
            count = 1
    result.append(f"{count}x{data[-1]}")
    return ','.join(result)

def decode(encoded: str) -> list:
    """Decode run-length encoded string back to list."""
    if not encoded:
        return []
    result = []
    for part in encoded.split(','):
        count_str, val_str = part.split('x')
        result.extend([int(val_str)] * int(count_str))
    return result

# Property: encode then decode should give back the original
@given(st.lists(st.integers(min_value=-100, max_value=100)))
def test_encode_decode_roundtrip(data):
    encoded = encode(data)
    decoded = decode(encoded)
    assert decoded == data, f"Roundtrip failed: {data} -> '{encoded}' -> {decoded}"

test_encode_decode_roundtrip()
print("Roundtrip test passed!")

Output:

Roundtrip test passed!

Hypothesis tested this function with hundreds of inputs including empty lists, single-element lists, all-same lists, and alternating values — and the roundtrip property held for all of them. If decode() had a bug (say, only handling positive integers), Hypothesis would immediately find a minimal failing input like [-1] and show you the exact failing case with the encoded string.

Controlling Hypothesis Settings

Hypothesis provides a settings decorator to control how many examples are generated, the maximum shrink time, and the verbosity of output. You can also use @example() to always include specific cases alongside the generated ones.

# settings_demo.py
from hypothesis import given, settings, example
from hypothesis import strategies as st

def divide(a: int, b: int) -> float:
    """Divide a by b, raise ValueError if b is zero."""
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b

# Always test these specific cases, plus 200 random ones
@settings(max_examples=200)
@example(a=0, b=1)
@example(a=-10, b=2)
@given(
    a=st.integers(),
    b=st.integers().filter(lambda x: x != 0)
)
def test_divide_properties(a, b):
    result = divide(a, b)
    # Property 1: result times b should equal a (within float precision)
    assert abs(result * b - a) < 1e-9
    # Property 2: dividing by positive number preserves sign
    if b > 0:
        assert (result >= 0) == (a >= 0)

test_divide_properties()
print("Divide properties verified over 200+ examples!")

Output:

Divide properties verified over 200+ examples!

The .filter(lambda x: x != 0) call on the strategy excludes zero from the generated values. The @example() decorator guarantees that specific cases always run, even if Hypothesis would not randomly generate them. Combined with max_examples=200, this gives you a thorough test with precise control.

Hypothesis generates the inputs you didn't think of.
Hypothesis generates the inputs you didn’t think of.

Real-Life Example: Testing a Sorted Data Structure

We will test a custom SortedList class that maintains elements in sorted order. Hypothesis will verify several invariants hold under all valid operations.

# sorted_list_test.py
from hypothesis import given, assume
from hypothesis import strategies as st
import bisect

class SortedList:
    """A list that maintains sorted order on insert."""
    def __init__(self):
        self._data = []

    def insert(self, value):
        bisect.insort(self._data, value)

    def remove(self, value):
        idx = bisect.bisect_left(self._data, value)
        if idx < len(self._data) and self._data[idx] == value:
            self._data.pop(idx)
        else:
            raise ValueError(f"{value} not in list")

    def contains(self, value) -> bool:
        idx = bisect.bisect_left(self._data, value)
        return idx < len(self._data) and self._data[idx] == value

    def to_list(self) -> list:
        return list(self._data)

    def __len__(self):
        return len(self._data)

# Property 1: after inserting a value, the list is always sorted
@given(st.lists(st.integers()), st.integers())
def test_insert_preserves_order(existing, new_value):
    sl = SortedList()
    for v in existing:
        sl.insert(v)
    sl.insert(new_value)
    lst = sl.to_list()
    assert lst == sorted(lst), f"Not sorted after insert: {lst}"

# Property 2: after inserting, the value is always present
@given(st.integers())
def test_insert_then_contains(value):
    sl = SortedList()
    sl.insert(value)
    assert sl.contains(value)

# Property 3: length matches number of inserts
@given(st.lists(st.integers()))
def test_length_matches_inserts(values):
    sl = SortedList()
    for v in values:
        sl.insert(v)
    assert len(sl) == len(values)

test_insert_preserves_order()
test_insert_then_contains()
test_length_matches_inserts()
print("SortedList: all 3 properties verified!")

Output:

SortedList: all 3 properties verified!

These three properties — sorted order preserved, inserted value found, length consistent — form a behavioral contract for SortedList. Any implementation that passes all three is correct by definition. You can refactor the internals (swap bisect.insort for a balanced BST, for example) and use these property tests as your regression suite — they will catch any violation of the contract.

Frequently Asked Questions

Hypothesis makes my test suite slow. How do I speed it up?

Hypothesis caches its database of found examples between runs, so later runs are faster. Use @settings(max_examples=50) for fast CI runs and max_examples=1000 for deeper local testing. The suppress_health_check option disables specific health checks if they are triggering false positives. In CI, set the environment variable HYPOTHESIS_DATABASE_DIRECTORY to a cached location to preserve learned examples across runs.

What is shrinking and why does it matter?

When Hypothesis finds a failing input, it automatically tries to shrink it — finding a smaller or simpler input that still triggers the same failure. Instead of showing you a list of 1,000 random integers that caused a bug, it will show you the 2-element list [0, -1] that triggers the same bug. This makes debugging dramatically easier, because you can immediately understand what property of the input caused the failure.

What is stateful testing?

Stateful testing (also called model-based testing) lets you test sequences of operations, not just single function calls. Use hypothesis.stateful.RuleBasedStateMachine to define rules (operations like insert, delete, query) and invariants. Hypothesis generates sequences of these operations and checks invariants after each step. This is powerful for testing state machines, databases, queues, and any system where order of operations matters.

Should I replace my existing tests with Hypothesis?

No — use both. Example-based tests document specific known behaviors and are fast to run. Property-based tests explore unknown edge cases and validate invariants. A typical approach: write a few example-based tests to nail down the specification, then add property-based tests to verify invariants hold broadly. Your test suite becomes both precise (example-based) and thorough (property-based).

How does Hypothesis remember failing examples?

Hypothesis stores its discovered failing examples in a database directory (by default, .hypothesis/ in your project root). When you run the tests again, it tries the previously failing examples first. This means once a bug is found, it is retested every run — even if the main random generation would not have generated that input again. Add .hypothesis/ to .gitignore for local databases, or commit it to retain team-shared learned examples.

Conclusion

Property-based testing with Hypothesis changes how you think about test coverage. Instead of asking “did I test this specific case?” you ask “what properties must always hold?” We covered the basics: the @given decorator and strategies for common types, composing custom strategies with st.builds(), writing meaningful properties (roundtrip, ordering, length consistency), and controlling test settings. The real payoff comes when Hypothesis finds a minimal failing input you never would have written yourself.

From here, explore Hypothesis’s stateful testing with RuleBasedStateMachine for testing complex state machines, and the st.data() strategy for dynamic input generation within tests. The Hypothesis documentation includes a gallery of real-world bugs found by property testing that is worth reading for inspiration.

Official documentation: hypothesis.readthedocs.io

How To Analyze Networks and Graphs with NetworkX in Python

How To Analyze Networks and Graphs with NetworkX in Python

Intermediate

Networks are everywhere: social connections, airline routes, software dependencies, financial transactions, biological pathways. When you need to find the shortest route between two cities, detect communities in a social network, or identify the most influential node in a dependency graph, you need graph analysis tools. Python’s NetworkX library provides all of this in a clean, expressive API that integrates naturally with NumPy, SciPy, and Matplotlib.

NetworkX represents graphs as Python objects, so you can build networks programmatically and then apply powerful algorithms without writing them yourself. It supports undirected graphs, directed graphs (digraphs), multigraphs, and weighted graphs. Installation is simple: pip install networkx matplotlib. The Matplotlib package is needed for visualization.

This article covers everything you need to start working with graphs in Python: creating graphs and adding nodes and edges, calculating basic graph metrics, finding shortest paths, running centrality analysis, detecting connected components, and visualizing networks. By the end, we will analyze a real social network dataset to find the most connected people in a group.

NetworkX Quick Example

Here is a complete example building a small social network and running basic analysis on it:

# quick_networkx.py
import networkx as nx

# Create an undirected graph
G = nx.Graph()

# Add edges (nodes are created automatically)
G.add_edges_from([
    ('Alice', 'Bob'), ('Alice', 'Carol'),
    ('Bob', 'Carol'), ('Bob', 'Dave'),
    ('Carol', 'Eve'), ('Dave', 'Eve')
])

print("Nodes:", list(G.nodes()))
print("Edges:", list(G.edges()))
print("Number of nodes:", G.number_of_nodes())
print("Number of edges:", G.number_of_edges())

# Shortest path between Alice and Eve
path = nx.shortest_path(G, source='Alice', target='Eve')
print("Shortest path Alice -> Eve:", path)

# Degree of each node (number of connections)
for node, degree in G.degree():
    print(f"  {node}: {degree} connections")

Output:

Nodes: ['Alice', 'Bob', 'Carol', 'Dave', 'Eve']
Edges: [('Alice', 'Bob'), ('Alice', 'Carol'), ('Bob', 'Carol'), ('Bob', 'Dave'), ('Carol', 'Eve'), ('Dave', 'Eve')]
Number of nodes: 6
Number of edges: 6
Shortest path Alice -> Eve: ['Alice', 'Carol', 'Eve']

  Alice: 2 connections
  Bob: 3 connections
  Carol: 3 connections
  Dave: 2 connections
  Eve: 2 connections

NetworkX found the shortest social path from Alice to Eve through Carol in two hops. Bob and Carol have the most connections (degree 3), making them the most central figures in this small network. We built this in about 10 lines of code.

Graph Types in NetworkX

NetworkX supports four main graph types, each suited to different real-world scenarios. Choosing the right type matters because it changes which algorithms are available and what the edges mean.

ClassDirectionMulti-edgesUse Case
GraphUndirectedNoFriendships, protein interactions
DiGraphDirectedNoTwitter follows, dependency graphs
MultiGraphUndirectedYesMulti-lane roads, parallel networks
MultiDiGraphDirectedYesFinancial transactions, flight routes
# graph_types.py
import networkx as nx

# Undirected graph: friendship is mutual
G = nx.Graph()
G.add_edge('Alice', 'Bob')
print("Alice in Bob's neighbors:", 'Alice' in G.neighbors('Bob'))  # True

# Directed graph: following is one-way
DG = nx.DiGraph()
DG.add_edge('Alice', 'Bob')  # Alice follows Bob
print("Bob follows Alice:", DG.has_edge('Bob', 'Alice'))  # False
print("Alice follows Bob:", DG.has_edge('Alice', 'Bob'))  # True

# Weighted graph: distances or strengths
WG = nx.Graph()
WG.add_edge('NYC', 'Chicago', weight=790)
WG.add_edge('Chicago', 'LA', weight=2015)
WG.add_edge('NYC', 'LA', weight=2800)
print("NYC-Chicago distance:", WG['NYC']['Chicago']['weight'])

Output:

Alice in Bob's neighbors: True
Bob follows Alice: False
Alice follows Bob: True
NYC-Chicago distance: 790

Edge weights can represent anything — distance, cost, bandwidth, similarity, or strength of relationship. The weight key is the standard convention in NetworkX, and most shortest-path algorithms use it automatically.

Building network graphs
Graphs are just nodes and edges. NetworkX handles the math in between.

Centrality Analysis

Centrality measures answer the question “which nodes are most important in this network?” Different centrality metrics define “important” in different ways. Degree centrality looks at number of direct connections. Betweenness centrality identifies nodes that bridge different parts of the network. PageRank (the algorithm behind Google Search) measures influence based on the quality of incoming connections.

# centrality_demo.py
import networkx as nx

# Build a slightly larger network
edges = [
    ('Alice', 'Bob'), ('Alice', 'Carol'), ('Alice', 'Dave'),
    ('Bob', 'Eve'), ('Carol', 'Frank'), ('Dave', 'Grace'),
    ('Eve', 'Frank'), ('Frank', 'Grace'), ('Grace', 'Bob')
]
G = nx.Graph()
G.add_edges_from(edges)

# Degree centrality: fraction of nodes connected to this node
degree_c = nx.degree_centrality(G)
print("Degree centrality:")
for node, val in sorted(degree_c.items(), key=lambda x: -x[1]):
    print(f"  {node}: {val:.3f}")

# Betweenness centrality: fraction of shortest paths passing through node
between_c = nx.betweenness_centrality(G)
print("\nBetweenness centrality (bridges):")
for node, val in sorted(between_c.items(), key=lambda x: -x[1])[:3]:
    print(f"  {node}: {val:.3f}")

Output:

Degree centrality:
  Alice: 0.500
  Frank: 0.500
  Grace: 0.500
  Bob: 0.375
  Carol: 0.250
  Dave: 0.250
  Eve: 0.250

Betweenness centrality (bridges):
  Frank: 0.286
  Alice: 0.238
  Grace: 0.214

Frank and Grace are the key bridges in this network — removing them would disconnect the graph into separate clusters. This kind of analysis is invaluable for understanding vulnerability in supply chains, critical nodes in infrastructure, or key connectors in organizational charts.

Shortest Paths and Weighted Routing

Finding the shortest path between nodes is one of the most common graph problems. NetworkX implements Dijkstra’s algorithm, Bellman-Ford, and A* for weighted graphs. For unweighted graphs, it uses BFS. All of these are available with a single function call.

# shortest_path_demo.py
import networkx as nx

# City distance network (weights in miles)
G = nx.Graph()
G.add_weighted_edges_from([
    ('NYC', 'Philadelphia', 95),
    ('Philadelphia', 'Baltimore', 100),
    ('Baltimore', 'DC', 45),
    ('NYC', 'Boston', 215),
    ('Boston', 'Providence', 50),
    ('NYC', 'DC', 230),
    ('DC', 'Richaond', 100)
])

# Shortest path by number of hops (ignoring weight)
hop_path = nx.shortest_path(G, 'NYC', 'Richaond')
print("Shortest by hops:", hop_path)
print("Hops:", len(hop_path) - 1)

# Shortest path by distance (using weights)
dist_path = nx.shortest_path(G, 'NYC', 'Richmond', weight='weight')
dist_len = nx.shortest_path_length(G, 'NYC', 'Richmond', weight='weight')
print("\nShortest by distance:", dist_path)
print("Total distance:", dist_len, "miles")

# All shortest paths from NYC
all_paths = dict(nx.single_source_shortest_path_length(G, 'NYC'))
print("\nAll cities reachable from NYC and hop count:")
for city, hops in sorted(all_paths.items()):
    print(f"  {city}: {hops} hops")

Output:

Shortest by hops: ['NYC', 'DC', 'Richmond']
Hops: 2

Shortest by distance: ['NYC', 'Philadelphia', 'Baltimore', 'DC', 'Richaond']
Total distance: 340 miles

All cities reachable from NYC and hop count:
  NYC: 0 hops
  Boston: 1 hops
  DC: 1 hops
  Philadelphia: 1 hops
  Baltimore: 2 hops
  Providence: 2 hops
  Richmond: 2 hops

Notice that the shortest route by hops (NYC -> DC -> Richmond, 2 stops) is actually 340 miles via DC, the same distance as the longer hop path via Philadelphia. But this is because the direct NYC->DC edge (230 miles) plus DC->Richaond (100) equals 330 miles — actually shorter! NetworkX correctly identifies this using Dijkstra’s algorithm on weighted edges.

Analyzing graph properties
Shortest path, centrality, clustering. NetworkX has an algorithm for that.

Connected Components and Clustering

A connected component is a subgraph where every node can reach every other node. Identifying components helps you find isolated clusters, detect when a network has split, or understand the community structure of a social network.

# components_demo.py
import networkx as nx

# Build a graph with two separate clusters
G = nx.Graph()
# Cluster 1: Python developers
G.add_edges_from([('Alice', 'Bob'), ('Bob', 'Carol'), ('Carol', 'Alice')])
# Cluster 2: Data scientists (not connected to cluster 1 yet)
G.add_edges_from([('Dave', 'Eve'), ('Eve', 'Frank')])
# Isolated node
G.add_node('Greg')

# Find connected components
components = list(nx.connected_components(G))
print(f"Number of components: {len(components)}")
for i, comp in enumerate(components, 1):
    print(f"  Component {i}: {comp}")

# Is the graph connected?
print("Fully connected:", nx.is_connected(G))

# Add a bridge between clusters
G.add_edge('Carol', 'Dave')
print("After bridge -- fully connected:", nx.is_connected(G))

# Clustering coefficient: how tightly knit is each node's neighborhood?
clustering = nx.clustering(G)
print("\nClustering coefficients:")
for node, coef in clustering.items():
    print(f"  {node}: {coef:.3f}")

Output:

Number of components: 3
  Component 1: {'Alice', 'Bob', 'Carol'}
  Component 2: {'Dave', 'Eve', 'Frank'}
  Component 3: {'Greg'}
Fully connected: False
After bridge -- fully connected: True

Clustering coefficients:
  Alice: 1.000
  Bob: 1.000
  Carol: 0.333
  Dave: 0.000
  Eve: 0.000
  Frank: 0.000
  Greg: 0.000

Alice and Bob have clustering coefficient 1.0 — all their neighbors are also connected to each other (a perfect triangle). Carol’s coefficient drops to 0.333 after she bridges to Dave, because Dave and Alice/Bob are not connected. High clustering indicates tight-knit groups; low clustering indicates bridging roles.

Real-Life Example: Analyzing a Software Package Dependency Graph

We will build a directed dependency graph for a Python project, find circular dependencies, and identify the most critical packages by centrality.

# dependency_graph.py
import networkx as nx

# Define package dependencies (A depends on B means edge A -> B)
dependencies = {
    'app': ['flask', 'sqlalchemy', 'celery'],
    'flask': ['werkzeug', 'jinja2', 'click'],
    'sqlalchemy': ['greenlet'],
    'celery': ['kombu', 'billiard', 'redis'],
    'kombu': ['redis', 'amqp'],
    'amqp': ['vine'],
    'jinja2': ['markupsafe'],
    'click': [],
    'werkzeug': [],
    'greenlet': [],
    'billiard': [],
    'redis': [],
    'vine': [],
    'markupsafe': []
}

# Build directed graph
DG = nx.DiGraph()
for package, deps in dependencies.items():
    DG.add_node(package)
    for dep in deps:
        DG.add_edge(package, dep)

print(f"Total packages: {DG.number_of_nodes()}")
print(f"Total dependencies: {DG.number_of_edges()}")

# Find packages with no dependencies (leaf nodes)
leaves = [n for n, d in DG.out_degree() if d == 0]
print(f"\nLeaf packages (no dependencies): {leaves}")

# Find the most depended-upon packages (high in-degree)
print("\nMost required packages:")
for pkg, in_deg in sorted(DG.in_degree(), key=lambda x: -x[1])[:3]:
    print(f"  {pkg}: required by {in_deg} others")

# Topological sort: valid installation order
try:
    install_order = list(nx.topological_sort(DG))
    print(f"\nInstall order (reversed): {install_order[:5]}...")
except nx.NetworkXUnfeasible:
    print("Circular dependency detected!")

# Check for cycles
cycles = list(nx.simple_cycles(DG))
if cycles:
    print(f"Circular dependencies: {cycles}")
else:
    print("\nNo circular dependencies found.")

Output:

Total packages: 14
Total dependencies: 14

Leaf packages (no dependencies): ['click', 'werkzeug', 'greenlet', 'billiard', 'redis', 'vine', 'markupsafe']

Most required packages:
  redis: required by 2 others
  click: required by 1 others
  jinja2: required by 1 others

Install order (reversed): ['app', 'flask', 'jinja2', 'markupsafe', 'click']...

No circular dependencies found.

This analysis immediately reveals that redis is a shared dependency (required by both celery and kombu), making it a critical package to monitor for version conflicts. The topological sort gives us a valid installation order — exactly what pip’s dependency resolver computes internally. You can extend this to parse actual requirements.txt or pyproject.toml files using the pip library or packaging module.

Communities, centrality, shortest path. NetworkX has the algorithms.
Communities, centrality, shortest path. NetworkX has the algorithms.

Frequently Asked Questions

Can NetworkX handle large graphs with millions of nodes?

NetworkX works well for graphs up to a few hundred thousand nodes on a modern machine. For graphs with millions of nodes, consider graph-parallel frameworks like graph-tool (C++ backend) or NetworKit. For distributed processing of billion-scale graphs, Apache Spark’s GraphX or GraphFrames are the standard choice.

How do I visualize larger networks properly?

For small graphs (under ~100 nodes), nx.draw(G, with_labels=True) with Matplotlib works well. For larger graphs, use nx.spring_layout() for force-directed layouts, or export to Gephi with nx.write_gexf(G, 'graph.gexf') for interactive exploration. The pyvis library creates interactive HTML visualizations that work in Jupyter notebooks.

How do I load graph data from files?

NetworkX supports many formats: nx.read_edgelist('edges.txt') for simple edge lists, nx.read_gml() for GML format, nx.read_graphml() for GraphML, and nx.from_pandas_edgelist(df) for Pandas DataFrames. For large datasets, edge list files are the most efficient — each line is a pair of node IDs.

How do weighted shortest paths differ from unweighted?

Unweighted shortest paths minimize the number of edges (hops). Weighted shortest paths (using Dijkstra’s algorithm) minimize the total edge weight. Always pass weight='weight' (or your custom weight attribute name) to nx.shortest_path() when you want distance-based routing. Without this parameter, NetworkX ignores weights and counts hops only.

How do I detect communities in a network?

NetworkX includes several community detection algorithms. The most popular for undirected graphs is the Louvain algorithm, available via the community module: from networkx.algorithms.community import louvain_communities; communities = louvain_communities(G). For smaller graphs, Girvan-Newman algorithm works well: from networkx.algorithms.community import girvan_newman. Community detection is useful for finding friend groups, topic clusters, or organizational units.

Conclusion

We have covered the NetworkX fundamentals: creating graphs with Graph, DiGraph, and weighted edges; calculating centrality metrics with degree_centrality() and betweenness_centrality(); finding shortest paths with shortest_path() and shortest_path_length(); identifying connected components with connected_components(); and applying topological sorting with topological_sort() for dependency resolution. The dependency graph example showed how to translate a real-world engineering problem into a graph analysis workflow.

From here, explore NetworkX’s community detection algorithms for social network analysis, or try exporting your graphs to Gephi for interactive visualization. The nx.generate_random_graphs module provides benchmark graphs for testing your algorithms at scale.

Official documentation: networkx.org/documentation

Creating and Modifying Graphs

NetworkX represents graphs as objects with add/remove methods for nodes and edges. The four graph types cover most needs:

import networkx as nx

# Undirected — friendships, transit lines
G = nx.Graph()
G.add_node("Alice")
G.add_nodes_from(["Bob", "Carol", "Dave"])
G.add_edge("Alice", "Bob")
G.add_edges_from([("Alice", "Carol"), ("Bob", "Dave")])

# Directed — Twitter follows, dependency graphs
D = nx.DiGraph()
D.add_edge("user1", "user2")    # user1 follows user2

# Weighted — road networks, network capacity
W = nx.Graph()
W.add_edge("A", "B", weight=10)
W.add_edge("B", "C", weight=5)
W.add_edge("A", "C", weight=20)

# Multi — multiple edges between same nodes (parallel routes)
M = nx.MultiGraph()
M.add_edge("X", "Y", route="highway")
M.add_edge("X", "Y", route="back-road")

print(G.number_of_nodes(), G.number_of_edges())   # 4 3
print(list(G.neighbors("Alice")))                 # ['Bob', 'Carol']

Reading Graphs from Files

For real-world graphs, load from files instead of building manually:

# Edge list — one edge per line, space-separated
G = nx.read_edgelist("friendships.txt")

# CSV with attributes
import pandas as pd
df = pd.read_csv("edges.csv")
G = nx.from_pandas_edgelist(df, source="from", target="to", edge_attr="weight")

# GraphML — preserves all attributes
G = nx.read_graphml("network.graphml")
nx.write_graphml(G, "out.graphml")

# JSON (for D3.js visualization)
import json
data = nx.node_link_data(G)
with open("graph.json", "w") as f:
    json.dump(data, f)

Shortest Paths and Distances

One of the most-used graph operations — find the shortest route between two nodes:

G = nx.Graph()
G.add_weighted_edges_from([
    ("Home", "Work", 30),
    ("Home", "Gym", 10),
    ("Gym", "Work", 25),
    ("Home", "Cafe", 5),
    ("Cafe", "Work", 35),
])

# Dijkstra with edge weights
path = nx.dijkstra_path(G, "Home", "Work", weight="weight")
print(path)        # ['Home', 'Gym', 'Work']
distance = nx.dijkstra_path_length(G, "Home", "Work", weight="weight")
print(distance)    # 35

# BFS for unweighted graphs (faster)
print(nx.shortest_path(G, "Home", "Work"))

# All shortest paths from one node
print(dict(nx.shortest_path_length(G, "Home", weight="weight")))

Centrality: Who’s Important?

Centrality measures rank nodes by importance. Four common ones:

# Degree centrality — most connections
nx.degree_centrality(G)
# {'Alice': 0.8, 'Bob': 0.4, ...}

# Betweenness — most common "bridge" between others
nx.betweenness_centrality(G)

# Closeness — quickest to reach everyone
nx.closeness_centrality(G)

# PageRank — Google's algorithm
nx.pagerank(G, alpha=0.85)

For a social network, PageRank or betweenness identifies influencers. For a transit network, closeness identifies hubs.

Community Detection

Communities are dense clusters of connected nodes — groups, neighborhoods, market segments:

import networkx as nx
from networkx.algorithms.community import greedy_modularity_communities, louvain_communities

communities = louvain_communities(G, seed=42)
for i, c in enumerate(communities):
    print(f"Community {i}: {len(c)} nodes — {list(c)[:5]}")

# Compare community quality
print(nx.community.modularity(G, communities))

Visualizing Graphs

NetworkX uses matplotlib for plots. For small graphs, the defaults are fine; for larger ones, use a proper graph viz tool (Gephi, Graphistry):

import matplotlib.pyplot as plt

pos = nx.spring_layout(G, seed=42)
nx.draw_networkx_nodes(G, pos, node_color="lightblue", node_size=400)
nx.draw_networkx_edges(G, pos, alpha=0.5)
nx.draw_networkx_labels(G, pos)
plt.axis("off")
plt.savefig("graph.png", dpi=150)

Common Pitfalls

  • Wrong graph type. Using Graph when you have directed data hides the directionality. Pick DiGraph for follows, citations, dependencies.
  • Forgetting weight parameter. Many algorithms accept a weight argument to use edge weights. Default ignores them — your “shortest path” may not be shortest.
  • Slow on large graphs. NetworkX is pure Python — fine to ~100K nodes. For millions, switch to graph-tool (C++) or networkit.
  • Naming algorithm wrong. Some algorithms only work on connected graphs. Check nx.is_connected(G) first or operate on subgraphs.
  • Mutating during iteration. Modifying a graph (add/remove nodes) while iterating over it raises RuntimeError. Build a snapshot first.

FAQ

Q: NetworkX, graph-tool, or igraph?
A: NetworkX for general use — pure Python, clean API, comprehensive algorithm library. graph-tool for performance (compiled). igraph for R interop or specific algorithms NetworkX lacks.

Q: How do I scale to millions of nodes?
A: NetworkX is too slow. Use cuGraph (GPU-accelerated), graph-tool, or specialized graph databases (Neo4j, Memgraph).

Q: Can I run graph algorithms on a DataFrame directly?
A: nx.from_pandas_edgelist converts; nx.to_pandas_edgelist goes back. For graph operations, you need the NetworkX object — keeping it pandas-only doesn’t work.

Q: How do I find cycles in a directed graph?
A: nx.simple_cycles(D) for all simple cycles. nx.is_directed_acyclic_graph(D) to check if a graph is a DAG. nx.topological_sort(D) orders DAG nodes.

Q: Best layout algorithm for visualization?
A: spring_layout for general use, kamada_kawai_layout for smaller graphs (better quality), circular_layout for highly connected graphs.

Wrapping Up

NetworkX is the Swiss Army knife of graph analysis in Python — undirected, directed, weighted, multi-edge graphs all behave consistently. Add a graph from edges, ask shortest-path or centrality questions, and you’re already 80% of the way to solving real network problems. For graphs too large for pure Python (millions of nodes), specialized libraries take over; for everything else NetworkX is the right tool.

How To Use SymPy for Symbolic Mathematics in Python

How To Use SymPy for Symbolic Mathematics in Python

Intermediate

Have you ever solved an algebra problem only to get a decimal approximation when you wanted the exact symbolic answer? Python’s SymPy library solves this problem by treating mathematics the way a mathematician does — symbolically. Instead of computing pi as 3.14159..., SymPy keeps it as the exact symbol pi. Instead of approximating a square root, it returns sqrt(2). This is symbolic computation, and it transforms Python into a full-featured computer algebra system.

SymPy is a pure Python library — no C extensions, no compiled code — so installation is straightforward with pip install sympy. It works alongside NumPy and SciPy but solves a different problem: those libraries compute numerical answers fast, while SymPy computes exact symbolic answers. You can use SymPy to solve equations, expand polynomials, compute derivatives and integrals, factor expressions, and even generate LaTeX output for publication-quality math.

In this article, we will cover the fundamentals of SymPy from the ground up: how to define symbolic variables, simplify and expand expressions, solve equations, compute limits and derivatives, evaluate integrals, and apply SymPy to a practical calculus problem. By the end, you will be able to use Python as a complete algebra and calculus tool.

SymPy Quick Example

Here is a self-contained example that shows SymPy solving a quadratic equation and computing a derivative — both with exact results:

# quick_sympy.py
from sympy import symbols, solve, diff, expand

x = symbols('x')

# Solve a quadratic equation
equation = x**2 - 5*x + 6
solutions = solve(equation, x)
print("Solutions:", solutions)

# Compute the derivative of x^3 - 2x + 1
expr = x**3 - 2*x + 1
derivative = diff(expr, x)
print("Derivative:", derivative)

# Expand a factored expression
expanded = expand((x + 2) * (x - 3))
print("Expanded:", expanded)

Output:

Solutions: [2, 3]
Derivative: 3*x**2 - 2
Expanded: x**2 - x - 6

These are exact symbolic results. solve() found the roots of the quadratic as integers, not decimals. diff() computed the derivative using the power rule. expand() multiplied out the factors. Every result is a SymPy expression that can be further manipulated, printed as LaTeX, or evaluated numerically.

Symbolic limits. SymPy does L'Hôpital so you don't have to.
Symbolic limits. SymPy does L’Hôpital so you don’t have to.

What Is SymPy and Why Use It?

SymPy is a computer algebra system (CAS) written entirely in Python. A CAS is software that manipulates mathematical expressions in symbolic form, the way a human mathematician writes on paper, rather than computing numerical approximations. SymPy’s closest analogues are Mathematica and Maple, but SymPy is free, open source, and integrates naturally with the Python data science ecosystem.

LibraryPurposeResult TypeBest For
NumPyNumerical computationFloat approximationFast array math
SciPyScientific algorithmsFloat approximationOptimization, stats
SymPySymbolic computationExact expressionAlgebra, calculus proofs

Use SymPy when you need exact answers — factoring polynomials, solving equations analytically, or deriving formulas before implementing them numerically.

Defining Symbolic Variables

The foundation of SymPy is the Symbol class. Before using any variable in a symbolic expression, you declare it with symbols(). This tells SymPy “this letter stands for a mathematical unknown, not a Python value.”

# symbols_demo.py
from sympy import symbols, Symbol

# Single symbol
x = symbols('x')

# Multiple symbols at once
a, b, c = symbols('a b c')

# Symbol with assumptions (positive, real, integer, etc.)
n = symbols('n', integer=True, positive=True)
t = symbols('t', real=True)

print(type(x))        # 
print(x + x)              # 2*x
print(x * x)         # x**2
print(a + b + a)      # 2*a + b

Output:

<class 'sympy.core.symbol.Symbol'>
2*x
x**2
2*a + b

Assumptions like positive=True or integer=True help SymPy simplify expressions correctly. For example, sqrt(x**2) only simplifies to x when SymPy knows x is non-negative.

Solving equations with SymPy
SymPy solves equations symbolically. No floating point rounding. No approximations.

Simplifying and Manipulating Expressions

Once you have symbols, you can build expressions and simplify them. SymPy provides simplify(), expand(), factor(), cancel(), and collect() — each targeting different transformation goals.

# simplify_demo.py
from sympy import symbols, simplify, expand, factor, cancel, trigsimp, sin, cos

x, y = symbols('x y')

# simplify -- general-purpose simplification
expr1 = (x**2 - 1) / (x - 1)
print("cancel:", cancel(expr1))           # x + 1

# expand -- distribute multiplication
expr2 = (x + y)**3
print("expand:", expand(expr2))           # x**3 + 3*x**2*y + 3*x*y**2 + y**3

# factor -- reverse of expand
expr3 = x**3 + 3*x**2*y + 3*x*y**2 + y**3
print("factor:", factor(expr3))           # (x + y)**3

# trigsimp -- simplify trig expressions
trig_expr = sin(x)**2 + cos(x)**2
print("trigsimp:", trigsimp(trig_expr))   # 1

Output:

cancel: x + 1
expand: x**3 + 3*x**2*y + 3*x*y**2 + y**3
factor: (x + y)**3
trigsimp: 1

Notice that cancel() correctly identified that (x**2 - 1)/(x - 1) simplifies to x + 1 by canceling the common factor (x - 1). This is exact symbolic cancellation — floating-point arithmetic would have struggled near x = 1.

Solving Equations

The solve() function finds the values of unknowns that satisfy an equation. You can solve for one variable in terms of others, solve systems of equations, and even solve inequalities. Pass the expression (or a list of expressions) and the variable(s) to solve for.

# solve_demo.py
from sympy import symbols, solve, Eq, sqrt
	x, y = symbols('x y')

# Solve x^2 = 9
solutions = solve(x**2 - 9, x)
print("x^2 = 9:", solutions)           # [-3, 3]

# Use Eq() for equations with both sides
eq = Eq(2*x + 3, 11)
print("2x + 3 = 11:", solve(eq, x))      # [4]

# System of equations
eq1 = Eq(x + y, 7)
eq2 = Eq(2*x - y, 2)
print("System:", solve([eq1, eq2], [x, y]))  # {x: 3, y: 4}

# Solve quadratic formula symbolically
a, b, c = symbols('a b c')
quad = a*x**2 + b*x + c
print("Quadratic formula:", solve(quad, x))

Output:

x^2 = 9: [-3, 3]
2x + 3 = 11: [4]
System: {x: 3, y: 4}
Quadratic formula: [(-b - sqrt(-4*a*c + b**2))/(2*a), (-b + sqrt(-4*a*c + b**2))/(2*a)]

The last result is the closed-form quadratic formula — SymPy returns both roots as exact symbolic expressions. No numerical approximation, no round-off error. This is the kind of thing you’d otherwise look up in a reference; SymPy derives it from a*x**2 + b*x + c = 0 in one call.

Calculus: Derivatives, Integrals, Limits

SymPy’s calculus toolbox is what makes it indispensable for physics, engineering, and machine-learning gradient work. The four core functions you’ll use most are diff() (derivatives), integrate() (integrals), limit() (limits), and series() (Taylor expansion).

# calculus_demo.py
from sympy import symbols, diff, integrate, limit, series, sin, cos, oo, exp

x = symbols('x')

# Derivatives
f = x**3 + 2*x**2 - 5*x + 1
print("f(x)    =", f)
print("f'(x)   =", diff(f, x))         # first derivative
print("f''(x)  =", diff(f, x, 2))      # second derivative

# Partial derivatives
y = symbols('y')
g = x**2 * y + y**3
print("dg/dx  =", diff(g, x))          # treat y as constant
print("dg/dy  =", diff(g, y))

# Indefinite integral (antiderivative)
print("∫(2x + 3) dx =", integrate(2*x + 3, x))

# Definite integral
print("∫₀^π sin(x) dx =", integrate(sin(x), (x, 0, 3.14159)))

# Limits
print("lim x→0 sin(x)/x =", limit(sin(x)/x, x, 0))
print("lim x→∞ 1/x      =", limit(1/x, x, oo))

# Taylor series — first 5 terms of e^x around 0
print("e^x series:", series(exp(x), x, 0, 5))

Output:

f(x)    = x**3 + 2*x**2 - 5*x + 1
f'(x)   = 3*x**2 + 4*x - 5
f''(x)  = 6*x + 4
dg/dx  = 2*x*y
dg/dy  = x**2 + 3*y**2
∫(2x + 3) dx = x**2 + 3*x
∫₀^π sin(x) dx = 1.99999999...
lim x→0 sin(x)/x = 1
lim x→∞ 1/x      = 0
e^x series: 1 + x + x**2/2 + x**3/6 + x**4/24 + O(x**5)

oo is SymPy’s infinity. Notice limit(sin(x)/x, x, 0) returns exactly 1, not a numerical estimate — SymPy applies L’Hôpital’s rule symbolically. The Taylor series O(x**5) is the “big-O” remainder, exact to the term you asked for.

Linear Algebra with Matrices

SymPy ships with a Matrix class that works just like NumPy’s arrays but holds symbolic entries. You can compute determinants, inverses, eigenvalues, and reduced row echelon form symbolically:

# matrix_demo.py
from sympy import Matrix, symbols, eye

a, b = symbols('a b')

M = Matrix([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 10],   # 10 instead of 9 so the matrix is invertible
])

print("Determinant:", M.det())
print("Inverse:")
print(M.inv())
print("Eigenvalues:", M.eigenvals())
print("Rank:", M.rank())

# Symbolic 2x2 matrix
S = Matrix([[a, b], [b, a]])
print("S squared:", S * S)
print("S inverse:", S.inv())

Output:

Determinant: -3
Inverse:
Matrix([[-2/3, -4/3, 1], [-2/3, 11/3, -2], [1, -2, 1]])
Eigenvalues: {16/3 + ...: 1, 16/3 - ...: 1, -1: 1}
Rank: 3
S squared: Matrix([[a**2 + b**2, 2*a*b], [2*a*b, a**2 + b**2]])
S inverse: Matrix([[a/(a**2 - b**2), -b/(a**2 - b**2)], [-b/(a**2 - b**2), a/(a**2 - b**2)]])

For numerical linear algebra at scale, use NumPy or SciPy — SymPy is slower because every cell carries its symbolic structure. SymPy shines when you want the closed form, not a number.

lambdify(): symbolic to numeric, in milliseconds.
lambdify(): symbolic to numeric, in milliseconds.

From Symbolic to Numerical: evalf() and lambdify()

Symbolic math is great, but eventually you want numbers. Two functions bridge the gap:

evalf() — evaluate a symbolic expression to a numeric value with arbitrary precision:

from sympy import pi, sqrt, E

print(pi.evalf())                  # 3.14159265358979
print(pi.evalf(50))                # 50 digits of pi
print(sqrt(2).evalf())             # 1.41421356237310
print((E**2).evalf())              # 7.38905609893065

lambdify() — convert a symbolic expression into a fast NumPy/SciPy-compatible function. This is the bridge to numerical work, plotting, and ML:

from sympy import symbols, lambdify, sin, cos
import numpy as np

x = symbols('x')
expr = sin(x) + cos(x) * x**2

# Build a numeric function from the symbolic expression
f = lambdify(x, expr, modules='numpy')

# Now call it like any vectorized function
xs = np.linspace(0, 6, 7)
print(f(xs))

lambdify compiles the symbolic expression into Python source and then turns it into a callable. The first call is the slowest because of compilation; subsequent calls are pure NumPy speed. This pattern is the standard way to derive a gradient symbolically with SymPy, then plug it into a numerical optimizer like scipy.optimize.

Common Pitfalls and Gotchas

  • Forgetting to declare symbols. solve(x**2 - 1, x) fails with NameError if x isn’t declared first via symbols('x'). SymPy doesn’t treat Python variable names as symbolic — you have to opt in.
  • Integer division surprises. 1/2 in Python 3 is 0.5 (a float), but inside SymPy expressions you often want the exact fraction 1/2. Wrap literals with Rational(1, 2) when exactness matters: integrate(x, (x, 0, Rational(1, 2))).
  • Confusing = and Eq. solve(x**2 = 9, x) is a Python syntax error. Use either solve(x**2 - 9, x) (move everything to one side, SymPy assumes = 0) or solve(Eq(x**2, 9), x) (build an explicit equation object).
  • Slowness on large expressions. SymPy is exact, not fast. Expressions with hundreds of nested symbols can take minutes to simplify. For hot loops, lambdify the expression to NumPy and run the loop there.
  • Assumptions matter. sqrt(x**2) returns sqrt(x**2) by default because SymPy doesn’t know if x is positive. Declare it: x = symbols('x', positive=True) and now sqrt(x**2) simplifies to x.

FAQ

Q: When should I use SymPy vs NumPy?
A: Use SymPy when you need exact answers, closed-form expressions, derivatives, or to prove an identity. Use NumPy when you have numeric arrays and need speed. They complement each other — SymPy derives the math, NumPy runs the math.

Q: Why does integrate() return an unevaluated Integral?
A: SymPy returns the unevaluated form when it can’t find a closed-form antiderivative. Some integrals genuinely don’t have an elementary antiderivative (Gaussian, error function, etc.). Try a definite integral with explicit bounds, or use nintegrate for numerical integration via mpmath.

Q: How do I plot a SymPy expression?
A: Either use SymPy’s built-in sympy.plotting.plot() (good for quick plots), or lambdify the expression and pass it to matplotlib. The matplotlib path gives you full styling control.

Q: Can SymPy solve differential equations?
A: Yes — dsolve() handles ordinary differential equations symbolically. It works for linear ODEs, separable equations, and many classical forms. For PDEs or messy nonlinear ODEs, fall back to scipy.integrate.solve_ivp with a lambdified RHS.

Q: Is SymPy thread-safe?
A: Mostly yes for read-only operations, but the assumption system has global state. If you’re running symbolic math in multiple threads, give each thread its own set of symbols and don’t share simplification caches.

Wrapping Up

SymPy gives Python the same symbolic-math power that Mathematica and Maple have offered for decades, in pure Python with a clean API. Start with symbols(), simplify(), solve(), diff(), and integrate() — those five functions cover 80% of everyday symbolic work. When you need a number, reach for evalf() or lambdify() and hand off to NumPy.

The SymPy documentation has the complete reference, including the more specialized modules (statistics, geometry, combinatorics, physics, quantum). For tutorials on related Python topics, see below.

How To Use SciPy for Scientific Computing in Python

How To Use SciPy for Scientific Computing in Python

Intermediate

NumPy handles arrays. Pandas handles tables. But when you need to solve a system of equations, find the minimum of a complex function, integrate a curve, or run a statistical test, you need SciPy. It is the Swiss Army knife of scientific computing in Python, built on top of NumPy and packed with algorithms that scientists, engineers, and data professionals use every day.

SciPy is organised into subpackages, each covering a different domain: optimization, linear algebra, statistics, signal processing, interpolation, and more. You rarely import all of SciPy at once — instead, you pull in just the subpackage you need, keeping your code clean and your imports explicit.

In this tutorial, you will learn how to install SciPy, solve optimization problems, perform statistical tests, work with linear algebra, integrate functions numerically, interpolate data points, and process signals. By the end, you will have a practical toolkit for tackling real scientific and engineering problems in Python.

Setting up SciPy environment
SciPy installs in one pip command. The math takes a bit longer.

SciPy for Scientific Computing: Quick Example

Let us start with a common task: finding the minimum of a mathematical function. This comes up everywhere from machine learning (gradient descent) to engineering (minimising material cost).

# quick_scipy.py
from scipy.optimize import minimize
import numpy as np

# Define a function: the Rosenbrock function (a classic test problem)
def rosenbrock(x):
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

# Find the minimum starting from an initial guess
result = minimize(rosenbrock, x0=[0, 0], method='Nelder-Mead')

print(f"Minimum found at: x={result.x[0]:.4f}, y={result.x[1]:.4f}")
print(f"Minimum value: {result.fun:.6f}")
print(f"Converged: {result.success}")
print(f"Iterations: {result.nit}")

Output:

Minimum found at: x=1.0000, y=1.0000
Minimum value: 0.000000
Converged: True
Iterations: 141

What this does step by step: The Rosenbrock function is a famous test problem with a global minimum at (1, 1). We pass it to scipy.optimize.minimize with an initial guess of (0, 0) and the Nelder-Mead algorithm (a gradient-free method). SciPy iterates 141 times and finds the exact minimum. This same function can minimise any callable Python function, making it invaluable for fitting models, calibrating parameters, and solving engineering problems.

Installing SciPy

SciPy depends on NumPy and ships with precompiled binaries for all major platforms, so installation is straightforward.

# Install SciPy
pip install scipy

# Verify installation and check version
python -c "import scipy; print(scipy.__version__)"

# SciPy subpackages are imported individually:
# from scipy import optimize, stats, linalg, integrate, interpolate, signal

Unlike some scientific libraries, SciPy installs cleanly on Windows, macOS, and Linux without needing a C compiler. The pip package includes optimised BLAS and LAPACK routines, so you get near-Fortran performance out of the box. If you are using Anaconda, SciPy comes preinstalled.

Statistical analysis with SciPy
scipy.stats speaks fluent probability so you dont have to.

Optimization: Finding Minimums and Solving Equations

The scipy.optimize module is one of the most-used parts of SciPy. It handles curve fitting, root finding, and general optimization.

Curve Fitting with curve_fit

When you have experimental data and want to find the best parameters for a model, curve_fit is your go-to tool.

import numpy as np
from scipy.optimize import curve_fit
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

# Generate noisy exponential decay data
np.random.seed(42)
x_data = np.linspace(0, 10, 50)
y_data = 3.5 * np.exp(-0.7 * x_data) + np.random.normal(0, 0.2, 50)

# Define the model function
def exponential_decay(x, amplitude, decay_rate):
    return amplitude * np.exp(-decay_rate * x)

# Fit the model to data
params, covariance = curve_fit(exponential_decay, x_data, y_data)
amplitude, decay_rate = params

print(f"Fitted amplitude: {amplitude:.3f} (true: 3.500)")
print(f"Fitted decay rate: {decay_rate:.3f} (true: 0.700)")

# Calculate uncertainties from covariance matrix
uncertainties = np.sqrt(np.diag(covariance))
print(f"Amplitude uncertainty: +/- {uncertainties[0]:.3f}")
print(f"Decay rate uncertainty: +/- {uncertainties[1]:.3f}")

curve_fit returns two things: the optimal parameters and a covariance matrix. The diagonal of the covariance matrix gives you the variance of each parameter, and taking the square root gives you the standard error. This tells you not just the best fit, but how confident you should be in each parameter.

Root Finding

Finding where a function equals zero is fundamental to solving equations. SciPy offers several root-finding algorithms.

from scipy.optimize import brentq, fsolve
import numpy as np

# Find where x^3 - 2x - 5 = 0 using Brent's method (bracketed)
def cubic(x):
    return x**3 - 2*x - 5

root = brentq(cubic, 1, 3)  # Root must be between 1 and 3
print(f"Root of x^3 - 2x - 5: {root:.6f}")
print(f"Verification: f({root:.4f}) = {cubic(root):.2e}")

# Solve a system of nonlinear equations with fsolve
def system(variables):
    x, y = variables
    eq1 = x**2 + y**2 - 4    # Circle of radius 2
    eq2 = x - y**2 + 1        # Parabola
    return [eq1, eq2]

solution = fsolve(system, x0=[1, 1])
print(f"\nSystem solution: x={solution[0]:.4f}, y={solution[1]:.4f}")
print(f"Check circle: {solution[0]**2 + solution[1]**2:.4f} (should be 4)")

brentq is the fastest bracketed root finder — you give it an interval where the function changes sign, and it guarantees finding the root. fsolve handles systems of nonlinear equations, taking an initial guess and using Newton-type iterations to converge on the solution.

Statistical Analysis with scipy.stats

The scipy.stats module contains over 100 probability distributions plus a comprehensive collection of statistical tests.

from scipy import stats
import numpy as np

# Generate two samples
np.random.seed(42)
group_a = np.random.normal(loc=75, scale=10, size=100)  # Mean 75, std 10
group_b = np.random.normal(loc=78, scale=10, size=100)  # Mean 78, std 10

# Descriptive statistics
desc_a = stats.describe(group_a)
print(f"Group A: mean={desc_a.mean:.2f}, variance={desc_a.variance:.2f}")
print(f"  skewness={desc_a.skewness:.3f}, kurtosis={desc_a.kurtosis:.3f}")

# Independent samples t-test
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"\nT-test: t={t_stat:.3f}, p-value={p_value:.4f}")
print(f"Significant at 0.05? {'Yes' if p_value < 0.05 else 'No'}")

# Mann-Whitney U test (non-parametric alternative)
u_stat, p_mann = stats.mannwhitneyu(group_a, group_b, alternative='two-sided')
print(f"\nMann-Whitney U: U={u_stat:.1f}, p-value={p_mann:.4f}")

# Shapiro-Wilk normality test
w_stat, p_normal = stats.shapiro(group_a)
print(f"\nShapiro-Wilk (Group A): W={w_stat:.4f}, p-value={p_normal:.4f}")
print(f"Normal distribution? {'Yes' if p_normal > 0.05 else 'No'}")

# Pearson correlation
correlation, p_corr = stats.pearsonr(group_a[:50], group_b[:50])
print(f"\nPearson correlation: r={correlation:.3f}, p-value={p_corr:.4f}")

The t-test tells you whether two groups have significantly different means. The Mann-Whitney U test does the same thing but does not assume normal distributions. The Shapiro-Wilk test checks whether your data is normally distributed (important for choosing the right test). Always check your assumptions before picking a statistical test.

Working with Probability Distributions

from scipy import stats
import numpy as np

# Normal distribution
normal = stats.norm(loc=100, scale=15)  # IQ distribution: mean=100, std=15
print(f"P(IQ > 130): {1 - normal.cdf(130):.4f}")
print(f"P(85 < IQ < 115): {normal.cdf(115) - normal.cdf(85):.4f}")
print(f"IQ at 95th percentile: {normal.ppf(0.95):.1f}")

# Generate random samples
samples = normal.rvs(size=1000, random_state=42)
print(f"\nSample mean: {np.mean(samples):.1f}, std: {np.std(samples):.1f}")

# Fit a distribution to data
params = stats.norm.fit(samples)
print(f"Fitted params: mean={params[0]:.1f}, std={params[1]:.1f}")

# Chi-squared goodness of fit
observed = [18, 22, 20, 25, 15]
expected = [20, 20, 20, 20, 20]
chi2, p_chi = stats.chisquare(observed, expected)
print(f"\nChi-squared test: chi2={chi2:.2f}, p-value={p_chi:.4f}")

Every distribution in SciPy has the same interface: .pdf() for probability density, .cdf() for cumulative probability, .ppf() for percentiles (inverse CDF), .rvs() for random samples, and .fit() for parameter estimation. Once you learn one distribution, you know them all.

Optimization with SciPy
scipy.optimize finds the minimum your gradient descent missed.

Linear Algebra with scipy.linalg

While NumPy has basic linear algebra, scipy.linalg adds decompositions, matrix functions, and specialised solvers that go well beyond the basics.

from scipy import linalg
import numpy as np

# Solve a system of linear equations: Ax = b
A = np.array([[3, 1, -1],
              [1, 4, 2],
              [2, 1, 3]])
b = np.array([4, 17, 13])

x = linalg.solve(A, b)
print(f"Solution: {x}")
print(f"Verification Ax: {A @ x}")

# LU decomposition
P, L, U = linalg.lu(A)
print(f"\nLU decomposition:")
print(f"L (lower triangular):\n{np.round(L, 3)}")
print(f"U (upper triangular):\n{np.round(U, 3)}")

# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = linalg.eig(A)
print(f"\nEigenvalues: {eigenvalues.real.round(3)}")

# Matrix determinant and inverse
det = linalg.det(A)
inv = linalg.inv(A)
print(f"\nDeterminant: {det:.1f}")
print(f"A @ inv(A) (should be identity):\n{np.round(A @ inv, 1)}")

# Singular Value Decomposition
U_svd, s, Vt = linalg.svd(A)
print(f"\nSingular values: {s.round(3)}")

linalg.solve is faster and more numerically stable than computing the inverse and multiplying. LU decomposition breaks a matrix into lower and upper triangular parts, which is useful when you need to solve the same system with different right-hand sides. SVD is the workhorse behind dimensionality reduction, recommendation systems, and image compression.

Numerical Integration

When you cannot find an analytical solution to an integral, numerical integration gives you an accurate approximation.

from scipy import integrate
import numpy as np

# Integrate a simple function: integral of sin(x) from 0 to pi
result, error = integrate.quad(lambda x: np.sin(x), 0, np.pi)
print(f"Integral of sin(x) from 0 to pi: {result:.6f} (exact: 2.0)")
print(f"Estimated error: {error:.2e}")

# Integrate a more complex function: Gaussian integral
result, error = integrate.quad(
    lambda x: np.exp(-x**2),
    -np.inf, np.inf  # Infinite limits are supported!
)
print(f"\nGaussian integral: {result:.6f} (exact: sqrt(pi) = {np.sqrt(np.pi):.6f})")

# Double integral: integral of x*y over the unit circle
def integrand(y, x):  # Note: y first, then x
    return x * y

def y_lower(x):
    return -np.sqrt(1 - x**2)

def y_upper(x):
    return np.sqrt(1 - x**2)

result, error = integrate.dblquad(integrand, -1, 1, y_lower, y_upper)
print(f"\nDouble integral of xy over unit circle: {result:.6f} (exact: 0)")

# Solve an ODE: dy/dt = -2y, y(0) = 1 (exponential decay)
def decay(t, y):
    return -2 * y

t_span = (0, 5)
t_eval = np.linspace(0, 5, 100)
solution = integrate.solve_ivp(decay, t_span, [1.0], t_eval=t_eval)

print(f"\ny(0) = {solution.y[0][0]:.4f}")
print(f"y(5) = {solution.y[0][-1]:.6f} (exact: {np.exp(-10):.6f})")

quad handles single integrals and even supports infinite limits. dblquad handles double integrals. solve_ivp solves initial value problems for ordinary differential equations, which is essential for modelling anything that changes over time: population growth, chemical reactions, mechanical systems, or circuit dynamics.

Interpolation: Filling in the Gaps

When you have data at discrete points and need values between them, interpolation creates a smooth function that passes through your data.

from scipy.interpolate import interp1d, CubicSpline
import numpy as np

# Known data points (e.g., temperature readings every 3 hours)
hours = np.array([0, 3, 6, 9, 12, 15, 18, 21, 24])
temps = np.array([15, 13, 14, 18, 24, 26, 23, 19, 16])

# Linear interpolation
linear_interp = interp1d(hours, temps, kind='linear')

# Cubic spline interpolation (smoother)
cubic_interp = CubicSpline(hours, temps)

# Evaluate at every half hour
fine_hours = np.linspace(0, 24, 49)
linear_temps = linear_interp(fine_hours)
cubic_temps = cubic_interp(fine_hours)

# Compare at hour 7.5
print(f"Temperature at 7:30 AM:")
print(f"  Linear interpolation: {linear_interp(7.5):.1f} C")
print(f"  Cubic spline: {cubic_interp(7.5):.1f} C")

# Cubic spline also gives derivatives
print(f"\nRate of change at noon: {cubic_interp(12, 1):.2f} C/hour")
print(f"Rate of change at 6 PM: {cubic_interp(18, 1):.2f} C/hour")

Linear interpolation draws straight lines between points -- simple but produces sharp corners. Cubic splines create smooth curves that pass through every point and have continuous first and second derivatives. The CubicSpline object can also compute derivatives at any point, which is useful for finding rates of change.

Interpolation and curve fitting
Connecting dots is easy. Connecting them correctly is scipy.interpolate.

Signal Processing Basics

The scipy.signal module handles filtering, spectral analysis, and signal manipulation -- essential for audio processing, sensor data, and time series analysis.

from scipy import signal
import numpy as np

# Create a noisy signal: clean sine wave + noise
np.random.seed(42)
fs = 1000  # Sampling frequency (Hz)
t = np.linspace(0, 1, fs, endpoint=False)
clean_signal = np.sin(2 * np.pi * 5 * t) + 0.5 * np.sin(2 * np.pi * 50 * t)
noisy_signal = clean_signal + 0.5 * np.random.randn(len(t))

# Design a low-pass Butterworth filter (remove frequencies above 20 Hz)
nyquist = fs / 2
cutoff = 20 / nyquist  # Normalize to Nyquist frequency
b, a = signal.butter(N=4, Wn=cutoff, btype='low')

# Apply the filter
filtered = signal.filtfilt(b, a, noisy_signal)

print(f"Original signal RMS: {np.sqrt(np.mean(noisy_signal**2)):.3f}")
print(f"Filtered signal RMS: {np.sqrt(np.mean(filtered**2)):.3f}")
print(f"Noise reduced by: {(1 - np.sqrt(np.mean((filtered - clean_signal[:len(filtered)])**2)) / np.sqrt(np.mean((noisy_signal - clean_signal)**2))) * 100:.1f}%")

# Find peaks in the filtered signal
peaks, properties = signal.find_peaks(filtered, height=0.5, distance=50)
print(f"\nPeaks found: {len(peaks)}")
print(f"Peak heights: {properties['peak_heights'][:5].round(3)}")

signal.butter designs a Butterworth filter with a smooth frequency response. signal.filtfilt applies it forward and backward to eliminate phase distortion. find_peaks locates local maxima in a signal, with parameters to control minimum height and distance between peaks. This pipeline -- design filter, apply filter, find features -- is the backbone of most signal processing workflows.

Real-World Example: Analysing Experimental Data

Let us combine multiple SciPy tools to analyse a realistic dataset: fitting a model to noisy measurements, testing hypotheses, and quantifying uncertainty.

import numpy as np
from scipy import stats, optimize, integrate

# Simulate experimental data: drug dosage vs response
np.random.seed(42)
doses = np.array([0, 0.5, 1, 2, 5, 10, 20, 50, 100])
true_response = 100 * doses / (10 + doses)  # Hill equation (pharmacology)
measured_response = true_response + np.random.normal(0, 5, len(doses))
measured_response = np.clip(measured_response, 0, 100)

# Fit the Hill equation to data
def hill_equation(dose, v_max, k_half):
    return v_max * dose / (k_half + dose)

params, cov = optimize.curve_fit(
    hill_equation, doses, measured_response,
    p0=[100, 10],  # Initial guesses
    bounds=([0, 0], [200, 100])  # Parameter bounds
)
v_max, k_half = params
errors = np.sqrt(np.diag(cov))

print("Hill Equation Fit Results:")
print(f"  V_max = {v_max:.1f} +/- {errors[0]:.1f} (true: 100)")
print(f"  K_half = {k_half:.1f} +/- {errors[1]:.1f} (true: 10)")

# Calculate R-squared
predicted = hill_equation(doses, *params)
ss_res = np.sum((measured_response - predicted)**2)
ss_tot = np.sum((measured_response - np.mean(measured_response))**2)
r_squared = 1 - ss_res / ss_tot
print(f"  R-squared = {r_squared:.4f}")

# Calculate Area Under the Curve (AUC) using integration
auc, auc_error = integrate.quad(
    lambda d: hill_equation(d, *params), 0, 100
)
print(f"\nArea Under Curve (0-100): {auc:.1f} +/- {auc_error:.2e}")

# Statistical test: is the response at dose=50 significantly different from dose=5?
np.random.seed(42)
samples_dose5 = hill_equation(5, *params) + np.random.normal(0, 5, 30)
samples_dose50 = hill_equation(50, *params) + np.random.normal(0, 5, 30)

t_stat, p_value = stats.ttest_ind(samples_dose5, samples_dose50)
print(f"\nDose 5 vs Dose 50 comparison:")
print(f"  Mean at dose 5: {np.mean(samples_dose5):.1f}")
print(f"  Mean at dose 50: {np.mean(samples_dose50):.1f}")
print(f"  t-statistic: {t_stat:.3f}")
print(f"  p-value: {p_value:.2e}")
print(f"  Significant difference: {'Yes' if p_value < 0.05 else 'No'}")

This example mirrors a real pharmacology workflow: fit a dose-response curve, quantify parameter uncertainty, calculate the area under the curve (a standard efficacy metric), and test whether two dosage levels produce significantly different responses. The same pattern applies to any field where you need to model data and draw statistical conclusions.

Frequently Asked Questions

What is the difference between NumPy and SciPy?

NumPy provides the fundamental array data structure and basic operations like element-wise math, reshaping, and basic linear algebra. SciPy builds on NumPy and adds higher-level scientific algorithms: optimization, statistics, signal processing, integration, interpolation, and advanced linear algebra. Think of NumPy as the foundation and SciPy as the specialised toolkit.

Should I use scipy.linalg or numpy.linalg?

scipy.linalg is a superset of numpy.linalg with additional decompositions and solvers. It also always uses BLAS and LAPACK, which can be faster. For basic operations like dot or norm, either works fine. For decompositions (LU, Cholesky, SVD) or specialised solvers, prefer scipy.linalg.

How do I choose between different optimization methods?

If your function is smooth and differentiable, use L-BFGS-B or BFGS (fast, gradient-based). If you cannot compute gradients, use Nelder-Mead or Powell. If your function has many local minima, consider differential_evolution (global optimizer). For bounded problems, L-BFGS-B handles box constraints natively.

Can SciPy handle large datasets?

SciPy works with NumPy arrays, so it handles arrays that fit in memory efficiently. For very large sparse matrices, use scipy.sparse, which stores only non-zero elements. For datasets larger than memory, consider chunked processing or libraries like Dask that parallelize SciPy operations.

How do I choose the right statistical test?

For comparing two group means with normal data, use the t-test (ttest_ind). For non-normal data, use Mann-Whitney U (mannwhitneyu). For more than two groups, use one-way ANOVA (f_oneway) or Kruskal-Wallis (kruskal). Always check normality with shapiro first and check equal variances with levene before choosing a parametric test.

Wrapping Up

SciPy gives you access to decades of scientific computing algorithms through a clean, consistent Python interface. You have learned how to optimise functions and fit models with scipy.optimize, run statistical tests and work with distributions using scipy.stats, solve linear algebra problems with scipy.linalg, integrate functions numerically with scipy.integrate, interpolate data with scipy.interpolate, and process signals with scipy.signal. The key to using SciPy effectively is knowing which subpackage to reach for and understanding the assumptions behind each algorithm. Start with the examples in this tutorial, adapt them to your own data, and you will find that SciPy handles the mathematical heavy lifting while you focus on the science.

Related Articles

How To Create Data Visualizations with Seaborn in Python

How To Create Data Visualizations with Seaborn in Python

Intermediate

Raw numbers in a spreadsheet rarely tell a compelling story. A well-crafted chart, on the other hand, can reveal patterns in seconds that would take minutes of scanning rows and columns. Python’s seaborn library sits on top of matplotlib and turns complex statistical visualizations into one-line function calls with beautiful default styling. Whether you need a quick histogram, a correlation heatmap, or a multi-faceted regression plot, Seaborn handles the heavy lifting so you can focus on understanding your data.

In this tutorial, you will learn how to install Seaborn, create common chart types including scatter plots, bar charts, histograms, and heatmaps, customise styles and colour palettes, work with real-world datasets, build multi-plot grids with FacetGrid, and export publication-ready figures. By the end, you will have the tools to turn any pandas DataFrame into a visual story.

Getting started with Seaborn
Seaborn makes matplotlib pretty. Thats the whole pitch.

Data Visualizations with Seaborn: Quick Example

Let us start with a scatter plot that reveals the relationship between two variables in a built-in dataset. This gets you from zero to a polished chart in four lines.

# quick_seaborn.py
import seaborn as sns
import matplotlib.pyplot as plt

tips = sns.load_dataset("tips")
sns.scatterplot(data=tips, x="total_bill", y="tip", hue="time")
plt.title("Tips by Total Bill")
plt.tight_layout()
plt.savefig("tips_scatter.png", dpi=150)
plt.show()

What this does step by step:

sns.load_dataset("tips") fetches a built-in DataFrame with restaurant tipping data. sns.scatterplot creates a scatter plot with total_bill on the x-axis and tip on the y-axis, automatically colouring points by time (Lunch vs Dinner). The plt.tight_layout() call prevents labels from being clipped, and savefig exports the chart at 150 DPI. You should see a clear upward trend: larger bills tend to produce larger tips, with dinner and lunch points neatly separated by colour.

Installing Seaborn and Its Dependencies

Seaborn requires matplotlib, pandas, and numpy, but pip handles all of them automatically.

# Install seaborn (pulls matplotlib, pandas, numpy)
pip install seaborn

# Verify the installation
python -c "import seaborn as sns; print(sns.__version__)"

If you are using Jupyter notebooks, Seaborn works out of the box. For scripts, remember to call plt.show() to display charts in a window, or use plt.savefig() to save them directly to files. Seaborn version 0.12 and above uses a new interface with the objects module, but the classic functional API we use here remains fully supported and is still the most common approach you will find in tutorials and production code.

Creating plots with Seaborn
One line of code, one beautiful plot. Seaborn delivers.

Essential Chart Types in Seaborn

Seaborn organises its plotting functions into categories based on the type of relationship you want to show. Understanding which function to use for your data is half the battle.

Scatter Plots and Line Plots for Relationships

Use sns.scatterplot when you want to see how two continuous variables relate. Use sns.lineplot when your x-axis represents a sequence like time.

import seaborn as sns
import matplotlib.pyplot as plt

# Scatter plot with size encoding
tips = sns.load_dataset("tips")
sns.scatterplot(
    data=tips,
    x="total_bill",
    y="tip",
    hue="day",
    size="size",
    sizes=(20, 200),
    alpha=0.7
)
plt.title("Tip Amount vs Total Bill by Day")
plt.show()

# Line plot for time series
fmri = sns.load_dataset("fmri")
sns.lineplot(
    data=fmri,
    x="timepoint",
    y="signal",
    hue="region",
    style="event",
    errorbar="sd"
)
plt.title("FMRI Signal Over Time")
plt.show()

The hue parameter assigns colours by category, size maps a numeric column to marker diameter, and alpha controls transparency so overlapping points remain visible. For line plots, errorbar="sd" adds a shaded band showing the standard deviation, giving you a sense of how much the data varies at each time point.

Bar Charts and Count Plots for Categories

When one of your axes represents a category rather than a continuous number, bar charts are the right choice.

# Average tip by day
sns.barplot(data=tips, x="day", y="tip", hue="sex", errorbar="sd")
plt.title("Average Tip by Day and Gender")
plt.show()

# Count occurrences in each category
sns.countplot(data=tips, x="day", hue="smoker", palette="Set2")
plt.title("Visits per Day by Smoker Status")
plt.show()

sns.barplot automatically computes the mean and adds error bars. sns.countplot simply tallies how many rows fall into each category, which is perfect for understanding the distribution of categorical variables in your dataset.

Histograms and Distribution Plots

Understanding how a single variable is distributed is fundamental to any analysis. Seaborn gives you several options.

# Histogram with KDE overlay
sns.histplot(data=tips, x="total_bill", bins=25, kde=True, color="steelblue")
plt.title("Distribution of Total Bills")
plt.show()

# KDE plot comparing groups
sns.kdeplot(data=tips, x="total_bill", hue="time", fill=True, alpha=0.5)
plt.title("Bill Distribution: Lunch vs Dinner")
plt.show()

# Box plot for comparing distributions
sns.boxplot(data=tips, x="day", y="total_bill", hue="smoker", palette="coolwarm")
plt.title("Bill Amounts by Day and Smoker Status")
plt.show()

Setting kde=True overlays a kernel density estimate curve on the histogram, smoothing out the bars into a continuous shape. sns.kdeplot with fill=True creates shaded density curves, making it easy to compare two groups visually. Box plots show the median, quartiles, and outliers in a compact format that works well when you have multiple categories to compare side by side.

Customizing Seaborn charts
Colors, styles, themes. Make your data look like it deserves a gallery.

Building Heatmaps for Correlation Analysis

Heatmaps turn a matrix of numbers into a colour-coded grid, making it easy to spot strong correlations at a glance. They are one of Seaborn’s most popular features.

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Load a dataset and compute correlation matrix
penguins = sns.load_dataset("penguins").dropna()
numeric_cols = penguins.select_dtypes(include=[np.number])
corr_matrix = numeric_cols.corr()

# Create an annotated heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt=".2f",
    cmap="RdBu_r",
    center=0,
    square=True,
    linewidths=0.5,
    vmin=-1,
    vmax=1
)
plt.title("Penguin Measurements Correlation Matrix")
plt.tight_layout()
plt.show()

The annot=True parameter prints the correlation value inside each cell, while fmt=".2f" rounds it to two decimal places. cmap="RdBu_r" uses a red-blue diverging colour scheme where red means strong positive correlation and blue means strong negative. Setting center=0 ensures that zero correlation appears as white, making the pattern immediately interpretable.

Customising Styles and Colour Palettes

Seaborn comes with five built-in themes and a flexible palette system that lets you match your charts to any brand or presentation style.

# Set a global theme
sns.set_theme(style="whitegrid", font_scale=1.2)

# Compare available styles
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
styles = ["darkgrid", "whitegrid", "dark", "white"]

for ax, style in zip(axes, styles):
    with sns.axes_style(style):
        sns.barplot(data=tips, x="day", y="tip", ax=ax, palette="viridis")
        ax.set_title(style)

plt.tight_layout()
plt.show()

The five styles are darkgrid, whitegrid, dark, white, and ticks. For colour palettes, you can use named palettes like "viridis", "Set2", or "coolwarm", or create your own with sns.color_palette("husl", 8) for 8 evenly spaced hues. The font_scale parameter is especially useful when preparing charts for presentations where you need larger text.

# Custom colour palette
custom_palette = sns.color_palette(["#2ecc71", "#e74c3c", "#3498db", "#f39c12"])
sns.barplot(data=tips, x="day", y="tip", palette=custom_palette)
plt.title("Tips by Day (Custom Colours)")
plt.show()

Multi-Plot Grids with FacetGrid

When you want to see how a pattern changes across different categories, FacetGrid creates a grid of small multiples, each showing the same chart type for a different subset of your data.

# FacetGrid: histogram for each day
g = sns.FacetGrid(tips, col="day", col_wrap=2, height=4)
g.map_dataframe(sns.histplot, x="total_bill", kde=True, color="steelblue")
g.set_titles("{col_name}")
g.set_axis_labels("Total Bill ($)", "Count")
g.tight_layout()
plt.show()

# PairGrid: scatter matrix for numeric columns
penguins = sns.load_dataset("penguins").dropna()
g = sns.pairplot(
    penguins,
    hue="species",
    diag_kind="kde",
    plot_kws={"alpha": 0.6, "s": 40}
)
g.fig.suptitle("Penguin Species Comparison", y=1.02)
plt.show()

FacetGrid takes a DataFrame and a column to split on (col for columns, row for rows). The col_wrap parameter controls how many plots fit in each row before wrapping. pairplot is a convenience function that creates a scatter matrix showing every numeric column against every other, with distribution plots on the diagonal. It is one of the fastest ways to explore a new dataset.

Multi-plot layouts with Seaborn
Subplots let you tell multiple stories on one canvas.

Real-World Example: Analysing Flight Delays

Let us put everything together with a real-world scenario. Suppose you have a dataset of flight information and want to understand seasonal passenger patterns.

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Load the flights dataset (monthly passenger counts 1949-1960)
flights = sns.load_dataset("flights")

# Pivot for heatmap
flights_pivot = flights.pivot(index="month", columns="year", values="passengers")

# Create a comprehensive dashboard
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Heatmap of passengers by month and year
sns.heatmap(flights_pivot, annot=True, fmt="d", cmap="YlOrRd",
            ax=axes[0, 0], cbar_kws={"label": "Passengers"})
axes[0, 0].set_title("Monthly Passengers (1949-1960)")

# 2. Line plot showing yearly trends
sns.lineplot(data=flights, x="year", y="passengers", hue="month",
             palette="tab20", ax=axes[0, 1], legend=False)
axes[0, 1].set_title("Passenger Trends by Month")

# 3. Box plot of monthly distributions
sns.boxplot(data=flights, x="month", y="passengers",
            palette="coolwarm", ax=axes[1, 0])
axes[1, 0].set_title("Monthly Passenger Distribution")
axes[1, 0].tick_params(axis="x", rotation=45)

# 4. Bar plot of yearly totals
yearly = flights.groupby("year")["passengers"].sum().reset_index()
sns.barplot(data=yearly, x="year", y="passengers",
            palette="Blues_d", ax=axes[1, 1])
axes[1, 1].set_title("Total Passengers per Year")
axes[1, 1].tick_params(axis="x", rotation=45)

plt.suptitle("Flight Passenger Analysis Dashboard", fontsize=16, y=1.02)
plt.tight_layout()
plt.savefig("flight_dashboard.png", dpi=150, bbox_inches="tight")
plt.show()

This dashboard combines four chart types into a single figure. The heatmap reveals that summer months consistently have the highest passenger counts. The line plot shows a clear upward trend across all months over the years. The box plot highlights that July has the widest range of values, while the bar chart confirms that total yearly passengers grew steadily from 1949 to 1960. Building multi-panel dashboards like this is one of the most practical skills you can develop with Seaborn.

Saving and Exporting Publication-Ready Figures

Creating a great chart is only half the job. You also need to export it at the right resolution and format for your audience.

# Save as PNG at high resolution
fig, ax = plt.subplots(figsize=(10, 6))
sns.histplot(data=tips, x="total_bill", kde=True, ax=ax)
ax.set_title("Distribution of Total Bills")

# PNG for web and presentations
plt.savefig("chart.png", dpi=300, bbox_inches="tight", facecolor="white")

# SVG for scalable vector graphics (papers, reports)
plt.savefig("chart.svg", format="svg", bbox_inches="tight")

# PDF for LaTeX documents
plt.savefig("chart.pdf", format="pdf", bbox_inches="tight")

plt.close()
print("Charts saved successfully")

Use dpi=300 for print quality and dpi=150 for web use. The bbox_inches="tight" parameter trims whitespace around the chart. SVG format is ideal for reports because it scales without pixelation. Always call plt.close() after saving to free memory, especially when generating many charts in a loop.

Exporting Seaborn visualizations
Export your masterpiece before the kernel crashes.

Frequently Asked Questions

What is the difference between Seaborn and Matplotlib?

Matplotlib is the foundation that handles the actual drawing of axes, lines, and shapes. Seaborn sits on top of matplotlib and provides a higher-level interface with better default styles, built-in statistical aggregation, and simpler syntax for common chart types. You can mix both in the same script since every Seaborn function returns a matplotlib axes object.

Can I use Seaborn with data that is not in a pandas DataFrame?

Yes, most Seaborn functions accept numpy arrays, Python lists, or dictionaries in addition to DataFrames. However, the DataFrame interface is the most powerful because it allows you to reference column names directly for parameters like hue, size, and style. Converting your data to a DataFrame first is almost always worth the extra line of code.

How do I change the figure size in Seaborn?

For axes-level functions like sns.scatterplot, create the figure first with plt.figure(figsize=(10, 6)) or fig, ax = plt.subplots(figsize=(10, 6)) and pass the ax parameter. For figure-level functions like sns.catplot, use the height and aspect parameters directly.

Why do my Seaborn charts look different from the examples online?

Seaborn version 0.12 changed several default behaviours, including the default theme and some function names. Run sns.set_theme() at the start of your script to apply the modern defaults, and check your version with sns.__version__. Also note that some older tutorials use deprecated functions like distplot which was replaced by histplot and kdeplot.

How do I add labels and annotations to Seaborn plots?

Since Seaborn returns matplotlib axes, you can use all matplotlib annotation functions. Call ax.set_xlabel(), ax.set_ylabel(), and ax.set_title() for basic labels. For annotations pointing to specific data points, use ax.annotate("text", xy=(x, y), xytext=(x2, y2), arrowprops=dict(arrowstyle="->")).

Wrapping Up

Seaborn transforms the often tedious process of data visualization into something fast and enjoyable. You have learned how to create scatter plots, bar charts, histograms, heatmaps, and multi-plot grids, all with Seaborn’s clean one-line syntax. The key to mastering Seaborn is understanding which chart type matches your data: use scatter plots for two continuous variables, bar charts for categorical comparisons, histograms and KDE plots for distributions, and heatmaps for correlation matrices. Combined with FacetGrid for multi-panel layouts and Seaborn’s built-in themes for consistent styling, you now have a complete toolkit for turning raw data into compelling visual stories. Start with the built-in datasets to practice, then apply these patterns to your own data.

Related Articles