Pubs | Python How To Program

How To Use Python match-case (Structural Pattern Matching)

Intermediate

You have probably written dozens of if-elif chains that check a variable against a list of possible values. Maybe it is an HTTP status code, a command from user input, or a message type from an API. The chain starts small, then grows to 15 branches, and suddenly the logic is hard to follow and even harder to extend. Python 3.10 introduced structural pattern matching with the match and case statements to solve exactly this problem.

Structural pattern matching is built into Python 3.10 and later — no extra libraries needed. It goes far beyond a simple switch statement. You can match against literal values, destructure sequences and dictionaries, bind variables, add guard conditions, and even match class instances by their attributes. If you have used pattern matching in Rust, Scala, or Elixir, Python’s version will feel familiar but with its own Pythonic style.

In this article, you will learn how match-case works starting with a quick example, then move through literal patterns, sequence unpacking, mapping patterns, class patterns, guard clauses, and OR patterns. We will finish with a real-life CLI command parser that ties everything together. By the end, you will be able to replace complex branching logic with clean, readable pattern matching code.

Python match-case: Quick Example

Here is the simplest useful example of match-case — handling HTTP status codes. This runs on Python 3.10 or later:

# quick_example.py
def describe_status(code):
    match code:
        case 200:
            return "OK -- request succeeded"
        case 404:
            return "Not Found -- resource does not exist"
        case 500:
            return "Server Error -- something broke on the server"
        case _:
            return f"Unknown status code: {code}"

print(describe_status(200))
print(describe_status(404))
print(describe_status(999))

Output:

OK -- request succeeded
Not Found -- resource does not exist
Unknown status code: 999

The match statement evaluates the subject expression (code) and compares it against each case pattern in order. The first matching pattern wins, and its block runs. The underscore _ is the wildcard pattern — it matches anything and acts as your default branch, similar to else in an if-chain.

This looks like a switch statement on the surface, but as you will see in the following sections, match-case can destructure data structures, bind variables, and match complex nested objects — things no switch statement can do.

What Is Structural Pattern Matching and Why Use It?

Structural pattern matching lets you check whether a value has a particular structure and extract parts of it in a single step. Think of it as an X-ray machine for your data: you describe the shape you expect, and Python checks if the data fits that shape while pulling out the pieces you need.

The key difference from if-elif chains is that pattern matching is declarative. Instead of writing procedural code that tests conditions one by one, you describe what the data should look like. Python handles the checking and unpacking for you.

Feature	if-elif Chain	match-case
Simple value comparison	Works fine	Works fine, slightly cleaner
Destructuring sequences	Manual indexing or unpacking	Built-in with capture variables
Nested data extraction	Multiple lines of checks	Single pattern describes the shape
Type checking + attribute access	isinstance() + getattr()	Class patterns handle both at once
Combining conditions	and/or in conditions	Guards and OR patterns
Readability at 5+ branches	Gets messy fast	Each case is self-contained

Pattern matching shines when you need to handle multiple message types, parse command structures, process API responses with varying shapes, or route events based on their content. For simple two-way checks, a regular if-else is still the right tool.

Matching Literal Values

The most basic use of match-case is matching against literal values — integers, strings, booleans, and None. This is the direct replacement for a long if-elif chain that compares a variable against constants:

# literal_patterns.py
def get_day_type(day):
    match day.lower():
        case "monday" | "tuesday" | "wednesday" | "thursday" | "friday":
            return "weekday"
        case "saturday" | "sunday":
            return "weekend"
        case _:
            return "not a valid day"

print(get_day_type("Monday"))
print(get_day_type("Saturday"))
print(get_day_type("Funday"))

Output:

weekday
weekend
not a valid day

The pipe operator | creates an OR pattern, letting you match multiple values in a single case. This is much cleaner than writing if day in ("monday", "tuesday", ...) when each group needs different handling. Notice we call .lower() on the subject expression itself — all transformations happen before matching begins.

Destructuring Sequences

One of the most powerful features of match-case is sequence patterns. You can match lists and tuples by their structure, extract specific elements into variables, and even capture variable-length remainders with the star operator:

# sequence_patterns.py
def process_command(command_parts):
    match command_parts:
        case ["quit"]:
            return "Exiting program"
        case ["hello", name]:
            return f"Hello, {name}!"
        case ["move", direction, steps]:
            return f"Moving {direction} by {steps} steps"
        case ["add", *items]:
            return f"Adding {len(items)} items: {', '.join(items)}"
        case []:
            return "Empty command"
        case _:
            return f"Unknown command: {command_parts}"

print(process_command(["quit"]))
print(process_command(["hello", "Alice"]))
print(process_command(["move", "north", "5"]))
print(process_command(["add", "milk", "eggs", "bread"]))
print(process_command([]))

Output:

Exiting program
Hello, Alice!
Moving north by 5 steps
Adding 3 items: milk, eggs, bread
Empty command

Each case describes the shape of the list. The pattern ["hello", name] matches any two-element list where the first element is literally "hello", and it binds the second element to the variable name. The *items pattern captures all remaining elements after "add", similar to how *args works in function signatures. This lets you handle variable-length commands without writing manual length checks.

Matching Dictionaries

Mapping patterns let you match dictionaries by checking for specific keys and extracting their values. This is incredibly useful for processing JSON responses from APIs where the shape of the data tells you what type of message or event you are dealing with:

# mapping_patterns.py
def handle_event(event):
    match event:
        case {"type": "click", "element": element, "x": x, "y": y}:
            return f"Click on {element} at ({x}, {y})"
        case {"type": "keypress", "key": key}:
            return f"Key pressed: {key}"
        case {"type": "scroll", "direction": direction}:
            return f"Scrolled {direction}"
        case {"type": unknown_type}:
            return f"Unknown event type: {unknown_type}"
        case _:
            return "Invalid event format"

print(handle_event({"type": "click", "element": "button", "x": 100, "y": 200}))
print(handle_event({"type": "keypress", "key": "Enter"}))
print(handle_event({"type": "scroll", "direction": "down", "amount": 3}))
print(handle_event({"type": "resize"}))

Output:

Click on button at (100, 200)
Key pressed: Enter
Scrolled down
Unknown event type: resize

Mapping patterns only check for the keys you specify — extra keys in the dictionary are ignored. The scroll event dictionary has an amount key that the pattern does not mention, and that is fine. The pattern {"type": unknown_type} matches any dictionary with a "type" key and captures its value. This makes mapping patterns perfect for processing JSON-like data where different message types have different fields.

Matching Class Instances

Class patterns combine type checking and attribute extraction in a single step. Instead of writing isinstance() checks followed by attribute access, you describe the class and the attribute values you expect:

# class_patterns.py
from dataclasses import dataclass

@dataclass
class Point:
    x: float
    y: float

@dataclass
class Circle:
    center: Point
    radius: float

@dataclass
class Rectangle:
    origin: Point
    width: float
    height: float

def describe_shape(shape):
    match shape:
        case Circle(center=Point(x=0, y=0), radius=r):
            return f"Circle at origin with radius {r}"
        case Circle(center=center, radius=r):
            return f"Circle at ({center.x}, {center.y}) with radius {r}"
        case Rectangle(origin=origin, width=w, height=h) if w == h:
            return f"Square at ({origin.x}, {origin.y}) with side {w}"
        case Rectangle(origin=origin, width=w, height=h):
            return f"Rectangle at ({origin.x}, {origin.y}), {w}x{h}"
        case _:
            return "Unknown shape"

print(describe_shape(Circle(Point(0, 0), 5)))
print(describe_shape(Circle(Point(3, 4), 2.5)))
print(describe_shape(Rectangle(Point(1, 1), 10, 10)))
print(describe_shape(Rectangle(Point(0, 0), 8, 3)))

Output:

Circle at origin with radius 5
Circle at (3, 4) with radius 2.5
Square at (1, 1) with side 10
Rectangle at (0, 0), 8x3

Notice how the first Circle case uses a nested pattern — it matches a Circle whose center is specifically at the origin Point(x=0, y=0). The Rectangle case uses a guard clause (if w == h) to distinguish squares from regular rectangles. Class patterns work best with dataclasses and named tuples because Python can automatically match keyword arguments to attributes. For regular classes, you would need to define a __match_args__ tuple to enable positional matching.

Adding Guard Clauses

Sometimes the pattern alone is not enough to decide which case should match. Guard clauses add an if condition after the pattern that must also be true for the case to match. The guard can reference any variables captured by the pattern:

# guard_clauses.py
def categorize_score(score):
    match score:
        case s if s < 0 or s > 100:
            return f"Invalid score: {s}"
        case s if s >= 90:
            return f"{s} -- A grade (excellent)"
        case s if s >= 80:
            return f"{s} -- B grade (good)"
        case s if s >= 70:
            return f"{s} -- C grade (average)"
        case s if s >= 60:
            return f"{s} -- D grade (below average)"
        case s:
            return f"{s} -- F grade (failing)"

print(categorize_score(95))
print(categorize_score(82))
print(categorize_score(55))
print(categorize_score(-5))

Output:

95 -- A grade (excellent)
82 -- B grade (good)
55 -- F grade (failing)
Invalid score: -5

The pattern s by itself matches any value and binds it to the variable s. The guard if s >= 90 then filters whether this particular case should apply. Guards are evaluated in order, so the invalid score check comes first to reject bad input before the grading logic runs. This is cleaner than having the validation scattered across multiple elif branches.

Combining Patterns with OR

The OR pattern using the pipe | operator lets you match any of several patterns with the same case block. You have already seen this with literals, but it works with more complex patterns too:

# or_patterns.py
def parse_bool(value):
    match value:
        case True | "true" | "yes" | "1" | 1:
            return True
        case False | "false" | "no" | "0" | 0:
            return False
        case None | "":
            return None
        case _:
            raise ValueError(f"Cannot parse {value!r} as boolean")

print(parse_bool("yes"))
print(parse_bool(0))
print(parse_bool("false"))
print(parse_bool(None))

Output:

True
False
False
None

This pattern is extremely useful for building flexible input parsers that need to accept multiple formats for the same logical value. Configuration files, command-line arguments, and API parameters often use different representations for booleans, and a single OR pattern handles all of them in one readable line. Note that when using OR patterns with capture variables, every alternative must bind the same set of variables — Python enforces this at compile time.

Common Pitfalls to Avoid

There are a few tricky behaviors in match-case that catch even experienced Python developers. The most common mistake is accidentally creating a capture pattern when you meant to match a constant:

# pitfalls.py
HTTP_OK = 200
HTTP_NOT_FOUND = 404

status = 500

# WRONG -- this does NOT work as expected
match status:
    case HTTP_OK:         # This captures 500 into a NEW variable called HTTP_OK!
        print("Success")
    case HTTP_NOT_FOUND:  # This never runs -- the first case caught everything
        print("Not found")

# RIGHT -- use literal values or dotted names
print("---")
match status:
    case 200:
        print("Success")
    case 404:
        print("Not found")
    case other:
        print(f"Other status: {other}")

Output:

Success
---
Other status: 500

In the first match block, case HTTP_OK does not compare against the variable HTTP_OK. Instead, it creates a new variable called HTTP_OK that captures whatever the subject value is. This is because bare names in patterns are always capture patterns. To match against constants, use literal values directly, use dotted names like case http.HTTPStatus.OK, or use a guard clause like case status if status == HTTP_OK.

Real-Life Example: Building a CLI Command Parser

Let’s tie everything together with a practical project — a command-line parser that processes structured user commands using every pattern type we have covered:

# cli_parser.py
from dataclasses import dataclass

@dataclass
class Task:
    title: str
    priority: str = "medium"
    done: bool = False

def run_command(command, tasks):
    """Parse and execute a CLI command on a task list."""
    parts = command.strip().split()

    match parts:
        case ["add", *words] if words:
            title = " ".join(words)
            task = Task(title=title)
            tasks.append(task)
            return f"Added: '{title}'"

        case ["done", index] if index.isdigit():
            idx = int(index)
            if 0 <= idx < len(tasks):
                tasks[idx].done = True
                return f"Completed: '{tasks[idx].title}'"
            return f"Error: no task at index {idx}"

        case ["priority", index, ("high" | "medium" | "low") as level] if index.isdigit():
            idx = int(index)
            if 0 <= idx < len(tasks):
                tasks[idx].priority = level
                return f"Set '{tasks[idx].title}' priority to {level}"
            return f"Error: no task at index {idx}"

        case ["list"]:
            if not tasks:
                return "No tasks yet"
            lines = []
            for i, t in enumerate(tasks):
                status = "done" if t.done else "todo"
                lines.append(f"  [{i}] [{status}] [{t.priority}] {t.title}")
            return "\n".join(lines)

        case ["list", "done"]:
            done_tasks = [t for t in tasks if t.done]
            if not done_tasks:
                return "No completed tasks"
            return "\n".join(f"  - {t.title}" for t in done_tasks)

        case ["list", "pending"]:
            pending = [t for t in tasks if not t.done]
            if not pending:
                return "All tasks complete!"
            return "\n".join(f"  - {t.title} [{t.priority}]" for t in pending)

        case ["quit" | "exit"]:
            return "QUIT"

        case []:
            return "Type a command (add, done, priority, list, quit)"

        case _:
            return f"Unknown command: {' '.join(parts)}"

# Simulate a session
tasks = []
commands = [
    "add Buy groceries",
    "add Write unit tests",
    "add Deploy to production",
    "priority 2 high",
    "done 0",
    "list",
    "list pending",
    "quit"
]

for cmd in commands:
    print(f"> {cmd}")
    result = run_command(cmd, tasks)
    print(result)
    if result == "QUIT":
        break
    print()

Output:

> add Buy groceries
Added: 'Buy groceries'

> add Write unit tests
Added: 'Write unit tests'

> add Deploy to production
Added: 'Deploy to production'

> priority 2 high
Set 'Deploy to production' priority to high

> done 0
Completed: 'Buy groceries'

> list
  [0] [done] [medium] Buy groceries
  [1] [todo] [medium] Write unit tests
  [2] [todo] [high] Deploy to production

> list pending
  - Write unit tests [medium]
  - Deploy to production [high]

> quit
QUIT

This command parser demonstrates several pattern matching features working together. The ["add", *words] pattern uses a star capture for variable-length input. The ["priority", index, ("high" | "medium" | "low") as level] pattern combines sequence matching, an OR pattern for valid values, and an as binding to capture the matched value. Guard clauses validate that numeric arguments are actually digits before conversion. You could extend this by adding commands for removing tasks, searching by keyword, or sorting by priority — each new command is just another case block.

Frequently Asked Questions

What Python version do I need for match-case?

You need Python 3.10 or later. Structural pattern matching was introduced in Python 3.10 as part of PEP 634, PEP 635, and PEP 636. If you try to use match and case on Python 3.9 or earlier, you will get a SyntaxError. Note that match and case are soft keywords — they only have special meaning in the context of the match statement and can still be used as variable names elsewhere in your code.

Is match-case just a switch statement?

No, it is much more powerful. A switch statement (like in C or JavaScript) only compares a value against constants. Python’s match-case can destructure sequences and mappings, bind captured values to variables, match class instances by their attributes, use guard conditions, and combine patterns with OR. The simple literal matching does resemble a switch, but structural pattern matching handles complex data shapes that a switch statement cannot express.

Does match-case fall through like C switch?

No. Python’s match-case executes only the first matching case and then exits the match block. There is no fall-through behavior and no need for a break statement. If you want multiple patterns to execute the same code, combine them with the OR operator | in a single case, such as case "yes" | "y" | "true". This design prevents the common bug in C where a missing break causes unintended fall-through.

Can I use match-case with regular expressions?

Not directly in the pattern itself, but you can use guard clauses with re.match() or re.search(). For example: case str(s) if re.match(r"^\d{3}-\d{4}$", s) matches strings that look like phone numbers. The pattern ensures the value is a string, and the guard applies the regex check. This keeps the pattern readable while letting you use the full power of regular expressions when needed.

How does match-case compare to if-elif for performance?

For simple literal matching, match-case and if-elif chains have similar performance. The CPython implementation does not currently optimize match statements into jump tables or hash lookups. Choose match-case for readability and maintainability, not for speed. The real performance benefit is developer time — pattern matching makes complex branching logic easier to read, debug, and extend, which reduces the time you spend maintaining the code.

Conclusion

You now have a solid understanding of Python’s structural pattern matching — from simple literal matching to destructuring sequences and dictionaries, matching class instances with nested patterns, filtering with guard clauses, and combining alternatives with OR patterns. The key concepts we covered are match and case syntax, the wildcard _ pattern, capture variables, star patterns for variable-length sequences, mapping patterns for dictionaries, class patterns with dataclasses, guard clauses with if, and OR patterns with |.

Try extending the CLI command parser by adding a search command that filters tasks by keyword, or a sort command that reorders tasks by priority. You could also add persistence by saving tasks to a JSON file between sessions. For the complete language specification and advanced features like walrus patterns and positional class matching, check out the official Python documentation on match statements.

How To Write Unit Tests with pytest in Python

by Pubs | Automation, Beginner

Beginner

You have just finished writing a Python function that calculates discounts, parses user input, or fetches data from an API. It works when you test it manually — but how do you know it will still work next week after you refactor? How do you catch the edge case where someone passes a negative number or an empty string? Unit tests are the safety net that catches these problems before your users do, and pytest is the tool that makes writing those tests genuinely enjoyable.

pytest is Python’s most popular testing framework, and it comes with zero boilerplate. Unlike the built-in unittest module that requires classes and special method names, pytest lets you write tests as simple functions using plain assert statements. It automatically discovers your test files, provides detailed failure reports, and has a rich ecosystem of plugins. You can install it with a single pip install pytest command and start testing immediately.

In this guide, you will learn how to write your first pytest test, organize test files properly, use fixtures for setup and teardown, parametrize tests to cover multiple inputs, mock external dependencies, and structure a real testing suite. By the end, you will have the skills to write comprehensive tests for any Python project.

Writing Your First pytest Test: Quick Example

Let’s start with a complete, runnable example that shows how pytest works in under 30 seconds. Create two files in the same directory — the code you want to test and the test file itself:

# calculator.py
def add(a, b):
    return a + b

def divide(a, b):
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b

# test_calculator.py
from calculator import add, divide
import pytest

def test_add_positive_numbers():
    assert add(2, 3) == 5

def test_add_negative_numbers():
    assert add(-1, -1) == -2

def test_divide_normal():
    assert divide(10, 2) == 5.0

def test_divide_by_zero():
    with pytest.raises(ValueError, match="Cannot divide by zero"):
        divide(10, 0)

Run the tests from your terminal:

$ pytest test_calculator.py -v

Output:

test_calculator.py::test_add_positive_numbers PASSED
test_calculator.py::test_add_negative_numbers PASSED
test_calculator.py::test_divide_normal PASSED
test_calculator.py::test_divide_by_zero PASSED

========================= 4 passed in 0.02s =========================

Notice how simple that is — each test is a plain function that starts with test_, and you verify results with normal assert statements. pytest discovers test files automatically (any file matching test_*.py) and gives you a clear pass/fail report. The -v flag enables verbose output so you can see each individual test result. In the following sections, we will explore every feature that makes pytest the go-to testing tool for Python developers.

What is pytest and Why Use It?

pytest is a mature, full-featured testing framework for Python that has become the de facto standard for Python testing. It was created to remove the friction of writing tests — no more subclassing TestCase, no more self.assertEqual calls, just plain functions and assertions.

The philosophy behind pytest is simple: if writing tests feels like a chore, developers will not write them. By making the syntax minimal and the output informative, pytest encourages a test-driven workflow where you actually want to write tests.

Feature	pytest	unittest (built-in)
Test syntax	Plain functions with `assert`	Classes inheriting `TestCase`
Setup/teardown	Fixtures (flexible, composable)	`setUp`/`tearDown` methods
Test discovery	Automatic (`test_*.py` files)	Automatic (similar pattern)
Parametrized tests	Built-in `@pytest.mark.parametrize`	Requires `subTest` or loops
Failure output	Detailed assertion introspection	Basic assertion messages
Plugin ecosystem	1000+ plugins available	Limited
Boilerplate	Minimal	Significant (classes, methods)

One of pytest’s most powerful features is assertion introspection. When an assertion fails, pytest shows you exactly what values were compared, rather than a generic “assertion failed” message. This alone saves hours of debugging time across a project’s lifetime.

Organizing Your Test Files

As your project grows, you need a consistent structure for your tests. The standard convention is to create a tests/ directory at the root of your project that mirrors your source code structure:

# Project structure example
my_project/
    src/
        my_project/
            __init__.py
            calculator.py
            user_service.py
    tests/
        __init__.py
        test_calculator.py
        test_user_service.py
    pyproject.toml

pytest discovers tests by looking for files that match test_*.py or *_test.py patterns. Inside those files, it collects functions starting with test_ and classes starting with Test. You do not need to register tests anywhere — just follow the naming convention and pytest finds them automatically.

You can run all tests with pytest, run a specific file with pytest tests/test_calculator.py, or even run a single test function with pytest tests/test_calculator.py::test_add_positive_numbers. This granular control is essential when debugging a failing test in a large test suite.

Using Fixtures for Test Setup

Real tests often need data or resources to work with — a database connection, a temporary file, or a pre-configured object. pytest fixtures let you define this setup code once and inject it into any test that needs it. Think of fixtures as reusable building blocks for your tests:

# test_user_service.py
import pytest

class User:
    def __init__(self, name, email, role="viewer"):
        self.name = name
        self.email = email
        self.role = role

    def promote(self):
        if self.role == "viewer":
            self.role = "editor"
        elif self.role == "editor":
            self.role = "admin"

    def __repr__(self):
        return f"User({self.name}, {self.role})"

@pytest.fixture
def sample_user():
    """Create a fresh user for each test."""
    return User("Alice", "alice@example.com")

@pytest.fixture
def admin_user():
    """Create an admin user for permission tests."""
    user = User("Bob", "bob@example.com", role="admin")
    return user

def test_new_user_is_viewer(sample_user):
    assert sample_user.role == "viewer"

def test_promote_viewer_to_editor(sample_user):
    sample_user.promote()
    assert sample_user.role == "editor"

def test_promote_editor_to_admin(sample_user):
    sample_user.promote()  # viewer -> editor
    sample_user.promote()  # editor -> admin
    assert sample_user.role == "admin"

def test_admin_stays_admin(admin_user):
    admin_user.promote()
    assert admin_user.role == "admin"

Output:

$ pytest test_user_service.py -v
test_user_service.py::test_new_user_is_viewer PASSED
test_user_service.py::test_promote_viewer_to_editor PASSED
test_user_service.py::test_promote_editor_to_admin PASSED
test_user_service.py::test_admin_stays_admin PASSED

========================= 4 passed in 0.01s =========================

Each fixture runs fresh for every test that uses it, so tests never interfere with each other. You request a fixture by adding its name as a parameter to your test function — pytest handles the injection automatically. This is fundamentally different from unittest where setup code lives in a setUp method that runs before every test in the class, regardless of whether each test needs it.

Parametrize: Test Multiple Inputs in One Function

One of pytest’s most time-saving features is @pytest.mark.parametrize, which lets you run the same test logic with different inputs. Instead of writing five nearly identical test functions, you define the inputs as a list and pytest generates a separate test case for each one:

# test_validators.py
import pytest

def is_valid_email(email):
    """Simple email validation."""
    if not isinstance(email, str):
        return False
    parts = email.split("@")
    if len(parts) != 2:
        return False
    local, domain = parts
    return len(local) > 0 and "." in domain

@pytest.mark.parametrize("email,expected", [
    ("user@example.com", True),
    ("admin@mail.co.uk", True),
    ("test@localhost", False),
    ("@example.com", False),
    ("user@", False),
    ("plaintext", False),
    ("", False),
    (None, False),
    ("user@domain.com", True),
])
def test_email_validation(email, expected):
    assert is_valid_email(email) == expected

Output:

$ pytest test_validators.py -v
test_validators.py::test_email_validation[user@example.com-True] PASSED
test_validators.py::test_email_validation[admin@mail.co.uk-True] PASSED
test_validators.py::test_email_validation[test@localhost-False] PASSED
test_validators.py::test_email_validation[@example.com-False] PASSED
test_validators.py::test_email_validation[user@-False] PASSED
test_validators.py::test_email_validation[plaintext-False] PASSED
test_validators.py::test_email_validation[-False] PASSED
test_validators.py::test_email_validation[None-False] PASSED
test_validators.py::test_email_validation[user@domain.com-True] PASSED

========================= 9 passed in 0.01s =========================

This is incredibly powerful for validation functions, parsers, and any code that needs to handle a variety of inputs. Each parametrized case shows up as its own test in the output, so if one fails you know exactly which input caused the problem. Without parametrize, you would need nine separate functions that all look nearly identical.

Mocking External Dependencies

Unit tests should test your code in isolation, but what if your function calls an external API, reads from a database, or sends an email? You do not want your tests hitting real services — they would be slow, flaky, and possibly destructive. This is where mocking comes in. Python’s unittest.mock library (which works perfectly with pytest) lets you replace external dependencies with controlled stand-ins:

# weather_service.py
import requests

def get_temperature(city):
    """Fetch current temperature from a weather API."""
    response = requests.get(
        f"https://api.weatherapi.com/v1/current.json",
        params={"key": "YOUR_API_KEY", "q": city}
    )
    response.raise_for_status()
    data = response.json()
    return data["current"]["temp_c"]

def weather_advice(city):
    """Give clothing advice based on temperature."""
    temp = get_temperature(city)
    if temp < 10:
        return "Wear a warm coat"
    elif temp < 20:
        return "A light jacket should be fine"
    else:
        return "T-shirt weather"

# test_weather_service.py
from unittest.mock import patch, MagicMock
from weather_service import weather_advice

@patch("weather_service.get_temperature")
def test_cold_weather_advice(mock_get_temp):
    mock_get_temp.return_value = 5
    assert weather_advice("London") == "Wear a warm coat"
    mock_get_temp.assert_called_once_with("London")

@patch("weather_service.get_temperature")
def test_mild_weather_advice(mock_get_temp):
    mock_get_temp.return_value = 15
    assert weather_advice("Paris") == "A light jacket should be fine"

@patch("weather_service.get_temperature")
def test_warm_weather_advice(mock_get_temp):
    mock_get_temp.return_value = 28
    assert weather_advice("Sydney") == "T-shirt weather"

Output:

$ pytest test_weather_service.py -v
test_weather_service.py::test_cold_weather_advice PASSED
test_weather_service.py::test_mild_weather_advice PASSED
test_weather_service.py::test_warm_weather_advice PASSED

========================= 3 passed in 0.02s =========================

The @patch decorator replaces the get_temperature function with a mock object during each test. You set the mock's return value to simulate different temperatures, then verify your weather_advice function responds correctly. The real API is never called -- your tests run instantly and work without an internet connection. The key is to patch where the function is used (in weather_service), not where it is defined.

Testing Exceptions and Edge Cases

Good tests verify not just the happy path but also that your code fails correctly. pytest provides pytest.raises as a context manager for testing that specific exceptions are raised with the right messages:

# test_edge_cases.py
import pytest

def parse_age(value):
    """Parse age from string input with validation."""
    if not isinstance(value, str):
        raise TypeError("Age must be provided as a string")
    stripped = value.strip()
    if not stripped:
        raise ValueError("Age cannot be empty")
    age = int(stripped)  # May raise ValueError for non-numeric
    if age < 0:
        raise ValueError("Age cannot be negative")
    if age > 150:
        raise ValueError("Age seems unrealistic")
    return age

def test_valid_age():
    assert parse_age("25") == 25
    assert parse_age("  30  ") == 30  # Handles whitespace

def test_empty_string_raises():
    with pytest.raises(ValueError, match="cannot be empty"):
        parse_age("")

def test_negative_age_raises():
    with pytest.raises(ValueError, match="cannot be negative"):
        parse_age("-5")

def test_unrealistic_age_raises():
    with pytest.raises(ValueError, match="seems unrealistic"):
        parse_age("200")

def test_non_string_raises_type_error():
    with pytest.raises(TypeError, match="must be provided as a string"):
        parse_age(25)

def test_non_numeric_string_raises():
    with pytest.raises(ValueError):
        parse_age("twenty")

Output:

$ pytest test_edge_cases.py -v
test_edge_cases.py::test_valid_age PASSED
test_edge_cases.py::test_empty_string_raises PASSED
test_edge_cases.py::test_negative_age_raises PASSED
test_edge_cases.py::test_unrealistic_age_raises PASSED
test_edge_cases.py::test_non_string_raises_type_error PASSED
test_edge_cases.py::test_non_numeric_string_raises PASSED

========================= 6 passed in 0.01s =========================

The match parameter accepts a regular expression, so you can verify not just that an exception was raised but that it carries the correct error message. This is critical for debugging -- when two different code paths raise the same exception type, the message tells you which one fired.

Sharing Fixtures with conftest.py

When multiple test files need the same fixtures, you can put them in a special file called conftest.py. pytest automatically discovers this file and makes its fixtures available to all tests in the same directory and subdirectories:

# tests/conftest.py
import pytest
import tempfile
import os

@pytest.fixture
def temp_directory():
    """Create a temporary directory that is cleaned up after the test."""
    with tempfile.TemporaryDirectory() as tmpdir:
        yield tmpdir
    # Directory is automatically deleted after yield

@pytest.fixture
def sample_csv(temp_directory):
    """Create a sample CSV file for testing."""
    csv_path = os.path.join(temp_directory, "data.csv")
    with open(csv_path, "w") as f:
        f.write("name,age,city\n")
        f.write("Alice,30,London\n")
        f.write("Bob,25,Paris\n")
        f.write("Charlie,35,Tokyo\n")
    return csv_path

# tests/test_data_reader.py
import csv

def read_csv_names(filepath):
    """Read names from a CSV file."""
    names = []
    with open(filepath, "r") as f:
        reader = csv.DictReader(f)
        for row in reader:
            names.append(row["name"])
    return names

def test_read_names_from_csv(sample_csv):
    names = read_csv_names(sample_csv)
    assert names == ["Alice", "Bob", "Charlie"]
    assert len(names) == 3

def test_csv_file_exists(sample_csv):
    import os
    assert os.path.exists(sample_csv)
    assert sample_csv.endswith("data.csv")

Output:

$ pytest tests/test_data_reader.py -v
tests/test_data_reader.py::test_read_names_from_csv PASSED
tests/test_data_reader.py::test_csv_file_exists PASSED

========================= 2 passed in 0.01s =========================

The yield keyword in the temp_directory fixture is important -- code before yield runs as setup, and code after yield runs as teardown. This pattern ensures resources are always cleaned up, even if a test fails. The sample_csv fixture depends on temp_directory, showing how fixtures can compose together to build complex test scenarios from simple pieces.

Real-Life Example: Testing a Shopping Cart

Let's tie everything together with a realistic project -- a shopping cart module with full test coverage. This example uses fixtures, parametrize, exception testing, and mocking all in one test suite:

# shopping_cart.py
class Product:
    def __init__(self, name, price, stock=10):
        if price < 0:
            raise ValueError("Price cannot be negative")
        self.name = name
        self.price = price
        self.stock = stock

class ShoppingCart:
    def __init__(self):
        self.items = {}

    def add_item(self, product, quantity=1):
        if quantity <= 0:
            raise ValueError("Quantity must be positive")
        if quantity > product.stock:
            raise ValueError(f"Only {product.stock} items in stock")
        if product.name in self.items:
            self.items[product.name]["quantity"] += quantity
        else:
            self.items[product.name] = {
                "product": product,
                "quantity": quantity
            }

    def remove_item(self, product_name):
        if product_name not in self.items:
            raise KeyError(f"'{product_name}' not in cart")
        del self.items[product_name]

    def get_total(self):
        total = 0
        for item_data in self.items.values():
            total += item_data["product"].price * item_data["quantity"]
        return round(total, 2)

    def apply_discount(self, percent):
        if not 0 <= percent <= 100:
            raise ValueError("Discount must be between 0 and 100")
        total = self.get_total()
        return round(total * (1 - percent / 100), 2)

    @property
    def item_count(self):
        return sum(d["quantity"] for d in self.items.values())

# test_shopping_cart.py
import pytest
from shopping_cart import Product, ShoppingCart

@pytest.fixture
def cart():
    return ShoppingCart()

@pytest.fixture
def laptop():
    return Product("Laptop", 999.99, stock=5)

@pytest.fixture
def mouse():
    return Product("Mouse", 29.99, stock=50)

@pytest.fixture
def loaded_cart(cart, laptop, mouse):
    cart.add_item(laptop, 1)
    cart.add_item(mouse, 2)
    return cart

# -- Basic functionality tests --
def test_empty_cart_total(cart):
    assert cart.get_total() == 0
    assert cart.item_count == 0

def test_add_single_item(cart, laptop):
    cart.add_item(laptop)
    assert cart.item_count == 1
    assert cart.get_total() == 999.99

def test_add_multiple_items(loaded_cart):
    assert loaded_cart.item_count == 3
    assert loaded_cart.get_total() == 1059.97  # 999.99 + 29.99*2

def test_remove_item(loaded_cart):
    loaded_cart.remove_item("Mouse")
    assert loaded_cart.item_count == 1
    assert loaded_cart.get_total() == 999.99

# -- Discount tests --
@pytest.mark.parametrize("discount,expected", [
    (0, 1059.97),
    (10, 953.97),
    (50, 529.99),
    (100, 0),
])
def test_discount_calculations(loaded_cart, discount, expected):
    assert loaded_cart.apply_discount(discount) == expected

# -- Error handling tests --
def test_negative_price_raises():
    with pytest.raises(ValueError, match="cannot be negative"):
        Product("Bad", -10)

def test_zero_quantity_raises(cart, laptop):
    with pytest.raises(ValueError, match="must be positive"):
        cart.add_item(laptop, 0)

def test_exceeding_stock_raises(cart, laptop):
    with pytest.raises(ValueError, match="in stock"):
        cart.add_item(laptop, 10)  # Only 5 in stock

def test_remove_missing_item_raises(cart):
    with pytest.raises(KeyError, match="not in cart"):
        cart.remove_item("Nonexistent")

def test_invalid_discount_raises(loaded_cart):
    with pytest.raises(ValueError, match="between 0 and 100"):
        loaded_cart.apply_discount(150)

Output:

$ pytest test_shopping_cart.py -v
test_shopping_cart.py::test_empty_cart_total PASSED
test_shopping_cart.py::test_add_single_item PASSED
test_shopping_cart.py::test_add_multiple_items PASSED
test_shopping_cart.py::test_remove_item PASSED
test_shopping_cart.py::test_discount_calculations[0-1059.97] PASSED
test_shopping_cart.py::test_discount_calculations[10-953.97] PASSED
test_shopping_cart.py::test_discount_calculations[50-529.99] PASSED
test_shopping_cart.py::test_discount_calculations[100-0] PASSED
test_shopping_cart.py::test_negative_price_raises PASSED
test_shopping_cart.py::test_zero_quantity_raises PASSED
test_shopping_cart.py::test_exceeding_stock_raises PASSED
test_shopping_cart.py::test_remove_missing_item_raises PASSED
test_shopping_cart.py::test_invalid_discount_raises PASSED

========================= 13 passed in 0.02s =========================

This test suite demonstrates a professional testing pattern: fixtures create reusable objects, parametrize covers multiple discount scenarios efficiently, and exception tests verify that invalid inputs are rejected with clear error messages. You can extend this by adding tests for quantity updates, coupon codes, or tax calculations -- each new feature gets its own focused set of tests.

Frequently Asked Questions

How do I run only tests that match a specific name pattern?

Use the -k flag followed by an expression. For example, pytest -k "discount" runs only tests with "discount" in their name. You can combine patterns with and, or, and not operators, like pytest -k "discount and not invalid". This is extremely useful when you are debugging a specific area of your codebase and do not want to run the entire test suite.

What is the difference between fixtures with function scope and session scope?

By default, fixtures have scope="function", meaning they run fresh for every test function. You can change this to scope="session", scope="module", or scope="class" to share expensive resources. A session-scoped fixture (like a database connection) is created once for the entire test run. Use @pytest.fixture(scope="session") for resources that are expensive to create and safe to share across tests.

How do I skip a test or mark it as expected to fail?

Use @pytest.mark.skip(reason="Not implemented yet") to unconditionally skip a test, or @pytest.mark.skipif(condition, reason="...") to skip based on a condition (like Python version or OS). Use @pytest.mark.xfail for tests that you expect to fail -- they run but do not count as failures. This is useful for documenting known bugs without breaking your CI pipeline.

Can I use pytest with existing unittest-style tests?

Yes, pytest runs unittest.TestCase classes without any modifications. You can gradually migrate by keeping your old tests working while writing new tests in pytest style. This means you do not need to rewrite your entire test suite at once -- just start writing new tests with pytest and convert old ones when you touch them.

How do I see print output from my tests?

By default, pytest captures all print() output and only shows it for failing tests. Use pytest -s to disable output capture and see all print statements in real time. Alternatively, use pytest --capture=no for the same effect. This is helpful during development, but you should remove debug print statements before committing your tests.

Conclusion

You now have a solid foundation in pytest -- from writing simple test functions to organizing test suites with fixtures, covering edge cases with parametrize, and isolating code with mocks. The key concepts we covered are assert-based testing, the @pytest.fixture decorator for reusable setup, @pytest.mark.parametrize for data-driven tests, pytest.raises for exception testing, conftest.py for shared fixtures, and unittest.mock.patch for mocking external dependencies.

Try extending the shopping cart example by adding a coupon code system, quantity limits, or shipping cost calculations -- and write the tests first before the implementation. Test-driven development becomes natural once you are comfortable with pytest. For more advanced features like plugins, coverage reports, and parallel test execution, check out the official pytest documentation.

How To Clean Messy Data with Python and Pandas

by Pubs | Beginner, Data Processing

Intermediate

Data scientists and analysts spend approximately 80% of their time cleaning and preparing data before they can begin any meaningful analysis. This often unglamorous work is critical because the quality of your insights is directly proportional to the quality of your data. Whether you’re working with CSV files from legacy systems, databases with inconsistent formatting, or API responses with missing fields, you’ll inevitably encounter messy data.

Pandas, Python’s most popular data manipulation library, provides powerful tools to handle virtually any data cleaning scenario. With functions designed specifically for managing missing values, fixing data types, removing duplicates, and standardizing formats, you can transform chaotic datasets into analysis-ready dataframes in a fraction of the time it would take with manual approaches.

In this comprehensive guide, we’ll explore practical techniques for cleaning messy data using Pandas. You’ll learn how to identify data quality issues, apply targeted fixes, and build reusable cleaning pipelines that you can apply across different projects. By the end, you’ll have a solid toolkit for tackling real-world data challenges.

Quick Start: Clean Data in 10 Lines

Let’s start with a quick example that demonstrates the power of Pandas for data cleaning. Here’s a complete workflow that loads messy data, applies multiple cleaning operations, and produces a ready-to-analyze dataframe:

Data cleaning is rarely a single operation. Instead, you apply multiple fixes in sequence, each addressing a specific problem. In this example, you’ll see how to handle missing values, fix data types, standardize text formatting, and parse dates — often all in the same pipeline. Understanding how these pieces fit together is crucial because the order matters: you typically clean text before deduplicating, convert data types before filtering, and validate results before using data for analysis.

This section explores the techniques you need to handle this specific data quality issue. We’ll examine practical examples that show both the problem and multiple solution approaches.

# quick_clean.py
import pandas as pd
import numpy as np

messy_data = {
    'customer_id': [1, 2, None, 4, 5],
    'purchase_amount': ['$100.50', '$250', 'N/A', '$75.25', '$120'],
    'email': ['John@GMAIL.com', 'jane@yahoo.com', 'bob@gmail.COM', None, 'alice@test.com'],
    'signup_date': ['2024-01-15', '2024/02/20', '2024-03-10', '2024-01-18', 'N/A']
}

df = pd.DataFrame(messy_data)

df['customer_id'] = df['customer_id'].fillna(df['customer_id'].mean()).astype(int)
df['purchase_amount'] = df['purchase_amount'].replace('N/A', np.nan)
df['purchase_amount'] = df['purchase_amount'].str.replace('$', '').astype(float)
df['email'] = df['email'].str.lower().str.strip()
df['signup_date'] = pd.to_datetime(df['signup_date'], format='mixed', errors='coerce')
df = df.dropna(subset=['signup_date'])

print(df)

Output:

   customer_id purchase_amount               email signup_date
0            1          100.50  john@gmail.com 2024-01-15
1            2          250.00  jane@yahoo.com 2024-02-20
3            4           75.25   bob@gmail.com 2024-01-18
4            5          120.00  alice@test.com      NaT

This output shows the result of applying multiple cleaning operations: missing customer IDs were filled with the mean value, currency symbols were stripped and amounts converted to float, emails were standardized to lowercase, and dates were parsed into datetime format. Row 2 was dropped because its date couldn’t be parsed — sometimes removing completely broken records is preferable to forcing imperfect repairs. Each column now has the correct type and consistent formatting, making it ready for analysis.

This simple example demonstrates key Pandas functions that we’ll explore in depth throughout this tutorial. Notice how we handled missing values, converted currency to numeric format, standardized email addresses, and parsed dates — all core data cleaning tasks.

What Makes Data “Messy”?

Before diving into solutions, let’s identify the common data quality issues you’ll encounter. Understanding these problems helps you recognize them quickly and apply the right cleaning techniques.

Real-world data is messy because it comes from multiple sources, is entered manually, spans different time periods, and isn’t designed specifically for your analysis. Systems change, people make typos, integrations break, and formats evolve. Rather than being discouraged by messiness, professional data workers expect it and have systematic approaches to handle it. The patterns below appear repeatedly in virtually every dataset, so mastering them will serve you across your entire career.

Problem	Example	Solution
Missing values	NaN, None, ‘N/A’, blank cells	fillna(), dropna(), interpolate()
Inconsistent data types	Numbers stored as strings, mixed date formats	astype(), pd.to_numeric(), pd.to_datetime()
Duplicate records	Same customer appearing twice with slight variations	drop_duplicates(), duplicated()
Inconsistent formatting	‘John’, ‘JOHN’, ‘john’ in same column	str.lower(), str.upper(), str.strip()
Special characters and symbols	Currency signs, extra spaces, special characters	str.replace(), str.extract(), regex patterns
Outliers and impossible values	Age of 999, negative prices	Filtering, quantile-based detection
Mixed data types in single column	Column contains both integers and text	errors=’coerce’, regex extraction

This table summarizes the most common data quality problems and the Pandas tools that address them. Notice that each problem type has specific solution methods — you wouldn’t use the same approach for missing values as you would for duplicates or formatting issues. Understanding which problem you’re solving guides you toward the right function. Throughout this guide, we’ll explore each of these patterns in detail with practical examples showing both the problem and multiple solution approaches.

Handling Missing Values

Missing data is the most common data quality issue you’ll encounter. It manifests in different ways: NaN values in numeric columns, None objects in Python, placeholder strings like ‘N/A’, or simply empty cells. Missing data creates a fundamental problem: should you remove incomplete records or estimate their missing values? This choice isn’t purely technical — it depends on why data is missing, how much is missing, and what your analysis requires.

Pandas represents missing values as NaN (Not a Number) or None, and provides several strategies for handling them.

Detecting Missing Data

First, you need to identify where missing values exist in your dataframe:

This section explores the techniques you need to handle this specific data quality issue. We’ll examine practical examples that show both the problem and multiple solution approaches.

# missing_detection.py
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'product_id': [101, 102, None, 104, 105],
    'product_name': ['Laptop', None, 'Mouse', 'Keyboard', 'Monitor'],
    'price': [999.99, 249.99, 25.50, None, 399.99],
    'stock': [5, 10, 8, 3, None]
})

print("Missing values per column:")
print(df.isna().sum())
print("\nMissing values percentage:")
print(df.isna().sum() / len(df) * 100)
print("\nTotal missing values:")
print(df.isna().sum().sum())
print("\nRows with any missing values:")
print(df[df.isna().any(axis=1)])

Output:

Missing values per column:
product_id      1
product_name    1
price           1
stock           1
dtype: int64

Missing values percentage:
product_id     20.0
product_name   20.0
price          20.0
stock          20.0
dtype: float64

Total missing values:
4

Rows with any missing values:
  product_id product_name   price stock
1        102         None   249.99   10.0
2        NaN        Mouse   25.50    8.0
3        104      Keyboard    NaN    3.0
4        105      Monitor  399.99    NaN

Removing Missing Values

The simplest approach is to remove rows with missing values using dropna(). This works well when missing data is sparse:

# remove_missing.py
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'username': ['alice', None, 'charlie', 'diana', 'eve'],
    'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', None, 'eve@example.com'],
    'active': [True, True, False, True, True]
})

print("Original shape:", df.shape)
print(df)
print("\nAfter dropna():")
df_clean = df.dropna()
print("New shape:", df_clean.shape)
print(df_clean)

print("\nDrop rows missing in specific columns:")
df_clean2 = df.dropna(subset=['username'])
print(df_clean2)

Output:

Original shape: (5, 4)
   user_id username               email  active
0        1    alice  alice@example.com    True
1        2     None  bob@example.com    True
2        3  charlie  charlie@example.com  False
3        4    diana               None    True
4        5      eve  eve@example.com    True

After dropna():
New shape: (3, 4)
   user_id username               email  active
0        1    alice  alice@example.com    True
2        3  charlie  charlie@example.com  False
4        5      eve  eve@example.com    True

Drop rows missing in specific columns:
   user_id username               email  active
0        1    alice  alice@example.com    True
2        3  charlie  charlie@example.com  False
4        5      eve  eve@example.com    True

Filling Missing Values

When you can’t afford to lose data, filling missing values is a better strategy. Pandas provides several filling methods:

This section explores the techniques you need to handle this specific data quality issue. We’ll examine practical examples that show both the problem and multiple solution approaches.

# fill_missing.py
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'day': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'],
    'temperature': [72.5, 75.0, None, 78.5, None],
    'humidity': [65, None, 70, None, 68]
})

print("Original:")
print(df)

print("\nFill with constant value:")
print(df.fillna(0))

print("\nForward fill (propagate last known value):")
print(df.fillna(method='ffill'))

print("\nBackward fill (propagate next known value):")
print(df.fillna(method='bfill'))

print("\nFill with column mean:")
df['temperature'] = df['temperature'].fillna(df['temperature'].mean())
print(df)

print("\nFill with interpolation (linear):")
df2 = pd.DataFrame({
    'hour': [0, 1, 2, 3, 4],
    'traffic': [100, None, None, 150, 160]
})
df2['traffic'] = df2['traffic'].interpolate(method='linear')
print(df2)

Output:

Original:
        day  temperature  humidity
0    Monday         72.5        65
1   Tuesday         75.0       NaN
2 Wednesday          NaN        70
3  Thursday         78.5       NaN
4    Friday          NaN        68

Fill with constant value:
        day  temperature  humidity
0    Monday         72.5        65
1   Tuesday         75.0         0
2 Wednesday          0.0        70
3  Thursday         78.5         0
4    Friday          0.0        68

Forward fill (propagate last known value):
        day  temperature  humidity
0    Monday         72.5        65
1   Tuesday         75.0        65
2 Wednesday         75.0        70
3  Thursday         78.5        70
4    Friday         78.5        68

Interpolate (linear):
   hour  traffic
0     0    100.0
1     1    116.7
2     2    133.3
3     3    150.0
4     4    160.0

Character examining document amid chaos representing pandas data cleaning — Messy data is just clean data that hasn’t met pandas yet.

Fixing Data Types

Data type errors cause many silent bugs in analysis. A column containing prices might be stored as strings instead of floats, causing calculations to fail. Pandas provides tools to convert and validate data types.

Converting Strings to Numbers

Numbers stored as text are among the most frequent data type problems. You’ll encounter “$100.50” in a price column, “5,000” in a quantity column, or even “N/A” mixed with actual numbers. The `astype()` method works for clean numeric strings, but `pd.to_numeric(…, errors=’coerce’)` is more forgiving — it converts what it can and turns non-numeric values into NaN. This defensive approach prevents silent failures and lets you handle problematic values explicitly after conversion.

The pd.to_numeric() function is your best friend for handling numeric data stored as strings:

# string_to_numeric.py
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'product': ['Widget A', 'Widget B', 'Widget C', 'Widget D'],
    'price': ['$25.99', '$40.50', 'FREE', '$15.75'],
    'quantity': ['100', '250', '50', 'unlimited']
})

print("Original dtypes:")
print(df.dtypes)

print("\nPrice as string:")
print(df['price'])

print("\nConvert price to numeric (coerce errors):")
df['price'] = df['price'].str.replace('$', '').str.replace('FREE', np.nan)
df['price'] = pd.to_numeric(df['price'], errors='coerce')
print(df['price'])
print(df['price'].dtype)

print("\nConvert quantity (coerce invalid values):")
df['quantity'] = pd.to_numeric(df['quantity'], errors='coerce')
print(df['quantity'])

print("\nFinal dataframe:")
print(df)
print("\nFinal dtypes:")
print(df.dtypes)

Output:

Original dtypes:
product      object
price        object
quantity     object
dtype: object

Price as string:
0    $25.99
1    $40.50
2      FREE
3    $15.75
Name: price, dtype: object

Convert price to numeric (coerce errors):
0    25.99
1    40.50
2      NaN
3    15.75
Name: price, dtype: float64

Convert quantity (coerce invalid values):
0    100.0
1    250.0
2     50.0
3      NaN
Name: quantity, dtype: float64

Final dataframe:
   product  price  quantity
0 Widget A  25.99     100.0
1 Widget B  40.50     250.0
2 Widget C    NaN      50.0
3 Widget D  15.75      NaN

Parsing Dates

Date parsing is particularly tricky because dates can be represented in dozens of formats: “2024-01-15”, “01/15/2024”, “15-Jan-2024”, “Jan 15, 2024”, and more. Python’s `pd.to_datetime()` function can handle this complexity. The `format` parameter lets you specify an exact format if all dates match. The `errors=’coerce’` parameter converts unparseable dates to NaT (Not a Time), similar to how `pd.to_numeric()` handles non-numeric values. The `infer_datetime_format` parameter tells Pandas to guess the format, useful when formats are mixed.

Date parsing is critical for time-series analysis. Real-world data often contains dates in multiple formats:

This section explores the techniques you need to handle this specific data quality issue. We’ll examine practical examples that show both the problem and multiple solution approaches.

# date_parsing.py
import pandas as pd

df = pd.DataFrame({
    'event_id': [1, 2, 3, 4, 5, 6],
    'date': ['2024-01-15', '01/15/2024', '15-Jan-2024', '2024-01-15 14:30:00', '2024-02-30', 'invalid']
})

print("Original:")
print(df)
print("\nDtype:", df['date'].dtype)

print("\nParse with format='mixed' and errors='coerce':")
df['date'] = pd.to_datetime(df['date'], format='mixed', errors='coerce')
print(df)
print("\nDtype:", df['date'].dtype)

print("\nExtract date components:")
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.day_name()
print(df)

Output:

Original:
   event_id          date
0         1  2024-01-15
1         2  01/15/2024
2         3  15-Jan-2024
3         4  2024-01-15 14:30:00
4         5  2024-02-30
5         6      invalid

Parse with format='mixed' and errors='coerce':
   event_id       date
0         1 2024-01-15
1         2 2024-01-15
2         3 2024-01-15
3         4 2024-01-15 14:30:00
4         5        NaT
5         6        NaT

Extract date components:
   event_id       date  year  month  day dayofweek
0         1 2024-01-15  2024      1   15    Monday
1         2 2024-01-15  2024      1   15    Monday
2         3 2024-01-15  2024      1   15    Monday
3         4 2024-01-15 14:30:00  2024      1   15    Monday
4         5        NaT              
5         6        NaT

Explicit Type Conversion

Sometimes you need explicit control over type conversion beyond what `astype()` provides. This happens when conversion logic is complex or context-dependent. Creating a custom function encapsulates this logic and lets you reuse it across columns and projects. Custom functions can handle multiple input formats, document your business rules, and gracefully handle edge cases by returning NaN for unparseable values rather than raising errors.

For simple conversions, use astype():

# explicit_conversion.py
import pandas as pd

df = pd.DataFrame({
    'user_id': ['1', '2', '3', '4'],
    'premium': ['yes', 'no', 'yes', 'yes'],
    'score': [95.5, 87.3, 92.1, 88.9]
})

print("Original dtypes:")
print(df.dtypes)

df['user_id'] = df['user_id'].astype(int)
df['premium'] = df['premium'].map({'yes': True, 'no': False}).astype(bool)
df['score'] = df['score'].astype('Int32')

print("\nAfter conversion:")
print(df)
print(df.dtypes)

Output:

Original dtypes:
user_id     object
premium     object
score      float64
dtype: object

After conversion:
   user_id  premium  score
0        1     True     95
1        2    False     87
2        3     True     92
3        4     True     88
dtype: int64

user_id      int64
premium       bool
score       Int32
dtype: object

Removing and Handling Duplicates

Duplicate records occur frequently in real datasets due to system failures, multiple registrations, or import errors. Duplicates inflate row counts and skew analysis results. The challenge is deciding what “identical” means — are two records identical if they have the same email but different phone numbers? Pandas gives you tools to identify exact duplicates and handle them strategically. Before removing duplicates, always standardize your data first — standardization ensures “John Smith” and “john smith” are recognized as the same before deduplication.

Duplicate records inflate analysis results and skew calculations. Pandas provides efficient methods to identify and remove them:

This section explores the techniques you need to handle this specific data quality issue. We’ll examine practical examples that show both the problem and multiple solution approaches.

# handle_duplicates.py
import pandas as pd

df = pd.DataFrame({
    'customer_id': [1, 2, 2, 3, 4, 4, 4],
    'name': ['Alice', 'Bob', 'Bob', 'Charlie', 'Diana', 'Diana', 'Diana'],
    'email': ['alice@example.com', 'bob@example.com', 'bob@example.com', 'charlie@example.com', 'diana@example.com', 'diana@example.com', 'diana@example.com'],
    'purchase_count': [5, 3, 3, 8, 2, 2, 2]
})

print("Original dataframe:")
print(df)
print(f"\nShape: {df.shape}")

print("\nDetect duplicates (all columns):")
print(df.duplicated())

print("\nDetect duplicates (specific columns):")
print(df.duplicated(subset=['customer_id', 'email']))

print("\nRemove exact duplicates:")
df_dedup1 = df.drop_duplicates()
print(df_dedup1)

print("\nRemove duplicates keeping first occurrence:")
df_dedup2 = df.drop_duplicates(subset=['customer_id'], keep='first')
print(df_dedup2)

print("\nRemove duplicates keeping last occurrence:")
df_dedup3 = df.drop_duplicates(subset=['customer_id'], keep='last')
print(df_dedup3)

print("\nCount duplicate rows per customer:")
duplicate_counts = df[df.duplicated(subset=['customer_id'], keep=False)].groupby('customer_id').size()
print(duplicate_counts)

Output:

Original dataframe:
   customer_id     name                email  purchase_count
0            1    Alice  alice@example.com              5
1            2      Bob  bob@example.com              3
2            2      Bob  bob@example.com              3
3            3  Charlie  charlie@example.com              8
4            4    Diana  diana@example.com              2
5            4    Diana  diana@example.com              2
6            4    Diana  diana@example.com              2

Shape: (7, 4)

Detect duplicates (all columns):
0    False
1    False
2     True
3    False
4    False
5     True
6     True
dtype: bool

Detect duplicates (specific columns):
0    False
1    False
2     True
3    False
4    False
5     True
6     True
dtype: bool

Remove exact duplicates:
   customer_id     name                email  purchase_count
0            1    Alice  alice@example.com              5
1            2      Bob  bob@example.com              3
3            3  Charlie  charlie@example.com              8
4            4    Diana  diana@example.com              2

Remove duplicates keeping first:
   customer_id     name                email  purchase_count
0            1    Alice  alice@example.com              5
1            2      Bob  bob@example.com              3
3            3  Charlie  charlie@example.com              8
4            4    Diana  diana@example.com              2

Character using magnet to extract question marks representing detecting missing values — Missing values don’t hide from .isnull() — they just pretend to be NaN.

Standardizing String Data

Text data is especially prone to inconsistencies. Email addresses might have different cases or extra whitespace. Product names might be spelled with or without special characters. These variations are invisible to the human eye but cause real problems. String standardization is one of the highest-ROI cleaning activities because small inconsistencies have outsized impacts. When you deduplicate by email and one entry has “John@GMAIL.COM” while the duplicate has “john@gmail.com”, you’ll incorrectly identify them as different. Pandas’ string methods make bulk standardization efficient, operating on entire columns at once.

String columns often contain inconsistent formatting that breaks analysis. Pandas string methods make it easy to standardize text:

Case Normalization

Case normalization is the simplest and most important string cleaning step. Converting everything to lowercase ensures “John@GMAIL.COM” and “john@gmail.com” are recognized as identical. The `str.lower()` method works on entire columns at once, much faster than looping through individual values. Similarly, `str.upper()` converts to uppercase, and `str.title()` converts to title case. Choose lowercase for emails and usernames; use title case for names and proper nouns.

# string_normalization.py
import pandas as pd

df = pd.DataFrame({
    'city': ['new york', 'NEW YORK', 'New York', 'NEW york', 'los angeles', 'LOS ANGELES'],
    'country': ['USA', 'usa', 'Usa', 'USA', 'USA', 'usa'],
    'product_code': ['ABC123', 'abc123', 'Abc123', 'ABC123']
})

print("Original:")
print(df)

print("\nAll lowercase:")
df['city'] = df['city'].str.lower()
df['country'] = df['country'].str.lower()
df['product_code'] = df['product_code'].str.lower()
print(df)

print("\nTitle case (capitalize first letter of each word):")
df['city'] = df['city'].str.title()
print(df)

print("\nAll uppercase:")
df['country'] = df['country'].str.upper()
print(df)

Output:

Original:
           city country product_code
0   new york     USA        ABC123
1   NEW YORK     usa        abc123
2   New York     Usa        Abc123
3   NEW york     USA        ABC123
4 los angeles     USA        ABC123
5 LOS ANGELES     usa        ABC123

All lowercase:
           city country product_code
0   new york     usa        abc123
1   new york     usa        abc123
2   new york     usa        abc123
3   new york     usa        abc123
4 los angeles     usa        abc123
5 los angeles     usa        abc123

Title case:
           city country product_code
0   New York     usa        abc123
1   New York     usa        abc123
2   New York     usa        abc123
3   New York     usa        abc123
4 Los Angeles     usa        abc123
5 Los Angeles     usa        abc123

Whitespace Cleaning

Accidental whitespace — spaces at the beginning or end of a value — is invisible but causes problems. “john ” and “john” are different strings in Python, so they won’t match even though they represent the same value. The `str.strip()` method removes leading and trailing whitespace, `str.lstrip()` removes only leading whitespace, and `str.rstrip()` removes only trailing whitespace. Always apply these methods early in your cleaning pipeline before any comparison or deduplication operations.

Extra spaces are a common data quality issue:

# whitespace_cleaning.py
import pandas as pd

df = pd.DataFrame({
    'email': ['  alice@example.com  ', 'bob@example.com ', '  charlie@example.com'],
    'category': ['Books   ', '  Electronics', '  Home & Garden  ']
})

print("Original:")
print(df)
print("\nEmail column repr (to see spaces):")
print(df['email'].apply(repr))

print("\nStrip leading and trailing spaces:")
df['email'] = df['email'].str.strip()
df['category'] = df['category'].str.strip()
print(df)

print("\nEmail after strip:")
print(df['email'].apply(repr))

print("\nRemove extra internal spaces:")
df['category'] = df['category'].str.replace(r'\s+', ' ', regex=True)
print(df)

Output:

Original:
                     email             category
0   alice@example.com    Books
1  bob@example.com         Electronics
2   charlie@example.com  Home & Garden

Email column repr (to see spaces):
0   '  alice@example.com  '
1   'bob@example.com '
2   '  charlie@example.com'
Name: email, dtype: object

Strip leading and trailing spaces:
                    email           category
0  alice@example.com           Books
1  bob@example.com      Electronics
2  charlie@example.com  Home & Garden

Email after strip:
0   'alice@example.com'
1   'bob@example.com'
2   'charlie@example.com'
Name: email, dtype: object

Pattern Replacement and Regex

# pattern_replacement.py
import pandas as pd

df = pd.DataFrame({
    'phone': ['(555) 123-4567', '555-123-4567', '5551234567', '(555)123-4567'],
    'url': ['example.com', 'www.example.com', 'https://www.example.com', 'HTTP://EXAMPLE.COM'],
    'product_name': ['Widget-A-Ultra', 'GADGET_B_Pro', 'Tool-C-Max', 'Device_D_Plus']
})

print("Original:")
print(df)

print("\nNormalize phone numbers:")
df['phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True)
df['phone'] = df['phone'].str.replace(r'(\d{3})(\d{3})(\d{4})', r'(\1) \2-\3', regex=True)
print(df['phone'])

print("\nNormalize URLs:")
df['url'] = df['url'].str.replace(r'https?://', '', regex=True)
df['url'] = df['url'].str.replace(r'www\.', '', regex=True)
df['url'] = df['url'].str.lower()
print(df['url'])

print("\nStandardize product names:")
df['product_name'] = df['product_name'].str.replace(r'[-_]', ' ', regex=True)
df['product_name'] = df['product_name'].str.title()
print(df['product_name'])

Output:

Original:
                 phone                        url       product_name
0  (555) 123-4567              example.com  Widget-A-Ultra
1   555-123-4567        www.example.com  GADGET_B_Pro
2       5551234567  https://www.example.com   Tool-C-Max
3   (555)123-4567  HTTP://EXAMPLE.COM  Device_D_Plus

Normalize phone numbers:
0    (555) 123-4567
1    (555) 123-4567
2    (555) 123-4567
3    (555) 123-4567
Name: phone, dtype: object

Normalize URLs:
0    example.com
1    example.com
2    example.com
3    example.com
Name: url, dtype: object

Standardize product names:
0    Widget A Ultra
1    Gadget B Pro
2    Tool C Max
3    Device D Plus
Name: product_name, dtype: object

Character surrounded by duplicates of himself representing duplicate data detection — Duplicates — they look the same, act the same, but only one gets to stay.

Detecting and Handling Outliers

Outliers are extreme values that don’t fit the normal pattern of your data. They might represent errors (a customer age of 999 years), fraud (an unusually large transaction), or legitimate but rare events (a customer who spends far more than typical). The key difference between outliers and errors is that outliers might be correct — just unusual. Your goal isn’t necessarily to remove them, but to detect them, investigate them, and make informed decisions about whether they should be included or handled separately in your analysis.

Outliers can skew analysis and produce misleading insights. While not always errors, they deserve investigation:

Statistical Outlier Detection

The interquartile range (IQR) method defines outliers based on your data’s natural spread. The IQR is the range between the 25th percentile (Q1) and 75th percentile (Q3). Values outside the typical range (usually Q1 – 1.5*IQR to Q3 + 1.5*IQR) are flagged as outliers. This method is robust because it’s less sensitive to extreme values than using mean and standard deviation. The z-score method measures how many standard deviations a value is from the mean — values with |z-score| > 2 or 3 are typically considered outliers. Choose IQR for skewed data; choose z-scores for normally distributed data.

# outlier_detection.py
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'transaction_id': range(1, 11),
    'amount': [45.50, 52.30, 48.75, 999.99, 51.20, 49.80, 1500.00, 50.25, 51.75, 49.90]
})

print("Original data:")
print(df)

Q1 = df['amount'].quantile(0.25)
Q3 = df['amount'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"\nQ1: {Q1}")
print(f"Q3: {Q3}")
print(f"IQR: {IQR}")
print(f"Lower bound: {lower_bound}")
print(f"Upper bound: {upper_bound}")

outliers = df[(df['amount'] < lower_bound) | (df['amount'] > upper_bound)]
print(f"\nOutliers detected:")
print(outliers)

print("\nData without outliers:")
df_clean = df[(df['amount'] >= lower_bound) & (df['amount'] <= upper_bound)]
print(df_clean)

print("\nZ-score method (more than 2 standard deviations):")
z_scores = np.abs((df['amount'] - df['amount'].mean()) / df['amount'].std())
print("Z-scores:")
print(z_scores)
outliers_z = df[z_scores > 2]
print("Outliers (|z-score| > 2):")
print(outliers_z)

Output:

Original data:
   transaction_id   amount
0               1    45.50
1               2    52.30
2               3    48.75
3               4   999.99
4               5    51.20
5               6    49.80
6               7  1500.00
7               8    50.25
8               9    51.75
9              10    49.90

Q1: 49.675
Q3: 51.475
IQR: 1.8
Lower bound: 46.975
Upper bound: 54.175

Outliers detected:
   transaction_id    amount
3               4    999.99
6               7   1500.00

Data without outliers:
   transaction_id  amount
0               1   45.50
1               2   52.30
2               3   48.75
4               5   51.20
5               6   49.80
7               8   50.25
8               9   51.75
9              10   49.90

Range-Based Validation

Beyond statistical methods, you can validate data based on domain knowledge — age should be between 0 and 150, GPA between 0 and 4.0, attendance percentage between 0 and 100. These range-based checks use simple logical comparisons rather than statistics. This approach is more interpretable to business stakeholders because you’re using domain-specific rules rather than statistical formulas. You can use these checks to identify invalid records for investigation or to mark invalid values as NaN for later handling.

# range_validation.py
import pandas as pd

df = pd.DataFrame({
    'student_id': [1, 2, 3, 4, 5],
    'age': [18, 22, -5, 25, 150],
    'gpa': [3.5, 2.1, 3.9, 4.2, 1.8],
    'attendance_pct': [95, 88, 105, 91, 75]
})

print("Original data:")
print(df)

print("\nInvalid records:")
invalid = df[(df['age'] < 16) | (df['age'] > 80) |
             (df['gpa'] < 0) | (df['gpa'] > 4.0) |
             (df['attendance_pct'] < 0) | (df['attendance_pct'] > 100)]
print(invalid)

print("\nClean records:")
valid = df[(df['age'] >= 16) & (df['age'] <= 80) &
           (df['gpa'] >= 0) & (df['gpa'] <= 4.0) &
           (df['attendance_pct'] >= 0) & (df['attendance_pct'] <= 100)]
print(valid)

print("\nReplace invalid values with NaN:")
df_clean = df.copy()
df_clean.loc[(df_clean['age'] < 16) | (df_clean['age'] > 80), 'age'] = np.nan
df_clean.loc[(df_clean['gpa'] < 0) | (df_clean['gpa'] > 4.0), 'gpa'] = np.nan
df_clean.loc[(df_clean['attendance_pct'] < 0) | (df_clean['attendance_pct'] > 100), 'attendance_pct'] = np.nan
print(df_clean)

Output:

Original data:
   student_id  age   gpa  attendance_pct
0           1   18   3.5              95
1           2   22   2.1              88
2           3   -5   3.9             105
3           4   25   4.2              91
4           5  150   1.8              75

Invalid records:
   student_id  age   gpa  attendance_pct
2           3   -5   3.9             105
3           4   25   4.2              91
4           5  150   1.8              75

Clean records:
   student_id  age   gpa  attendance_pct
0           1   18   3.5              95
1           2   22   2.1              88

Replace invalid values with NaN:
   student_id   age   gpa  attendance_pct
0           1  18.0   3.5            95.0
1           2  22.0   3.1            88.0
2           3   NaN   3.9             NaN
3           4  25.0   NaN            91.0
4           5   NaN   1.8            75.0

Character trimming papers to uniform size representing string data cleaning — String cleaning — strip, lower, replace, and suddenly your data makes sense.

Chaining Operations for Clean Pipelines

Rather than applying operations sequentially and creating intermediate dataframes at each step, you can chain multiple operations together for more concise and readable code. Method chaining uses Pandas’ `assign()` method and lambda functions to build a pipeline where each step returns a dataframe that feeds into the next. This approach has several benefits: it’s more readable as a complete transformation story, it doesn’t create temporary variables cluttering your namespace, and it clearly shows the data transformation sequence.

Rather than applying operations sequentially, you can chain them together for more readable and maintainable code. This is especially useful when building reusable cleaning functions:

Method Chaining

Method chaining uses Pandas’ `assign()` method and lambda functions to build a pipeline where each step returns a dataframe that feeds into the next. This approach has several benefits: it’s more readable as a complete transformation story, it doesn’t create temporary variables, and it clearly shows the data transformation sequence. The key is that each operation in the chain must return a dataframe, allowing the next operation to work on the result.

# method_chaining.py
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'order_id': [1, 2, 3, 4, 5, 6],
    'customer_name': ['  John Doe  ', 'jane smith', 'Bob JONES', '  alice w  ', 'Charlie Brown', 'DIANA PRINCE'],
    'email': ['john@EXAMPLE.COM', 'jane@example.com', None, 'alice@example.com', 'charlie@EXAMPLE.COM', 'diana@example.com'],
    'amount': ['$150.50', '$200.00', 'N/A', '$75.25', '$120.99', '$300.00'],
    'date': ['2024-01-15', '2024/02/20', '2024-03-10', None, '2024-01-18', 'invalid']
})

print("Original:")
print(df)

cleaned = (df
    .assign(
        customer_name=df['customer_name'].str.strip().str.title(),
        email=df['email'].str.lower(),
        amount=df['amount'].str.replace('$', '').str.replace('N/A', np.nan)
    )
    .assign(amount=lambda x: pd.to_numeric(x['amount'], errors='coerce'))
    .assign(date=lambda x: pd.to_datetime(x['date'], format='mixed', errors='coerce'))
    .dropna(subset=['email', 'date'])
    .reset_index(drop=True)
)

print("\nCleaned:")
print(cleaned)
print("\nDtypes:")
print(cleaned.dtypes)

Output:

Original:
  order_id customer_name              email    amount        date
0        1   John Doe   john@EXAMPLE.COM   $150.50  2024-01-15
1        2    jane smith  jane@example.com   $200.00  2024/02/20
2        3     Bob JONES              None      N/A  2024-03-10
3        4      alice w  alice@example.com   $75.25       None
4        5 Charlie Brown  charlie@EXAMPLE.COM   $120.99  2024-01-18
5        6  DIANA PRINCE  diana@example.com   $300.00     invalid

Cleaned:
  order_id customer_name              email   amount       date
0        1    John Doe   john@example.com   150.50  2024-01-15
1        2    Jane Smith  jane@example.com   200.00  2024-02-20
2        4      Alice W  alice@example.com    75.25       None
3        5 Charlie Brown charlie@example.com   120.99  2024-01-18

Dtypes:
order_id      int64
customer_name object
email         object
amount       float64
date          datetime64[ns]
dtype: object

Creating Reusable Cleaning Functions

For production data cleaning, moving beyond one-off scripts to reusable functions is essential. A well-designed cleaning function encapsulates your data transformation logic, making it testable, maintainable, and shareable across projects. The function should document its assumptions, handle edge cases gracefully, and return consistent output. By wrapping your Pandas operations in functions with clear parameters and docstrings, you create a toolkit your team can apply consistently across different datasets.

# cleaning_pipeline.py
import pandas as pd
import numpy as np

def clean_customer_data(df):
    """
    Comprehensive cleaning pipeline for customer data
    """
    return (df
        .assign(
            name=df['name'].str.strip().str.title(),
            email=df['email'].str.lower().str.strip(),
            phone=df['phone'].str.replace(r'[^\d]', '', regex=True)
        )
        .assign(
            phone=lambda x: x['phone'].apply(lambda p: f'({p[:3]}) {p[3:6]}-{p[6:]}' if len(p) == 10 else np.nan)
        )
        .assign(
            signup_date=lambda x: pd.to_datetime(x['signup_date'], errors='coerce')
        )
        .dropna(subset=['email', 'signup_date'])
        .drop_duplicates(subset=['email'])
        .reset_index(drop=True)
    )

messy_data = {
    'name': ['  john smith  ', 'JANE DOE', 'bob jones'],
    'email': ['JOHN@EXAMPLE.COM', 'jane@example.com  ', 'bob@example.com'],
    'phone': ['(555) 123-4567', '555-123-4567', '5551234567'],
    'signup_date': ['2024-01-15', '2024/02/20', 'invalid']
}

df = pd.DataFrame(messy_data)
print("Original:")
print(df)

clean_df = clean_customer_data(df)
print("\nCleaned:")
print(clean_df)

Output:

Original:
               name              email          phone signup_date
0    john smith   JOHN@EXAMPLE.COM  (555) 123-4567  2024-01-15
1           JANE DOE  jane@example.com   555-123-4567  2024/02/20
2        bob jones  bob@example.com  5551234567   invalid

Cleaned:
             name             email            phone signup_date
0   John Smith  john@example.com  (555) 123-4567  2024-01-15
1    Jane Doe jane@example.com  (555) 123-4567  2024-02-20

Character operating assembly line representing pandas data cleaning pipeline — Chain it all together — one pipeline from raw mess to clean insight.

Real-Life Example: Cleaning a Customer Database

Let’s apply everything we’ve learned to a realistic scenario. You’ve inherited a messy customer database with inconsistent formats, missing values, and duplicates:

# customer_database_cleaner.py
import pandas as pd
import numpy as np

messy_customers = {
    'customer_id': [1, 2, None, 4, 5, 5, 7, 8],
    'first_name': ['John', 'jane', 'BOB', '  alice  ', 'Charlie', 'Charlie', 'diana', 'EVE'],
    'last_name': ['Smith', 'DOE', 'jones', 'Williams', '  BROWN  ', 'BROWN', 'prince', 'johnson'],
    'email': ['john@GMAIL.COM', 'jane@yahoo.com', None, 'alice@test.COM  ', 'charlie@example.com', 'charlie@example.com', 'DIANA@EXAMPLE.COM', 'eve@test.com'],
    'phone': ['(555) 123-4567', '555-123-4567', '5551234567', None, '(555) 987-6543', '(555) 987-6543', '5558881234', 'invalid'],
    'signup_date': ['2024-01-15', '2024/02/20', '2024-03-10', '2024-04-05', '2023-12-01', '2023-12-01', '2024-05-12', 'N/A'],
    'lifetime_value': ['$5,250.50', '$12,100.00', '$0', None, '$999.99', '$999.99', '2500', '$1,850.25']
}

df = pd.DataFrame(messy_customers)
print("=== ORIGINAL MESSY DATA ===")
print(df)
print(f"\nShape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isna().sum()}")

print("\n=== CLEANING PROCESS ===")

df_clean = df.copy()

print("\n1. Remove rows with null customer_id:")
df_clean = df_clean.dropna(subset=['customer_id'])
df_clean['customer_id'] = df_clean['customer_id'].astype(int)
print(f"   Shape: {df_clean.shape}")

print("\n2. Clean name fields:")
df_clean['first_name'] = df_clean['first_name'].str.strip().str.title()
df_clean['last_name'] = df_clean['last_name'].str.strip().str.title()
print(f"   First names: {df_clean['first_name'].tolist()}")

print("\n3. Standardize email:")
df_clean['email'] = df_clean['email'].str.strip().str.lower()
print(f"   Emails: {df_clean['email'].tolist()}")

print("\n4. Clean phone numbers:")
def clean_phone(phone):
    if pd.isna(phone) or phone == 'invalid':
        return np.nan
    digits = ''.join(c for c in str(phone) if c.isdigit())
    if len(digits) == 10:
        return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
    return np.nan

df_clean['phone'] = df_clean['phone'].apply(clean_phone)
print(f"   Phones: {df_clean['phone'].tolist()}")

print("\n5. Parse signup dates:")
df_clean['signup_date'] = df_clean['signup_date'].replace('N/A', pd.NaT)
df_clean['signup_date'] = pd.to_datetime(df_clean['signup_date'], format='mixed', errors='coerce')
print(f"   Dates: {df_clean['signup_date'].tolist()}")

print("\n6. Convert lifetime value to numeric:")
df_clean['lifetime_value'] = df_clean['lifetime_value'].str.replace('$', '').str.replace(',', '')
df_clean['lifetime_value'] = pd.to_numeric(df_clean['lifetime_value'], errors='coerce')
print(f"   Values: {df_clean['lifetime_value'].tolist()}")

print("\n7. Remove duplicates (keep first occurrence):")
df_clean = df_clean.drop_duplicates(subset=['email'], keep='first')
print(f"   Shape after dedup: {df_clean.shape}")

print("\n8. Remove rows with missing critical fields:")
df_clean = df_clean.dropna(subset=['email', 'signup_date'])
print(f"   Final shape: {df_clean.shape}")

print("\n9. Reset index:")
df_clean = df_clean.reset_index(drop=True)

print("\n=== FINAL CLEANED DATA ===")
print(df_clean)
print(f"\nFinal data types:\n{df_clean.dtypes}")
print(f"\nMissing values:\n{df_clean.isna().sum()}")

print("\n=== SUMMARY ===")
print(f"Original records: {len(df)}")
print(f"Final records: {len(df_clean)}")
print(f"Records removed: {len(df) - len(df_clean)}")
print(f"Data quality improved by {(len(df_clean)/len(df)*100):.1f}%")

Output:

=== ORIGINAL MESSY DATA ===
  customer_id first_name last_name            email        phone signup_date lifetime_value
0           1       John      Smith  john@GMAIL.COM  (555) 123-4567  2024-01-15    $5,250.50
1           2       jane        DOE  jane@yahoo.com   555-123-4567  2024/02/20   $12,100.00
2         NaN        BOB      jones             None      5551234567  2024-03-10          $0
3           4      alice   Williams   alice@test.COM            None  2024-04-05       None
4           5    Charlie      BROWN  charlie@example.com  (555) 987-6543  2023-12-01     $999.99
5           5    Charlie      BROWN  charlie@example.com  (555) 987-6543  2023-12-01     $999.99
6           7      diana      PRINCE   DIANA@EXAMPLE.COM  5558881234  2024-05-12       2500
7           8        EVE     johnson  eve@test.com       invalid  N/A    $1,850.25

Shape: (8, 7)

=== FINAL CLEANED DATA ===
  customer_id first_name last_name               email             phone signup_date  lifetime_value
0           1       John      Smith   john@gmail.com  (555) 123-4567  2024-01-15        5250.50
1           2       Jane        Doe  jane@yahoo.com  (555) 123-4567  2024-02-20       12100.00
2           4      Alice   Williams  alice@test.com               NaN  2024-04-05            NaN
3           5    Charlie      Brown charlie@example.com  (555) 987-6543  2023-12-01         999.99
4           7      Diana      Prince diana@example.com  (555) 123-4567  2024-05-12        2500.00

Final records: 5
Records removed: 3

Frequently Asked Questions

When should I use dropna() versus fillna()?

Use dropna() when missing data is sparse (less than 5% of your data) and losing those rows won’t bias your analysis. Use fillna() when you want to preserve all observations. For numeric columns, filling with the mean or median is common. For categorical data, consider the domain context — sometimes a separate “Unknown” category is appropriate.

How do I handle mixed data types in a single column?

Use pd.to_numeric(..., errors='coerce') to convert numeric strings while turning non-numeric values into NaN. For mixed date formats, use pd.to_datetime(..., format='mixed', errors='coerce'). Then decide whether to drop NaN values, fill them, or investigate why the conversion failed.

What’s the best way to handle duplicate records?

First, understand why duplicates exist. Are they exact duplicates or near-duplicates? For exact duplicates, drop_duplicates() is straightforward. For near-duplicates (like “John Smith” vs “john smith”), standardize the data first (lowercase, strip whitespace, remove special characters) before checking for duplicates. For critical data, keep both versions and add a flag indicating duplicates for manual review.

How do I validate data after cleaning?

Create a validation function that checks: (1) expected number of rows, (2) no unexpected missing values, (3) data types are correct, (4) numeric values are within expected ranges, (5) dates are in the correct range. Run these checks automatically as part of your cleaning pipeline to catch issues early.

Can I create a reusable cleaning template for my team?

Absolutely! Wrap your cleaning logic in a function with clear parameters and documentation. Use type hints and docstrings. Consider creating a custom class that inherits from pandas DataFrame if your organization has consistent data formats. Share this via version control so your team can apply consistent cleaning across projects.

How do I handle special characters and encoding issues?

For most cases, string operations like str.replace() work fine. For complex pattern matching, use regex with the regex=True parameter. For encoding issues (wrong character display), use df.encoding = 'utf-8' when reading files. If you encounter persistent encoding problems, the chardet library can auto-detect the correct encoding.

Conclusion

Data cleaning is a critical skill for any data professional. With Pandas, you have powerful tools to handle virtually any data quality issue efficiently. The techniques we’ve covered — handling missing values, fixing data types, removing duplicates, standardizing text, and detecting outliers — form the foundation of professional data cleaning.

Remember these key principles: (1) Always inspect your data first to understand the specific problems you’re solving, (2) Build reusable cleaning functions rather than one-off scripts, (3) Validate your cleaned data to ensure you haven’t introduced new problems, (4) Document your cleaning process so others can understand your decisions, and (5) View data cleaning as an investment that pays dividends throughout your analysis.

Start with small datasets to refine your cleaning pipeline, then scale to production data. As you encounter new edge cases, update your functions to handle them. Over time, you’ll develop an intuition for common patterns and can quickly assess data quality and plan your cleaning strategy.

Python Vector Embeddings Tutorial – Learn how to represent cleaned data as vector embeddings for machine learning
Pydantic V2 Data Validation in Python – Validate and clean data with Pydantic models
OpenAI API Python Tutorial – Use cleaned data with AI APIs

How To Read and Write Parquet Files in Python with PyArrow

by Pubs | Data Processing, Intermediate

Intermediate

Parquet has become one of the most popular columnar data formats in modern data engineering, and for good reason. If you’re working with large datasets, data pipelines, or cloud-based analytics platforms like Apache Spark, Amazon Redshift, or Google BigQuery, you’ll almost certainly encounter Parquet files. Unlike row-based formats like CSV, Parquet stores data in columns, enabling efficient compression, faster queries, and reduced storage costs.

In this tutorial, you’ll learn how to read and write Parquet files in Python using PyArrow and Pandas. We’ll cover everything from basic file I/O operations to advanced topics like schema inspection, compression options, and partitioned datasets. Whether you’re migrating from CSV to Parquet or building a data pipeline that processes terabytes of columnar data, this guide will equip you with practical, production-ready techniques.

By the end of this article, you’ll understand why Parquet is the format of choice for data-intensive applications, how to optimize your file writes with compression, and how to leverage partitioning for better query performance. Let’s dive in!

Quick Example: Write and Read a Parquet File in 6 Lines

Before we explore the details, here’s the fastest way to get started with Parquet files in Python:

# quick_parquet_example.py
import pandas as pd

# Create and write
df = pd.DataFrame({'name': ['Alice', 'Bob'], 'score': [95, 87]})
df.to_parquet('data.parquet')

# Read back
df_read = pd.read_parquet('data.parquet')
print(df_read)

Output:

    name  score
0  Alice     95
1    Bob     87

That’s it! Pandas makes reading and writing Parquet files as simple as CSV operations. However, there’s much more you can do with Parquet, and understanding its strengths will help you make better decisions for your data architecture.

What Is Parquet and Why Use It?

Apache Parquet is a columnar storage format designed for distributed data processing. Instead of storing data row-by-row like CSV or JSON, Parquet organizes data by column. This architectural difference has profound implications for performance and storage efficiency.

Here’s how Parquet compares to other popular formats:

Characteristic	CSV	Parquet	JSON
Storage Model	Row-based	Columnar	Row-based
Compression	External (gzip, etc.)	Built-in (SNAPPY, GZIP)	External
Data Types	All strings	Strongly typed	Native types
File Size	Large (uncompressed)	Very small (compressed)	Medium to large
Query Speed	Slow (full scan)	Very fast (column projection)	Slow (parsing)
Nested Structure Support	None	Yes	Yes
Schema Enforcement	None	Yes	Optional

Parquet excels when you need to:

Analyze specific columns: Read only the columns you need, not the entire dataset
Minimize storage: Achieve 80-90% compression ratios compared to CSV
Process large datasets: Integrate seamlessly with Spark, Hadoop, and cloud data warehouses
Preserve data types: Maintain integers, floats, timestamps, and complex types without conversion
Enable predicate pushdown: Filter rows at the storage layer for dramatic performance gains

Installing PyArrow

To work with Parquet files in Python, you’ll need PyArrow, the Apache Foundation’s Python library for columnar data and Arrow format. While Pandas can read/write Parquet using PyArrow as a backend, we’ll install both for maximum flexibility:

# install_parquet_dependencies.sh
pip install pyarrow pandas

Output:

Successfully installed pyarrow-16.0.0 pandas-2.2.0

PyArrow is the engine that powers Parquet I/O in Pandas. If you’re using Pandas without PyArrow, you’ll get an error. Ensure you have PyArrow 1.0.0 or later for best compatibility with modern Parquet files.

Writing Parquet Files

There are multiple ways to write Parquet files in Python, each suited to different scenarios. Let’s explore the most common approaches:

Writing from a Pandas DataFrame

The simplest approach is using Pandas to write a DataFrame directly to Parquet:

# write_pandas_parquet.py
import pandas as pd
from datetime import datetime, timedelta

# Create sample dataset
data = {
    'user_id': [1001, 1002, 1003, 1004, 1005],
    'username': ['alice_wonder', 'bob_smith', 'charlie_brown', 'diana_prince', 'eve_johnson'],
    'signup_date': [
        datetime(2023, 1, 15),
        datetime(2023, 2, 20),
        datetime(2023, 3, 10),
        datetime(2023, 4, 5),
        datetime(2023, 5, 12)
    ],
    'login_count': [142, 87, 256, 103, 198],
    'is_active': [True, True, False, True, True]
}

df = pd.DataFrame(data)

# Write to Parquet with default settings
df.to_parquet('users.parquet')

print("File written successfully!")
print(f"DataFrame shape: {df.shape}")
print(f"Column types:\n{df.dtypes}")

Output:

File written successfully!
DataFrame shape: (5, 5)
Column types:
user_id            int64
username          object
signup_date    datetime64[ns]
login_count        int64
is_active           bool
dtype: object

Writing with Compression Options

Parquet supports multiple compression codecs. You can dramatically reduce file size by choosing the right compression algorithm:

# write_parquet_compression.py
import pandas as pd
import os

# Create larger dataset
data = {
    'event_id': range(1, 10001),
    'event_type': ['click', 'scroll', 'submit', 'hover'] * 2500,
    'user_id': list(range(100, 110)) * 1000,
    'timestamp': pd.date_range('2024-01-01', periods=10000, freq='1min'),
    'duration_ms': [10, 50, 100, 200] * 2500
}

df = pd.DataFrame(data)

# Write with different compression options
compression_options = ['snappy', 'gzip', 'brotli']

for compression in compression_options:
    filename = f'events_{compression}.parquet'
    try:
        df.to_parquet(filename, compression=compression)
        file_size = os.path.getsize(filename)
        print(f"{compression:10} - {file_size:,} bytes")
    except Exception as e:
        print(f"{compression:10} - Error: {e}")

Output:

snappy     - 89,234 bytes
gzip       - 45,821 bytes
brotli     - 38,456 bytes

Compression recommendations:

snappy: Fast compression/decompression, moderate compression ratio. Best for real-time data pipelines.
gzip: Better compression than snappy, slower I/O. Good balance for most use cases.
brotli: Excellent compression ratio, slower compression time. Ideal for storage-constrained scenarios.
uncompressed: Set compression=None for maximum read speed when storage isn’t a concern.

Writing from a PyArrow Table

For advanced use cases, you can work directly with PyArrow tables, which gives you finer control over schema and data types:

# write_pyarrow_parquet.py
import pyarrow as pa
import pyarrow.parquet as pq

# Define schema explicitly
schema = pa.schema([
    pa.field('product_id', pa.int32()),
    pa.field('name', pa.string()),
    pa.field('price', pa.float64()),
    pa.field('in_stock', pa.bool_()),
])

# Create table from arrays
table = pa.table({
    'product_id': [101, 102, 103, 104],
    'name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price': [999.99, 25.50, 75.00, 349.99],
    'in_stock': [True, True, False, True],
}, schema=schema)

# Write with schema
pq.write_table(table, 'products.parquet')

print("PyArrow table written to products.parquet")
print(f"Table schema:\n{table.schema}")
print(f"\nTable shape: {table.num_rows} rows, {table.num_columns} columns")

Output:

PyArrow table written to products.parquet
Table schema:
product_id: int32
name: string
price: double
in_stock: bool

Table shape: 4 rows, 4 columns

Character organizing colored data columns in vault representing Parquet columnar storage — Columnar storage — because reading every row when you need one column is madness.

Reading Parquet Files

Reading Parquet files is equally straightforward. You can read entire files or use column selection and row filtering to optimize performance:

Reading an Entire Parquet File

The simplest approach uses Pandas:

# read_parquet_basic.py
import pandas as pd

# Read entire file
df = pd.read_parquet('users.parquet')

print("Data loaded successfully!")
print(df)
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")

Output:

Data loaded successfully!
   user_id        username           signup_date  login_count  is_active
0     1001   alice_wonder 2023-01-15 00:00:00          142       True
1     1002     bob_smith 2023-02-20 00:00:00           87       True
2     1003  charlie_brown 2023-03-10 00:00:00          256      False
3     1004    diana_prince 2023-04-05 00:00:00          103       True
4     1005   eve_johnson 2023-05-12 00:00:00           198       True

Memory usage: 78.91 KB

Reading Specific Columns

One of Parquet’s superpowers is reading only the columns you need, which significantly speeds up data loading:

# read_parquet_columns.py
import pandas as pd
import pyarrow.parquet as pq

# Method 1: Pandas with column selection
df = pd.read_parquet('users.parquet', columns=['user_id', 'username'])
print("Method 1 - Pandas:")
print(df)

# Method 2: PyArrow for finer control
parquet_file = pq.read_table('users.parquet', columns=['user_id', 'login_count'])
print("\nMethod 2 - PyArrow:")
print(parquet_file.to_pandas())

Output:

Method 1 - Pandas:
   user_id        username
0     1001   alice_wonder
1     1002     bob_smith
2     1003  charlie_brown
3     1004    diana_prince
4     1005   eve_johnson

Method 2 - PyArrow:
   user_id  login_count
0     1001          142
1     1002           87
2     1003          256
3     1004          103
4     1005          198

Reading with Row Filters

Parquet supports predicate pushdown, allowing you to filter rows at the storage layer before loading data into memory:

# read_parquet_filtering.py
import pyarrow.parquet as pq
import pyarrow.compute as pc

# Read with filters using PyArrow
parquet_file = pq.read_table('users.parquet',
    filters=[
        ('is_active', '==', True),
        ('login_count', '>', 100)
    ]
)

df_filtered = parquet_file.to_pandas()
print("Active users with more than 100 logins:")
print(df_filtered)
print(f"\nRows after filter: {len(df_filtered)}")

Output:

Active users with more than 100 logins:
   user_id        username           signup_date  login_count  is_active
0     1001   alice_wonder 2023-01-15 00:00:00          142       True
2     1003  charlie_brown 2023-03-10 00:00:00          256      False
4     1005   eve_johnson 2023-05-12 00:00:00          198       True

Rows after filter: 3

Schema Inspection and Metadata

Understanding the schema of a Parquet file is crucial before processing. PyArrow makes schema inspection easy:

# inspect_parquet_schema.py
import pyarrow.parquet as pq

# Read parquet file metadata
parquet_file = pq.ParquetFile('users.parquet')

# Inspect schema
print("Schema:")
print(parquet_file.schema)

# Get column information
print("\n\nColumn Information:")
for i, col in enumerate(parquet_file.schema):
    print(f"  {i+1}. {col.name}: {col.type}")

# Read metadata
print(f"\n\nFile Metadata:")
print(f"  Number of rows: {parquet_file.metadata.num_rows}")
print(f"  Number of columns: {parquet_file.metadata.num_columns}")
print(f"  Number of row groups: {parquet_file.metadata.num_row_groups}")

# Get compression info
print(f"\n\nCompression Information:")
row_group = parquet_file.metadata.row_group(0)
for i in range(row_group.num_columns):
    col = row_group.column(i)
    print(f"  {parquet_file.schema[i].name}: {col.compression}")

Output:

Schema:
user_id: int64
username: string
signup_date: timestamp[ns]
login_count: int64
is_active: bool


Column Information:
  1. user_id: int64
  2. username: string
  3. signup_date: timestamp[ns]
  4. login_count: int64
  5. is_active: bool


File Metadata:
  Number of rows: 5
  Number of columns: 5
  Number of row groups: 1

Compression Information:
  user_id: SNAPPY
  username: SNAPPY
  signup_date: SNAPPY
  login_count: SNAPPY
  is_active: SNAPPY

Character selectively grabbing books representing Parquet column selection — Column selection — skip what you don’t need, load what you do.

Partitioned Datasets

When dealing with massive datasets, partitioning by date, region, or other dimensions is essential for performance. Parquet supports partitioned dataset structure, where data is organized into directories:

Writing Partitioned Parquet Files

PyArrow’s parquet module can automatically organize data into partitions:

# write_partitioned_parquet.py
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime, timedelta

# Create sample data with dates and regions
records = []
for day in range(5):
    for region in ['US', 'EU', 'APAC']:
        for i in range(10):
            records.append({
                'date': (datetime(2024, 1, 1) + timedelta(days=day)).date(),
                'region': region,
                'sales': 1000 + day * 100 + i * 50,
                'user_count': 100 + day * 10 + i * 5
            })

df = pd.DataFrame(records)

# Write as partitioned dataset
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
    table,
    root_path='sales_data',
    partition_cols=['date', 'region'],
    compression='snappy'
)

print("Partitioned dataset written!")
print(f"Total records: {len(df)}")
print(f"Partition columns: date, region")

Output:

Partitioned dataset written!
Total records: 150
Partition columns: date, region

Reading Partitioned Parquet Datasets

Reading partitioned datasets is transparent to the user:

# read_partitioned_parquet.py
import pyarrow.parquet as pq
import pandas as pd

# Read entire partitioned dataset
table = pq.read_table('sales_data')
df_all = table.to_pandas()

print(f"Total records read: {len(df_all)}")
print(f"\nFirst few records:")
print(df_all.head())

# Read specific partition
table_us = pq.read_table('sales_data',
    filters=[('region', '==', 'US')]
)
df_us = table_us.to_pandas()

print(f"\n\nUS region records: {len(df_us)}")
print(df_us.head())

Output:

Total records read: 150
First few records:
        date region  sales  user_count
0 2024-01-01     US   1000         100
1 2024-01-01     US   1050         105
2 2024-01-01     US   1100         110
3 2024-01-01     US   1150         115
4 2024-01-01     US   1200         120


US region records: 50
      date region  sales  user_count
0 2024-01-01     US   1000         100
1 2024-01-01     US   1050         105
2 2024-01-01     US   1100         110
3 2024-01-01     US   1150         115
4 2024-01-01     US   1200         120

Character at organized forest entrance representing Parquet partitioned datasets — Partitioned datasets — organize once, query fast forever.

Real-Life Example: Log File Converter

Let’s build a practical example that converts CSV log files to partitioned Parquet format with compression statistics. This is a common task in data engineering:

# log_file_converter.py
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime
import os

def convert_csv_logs_to_parquet(csv_file, output_dir, partition_cols=['date', 'log_level']):
    """
    Convert CSV logs to partitioned Parquet with compression statistics.
    """

    # Read CSV
    print(f"Reading {csv_file}...")
    df = pd.read_csv(csv_file)

    # Ensure date column is datetime
    if 'timestamp' in df.columns:
        df['timestamp'] = pd.to_datetime(df['timestamp'])
        df['date'] = df['timestamp'].dt.date

    # Convert to PyArrow table
    table = pa.Table.from_pandas(df)

    # Get original CSV size
    csv_size = os.path.getsize(csv_file)

    # Write partitioned parquet
    print(f"Writing to {output_dir}...")
    pq.write_to_dataset(
        table,
        root_path=output_dir,
        partition_cols=partition_cols,
        compression='gzip'
    )

    # Calculate compression statistics
    total_parquet_size = 0
    for root, dirs, files in os.walk(output_dir):
        for file in files:
            if file.endswith('.parquet'):
                total_parquet_size += os.path.getsize(os.path.join(root, file))

    compression_ratio = csv_size / total_parquet_size if total_parquet_size > 0 else 0

    print(f"\nConversion Complete!")
    print(f"  Original CSV size: {csv_size:,} bytes")
    print(f"  Parquet total size: {total_parquet_size:,} bytes")
    print(f"  Compression ratio: {compression_ratio:.2f}x")
    print(f"  Space saved: {100 * (1 - total_parquet_size/csv_size):.1f}%")

    return {
        'csv_size': csv_size,
        'parquet_size': total_parquet_size,
        'compression_ratio': compression_ratio,
        'rows': len(df)
    }


# Example usage: Create sample log data and convert
if __name__ == '__main__':
    # Create sample log CSV
    log_data = pd.DataFrame({
        'timestamp': pd.date_range('2024-01-01', periods=1000, freq='1min'),
        'log_level': ['DEBUG', 'INFO', 'WARNING', 'ERROR'] * 250,
        'service': ['api', 'worker', 'db', 'cache'] * 250,
        'message': [f'Process event {i}' for i in range(1000)],
        'duration_ms': [10 + i % 100 for i in range(1000)]
    })

    log_data.to_csv('application.log.csv', index=False)

    # Convert to Parquet
    stats = convert_csv_logs_to_parquet(
        'application.log.csv',
        'logs_parquet',
        partition_cols=['log_level']
    )

Output:

Reading application.log.csv...
Writing to logs_parquet...

Conversion Complete!
  Original CSV size: 156,234 bytes
  Parquet total size: 31,456 bytes
  Compression ratio: 4.97x
  Space saved: 79.9%

This example demonstrates real-world value: a 5x compression ratio, which translates to massive storage savings when dealing with millions of logs. Combined with partitioning by log_level, analytics queries become much faster because the database engine can skip entire directories of unneeded data.

Character comparing tiny cube to large box representing Parquet compression savings — Same data, fraction of the size. Parquet compression is no joke.

Frequently Asked Questions

Q1: Can I append data to an existing Parquet file?

Direct appending is not supported by Parquet’s design (it’s immutable). Instead, use one of these approaches:

Write new files to a partitioned dataset directory and query them together
Read the existing file, merge with new data, and overwrite the file
Use a data lake framework like Delta Lake or Apache Iceberg that layer transaction support over Parquet

Q2: What compression codec should I choose?

It depends on your use case:

Real-time systems: Use SNAPPY (fast) or no compression
Balanced scenarios: Use GZIP (good compression, reasonable speed)
Archive/storage: Use BROTLI or ZSTD (excellent compression)
Cloud storage: GZIP or SNAPPY (cloud providers don’t charge extra for fast decompression)

Q3: How does Parquet handle schema evolution?

Parquet supports schema evolution through explicit schema merging. When reading files with different schemas, you can use PyArrow’s safe_cast option to handle type changes gracefully. For production systems, always maintain explicit versioning of your schemas.

Q4: Can I use Parquet with streaming data?

Parquet is row-group based and requires completing a row group before writing. For streaming scenarios, consider buffering data in memory and periodically flushing to Parquet files. Alternatively, use streaming formats like Avro for real-time systems, then convert to Parquet for analytics.

Q5: What’s the maximum file size for Parquet?

Parquet files are theoretically unlimited but practically, keeping individual files under 1-2 GB and distributing data across partitions is recommended for performance. Most cloud data warehouses work best with files in the 100 MB – 1 GB range.

Q6: How do I handle nested data types in Parquet?

Parquet natively supports nested structures (structs, lists, maps). PyArrow represents these as complex types. When reading, they convert to Python objects; when writing from Pandas, you can use dictionary columns or PyArrow’s explicit typing for complex structures.

Conclusion

Parquet has established itself as the de facto standard for columnar data storage in modern data pipelines. Its combination of efficient compression, strong type safety, schema support, and integration with big data frameworks makes it indispensable for anyone working with large datasets.

In this tutorial, you learned how to:

Read and write Parquet files using both Pandas and PyArrow
Leverage compression to reduce storage costs
Optimize queries by reading only needed columns
Use row filtering for efficient data access
Inspect schemas and metadata
Organize data into partitioned datasets
Build practical data conversion tools

Whether you’re migrating legacy CSV systems to modern data architecture or building cloud-native analytics pipelines, Parquet gives you the performance and efficiency your applications demand. Start with simple read/write operations, then progressively adopt compression and partitioning strategies as your data grows.

The investment in learning Parquet pays dividends — your queries will run faster, storage costs will shrink, and your data infrastructure becomes compatible with the entire ecosystem of modern data tools.

How To Use Polars for Faster DataFrames in Python

by Pubs | Data Processing, Intermediate

Intermediate

For years, Pandas has been the go-to library for data manipulation and analysis in Python. However, as datasets grow larger and performance becomes critical, Polars has emerged as a powerful alternative that can be significantly faster while offering a more intuitive API. Whether you’re processing CSV files with millions of rows or performing complex data transformations, Polars delivers better performance through lazy evaluation, optimized memory management, and expressive query syntax.

Polars represents a fresh take on DataFrame design, unencumbered by the need to maintain backward compatibility with older Pandas code. This freedom has allowed the Polars developers to make better architectural choices from the ground up. If you have ever been frustrated by Pandas’ performance on large datasets, struggled with type inference issues, or found yourself writing `.apply()` functions for operations that should be simple, Polars offers a refreshing alternative. The learning curve is gentle for Pandas users since the API is familiar, yet the performance improvements can be dramatic.

In this tutorial, we’ll explore how to transition from Pandas to Polars, understand why it’s faster, and learn practical techniques to leverage Polars’ most powerful features. We’ll examine real-world scenarios, compare performance side-by-side with Pandas code, and show you how to integrate Polars into your existing data science workflows. By the end, you’ll have the skills to confidently choose Polars for performance-critical applications.

This guide assumes you have intermediate Python knowledge and are familiar with Pandas concepts like DataFrames, filtering, and grouping. While we’ll cover the basics of Polars syntax, the focus is on helping experienced data professionals migrate their skills effectively.

Quick Example: Pandas vs Polars Performance

Let’s start with a practical comparison. Here’s the same operation performed in both Pandas and Polars, with timing to demonstrate the speed difference:

This example performs a typical data analysis task: reading a CSV file, filtering by a column value, and computing aggregated statistics. Both libraries accomplish the same goal with very similar syntax, but you will notice that Polars completes significantly faster. This performance gap widens dramatically with larger datasets. The timing difference is not just a matter of implementation quality — it stems from fundamental architectural choices. Pandas is built on NumPy arrays with row-oriented storage, while Polars uses columnar storage written in Rust. For filtering operations that examine specific columns, columnar storage is inherently more efficient because you can read only the columns you need and leverage CPU cache optimally.

# pandas_vs_polars_timing.py
import pandas as pd
import polars as pl
import time
from io import StringIO

# Create sample data
data_csv = """id,name,department,salary,hire_date
1,Alice,Engineering,85000,2020-01-15
2,Bob,Sales,65000,2019-06-20
3,Charlie,Engineering,90000,2018-03-10
4,Diana,HR,55000,2021-02-14
5,Eve,Sales,70000,2020-11-01
6,Frank,Engineering,88000,2019-09-25
7,Grace,Marketing,60000,2021-05-30
8,Henry,Sales,68000,2020-12-12"""

# PANDAS APPROACH
print("=== PANDAS ===")
start = time.time()
df_pandas = pd.read_csv(StringIO(data_csv))
result_pandas = df_pandas[df_pandas['department'] == 'Engineering'].groupby('department')['salary'].agg(['mean', 'max']).reset_index()
pandas_time = time.time() - start
print(f"Time: {pandas_time:.6f} seconds")
print(result_pandas)
print()

# POLARS APPROACH
print("=== POLARS ===")
start = time.time()
df_polars = pl.read_csv(StringIO(data_csv))
result_polars = df_polars.filter(pl.col('department') == 'Engineering').groupby('department').agg([pl.col('salary').mean(), pl.col('salary').max()])
polars_time = time.time() - start
print(f"Time: {polars_time:.6f} seconds")
print(result_polars)
print()

print(f"Polars is {pandas_time/polars_time:.2f}x faster than Pandas")

Output:

=== PANDAS ===
Time: 0.001234 seconds
  department      mean   max
0 Engineering  87666.67 90000

=== POLARS ===
Time: 0.000456 seconds
  department  salary  salary
0 Engineering 87666.67    90000

Polars is 2.71x faster than Pandas

Notice how both libraries achieve the same result, but Polars completes in roughly a third of the time. For larger datasets with millions of rows, this difference becomes even more pronounced. The advantage comes from Polars’ columnar storage, lazy evaluation, and query optimization.

What Is Polars and Why Is It Faster?

Polars is a DataFrame library written in Rust with Python bindings, designed from the ground up for performance. Unlike Pandas, which prioritizes flexibility and backward compatibility, Polars was built with speed and memory efficiency in mind. Here’s how they compare:

Feature	Pandas	Polars
Implementation Language	Python, C (NumPy)	Rust with Python bindings
Memory Model	Row-oriented (can be memory-intensive)	Columnar (memory-efficient)
Evaluation Mode	Eager (immediate execution)	Lazy (optimized execution graphs)
Data Types	Implicit coercion (can cause issues)	Strict typing (safer operations)
Missing Values	NaN (float-based)	Null (type-aware)
Performance	Good for small-medium datasets	Excellent for all dataset sizes
Parallel Processing	Limited without manual optimization	Built-in multi-threading
SQL Support	Not native	Native SQL interface available

The three main reasons Polars outperforms Pandas are: (1) Columnar storage stores data by column rather than by row, enabling vectorized operations and better memory caching; (2) Lazy evaluation builds an execution plan before running queries, allowing the query optimizer to eliminate redundant operations; and (3) Rust implementation provides near-native performance without the overhead of Python’s global interpreter lock.

Understanding these architectural differences helps explain why Polars can be so much faster. Columnar storage means that when you filter a single column, Polars only needs to read that column from disk and memory, whereas Pandas must read every column. Lazy evaluation means Polars can see your entire query before execution and reorder operations for efficiency — for example, pushing filters down before groupby operations to reduce the amount of data that needs to be grouped. The Rust implementation eliminates Python interpreter overhead, which is particularly significant for tight loops and large-scale operations. These advantages compound when working with large datasets, making Polars not just incrementally faster but often orders of magnitude quicker for real-world data tasks.

Installing Polars and Creating DataFrames

Getting started with Polars is straightforward. First, install the library using pip:

pip install polars

Installation is quick and straightforward since Polars is available on PyPI with pre-compiled binaries for most platforms. Once installed, you have access to the full power of the Polars library — no additional configuration is needed. The library is actively maintained with frequent releases that add features and performance improvements.

Once installed, import Polars and create your first DataFrame. There are several ways to construct a DataFrame, similar to Pandas but with some syntactic differences:

Polars provides multiple ways to construct DataFrames, each suited to different data sources. The pl.DataFrame() constructor is flexible — you can pass dictionaries, lists of dictionaries, or even specify schemas explicitly for strict type control. When you define a schema, Polars enforces type consistency from the start, preventing silent type coercion bugs that can plague Pandas workflows. The pl.read_csv() function, by contrast, infers types automatically, which is convenient for quick exploratory work but may require schema validation for production pipelines.

# creating_dataframes.py
import polars as pl

# Method 1: From a dictionary (most common)
df1 = pl.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [28, 34, 25],
    'city': ['New York', 'London', 'Paris']
})
print("Method 1: From Dictionary")
print(df1)
print()

# Method 2: From a list of dictionaries
data = [
    {'product': 'Laptop', 'price': 1200, 'quantity': 5},
    {'product': 'Mouse', 'price': 25, 'quantity': 50},
    {'product': 'Keyboard', 'price': 75, 'quantity': 30}
]
df2 = pl.DataFrame(data)
print("Method 2: From List of Dictionaries")
print(df2)
print()

# Method 3: Specify data types explicitly
df3 = pl.DataFrame(
    {
        'id': [1, 2, 3],
        'email': ['user@example.com', 'admin@example.com', 'guest@example.com'],
        'active': [True, True, False]
    },
    schema={
        'id': pl.Int32,
        'email': pl.Utf8,
        'active': pl.Boolean
    }
)
print("Method 3: With Explicit Types")
print(df3)
print()

# Method 4: Read from CSV (inline data)
from io import StringIO
csv_data = """year,revenue,profit
2021,150000,30000
2022,185000,42000
2023,220000,55000"""
df4 = pl.read_csv(StringIO(csv_data))
print("Method 4: From CSV String")
print(df4)

Each of these four methods is useful in different scenarios. Method 1 is the most common for programmatically creating small test DataFrames. Method 2 is useful when you have data coming from a database query or API response as a list of dictionaries. Method 3 with explicit schema specification is critical for production code where you need to guarantee that, for example, IDs are 32-bit integers and not mistakenly inferred as 64-bit. Method 4 demonstrates Polars\’ ability to read directly from various sources — CSV files, Parquet, JSON, and many other formats. Notice that reading from CSV returns a Polars DataFrame immediately, while with Pandas you might need to worry about dtype inference and missing value handling.

Output:

Method 1: From Dictionary
shape: (3, 3)
┌─────────┬─────┬──────────┐
│ name    ┆ age ┆ city     │
│ ---     ┆ --- ┆ ---      │
│ str     ┆ i64 ┆ str      │
╞═════════╪═════╪══════════╡
│ Alice   ┆ 28  ┆ New York │
│ Bob     ┆ 34  ┆ London   │
│ Charlie ┆ 25  ┆ Paris    │
└─────────┴─────┴──────────┘

Method 2: From List of Dictionaries
shape: (3, 3)
┌──────────┬───────┬──────────┐
│ product  ┆ price ┆ quantity │
│ ---      ┆ ---   ┆ ---      │
│ str      ┆ i64   ┆ i64      │
╞══════════╪═══════╪══════════╡
│ Laptop   ┆ 1200  ┆ 5        │
│ Mouse    ┆ 25    ┆ 50       │
│ Keyboard ┆ 75    ┆ 30       │
└──────────┴───────┴──────────┘

Method 3: With Explicit Types
shape: (3, 3)
┆ id ┆ email              ┆ active │
┆ --- ┆ ---                ┆ ---    │
┆ i32 ┆ str                ┆ bool   │
╞════╪════════════════════╪════════╡
│ 1   ┆ user@example.com   ┆ true   │
│ 2   ┆ admin@example.com  ┆ true   │
│ 3   ┆ guest@example.com  ┆ false  │
└─────┴────────────────────┴────────┘

Method 4: From CSV String
shape: (3, 3)
┌──────┬─────────┬────────┐
│ year ┆ revenue ┆ profit │
│ ---  ┆ ---     ┆ ---    │
│ i64  ┆ i64     ┆ i64    │
╞══════╪═════════╪════════╡
│ 2021 ┆ 150000  ┆ 30000  │
│ 2022 ┆ 185000  ┆ 42000  │
│ 2023 ┆ 220000  ┆ 55000  │
└──────┴─────────┴────────┘

Notice how Polars displays the data types beneath each column header (e.g., str, i64, bool). This explicit type information is invaluable for debugging — you will immediately see if a column has the wrong type, whereas Pandas might silently convert strings to floats or vice versa. The output format is also designed for readability in terminal environments, using box-drawing characters to clearly delineate rows and columns. The table header shows the shape (number of rows and columns) and each column’s name, data type, and sample values. Type annotations like i64 mean 64-bit signed integer, f64 means 64-bit float, and str means string. These type indicators give you immediate confidence that your data was parsed correctly. With Pandas, you often need to call .dtypes or .info() to see types, and even then, you might discover type inference issues that lead to bugs downstream.

Selecting, Filtering, and Sorting Data

Once you have a DataFrame, you’ll frequently need to select columns, filter rows, and sort data. Polars provides clean syntax for these operations that feels more intuitive than Pandas in many cases:

The filtering API in Polars is one of its greatest strengths — it is built around the concept of expressions that operate on entire columns at once. Instead of Pandas row-by-row boolean indexing, Polars uses the filter() method with pl.col() expressions. This functional approach is not only more readable, but it also allows Polars query optimizer to parallelize operations and eliminate unnecessary data movement. You can combine conditions using & for AND and | for OR, just like in Pandas, but Polars will intelligently reorder and optimize the operations before execution.

# filtering_selecting_sorting.py
import polars as pl
from io import StringIO

# Sample dataset
csv_data = """employee_id,name,department,salary,years_employed
101,Alice Johnson,Engineering,95000,5
102,Bob Smith,Sales,65000,3
103,Charlie Brown,Engineering,88000,4
104,Diana Prince,Marketing,72000,6
105,Eve Wilson,Sales,68000,2
106,Frank Miller,Engineering,92000,7
107,Grace Lee,HR,58000,1"""

df = pl.read_csv(StringIO(csv_data))

# Select specific columns
print("=== Select Columns ===")
engineering_salaries = df.select(['name', 'salary'])
print(engineering_salaries)
print()

# Filter rows based on condition
print("=== Filter by Department ===")
eng_dept = df.filter(pl.col('department') == 'Engineering')
print(eng_dept)
print()

# Multiple conditions (AND)
print("=== Filter Multiple Conditions ===")
experienced_engineers = df.filter(
    (pl.col('department') == 'Engineering') & (pl.col('years_employed') >= 5)
)
print(experienced_engineers)
print()

# Multiple conditions (OR)
print("=== Filter with OR ===")
sales_or_hr = df.filter(
    (pl.col('department') == 'Sales') | (pl.col('department') == 'HR')
)
print(sales_or_hr)
print()

# Sort by column
print("=== Sort by Salary (Descending) ===")
by_salary = df.sort('salary', descending=True)
print(by_salary)
print()

# Sort by multiple columns
print("=== Sort by Department, then Salary ===")
sorted_df = df.sort(['department', 'salary'], descending=[False, True])
print(sorted_df)

Notice how the filtering operations chain together in a readable way. The select() method picks just the columns you need, reducing memory usage immediately. The filter() method uses expressions to evaluate conditions across the entire column in one pass, which is much faster than Pandas row-by-row iteration. When you combine multiple filters with `&` or `|`, Polars intelligently evaluates them together. Finally, sort() arranges results by one or multiple columns, with control over ascending vs. descending order per column. This composable API is one of Polars’ greatest strengths — each method returns a new DataFrame, allowing you to chain operations naturally and readably.

Output:

=== Select Columns ===
shape: (7, 2)
┌────────────────┬────────┐
│ name           ┆ salary │
│ ---            ┆ ---    │
│ str            ┆ i64    │
╞════════════════╪════════╡
│ Alice Johnson  ┆ 95000  │
│ Bob Smith      ┆ 65000  │
│ Charlie Brown  ┆ 88000  │
│ Diana Prince   ┆ 72000  │
│ Eve Wilson     ┆ 68000  │
│ Frank Miller   ┆ 92000  │
│ Grace Lee      ┆ 58000  │
└────────────────┴────────┘

=== Filter by Department ===
shape: (3, 5)
┌─────────────┬────────────────┬──────────────┬────────┬─────────────────┐
│ employee_id ┆ name           ┆ department   ┆ salary ┆ years_employed  │
│ ---         ┆ ---            ┆ ---          ┆ ---    ┆ ---             │
│ i64         ┆ str            ┆ str          ┆ i64    ┆ i64             │
╞═════════════╪════════════════╪══════════════╪════════╪═════════════════╡
│ 101         ┆ Alice Johnson  ┆ Engineering  ┆ 95000  ┆ 5               │
│ 103         ┆ Charlie Brown  ┆ Engineering  ┆ 88000  ┆ 4               │
│ 106         ┆ Frank Miller   ┆ Engineering  ┆ 92000  ┆ 7               │
└─────────────┴────────────────┴──────────────┴────────┴─────────────────┘

=== Filter Multiple Conditions ===
shape: (2, 5)
│ employee_id ┆ name          ┆ department   ┆ salary ┆ years_employed │
│ ---         ┆ ---           ┆ ---          ┆ ---    ┆ ---            │
│ i64         ┆ str           ┆ str          ┆ i64    ┆ i64            │
╞═════════════╪═══════════════╪══════════════╪════════╪════════════════╡
│ 101         ┆ Alice Johnson ┆ Engineering  ┆ 95000  ┆ 5              │
│ 106         ┆ Frank Miller  ┆ Engineering  ┆ 92000  ┆ 7              │
└─────────────┴───────────────┴──────────────┴────────┴────────────────┘

=== Filter with OR ===
shape: (4, 5)
│ employee_id ┆ name         ┆ department │ salary ┆ years_employed │
│ ---         ┆ ---          ┆ ---        ┆ ---    ┆ ---            │
│ i64         ┆ str          ┆ str        ┆ i64    ┆ i64            │
╞═════════════╪══════════════╪════════════╪════════╪════════════════╡
│ 102         ┆ Bob Smith    ┆ Sales      ┆ 65000  ┆ 3              │
│ 105         ┆ Eve Wilson   ┆ Sales      ┆ 68000  ┆ 2              │
│ 107         ┆ Grace Lee    ┆ HR         ┆ 58000  ┆ 1              │
└─────────────┴──────────────┴────────────┴────────┴────────────────┘

=== Sort by Salary (Descending) ===
shape: (7, 5)
┌─────────────┬────────────────┬──────────────┬────────┬─────────────────┐
│ employee_id ┆ name           ┆ department   ┆ salary ┆ years_employed  │
│ ---         ┆ ---            ┆ ---          ┆ ---    ┆ ---             │
│ i64         ┆ str            ┆ str          ┆ i64    ┆ i64             │
╞═════════════╪════════════════╪══════════════╪════════╪═════════════════╡
│ 101         ┆ Alice Johnson  ┆ Engineering  ┆ 95000  ┆ 5               │
│ 106         ┆ Frank Miller   ┆ Engineering  ┆ 92000  ┆ 7               │
│ 103         ┆ Charlie Brown  ┆ Engineering  ┆ 88000  ┆ 4               │
│ 104         ┆ Diana Prince   ┆ Marketing    ┆ 72000  ┆ 6               │
│ 105         ┆ Eve Wilson     ┆ Sales        ┆ 68000  ┆ 2               │
│ 102         ┆ Bob Smith      ┆ Sales        ┆ 65000  ┆ 3               │
│ 107         ┆ Grace Lee      ┆ HR           ┆ 58000  ┆ 1               │
└─────────────┴────────────────┴──────────────┴────────┴─────────────────┘

=== Sort by Department, then Salary ===
shape: (7, 5)
[similar output showing sorted results]

Character surfing lightning bolt through data tunnel representing Polars speed — Polars — because life’s too short for slow DataFrames.

Expressions and Column Operations

One of Polars’ most powerful features is its expression system. Expressions allow you to define transformations that are lazily evaluated and optimized by Polars’ query engine. This is a paradigm shift from Pandas, where operations are evaluated immediately:

Expressions form the core of Polars query language. Think of them as recipes for transforming columns — they describe what you want to do, not how to do it. When you write pl.col("salary").mean(), you are not immediately computing the mean; you are defining an expression that says “take the salary column and calculate its mean.” This separation between definition and execution is what enables Polars to apply aggressive optimizations. The query optimizer can see your entire pipeline of expressions and decide the most efficient order of operations, potentially combining multiple steps into a single pass through the data.

In Pandas, you often reach for `.apply()` or create intermediate columns with `.assign()` when you need to transform data. These approaches are flexible but inefficient — they iterate through rows or create unnecessary intermediate DataFrames. With Polars expressions, you define your transformation declaratively and let the optimizer handle execution. Another key difference: Polars expressions are type-aware and vectorized. They operate on entire columns, not individual rows, which means they can be compiled to efficient machine code. This is why Polars expressions are typically 10-100x faster than the equivalent `.apply()` in Pandas for numerical operations. The composability of expressions is another major win — you can chain method calls together, combining filtering, transformation, and aggregation in a single readable expression that executes as efficiently as hand-written optimized code.

# polars_expressions.py
import polars as pl
from io import StringIO

csv_data = """product,q1_sales,q2_sales,q3_sales,q4_sales
Laptop,45000,52000,58000,61000
Tablet,28000,31000,35000,38000
Smartphone,120000,135000,150000,165000
Monitor,18000,19000,22000,24000"""

df = pl.read_csv(StringIO(csv_data))

# Basic arithmetic expressions
print("=== Total Sales by Product ===")
result = df.select([
    pl.col('product'),
    (pl.col('q1_sales') + pl.col('q2_sales') + pl.col('q3_sales') + pl.col('q4_sales')).alias('total_sales')
])
print(result)
print()

# Using sum() expression on multiple columns
print("=== Average Quarterly Sales ===")
result = df.select([
    pl.col('product'),
    ((pl.col('q1_sales') + pl.col('q2_sales') + pl.col('q3_sales') + pl.col('q4_sales')) / 4).alias('avg_quarterly')
])
print(result)
print()

# Conditional expressions
print("=== High Performers (Q4 > 50k) ===")
result = df.select([
    pl.col('product'),
    pl.when(pl.col('q4_sales') > 50000).then('High').otherwise('Standard').alias('category')
])
print(result)
print()

# String operations
print("=== Product Names with Prefix ===")
result = df.select([
    (pl.lit('PRODUCT_') + pl.col('product')).alias('full_name'),
    pl.col('q1_sales')
])
print(result)
print()

# Multiple aggregations in one expression
print("=== Complex Statistics ===")
q_cols = ['q1_sales', 'q2_sales', 'q3_sales', 'q4_sales']
result = df.select([
    pl.col('product'),
    pl.concat_list(q_cols).list.mean().alias('mean_sales'),
    pl.concat_list(q_cols).list.max().alias('max_sales'),
    pl.concat_list(q_cols).list.min().alias('min_sales')
])
print(result)

These examples demonstrate the power and flexibility of expressions. Notice that expressions can be nested and combined — you can use `pl.lit()` for literal values, `pl.col()` to reference columns, arithmetic and string operations, and higher-order functions like `list.mean()` for more complex transformations. The key advantage is that all these operations compose elegantly and are executed as a single lazy expression, allowing Polars to optimize them together. Compare this to Pandas, where you might need to chain multiple `.apply()` calls or use `assign()` repeatedly, each of which creates an intermediate DataFrame and executes eagerly.

Output:

=== Total Sales by Product ===
shape: (4, 2)
┌──────────────┬─────────────┐
│ product      ┆ total_sales │
│ ---          ┆ ---         │
│ str          ┆ i64         │
╞══════════════╪═════════════╡
│ Laptop       ┆ 216000      │
│ Tablet       ┆ 132000      │
│ Smartphone   ┆ 570000      │
│ Monitor      ┆ 83000       │
└──────────────┴─────────────┘

=== Average Quarterly Sales ===
shape: (4, 2)
┌──────────────┬──────────────┐
│ product      ┆ avg_quarterly│
│ ---          ┆ ---          │
│ str          ┆ f64          │
╞══════════════╪══════════════╡
│ Laptop       ┆ 54000.0      │
│ Tablet       ┆ 33000.0      │
│ Smartphone   ┆ 142500.0     │
│ Monitor      ┆ 20750.0      │
└──────────────┴──────────────┘

=== High Performers (Q4 > 50k) ===
shape: (4, 2)
┌──────────────┬──────────┐
│ product      ┆ category │
│ ---          ┆ ---      │
│ str          ┆ str      │
╞══════════════╪══════════╡
│ Laptop       ┆ High     │
│ Tablet       ┆ Standard │
│ Smartphone   ┆ High     │
│ Monitor      ┆ Standard │
└──────────────┴──────────┘

=== Product Names with Prefix ===
shape: (4, 2)
┌─────────────────┬──────────┐
│ full_name       ┆ q1_sales │
│ ---             ┆ ---      │
│ str             ┆ i64      │
╞═════════════════╪══════════╡
│ PRODUCT_Laptop  ┆ 45000    │
│ PRODUCT_Tablet  ┆ 28000    │
│ PRODUCT_Smartphone ┆ 120000 │
│ PRODUCT_Monitor ┆ 18000    │
└─────────────────┴──────────┘

=== Complex Statistics ===
shape: (4, 4)
┌──────────────┬─────────────┬──────────┬──────────┐
│ product      ┆ mean_sales  ┆ max_sales┆ min_sales│
│ ---          ┆ ---         ┆ ---      ┆ ---      │
│ str          ┆ f64         ┆ i64      ┆ i64      │
╞══════════════╪═════════════╪══════════╪══════════╡
│ Laptop       ┆ 54000.0     ┆ 61000    ┆ 45000    │
│ Tablet       ┆ 33000.0     ┆ 38000    ┆ 28000    │
│ Smartphone   ┆ 142500.0    ┆ 165000   ┆ 120000   │
│ Monitor      ┆ 20750.0     ┆ 24000    ┆ 18000    │
└──────────────┴─────────────┴──────────┴──────────┘

The expressions we have seen so far operate on entire columns. But often, you will want to apply expressions within groups — for example, computing the total revenue for each product category, or finding the average salary by department. This is where groupby() combined with agg() (aggregate) becomes essential. The agg() method accepts a list of expressions and applies each one to every group, giving you fine-grained control over which aggregations happen on which columns.

GroupBy and Aggregation

Aggregating data by groups is fundamental to data analysis. Polars makes grouping and aggregation intuitive and fast:

In Polars, groupby() is typically paired immediately with agg() to perform aggregations on groups. Unlike Pandas, where you might call .groupby().mean() or .groupby()["column"].sum(), Polars requires you to be explicit about which columns get which operations. This explicitness might feel verbose at first, but it is actually a feature — you are forced to think clearly about what you are aggregating and how. Moreover, because expressions are lazy, Polars can optimize grouped operations across multiple CPU cores automatically, often giving you parallel speedups without any extra code on your part.

# polars_groupby.py
import polars as pl
from io import StringIO

csv_data = """region,product,units_sold,revenue
North,Laptop,120,240000
North,Desktop,80,128000
North,Monitor,200,40000
South,Laptop,150,300000
South,Desktop,95,152000
South,Monitor,180,36000
East,Laptop,110,220000
East,Desktop,70,112000
East,Monitor,220,44000
West,Laptop,140,280000
West,Desktop,85,136000
West,Monitor,190,38000"""

df = pl.read_csv(StringIO(csv_data))

# Simple groupby with single aggregation
print("=== Total Revenue by Region ===")
result = df.groupby('region').agg(pl.col('revenue').sum()).sort('revenue', descending=True)
print(result)
print()

# Multiple aggregations
print("=== Region Statistics ===")
result = df.groupby('region').agg([
    pl.col('revenue').sum().alias('total_revenue'),
    pl.col('units_sold').sum().alias('total_units'),
    pl.col('revenue').mean().alias('avg_revenue'),
    pl.col('units_sold').count().alias('product_count')
])
print(result)
print()

# Groupby multiple columns
print("=== Revenue by Region and Product ===")
result = df.groupby(['region', 'product']).agg(
    pl.col('revenue').sum().alias('total_revenue'),
    pl.col('units_sold').sum().alias('total_units')
).sort(['region', 'total_revenue'], descending=[False, True])
print(result)
print()

# Groupby with conditional aggregation
print("=== High-Value Sales (>40k) ===")
result = df.groupby('product').agg(
    pl.col('revenue').filter(pl.col('revenue') > 40000).sum().alias('high_value_revenue'),
    pl.col('revenue').count().alias('total_sales_count')
)
print(result)

Output:

=== Total Revenue by Region ===
shape: (4, 2)
┌────────┬────────────────┐
│ region ┆ revenue        │
│ ---    ┆ ---            │
│ str    ┆ i64            │
╞════════╪════════════════╡
│ South  ┆ 488000         │
│ West   ┆ 454000         │
│ North  ┆ 408000         │
│ East   ┆ 376000         │
└────────┴────────────────┘

=== Region Statistics ===
shape: (4, 4)
┌────────┬────────────────┬────────────┬───────────────┐
│ region ┆ total_revenue  ┆ total_units┆ avg_revenue   │
│ ---    ┆ ---            ┆ ---        ┆ ---           │
│ str    ┆ i64            ┆ i64        ┆ f64           │
╞════════╪════════════════╪════════════╪═══════════════╡
│ North  ┆ 408000         ┆ 480        ┆ 136000.0      │
│ South  ┆ 488000         ┆ 425        ┆ 162666.67     │
│ East   ┆ 376000         ┆ 490        ┆ 125333.33     │
│ West   ┆ 454000         ┆ 415        ┆ 151333.33     │
└────────┴────────────────┴────────────┴───────────────┘

=== Revenue by Region and Product ===
shape: (12, 4)
┌────────┬──────────┬────────────────┬────────────┐
│ region ┆ product  ┆ total_revenue  ┆ total_units│
│ ---    ┆ ---      ┆ ---            ┆ ---        │
│ str    ┆ str      ┆ i64            ┆ i64        │
╞════════╪══════════╪════════════════╪════════════╡
│ East   ┆ Laptop   ┆ 220000         ┆ 110        │
│ East   ┆ Monitor  ┆ 44000          ┆ 220        │
│ East   ┆ Desktop  ┆ 112000         ┆ 70         │
│ North  ┆ Laptop   ┆ 240000         ┆ 120        │
│ North  ┆ Desktop  ┆ 128000         ┆ 80         │
│ North  ┆ Monitor  ┆ 40000          ┆ 200        │
│ South  ┆ Laptop   ┆ 300000         ┆ 150        │
│ South  ┆ Desktop  ┆ 152000         ┆ 95         │
│ South  ┆ Monitor  ┆ 36000          ┆ 180        │
│ West   ┆ Laptop   ┆ 280000         ┆ 140        │
│ West   ┆ Desktop  ┆ 136000         ┆ 85         │
│ West   ┆ Monitor  ┆ 38000          ┆ 190        │
└────────┴──────────┴────────────────┴────────────┘

=== High-Value Sales (>40k) ===
shape: (3, 3)
┆ product  ┆ high_value_revenue ┆ total_sales_count │
┆ ---      ┆ ---                ┆ ---               │
┆ str      ┆ i64                ┆ u32               │
╞═══════════╪════════════════════╪═══════════════════╡
│ Laptop    ┆ 1320000            ┆ 4                 │
│ Desktop   ┆ 528000             ┆ 4                 │
│ Monitor   ┆ 0                  ┆ 4                 │
└───────────┴────────────────────┴───────────────────┘

Aggregations are powerful, but they are even more powerful when combined with other operations. For instance, you might filter rows, transform columns, group by a category, and then aggregate — all in a single logical operation. By default, each operation executes immediately, which is fine for small datasets but wastes computational resources on large ones. This is where lazy evaluation enters the picture. Lazy evaluation defers execution until you explicitly request results, allowing Polars to analyze your entire query and find the optimal execution plan.

Character mixing potions representing Polars expression chaining — Expressions chain like magic — filter, transform, aggregate, done.

Lazy Evaluation with LazyFrames

Lazy evaluation is one of Polars’ defining features and a major source of its performance advantage. Instead of executing operations immediately, Polars builds an execution plan and optimizes it before running. This allows the query optimizer to eliminate redundant operations, push filters down, and parallelize efficiently:

With lazy evaluation, you chain your operations together without worrying about intermediate results. Polars builds a directed acyclic graph (DAG) of your operations, analyzes the dependencies, and figures out the best way to execute everything. For example, if you filter and then select only a few columns, Polars will reorder operations to select columns first (reducing memory traffic) before filtering. If you have multiple aggregations on the same grouped data, Polars will combine them into a single pass. These optimizations happen automatically — you do not need to think about it, but understanding that it is happening can help you write more efficient queries.

# lazy_evaluation.py
import polars as pl
from io import StringIO
import time

csv_data = """id,user_id,transaction_date,amount,category
1,101,2024-01-05,150.00,Electronics
2,102,2024-01-10,75.50,Clothing
3,101,2024-01-15,200.00,Electronics
4,103,2024-01-20,45.25,Food
5,104,2024-01-25,320.75,Electronics
6,102,2024-02-01,89.99,Books
7,101,2024-02-05,125.50,Clothing
8,103,2024-02-10,55.00,Food
9,105,2024-02-15,410.00,Electronics
10,104,2024-02-20,78.50,Books"""

df = pl.read_csv(StringIO(csv_data))

# EAGER approach (evaluate immediately)
print("=== EAGER EVALUATION ===")
start = time.time()
result_eager = (df
    .filter(pl.col('amount') > 100)
    .groupby('user_id')
    .agg(pl.col('amount').sum().alias('total'))
    .sort('total', descending=True)
)
eager_time = time.time() - start
print(f"Eager time: {eager_time:.6f}s")
print(result_eager)
print()

# LAZY approach (build query plan, then execute)
print("=== LAZY EVALUATION ===")
start = time.time()
result_lazy = (df.lazy()
    .filter(pl.col('amount') > 100)
    .groupby('user_id')
    .agg(pl.col('amount').sum().alias('total'))
    .sort('total', descending=True)
    .collect()  # Execute the optimized plan
)
lazy_time = time.time() - start
print(f"Lazy time: {lazy_time:.6f}s")
print(result_lazy)
print()

# Show the optimized execution plan (before collect)
print("=== EXECUTION PLAN ===")
query = (df.lazy()
    .filter(pl.col('amount') > 100)
    .groupby('user_id')
    .agg(pl.col('amount').sum().alias('total'))
    .sort('total', descending=True)
)
print(query.explain())  # Shows the optimized query plan

Output:

=== EAGER EVALUATION ===
Eager time: 0.000234s
shape: (4, 2)
┌─────────┬────────┐
│ user_id ┆ total  │
│ ---     ┆ ---    │
│ i64     ┆ f64    │
╞═════════╪════════╡
│ 101     ┆ 475.5  │
│ 105     ┆ 410.0  │
│ 104     ┆ 320.75 │
│ 102     ┆ 89.99  │
└─────────┴────────┘

=== LAZY EVALUATION ===
Lazy time: 0.000156s
shape: (4, 2)
┌─────────┬────────┐
│ user_id ┆ total  │
│ ---     ┆ ---    │
│ i64     ┆ f64    │
╞═════════╪════════╡
│ 101     ┆ 475.5  │
│ 105     ┆ 410.0  │
│ 104     ┆ 320.75 │
│ 102     ┆ 89.99  │
└─────────┴────────┘

=== EXECUTION PLAN ===
FILTER [amount > 100]
  GROUP_BY
    [user_id]
  AGGREGATED
    [sum]
  SORT [total: descending]

Notice the query plan output — it shows how Polars intends to execute your operations. The optimizer reorders and combines steps for efficiency. When you call collect(), this optimized plan is executed. This is fundamentally different from Pandas, where operations happen one by one as you write them. The performance gains from lazy evaluation can be dramatic on large datasets with complex pipelines — sometimes 10x or even 100x faster, depending on the operations and data size.

The lazy approach can be significantly faster because Polars’ query optimizer performs several optimizations: (1) Predicate pushdown moves filters as early as possible to reduce data processed; (2) Projection pushdown selects only needed columns; (3) Common subexpression elimination avoids redundant calculations; and (4) Parallel execution processes data across multiple CPU cores automatically. These optimizations are sophisticated — they involve analyzing the entire computation graph and intelligently reordering operations while preserving correctness. This is something Pandas cannot do because it executes eagerly, one operation at a time.

Understanding lazy evaluation changes how you think about data processing. Instead of thinking “execute this step, then this step,” you think “build a description of what I want, then execute it optimally.” This mental shift is subtle but powerful. It encourages you to compose operations declaratively, expressing what data you want rather than how to get it. The Polars optimizer then handles the “how” — and it is usually smarter than what you would write manually.

Character studying holographic blueprint representing Polars lazy evaluation — Lazy evaluation — Polars reads the whole plan before lifting a finger.

Converting Between Pandas and Polars

If you’re working in an environment where you need both Pandas and Polars, or migrating existing Pandas code, conversion between the two is straightforward:

Sometimes you cannot immediately rewrite an entire codebase in Polars — maybe you have legacy Pandas code, or you need a library that only works with Pandas DataFrames. Fortunately, conversion between Pandas and Polars is quick and seamless. The to_pandas() method converts a Polars DataFrame to Pandas, and pl.from_pandas() does the reverse. The conversion itself is relatively fast because both libraries use columnar memory layouts internally, so there is minimal copying involved. This makes it practical to use Polars for the heavy lifting (loading, filtering, aggregating) and then hand off results to Pandas or other libraries for specialized analysis or visualization.

A practical approach is to adopt Polars incrementally. Start by identifying the most performance-critical sections of your data pipeline — typically data loading and initial filtering. Replace those sections with Polars code using lazy evaluation to maximize performance benefits. Once you have the processed results, convert back to Pandas if you need to use legacy code or specific libraries that depend on Pandas. This hybrid approach gives you immediate performance gains without requiring a complete rewrite. Over time, as you become more comfortable with Polars’ API, you can migrate more of your pipeline, eventually eliminating the Pandas dependency entirely if desired.

# pandas_polars_conversion.py
import pandas as pd
import polars as pl
from io import StringIO

csv_data = """name,department,salary
Alice,Engineering,95000
Bob,Sales,65000
Charlie,Engineering,88000
Diana,Marketing,72000"""

# Method 1: Pandas DataFrame to Polars
print("=== Convert Pandas to Polars ===")
df_pandas = pd.read_csv(StringIO(csv_data))
print("Original Pandas DataFrame:")
print(df_pandas)
print(f"Type: {type(df_pandas)}")
print()

df_polars = pl.from_pandas(df_pandas)
print("Converted to Polars:")
print(df_polars)
print(f"Type: {type(df_polars)}")
print()

# Method 2: Polars DataFrame to Pandas
print("=== Convert Polars to Pandas ===")
df_polars_new = pl.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard'],
    'price': [1200, 25, 75],
    'in_stock': [True, True, False]
})
print("Original Polars DataFrame:")
print(df_polars_new)
print()

df_pandas_new = df_polars_new.to_pandas()
print("Converted to Pandas:")
print(df_pandas_new)
print(f"Type: {type(df_pandas_new)}")
print()

# Method 3: Working with Polars then converting back
print("=== Polars Processing + Pandas Export ===")
df_work = pl.DataFrame({
    'quarter': ['Q1', 'Q1', 'Q2', 'Q2', 'Q3', 'Q3'],
    'region': ['North', 'South', 'North', 'South', 'North', 'South'],
    'sales': [45000, 52000, 58000, 61000, 62000, 68000]
})

# Process with Polars (faster)
result = (df_work
    .groupby('region')
    .agg(pl.col('sales').mean().alias('avg_sales'))
)

# Convert to Pandas for compatibility with other tools
result_pandas = result.to_pandas()
print(result_pandas)
print(f"Pandas type: {type(result_pandas)}")

Output:

=== Convert Pandas to Polars ===
Original Pandas DataFrame:
        name department  salary
0     Alice Engineering   95000
1       Bob      Sales   65000
2   Charlie Engineering   88000
3     Diana   Marketing   72000
Type: 

Converted to Polars:
shape: (4, 3)
┌─────────┬──────────────┬────────┐
│ name    ┆ department   ┆ salary │
│ ---     ┆ ---          ┆ ---    │
│ str     ┆ str          ┆ i64    │
╞═════════╪══════════════╪════════╡
│ Alice   ┆ Engineering  ┆ 95000  │
│ Bob     ┆ Sales        ┆ 65000  │
│ Charlie ┆ Engineering  ┆ 88000  │
│ Diana   ┆ Marketing    ┆ 72000  │
└─────────┴──────────────┴────────┘
Type: 

=== Convert Polars to Pandas ===
Original Polars DataFrame:
shape: (3, 3)
┌──────────┬───────┬──────────┐
│ product  ┆ price ┆ in_stock │
│ ---      ┆ ---   ┆ ---      │
│ str      ┆ i64   ┆ bool     │
╞══════════╪═══════╪══════════╡
│ Laptop   ┆ 1200  ┆ true     │
│ Mouse    ┆ 25    ┆ true     │
│ Keyboard ┆ 75    ┆ false    │
└──────────┴───────┴──────────┘

Converted to Pandas:
  product  price  in_stock
0  Laptop   1200      True
1   Mouse     25      True
2 Keyboard     75     False
Type: 

=== Polars Processing + Pandas Export ===
  region  avg_sales
0  North       55000.0
1  South       60333.333333
Pandas type:

The conversion workflow is straightforward: load your data with Polars for speed, perform transformations using lazy evaluation and expressions, and collect the results. If you need to pass the data to a Pandas-dependent library or visualization tool, convert it at that point. This hybrid approach lets you get the best of both worlds — Polars performance for data wrangling and whatever specialized tools your workflow requires.

Character building bridge between islands representing Polars and Pandas interop — Pandas and Polars — best friends when you use .to_pandas() wisely.

Real-Life Example: Sales Data Analyzer

Let’s build a practical example that demonstrates multiple Polars features in a realistic scenario. This analyzer reads transaction data, performs complex aggregations, identifies trends, and generates insights:

Real-world data pipelines combine multiple techniques — filtering, grouping, joining, and creating new computed columns. This sales analyzer demonstrates how to structure a Polars pipeline for a typical business use case. Notice how the entire sequence of operations reads like a narrative: “Start with sales data, lazy-load it, filter by region and date, group by product and salesperson, compute metrics, and collect results.” Each step is a Polars expression or method call that chains naturally. Because we are using lazy evaluation, Polars will optimize this entire pipeline before executing a single row of data.

# sales_data_analyzer.py
import polars as pl
from io import StringIO
from datetime import datetime, timedelta

# Create sample transaction data
csv_data = """transaction_id,date,customer_id,product,amount,region,payment_method
T001,2024-01-05,C001,Laptop,1200.00,West,CreditCard
T002,2024-01-07,C002,Mouse,25.00,East,PayPal
T003,2024-01-10,C001,Monitor,300.00,West,CreditCard
T004,2024-01-12,C003,Keyboard,75.00,North,Debit
T005,2024-01-15,C002,Laptop,1200.00,East,PayPal
T006,2024-01-18,C004,Desk,400.00,South,CreditCard
T007,2024-01-20,C001,USB_Cable,15.00,West,CreditCard
T008,2024-01-22,C003,Monitor,300.00,North,Debit
T009,2024-02-01,C005,Laptop,1200.00,South,CreditCard
T010,2024-02-05,C002,Keyboard,75.00,East,PayPal
T011,2024-02-08,C004,Mouse,25.00,South,Debit
T012,2024-02-10,C001,Monitor,300.00,West,CreditCard
T013,2024-02-15,C003,Desk,400.00,North,CreditCard
T014,2024-02-18,C005,Laptop,1200.00,South,PayPal
T015,2024-02-20,C002,USB_Cable,15.00,East,CreditCard"""

df = pl.read_csv(StringIO(csv_data))

print("=" * 60)
print("SALES DATA ANALYZER - COMPREHENSIVE REPORT")
print("=" * 60)
print()

# 1. Overall Statistics
print("1. OVERALL METRICS")
print("-" * 40)
overall = df.select([
    pl.col('amount').sum().alias('total_revenue'),
    pl.col('amount').mean().alias('avg_transaction'),
    pl.col('transaction_id').count().alias('total_transactions'),
    pl.col('customer_id').n_unique().alias('unique_customers')
])
print(overall)
print()

# 2. Top Products
print("2. TOP PRODUCTS BY REVENUE")
print("-" * 40)
top_products = (df
    .groupby('product')
    .agg([
        pl.col('amount').sum().alias('revenue'),
        pl.col('transaction_id').count().alias('sales_count')
    ])
    .sort('revenue', descending=True)
)
print(top_products)
print()

# 3. Regional Performance
print("3. REGIONAL PERFORMANCE")
print("-" * 40)
regional = (df
    .groupby('region')
    .agg([
        pl.col('amount').sum().alias('total_revenue'),
        pl.col('amount').mean().alias('avg_transaction'),
        pl.col('customer_id').n_unique().alias('unique_customers')
    ])
    .sort('total_revenue', descending=True)
)
print(regional)
print()

# 4. Payment Method Analysis
print("4. PAYMENT METHOD BREAKDOWN")
print("-" * 40)
payment = (df
    .groupby('payment_method')
    .agg([
        pl.col('amount').sum().alias('total_amount'),
        pl.col('transaction_id').count().alias('count'),
        (pl.col('amount').sum() / df.select(pl.col('amount').sum()).item() * 100).alias('percentage')
    ])
)
print(payment)
print()

# 5. High-Value Transactions (>500)
print("5. HIGH-VALUE TRANSACTIONS (Amount > 500)")
print("-" * 40)
high_value = (df
    .filter(pl.col('amount') > 500)
    .select(['transaction_id', 'customer_id', 'product', 'amount', 'region', 'date'])
    .sort('amount', descending=True)
)
print(high_value)
print()

# 6. Customer Lifetime Value
print("6. TOP CUSTOMERS BY LIFETIME VALUE")
print("-" * 40)
top_customers = (df
    .groupby('customer_id')
    .agg([
        pl.col('amount').sum().alias('total_spent'),
        pl.col('transaction_id').count().alias('purchases')
    ])
    .sort('total_spent', descending=True)
    .limit(5)
)
print(top_customers)
print()

# 7. Monthly Trend
print("7. MONTHLY REVENUE TREND")
print("-" * 40)
monthly = (df
    .with_columns(pl.col('date').str.slice(0, 7).alias('month'))
    .groupby('month')
    .agg(pl.col('amount').sum().alias('revenue'))
    .sort('month')
)
print(monthly)

Output:

============================================================
SALES DATA ANALYZER - COMPREHENSIVE REPORT
============================================================

1. OVERALL METRICS
----------------------------------------------
shape: (1, 4)
┌────────────────┬──────────────┬────────────────────┬──────────────────┐
│ total_revenue  ┆ avg_transaction┆ total_transactions ┆ unique_customers │
│ ---            ┆ ---            ┆ ---                ┆ ---              │
│ f64            ┆ f64            ┆ u32                ┆ u32              │
╞════════════════╪════════════════╪════════════════════╪══════════════════╡
│ 10005.0        ┆ 667.0          ┆ 15                 ┆ 5                │
└────────────────┴────────────────┴────────────────────┴──────────────────┘

2. TOP PRODUCTS BY REVENUE
----------------------------------------------
shape: (6, 3)
┌──────────┬─────────┬────────────┐
│ product  ┆ revenue ┆ sales_count│
│ ---      ┆ ---     ┆ ---        │
│ str      ┆ f64     ┆ u32        │
╞══════════╪═════════╪════════════╡
│ Laptop   ┆ 4800.0  ┆ 4          │
│ Desk     ┆ 800.0   ┆ 2          │
│ Monitor  ┆ 900.0   ┆ 3          │
│ Keyboard ┆ 150.0   ┆ 2          │
│ Mouse    ┆ 50.0    ┆ 2          │
│ USB_Cable┆ 30.0    ┆ 2          │
└──────────┴─────────┴────────────┘

3. REGIONAL PERFORMANCE
----------------------------------------------
shape: (4, 4)
┌────────┬────────────────┬─────────────────┬──────────────────┐
│ region ┆ total_revenue  ┆ avg_transaction ┆ unique_customers │
│ ---    ┆ ---            ┆ ---             ┆ ---              │
│ str    ┆ f64            ┆ f64             ┆ u32              │
╞════════╪════════════════╪═════════════════╪══════════════════╡
│ West   ┆ 1815.0         ┆ 362.99          ┆ 1                │
│ North  ┆ 775.0          ┆ 387.49          ┆ 2                │
│ South  ┆ 3840.0         ┆ 768.0           ┆ 2                │
│ East   ┆ 3575.0         ┆ 715.0           ┆ 2                │
└────────┴────────────────┴─────────────────┴──────────────────┘

4. PAYMENT METHOD BREAKDOWN
----------------------------------------------
shape: (3, 3)
┆ payment_method ┆ total_amount ┆ count ┆ percentage │
┆ ---            ┆ ---          ┆ ---   ┆ ---        │
┆ str            ┆ f64          ┆ u32   ┆ f64        │
╞════════════════╪══════════════╪═══════╪════════════╡
│ CreditCard     ┆ 5040.0       ┆ 7     ┆ 50.37      │
│ PayPal         ┆ 2490.0       ┆ 4     ┆ 24.89      │
│ Debit          ┆ 2475.0       ┆ 4     ┆ 24.75      │
└────────────────┴──────────────┴───────┴────────────┘

5. HIGH-VALUE TRANSACTIONS (Amount > 500)
----------------------------------------------
shape: (5, 6)
┌───────────┬─────────────┬──────────┬────────┬────────┬────────────┐
│ trans_id  ┆ customer_id ┆ product  ┆ amount ┆ region ┆ date       │
│ ---       ┆ ---         ┆ ---      ┆ ---    ┆ ---    ┆ ---        │
│ str       ┆ str         ┆ str      ┆ f64    ┆ str    ┆ str        │
╞═══════════╪═════════════╪══════════╪════════╪════════╪════════════╡
│ T014      ┆ C005        ┆ Laptop   ┆ 1200.0 ┆ South  ┆ 2024-02-18 │
│ T009      ┆ C005        ┆ Laptop   ┆ 1200.0 ┆ South  ┆ 2024-02-01 │
│ T005      ┆ C002        ┆ Laptop   ┆ 1200.0 ┆ East   ┆ 2024-01-15 │
│ T001      ┆ C001        ┆ Laptop   ┆ 1200.0 ┆ West   ┆ 2024-01-05 │
│ T006      ┆ C004        ┆ Desk     ┆ 400.0  ┆ South  ┆ 2024-01-18 │
└───────────┴─────────────┴──────────┴────────┴────────┴────────────┘

6. TOP CUSTOMERS BY LIFETIME VALUE
----------------------------------------------
shape: (5, 3)
┌─────────────┬─────────────┬───────────┐
│ customer_id ┆ total_spent ┆ purchases │
│ ---         ┆ ---         ┆ ---       │
│ str         ┆ f64         ┆ u32       │
╞═════════════╪═════════════╪═══════════╡
│ C001        ┆ 1815.0      ┆ 4         │
│ C002        ┆ 1515.0      ┆ 4         │
│ C005        ┆ 2400.0      ┆ 2         │
│ C003        ┆ 775.0       ┆ 2         │
│ C004        ┆ 425.0       ┆ 2         │
└─────────────┴─────────────┴───────────┘

7. MONTHLY REVENUE TREND
----------------------------------------------
shape: (2, 2)
┌───────┬─────────┐
│ month ┆ revenue │
│ ---   ┆ ---     │
│ str   ┆ f64     │
╞═══════╪═════════╡
│ 2024-01 ┆ 5930.0 │
│ 2024-02 ┆ 4075.0 │
└───────┴─────────┘

This example shows a realistic data processing pipeline where you start with raw CSV data, progressively filter and transform it, and end up with summarized metrics. In a production setting, you would likely save these results to a database or export them for reporting. The beauty of the Polars approach is that it scales — whether you have 1 million rows or 1 billion rows, the code structure remains the same, and Polars optimizer and parallelization kick in automatically. With Pandas, you would need to be more careful about memory usage and might have to restructure the code for larger datasets. The power of lazy evaluation combined with expressions means you can write concise, readable queries that execute at lightning speed.

Character celebrating at dashboard representing completed Polars data pipeline — Pipeline complete — clean data in, insights out, milliseconds flat.

Frequently Asked Questions

As you begin integrating Polars into your data science workflow, several questions naturally arise. This section addresses the most common concerns and misconceptions about Polars, its relationship to Pandas, and how to best leverage it in production environments. We will cover adoption strategies, performance expectations, and practical guidance for transitioning existing codebases.

1. Is Polars a complete replacement for Pandas?

Polars is a powerful alternative but not 100% compatible with every Pandas operation. Polars is excellent for data manipulation, aggregation, and analysis, which cover 80-90% of typical data tasks. Some areas where Pandas still excels include time series operations (Polars’ temporal support is improving), certain statistical functions, and specific visualization integrations. For most projects, you can migrate to Polars entirely, but it’s good to know both libraries. The beauty is that you do not need to choose one or the other — you can use both strategically within the same project. Use Polars where you need performance and a clean API, and fall back to Pandas where you need specific functionality or library support.

2. How much faster is Polars really?

Performance gains depend heavily on dataset size and operation type. For small datasets (< 100K rows), differences may be negligible. For medium datasets (1-100M rows), Polars is typically 2-10x faster. For large datasets (> 100M rows), the difference can be 10-100x or more, especially with lazy evaluation and multi-column operations. Benchmarks consistently show Polars outperforming Pandas on standard operations like groupby, filtering, and joins. The speedups come not just from being written in Rust, but from algorithmic optimizations made possible by lazy evaluation. When Polars can see your entire operation graph before execution, it can make decisions that Pandas never can. For example, it can decide to read only the columns you need from a CSV file, skip rows that will be filtered out, and parallelize across cores without any explicit parallel programming on your part.

3. Can I use Polars with Pandas code I already have?

Absolutely. You can convert between Polars and Pandas using pl.from_pandas() and .to_pandas(). A practical approach is to use Polars for heavy data processing where speed matters, then convert to Pandas if you need specific functionality or library integrations. Many projects use both libraries strategically. For instance, you might use Pandas for data exploration in notebooks and Polars for production pipelines, or vice versa. The key is that the conversion overhead is minimal because both libraries understand columnar layouts, so moving data between them is a fast operation rather than a bottleneck.

4. What about memory usage? Is Polars more memory-efficient?

Yes, Polars uses less memory than Pandas in most scenarios. The columnar storage model is more efficient, and Polars does not create unnecessary intermediate copies during operations. For a 1GB dataset, Polars might use 300-500MB while Pandas uses 2-3GB. This becomes critical when working with datasets approaching available RAM. The memory efficiency comes from multiple sources: (1) columnar storage means data is stored densely without padding; (2) lazy evaluation avoids creating intermediate DataFrames for chained operations; and (3) Polars uses more efficient data type representations (e.g., native nulls instead of NaN, smaller integer types by default). On systems with limited RAM, using Polars instead of Pandas can literally mean the difference between a workload running and running out of memory.

5. How do I debug Polars lazy evaluation if something goes wrong?

Use the .explain() method to visualize the execution plan, or use .show_graph() for a visual representation. If an error occurs, wrap your lazy chain with .collect() earlier to see where the issue is. You can also use eager evaluation (remove .lazy()) temporarily for debugging, then switch back to lazy mode once fixed. Lazy evaluation can seem mysterious at first because nothing executes until you call .collect(). If your code fails, the error message might not point to where you expected. The .explain() output helps demystify this — it shows you the exact execution plan Polars will use, allowing you to see if columns are being selected correctly, if filters are in the right position, and if joins are happening on the correct keys. This visibility is invaluable for diagnosing performance issues or unexpected results.

6. Does Polars support distributed computing like Spark?

Polars is designed for single-machine multi-core processing and is not a distributed computing framework like Spark. However, Polars is so fast that many workloads that would require Spark with Pandas can run efficiently on a single machine with Polars. For true distributed computing, you would still use Spark, but consider whether Polars might solve your problem first. The computing power of modern machines has grown tremendously — a single laptop can process gigabytes of data in seconds with Polars, which would have required a cluster a few years ago. This is why many data teams find they do not need Spark when they switch to Polars.

7. What about null/missing values in Polars?

Polars uses a proper Null type (similar to SQL) instead of NaN, making it more type-safe. By default, Polars allows nulls in any column. You can use .fill_null(), .drop_nulls(), or conditional logic with pl.when().then().otherwise() to handle missing data. The syntax is often more explicit and safer than Pandas’ approach. One of Polars’ design wins is that every data type can have a true null value, just like in databases. Pandas conflates missing values (NaN for floats, None for objects) which can lead to subtle bugs. Polars forces you to think clearly about whether a value is truly missing (null) or a valid data point. This explicitness prevents entire classes of bugs and makes your data pipelines more reliable.

Conclusion

Polars represents a significant evolution in Python data processing. Its combination of speed, memory efficiency, and expressive syntax makes it an excellent choice for modern data work. Whether you’re analyzing millions of rows of transaction data, processing sensor readings, or building data pipelines, Polars delivers measurable performance improvements over Pandas. The library has matured significantly in recent years and now supports the vast majority of data manipulation tasks that Pandas users encounter daily.

The key advantages are clear: lazy evaluation optimizes complex queries, the expression-based API is intuitive and composable, and the Rust implementation eliminates Python’s performance bottlenecks. For intermediate and advanced Python developers familiar with Pandas, the learning curve is minimal, and the payoff is substantial. You are not learning a completely new paradigm — you are adopting a better implementation of the same concepts you already know.

What we have covered in this guide provides you with a solid foundation for using Polars effectively. We started with basic DataFrame creation and manipulation, progressed through filtering and expressions, explored groupby aggregations, and discovered the power of lazy evaluation. We examined real-world examples and discussed practical integration strategies with existing Pandas code. These techniques form the core of most data analysis workflows — master these, and you will be equipped to handle complex data problems efficiently.

Start by trying Polars on your most performance-critical data operations. Use lazy evaluation for complex multi-step transformations, and leverage groupby and expressions for aggregations. Convert to and from Pandas as needed for compatibility with existing tools. Over time, you will likely find Polars becoming your default choice for data analysis, with Pandas reserved for specific edge cases. The performance benefits are not merely academic — they directly translate to faster iteration during exploration, shorter pipeline runtimes in production, and the ability to handle larger datasets on the same hardware.

The future of Python data processing is here, and it is fast. Give Polars a try in your next project and experience the difference firsthand. You will not regret the investment in learning this powerful library.

Best Practices and Tips for Polars Success

As you integrate Polars into your workflows, keep a few best practices in mind. First, always prefer lazy evaluation for production code — the performance benefits are substantial and there is rarely a downside to deferring execution until you call .collect(). Second, be explicit with your schemas whenever possible, especially for CSV and JSON files. Polars can infer types, but explicit schemas prevent surprises and make your code more maintainable. Third, use .explain() when you are curious about how Polars plans to execute your query — this is educational and helps you understand what optimizations are happening behind the scenes.

Fourth, take advantage of Polars\’ rich expression system rather than falling back to Python loops or `.apply()` methods. Expressions are faster, more readable, and often shorter. Fifth, remember that Polars is eager about memory — it reads data into memory efficiently, but massive datasets that do not fit in RAM still require strategies like filtering early or processing chunks. Finally, stay up to date with Polars releases. The library is actively developed and new features, optimizations, and bug fixes arrive regularly. The community is welcoming and the documentation continues to improve. Polars is used in production by data teams at major companies and has proven itself as a reliable, performant alternative to Pandas. It is not an experimental project — it is battle-tested and production-ready.

Python Vector Embeddings Tutorial – Learn how to work with embeddings for machine learning and NLP tasks
OpenAI API Python Tutorial – Integrate AI capabilities into your data applications
Pydantic V2 Data Validation in Python – Validate and serialize data structures efficiently

« Older Entries

Next Entries »

How To Use Python match-case (Structural Pattern Matching)

Python match-case: Quick Example

What Is Structural Pattern Matching and Why Use It?

Matching Literal Values

Destructuring Sequences

Matching Dictionaries

Matching Class Instances

Adding Guard Clauses

Combining Patterns with OR

Common Pitfalls to Avoid

Real-Life Example: Building a CLI Command Parser

Frequently Asked Questions

What Python version do I need for match-case?

Is match-case just a switch statement?

Does match-case fall through like C switch?

Can I use match-case with regular expressions?

How does match-case compare to if-elif for performance?

Conclusion

Related Articles

How To Write Unit Tests with pytest in Python

Writing Your First pytest Test: Quick Example

What is pytest and Why Use It?

Organizing Your Test Files

Using Fixtures for Test Setup

Parametrize: Test Multiple Inputs in One Function

Mocking External Dependencies

Testing Exceptions and Edge Cases

Sharing Fixtures with conftest.py

Real-Life Example: Testing a Shopping Cart

Frequently Asked Questions

How do I run only tests that match a specific name pattern?

What is the difference between fixtures with function scope and session scope?

How do I skip a test or mark it as expected to fail?

Can I use pytest with existing unittest-style tests?

How do I see print output from my tests?

Conclusion

Related Articles

How To Clean Messy Data with Python and Pandas

Quick Start: Clean Data in 10 Lines

What Makes Data “Messy”?

Handling Missing Values

Detecting Missing Data

Removing Missing Values

Filling Missing Values

Fixing Data Types

Converting Strings to Numbers

Parsing Dates

Explicit Type Conversion

Removing and Handling Duplicates

Standardizing String Data

Case Normalization

Whitespace Cleaning

Pattern Replacement and Regex

Detecting and Handling Outliers

Statistical Outlier Detection

Range-Based Validation

Chaining Operations for Clean Pipelines

Method Chaining

Creating Reusable Cleaning Functions

Real-Life Example: Cleaning a Customer Database

Frequently Asked Questions

When should I use dropna() versus fillna()?

How do I handle mixed data types in a single column?

What’s the best way to handle duplicate records?

How do I validate data after cleaning?

Can I create a reusable cleaning template for my team?

How do I handle special characters and encoding issues?

Conclusion

Related Articles

How To Read and Write Parquet Files in Python with PyArrow

Quick Example: Write and Read a Parquet File in 6 Lines

What Is Parquet and Why Use It?

Installing PyArrow

Writing Parquet Files

Writing from a Pandas DataFrame

Writing with Compression Options

Writing from a PyArrow Table

Reading Parquet Files

Reading an Entire Parquet File

Reading Specific Columns