Intermediate

You load a CSV with a million rows into pandas, run a few groupby operations, and then — you wait. And wait. The spinner in Jupyter keeps spinning while your CPU fans kick in. If you have ever hit this wall, you know exactly why the Python data community got excited about Polars. Written in Rust and built around the Apache Arrow memory model, Polars processes data 5x to 50x faster than pandas for many common operations — without requiring you to rewrite everything in C++.

Polars is a modern DataFrame library that runs on your laptop or a server with no special setup. You install it with pip, import it, and use an API that feels familiar if you know pandas — but with some important design differences that make your code faster by default. There is no index to manage, lazy evaluation is built in, and operations are automatically parallelized across your CPU cores.

In this article we will cover how Polars works and why it is faster than pandas, how to create and load DataFrames, how to filter, group, and aggregate data, how to use lazy execution for maximum performance, and how to work with real CSV and JSON data. By the end you will be able to replace your slowest pandas scripts with Polars and see immediate speed improvements.

Polars in Python: Quick Example

Here is a complete working example that shows the core Polars workflow — create a DataFrame, filter rows, group by a column, and compute aggregates — all in under 20 lines:

# quick_polars.py
import polars as pl

df = pl.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave", "Eve"],
    "department": ["Eng", "Eng", "HR", "HR", "Eng"],
    "salary": [95000, 88000, 72000, 68000, 102000],
})

result = (
    df.filter(pl.col("salary") > 70000)
    .group_by("department")
    .agg(
        pl.col("salary").mean().alias("avg_salary"),
        pl.col("name").count().alias("headcount"),
    )
    .sort("avg_salary", descending=True)
)

print(result)

Output:

shape: (2, 3)
+------------+------------+-----------+
| department | avg_salary | headcount |
| ---        | ---        | ---       |
| str        | f64        | u32       |
+============+============+===========+
| Eng        | 95000.0    | 3         |
| HR         | 72000.0    | 1         |
+------------+------------+-----------+

Notice how pl.col("salary") is used to reference columns — this is the Polars expression API and it is at the heart of everything. Unlike pandas where you often chain method calls on Series objects, Polars uses expressions that describe what to compute, letting the engine optimize how to compute it. The .alias() call renames the output column, and .sort() orders the final result.

We will go deeper into expressions, lazy frames, and real-world data loading in the sections below.

Cache Katie with speed gauges and bar charts
Polars: same API as pandas, minus the wait.

What Is Polars and Why Is It Faster Than pandas?

Polars is a DataFrame library written in Rust with Python bindings. It stores data in the Apache Arrow columnar format, which means all values of the same column are laid out contiguously in memory. This makes operations like summing a column or filtering rows extremely cache-friendly — the CPU can process thousands of values per second without constantly jumping around in memory.

pandas, by contrast, was designed in 2008 when multi-core CPUs were less common and the NumPy array model was state of the art. pandas operations are largely single-threaded, use Python’s GIL extensively, and have an index system that adds overhead on most operations. Polars sidesteps all three problems: it is multi-threaded by default, written in Rust with no GIL, and has no index.

FeaturepandasPolars
Language corePython + NumPyRust + Arrow
Multi-threadingNo (GIL-limited)Yes (automatic)
Lazy evaluationNoYes (LazyFrame)
IndexRequiredNone
Memory usageHigherLower (Arrow format)
API styleMethod chaining on SeriesExpression-based

The practical result: for operations on DataFrames over 100,000 rows, Polars is typically 5x to 30x faster. For CSV parsing alone, Polars is often 10x faster than pandas because it uses all available CPU cores in parallel. Let us see how to install and use it.

Installing Polars

Polars is available on PyPI and installs with a single command. It has no dependency on pandas or NumPy, so the install is clean and fast:

# terminal
pip install polars

Output:

Successfully installed polars-0.20.x

If you work with large datasets and want the optional extras (like connecting to cloud storage or reading Parquet files faster), you can install the full extras bundle with pip install polars[all]. For most tutorial purposes, the base install is all you need.

Creating DataFrames in Polars

Polars DataFrames can be created from Python dictionaries, lists, or loaded from files. The dictionary approach is the most common for creating test data:

# create_dataframes.py
import polars as pl

# From a dictionary -- each key is a column name, value is a list
df = pl.DataFrame({
    "product": ["Widget", "Gadget", "Doohickey", "Thingamajig"],
    "price": [9.99, 24.99, 4.49, 14.99],
    "units_sold": [1500, 340, 2100, 780],
    "in_stock": [True, True, False, True],
})

print(df)
print()
print("Shape:", df.shape)
print("Columns:", df.columns)
print("Dtypes:", df.dtypes)

Output:

shape: (4, 4)
+-------------+-------+------------+----------+
| product     | price | units_sold | in_stock |
| ---         | ---   | ---        | ---      |
| str         | f64   | i64        | bool     |
+=============+=======+============+==========+
| Widget      | 9.99  | 1500       | true     |
| Gadget      | 24.99 | 340        | true     |
| Doohickey   | 4.49  | 2100       | false    |
| Thingamajig | 14.99 | 780        | true     |
+-------------+-------+------------+----------+

Shape: (4, 4)
Columns: ['product', 'price', 'units_sold', 'in_stock']
Dtypes: [String, Float64, Int64, Boolean]

Polars automatically infers types from your data. Strings become String, floats become Float64, integers become Int64, and booleans become Boolean. Unlike pandas, there is no ambiguous “object” dtype for strings — Polars always knows exactly what type each column holds, which is one reason it can optimize operations so aggressively.

Reading CSV and JSON Files

Polars reads CSV files dramatically faster than pandas because it splits the file across CPU cores and parses each chunk in parallel. Here is how to read a CSV, check its schema, and do a quick preview:

# read_csv_polars.py
import polars as pl

# We will create a sample CSV first so the code runs standalone
import csv, tempfile, os

data = [
    ["date", "city", "temp_c", "humidity"],
    ["2024-01-01", "Sydney", "28.5", "62"],
    ["2024-01-01", "Melbourne", "22.1", "75"],
    ["2024-01-02", "Sydney", "30.2", "58"],
    ["2024-01-02", "Melbourne", "19.8", "80"],
    ["2024-01-03", "Sydney", "26.7", "70"],
]

tmpfile = tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False, newline='')
writer = csv.writer(tmpfile)
writer.writerows(data)
tmpfile.close()

# Read with Polars
df = pl.read_csv(tmpfile.name)
print(df)
print()

# Polars can also infer date columns
df2 = pl.read_csv(tmpfile.name, try_parse_dates=True)
print("Schema with date parsing:", df2.dtypes)

os.unlink(tmpfile.name)

Output:

shape: (5, 4)
+------------+-----------+--------+----------+
| date       | city      | temp_c | humidity |
+------------+-----------+--------+----------+
| 2024-01-01 | Sydney    | 28.5   | 62       |
| 2024-01-01 | Melbourne | 22.1   | 75       |
| 2024-01-02 | Sydney    | 30.2   | 58       |
| 2024-01-02 | Melbourne | 19.8   | 80       |
| 2024-01-03 | Sydney    | 26.7   | 70       |
+------------+-----------+--------+----------+

Schema with date parsing: [Date, String, Float64, Int64]

The try_parse_dates=True argument tells Polars to automatically detect date and datetime columns. This saves you the extra step of converting string columns to dates after loading — a common pain point in pandas workflows. For JSON files, use pl.read_json() or pl.read_ndjson() for newline-delimited JSON.

Pyro Pete sprinting through data tables
Eager vs lazy: one runs every line, one waits for you to actually need the answer.

Filtering and Selecting Data

In Polars, filtering and column selection use the expression API with pl.col(). This is the biggest conceptual shift from pandas, but once you get the pattern it is very readable:

# filter_select.py
import polars as pl

df = pl.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave", "Eve", "Frank"],
    "age": [28, 34, 22, 45, 31, 29],
    "score": [88.5, 72.3, 95.1, 61.0, 83.7, 90.2],
    "passed": [True, True, True, False, True, True],
})

# Filter rows where passed=True AND score > 80
high_scorers = df.filter(
    (pl.col("passed") == True) & (pl.col("score") > 80)
)
print("High scorers:")
print(high_scorers)
print()

# Select specific columns
names_scores = df.select(["name", "score"])
print("Names and scores:")
print(names_scores)
print()

# Add a computed column
with_grade = df.with_columns(
    pl.when(pl.col("score") >= 90).then(pl.lit("A"))
    .when(pl.col("score") >= 80).then(pl.lit("B"))
    .otherwise(pl.lit("C"))
    .alias("grade")
)
print("With grades:")
print(with_grade)

Output:

High scorers:
shape: (4, 4)
| name  | age | score | passed |
| Alice | 28  | 88.5  | true   |
| Carol | 22  | 95.1  | true   |
| Eve   | 31  | 83.7  | true   |
| Frank | 29  | 90.2  | true   |

Names and scores:
| name  | score |
| Alice | 88.5  |
...

With grades:
| name  | age | score | passed | grade |
| Alice | 28  | 88.5  | true   | B     |
| Carol | 22  | 95.1  | true   | A     |
...

The pl.when().then().otherwise() pattern is Polars’ equivalent of a vectorized if-else, similar to numpy.where. The key difference from pandas is that all of these operations are described as expressions and can be computed in parallel — Polars may even reorder them internally for optimal performance.

Group By and Aggregation

Group-by aggregations are where Polars really shines. Multiple aggregations on the same group are computed in parallel, and the expression API lets you write complex aggregations in a single readable call:

# groupby_agg.py
import polars as pl

df = pl.DataFrame({
    "region": ["North", "South", "North", "South", "North", "East"],
    "product": ["Widget", "Widget", "Gadget", "Gadget", "Widget", "Gadget"],
    "revenue": [1200, 980, 450, 620, 1350, 800],
    "units": [120, 98, 45, 62, 135, 80],
})

summary = (
    df.group_by(["region", "product"])
    .agg([
        pl.col("revenue").sum().alias("total_revenue"),
        pl.col("units").sum().alias("total_units"),
        (pl.col("revenue").sum() / pl.col("units").sum()).alias("revenue_per_unit"),
    ])
    .sort(["region", "product"])
)

print(summary)

Output:

shape: (4, 5)
+--------+---------+---------------+-------------+------------------+
| region | product | total_revenue | total_units | revenue_per_unit |
+--------+---------+---------------+-------------+------------------+
| East   | Gadget  | 800           | 80          | 10.0             |
| North  | Widget  | 2550          | 255         | 10.0             |
| North  | Gadget  | 450           | 45          | 10.0             |
| South  | Widget  | 980           | 98          | 10.0             |
+--------+---------+---------------+-------------+------------------+

You can pass multiple aggregations in a single list to .agg(). Polars computes all of them simultaneously using all available CPU cores. On a 4-core machine running a 10-million-row dataset, this single change from pandas to Polars can take a groupby from 8 seconds down to under 1 second.

Lazy Evaluation with LazyFrame

Polars has an eager mode (what we have been using) and a lazy mode. Lazy frames do not execute operations immediately — they build up a query plan that gets optimized and executed only when you call .collect(). This is where the biggest performance gains live for complex pipelines:

# lazy_frame.py
import polars as pl

# Create a larger sample dataset
import random
random.seed(42)
n = 100_000
df = pl.DataFrame({
    "user_id": list(range(n)),
    "country": random.choices(["AU", "US", "UK", "CA", "DE"], k=n),
    "spend": [round(random.uniform(5, 500), 2) for _ in range(n)],
    "category": random.choices(["Electronics", "Clothing", "Food", "Books"], k=n),
})

# Build a lazy query -- nothing executes yet
lazy_result = (
    df.lazy()
    .filter(pl.col("spend") > 100)
    .group_by(["country", "category"])
    .agg(
        pl.col("spend").sum().alias("total_spend"),
        pl.col("user_id").count().alias("customers"),
    )
    .sort("total_spend", descending=True)
    .limit(5)
)

# Execute the optimized query
result = lazy_result.collect()
print(result)

Output:

shape: (5, 4)
+------+-------------+-------------+-----------+
| country | category    | total_spend | customers |
+---------+-------------+-------------+-----------+
| US      | Electronics | 1254832.0   | 5012      |
| AU      | Electronics | 1248901.0   | 4988      |
| CA      | Electronics | 1241234.0   | 4956      |
| UK      | Electronics | 1238123.0   | 4942      |
| US      | Clothing    | 1225891.0   | 4893      |
+---------+-------------+-------------+-----------+

The lazy frame optimizer can push the .filter() before the .group_by(), reducing the number of rows processed in the aggregation. It can also eliminate columns that are not needed in the final output. When reading from files, pl.scan_csv() creates a lazy frame that only reads the columns you actually use, making I/O much faster for wide files with many columns.

Sudo Sam controlling lazy evaluation
collect() is not optional. Neither is knowing why it exists.

Real-Life Example: Sales Data Analyzer

Let us build a practical sales analysis tool that loads data, cleans it, and produces a summary report — the kind of script you might actually use to analyze a business dataset:

# sales_analyzer.py
import polars as pl
import tempfile, csv, os

# Create a realistic sample sales CSV
rows = [
    ["order_id", "date", "customer", "product", "category", "quantity", "unit_price"],
    ["1001", "2024-01-05", "Acme Corp", "Widget Pro", "Hardware", "50", "29.99"],
    ["1002", "2024-01-07", "Beta Ltd", "Cloud Sub", "Software", "10", "99.00"],
    ["1003", "2024-01-10", "Acme Corp", "Widget Pro", "Hardware", "30", "29.99"],
    ["1004", "2024-01-12", "Gamma Inc", "Gadget Plus", "Hardware", "20", "49.99"],
    ["1005", "2024-01-15", "Beta Ltd", "Widget Pro", "Hardware", "15", "29.99"],
    ["1006", "2024-01-20", "Delta Co", "Cloud Sub", "Software", "5", "99.00"],
    ["1007", "2024-01-22", "Acme Corp", "Gadget Plus", "Hardware", "8", "49.99"],
    ["1008", "2024-01-28", "Gamma Inc", "Cloud Sub", "Software", "12", "99.00"],
]

tmp = tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False, newline='')
csv.writer(tmp).writerows(rows)
tmp.close()

# Load and analyze with Polars LazyFrame
result = (
    pl.scan_csv(tmp.name, try_parse_dates=True)
    .with_columns(
        (pl.col("quantity") * pl.col("unit_price")).alias("revenue")
    )
    .group_by(["customer", "category"])
    .agg([
        pl.col("revenue").sum().alias("total_revenue"),
        pl.col("quantity").sum().alias("total_units"),
        pl.col("order_id").count().alias("orders"),
    ])
    .sort("total_revenue", descending=True)
    .collect()
)

print("=== Sales Summary by Customer and Category ===")
print(result)
print()

# Top customer overall
top = result.group_by("customer").agg(
    pl.col("total_revenue").sum().alias("grand_total")
).sort("grand_total", descending=True)

print("=== Top Customers ===")
print(top)

os.unlink(tmp.name)

Output:

=== Sales Summary by Customer and Category ===
shape: (6, 5)
+----------+----------+---------------+-------------+--------+
| customer | category | total_revenue | total_units | orders |
+----------+----------+---------------+-------------+--------+
| Beta Ltd | Software | 1980.0        | 20          | 2      |
| Acme Corp| Hardware | 2289.7        | 88          | 3      |
...

=== Top Customers ===
+----------+-------------+
| customer | grand_total |
+----------+-------------+
| Acme Corp| 2689.42     |
| Beta Ltd | 1980.0      |
...

This script uses pl.scan_csv() for lazy reading, adds a computed revenue column before aggregating (so the computation is pushed into the optimal position by the query planner), and chains two group-by operations without ever converting to an intermediate pandas DataFrame. You can extend this by replacing the temp file with your actual CSV path and adding more aggregation columns to the .agg() call.

Frequently Asked Questions

Can I replace pandas completely with Polars?

For most data processing tasks — loading, filtering, grouping, aggregating, and joining — yes. Polars covers the same ground pandas does. However, some ML libraries like scikit-learn expect pandas DataFrames or NumPy arrays as input, so you may still need to convert with .to_pandas() at the final step. Polars and pandas can coexist in the same project without issues.

How do I join two DataFrames in Polars?

Use df1.join(df2, on="common_column", how="inner"). Polars supports inner, left, outer, cross, semi, and anti joins. The how parameter accepts these strings directly. For joining on multiple columns, pass a list: on=["col1", "col2"]. All joins are executed in parallel and are significantly faster than pandas merges on large datasets.

How does Polars handle missing values?

Polars uses null for missing values, not NaN. This distinction matters: NaN is a floating-point concept; null is a general absence-of-value concept that works for any data type. Use pl.col("x").is_null() to check for nulls, .drop_nulls() to remove rows with nulls, and .fill_null(value) to replace them. Polars propagates nulls through computations correctly by default.

When should I use LazyFrame vs DataFrame?

Use LazyFrame (via df.lazy() or pl.scan_csv()) whenever you have a multi-step pipeline with filtering, joining, or aggregation. The query optimizer will find the fastest execution plan. Use eager DataFrame when you are doing exploratory work at the REPL, need to inspect intermediate results, or have a single simple operation. The performance difference is most visible on datasets over 500,000 rows with complex pipelines.

How do I convert between Polars and pandas?

Converting is one line in each direction: pandas_df = polars_df.to_pandas() and polars_df = pl.from_pandas(pandas_df). Both conversions use the Arrow memory format internally, so they are fast and memory-efficient. If you are integrating Polars into an existing pandas-heavy codebase, start by swapping just the slow parts — typically the CSV loading and groupby steps — and keep everything else in pandas until you are comfortable.

Conclusion

Polars brings genuine speed improvements to Python data work — not through tricks, but through better fundamentals: Rust’s performance, the Arrow memory model, automatic multi-threading, and a lazy query optimizer. We covered how to install Polars, create and load DataFrames, use the expression API for filtering and selection, run fast group-by aggregations, and use LazyFrame for query-plan optimization. The real-life example showed how these pieces combine into a practical data pipeline.

The best next step is to take one slow pandas script you already have and rewrite the data-loading and groupby sections in Polars. Time both versions with time.time() before and after. The improvement will be obvious and motivating. From there, explore joins and window functions — the official Polars documentation has excellent coverage of both.