Intermediate
You have a 2 GB CSV file and you need to answer one question: which product category generated the most revenue last quarter? The naive approach — load the whole file into a pandas DataFrame, then filter and group — works, but it means reading 2 GB into RAM for a query that touches maybe 10% of the data. There is a better tool for this job. DuckDB is an in-process analytical database that runs SQL queries directly on CSV files, Parquet files, and pandas or Polars DataFrames, using only the memory it actually needs for the query.
DuckDB installs in one command (pip install duckdb), requires no server, no configuration, and no separate process. It embeds directly into your Python program, the same way SQLite does — except DuckDB is built for analytics, not transactions. It uses columnar storage, vectorized query execution, and automatic parallelism across CPU cores. For data analysis workloads, it is typically 5x to 50x faster than SQLite and often faster than pandas for the same aggregations.
This article covers everything you need to get productive with DuckDB in Python: running SQL on files without loading them, querying pandas and Polars DataFrames directly, persisting data to a DuckDB database file, using window functions and CTEs, and understanding when DuckDB beats pandas. By the end you will be able to replace your heaviest data analysis scripts with leaner, faster DuckDB queries.
DuckDB in Python: Quick Example
The simplest DuckDB pattern is querying a CSV file directly — no loading, no DataFrame, just SQL on the file path:
# quick_duckdb.py
import duckdb
# Query a CSV file directly -- DuckDB reads only what the query needs
result = duckdb.sql("""
SELECT
category,
SUM(amount) AS total_revenue,
COUNT(*) AS num_orders
FROM 'sales.csv'
WHERE order_date >= '2024-01-01'
GROUP BY category
ORDER BY total_revenue DESC
LIMIT 5
""").fetchdf()
print(result)
Output:
category total_revenue num_orders
0 Electronics 48200.0 312
1 Clothing 31500.0 489
2 Furniture 27800.0 198
3 Sports 19400.0 267
4 Books 8900.0 401
The file path goes directly into the SQL string inside single quotes. DuckDB reads only the columns and rows it needs, streams them through the query engine, and returns a pandas DataFrame via .fetchdf(). No pd.read_csv(), no memory spike, no waiting for the full file to load. The duckdb.sql() function creates an in-memory connection automatically — if you need to query the same file multiple times in a session, create a persistent connection instead (covered below).
The sections below show how to query DataFrames, persist data, use advanced SQL features, and build real analysis pipelines.
What Is DuckDB and When Should You Use It?
DuckDB is an OLAP (Online Analytical Processing) database. OLAP databases are designed for aggregation queries that scan large amounts of data — think “total sales by region last year” rather than “fetch order #12345.” Traditional row-oriented databases like PostgreSQL or SQLite store each record as a row; OLAP databases like DuckDB store each column together. When a query only touches two of twenty columns, DuckDB reads only those two columns from disk — skipping the other 90% entirely.
The two core design choices that make DuckDB fast are vectorized execution and automatic parallelism. Vectorized execution processes thousands of values at once using SIMD CPU instructions, the same technique that makes NumPy fast. Automatic parallelism splits the work across all available CPU cores without any configuration on your part. Both happen transparently — you just write SQL.
| Tool | Best for | Scales to | Setup |
|---|---|---|---|
| DuckDB | Analytical queries, file scanning, aggregations | Files larger than RAM (streaming) | pip install, no server |
| pandas | Row manipulation, reshaping, machine learning prep | Data that fits in RAM | pip install |
| SQLite | Transactional workloads, small CRUD apps | Millions of rows (row-oriented) | Built into Python |
| PostgreSQL | Multi-user OLTP, production apps | Billions of rows with indexes | Server required |
| Polars | DataFrame transforms, Rust-speed groupby | Data that fits in RAM or lazy | pip install |
Use DuckDB when you need to run SQL aggregations on files that are too large to comfortably load into memory, when you want SQL syntax for complex analytical queries instead of pandas chaining, or when you are working with Parquet files and want efficient columnar reads. Stick with pandas when you need row-level mutations, index-based lookups, or tight integration with scikit-learn.
Querying pandas and Polars DataFrames
DuckDB can query an in-memory pandas or Polars DataFrame directly by name — no loading, no copying, no conversion step needed. You reference the DataFrame variable name inside your SQL string:
# query_dataframe.py
import duckdb
import pandas as pd
orders = pd.DataFrame({
"order_id": [1, 2, 3, 4, 5, 6],
"customer": ["Alice", "Bob", "Alice", "Carol", "Bob", "Alice"],
"product": ["Laptop", "Phone", "Charger", "Tablet", "Laptop", "Phone"],
"amount": [1200, 800, 45, 650, 1150, 820],
"month": ["Jan", "Jan", "Jan", "Feb", "Feb", "Feb"],
})
# Reference the DataFrame by variable name -- no import step needed
result = duckdb.sql("""
SELECT
customer,
SUM(amount) AS total_spent,
COUNT(*) AS orders,
MAX(amount) AS biggest_order
FROM orders
GROUP BY customer
ORDER BY total_spent DESC
""").fetchdf()
print(result)
Output:
customer total_spent orders biggest_order
0 Alice 2065 3 1200
1 Bob 1950 2 1150
2 Carol 650 1 650
DuckDB scans the DataFrame’s Arrow-compatible memory representation directly — there is no serialization step. This means that querying a 1 GB pandas DataFrame with DuckDB typically uses no extra memory for the query itself. If you prefer Polars, the syntax is identical — just pass the Polars DataFrame variable name in the SQL string. DuckDB recognizes both via their shared Apache Arrow backing.
Persistent Connections and Database Files
The duckdb.sql() shortcut creates a new in-memory connection for each call. When you need to run multiple queries in a session, or persist data to disk, create an explicit connection instead:
# persistent_connection.py
import duckdb
# In-memory connection (data gone when Python exits)
con = duckdb.connect()
# Persistent file-based connection (data survives restarts)
# con = duckdb.connect("analytics.duckdb")
# Create a table and insert data
con.execute("""
CREATE TABLE products (
id INTEGER PRIMARY KEY,
name VARCHAR,
category VARCHAR,
price DOUBLE
)
""")
con.executemany(
"INSERT INTO products VALUES (?, ?, ?, ?)",
[
(1, "Laptop Pro", "Electronics", 1299.0),
(2, "Wireless Mouse","Electronics", 49.0),
(3, "Standing Desk", "Furniture", 599.0),
(4, "Notebook Set", "Stationery", 24.0),
(5, "Monitor 27in", "Electronics", 449.0),
]
)
# Query with a parameterized filter
category = "Electronics"
rows = con.execute(
"SELECT name, price FROM products WHERE category = ? ORDER BY price DESC",
[category]
).fetchall()
for name, price in rows:
print(f" {name:<20} ${price:>8.2f}")
con.close()
Output:
Laptop Pro $ 1299.00
Monitor 27in $ 449.00
Wireless Mouse $ 49.00
Use duckdb.connect("filename.duckdb") to write a persistent database file. The file format is compact and portable — you can share it, version-control small ones, or open it in the DuckDB CLI for exploration. Always call con.close() when done, or use DuckDB connections as context managers: with duckdb.connect("analytics.duckdb") as con:.
Querying CSV, JSON, and Parquet Files
DuckDB has built-in readers for CSV, JSON, and Parquet that support glob patterns, remote URLs, and automatic schema inference. You can query multiple files at once with a wildcard:
# query_files.py
import duckdb
import os
# Create sample CSV files to demonstrate multi-file querying
os.makedirs("data", exist_ok=True)
with open("data/sales_jan.csv", "w") as f:
f.write("date,product,region,amount\n")
f.write("2024-01-05,Laptor,North,1200\n")
f.write("2024-01-12,Phone,South,800\n")
f.write("2024-01-20,Tablet,East,650\n")
with open("data/sales_feb.csv", "w") as f:
f.write("date,product,region,amount\n")
f.write("2024-02-03,Laptor,South,1150\n")
f.write("2024-02-14,Charger,North,45\n")
f.write("2024-02-22,Phone,West,820\n")
# Query all monthly files at once using a glob pattern
result = duckdb.sql("""
SELECT
region,
SUM(amount) AS total,
COUNT(*) AS transactions
FROM read_csv_auto('data/sales_*.csv')
GROUP BY region
ORDER BY total DESC
""").fetchdf()
print(result)
# Write the result directly to Parquet (efficient for later queries)
duckdb.sql("""
COPY (
SELECT * FROM read_csv_auto('data/sales_*.csv')
WHERE CAST(amount AS INTEGER) > 100
)
TO 'data/sales_filtered.parquet' (FORMAT PARQUET)
""")
print("\nParquet written:", os.path.getsize("data/sales_filtered.parquet"), "bytes")
Output:
region total transactions
0 North 1245 2
1 South 1950 2
2 East 650 1
3 West 820 1
Parquet written: 3891 bytes
The read_csv_auto() function infers column types automatically. For Parquet files, use read_parquet() or just reference the file path directly in your SQL. The COPY ... TO statement writes query results directly to a file — handy for exporting processed data without building a pandas pipeline.
Window Functions and CTEs
DuckDB supports the full SQL standard including window functions, CTEs (Common Table Expressions), and QUALIFY — features that are awkward or verbose in pandas but natural in SQL:
# window_functions.py
import duckdb
import pandas as pd
sales = pd.DataFrame({
"month": ["Jan","Jan","Jan","Feb","Feb","Feb","Mar","Mar","Mar"],
"product": ["A","B","C","A","B","C","A","B","C"],
"revenue": [1200, 800, 450, 1350, 750, 520, 1100, 900, 480],
})
# Window functions: running total + rank within month
result = duckdb.sql("""
WITH ranked AS (
SELECT
month,
product,
revenue,
SUM(revenue) OVER (PARTITION BY month ORDER BY revenue DESC) AS running_total,
RANK() OVER (PARTITION BY month ORDER BY revenue DESC) AS rank_in_month,
revenue - LAG(revenue) OVER (PARTITION BY product ORDER BY month) AS mom_change
FROM sales
)
SELECT *
FROM ranked
ORDER BY month, rank_in_month
""").fetchdf()
print(result.to_string(index=False))
Output:
month product revenue running_total rank_in_month mom_change
Jan A 1200 1200 1 NaN
Jan B 800 2000 2 NaN
Jan C 450 2450 3 NaN
Feb A 1350 1350 1 150.0
Feb C 520 1870 2 70.0
Feb B 750 2620 3 -50.0
Mar B 900 900 1 150.0
Mar A 1100 2000 2 -250.0
Mar C 480 2480 3 -40.0
The OVER (PARTITION BY ... ORDER BY ...) syntax computes window aggregations — running totals, ranks, and lag values — in a single SQL pass. Doing this in pandas requires multiple groupby().transform() calls and careful merging. The CTE (WITH ranked AS (...)) names an intermediate result and makes the query readable. DuckDB supports arbitrarily deep CTE chains, recursive CTEs, and lateral joins — essentially the full SQL feature set.
Real-Life Example: Sales Analysis Pipeline
The following pipeline reads monthly sales files, runs a multi-step analysis with window functions, and exports the results — all without loading the full dataset into a DataFrame:
# sales_pipeline.py
import duckdb
import os
# Setup: write sample data files
os.makedirs("pipeline_data", exist_ok=True)
for month_num, (month, rows) in enumerate([
("2024-01", [("Laptop","Electronics",1200),("Phone","Electronics",800),
("Desk","Furniture",599),("Chair","Furniture",299),("Pen","Stationery",12)]),
("2024-02", [("Laptop","Electronics",1350),("Phone","Electronics",750),
("Desk","Furniture",620),("Chair","Furniture",310),("Pen","Stationery",15)]),
("2024-03", [("Laptop","Electronics",1100),("Phone","Electronics",900),
("Desk","Furniture",580),("Chair","Furniture",290),("Pen","Stationery",11)]),
], start=1):
with open(f"pipeline_data/sales_{month}.csv", "w") as f:
f.write("month,product,category,revenue\n")
for product, category, revenue in rows:
f.write(f"{month},{product},{category},{revenue}\n")
con = duckdb.connect()
# Step 1: Load all files into a DuckDB view (no copy -- reads lazily)
con.execute("""
CREATE VIEW raw_sales AS
SELECT * FROM read_csv_auto('pipeline_data/sales_*.csv')
""")
# Step 2: Category-level analysis with month-over-month growth
summary = con.execute("""
WITH monthly_cat AS (
SELECT
month,
category,
SUM(revenue) AS cat_revenue
FROM raw_sales
GROUP BY month, category
),
with_growth AS (
SELECT
month,
category,
cat_revenue,
ROUND(
(cat_revenue - LAG(cat_revenue) OVER (
PARTITION BY category ORDER BY month
)) * 100.0 / NULLIF(LAG(cat_revenue) OVER (
PARTITION BY category ORDER BY month
), 0),
1
) AS growth_pct
FROM monthly_cat
)
SELECT * FROM with_growth ORDER BY month, cat_revenue DESC
""").fetchdf()
print("-- Category Revenue with MoM Growth --")
print(summary.to_string(index=False))
# Step 3: Export top performer each month to Parquet
con.execute("""
COPY (
SELECT month, product, revenue
FROM raw_sales
QUALIFY RANK() OVER (PARTITION BY month ORDER BY revenue DESC) = 1
)
TO 'pipeline_data/top_products.parquet' (FORMAT PARQUET)
""")
top = con.execute("SELECT * FROM 'pipeline_data/top_products.parquet'").fetchdf()
print("\n-- Top Product Per Month --")
print(top.to_string(index=False))
con.close()
Output:
-- Category Revenue with MoM Growth --
month category cat_revenue growth_pct
2024-01 Electronics 2000.0 None
2024-01 Furniture 898.0 None
2024-01 Stationery 12.0 None
2024-02 Electronics 2100.0 5.0
2024-02 Furniture 930.0 3.6
2024-02 Stationery 15.0 25.0
2024-03 Electronics 2000.0 -4.8
2024-03 Furniture 870.0 -6.5
2024-03 Stationery 11.0 -26.7
-- Top Product Per Month --
month product revenue
2024-01 Laptop 1200
2024-02 Laptop 1350
2024-03 Laptop 1100
The QUALIFY clause filters window function results — it is the SQL equivalent of filtering a column produced by OVER(), which standard SQL cannot do in a WHERE clause. The view created in step 1 is lazy — DuckDB reads the CSV files only when the view is queried, not when CREATE VIEW runs. This pattern scales to hundreds of files without changing the query at all.
Frequently Asked Questions
How is DuckDB different from SQLite?
SQLite is a row-oriented database optimized for transactional workloads — fast inserts, single-row lookups, and small databases. DuckDB is column-oriented and optimized for analytical queries — fast aggregations, full-table scans, and large files. If you need to store application state, use SQLite. If you need to analyze data, use DuckDB. The two tools are not really competitors; most data pipelines that need both use SQLite for operational data and DuckDB for analysis.
Can DuckDB handle files larger than RAM?
Yes. DuckDB’s streaming execution model reads data in chunks and spills intermediate results to disk when necessary. A query that aggregates a 50 GB CSV file on a machine with 16 GB of RAM will work correctly — it will just take longer as data is streamed through the engine. Enable disk spilling explicitly with con.execute("SET temp_directory='/tmp/duckdb_tmp'") to control where temporary files land.
How do I join a DuckDB query result with a pandas DataFrame?
Reference both in the same SQL query. If you have a pandas DataFrame named dim_products and a CSV file of transactions, you can write FROM transactions JOIN dim_products USING (product_id) — DuckDB resolves variable names from the local scope automatically. The result is a pandas DataFrame that combines both sources without any explicit merge step.
Is DuckDB safe to use from multiple threads?
A single DuckDB connection is not thread-safe. Create one connection per thread, or serialize access with a threading lock. The file-based database format supports multiple readers simultaneously (via separate connections), but only one writer at a time. For high-concurrency workloads, DuckDB is not the right tool — use PostgreSQL instead.
Can DuckDB query files on S3 or cloud storage?
Yes, using the httpfs extension. Install it once with con.install_extension("httpfs"); con.load_extension("httpfs"), then set your AWS credentials and query S3 paths directly: SELECT * FROM read_parquet('s3://my-bucket/data/*.parquet'). DuckDB streams only the row groups it needs from S3, making it efficient for large Parquet datasets stored in cloud object storage.
Conclusion
DuckDB gives Python data pipelines a superpower: running full analytical SQL — window functions, CTEs, aggregations — directly on CSV, Parquet, and JSON files, on pandas DataFrames, and on persistent database files, without a server and without loading data into memory unnecessarily. You have seen how to query files with glob patterns, join DataFrames in SQL, use window functions for running totals and rankings, and build a multi-step analysis pipeline that exports results to Parquet.
The best way to extend the sales pipeline example is to add a Parquet layer: write your cleaned and transformed data as Parquet files after the first query pass, then query those Parquets for all downstream analysis. Parquet’s columnar compression means those files will be 5x to 10x smaller than the original CSVs and query 10x faster. Combine that with DuckDB’s lazy reading and you have a serverless data warehouse that runs entirely in Python.
For the full DuckDB API reference including extensions, remote file access, and the Python relational API, see the official DuckDB Python documentation.