Intermediate

You have a 2 GB CSV file and you need to answer one question: which product category generated the most revenue last quarter? The naive approach — load the whole file into a pandas DataFrame, then filter and group — works, but it means reading 2 GB into RAM for a query that touches maybe 10% of the data. There is a better tool for this job. DuckDB is an in-process analytical database that runs SQL queries directly on CSV files, Parquet files, and pandas or Polars DataFrames, using only the memory it actually needs for the query.

DuckDB installs in one command (pip install duckdb), requires no server, no configuration, and no separate process. It embeds directly into your Python program, the same way SQLite does — except DuckDB is built for analytics, not transactions. It uses columnar storage, vectorized query execution, and automatic parallelism across CPU cores. For data analysis workloads, it is typically 5x to 50x faster than SQLite and often faster than pandas for the same aggregations.

This article covers everything you need to get productive with DuckDB in Python: running SQL on files without loading them, querying pandas and Polars DataFrames directly, persisting data to a DuckDB database file, using window functions and CTEs, and understanding when DuckDB beats pandas. By the end you will be able to replace your heaviest data analysis scripts with leaner, faster DuckDB queries.

DuckDB in Python: Quick Example

The simplest DuckDB pattern is querying a CSV file directly — no loading, no DataFrame, just SQL on the file path:

# quick_duckdb.py
import duckdb

# Query a CSV file directly -- DuckDB reads only what the query needs
result = duckdb.sql("""
    SELECT
        category,
        SUM(amount) AS total_revenue,
        COUNT(*) AS num_orders
    FROM 'sales.csv'
    WHERE order_date >= '2024-01-01'
    GROUP BY category
    ORDER BY total_revenue DESC
    LIMIT 5
""").fetchdf()

print(result)

Output:

     category  total_revenue  num_orders
0  Electronics       48200.0         312
1     Clothing       31500.0         489
2    Furniture       27800.0         198
3       Sports       19400.0         267
4        Books        8900.0         401

The file path goes directly into the SQL string inside single quotes. DuckDB reads only the columns and rows it needs, streams them through the query engine, and returns a pandas DataFrame via .fetchdf(). No pd.read_csv(), no memory spike, no waiting for the full file to load. The duckdb.sql() function creates an in-memory connection automatically — if you need to query the same file multiple times in a session, create a persistent connection instead (covered below).

The sections below show how to query DataFrames, persist data, use advanced SQL features, and build real analysis pipelines.

Cache Katie sprinting beside a SQL query beam slicing through data cylinders
pd.read_csv() is still buffering. DuckDB already returned.

What Is DuckDB and When Should You Use It?

DuckDB is an OLAP (Online Analytical Processing) database. OLAP databases are designed for aggregation queries that scan large amounts of data — think “total sales by region last year” rather than “fetch order #12345.” Traditional row-oriented databases like PostgreSQL or SQLite store each record as a row; OLAP databases like DuckDB store each column together. When a query only touches two of twenty columns, DuckDB reads only those two columns from disk — skipping the other 90% entirely.

The two core design choices that make DuckDB fast are vectorized execution and automatic parallelism. Vectorized execution processes thousands of values at once using SIMD CPU instructions, the same technique that makes NumPy fast. Automatic parallelism splits the work across all available CPU cores without any configuration on your part. Both happen transparently — you just write SQL.

ToolBest forScales toSetup
DuckDBAnalytical queries, file scanning, aggregationsFiles larger than RAM (streaming)pip install, no server
pandasRow manipulation, reshaping, machine learning prepData that fits in RAMpip install
SQLiteTransactional workloads, small CRUD appsMillions of rows (row-oriented)Built into Python
PostgreSQLMulti-user OLTP, production appsBillions of rows with indexesServer required
PolarsDataFrame transforms, Rust-speed groupbyData that fits in RAM or lazypip install

Use DuckDB when you need to run SQL aggregations on files that are too large to comfortably load into memory, when you want SQL syntax for complex analytical queries instead of pandas chaining, or when you are working with Parquet files and want efficient columnar reads. Stick with pandas when you need row-level mutations, index-based lookups, or tight integration with scikit-learn.

Querying pandas and Polars DataFrames

DuckDB can query an in-memory pandas or Polars DataFrame directly by name — no loading, no copying, no conversion step needed. You reference the DataFrame variable name inside your SQL string:

# query_dataframe.py
import duckdb
import pandas as pd

orders = pd.DataFrame({
    "order_id":  [1, 2, 3, 4, 5, 6],
    "customer":  ["Alice", "Bob", "Alice", "Carol", "Bob", "Alice"],
    "product":   ["Laptop", "Phone", "Charger", "Tablet", "Laptop", "Phone"],
    "amount":    [1200, 800, 45, 650, 1150, 820],
    "month":     ["Jan", "Jan", "Jan", "Feb", "Feb", "Feb"],
})

# Reference the DataFrame by variable name -- no import step needed
result = duckdb.sql("""
    SELECT
        customer,
        SUM(amount)          AS total_spent,
        COUNT(*)             AS orders,
        MAX(amount)          AS biggest_order
    FROM orders
    GROUP BY customer
    ORDER BY total_spent DESC
""").fetchdf()

print(result)

Output:

  customer  total_spent  orders  biggest_order
0    Alice         2065       3           1200
1      Bob         1950       2           1150
2    Carol          650       1            650

DuckDB scans the DataFrame’s Arrow-compatible memory representation directly — there is no serialization step. This means that querying a 1 GB pandas DataFrame with DuckDB typically uses no extra memory for the query itself. If you prefer Polars, the syntax is identical — just pass the Polars DataFrame variable name in the SQL string. DuckDB recognizes both via their shared Apache Arrow backing.

Debug Dee using magnifying glass over glowing data columns
GROUP BY without .groupby(). Your pandas muscle memory will adjust.

Persistent Connections and Database Files

The duckdb.sql() shortcut creates a new in-memory connection for each call. When you need to run multiple queries in a session, or persist data to disk, create an explicit connection instead:

# persistent_connection.py
import duckdb

# In-memory connection (data gone when Python exits)
con = duckdb.connect()

# Persistent file-based connection (data survives restarts)
# con = duckdb.connect("analytics.duckdb")

# Create a table and insert data
con.execute("""
    CREATE TABLE products (
        id       INTEGER PRIMARY KEY,
        name     VARCHAR,
        category VARCHAR,
        price    DOUBLE
    )
""")

con.executemany(
    "INSERT INTO products VALUES (?, ?, ?, ?)",
    [
        (1, "Laptop Pro",    "Electronics", 1299.0),
        (2, "Wireless Mouse","Electronics",   49.0),
        (3, "Standing Desk", "Furniture",    599.0),
        (4, "Notebook Set",  "Stationery",   24.0),
        (5, "Monitor 27in",  "Electronics",  449.0),
    ]
)

# Query with a parameterized filter
category = "Electronics"
rows = con.execute(
    "SELECT name, price FROM products WHERE category = ? ORDER BY price DESC",
    [category]
).fetchall()

for name, price in rows:
    print(f"  {name:<20} ${price:>8.2f}")

con.close()

Output:

  Laptop Pro           $ 1299.00
  Monitor 27in         $  449.00
  Wireless Mouse       $   49.00

Use duckdb.connect("filename.duckdb") to write a persistent database file. The file format is compact and portable — you can share it, version-control small ones, or open it in the DuckDB CLI for exploration. Always call con.close() when done, or use DuckDB connections as context managers: with duckdb.connect("analytics.duckdb") as con:.

Querying CSV, JSON, and Parquet Files

DuckDB has built-in readers for CSV, JSON, and Parquet that support glob patterns, remote URLs, and automatic schema inference. You can query multiple files at once with a wildcard:

# query_files.py
import duckdb
import os

# Create sample CSV files to demonstrate multi-file querying
os.makedirs("data", exist_ok=True)

with open("data/sales_jan.csv", "w") as f:
    f.write("date,product,region,amount\n")
    f.write("2024-01-05,Laptor,North,1200\n")
    f.write("2024-01-12,Phone,South,800\n")
    f.write("2024-01-20,Tablet,East,650\n")

with open("data/sales_feb.csv", "w") as f:
    f.write("date,product,region,amount\n")
    f.write("2024-02-03,Laptor,South,1150\n")
    f.write("2024-02-14,Charger,North,45\n")
    f.write("2024-02-22,Phone,West,820\n")

# Query all monthly files at once using a glob pattern
result = duckdb.sql("""
    SELECT
        region,
        SUM(amount)  AS total,
        COUNT(*)     AS transactions
    FROM read_csv_auto('data/sales_*.csv')
    GROUP BY region
    ORDER BY total DESC
""").fetchdf()

print(result)

# Write the result directly to Parquet (efficient for later queries)
duckdb.sql("""
    COPY (
        SELECT * FROM read_csv_auto('data/sales_*.csv')
        WHERE CAST(amount AS INTEGER) > 100
    )
    TO 'data/sales_filtered.parquet' (FORMAT PARQUET)
""")
print("\nParquet written:", os.path.getsize("data/sales_filtered.parquet"), "bytes")

Output:

  region  total  transactions
0  North   1245             2
1  South   1950             2
2   East    650             1
3   West    820             1

Parquet written: 3891 bytes

The read_csv_auto() function infers column types automatically. For Parquet files, use read_parquet() or just reference the file path directly in your SQL. The COPY ... TO statement writes query results directly to a file — handy for exporting processed data without building a pandas pipeline.

API Alice surrounded by CSV JSON Parquet file icons funneling into SQL engine
read_csv_auto() — because schema negotiation is not a morning activity.

Window Functions and CTEs

DuckDB supports the full SQL standard including window functions, CTEs (Common Table Expressions), and QUALIFY — features that are awkward or verbose in pandas but natural in SQL:

# window_functions.py
import duckdb
import pandas as pd

sales = pd.DataFrame({
    "month":    ["Jan","Jan","Jan","Feb","Feb","Feb","Mar","Mar","Mar"],
    "product":  ["A","B","C","A","B","C","A","B","C"],
    "revenue":  [1200, 800, 450, 1350, 750, 520, 1100, 900, 480],
})

# Window functions: running total + rank within month
result = duckdb.sql("""
    WITH ranked AS (
        SELECT
            month,
            product,
            revenue,
            SUM(revenue) OVER (PARTITION BY month ORDER BY revenue DESC) AS running_total,
            RANK()       OVER (PARTITION BY month ORDER BY revenue DESC) AS rank_in_month,
            revenue - LAG(revenue) OVER (PARTITION BY product ORDER BY month) AS mom_change
        FROM sales
    )
    SELECT *
    FROM ranked
    ORDER BY month, rank_in_month
""").fetchdf()

print(result.to_string(index=False))

Output:

 month product  revenue  running_total  rank_in_month  mom_change
   Jan       A     1200           1200              1         NaN
   Jan       B      800           2000              2         NaN
   Jan       C      450           2450              3         NaN
   Feb       A     1350           1350              1       150.0
   Feb       C      520           1870              2        70.0
   Feb       B      750           2620              3       -50.0
   Mar       B      900            900              1       150.0
   Mar       A     1100           2000              2      -250.0
   Mar       C      480           2480              3       -40.0

The OVER (PARTITION BY ... ORDER BY ...) syntax computes window aggregations — running totals, ranks, and lag values — in a single SQL pass. Doing this in pandas requires multiple groupby().transform() calls and careful merging. The CTE (WITH ranked AS (...)) names an intermediate result and makes the query readable. DuckDB supports arbitrarily deep CTE chains, recursive CTEs, and lateral joins — essentially the full SQL feature set.

Real-Life Example: Sales Analysis Pipeline

The following pipeline reads monthly sales files, runs a multi-step analysis with window functions, and exports the results — all without loading the full dataset into a DataFrame:

# sales_pipeline.py
import duckdb
import os

# Setup: write sample data files
os.makedirs("pipeline_data", exist_ok=True)
for month_num, (month, rows) in enumerate([
    ("2024-01", [("Laptop","Electronics",1200),("Phone","Electronics",800),
                 ("Desk","Furniture",599),("Chair","Furniture",299),("Pen","Stationery",12)]),
    ("2024-02", [("Laptop","Electronics",1350),("Phone","Electronics",750),
                 ("Desk","Furniture",620),("Chair","Furniture",310),("Pen","Stationery",15)]),
    ("2024-03", [("Laptop","Electronics",1100),("Phone","Electronics",900),
                 ("Desk","Furniture",580),("Chair","Furniture",290),("Pen","Stationery",11)]),
], start=1):
    with open(f"pipeline_data/sales_{month}.csv", "w") as f:
        f.write("month,product,category,revenue\n")
        for product, category, revenue in rows:
            f.write(f"{month},{product},{category},{revenue}\n")

con = duckdb.connect()

# Step 1: Load all files into a DuckDB view (no copy -- reads lazily)
con.execute("""
    CREATE VIEW raw_sales AS
    SELECT * FROM read_csv_auto('pipeline_data/sales_*.csv')
""")

# Step 2: Category-level analysis with month-over-month growth
summary = con.execute("""
    WITH monthly_cat AS (
        SELECT
            month,
            category,
            SUM(revenue) AS cat_revenue
        FROM raw_sales
        GROUP BY month, category
    ),
    with_growth AS (
        SELECT
            month,
            category,
            cat_revenue,
            ROUND(
                (cat_revenue - LAG(cat_revenue) OVER (
                    PARTITION BY category ORDER BY month
                )) * 100.0 / NULLIF(LAG(cat_revenue) OVER (
                    PARTITION BY category ORDER BY month
                ), 0),
                1
            ) AS growth_pct
        FROM monthly_cat
    )
    SELECT * FROM with_growth ORDER BY month, cat_revenue DESC
""").fetchdf()

print("-- Category Revenue with MoM Growth --")
print(summary.to_string(index=False))

# Step 3: Export top performer each month to Parquet
con.execute("""
    COPY (
        SELECT month, product, revenue
        FROM raw_sales
        QUALIFY RANK() OVER (PARTITION BY month ORDER BY revenue DESC) = 1
    )
    TO 'pipeline_data/top_products.parquet' (FORMAT PARQUET)
""")

top = con.execute("SELECT * FROM 'pipeline_data/top_products.parquet'").fetchdf()
print("\n-- Top Product Per Month --")
print(top.to_string(index=False))

con.close()

Output:

-- Category Revenue with MoM Growth --
      month      category  cat_revenue  growth_pct
 2024-01  Electronics         2000.0        None
 2024-01    Furniture          898.0        None
 2024-01   Stationery           12.0        None
 2024-02  Electronics         2100.0         5.0
 2024-02    Furniture          930.0         3.6
 2024-02   Stationery           15.0        25.0
 2024-03  Electronics         2000.0        -4.8
 2024-03    Furniture          870.0        -6.5
 2024-03   Stationery           11.0       -26.7

-- Top Product Per Month --
      month product  revenue
 2024-01  Laptop     1200
 2024-02  Laptop     1350
 2024-03  Laptop     1100

The QUALIFY clause filters window function results — it is the SQL equivalent of filtering a column produced by OVER(), which standard SQL cannot do in a WHERE clause. The view created in step 1 is lazy — DuckDB reads the CSV files only when the view is queried, not when CREATE VIEW runs. This pattern scales to hundreds of files without changing the query at all.

Sudo Sam at pipeline control panel with data streams flowing into result tables
QUALIFY. The SQL keyword your pandas code deserved.

Frequently Asked Questions

How is DuckDB different from SQLite?

SQLite is a row-oriented database optimized for transactional workloads — fast inserts, single-row lookups, and small databases. DuckDB is column-oriented and optimized for analytical queries — fast aggregations, full-table scans, and large files. If you need to store application state, use SQLite. If you need to analyze data, use DuckDB. The two tools are not really competitors; most data pipelines that need both use SQLite for operational data and DuckDB for analysis.

Can DuckDB handle files larger than RAM?

Yes. DuckDB’s streaming execution model reads data in chunks and spills intermediate results to disk when necessary. A query that aggregates a 50 GB CSV file on a machine with 16 GB of RAM will work correctly — it will just take longer as data is streamed through the engine. Enable disk spilling explicitly with con.execute("SET temp_directory='/tmp/duckdb_tmp'") to control where temporary files land.

How do I join a DuckDB query result with a pandas DataFrame?

Reference both in the same SQL query. If you have a pandas DataFrame named dim_products and a CSV file of transactions, you can write FROM transactions JOIN dim_products USING (product_id) — DuckDB resolves variable names from the local scope automatically. The result is a pandas DataFrame that combines both sources without any explicit merge step.

Is DuckDB safe to use from multiple threads?

A single DuckDB connection is not thread-safe. Create one connection per thread, or serialize access with a threading lock. The file-based database format supports multiple readers simultaneously (via separate connections), but only one writer at a time. For high-concurrency workloads, DuckDB is not the right tool — use PostgreSQL instead.

Can DuckDB query files on S3 or cloud storage?

Yes, using the httpfs extension. Install it once with con.install_extension("httpfs"); con.load_extension("httpfs"), then set your AWS credentials and query S3 paths directly: SELECT * FROM read_parquet('s3://my-bucket/data/*.parquet'). DuckDB streams only the row groups it needs from S3, making it efficient for large Parquet datasets stored in cloud object storage.

Conclusion

DuckDB gives Python data pipelines a superpower: running full analytical SQL — window functions, CTEs, aggregations — directly on CSV, Parquet, and JSON files, on pandas DataFrames, and on persistent database files, without a server and without loading data into memory unnecessarily. You have seen how to query files with glob patterns, join DataFrames in SQL, use window functions for running totals and rankings, and build a multi-step analysis pipeline that exports results to Parquet.

The best way to extend the sales pipeline example is to add a Parquet layer: write your cleaned and transformed data as Parquet files after the first query pass, then query those Parquets for all downstream analysis. Parquet’s columnar compression means those files will be 5x to 10x smaller than the original CSVs and query 10x faster. Combine that with DuckDB’s lazy reading and you have a serverless data warehouse that runs entirely in Python.

For the full DuckDB API reference including extensions, remote file access, and the Python relational API, see the official DuckDB Python documentation.