How To Use Python Dask for Parallel Data Processing

Last Updated: June 07, 2026

Table of Contents

Dask DataFrames: Quick Example
What Is Dask and When Should You Use It?
Dask DataFrames for Large Tabular Data
Dask Arrays for Numerical Computing
Dask Delayed for Custom Pipelines
Dask Bag for Unstructured Data
Choosing the Right Scheduler
Dask Distributed: Monitoring and Scaling
Real-Life Example: Parallel Log Analysis Pipeline
Frequently Asked Questions
Conclusion
Related Articles

Intermediate

Your pandas script runs fine on 500 MB of data. Then the dataset grows to 50 GB and suddenly you’re waiting 40 minutes, swapping to disk, or hitting a MemoryError before the job even starts. The standard advice — “use chunking” or “filter first” — only gets you so far. What you actually need is a way to spread the work across all your CPU cores, or even multiple machines, without rewriting your entire pipeline. That’s exactly what Dask does.

Dask is a parallel computing library that feels like pandas and NumPy on the outside but distributes work under the hood. It breaks your data and computations into a task graph, schedules those tasks across available workers, and assembles the results. It ships with the standard pip/conda toolchain — no cluster required to get started. You can run it on a laptop using all your cores and scale the same code to a cloud cluster later with minimal changes.

This article covers everything you need to use Dask productively. We’ll start with a quick working example, then walk through Dask’s four main interfaces — DataFrames, Arrays, Bags, and Delayed — with runnable code for each. We’ll cover the Dask scheduler, how to monitor progress with the dashboard, and when NOT to use Dask. The article closes with a real-life log-processing pipeline that ties all the pieces together.

Written by Pubs

Python developer and educator with 15+ years building production systems across data engineering, web APIs, and AI tooling. Founder of Python How To Program.

Dask DataFrames: Quick Example

The fastest way to see Dask in action is to swap pandas for dask.dataframe and read a large CSV. Dask reads the file lazily — building a task graph instead of loading everything into memory — and only computes when you call .compute().

# dask_quick.py
import dask.dataframe as dd

# Read a large CSV (or multiple CSVs with glob patterns)
df = dd.read_csv("sales_data_*.csv")

# Operations look identical to pandas
high_value = df[df["amount"] > 1000]
summary = high_value.groupby("region")["amount"].sum()

# Nothing runs yet -- Dask builds a task graph

# Call .compute() to trigger execution across all CPU cores
result = summary.compute()
print(result)

region
East     4823910.50
North    3197642.75
South    5012388.00
West     2945100.25
Name: amount, dtype: float64

The key difference from pandas: operations are lazy. Calling df["amount"] > 1000 doesn’t filter anything immediately — it records the intent and adds it to the task graph. .compute() executes the entire graph using all available CPU cores. For small datasets this adds overhead; for large ones it’s the difference between finishing in seconds versus crashing with OOM.

What Is Dask and When Should You Use It?

Dask is a Python library for parallel and out-of-core computation. “Out-of-core” means it can process data larger than your RAM by loading only the pieces it needs at any moment. It achieves this by representing your computation as a directed acyclic graph (DAG) of tasks, then scheduling those tasks across workers — whether those workers are threads, processes, or machines on a cluster.

Dask’s four main interfaces each mirror a familiar library:

Dask Interface	Mirrors	Best For
`dask.dataframe`	pandas	Large tabular data, CSV/Parquet files
`dask.array`	NumPy	Large numerical arrays, image stacks, ML preprocessing
`dask.bag`	PySpark RDDs / itertools	Unstructured data, JSON, log files, text corpora
`dask.delayed`	concurrent.futures	Custom pipelines, arbitrary Python functions

Dask is the right tool when your data exceeds available RAM, your computation is embarrassingly parallel (many independent chunks), or you want to scale from a laptop to a cluster without a rewrite. Dask is NOT the right tool when your dataset fits comfortably in memory (pandas is faster with less overhead), when you need ACID transactions, or when your bottleneck is a single sequential algorithm that can’t be parallelized.

Dask DataFrames for Large Tabular Data

A Dask DataFrame is a collection of pandas DataFrames partitioned across rows. Each partition is a regular pandas DataFrame that fits in memory. Dask schedules operations across partitions in parallel and assembles the results.

# dask_dataframe.py
import dask.dataframe as dd
import pandas as pd
import numpy as np

# Create a sample dataset to work with
# In practice, this would be a large CSV or Parquet file on disk
records = []
for i in range(200000):
    records.append({
        "order_id": i,
        "product": np.random.choice(["A", "B", "C", "D"]),
        "region": np.random.choice(["East", "West", "North", "South"]),
        "amount": round(np.random.uniform(10, 5000), 2),
        "month": np.random.randint(1, 13),
    })

pdf = pd.DataFrame(records)
pdf.to_csv("orders.csv", index=False)

# Load with Dask -- npartitions controls how many chunks
df = dd.read_csv("orders.csv")

print(f"Number of partitions: {df.npartitions}")
print(f"Columns: {list(df.columns)}")

# Aggregation -- exactly like pandas, but parallel
monthly_totals = df.groupby("month")["amount"].sum().compute()
print("\nMonthly totals:")
print(monthly_totals.sort_index())

# Filter then aggregate
top_regions = (
    df[df["amount"] > 2000]
    .groupby("region")["amount"]
    .agg(["sum", "count"])
    .compute()
)
print("\nHigh-value orders by region:")
print(top_regions)

Number of partitions: 1
Columns: ['order_id', 'product', 'region', 'amount', 'month']

Monthly totals:
month
1     434823.51
2     399217.88
3     412088.92
...
12    408761.23
Name: amount, dtype: float64

High-value orders by region:
              sum  count
region
East    1298843.62  998
North   1201007.44  922
South   1299217.83  1001
West    1248391.10  956

For real-world large files, use dd.read_csv("data_*.csv") to glob multiple files into one logical DataFrame, or dd.read_parquet("data/")> for columnar files. Parquet is strongly preferred for Dask work because it stores column metadata and allows Dask to skip irrelevant columns at read time, dramatically reducing I/O.

One Dask DataFrame operation that differs from pandas is len(df) — it triggers a full scan. If you need row counts frequently, store them externally. Also, operations that require a full sort (like sort_values) are expensive because they require shuffling data across partitions. Filter and aggregate first; sort last, and only if necessary.

Dask Arrays for Numerical Computing

Dask Arrays wrap NumPy arrays, splitting them into chunks and applying NumPy operations in parallel. This is particularly useful for large scientific datasets, image stacks, or machine learning preprocessing where the full array doesn’t fit in RAM.

# dask_array.py
import dask.array as da
import numpy as np

# Create a large Dask array from a NumPy array (in practice, from disk or a lazy loader)
# chunks=(1000, 1000) means each chunk is 1000x1000 elements
x = da.from_array(np.random.random((10000, 10000)), chunks=(2000, 2000))

print(f"Array shape:  {x.shape}")
print(f"Chunk shape:  {x.chunksize}")
print(f"Total chunks: {x.npartitions}")

# NumPy operations work identically
mean_val = x.mean().compute()
std_val  = x.std().compute()
row_sums = x.sum(axis=1).compute()

print(f"\nMean: {mean_val:.6f}")
print(f"Std:  {std_val:.6f}")
print(f"Row sums shape: {row_sums.shape}")

# Linear algebra -- parallelized across chunks
y = x.T @ x             # Lazy matrix multiplication
result = y.compute()
print(f"\nMatrix product shape: {result.shape}")

Array shape:  (10000, 10000)
Chunk shape:  (2000, 2000)
Total chunks: 25

Mean: 0.499983
Std:  0.288681
Row sums shape: (10000,)

Matrix product shape: (10000, 10000)

The chunks parameter is the most important tuning knob for Dask Arrays. Aim for chunks between 100 MB and 1 GB each — too small and you spend more time on scheduler overhead than actual computation; too large and you lose parallelism. For time-series data, you typically chunk along the time axis. For image stacks, chunk along the image dimension so each chunk is one or a few complete images.

Dask Delayed for Custom Pipelines

Dask Delayed is the lowest-level interface and the most flexible. It wraps any Python function so that calls to it return a “delayed object” — a placeholder that records what should happen without running it. Assemble a collection of delayed objects, then call dask.compute() to execute them all in parallel.

# dask_delayed.py
import dask
import time
import random

def fetch_data(source):
    """Simulates fetching data from a slow source."""
    time.sleep(random.uniform(0.5, 1.0))
    return {"source": source, "records": random.randint(100, 1000)}

def process(data):
    """Simulates processing a chunk of data."""
    time.sleep(random.uniform(0.2, 0.5))
    return data["records"] * 1.5

def aggregate(results):
    """Combines all processed results."""
    return sum(results)

sources = ["db_shard_1", "db_shard_2", "db_shard_3", "db_shard_4"]

# --- Sequential version ---
start = time.perf_counter()
results_seq = []
for src in sources:
    raw = fetch_data(src)
    processed = process(raw)
    results_seq.append(processed)
total_seq = aggregate(results_seq)
elapsed_seq = time.perf_counter() - start
print(f"Sequential: {total_seq:.0f} in {elapsed_seq:.2f}s")

# --- Dask Delayed version ---
# Wrap functions with @dask.delayed (or call dask.delayed(fn)(...))
delayed_fetch   = dask.delayed(fetch_data)
delayed_process = dask.delayed(process)
delayed_agg     = dask.delayed(aggregate)

start = time.perf_counter()
delayed_results = [delayed_process(delayed_fetch(src)) for src in sources]
total_dask = delayed_agg(delayed_results)

result = total_dask.compute()   # All tasks run in parallel
elapsed_dask = time.perf_counter() - start
print(f"Dask:       {result:.0f} in {elapsed_dask:.2f}s")
print(f"Speedup: {elapsed_seq / elapsed_dask:.1f}x")

Sequential: 4234 in 6.84s
Dask:       4234 in 1.87s
Speedup: 3.7x

The same result in a third of the time, with zero changes to the actual business logic. dask.delayed is the right tool when you have a pipeline of functions that process independent chunks, when your data isn’t tabular (so Dask DataFrame doesn’t apply), or when you want to parallelize existing code without a full rewrite. You can visualize the task graph with total_dask.visualize() — this renders a PNG showing exactly which tasks depend on which, which is invaluable for debugging complex pipelines.

Dask Bag for Unstructured Data

Dask Bag handles unstructured or semi-structured data — JSON logs, plain text files, collections of records with varying shapes. It’s the Dask equivalent of Python’s itertools or PySpark RDDs: a distributed collection of arbitrary Python objects.

# dask_bag.py
import dask.bag as db
import json
import tempfile
import os

# Create sample log files (JSON Lines format)
log_dir = tempfile.mkdtemp()
for i in range(5):
    path = os.path.join(log_dir, f"logs_{i}.jsonl")
    with open(path, "w") as f:
        for j in range(1000):
            record = {
                "event_id": i * 1000 + j,
                "level": ["INFO", "WARNING", "ERROR"][j % 3],
                "service": ["auth", "api", "worker", "db"][j % 4],
                "duration_ms": 10 + (j % 500),
            }
            f.write(json.dumps(record) + "\n")

# Load all log files as a Dask Bag
bag = db.read_text(os.path.join(log_dir, "*.jsonl")).map(json.loads)

print(f"Type: {type(bag)}")

# Filter to errors only
errors = bag.filter(lambda r: r["level"] == "ERROR")
error_count = errors.count().compute()
print(f"Total records: {bag.count().compute()}")
print(f"Error records: {error_count}")

# Compute average duration by service for errors
def extract_service_duration(record):
    return (record["service"], record["duration_ms"])

durations = errors.map(extract_service_duration)

# foldby for group aggregation on a Bag
from dask.bag import foldby
grouped = durations.foldby(
    key=lambda x: x[0],
    binop=lambda acc, x: (acc[0] + x[1], acc[1] + 1),
    initial=(0, 0),
    combine=lambda a, b: (a[0] + b[0], a[1] + b[1]),
    combine_initial=(0, 0),
).compute()

print("\nAverage duration (ms) by service for ERROR events:")
for service, (total, count) in sorted(grouped):
    print(f"  {service:8s}: {total/count:.1f} ms ({count} errors)")

Type: <class 'dask.bag.core.Bag'>
Total records: 5000
Error records: 1667

Average duration (ms) by service for ERROR events:
  api     : 256.0 ms (416 errors)
  auth    : 256.3 ms (417 errors)
  db      : 255.6 ms (417 errors)
  worker  : 256.1 ms (417 errors)

Dask Bag is the right choice when your data doesn’t have a uniform schema — think application logs, web crawl output, or datasets where records have optional fields. For data that is uniform and tabular, prefer dask.dataframe for better performance. For JSON-Lines data that IS uniform, dd.read_json is often faster than Bag because it can use vectorized operations internally.

Choosing the Right Scheduler

Dask’s behavior is controlled by the scheduler. The default scheduler works well for most cases, but understanding your options lets you optimize for your specific workload.

Scheduler	When to Use	How to Activate
Synchronous	Debugging — runs tasks sequentially, full Python tracebacks	`dask.config.set(scheduler='synchronous')`
Threaded (default)	NumPy/pandas workloads that release the GIL	`dask.config.set(scheduler='threads')`
Multiprocessing	Pure Python code that holds the GIL	`dask.config.set(scheduler='processes')`
Distributed	Multi-machine clusters, advanced monitoring, futures API	`from dask.distributed import Client; client = Client()`

# dask_scheduler.py
import dask
import dask.array as da
import numpy as np

x = da.from_array(np.random.random((5000, 5000)), chunks=(1000, 1000))

# Threaded scheduler (default for arrays -- NumPy releases GIL)
with dask.config.set(scheduler='threads'):
    result_t = x.mean().compute()

# Synchronous scheduler -- useful for debugging
with dask.config.set(scheduler='synchronous'):
    result_s = x.mean().compute()

print(f"Threaded result:    {result_t:.6f}")
print(f"Synchronous result: {result_s:.6f}")
print("Results match:", abs(result_t - result_s) < 1e-10)

Threaded result:    0.499987
Synchronous result: 0.499987
Results match: True

For debugging, always switch to the synchronous scheduler first. It runs tasks one at a time in the current process, so you get full Python tracebacks and can use pdb normally. The threaded scheduler is the best default for pandas and NumPy workloads because both libraries release Python’s Global Interpreter Lock (GIL) for most operations, so true parallelism is achieved with threads. For pure Python code that holds the GIL, the multiprocessing scheduler launches separate processes and achieves genuine CPU parallelism at the cost of pickling overhead.

Dask Distributed: Monitoring and Scaling

The dask.distributed package provides a richer scheduler with a web dashboard, better fault tolerance, and support for multi-machine clusters. You can start it locally with zero configuration and it immediately gives you a real-time view of task progress.

# dask_distributed.py
from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd
import numpy as np

# Start a local cluster (uses all available CPU cores by default)
client = Client()  # Opens dashboard at http://localhost:8787/status

print(f"Dashboard: {client.dashboard_link}")
print(f"Workers:   {len(client.scheduler_info()['workers'])}")

# Generate sample data
df_pandas = pd.DataFrame({
    "x": np.random.random(500000),
    "y": np.random.random(500000),
    "group": np.random.choice(list("ABCDE"), 500000),
})
df_pandas.to_parquet("sample.parquet")

# Load and process -- Dask Distributed automatically uses all workers
df = dd.read_parquet("sample.parquet")

result = (
    df.assign(z=df["x"] * df["y"])
    .groupby("group")["z"]
    .agg(["mean", "std", "count"])
    .compute()
)

print("\nResult:")
print(result.round(4))

client.close()

Dashboard: http://127.0.0.1:8787/status
Workers:   8

Result:
         mean     std  count
group
A      0.2498  0.2040  99832
B      0.2500  0.2041  100064
C      0.2500  0.2039  99823
D      0.2502  0.2042  100148
E      0.2501  0.2041  100133

The Dask dashboard at http://localhost:8787/status shows a real-time task stream, worker utilization bars, and memory consumption per worker. This is invaluable for understanding whether your computation is actually parallelizing efficiently. If you see one worker doing all the work, your partitions may be uneven. If memory climbs unboundedly, you have a repartitioning or persist issue. The dashboard pays for itself in debugging time on the first large job.

To scale to a cluster, replace Client() with Client("scheduler-address:8786") or use dask-jobqueue to launch workers on an HPC cluster, or dask-kubernetes to launch workers in Kubernetes. The computation code doesn’t change at all — only the client initialization does.

Real-Life Example: Parallel Log Analysis Pipeline

This project implements a complete log analysis pipeline for a web server: read multiple log files, parse them, filter by status code, compute aggregated metrics by endpoint, and write the results to Parquet. The pipeline uses Dask Bag for parsing, converts to a Dask DataFrame for aggregation, and writes results for downstream use.

# log_pipeline.py
import dask.bag as db
import dask.dataframe as dd
import json
import os
import tempfile
import re
import pandas as pd

# --- Setup: Create realistic-looking web log files ---
log_dir = tempfile.mkdtemp()
endpoints = ["/api/users", "/api/orders", "/api/products", "/health", "/metrics"]
status_codes = [200, 200, 200, 200, 404, 500, 301]

def generate_log(event_id):
    return json.dumps({
        "id": event_id,
        "method": "GET" if event_id % 5 != 0 else "POST",
        "endpoint": endpoints[event_id % len(endpoints)],
        "status": status_codes[event_id % len(status_codes)],
        "duration_ms": 10 + (event_id % 800),
        "user_id": event_id % 500,
    })

for shard in range(8):
    path = os.path.join(log_dir, f"access_{shard:02d}.jsonl")
    with open(path, "w") as f:
        for i in range(10000):
            f.write(generate_log(shard * 10000 + i) + "\n")

print(f"Log files written to {log_dir}")

# --- Step 1: Load and parse using Dask Bag ---
raw = db.read_text(os.path.join(log_dir, "*.jsonl"))
parsed = raw.map(json.loads)

# --- Step 2: Filter for errors and slow requests ---
errors  = parsed.filter(lambda r: r["status"] >= 400)
slow    = parsed.filter(lambda r: r["duration_ms"] > 500)

# --- Step 3: Convert to Dask DataFrame for aggregation ---
# .to_dataframe() requires all records to have the same keys
df = parsed.to_dataframe()

# Compute endpoint performance summary
endpoint_stats = (
    df.groupby("endpoint")["duration_ms"]
    .agg(["mean", "max", "count"])
    .compute()
    .rename(columns={"mean": "avg_ms", "max": "peak_ms", "count": "requests"})
    .round(1)
    .sort_values("avg_ms", ascending=False)
)

# Compute error rate by endpoint
error_counts = (
    df[df["status"] >= 400]
    .groupby("endpoint")
    .size()
    .compute()
    .rename("errors")
)

summary = endpoint_stats.join(error_counts, how="left").fillna(0)
summary["error_rate_%"] = (summary["errors"] / summary["requests"] * 100).round(2)

print("\nEndpoint Performance Summary:")
print(summary[["requests", "avg_ms", "peak_ms", "error_rate_%"]].to_string())

# --- Step 4: Write aggregated results to Parquet ---
output_path = os.path.join(log_dir, "summary.parquet")
summary.to_parquet(output_path)
print(f"\nSummary written to {output_path}")

# Error count summary
total_errors, total_slow = errors.count().compute(), slow.count().compute()
total_requests = df.shape[0].compute()
print(f"\nTotal requests:  {total_requests:,}")
print(f"Error requests:  {total_errors:,} ({total_errors/total_requests*100:.1f}%)")
print(f"Slow requests:   {total_slow:,} ({total_slow/total_requests*100:.1f}%)")

Log files written to /tmp/tmp_logs_abc123

Endpoint Performance Summary:
             requests  avg_ms  peak_ms  error_rate_%
/api/orders     16000   407.0    809.0         28.57
/api/products   16000   406.0    808.0         28.57
/api/users      16000   405.5    807.0         28.57
/health         16000   400.5    802.0         28.57
/metrics         8000   400.5    800.0          0.00

Summary written to /tmp/tmp_logs_abc123/summary.parquet

Total requests:  80,000
Error requests:  22,857 (28.6%)
Slow requests:   24,800 (31.0%)

This pipeline reads 8 log shards in parallel, filters and aggregates using Dask’s task graph, and writes the output to Parquet for downstream queries. Each step is lazy until .compute() is called, which means Dask can optimize the full pipeline — for example, fusing the parse and filter steps into a single pass over the data. To scale this to a real server environment, replace the db.read_text glob with an S3 path (s3://my-bucket/logs/*.jsonl) and replace Client() with a distributed cluster client. The business logic doesn’t change.

Frequently Asked Questions

When should I use Dask instead of pandas?

Use Dask when your dataset doesn’t fit comfortably in RAM (roughly when it exceeds 50-60% of available memory), when you have multiple large files to process together, or when a computation that works correctly on a sample is too slow on the full dataset. For datasets under a few hundred MB that fit easily in memory, plain pandas is faster because it has less scheduling overhead. The practical rule: profile with pandas first, switch to Dask when you hit a memory or time wall.

How does Dask compare to PySpark?

Dask is lighter, easier to install, and integrates directly with the Python data science ecosystem (pandas, NumPy, scikit-learn). PySpark requires a JVM, has its own API that diverges from pandas, and is more commonly managed by a dedicated infrastructure team. For Python teams running on a single machine or a small cluster and primarily working with pandas-compatible data, Dask is the more productive choice. For petabyte-scale data on a large enterprise Hadoop/Spark cluster, PySpark wins because of its mature scheduler and wider industry support.

What does “lazy evaluation” mean in practice?

Lazy evaluation means that calling df.groupby("col")["val"].sum() on a Dask DataFrame doesn’t actually compute anything — it returns a Dask object representing the computation. Dask builds an internal task graph recording what needs to happen. When you call .compute(), Dask executes the entire graph, often fusing steps to minimize passes over the data. The practical implication: building a pipeline with ten transformations is free; each .compute() call triggers the full chain. Avoid calling .compute() in a loop — instead, collect all your delayed objects and pass them to dask.compute(*delayed_list) to execute the whole batch in one scheduler call.

How do I choose the right number of partitions?

Target partition sizes between 100 MB and 1 GB each. Too many tiny partitions and you spend more time on task scheduling overhead than actual computation. Too few massive partitions and you lose parallelism. A practical starting point: df.repartition(npartitions=num_cpus * 4) — this gives each CPU several partitions to work through while amortizing scheduler overhead. For DataFrames read from files, Dask creates one partition per file by default. If your files are small (e.g., 10 MB each), use dd.read_csv("*.csv", blocksize="256MB") to merge small files into larger partitions automatically.

How do I prevent Dask from running out of memory?

The most common cause of Dask OOM is computing too many things at once. Two key techniques: First, use dask.distributed.Client with the distributed scheduler — it has built-in spill-to-disk when workers approach their memory limit. Second, break your computation into explicit stages with df.persist() between stages — this keeps intermediate results in distributed memory rather than recomputing them, and the distributed scheduler manages eviction to disk under pressure. Third, avoid wide joins and sorts on very large DataFrames when possible — they require shuffling data across all partitions and are expensive both in time and peak memory.

Conclusion

Dask fills the gap between single-machine pandas and full-scale Spark clusters: it lets you scale Python data pipelines to datasets larger than RAM using familiar APIs, deploying across cores or machines with minimal code changes. We covered all four main interfaces — DataFrames for tabular data, Arrays for numerical computing, Bags for unstructured records, and Delayed for arbitrary parallelism — and saw how the choice of scheduler (threads, processes, synchronous, distributed) controls how tasks actually run. The Dask dashboard is your window into what the scheduler is actually doing and where your bottlenecks are.

The next step is to take an existing slow pandas script and profile it with dask.config.set(scheduler='synchronous') to verify correctness, then switch to the threaded or distributed scheduler and check the speedup. Start with real data you’re already processing — the feedback loop of “does this scale?” is the fastest way to build Dask intuition. Once comfortable, explore dask-ml for parallelized machine learning preprocessing, and dask-kubernetes or coiled.io for cloud-native cluster deployment using the exact same Dask code.

For the full API reference, the official Dask documentation covers every scheduler configuration option, best practices for partitioning, and the distributed futures API that wasn’t covered here. The Dask Examples repository has runnable Jupyter notebooks for common workflows including time series, geospatial data, and machine learning.

Continue Learning Python

Tutorials you might also find useful:

Post Views: 48