How To Use Python PyArrow for Columnar Data Processing

Last Updated: June 01, 2026

Intermediate

You have a CSV file with 50 million rows. Loading it into Pandas takes three minutes and uses 8 GB of RAM. Your colleague sends you a Parquet file and asks you to join it with another dataset. Your data pipeline reads from an S3 bucket where every file is a different format. Each of these scenarios has the same underlying solution: stop treating your data as row-based text and start treating it as columnar binary data.

PyArrow is the Python binding for Apache Arrow, a cross-language in-memory data format and file I/O library. It reads Parquet files in seconds, handles files larger than RAM with streaming, and converts between Pandas DataFrames, Polars frames, and NumPy arrays with zero copies in most cases. Install it with pip install pyarrow. Pandas installs PyArrow as an optional dependency, but using it directly gives you more control over schema, compression, and memory layout.

This article covers creating Arrow tables and arrays, reading and writing Parquet files, filtering and selecting columns during reads to avoid loading unnecessary data, streaming large files in batches, converting between PyArrow and Pandas, and using PyArrow’s compute functions for fast transformations. By the end you will have the tools to handle data files that would overwhelm a naive CSV-based pipeline.

Written by Pubs

Python developer and educator with 15+ years building production systems across data engineering, web APIs, and AI tooling. Founder of Python How To Program — 270+ in-depth tutorials covering the modern Python stack.

View all tutorials by Pubs →

PyArrow Parquet: Quick Example

Part of the Python Data Stack Hub. See the full hub for related Python tutorials.

Part of the Modern Python AI Stack series. See the full tutorial hub for all 23 tutorials on LangGraph, MCP, Pydantic AI, Polars, FastAPI, Litestar, Typer, and more.

The most common use case for PyArrow is reading and writing Parquet files. Here is the smallest working example — write a table to Parquet and read it back.

# quick_pyarrow.py
import pyarrow as pa
import pyarrow.parquet as pq

# Create a table from Python dicts
table = pa.table({
    "name":   ["Alice", "Bob", "Carol", "David"],
    "age":    [30, 25, 35, 28],
    "salary": [95000.0, 72000.0, 110000.0, 88000.0],
})

print(f"Schema: {table.schema}")
print(f"Rows: {table.num_rows}, Columns: {table.num_columns}")

# Write to Parquet
pq.write_table(table, "employees.parquet", compression="snappy")
print("Wrote employees.parquet")

# Read it back
loaded = pq.read_table("employees.parquet")
print(loaded.to_pydict())

Output:

Schema: name: string
age: int64
salary: double
Rows: 4, Columns: 3
Wrote employees.parquet
{'name': ['Alice', 'Bob', 'Carol', 'David'], 'age': [30, 25, 35, 28], 'salary': [95000.0, 72000.0, 110000.0, 88000.0]}

The pa.table() constructor infers schema from the Python types automatically. PyArrow maps Python str to Arrow string, Python int to int64, and Python float to double. The Snappy compression used in write_table() is a fast, widely-supported compressor that typically reduces Parquet file sizes by 50-70% with negligible CPU overhead. The following sections go deeper into schema control, partial reads, and large-file handling.

What Is PyArrow and Why Use It?

Apache Arrow defines a language-independent in-memory format for columnar data. Where a CSV row stores "Alice,30,95000" as a single string, Arrow stores all names in one contiguous memory buffer, all ages in another, and all salaries in a third. This columnar layout means that reading only the salary column never touches the name data — the disk seeks and memory reads are proportional to the columns you actually need, not to the total row width.

Operation	CSV + pandas	Parquet + PyArrow
Read 50M rows, all columns	3-5 min, 8 GB RAM	15-30 sec, 1-2 GB RAM
Read 50M rows, 2 columns	Same (must parse all)	2-4 sec (skips unused columns)
Filter rows during read	Not possible	Yes, via predicate pushdown
File size (same data)	500 MB (uncompressed)	50-100 MB (Snappy)
Schema enforcement	Guessed at read time	Stored in file metadata

Parquet is the standard file format for data engineering pipelines on platforms like AWS Athena, Google BigQuery, and Apache Spark. If you receive data from any of these systems, it will often be Parquet. If you send data to them, Parquet is what they prefer. PyArrow is the reference Python implementation for reading and writing this format.

Defining and Enforcing Schema

When reading files from multiple sources, auto-inferred types can cause subtle bugs — an integer column that arrives as a string in one file breaks a join against an integer column in another. PyArrow lets you define a schema explicitly and validate incoming data against it.

# schema_control.py
import pyarrow as pa
import pyarrow.parquet as pq

# Define the exact types you expect
schema = pa.schema([
    pa.field("user_id",    pa.int32()),
    pa.field("username",   pa.string()),
    pa.field("email",      pa.string()),
    pa.field("age",        pa.int16()),
    pa.field("score",      pa.float32()),
    pa.field("active",     pa.bool_()),
])

# Create data that matches the schema
data = {
    "user_id":  pa.array([1, 2, 3],          type=pa.int32()),
    "username": pa.array(["alice", "bob", "carol"]),
    "email":    pa.array(["a@x.com", "b@x.com", "c@x.com"]),
    "age":      pa.array([30, 25, 35],        type=pa.int16()),
    "score":    pa.array([0.95, 0.82, 0.99],  type=pa.float32()),
    "active":   pa.array([True, True, False]),
}

table = pa.table(data, schema=schema)
print(f"Schema: {table.schema}")
print(f"Memory: {table.nbytes:,} bytes")

Every column now carries the type you declared, and PyArrow reports the table’s exact in-memory footprint:

Schema: user_id: int32
username: string
email: string
age: int16
score: float
active: bool
Memory: 89 bytes

Frequently Asked Questions

How does PyArrow compare to pandas?

pandas stores data in NumPy arrays organised as a DataFrame; PyArrow stores data as Arrow Tables backed by columnar memory buffers that match the Apache Arrow specification. Arrow is the better choice when you want zero-copy interop with Polars, DuckDB, Spark, or Parquet/Feather files, or when you need string columns that don’t blow up RAM. pandas remains the better choice when you need its full grouping/window/plotting ecosystem. pandas 2+ can use PyArrow as its backend via dtype_backend=’pyarrow’ to get the best of both.

When should I use PyArrow over CSV?

Whenever the file will be read more than once. PyArrow’s columnar Parquet format is typically 5-10x smaller than CSV, reads only the columns you select instead of the whole row, and preserves dtypes (no more parsing dates from strings). CSV stays useful only as a human-readable interchange format.

Can I stream a PyArrow Table that’s larger than RAM?

Yes — use the streaming API: pa.ipc.RecordBatchStreamReader or pa.dataset.dataset(…).to_batches(). These yield RecordBatch objects you can process one chunk at a time. For Parquet, pyarrow.parquet.ParquetFile(…).iter_batches(batch_size=100000) does the same job. Never load the whole Table with read_table() if it’s bigger than free RAM.

Does PyArrow support nested data like JSON?

Yes — Arrow has native struct, list, and map types. Convert nested Python dicts via pyarrow.array(list_of_dicts) and Arrow infers the nested schema. The same nested types serialize to Parquet’s nested encoding so you get efficient columnar storage of semi-structured data without flattening.

How do I push Arrow data to a database efficiently?

Use adbc_driver_postgresql (or the equivalent for SQLite, BigQuery, Snowflake) and call cursor.adbc_ingest(‘table_name’, arrow_table, mode=’append’). ADBC sends Arrow buffers over the wire without converting through Python objects, which is 5-50x faster than executing INSERT statements row by row.

Continue Learning Python

Tutorials you might also find useful:

Post Views: 86