Intermediate
You have a CSV file with 50 million rows. Loading it into Pandas takes three minutes and uses 8 GB of RAM. Your colleague sends you a Parquet file and asks you to join it with another dataset. Your data pipeline reads from an S3 bucket where every file is a different format. Each of these scenarios has the same underlying solution: stop treating your data as row-based text and start treating it as columnar binary data.
PyArrow is the Python binding for Apache Arrow, a cross-language in-memory data format and file I/O library. It reads Parquet files in seconds, handles files larger than RAM with streaming, and converts between Pandas DataFrames, Polars frames, and NumPy arrays with zero copies in most cases. Install it with pip install pyarrow. Pandas installs PyArrow as an optional dependency, but using it directly gives you more control over schema, compression, and memory layout.
This article covers creating Arrow tables and arrays, reading and writing Parquet files, filtering and selecting columns during reads to avoid loading unnecessary data, streaming large files in batches, converting between PyArrow and Pandas, and using PyArrow’s compute functions for fast transformations. By the end you will have the tools to handle data files that would overwhelm a naive CSV-based pipeline.
PyArrow Parquet: Quick Example
The most common use case for PyArrow is reading and writing Parquet files. Here is the smallest working example — write a table to Parquet and read it back.
# quick_pyarrow.py
import pyarrow as pa
import pyarrow.parquet as pq
# Create a table from Python dicts
table = pa.table({
"name": ["Alice", "Bob", "Carol", "David"],
"age": [30, 25, 35, 28],
"salary": [95000.0, 72000.0, 110000.0, 88000.0],
})
print(f"Schema: {table.schema}")
print(f"Rows: {table.num_rows}, Columns: {table.num_columns}")
# Write to Parquet
pq.write_table(table, "employees.parquet", compression="snappy")
print("Wrote employees.parquet")
# Read it back
loaded = pq.read_table("employees.parquet")
print(loaded.to_pydict())
Output:
Schema: name: string
age: int64
salary: double
Rows: 4, Columns: 3
Wrote employees.parquet
{'name': ['Alice', 'Bob', 'Carol', 'David'], 'age': [30, 25, 35, 28], 'salary': [95000.0, 72000.0, 110000.0, 88000.0]}
The pa.table() constructor infers schema from the Python types automatically. PyArrow maps Python str to Arrow string, Python int to int64, and Python float to double. The Snappy compression used in write_table() is a fast, widely-supported compressor that typically reduces Parquet file sizes by 50-70% with negligible CPU overhead. The following sections go deeper into schema control, partial reads, and large-file handling.
What Is PyArrow and Why Use It?
Apache Arrow defines a language-independent in-memory format for columnar data. Where a CSV row stores "Alice,30,95000" as a single string, Arrow stores all names in one contiguous memory buffer, all ages in another, and all salaries in a third. This columnar layout means that reading only the salary column never touches the name data — the disk seeks and memory reads are proportional to the columns you actually need, not to the total row width.
| Operation | CSV + pandas | Parquet + PyArrow |
|---|---|---|
| Read 50M rows, all columns | 3-5 min, 8 GB RAM | 15-30 sec, 1-2 GB RAM |
| Read 50M rows, 2 columns | Same (must parse all) | 2-4 sec (skips unused columns) |
| Filter rows during read | Not possible | Yes, via predicate pushdown |
| File size (same data) | 500 MB (uncompressed) | 50-100 MB (Snappy) |
| Schema enforcement | Guessed at read time | Stored in file metadata |
Parquet is the standard file format for data engineering pipelines on platforms like AWS Athena, Google BigQuery, and Apache Spark. If you receive data from any of these systems, it will often be Parquet. If you send data to them, Parquet is what they prefer. PyArrow is the reference Python implementation for reading and writing this format.
Defining and Enforcing Schema
When reading files from multiple sources, auto-inferred types can cause subtle bugs — an integer column that arrives as a string in one file breaks a join against an integer column in another. PyArrow lets you define a schema explicitly and validate incoming data against it.
# schema_control.py
import pyarrow as pa
import pyarrow.parquet as pq
# Define the exact types you expect
schema = pa.schema([
pa.field("user_id", pa.int32()),
pa.field("username", pa.string()),
pa.field("email", pa.string()),
pa.field("age", pa.int16()),
pa.field("score", pa.float32()),
pa.field("active", pa.bool_()),
])
# Create data that matches the schema
data = {
"user_id": pa.array([1, 2, 3], type=pa.int32()),
"username": pa.array(["alice", "bob", "carol"]),
"email": pa.array(["a@x.com", "b@x.com", "c@x.com"]),
"age": pa.array([30, 25, 35], type=pa.int16()),
"score": pa.array([0.95, 0.82, 0.99], type=pa.float32()),
"active": pa.array([True, True, False]),
}
table = pa.table(data, schema=schema)
print(f"Schema: {table.schema}")
print(f"Memory: {table.nbytes:,H�]\ȊB��K�ܚ]W�X�JX�K�\�\�˜\�]Y]�B����XY�X��KH��[XH\��\�\��Y[�H�[HY]Y]B��YYHK��XY�X�J�\�\�˜\�]Y]�B��[�
���YY��[XN���YY���[X_H�O���O���O������ۙϓ�]]����ۙϏ�����O���O���[XN�\�\��Y�[�̂�\�\��[YN���[�[XZ[���[�Y�N�[�M�����ܙN���]�X�]�N�����Y[[ܞN�M��]\�YY��[XN�\�\��Y�[�̂�\�\��[YN���[�[XZ[���[�Y�N�[�M�����ܙN���]�X�]�N�������O���O��������[���X[\�[�Y�\�\\�
��O�[�M����O�[��XYو��O�[�
����O�H[���O���]̏���O�[��XYو��O���]
����O��[�[�HHY[[ܞH����[�وH\��H]\�]Y�H�[Y\��][�H�X[\��[��K�H��[XH\��ܙY[��YHH\�]Y]�[I��Y]Y]K���[�[�H�XYH�[H]\�
܈�\�H]�]H��XY�YJKH\\�\�H�\�\��Y^X�H\�[�HY�[�Y[HKH���Y\��[��]�XY[YK������YH��[X�]�K\�XYȏ��XY[��ۛH�][�H�YY������\�]Y] ����[[�\��ܛX]]�P\������\��[[��[��[\�����]H�[K\�XY]�[�Y�ܙHH]H[�\��]ۈY[[ܞK�\�\��[Y�YX�]H\��ۈ[���[[��ڙX�[ۋ��܈\��H�[\�]�[��YX�H�XY[YH���HZ[�]\���X�ۙ��H�XY[��ۛHH�]\�]\��[�\��[\�������O���O���[X�]�WܙXY�B�[\ܝX\���˜\�]Y]\�B�[\ܝX\���˘��\]H\�[\ܝX\����\�B����\��ܙX]HH\��\��[\H�[B�[\ܝ�[��B�����HL��X�HHK�X�J�\�\��Y��\�
�[��J����JK����[��H��ܘ[��K���X�JȐUH��Tȋ�Rȋ��H�JH�܈�[��[��J����WK���]�[�YH��ܛ�[�
�[��K�[�Y�ܛJLL
K�H�܈�[��[��J����WK���]Y�ܞH��ܘ[��K���X�JȐH�����ȗJH�܈�[��[��J����WK���X��Ȏ�ܘ[��K��[�[�
L
H�܈�[��[��J����WK�JB�K�ܚ]W�X�JX�K�]�[�˜\�]Y]���\�\��[ۏH�ۘ\H�B��[�
��ܛ�Hܛ��H���ȊB����XYۛH���[[���]و
B�\�X[HK��XY�X�J�]�[�˜\�]Y]���[[��Vȝ\�\��Y���]�[�YH�JB��[�
��\�X[�XY��\�X[��[W���[[��H��[[���\�X[��[Wܛ��H���ȊB����XYۛH]\��[X[�����
�YX�]H\��ۊB�]Wٚ[\�H����[��H��H��UH�WB�]W�X�HHK��XY�X�J�]�[�˜\�]Y]��[\��X]Wٚ[\�B��[�
��UH���Έ�]W�X�K��[Wܛ��H�B�����X�[�N�ۛHUH����ۛH���[[�]Wܙ]�[�YHHK��XY�X�J�]�[�˜\�]Y]���[[��Vȝ\�\��Y���]�[�YH�K�[\��X]Wٚ[\�B��[�
��UH�]�[�YH���Έ�]Wܙ]�[�YK��[Wܛ��K��[[�Έ�]Wܙ]�[�YK��[W���[[��H�O���O���O������ۙϓ�]]����ۙϏ�����O���O�ܛ�HL���\�X[�XY����[[��L���UH���Έ�N
UH�]�[�YH���Έ�N
���[[�Έ����O���O����H��O��[\�����O�\�[Y]\�X��\�H\�و��O���[[��\�]܋�[YJO���O�\\ˈ�\ܝY�\�]ܜ�[��YH��O�O���O���O�OO���O���O������O���O���O���O���O�������O���O����O���O���O�[����O�[���O���[����O��[�H�[���X�[�H�ۙ][ۜ�[��S�ܛ�\��]�\�Y\�Έ��O��[\��V����[��H��H��UH�K
��]�[�YH����ȋ
L
WO���O��ۈ�X[�[\��]Z[[ۜ�و����\�H�[\���[X]X�[H�YX�HH�]\��XY���H\���X�]\�H\�]Y]�ܙ\�\�\���Yܛ�\�]\�X��]]P\������\[�\�H�[����]�]�XY[��Z\�]K�����KKHSPQ�W�P�R�T��H]H�[H�]��YH��[[��Y�Y�Y[���Y����\�[��\��ܙ^YY�]���[���[X�]�H��[[��XY[�ˈ�\[ێ��L��[[��[�H�[K�[�\��[\��XY���\�]Y]�\ۉ��[�HH��[[�[�H��\��KO����YH�\��KY�[\ȏ���X[Z[��\��H�[\��]\�]Y]�[O�������[�H�[H\�\��\�[�]�Z[X�H�SK�Y[��H��HX�H]ۘ�H\���[��[ۋ�P\��������O�\�]Y]�[O���O��\���ݚY\�[�]\�]܈[�\��X�H]�XY�H�[H[����ܛ�\�KHH[�\��[�]�\�]\�]Y]ܙ�[�\�\�]H[�ˈ���\��[��ۙH�]�]H[YH]�[�H[�H\��]�\�[H\��H�[\��]H�^YY[[ܞH����[�������O���O����X[Z[��ܙXY�B�[\ܝX\���˜\�]Y]\�B�[\ܝX\���˘��\]H\����[�H�[H�]�]�Y[��]�\�]Y]ٚ[HHK�\�]Y]�[J�]�[�˜\�]Y]�B���[�
��Y]Y]N��\�]Y]ٚ[K�Y]Y]K��[Wܛ��H�[���ȊB��[�
�����ܛ�\Έ�\�]Y]ٚ[K�Y]Y]K��[Wܛ���ܛ�\�H�B��[�
����[XN��\�]Y]ٚ[K���[XW�\����H�B������\��[��]�\�وL����[ܙ]�[�YHH��]Wܛ���H���܈�]�[�\�]Y]ٚ[K�]\�ؘ]�\��]���^�OLL���[[��VȘ��[��H���]�[�YH�JN����[\��܈]\��[XB�X\��H˙\]X[
�]����[[����[��H�K�UH�B�]Wؘ]�H�]���[\�X\��B����[H�]�[�YH�܈\��]��Y�]Wؘ]���[Wܛ�������]�ܙ]�[�YHH˜�[J]Wؘ]����[[���]�[�YH�JK�\��J
B��[ܙ]�[�YH
�H�]�ܙ]�[�YHY��]�ܙ]�[�YH[�H�]Wܛ���
�H]Wؘ]���[Wܛ����[�
�����\��Y�]Wܛ��HUH���ȊB��[�
���[UH�]�[�YN� ��[ܙ]�[�YN����H�O���O���O������ۙϓ�]]����ۙϏ�����O���O�Y]Y]N�L�[������ܛ�\ΈB���[XN���[��N���[��]�[�YN��X�B����\��Y�N
�UH����[UH�]�[�YN� L�
͋
Mˌ�O���O���O����H��O�]\�ؘ]�\�
O���O�Y]��XY�ۙH�]�]H[YH[�ZY[���O��X�ܙ�]����O�ؚ�X�ˈH��O�X\���˘��\]O���O�[�[H�ݚY\��X�ܚ\�Y�\�][ۜ�Z�H��O�˙\]X[
O���O���O�˜�[J
O���O���O�˛YX[�
O���O�[�X[�H[ܙHKH\�H�[�]��YYۈH�]�\��^\��]�]ܙX][��]ۈؚ�X���܈XX���ˈ\���X�[�][ۈو��X[Z[���XY�[���\]H�\�][ۜ�\���]H\[[�\����\��][KY�Y�X�]H�[\�ۈXX�[�\��][Z]Y�SK������YH��X[[Y�KY^[\H���X[SY�H^[\N�Z[H�[\��\ܝ\[[�O������H����[��\[[�H�XY�H\�X�ܞHوZ[H\�]Y]�[\��[\���܈H�\��[�[Y�ܙY�]\��]�[�YH�H��[��K[�ܚ]\�H�[[X\�H\�]Y]�[K�\�]\��\X\��[�[��]H[��[�Y\�[���ܚٛ��ˏ�����O���O���[\��\[[�K�B�[\ܝX\����\�B�[\ܝX\���˜\�]Y]\�B�[\ܝX\���˘��\]H\����H]X�[\ܝ]����H]][YH[\ܝ]B�[\ܝ�[��B��UW�T�H]
��[\��]H�B�UW�T��Z�\�^\����U�YJB��Y�ܙX]W��[\Wٚ[\�
N�����ܙX]H�[\HZ[H�[\�\�]Y]�[\ˈ�����܈^H[��[��JK
N������H�[��K��[�[�
LML
B�X�HHK�X�J�]H��و����L
K^�^N��H�H
���������[��H��ܘ[��K���X�JȐUH��Tȋ�Rȋ��H�JH�܈�[��[��J����WK����X���ܘ[��K���X�JȔ�ȋ��\�Xȋ�[�\��\�H�JH�܈�[��[��J����WK���]�[�YH��ܛ�[�
�[��K�[�Y�ܛJ�
L
K�H�܈�[��[��J����WK�JB�K�ܚ]W�X�JX�KUW�T�����[\�̌��L
K^�^N��K�\�]Y]�B��[�
��ܙX]Y�^_H�[\H�[\�[��UW�T�KȊB��Y��[[X\�\�W؞W���[��J]W�\��]
HO�K�X�N������XY[�[\��]\���]�[�YH�[��H��[��K�����\�]Y]ٚ[\�H�ܝY
]W�\���؊���\�]Y]�JB�Y���\�]Y]ٚ[\��Z\�H�[S����[�\��܊����\�]Y]�[\�[��]W�\�H�B��[�X�\�H�B��܈�[�\�]Y]ٚ[\�HK��XY�X�J���[[��VȘ��[��H���]�[�YH�JB�[�X�\˘\[�
B����X�[�YHK��ۘ�]�X�\�[�X�\�B��[�
����X�[�Y����X�[�Y��[Wܛ��H�������H�[�\�]Y]ٚ[\�_H�[\ȊB���ܛ�\�H��[��H\�[��P\����X�Hܛ�\�B�ܛ�\YH��X�[�Y�ܛ�\؞J���[��H�K�Y�ܙY�]J���]�[�YH���[H�K
��]�[�YH����[��WJB��]\��ܛ�\Y��ܝ؞J���]�[�YW��[H��\��[�[�ȊWJB��ܙX]W��[\Wٚ[\�
B��[[X\�HH�[[X\�\�W؞W���[��JUW�T�B��[�
���]�[�YH�H��[��N��B��܈���[��[[X\�K���[\�
N���[�
��ܛ������[��I�N��H ܛ���ܙ]�[�YW��[I�N��L����H
ܛ���ܙ]�[�YW���[� �N�H�[��X�[ۜ�H�B��K�ܚ]W�X�J�[[X\�K�[W��[[X\�K�\�]Y]�B��[�
��ܛ�H[W��[[X\�K�\�]Y]�O���O���O������ۙϓ�]]����ۙϏ�����O���O�ܙX]Y
��[\H�[\�[��[\��]K��X�[�Y�
��
��������H
��[\�]�[�YH�H��[��N��T�
�L�N��
M��H�[��X�[ۜ�B�UH
�N
͍K�LH
M�N
H�[��X�[ۜ�B�R�
N
�
Mˎ
M�
�[��X�[ۜ�B��H
�K�����
M�
L��[��X�[ۜ�B�ܛ�H[W��[[X\�K�\�]Y]���O���O����H��O�K��ۘ�]�X�\�
O���O��[��[ۈ�X���][\HX�\��]H�[YH��[XH[��ۙK�H��O�ܛ�\؞J
K�Y�ܙY�]J
O���O��Z[�\Y\��X�ܚ\�YY�ܙY�][ۈ]��YYKH��]ۈ��ݙ\����ˈ[�H�[�^[�\�\[[�H�YH]H�[\�
ۛH���\���[\��]�\�[�H\��[�[Y\�[\
Kܚ]HH�[[X\�H���\�[����O�X\���˙�˔�њ[T�\�[O���O�܈\�][ۈH�]]�H��[��H\�[����O�K�ܚ]W���]\�]
O���O������KKHSPQ�W�P�R�T��][\H]H�[\����[��[��H�[��[[���Z[���]\�H�[��H�[[X\�HX�K��\[ێ���ۘ�]�X�\�
H
�ܛ�\؞J
K����S���\���\�\����H�\]Z\�Y��KO����YH��\H����\]Y[�H\��Y]Y\�[ۜ�������YH��\K\[�\ȏ����\�P\�����[]H�[�\���ς��P\����[�[�\�\�H��\[Y[�\�K�[�\�\�\�P\����\�H�X��[��܈]��]�\���O�\����\O���O���[[��
[�X�Y�]��O���XY�\�]Y]
��[K�\�]Y]�[��[�OH�X\���ȊO���O�K[�[�H�[���\���Y[N���O�\�����X�K���[�\�
O���O�[���O�K�X�K����W�[�\��O���O��H��\��[ۈ\�ٝ[��\��X��H�X�]\�H��X��\�Y\��[��\�HH�[YHY[[ܞH�Y��\�ˈ\�HP\����\�X�H�[�[�H�YY��[XH�����X[Z[���XY�܈�YX�]H\��ێ�\�H[�\��[�[�H�YY]�^[��]�H]HX[�\[][ۈTK������YH��\K\\�]Y]]��X�݈���[���[H\�H\�]Y][��XYو�Տ��ς��\�H\�]Y]�[�]�\��[H�^�H܈�XY\��ܛX[��HX]\�Έ�[\�ݙ\�HP�\[[�\�]�XYH�[YH�[H[ܙH[�ۘ�K]Y\�Y\�]ۛH�YYH�X��]و��[[��܈]H�\�Y�]�Y[��\�[\ˈ�X���]�Ո�[�H�[H�YY���H[X[�\�XYX�K�[�H�X�Z]�[���\�[H�[���[�H\�]Y]܈�܈�\�H�X[ۙK[ٙ�]H�[�ٙ\�ˈHXZ[��ۜ�YHو\�]Y]\�][�H�[����[�][�H^Y]܈KH\�H��O�K��XY�X�J��[K�\�]Y]�K���[�\�
K�XY
O���O��[��X�]������YH��\KX��\�\��[ۈ���X���\�\��[ۈ��[H\�H�܈\�]Y]�[\���ς��\�H��O���\�\��[ۏH�ۘ\H����O�
HY�][[�X[�H���H�܈H�\��[[��Hو�YY[��[H�^�K�ۘ\HX��\�\��\�[�Z[\�X�ۙ�[��YX�\��[H�^�\��H
LM� K�\�H��O���\�\��[ۏH�������O��܈LL� H�]\���\�\��[ۈ�]H[�\��YY�YK[ٙ�KH����܈\��]�[�[\��XY[���\]Y[�K�\�H��O���\�\��[ۏH�ޚ\����O�ۛH�[�X^[][H��\]X�[]H\��YYY
ޚ\X��\�\��Y\�]Y]\��XYX�H�HH[�����K�]��Y��O���\�\��[ۏS�ۙO���O�[���X�[ۈKH[���\�\��Y\�]Y]\�]X�\��\��]��\��ܛX[��H�[�Y�]�܈�\]Y[�X[�XYˏ�����YH��\KZ��ۈ���[�P\�����XY��ӈ�[\���ς��Y\Έ��O�[\ܝX\���˚��ۈ\�Z��X�HHZ���XYڜ�ۊ�]K���ۛ�O���O��XY��]�[�KY[[Z]Y��ӈ
ۙH��ӈؚ�X�\�[�JH[��[�\����X�H�]]]�X]X���[XH[��\�[��K��܈�Y�[\���ӈ\��^\�\�H��O�K�X�K����W�[\�
��ۋ��Y�^
JO���O���܈\��H��ӈ�[\�P\��������ӈ�XY\��\ܝ�����\�^�KX�\�Y��X[Z[�Έ��O�Z���XYڜ�ۊ�[W�]�XY��[ۜ�\Z���XY�[ۜ�������^�OLL�
�L�
JO���O�������YH��\K\�ȏ��\�P\�����ܚ��]��[���Y�ܘY�O��ς��Y\ˈ[��[��O�\[��[X\�������O���O��Y���\ܝ[�\�H��O�X\���˙�˔�њ[T�\�[J�Y�[ۏH�\�YX\�LH�O���O�\�H�[\�\�[H\��[Y[����O�K��XY�X�J��X��]��Y�^ٚ[K�\�]Y]��[\�\�[O\��O���O��H�[YH�[\�\�[H[�\��X�H�ܚ���]����H��Y�ܘY�H
��O���њ[T�\�[O���O�H[�^�\�H�؈�ܘY�H
��O�^�\�P�ؑ�[T�\�[O���O��XHH��O�Y�����O�X��Y�JK��YX�]H\��ۈ[���[[��ڙX�[ۈ�ܚ�H�[YH�^Hۈ�[[�H�[\�KHP\����ۛH�ۛ�Y�H�]\�]�YYˏ�����YH��ۘ�\�[ۈ���ۘ�\�[ۏ������P\�����]�\�]ۈ]�[�\��\�X�X��\���H\X�H\����X���\�[N��\���[[�\��XY���[XKY[��ܘ�Y\�]Y]�[\���X[Z[���XY��܈�[\�\��\�[��SK[��X�ܚ\�Y��\]H�\�][ۜˈ[�H]�H�Y[����ܙX]HX�\��]^X�]��[X\��XYۛHH��[[��[�����[�H�YY�XH��[[��ڙX�[ۈ[��YX�]H\��ۋ���\��\��H�[\�[��]�\��]��O�]\�ؘ]�\�
O���O�[��Z[H][KY�[HY�ܙY�][ۈ\[[�H�]��O��ۘ�]�X�\�
O���O�[���O�ܛ�\؞J
O���O�������H�]\�[^[��[ۈو\�\[[�H\��Y\�][ۙYܚ]\��]��O�K�ܚ]W���]\�]
O���O�
�X�ܙ�[�\�\��[\�[���X�\�X�ܚY\�Z�H��O���[��OPUK����O�H[���XY\�][ۙY]\�]��]��O�K��XY�X�J]\�]�]�[\��Vˋ��JO���O��X�XZ�\�\�]Y]�Z]�HZ�HH�[\H��[[�\�ܙH]X�\�H�]�][�H�\��\�[���\��X�\�K�������܈H��\]HP\����TH�Y�\�[��H[��Y[��T��ܛX]�Y������[�]\�]TK�YHHH�Y�H���\���˘\X�K�ܙ������]ۋȏ�ٙ�X�X[\X�H\����]ۈ��[Y[�][ۏ�O�������YH��[]YX\�X�\ȏ��[]Y\�X�\�����[��O�H�Y�H���]ۚ�����ܘ[K���K���]�]\�K\�\��Y�܋Y�\�Y]Y��[YK[�\�][ۜ�Z[�\]ۋȏ����\�H�\���܈�\�]Q��[YH�\�][ۜ�[�]ۏ�O��O��O�H�Y�H���]ۚ�����ܘ[K���K���]�]\�K\]ۋ\[�\�Y�܋Y]KX[�[\�\�ȏ����\�H]ۈ[�\��܈]H[�[\�\��O��O��O�H�Y�H���]ۚ�����ܘ[K���K���]�]\�K\]ۋY\��Y�܋\\�[[X[�Y\��X�]YX��\][��ȏ����\�H]ۈ\���܈\�[[[�\��X�]Y��\][���O��O���[�����]���^V��]�����[[�V��]��ܛ��V��]����X�[ۗB�KKH���]�K�X�Z�\�KO�