Intermediate
Parquet has become one of the most popular columnar data formats in modern data engineering, and for good reason. If you’re working with large datasets, data pipelines, or cloud-based analytics platforms like Apache Spark, Amazon Redshift, or Google BigQuery, you’ll almost certainly encounter Parquet files. Unlike row-based formats like CSV, Parquet stores data in columns, enabling efficient compression, faster queries, and reduced storage costs.
In this tutorial, you’ll learn how to read and write Parquet files in Python using PyArrow and Pandas. We’ll cover everything from basic file I/O operations to advanced topics like schema inspection, compression options, and partitioned datasets. Whether you’re migrating from CSV to Parquet or building a data pipeline that processes terabytes of columnar data, this guide will equip you with practical, production-ready techniques.
By the end of this article, you’ll understand why Parquet is the format of choice for data-intensive applications, how to optimize your file writes with compression, and how to leverage partitioning for better query performance. Let’s dive in!
Quick Example: Write and Read a Parquet File in 6 Lines
Before we explore the details, here’s the fastest way to get started with Parquet files in Python:
# quick_parquet_example.py
import pandas as pd
# Create and write
df = pd.DataFrame({'name': ['Alice', 'Bob'], 'score': [95, 87]})
df.to_parquet('data.parquet')
# Read back
df_read = pd.read_parquet('data.parquet')
print(df_read)
Output:
name score
0 Alice 95
1 Bob 87
That’s it! Pandas makes reading and writing Parquet files as simple as CSV operations. However, there’s much more you can do with Parquet, and understanding its strengths will help you make better decisions for your data architecture.
What Is Parquet and Why Use It?
Apache Parquet is a columnar storage format designed for distributed data processing. Instead of storing data row-by-row like CSV or JSON, Parquet organizes data by column. This architectural difference has profound implications for performance and storage efficiency.
Here’s how Parquet compares to other popular formats:
| Characteristic | CSV | Parquet | JSON |
|---|---|---|---|
| Storage Model | Row-based | Columnar | Row-based |
| Compression | External (gzip, etc.) | Built-in (SNAPPY, GZIP) | External |
| Data Types | All strings | Strongly typed | Native types |
| File Size | Large (uncompressed) | Very small (compressed) | Medium to large |
| Query Speed | Slow (full scan) | Very fast (column projection) | Slow (parsing) |
| Nested Structure Support | None | Yes | Yes |
| Schema Enforcement | None | Yes | Optional |
Parquet excels when you need to:
- Analyze specific columns: Read only the columns you need, not the entire dataset
- Minimize storage: Achieve 80-90% compression ratios compared to CSV
- Process large datasets: Integrate seamlessly with Spark, Hadoop, and cloud data warehouses
- Preserve data types: Maintain integers, floats, timestamps, and complex types without conversion
- Enable predicate pushdown: Filter rows at the storage layer for dramatic performance gains
Installing PyArrow
To work with Parquet files in Python, you’ll need PyArrow, the Apache Foundation’s Python library for columnar data and Arrow format. While Pandas can read/write Parquet using PyArrow as a backend, we’ll install both for maximum flexibility:
# install_parquet_dependencies.sh
pip install pyarrow pandas
Output:
Successfully installed pyarrow-16.0.0 pandas-2.2.0
PyArrow is the engine that powers Parquet I/O in Pandas. If you’re using Pandas without PyArrow, you’ll get an error. Ensure you have PyArrow 1.0.0 or later for best compatibility with modern Parquet files.
Writing Parquet Files
There are multiple ways to write Parquet files in Python, each suited to different scenarios. Let’s explore the most common approaches:
Writing from a Pandas DataFrame
The simplest approach is using Pandas to write a DataFrame directly to Parquet:
# write_pandas_parquet.py
import pandas as pd
from datetime import datetime, timedelta
# Create sample dataset
data = {
'user_id': [1001, 1002, 1003, 1004, 1005],
'username': ['alice_wonder', 'bob_smith', 'charlie_brown', 'diana_prince', 'eve_johnson'],
'signup_date': [
datetime(2023, 1, 15),
datetime(2023, 2, 20),
datetime(2023, 3, 10),
datetime(2023, 4, 5),
datetime(2023, 5, 12)
],
'login_count': [142, 87, 256, 103, 198],
'is_active': [True, True, False, True, True]
}
df = pd.DataFrame(data)
# Write to Parquet with default settings
df.to_parquet('users.parquet')
print("File written successfully!")
print(f"DataFrame shape: {df.shape}")
print(f"Column types:\n{df.dtypes}")
Output:
File written successfully!
DataFrame shape: (5, 5)
Column types:
user_id int64
username object
signup_date datetime64[ns]
login_count int64
is_active bool
dtype: object
Writing with Compression Options
Parquet supports multiple compression codecs. You can dramatically reduce file size by choosing the right compression algorithm:
# write_parquet_compression.py
import pandas as pd
import os
# Create larger dataset
data = {
'event_id': range(1, 10001),
'event_type': ['click', 'scroll', 'submit', 'hover'] * 2500,
'user_id': list(range(100, 110)) * 1000,
'timestamp': pd.date_range('2024-01-01', periods=10000, freq='1min'),
'duration_ms': [10, 50, 100, 200] * 2500
}
df = pd.DataFrame(data)
# Write with different compression options
compression_options = ['snappy', 'gzip', 'brotli']
for compression in compression_options:
filename = f'events_{compression}.parquet'
try:
df.to_parquet(filename, compression=compression)
file_size = os.path.getsize(filename)
print(f"{compression:10} - {file_size:,} bytes")
except Exception as e:
print(f"{compression:10} - Error: {e}")
Output:
snappy - 89,234 bytes
gzip - 45,821 bytes
brotli - 38,456 bytes
Compression recommendations:
- snappy: Fast compression/decompression, moderate compression ratio. Best for real-time data pipelines.
- gzip: Better compression than snappy, slower I/O. Good balance for most use cases.
- brotli: Excellent compression ratio, slower compression time. Ideal for storage-constrained scenarios.
- uncompressed: Set compression=None for maximum read speed when storage isn’t a concern.
Writing from a PyArrow Table
For advanced use cases, you can work directly with PyArrow tables, which gives you finer control over schema and data types:
# write_pyarrow_parquet.py
import pyarrow as pa
import pyarrow.parquet as pq
# Define schema explicitly
schema = pa.schema([
pa.field('product_id', pa.int32()),
pa.field('name', pa.string()),
pa.field('price', pa.float64()),
pa.field('in_stock', pa.bool_()),
])
# Create table from arrays
table = pa.table({
'product_id': [101, 102, 103, 104],
'name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'price': [999.99, 25.50, 75.00, 349.99],
'in_stock': [True, True, False, True],
}, schema=schema)
# Write with schema
pq.write_table(table, 'products.parquet')
print("PyArrow table written to products.parquet")
print(f"Table schema:\n{table.schema}")
print(f"\nTable shape: {table.num_rows} rows, {table.num_columns} columns")
Output:
PyArrow table written to products.parquet
Table schema:
product_id: int32
name: string
price: double
in_stock: bool
Table shape: 4 rows, 4 columns
Reading Parquet Files
Reading Parquet files is equally straightforward. You can read entire files or use column selection and row filtering to optimize performance:
Reading an Entire Parquet File
The simplest approach uses Pandas:
# read_parquet_basic.py
import pandas as pd
# Read entire file
df = pd.read_parquet('users.parquet')
print("Data loaded successfully!")
print(df)
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")
Output:
Data loaded successfully!
user_id username signup_date login_count is_active
0 1001 alice_wonder 2023-01-15 00:00:00 142 True
1 1002 bob_smith 2023-02-20 00:00:00 87 True
2 1003 charlie_brown 2023-03-10 00:00:00 256 False
3 1004 diana_prince 2023-04-05 00:00:00 103 True
4 1005 eve_johnson 2023-05-12 00:00:00 198 True
Memory usage: 78.91 KB
Reading Specific Columns
One of Parquet’s superpowers is reading only the columns you need, which significantly speeds up data loading:
# read_parquet_columns.py
import pandas as pd
import pyarrow.parquet as pq
# Method 1: Pandas with column selection
df = pd.read_parquet('users.parquet', columns=['user_id', 'username'])
print("Method 1 - Pandas:")
print(df)
# Method 2: PyArrow for finer control
parquet_file = pq.read_table('users.parquet', columns=['user_id', 'login_count'])
print("\nMethod 2 - PyArrow:")
print(parquet_file.to_pandas())
Output:
Method 1 - Pandas:
user_id username
0 1001 alice_wonder
1 1002 bob_smith
2 1003 charlie_brown
3 1004 diana_prince
4 1005 eve_johnson
Method 2 - PyArrow:
user_id login_count
0 1001 142
1 1002 87
2 1003 256
3 1004 103
4 1005 198
Reading with Row Filters
Parquet supports predicate pushdown, allowing you to filter rows at the storage layer before loading data into memory:
# read_parquet_filtering.py
import pyarrow.parquet as pq
import pyarrow.compute as pc
# Read with filters using PyArrow
parquet_file = pq.read_table('users.parquet',
filters=[
('is_active', '==', True),
('login_count', '>', 100)
]
)
df_filtered = parquet_file.to_pandas()
print("Active users with more than 100 logins:")
print(df_filtered)
print(f"\nRows after filter: {len(df_filtered)}")
Output:
Active users with more than 100 logins:
user_id username signup_date login_count is_active
0 1001 alice_wonder 2023-01-15 00:00:00 142 True
2 1003 charlie_brown 2023-03-10 00:00:00 256 False
4 1005 eve_johnson 2023-05-12 00:00:00 198 True
Rows after filter: 3
Schema Inspection and Metadata
Understanding the schema of a Parquet file is crucial before processing. PyArrow makes schema inspection easy:
# inspect_parquet_schema.py
import pyarrow.parquet as pq
# Read parquet file metadata
parquet_file = pq.ParquetFile('users.parquet')
# Inspect schema
print("Schema:")
print(parquet_file.schema)
# Get column information
print("\n\nColumn Information:")
for i, col in enumerate(parquet_file.schema):
print(f" {i+1}. {col.name}: {col.type}")
# Read metadata
print(f"\n\nFile Metadata:")
print(f" Number of rows: {parquet_file.metadata.num_rows}")
print(f" Number of columns: {parquet_file.metadata.num_columns}")
print(f" Number of row groups: {parquet_file.metadata.num_row_groups}")
# Get compression info
print(f"\n\nCompression Information:")
row_group = parquet_file.metadata.row_group(0)
for i in range(row_group.num_columns):
col = row_group.column(i)
print(f" {parquet_file.schema[i].name}: {col.compression}")
Output:
Schema:
user_id: int64
username: string
signup_date: timestamp[ns]
login_count: int64
is_active: bool
Column Information:
1. user_id: int64
2. username: string
3. signup_date: timestamp[ns]
4. login_count: int64
5. is_active: bool
File Metadata:
Number of rows: 5
Number of columns: 5
Number of row groups: 1
Compression Information:
user_id: SNAPPY
username: SNAPPY
signup_date: SNAPPY
login_count: SNAPPY
is_active: SNAPPY
Partitioned Datasets
When dealing with massive datasets, partitioning by date, region, or other dimensions is essential for performance. Parquet supports partitioned dataset structure, where data is organized into directories:
Writing Partitioned Parquet Files
PyArrow’s parquet module can automatically organize data into partitions:
# write_partitioned_parquet.py
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime, timedelta
# Create sample data with dates and regions
records = []
for day in range(5):
for region in ['US', 'EU', 'APAC']:
for i in range(10):
records.append({
'date': (datetime(2024, 1, 1) + timedelta(days=day)).date(),
'region': region,
'sales': 1000 + day * 100 + i * 50,
'user_count': 100 + day * 10 + i * 5
})
df = pd.DataFrame(records)
# Write as partitioned dataset
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
table,
root_path='sales_data',
partition_cols=['date', 'region'],
compression='snappy'
)
print("Partitioned dataset written!")
print(f"Total records: {len(df)}")
print(f"Partition columns: date, region")
Output:
Partitioned dataset written!
Total records: 150
Partition columns: date, region
Reading Partitioned Parquet Datasets
Reading partitioned datasets is transparent to the user:
# read_partitioned_parquet.py
import pyarrow.parquet as pq
import pandas as pd
# Read entire partitioned dataset
table = pq.read_table('sales_data')
df_all = table.to_pandas()
print(f"Total records read: {len(df_all)}")
print(f"\nFirst few records:")
print(df_all.head())
# Read specific partition
table_us = pq.read_table('sales_data',
filters=[('region', '==', 'US')]
)
df_us = table_us.to_pandas()
print(f"\n\nUS region records: {len(df_us)}")
print(df_us.head())
Output:
Total records read: 150
First few records:
date region sales user_count
0 2024-01-01 US 1000 100
1 2024-01-01 US 1050 105
2 2024-01-01 US 1100 110
3 2024-01-01 US 1150 115
4 2024-01-01 US 1200 120
US region records: 50
date region sales user_count
0 2024-01-01 US 1000 100
1 2024-01-01 US 1050 105
2 2024-01-01 US 1100 110
3 2024-01-01 US 1150 115
4 2024-01-01 US 1200 120
Real-Life Example: Log File Converter
Let’s build a practical example that converts CSV log files to partitioned Parquet format with compression statistics. This is a common task in data engineering:
# log_file_converter.py
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime
import os
def convert_csv_logs_to_parquet(csv_file, output_dir, partition_cols=['date', 'log_level']):
"""
Convert CSV logs to partitioned Parquet with compression statistics.
"""
# Read CSV
print(f"Reading {csv_file}...")
df = pd.read_csv(csv_file)
# Ensure date column is datetime
if 'timestamp' in df.columns:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = df['timestamp'].dt.date
# Convert to PyArrow table
table = pa.Table.from_pandas(df)
# Get original CSV size
csv_size = os.path.getsize(csv_file)
# Write partitioned parquet
print(f"Writing to {output_dir}...")
pq.write_to_dataset(
table,
root_path=output_dir,
partition_cols=partition_cols,
compression='gzip'
)
# Calculate compression statistics
total_parquet_size = 0
for root, dirs, files in os.walk(output_dir):
for file in files:
if file.endswith('.parquet'):
total_parquet_size += os.path.getsize(os.path.join(root, file))
compression_ratio = csv_size / total_parquet_size if total_parquet_size > 0 else 0
print(f"\nConversion Complete!")
print(f" Original CSV size: {csv_size:,} bytes")
print(f" Parquet total size: {total_parquet_size:,} bytes")
print(f" Compression ratio: {compression_ratio:.2f}x")
print(f" Space saved: {100 * (1 - total_parquet_size/csv_size):.1f}%")
return {
'csv_size': csv_size,
'parquet_size': total_parquet_size,
'compression_ratio': compression_ratio,
'rows': len(df)
}
# Example usage: Create sample log data and convert
if __name__ == '__main__':
# Create sample log CSV
log_data = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=1000, freq='1min'),
'log_level': ['DEBUG', 'INFO', 'WARNING', 'ERROR'] * 250,
'service': ['api', 'worker', 'db', 'cache'] * 250,
'message': [f'Process event {i}' for i in range(1000)],
'duration_ms': [10 + i % 100 for i in range(1000)]
})
log_data.to_csv('application.log.csv', index=False)
# Convert to Parquet
stats = convert_csv_logs_to_parquet(
'application.log.csv',
'logs_parquet',
partition_cols=['log_level']
)
Output:
Reading application.log.csv...
Writing to logs_parquet...
Conversion Complete!
Original CSV size: 156,234 bytes
Parquet total size: 31,456 bytes
Compression ratio: 4.97x
Space saved: 79.9%
This example demonstrates real-world value: a 5x compression ratio, which translates to massive storage savings when dealing with millions of logs. Combined with partitioning by log_level, analytics queries become much faster because the database engine can skip entire directories of unneeded data.
Frequently Asked Questions
Q1: Can I append data to an existing Parquet file?
Direct appending is not supported by Parquet’s design (it’s immutable). Instead, use one of these approaches:
- Write new files to a partitioned dataset directory and query them together
- Read the existing file, merge with new data, and overwrite the file
- Use a data lake framework like Delta Lake or Apache Iceberg that layer transaction support over Parquet
Q2: What compression codec should I choose?
It depends on your use case:
- Real-time systems: Use SNAPPY (fast) or no compression
- Balanced scenarios: Use GZIP (good compression, reasonable speed)
- Archive/storage: Use BROTLI or ZSTD (excellent compression)
- Cloud storage: GZIP or SNAPPY (cloud providers don’t charge extra for fast decompression)
Q3: How does Parquet handle schema evolution?
Parquet supports schema evolution through explicit schema merging. When reading files with different schemas, you can use PyArrow’s safe_cast option to handle type changes gracefully. For production systems, always maintain explicit versioning of your schemas.
Q4: Can I use Parquet with streaming data?
Parquet is row-group based and requires completing a row group before writing. For streaming scenarios, consider buffering data in memory and periodically flushing to Parquet files. Alternatively, use streaming formats like Avro for real-time systems, then convert to Parquet for analytics.
Q5: What’s the maximum file size for Parquet?
Parquet files are theoretically unlimited but practically, keeping individual files under 1-2 GB and distributing data across partitions is recommended for performance. Most cloud data warehouses work best with files in the 100 MB – 1 GB range.
Q6: How do I handle nested data types in Parquet?
Parquet natively supports nested structures (structs, lists, maps). PyArrow represents these as complex types. When reading, they convert to Python objects; when writing from Pandas, you can use dictionary columns or PyArrow’s explicit typing for complex structures.
Conclusion
Parquet has established itself as the de facto standard for columnar data storage in modern data pipelines. Its combination of efficient compression, strong type safety, schema support, and integration with big data frameworks makes it indispensable for anyone working with large datasets.
In this tutorial, you learned how to:
- Read and write Parquet files using both Pandas and PyArrow
- Leverage compression to reduce storage costs
- Optimize queries by reading only needed columns
- Use row filtering for efficient data access
- Inspect schemas and metadata
- Organize data into partitioned datasets
- Build practical data conversion tools
Whether you’re migrating legacy CSV systems to modern data architecture or building cloud-native analytics pipelines, Parquet gives you the performance and efficiency your applications demand. Start with simple read/write operations, then progressively adopt compression and partitioning strategies as your data grows.
The investment in learning Parquet pays dividends — your queries will run faster, storage costs will shrink, and your data infrastructure becomes compatible with the entire ecosystem of modern data tools.