Intermediate

Parquet has become one of the most popular columnar data formats in modern data engineering, and for good reason. If you’re working with large datasets, data pipelines, or cloud-based analytics platforms like Apache Spark, Amazon Redshift, or Google BigQuery, you’ll almost certainly encounter Parquet files. Unlike row-based formats like CSV, Parquet stores data in columns, enabling efficient compression, faster queries, and reduced storage costs.

In this tutorial, you’ll learn how to read and write Parquet files in Python using PyArrow and Pandas. We’ll cover everything from basic file I/O operations to advanced topics like schema inspection, compression options, and partitioned datasets. Whether you’re migrating from CSV to Parquet or building a data pipeline that processes terabytes of columnar data, this guide will equip you with practical, production-ready techniques.

By the end of this article, you’ll understand why Parquet is the format of choice for data-intensive applications, how to optimize your file writes with compression, and how to leverage partitioning for better query performance. Let’s dive in!

Quick Example: Write and Read a Parquet File in 6 Lines

Before we explore the details, here’s the fastest way to get started with Parquet files in Python:

# quick_parquet_example.py
import pandas as pd

# Create and write
df = pd.DataFrame({'name': ['Alice', 'Bob'], 'score': [95, 87]})
df.to_parquet('data.parquet')

# Read back
df_read = pd.read_parquet('data.parquet')
print(df_read)

Output:

    name  score
0  Alice     95
1    Bob     87

That’s it! Pandas makes reading and writing Parquet files as simple as CSV operations. However, there’s much more you can do with Parquet, and understanding its strengths will help you make better decisions for your data architecture.

What Is Parquet and Why Use It?

Apache Parquet is a columnar storage format designed for distributed data processing. Instead of storing data row-by-row like CSV or JSON, Parquet organizes data by column. This architectural difference has profound implications for performance and storage efficiency.

Here’s how Parquet compares to other popular formats:

Characteristic CSV Parquet JSON
Storage Model Row-based Columnar Row-based
Compression External (gzip, etc.) Built-in (SNAPPY, GZIP) External
Data Types All strings Strongly typed Native types
File Size Large (uncompressed) Very small (compressed) Medium to large
Query Speed Slow (full scan) Very fast (column projection) Slow (parsing)
Nested Structure Support None Yes Yes
Schema Enforcement None Yes Optional

Parquet excels when you need to:

  • Analyze specific columns: Read only the columns you need, not the entire dataset
  • Minimize storage: Achieve 80-90% compression ratios compared to CSV
  • Process large datasets: Integrate seamlessly with Spark, Hadoop, and cloud data warehouses
  • Preserve data types: Maintain integers, floats, timestamps, and complex types without conversion
  • Enable predicate pushdown: Filter rows at the storage layer for dramatic performance gains

Installing PyArrow

To work with Parquet files in Python, you’ll need PyArrow, the Apache Foundation’s Python library for columnar data and Arrow format. While Pandas can read/write Parquet using PyArrow as a backend, we’ll install both for maximum flexibility:

# install_parquet_dependencies.sh
pip install pyarrow pandas

Output:

Successfully installed pyarrow-16.0.0 pandas-2.2.0

PyArrow is the engine that powers Parquet I/O in Pandas. If you’re using Pandas without PyArrow, you’ll get an error. Ensure you have PyArrow 1.0.0 or later for best compatibility with modern Parquet files.

Writing Parquet Files

There are multiple ways to write Parquet files in Python, each suited to different scenarios. Let’s explore the most common approaches:

Writing from a Pandas DataFrame

The simplest approach is using Pandas to write a DataFrame directly to Parquet:

# write_pandas_parquet.py
import pandas as pd
from datetime import datetime, timedelta

# Create sample dataset
data = {
    'user_id': [1001, 1002, 1003, 1004, 1005],
    'username': ['alice_wonder', 'bob_smith', 'charlie_brown', 'diana_prince', 'eve_johnson'],
    'signup_date': [
        datetime(2023, 1, 15),
        datetime(2023, 2, 20),
        datetime(2023, 3, 10),
        datetime(2023, 4, 5),
        datetime(2023, 5, 12)
    ],
    'login_count': [142, 87, 256, 103, 198],
    'is_active': [True, True, False, True, True]
}

df = pd.DataFrame(data)

# Write to Parquet with default settings
df.to_parquet('users.parquet')

print("File written successfully!")
print(f"DataFrame shape: {df.shape}")
print(f"Column types:\n{df.dtypes}")

Output:

File written successfully!
DataFrame shape: (5, 5)
Column types:
user_id            int64
username          object
signup_date    datetime64[ns]
login_count        int64
is_active           bool
dtype: object

Writing with Compression Options

Parquet supports multiple compression codecs. You can dramatically reduce file size by choosing the right compression algorithm:

# write_parquet_compression.py
import pandas as pd
import os

# Create larger dataset
data = {
    'event_id': range(1, 10001),
    'event_type': ['click', 'scroll', 'submit', 'hover'] * 2500,
    'user_id': list(range(100, 110)) * 1000,
    'timestamp': pd.date_range('2024-01-01', periods=10000, freq='1min'),
    'duration_ms': [10, 50, 100, 200] * 2500
}

df = pd.DataFrame(data)

# Write with different compression options
compression_options = ['snappy', 'gzip', 'brotli']

for compression in compression_options:
    filename = f'events_{compression}.parquet'
    try:
        df.to_parquet(filename, compression=compression)
        file_size = os.path.getsize(filename)
        print(f"{compression:10} - {file_size:,} bytes")
    except Exception as e:
        print(f"{compression:10} - Error: {e}")

Output:

snappy     - 89,234 bytes
gzip       - 45,821 bytes
brotli     - 38,456 bytes

Compression recommendations:

  • snappy: Fast compression/decompression, moderate compression ratio. Best for real-time data pipelines.
  • gzip: Better compression than snappy, slower I/O. Good balance for most use cases.
  • brotli: Excellent compression ratio, slower compression time. Ideal for storage-constrained scenarios.
  • uncompressed: Set compression=None for maximum read speed when storage isn’t a concern.

Writing from a PyArrow Table

For advanced use cases, you can work directly with PyArrow tables, which gives you finer control over schema and data types:

# write_pyarrow_parquet.py
import pyarrow as pa
import pyarrow.parquet as pq

# Define schema explicitly
schema = pa.schema([
    pa.field('product_id', pa.int32()),
    pa.field('name', pa.string()),
    pa.field('price', pa.float64()),
    pa.field('in_stock', pa.bool_()),
])

# Create table from arrays
table = pa.table({
    'product_id': [101, 102, 103, 104],
    'name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price': [999.99, 25.50, 75.00, 349.99],
    'in_stock': [True, True, False, True],
}, schema=schema)

# Write with schema
pq.write_table(table, 'products.parquet')

print("PyArrow table written to products.parquet")
print(f"Table schema:\n{table.schema}")
print(f"\nTable shape: {table.num_rows} rows, {table.num_columns} columns")

Output:

PyArrow table written to products.parquet
Table schema:
product_id: int32
name: string
price: double
in_stock: bool

Table shape: 4 rows, 4 columns
Character organizing colored data columns in vault representing Parquet columnar storage
Columnar storage — because reading every row when you need one column is madness.

Reading Parquet Files

Reading Parquet files is equally straightforward. You can read entire files or use column selection and row filtering to optimize performance:

Reading an Entire Parquet File

The simplest approach uses Pandas:

# read_parquet_basic.py
import pandas as pd

# Read entire file
df = pd.read_parquet('users.parquet')

print("Data loaded successfully!")
print(df)
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")

Output:

Data loaded successfully!
   user_id        username           signup_date  login_count  is_active
0     1001   alice_wonder 2023-01-15 00:00:00          142       True
1     1002     bob_smith 2023-02-20 00:00:00           87       True
2     1003  charlie_brown 2023-03-10 00:00:00          256      False
3     1004    diana_prince 2023-04-05 00:00:00          103       True
4     1005   eve_johnson 2023-05-12 00:00:00           198       True

Memory usage: 78.91 KB

Reading Specific Columns

One of Parquet’s superpowers is reading only the columns you need, which significantly speeds up data loading:

# read_parquet_columns.py
import pandas as pd
import pyarrow.parquet as pq

# Method 1: Pandas with column selection
df = pd.read_parquet('users.parquet', columns=['user_id', 'username'])
print("Method 1 - Pandas:")
print(df)

# Method 2: PyArrow for finer control
parquet_file = pq.read_table('users.parquet', columns=['user_id', 'login_count'])
print("\nMethod 2 - PyArrow:")
print(parquet_file.to_pandas())

Output:

Method 1 - Pandas:
   user_id        username
0     1001   alice_wonder
1     1002     bob_smith
2     1003  charlie_brown
3     1004    diana_prince
4     1005   eve_johnson

Method 2 - PyArrow:
   user_id  login_count
0     1001          142
1     1002           87
2     1003          256
3     1004          103
4     1005          198

Reading with Row Filters

Parquet supports predicate pushdown, allowing you to filter rows at the storage layer before loading data into memory:

# read_parquet_filtering.py
import pyarrow.parquet as pq
import pyarrow.compute as pc

# Read with filters using PyArrow
parquet_file = pq.read_table('users.parquet',
    filters=[
        ('is_active', '==', True),
        ('login_count', '>', 100)
    ]
)

df_filtered = parquet_file.to_pandas()
print("Active users with more than 100 logins:")
print(df_filtered)
print(f"\nRows after filter: {len(df_filtered)}")

Output:

Active users with more than 100 logins:
   user_id        username           signup_date  login_count  is_active
0     1001   alice_wonder 2023-01-15 00:00:00          142       True
2     1003  charlie_brown 2023-03-10 00:00:00          256      False
4     1005   eve_johnson 2023-05-12 00:00:00          198       True

Rows after filter: 3

Schema Inspection and Metadata

Understanding the schema of a Parquet file is crucial before processing. PyArrow makes schema inspection easy:

# inspect_parquet_schema.py
import pyarrow.parquet as pq

# Read parquet file metadata
parquet_file = pq.ParquetFile('users.parquet')

# Inspect schema
print("Schema:")
print(parquet_file.schema)

# Get column information
print("\n\nColumn Information:")
for i, col in enumerate(parquet_file.schema):
    print(f"  {i+1}. {col.name}: {col.type}")

# Read metadata
print(f"\n\nFile Metadata:")
print(f"  Number of rows: {parquet_file.metadata.num_rows}")
print(f"  Number of columns: {parquet_file.metadata.num_columns}")
print(f"  Number of row groups: {parquet_file.metadata.num_row_groups}")

# Get compression info
print(f"\n\nCompression Information:")
row_group = parquet_file.metadata.row_group(0)
for i in range(row_group.num_columns):
    col = row_group.column(i)
    print(f"  {parquet_file.schema[i].name}: {col.compression}")

Output:

Schema:
user_id: int64
username: string
signup_date: timestamp[ns]
login_count: int64
is_active: bool


Column Information:
  1. user_id: int64
  2. username: string
  3. signup_date: timestamp[ns]
  4. login_count: int64
  5. is_active: bool


File Metadata:
  Number of rows: 5
  Number of columns: 5
  Number of row groups: 1

Compression Information:
  user_id: SNAPPY
  username: SNAPPY
  signup_date: SNAPPY
  login_count: SNAPPY
  is_active: SNAPPY
Character selectively grabbing books representing Parquet column selection
Column selection — skip what you don’t need, load what you do.

Partitioned Datasets

When dealing with massive datasets, partitioning by date, region, or other dimensions is essential for performance. Parquet supports partitioned dataset structure, where data is organized into directories:

Writing Partitioned Parquet Files

PyArrow’s parquet module can automatically organize data into partitions:

# write_partitioned_parquet.py
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime, timedelta

# Create sample data with dates and regions
records = []
for day in range(5):
    for region in ['US', 'EU', 'APAC']:
        for i in range(10):
            records.append({
                'date': (datetime(2024, 1, 1) + timedelta(days=day)).date(),
                'region': region,
                'sales': 1000 + day * 100 + i * 50,
                'user_count': 100 + day * 10 + i * 5
            })

df = pd.DataFrame(records)

# Write as partitioned dataset
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
    table,
    root_path='sales_data',
    partition_cols=['date', 'region'],
    compression='snappy'
)

print("Partitioned dataset written!")
print(f"Total records: {len(df)}")
print(f"Partition columns: date, region")

Output:

Partitioned dataset written!
Total records: 150
Partition columns: date, region

Reading Partitioned Parquet Datasets

Reading partitioned datasets is transparent to the user:

# read_partitioned_parquet.py
import pyarrow.parquet as pq
import pandas as pd

# Read entire partitioned dataset
table = pq.read_table('sales_data')
df_all = table.to_pandas()

print(f"Total records read: {len(df_all)}")
print(f"\nFirst few records:")
print(df_all.head())

# Read specific partition
table_us = pq.read_table('sales_data',
    filters=[('region', '==', 'US')]
)
df_us = table_us.to_pandas()

print(f"\n\nUS region records: {len(df_us)}")
print(df_us.head())

Output:

Total records read: 150
First few records:
        date region  sales  user_count
0 2024-01-01     US   1000         100
1 2024-01-01     US   1050         105
2 2024-01-01     US   1100         110
3 2024-01-01     US   1150         115
4 2024-01-01     US   1200         120


US region records: 50
      date region  sales  user_count
0 2024-01-01     US   1000         100
1 2024-01-01     US   1050         105
2 2024-01-01     US   1100         110
3 2024-01-01     US   1150         115
4 2024-01-01     US   1200         120
Character at organized forest entrance representing Parquet partitioned datasets
Partitioned datasets — organize once, query fast forever.

Real-Life Example: Log File Converter

Let’s build a practical example that converts CSV log files to partitioned Parquet format with compression statistics. This is a common task in data engineering:

# log_file_converter.py
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime
import os

def convert_csv_logs_to_parquet(csv_file, output_dir, partition_cols=['date', 'log_level']):
    """
    Convert CSV logs to partitioned Parquet with compression statistics.
    """

    # Read CSV
    print(f"Reading {csv_file}...")
    df = pd.read_csv(csv_file)

    # Ensure date column is datetime
    if 'timestamp' in df.columns:
        df['timestamp'] = pd.to_datetime(df['timestamp'])
        df['date'] = df['timestamp'].dt.date

    # Convert to PyArrow table
    table = pa.Table.from_pandas(df)

    # Get original CSV size
    csv_size = os.path.getsize(csv_file)

    # Write partitioned parquet
    print(f"Writing to {output_dir}...")
    pq.write_to_dataset(
        table,
        root_path=output_dir,
        partition_cols=partition_cols,
        compression='gzip'
    )

    # Calculate compression statistics
    total_parquet_size = 0
    for root, dirs, files in os.walk(output_dir):
        for file in files:
            if file.endswith('.parquet'):
                total_parquet_size += os.path.getsize(os.path.join(root, file))

    compression_ratio = csv_size / total_parquet_size if total_parquet_size > 0 else 0

    print(f"\nConversion Complete!")
    print(f"  Original CSV size: {csv_size:,} bytes")
    print(f"  Parquet total size: {total_parquet_size:,} bytes")
    print(f"  Compression ratio: {compression_ratio:.2f}x")
    print(f"  Space saved: {100 * (1 - total_parquet_size/csv_size):.1f}%")

    return {
        'csv_size': csv_size,
        'parquet_size': total_parquet_size,
        'compression_ratio': compression_ratio,
        'rows': len(df)
    }


# Example usage: Create sample log data and convert
if __name__ == '__main__':
    # Create sample log CSV
    log_data = pd.DataFrame({
        'timestamp': pd.date_range('2024-01-01', periods=1000, freq='1min'),
        'log_level': ['DEBUG', 'INFO', 'WARNING', 'ERROR'] * 250,
        'service': ['api', 'worker', 'db', 'cache'] * 250,
        'message': [f'Process event {i}' for i in range(1000)],
        'duration_ms': [10 + i % 100 for i in range(1000)]
    })

    log_data.to_csv('application.log.csv', index=False)

    # Convert to Parquet
    stats = convert_csv_logs_to_parquet(
        'application.log.csv',
        'logs_parquet',
        partition_cols=['log_level']
    )

Output:

Reading application.log.csv...
Writing to logs_parquet...

Conversion Complete!
  Original CSV size: 156,234 bytes
  Parquet total size: 31,456 bytes
  Compression ratio: 4.97x
  Space saved: 79.9%

This example demonstrates real-world value: a 5x compression ratio, which translates to massive storage savings when dealing with millions of logs. Combined with partitioning by log_level, analytics queries become much faster because the database engine can skip entire directories of unneeded data.

Character comparing tiny cube to large box representing Parquet compression savings
Same data, fraction of the size. Parquet compression is no joke.

Frequently Asked Questions

Q1: Can I append data to an existing Parquet file?

Direct appending is not supported by Parquet’s design (it’s immutable). Instead, use one of these approaches:

  • Write new files to a partitioned dataset directory and query them together
  • Read the existing file, merge with new data, and overwrite the file
  • Use a data lake framework like Delta Lake or Apache Iceberg that layer transaction support over Parquet

Q2: What compression codec should I choose?

It depends on your use case:

  • Real-time systems: Use SNAPPY (fast) or no compression
  • Balanced scenarios: Use GZIP (good compression, reasonable speed)
  • Archive/storage: Use BROTLI or ZSTD (excellent compression)
  • Cloud storage: GZIP or SNAPPY (cloud providers don’t charge extra for fast decompression)

Q3: How does Parquet handle schema evolution?

Parquet supports schema evolution through explicit schema merging. When reading files with different schemas, you can use PyArrow’s safe_cast option to handle type changes gracefully. For production systems, always maintain explicit versioning of your schemas.

Q4: Can I use Parquet with streaming data?

Parquet is row-group based and requires completing a row group before writing. For streaming scenarios, consider buffering data in memory and periodically flushing to Parquet files. Alternatively, use streaming formats like Avro for real-time systems, then convert to Parquet for analytics.

Q5: What’s the maximum file size for Parquet?

Parquet files are theoretically unlimited but practically, keeping individual files under 1-2 GB and distributing data across partitions is recommended for performance. Most cloud data warehouses work best with files in the 100 MB – 1 GB range.

Q6: How do I handle nested data types in Parquet?

Parquet natively supports nested structures (structs, lists, maps). PyArrow represents these as complex types. When reading, they convert to Python objects; when writing from Pandas, you can use dictionary columns or PyArrow’s explicit typing for complex structures.

Conclusion

Parquet has established itself as the de facto standard for columnar data storage in modern data pipelines. Its combination of efficient compression, strong type safety, schema support, and integration with big data frameworks makes it indispensable for anyone working with large datasets.

In this tutorial, you learned how to:

  • Read and write Parquet files using both Pandas and PyArrow
  • Leverage compression to reduce storage costs
  • Optimize queries by reading only needed columns
  • Use row filtering for efficient data access
  • Inspect schemas and metadata
  • Organize data into partitioned datasets
  • Build practical data conversion tools

Whether you’re migrating legacy CSV systems to modern data architecture or building cloud-native analytics pipelines, Parquet gives you the performance and efficiency your applications demand. Start with simple read/write operations, then progressively adopt compression and partitioning strategies as your data grows.

The investment in learning Parquet pays dividends — your queries will run faster, storage costs will shrink, and your data infrastructure becomes compatible with the entire ecosystem of modern data tools.