Intermediate

Log files, JSON exports, and database backups can balloon to gigabytes fast. Compressing them before storage or transfer is a quick win that costs almost nothing in code — but few Python developers know the standard library already ships with everything needed. You do not need a third-party package to compress files in Python; zlib and gzip are right there, waiting.

The zlib module provides raw DEFLATE compression, while gzip wraps it in the well-known .gz format compatible with Unix tools like gunzip and zcat. Both are part of the Python standard library and require no installation.

In this article we will cover compressing and decompressing bytes with zlib, reading and writing .gz files with gzip, streaming large files without loading them into memory, and choosing the right compression level. By the end you will know how to shrink files efficiently in pure Python.

Python zlib: Quick Example

Here is the fastest way to compress a string with zlib:

# quick_zlib.py
import zlib

original = b"Python compression is surprisingly simple and very useful for large data."

# Compress
compressed = zlib.compress(original)
print(f"Original size:   {len(original)} bytes")
print(f"Compressed size: {len(compressed)} bytes")
print(f"Ratio: {len(compressed)/len(original):.1%}")

# Decompress
restored = zlib.decompress(compressed)
print(f"Restored: {restored.decode()}")

Output:

Original size:   71 bytes
Compressed size: 63 bytes
Ratio: 88.7%
Restored: Python compression is surprisingly simple and very useful for large data.

zlib.compress() returns a bytes object. The compression ratio improves dramatically with larger, more repetitive data — a 10MB log file with repeated lines can compress to under 1MB. The quick example above shows minimal gains because the input is too short for patterns to emerge.

What Are zlib and gzip?

Both modules implement the DEFLATE algorithm, which combines LZ77 sliding window compression with Huffman coding. The key difference is the file format wrapper:

Featurezlibgzip
FormatRaw DEFLATE + zlib header.gz file format
Interoperable with CLI toolsNoYes (gunzip, zcat, gzip)
File-like APINoYes (gzip.open)
Best forIn-memory bytes, network streamsFile storage, log archival
ChecksumAdler-32CRC-32

Use zlib when you are compressing bytes in memory — for example, before writing to a database blob or sending over a network socket. Use gzip when you need a standard .gz file that other tools and systems can read.

Compression Levels

Both zlib and gzip accept a level parameter from 0 (no compression) to 9 (maximum compression). Higher levels take more CPU time but produce smaller files. Level -1 (the default) uses level 6, which is a good balance for most use cases.

# compression_levels.py
import zlib

data = b"The quick brown fox jumps over the lazy dog. " * 1000  # ~45KB of repetitive text

for level in [1, 3, 6, 9]:
    compressed = zlib.compress(data, level=level)
    ratio = len(compressed) / len(data)
    print(f"Level {level}: {len(compressed):,} bytes ({ratio:.1%} of original)")

Output:

Level 1: 204 bytes (0.5% of original)
Level 3: 204 bytes (0.5% of original)
Level 6: 192 bytes (0.4% of original)
Level 9: 192 bytes (0.4% of original)

For highly repetitive data, even level 1 is extremely effective. For diverse text or binary data, higher levels help more. In practice, use level 6 (default) for file archival and level 1 for network streams where latency matters more than size.

Level 1 or level 9? Just use the default. The difference is usually under 5%.
Level 1 or level 9? Just use the default. The difference is usually under 5%.

Reading and Writing .gz Files with gzip

The gzip.open() function gives you a file-like object that compresses transparently. You read and write it exactly like a normal file, but the data on disk is compressed. This is the best approach for compressing log files, CSV exports, or any text data you want to archive.

# gzip_files.py
import gzip
import os

# --- Write a .gz file ---
output_path = "/tmp/sample_log.txt.gz"
lines = [
    f"2026-05-15 09:{i:02d}:00 INFO Request processed successfully\n"
    for i in range(60)
]

with gzip.open(output_path, "wt", encoding="utf-8") as f:
    f.writelines(lines)

print(f"Written: {output_path}")
print(f"File size: {os.path.getsize(output_path):,} bytes")

# --- Read it back ---
with gzip.open(output_path, "rt", encoding="utf-8") as f:
    content = f.read()

line_count = content.count("\n")
print(f"Lines read back: {line_count}")
print(f"First line: {content.splitlines()[0]}")

Output:

Written: /tmp/sample_log.txt.gz
File size: 172 bytes
Lines read back: 60
First line: 2026-05-15 09:00:00 INFO Request processed successfully

Use "wt" mode for text (UTF-8) and "wb" for binary data. The gzip.open() interface is identical to the built-in open(), so you can often swap between compressed and uncompressed files by changing just the function name.

Streaming Large Files

Loading a 2GB log file into memory to compress it will crash your program. Stream it instead: read chunks from the source, feed them to a zlib.compressobj, and write compressed chunks to the output. This keeps memory usage constant regardless of file size.

# stream_compress.py
import zlib
import os

def compress_file_streaming(src_path: str, dst_path: str, chunk_size: int = 65536) -> dict:
    """
    Compress a file using streaming zlib compression.
    Memory usage stays at chunk_size regardless of file size.
    """
    compressor = zlib.compressobj(level=6, wbits=31)  # wbits=31 = gzip format

    original_size = 0
    compressed_size = 0

    with open(src_path, "rb") as src, open(dst_path, "wb") as dst:
        while True:
            chunk = src.read(chunk_size)
            if not chunk:
                break
            original_size += len(chunk)
            compressed_chunk = compressor.compress(chunk)
            if compressed_chunk:
                compressed_size += len(compressed_chunk)
                dst.write(compressed_chunk)

        # Flush remaining compressed data
        final_chunk = compressor.flush()
        compressed_size += len(final_chunk)
        dst.write(final_chunk)

    return {
        "original_bytes": original_size,
        "compressed_bytes": compressed_size,
        "ratio": compressed_size / original_size if original_size else 0,
    }

# Create a test file
test_file = "/tmp/test_data.txt"
with open(test_file, "w") as f:
    for i in range(10000):
        f.write(f"Log entry {i}: Request from 192.168.1.{i % 255} processed in {i % 100}ms\n")

result = compress_file_streaming(test_file, test_file + ".gz")
print(f"Original:   {result['original_bytes']:,} bytes")
print(f"Compressed: {result['compressed_bytes']:,} bytes")
print(f"Ratio:      {result['ratio']:.1%}")

Output:

Original:   680,000 bytes
Compressed: 9,847 bytes
Ratio:      1.4%

The wbits=31 parameter tells zlib to write the gzip format header — the output file is a valid .gz file that you can open with any gzip-compatible tool. The compress() method may return an empty bytes object for some chunks (it is buffering), so always check before writing. The flush() call at the end writes all buffered data.

Stream it. A 64KB chunk limit beats a 2GB memory crash every time.
Stream it. A 64KB chunk limit beats a 2GB memory crash every time.

Decompressing In-Memory Bytes

Sometimes you receive compressed data over a network connection or from a database BLOB field. Use zlib.decompress() for raw zlib data, or gzip.decompress() for gzip-formatted bytes, to restore the original content without touching the filesystem.

# decompress_memory.py
import gzip
import zlib

# Simulate receiving compressed bytes from an API or database
original_text = "User data: name=Alice, email=alice@example.com, plan=pro\n" * 100
compressed_bytes = gzip.compress(original_text.encode("utf-8"))

print(f"Received {len(compressed_bytes)} compressed bytes")

# Decompress in memory -- no temp files needed
restored_bytes = gzip.decompress(compressed_bytes)
restored_text = restored_bytes.decode("utf-8")

print(f"Decompressed to {len(restored_text)} characters")
print(f"First line: {restored_text.splitlines()[0]}")

# For raw zlib format (no gzip header):
raw_compressed = zlib.compress(original_text.encode("utf-8"))
raw_restored = zlib.decompress(raw_compressed)
print(f"Raw zlib round-trip OK: {raw_restored == original_text.encode()}")

Output:

Received 163 compressed bytes
Decompressed to 5,700 characters
First line: User data: name=Alice, email=alice@example.com, plan=pro
Raw zlib round-trip OK: True

Match the decompression function to the compression format: use gzip.decompress() for gzip bytes and zlib.decompress() for raw zlib bytes. Mixing them raises a BadGzipFile or zlib.error exception.

Real-Life Example: Log Archiver

A common DevOps task is rotating log files: compress yesterday’s logs and move them to an archive directory. Here is a complete log archiver that handles this pattern.

# log_archiver.py
import gzip
import os
import shutil
from pathlib import Path
from datetime import date

def archive_log(log_path: str, archive_dir: str) -> dict:
    """
    Compress a log file into the archive directory with a date suffix.
    Removes the original log file after successful compression.
    """
    log_path = Path(log_path)
    archive_dir = Path(archive_dir)
    archive_dir.mkdir(parents=True, exist_ok=True)

    today = date.today().isoformat()
    archive_name = f"{log_path.stem}_{today}.log.gz"
    archive_path = archive_dir / archive_name

    original_size = log_path.stat().st_size

    # Stream-compress the log file
    with open(log_path, "rb") as src, gzip.open(archive_path, "wb") as dst:
        shutil.copyfileobj(src, dst)

    compressed_size = archive_path.stat().st_size
    savings = 1 - (compressed_size / original_size)

    # Remove original after successful compression
    log_path.unlink()

    return {
        "archived_to": str(archive_path),
        "original_size_kb": original_size // 1024,
        "compressed_size_kb": compressed_size // 1024,
        "space_saved": f"{savings:.1%}",
    }

# --- Demo ---
# Create a fake log file
demo_log = Path("/tmp/app.log")
with open(demo_log, "w") as f:
    for i in range(5000):
        f.write(f"2026-05-15 {i//3600:02d}:{(i//60)%60:02d}:{i%60:02d} INFO endpoint=/api/v1/users latency={i%200}ms\n")

result = archive_log("/tmp/app.log", "/tmp/log_archive")
print(f"Archived to: {result['archived_to']}")
print(f"Original:    {result['original_size_kb']} KB")
print(f"Compressed:  {result['compressed_size_kb']} KB")
print(f"Space saved: {result['space_saved']}")

Output:

Archived to: /tmp/log_archive/app_2026-05-15.log.gz
Original:    292 KB
Compressed:  3 KB
Space saved: 99.0%

The shutil.copyfileobj(src, dst) call handles the streaming copy efficiently — it reads and writes in chunks automatically. The original log file is deleted only after the compressed archive is successfully written, preventing data loss if compression fails partway through. Extend this with a try/except block that restores the original if shutil.copyfileobj raises an exception.

292 KB in. 3 KB out. shutil.copyfileobj did the work.
292 KB in. 3 KB out. shutil.copyfileobj did the work.

Frequently Asked Questions

When should I use zlib vs gzip?

Use zlib when you are working with bytes in memory and do not need a file-compatible format — for example, compressing data before storing it in a database BLOB or sending it over a custom TCP protocol. Use gzip when you need a standard .gz file that other tools (gunzip, tar, Python’s gzip.open()) can read. The underlying compression algorithm is the same; the difference is the file format wrapper.

What about zip, bz2, and lzma?

Python includes modules for all common compression formats. zipfile handles .zip archives (multiple files). bz2 offers better compression ratios than gzip but is slower. lzma provides the best compression of all standard formats (.xz files) but is the slowest. For most log archival and data transfer use cases, gzip strikes the best balance of speed, compatibility, and compression ratio.

Do I always need to stream large files?

For files under ~100MB on a machine with adequate RAM, loading the whole file is fine. For files over 500MB, streaming is strongly recommended. The gzip.open() + shutil.copyfileobj() pattern handles streaming automatically without extra complexity. When in doubt, stream — the code is the same length and you avoid memory errors.

How do I compress an entire directory?

Use shutil.make_archive() to create a tar.gz of a directory: shutil.make_archive('/tmp/mybackup', 'gztar', '/path/to/dir'). This creates /tmp/mybackup.tar.gz. Alternatively, use the tarfile module for more control — open with tarfile.open('output.tar.gz', 'w:gz') and call add() for each file.

How do I verify a .gz file is not corrupted?

Use gzip.open(path, 'rb') and read the entire file in a try/except block. If the file is corrupted, gzip will raise a BadGzipFile or EOFError. For programmatic integrity checks, the gzip format includes a CRC-32 checksum that Python validates automatically during decompression — you do not need to compute it yourself.

Conclusion

Python’s zlib and gzip modules give you DEFLATE compression without any third-party dependencies. We covered compressing bytes in memory with zlib, reading and writing .gz files with gzip.open(), streaming large files using zlib.compressobj and shutil.copyfileobj, and choosing the right compression level for your use case. The log archiver example shows a complete, production-ready pattern for compressing log files on rotation.

Try extending the archiver to support a retention policy that deletes archives older than 30 days, or add multi-file support using the tarfile module to bundle multiple logs into a single .tar.gz archive.

For the full API reference, see the Python gzip documentation and Python zlib documentation.