Intermediate

You have trained a machine learning model that took three hours to fit, and now you need to save it so you can load it tomorrow without retraining. Or you have a complex nested data structure — dictionaries containing lists of custom objects — and you need to persist it between script runs. JSON cannot handle custom Python objects, and writing a custom serializer for every class is tedious. This is where pickle comes in.

Python’s pickle module serializes Python objects into a byte stream and deserializes them back. Its companion module shelve builds on pickle to give you a persistent dictionary backed by a file. Both are in the standard library — no installation needed.

In this article, we will start with a quick example of saving and loading objects, then explain how pickle works and its security implications. We will cover pickling custom classes, using shelve as a persistent key-value store, and best practices for safe serialization. We will finish with a real-life project that caches expensive API results.

Python Pickle Quick Example

# quick_pickle.py
import pickle

# Save a complex data structure
data = {
    "users": ["Alice", "Bob", "Charlie"],
    "scores": {"Alice": [95, 87, 92], "Bob": [78, 85], "Charlie": [90]},
    "metadata": {"version": 2, "created": "2026-04-12"}
}

# Serialize to file
with open("data.pkl", "wb") as f:
    pickle.dump(data, f)
print("Saved to data.pkl")

# Load it back
with open("data.pkl", "rb") as f:
    loaded = pickle.load(f)

print(f"Users: {loaded['users']}")
print(f"Alice scores: {loaded['scores']['Alice']}")
print(f"Same data: {data == loaded}")

Output:

Saved to data.pkl
Users: ['Alice', 'Bob', 'Charlie']
Alice scores: [95, 87, 92]
Same data: True

Two function calls: pickle.dump() to save and pickle.load() to restore. The entire nested structure — dict, lists, strings, integers — survives the round trip perfectly. Let us explore how this works and when to use it.

What Is Pickle and When Should You Use It?

Pickle converts Python objects into a byte stream (serialization) and reconstructs them from that byte stream (deserialization). Unlike JSON, which only handles basic types (strings, numbers, lists, dicts), pickle can serialize almost any Python object: custom classes, functions, sets, datetime objects, and complex nested structures.

FeaturepickleJSONshelve
Python-specific typesYes (classes, sets, etc.)No (basic types only)Yes (uses pickle)
Human readableNo (binary)YesNo
Cross-languageNo (Python only)YesNo
SecurityUnsafe with untrusted dataSafeUnsafe with untrusted data
Key-value accessNo (full load)No (full load)Yes (dict-like)

Use pickle when: you need to save Python-specific objects (custom classes, ML models, complex structures) and the data stays within your own code. Use JSON when: you need human-readable output or cross-language compatibility. Never unpickle data from untrusted sources — pickle can execute arbitrary code during deserialization.

Pickling objects into jars
pickle.dump() — freeze any Python object. pickle.load() — thaw it back to life.

Pickling Custom Classes

One of pickle’s most powerful features is its ability to serialize custom objects. Your classes work with pickle automatically as long as their attributes are themselves picklable.

# custom_pickle.py
import pickle
from datetime import datetime

class Task:
    def __init__(self, title, priority="medium"):
        self.title = title
        self.priority = priority
        self.created = datetime.now()
        self.completed = False
    
    def complete(self):
        self.completed = True
    
    def __repr__(self):
        status = "done" if self.completed else "pending"
        return f"Task('{self.title}', {self.priority}, {status})"

# Create tasks
tasks = [
    Task("Write tests", "high"),
    Task("Update docs", "low"),
    Task("Fix bug #42", "high"),
]
tasks[0].complete()

# Save
with open("tasks.pkl", "wb") as f:
    pickle.dump(tasks, f)
print(f"Saved {len(tasks)} tasks")

# Load
with open("tasks.pkl", "rb") as f:
    loaded_tasks = pickle.load(f)

for task in loaded_tasks:
    print(f"  {task}")
print(f"\nFirst task created: {loaded_tasks[0].created}")

Output:

Saved 3 tasks
  Task('Write tests', high, done)
  Task('Update docs', low, pending)
  Task('Fix bug #42', high, pending)

First task created: 2026-04-12 09:30:15.123456

Notice that everything was preserved: the string attributes, the boolean completed state, and even the datetime object. Pickle captures the full state of each object, including the state changes we made by calling complete().

Pickle Protocols and bytes

# pickle_protocols.py
import pickle

data = {"key": "value", "numbers": [1, 2, 3]}

# Serialize to bytes (not to a file)
pickled_bytes = pickle.dumps(data)
print(f"Pickled size: {len(pickled_bytes)} bytes")
print(f"Bytes preview: {pickled_bytes[:30]}...")

# Deserialize from bytes
restored = pickle.loads(pickled_bytes)
print(f"Restored: {restored}")

# Check available protocols
print(f"\nHighest protocol: {pickle.HIGHEST_PROTOCOL}")
print(f"Default protocol: {pickle.DEFAULT_PROTOCOL}")

# Use highest protocol for best performance
fast_bytes = pickle.dumps(data, protocol=pickle.HIGHEST_PROTOCOL)
print(f"Highest protocol size: {len(fast_bytes)} bytes")

Output:

Pickled size: 52 bytes
Bytes preview: b'\x80\x05\x95*\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x03key'...
Restored: {'key': 'value', 'numbers': [1, 2, 3]}

Highest protocol: 5
Default protocol: 5
Highest protocol size: 52 bytes

Use pickle.dumps() and pickle.loads() (note the ‘s’) for in-memory serialization to bytes — useful for sending objects over networks or storing in databases. Higher protocol numbers generally produce smaller output and deserialize faster.

Pickle protocols and bytes
pickle.dumps() for bytes, pickle.dump() for files. One letter changes everything.

Shelve: A Persistent Dictionary

shelve builds on pickle to give you a dictionary-like interface backed by a file. Instead of loading and saving entire data structures, you read and write individual keys — perfect for caching and simple data stores.

# shelve_basics.py
import shelve

# Open a shelf (creates the file if it does not exist)
with shelve.open("mydata") as db:
    db["users"] = ["Alice", "Bob", "Charlie"]
    db["config"] = {"theme": "dark", "lang": "en"}
    db["counter"] = 42
    print(f"Stored {len(db)} keys")

# Read back individual keys
with shelve.open("mydata") as db:
    print(f"Users: {db['users']}")
    print(f"Config: {db['config']}")
    print(f"Keys: {list(db.keys())}")
    
    # Check if key exists
    print(f"Has 'users': {'users' in db}")
    print(f"Has 'missing': {'missing' in db}")
    
    # Update a value
    db["counter"] = db["counter"] + 1
    print(f"Counter: {db['counter']}")

Output:

Stored 3 keys
Users: ['Alice', 'Bob', 'Charlie']
Config: {'theme': 'dark', 'lang': 'en'}
Keys: ['users', 'config', 'counter']
Has 'users': True
Has 'missing': False
Counter: 43

Shelve works exactly like a dictionary — you use square bracket syntax to get and set values, in to check membership, and keys() to list all keys. The difference is that the data persists on disk between runs. Always use the with statement to ensure the shelf is properly closed and synced.

Real-Life Example: API Response Cache

API response cache with shelve
Cache hit: 0ms. Cache miss: 2,000ms. shelve makes the difference.

Let us build an API response cache that stores results on disk to avoid making the same request twice. This is useful for development, testing, or any scenario where API calls are slow or rate-limited.

# api_cache.py
import shelve
import time
import hashlib
import json

class APICache:
    def __init__(self, cache_file="api_cache", ttl=3600):
        self.cache_file = cache_file
        self.ttl = ttl  # Time to live in seconds
    
    def _make_key(self, url, params=None):
        key_data = url + json.dumps(params or {}, sort_keys=True)
        return hashlib.md5(key_data.encode()).hexdigest()
    
    def get(self, url, params=None):
        key = self._make_key(url, params)
        with shelve.open(self.cache_file) as db:
            if key in db:
                entry = db[key]
                age = time.time() - entry["timestamp"]
                if age < self.ttl:
                    print(f"  Cache HIT for {url} (age: {age:.0f}s)")
                    return entry["data"]
                else:
                    print(f"  Cache EXPIRED for {url} (age: {age:.0f}s)")
        return None
    
    def set(self, url, data, params=None):
        key = self._make_key(url, params)
        with shelve.open(self.cache_file) as db:
            db[key] = {"data": data, "timestamp": time.time(), "url": url}
        print(f"  Cached response for {url}")
    
    def stats(self):
        with shelve.open(self.cache_file) as db:
            total = len(db)
            fresh = sum(1 for k in db if time.time() - db[k]["timestamp"] < self.ttl)
            print(f"  Cache: {total} entries, {fresh} fresh, {total - fresh} expired")

def fetch_with_cache(cache, url, params=None):
    cached = cache.get(url, params)
    if cached is not None:
        return cached
    
    # Simulate API call
    print(f"  Fetching {url}...")
    time.sleep(0.1)  # Simulate network delay
    data = {"url": url, "status": "ok", "results": [1, 2, 3]}
    
    cache.set(url, data, params)
    return data

# Demo
cache = APICache(ttl=60)

print("First requests (cache miss):")
fetch_with_cache(cache, "https://api.example.com/users")
fetch_with_cache(cache, "https://api.example.com/products")

print("\nSecond requests (cache hit):")
fetch_with_cache(cache, "https://api.example.com/users")
fetch_with_cache(cache, "https://api.example.com/products")

print("\nCache statistics:")
cache.stats()

Output:

First requests (cache miss):
  Fetching https://api.example.com/users...
  Cached response for https://api.example.com/users
  Fetching https://api.example.com/products...
  Cached response for https://api.example.com/products

Second requests (cache hit):
  Cache HIT for https://api.example.com/users (age: 0s)
  Cache HIT for https://api.example.com/products (age: 0s)

Cache statistics:
  Cache: 2 entries, 2 fresh, 0 expired

This cache uses shelve for persistent storage, MD5 hashes for unique cache keys, and TTL-based expiration. You could extend it with LRU eviction, cache size limits, or async support for real production use.

Pickle security warning
Never unpickle data you did not create. Deserialization runs arbitrary code.

Frequently Asked Questions

Is pickle safe to use?

Pickle is safe when you only load data you created yourself. Never unpickle data from untrusted sources -- a malicious pickle file can execute arbitrary code on your machine during deserialization. For data exchange with external systems, use JSON, MessagePack, or Protocol Buffers instead.

What types can pickle handle?

Pickle handles most Python types: basic types (int, float, str, bytes, None), containers (list, dict, set, tuple), custom classes, functions defined at module level, and classes defined at module level. It cannot pickle lambda functions, generators, open file handles, database connections, or thread locks.

When should I use shelve instead of pickle?

Use shelve when you need to read and write individual keys without loading the entire dataset into memory. Pickle loads everything at once. Shelve is better for caches, configuration stores, and any scenario where you access data by key rather than as a whole.

How do I handle class changes after pickling?

If you add attributes to a class after pickling objects, implement __getstate__ and __setstate__ to handle migration. Or use __reduce__ for full control over how objects are reconstructed. For simple cases, adding a __init__ default works: getattr(self, "new_attr", default_value).

Is there a size limit for pickle files?

There is no hard limit, but very large pickle files (multi-GB) can cause memory issues since the entire object must fit in RAM during deserialization. For large datasets, consider using pickle with chunked processing, or switch to formats like Parquet or HDF5 that support lazy loading.

Conclusion

We covered Python's serialization tools: pickle for saving and loading any Python object, pickle protocols for optimization, shelve for persistent key-value storage, and critical security considerations. The API cache project demonstrated how shelve powers practical caching solutions.

Try building a session manager or a simple document store with shelve. For the complete reference, see the official pickle documentation and shelve documentation.