Beginner
For most serious applications, you will often have to have persistent storage (storage that still exists after your applications stops running) of some sort. For new developers, it can be quite daunting to decide which option to go for. Is a simple flat file enough? When should you use something like a database? Which database should you use? There are so many options that are available it becomes quite daunting to decide which way to go for.
This is a starting guide to provide an overview of some of the many data storage options that are available for you and how you can go about deciding. One thing to keep in mind is that if you are developing an application which is either planned or has a possibility to scale over time, your underlying database might also grow overtime. It may be quick and easy to implement a file as storage, but as your data grows it might be better to use a relational database but it will take a little bit more effort. Let’s look at this a bit deeper

What are the possible ways to store data?
There are many methods of persistent storage that you can use (persistent storage means that after your program is finished running your data is not lost). The typical ways you can do this is either by using a file which you save data to, or by using the python pickle mechanism. Firstly I will explain what some of the persistent storage options are:
- File: This is where you store the data in a text based file in format such as CSV (comma separated values), JSON, and others
- Python Pickle: A python pickle is a mechanism where you can save a data structure directly to a file, and then you can retrieve the data directly from the file next time you run your program. You can do this with a library called “pickle”
- Config files: config files are similar to File and Python Pickle in that the data is stored in a file format but is intended to be directly edited by a user
- Database SQLite: this is a database where you can run queries to search for data, but the data is stored in a file
- Database Postgres (or other SQL based database): this is a database service where there’s another program that you run to manage the database, and you call functions (or SQL queries) on the database service to get the data back in an efficient manner. SQL based databases are great for structured data – e.g. table-like/excel-like data. You would search for data by category fields as an example
- Key-value database (e.g redis is one of the most famous): A key-value database is exactly that, it contains a database where you search by a key, and then it returns a value. This value can be a single value or it can be a set of fields that are associated with that value. A common use of a key-value database is for hash-based data. Meaning that you have a specific key that you want to search for, and then you get all the related fields associated with that key – much like a dictionary in python, but the benefit being its in a persistent storage
- Graph Database (e.g. Neo4J): A graph database stores data which is built to navigate relationships. This is something that is rather cumbersome to do in a relational database where you need to have many intermediary tables but becomes trivial with GraphQL language
- Text Search (e.g. Elastic Search): A purpose built database for text search which is extremely fast when searching for strings or long text
- Time series database (e.g. influx): For IoT data where each record is stored with a timestamp key and you need to do queries in time blocks, time series databases are ideal. You can do common operations such as to aggregate, search, slice data through specific query operations
- NOSQL document database (e.g. mongodb, couchdb): this is a database that also runs as a separate service but is specifically for “unstructured data” (non-table like data) such as text, images where you search for records in a free form way such as by text strings.
There is no one persistent storage mechanism that fits all, it really depends on your purpose (or “use case”) to determine which database works best for you as there are pros and cons for each.
| Setup | Editable outside Python | Volume | Read Speed | Write Speed | Inbuilt Redundancy | |
| File | None – you can create a file in your python code | For text based | Small | Slow | Slow | No – manual |
| Python Pickle | None- you can create this in your python code | No – only in python | Small | Slow | Slow | No – manual |
| Config File | Optional. You can create a config file before hand | Yes – you can use any text based editor | Small | Slow | Slow | No – manual |
| Database SQLite | None – database created automatically | No – only in python | Small-Med | Slow-Med | Slow-Med | No – manual |
| Relational SQL Database | Separate installation of server | Through the SQL console or other SQL clients | Large | Fast | Fast | Yes, require extra setup |
| NoSQL Column Database | Separate installation of server | Yes, through external client | Very large | Very fast | Very fast | Yes, inbuilt |
| Key-Value database | Separate installation of server | Yes, through external client | Very large | Very fast | Fast-Very Fast | Yes, require extra setup |
| Graph Database | Separate installation of serverSeparate installation of server | Yes, through external client | Large | Med | Med | Yes, require extra setup |
| Time Series Database | Separate installation of server | Yes, through external client | Very large | Very fast | Fast | Yes, require extra setup |
| Text Search Database | Separate installation of server | Yes, through external client | Very large | Very fast | Fast | Yes, require extra setup |
| NoSQL Documet DB | Separate installation of server | Yes, through external client | Very large | Very fast | Fast | Yes, require extra setup |

A big disclaimer here, for some of the responses, the more accurate answer is “it depends”. For example, for redundancy for relational databases, some have it inbuilt such as Oracle RAC enterprise databases and for others you can set up redundancy where you could have an infrastructure solution. However, to provide a simpler guidance, I’ve made this a bit more prescriptive. If you would like to dive deeper, then please don’t rely purely on the table above! Look into the documentation of the particular database product you are considering or reach out to me and I’m happy to provide some advice.
Summary
There are in fact plenty of SaaS-based options for database or persistent storage that are popping up which is exciting. These newer SaaS options (for example, firebase, restdb.io, anvil.works etc) are great in that they save you time on the heavy lifting, but then there may be times you still want to manage your own database. This may be because you want to keep your data yourself, or simply because you want to save costs as you already have an environment either on your own laptop, or you’re paying a fixed price for a virtual machine. Hence, managing your own persistent storage may be more cost effective rather than paying for another SaaS. However, certainly don’t discount the SaaS options altogether, as they will at least help you with things like backups, security updates etc for you.
How To Use Python orjson for Fast JSON Processing
Intermediate
You have a Python service that parses JSON responses from an API thousands of times per second, and the standard json module is quietly becoming a bottleneck. At low traffic volumes this goes unnoticed, but once you scale up, milliseconds of serialization overhead compound into real latency. If you have ever profiled a Python web service and found json.dumps or json.loads sitting near the top of the flame graph, you already know this pain.
orjson is a fast, correct JSON library for Python written in Rust. It drops into nearly any codebase as a replacement for the standard json module and typically runs 2-10x faster on both serialization and deserialization. It also natively supports types the standard library forces you to handle manually — datetime, UUID, numpy arrays, and dataclasses.
In this article you will learn how to install orjson, serialize and deserialize JSON with it, use its built-in support for Python-native types, benchmark it against the standard library, and integrate it into a real-world FastAPI project. By the end you will have a working understanding of when and why to choose orjson over the alternatives.
orjson Quick Example
Before diving deep, here is a self-contained example that shows the core pattern. orjson is nearly a drop-in replacement for the standard json module, but returns and accepts bytes instead of str.
# quick_example.py
import orjson
from datetime import datetime
data = {
"name": "Alice",
"score": 98.6,
"logged_in": True,
"joined": datetime(2024, 3, 15, 9, 30, 0),
"tags": ["python", "backend","fast"]
}
# Serialize to bytes (not str like the standard json module)
encoded = orjson.dumps(data)
print(encoded)
print(type(encoded))
# Deserialize back to a Python dict
decoded = orjson.loads(encoded)
print(decoded["joined"]) # datetime is serialized as ISO 8601 string
print(type(decoded))
Output:
b'{"name":"Alice","score":98.6,"logged_in":true,"joined":"2024-03-15T09:30:00","tags":["python","backend","fast"]}'
<class 'bytes'>
2024-03-15T09:30:00
<class 'dict'>
Two things stand out right away. First, orjson.dumps() returns bytes, not a string — this is intentional and saves an unnecessary encoding step when writing to network sockets or files. Second, the datetime object is automatically serialized to ISO 8601 format without any extra work, which the standard json module would refuse to handle at all.
What Is orjson and Why Use It?
orjson is a Python JSON library implemented in Rust using the Serde framework. It was created specifically to address the performance limitations of Python’s built-in json module, which is implemented in C but still shows its age when processing large payloads at high throughput.
The key differences between orjson and the standard library are:
| Feature | Standard json | orjson |
|---|---|---|
| Output type of dumps() | str | bytes |
| datetime support | Raises TypeError | Native ISO 8601 |
| UUID support | Raises TypeError | Native string |
| dataclass support | Raises TypeError | Native dict-like |
| numpy array support | Not supported | Native (optional dep) |
| Performance (typical) | Baseline | 2-10x faster |
| Strict UTF-8 validation | No | Yes |
The Rust implementation takes advantage of SIMD instructions and a highly optimized Serde-based serialization pipeline. For applications doing heavy JSON processing — API gateways, caching layers, log aggregators — the improvement is measurable and often significant.
Installing orjson
orjson is available on PyPI and installs with a single command:
# install_orjson.sh
pip install orjson
Output:
Collecting orjson
Downloading orjson-3.10.x-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (144 kB)
Successfully installed orjson-3.10.x
orjson ships as a pre-compiled binary for most platforms (Linux, macOS, Windows on x86-64 and ARM), so there is no Rust toolchain required. If you are on a less common platform you may need Rust installed to build from source. Verify the installation with a quick import check:
# verify_install.py
import orjson
print(orjson.__version__)
Output:
3.10.x
Serializing Python Objects with orjson.dumps()
The orjson.dumps() function converts Python objects to JSON bytes. The most important thing to remember is that it always returns bytes, not str. If you need a string, call .decode() on the result.
# serialization_basics.py
import orjson
from datetime import datetime, date
from uuid import UUID
from dataclasses import dataclass
@dataclass
class User:
id: UUID
name: str
created: datetime
active: bool
user = User(
id=UUID("12345678-1234-5678-1234-567812345678"),
name="Bob Smith",
created=datetime(2025, 1, 10, 14, 30),
active=True
)
# Serialize the dataclass directly -- no custom encoder needed
result = orjson.dumps(user)
print(result)
# Decode to string if needed
print(result.decode("utf-8"))
Output:
b'{"id":"12345678-1234-5678-1234-567812345678","name":"Bob Smith","created":"2025-01-10T14:30:00","active":true}'
{"id":"12345678-1234-5678-1234-567812345678","name":"Bob Smith","created":"2025-01-10T14:30:00","active":true}
Notice that the UUID, datetime, and dataclass are all handled automatically with zero configuration. With the standard json module, each of these would raise a TypeError: Object of type X is not JSON serializable error, requiring a custom default function.
orjson Options and Flags
orjson supports serialization options passed via the option parameter as bitwise-OR combinations of constants. These let you control formatting, sorting, and type handling:
# orjson_options.py
import orjson
data = {
"z_key": "last",
"a_key": "first",
"count": 42,
"ratio": 3.14159
}
# Pretty-print with indented output
pretty = orjson.dumps(data, option=orjson.OPT_INDENT_2)
print("Pretty:")
print(pretty.decode())
# Sort keys alphabetically
sorted_output = orjson.dumps(data, option=orjson.OPT_SORT_KEYS)
print("\nSorted keys:")
print(sorted_output.decode())
# Combine options with bitwise OR
both = orjson.dumps(data, option=orjson.OPT_INDENT_2 | orjson.OPT_SORT_KEYS)
print("\nPretty + Sorted:")
print(both.decode())
Output:
Pretty:
{
"z_key": "last",
"a_key": "first",
"count": 42,
"ratio": 3.14159
}
Sorted keys:
{"a_key":"first","count":42,"ratio":3.14159,"z_key":"last"}
Pretty + Sorted:
{
"a_key": "first",
"count": 42,
"ratio": 3.14159,
"z_key": "last"
}
The most useful options in practice are OPT_INDENT_2 for human-readable output during debugging, OPT_SORT_KEYS for deterministic output in tests or caches, OPT_NON_STR_KEYS for dicts with integer or float keys, and OPT_UTC_Z to use Z suffix instead of +00:00 for UTC datetimes.
Deserializing with orjson.loads()
The orjson.loads() function accepts both bytes and str input and returns Python objects. Unlike the standard library, it performs strict UTF-8 validation on input, which means malformed data fails loudly rather than silently corrupting your data.
# deserialization.py
import orjson
# From bytes (most common in API and network scenarios)
json_bytes = b'{"name": "Charlie", "score": 99.5, "tags": ["fast", "correct"]}'
data = orjson.loads(json_bytes)
print(data)
print(type(data["score"]))
# From string also works
json_str = '{"status": "ok", "count": 1000}'
data2 = orjson.loads(json_str)
print(data2)
# Error handling -- orjson raises JSONDecodeError for invalid input
try:
orjson.loads(b'{"broken": }')
except orjson.JSONDecodeError as e:
print(f"Parse error: {e}")
Output:
{'name': 'Charlie', 'score': 99.5, 'tags': ['fast', 'correct']}
<class 'float'>
{'status': 'ok', 'count': 1000}
Parse error: expected value at line 1 column 12
One important detail: orjson.JSONDecodeError is a subclass of json.JSONDecodeError, so any existing except blocks using json.JSONDecodeError will still catch orjson errors without modification. This makes the migration path from the standard library seamless.
Benchmarking orjson vs Standard json
Let us run a concrete benchmark so you can see the actual performance difference on your hardware. We test serializing and deserializing a moderately complex nested dictionary 100,000 times:
# benchmark_orjson.py
import json
import orjson
import time
from datetime import datetime
# Test data -- similar to a typical API response
sample_data = {
"users": [
{"id": i, "name": f"User{i}", "email": f"user{i}@example.com",
"score": i * 1.5, "active": i % 2 == 0, "tags": ["python", "backend"]}
for i in range(50)
],
"total": 50,
"page": 1
}
ITERATIONS = 100_000
# Benchmark json.dumps
start = time.perf_counter()
for _ in range(ITERATIONS):
json.dumps(sample_data)
json_dumps_time = time.perf_counter() - start
# Benchmark orjson.dumps (returns bytes)
start = time.perf_counter()
for _ in range(ITERATIONS):
orjson.dumps(sample_data)
orjson_dumps_time = time.perf_counter() - start
# Benchmark json.loads
json_str = json.dumps(sample_data)
start = time.perf_counter()
for _ in range(ITERATIONS):
json.loads(json_str)
json_loads_time = time.perf_counter() - start
# Benchmark orjson.loads
orjson_bytes = orjson.dumps(sample_data)
start = time.perf_counter()
for _ in range(ITERATIONS):
orjson.loads(orjson_bytes)
orjson_loads_time = time.perf_counter() - start
print(f"json.dumps: {json_dumps_time:.3f}s")
print(f"orjson.dumps: {orjson_dumps_time:.3f}s ({json_dumps_time/orjson_dumps_time:.1f}x faster)")
print(f"json.loads: {json_loads_time:.3f}s")
print(f"orjson.loads: {orjson_loads_time:.3f}s ({json_loads_time/orjson_loads_time:.1f}x faster)")
Output (typical results on a modern CPU):
json.dumps: 2.841s
orjson.dumps: 0.482s (5.9x faster)
json.loads: 2.103s
orjson.loads: 0.631s (3.3x faster)
Actual speedups vary based on payload size, nesting depth, and hardware, but 3-6x faster on both operations is typical. For a service handling 1,000 requests per second with 100KB payloads each, this translates to substantial CPU savings that compound at scale.
Real-Life Example: FastAPI Response Caching with orjson
Here is a practical example that integrates orjson into a FastAPI application. We use orjson for both serializing API responses and caching them in memory, demonstrating a common production pattern:
# fastapi_orjson_cache.py
"""
FastAPI app with orjson-powered response serialization and in-memory caching.
Run with: uvicorn fastapi_orjson_cache:app --reload
"""
import orjson
from fastapi import FastAPI
from fastapi.responses import Response
from datetime import datetime, timezone
from dataclasses import dataclass, field
from typing import Optional
import hashlib
app = FastAPI()
# Simple in-memory cache using orjson bytes as values
_cache: dict[str, bytes] = {}
@dataclass
class ProductRecord:
id: int
name: str
price: float
in_stock: bool
last_updated: datetime
tags: list[str] = field(default_factory=list)
def get_product_from_db(product_id: int) -> Optional[ProductRecord]:
"""Simulates a database lookup."""
if product_id > 100:
return None
return ProductRecord(
id=product_id,
name=f"Product {product_id}",
price=round(product_id * 9.99, 2),
in_stock=product_id % 3 != 0,
last_updated=datetime.now(timezone.utc),
tags=["electronics", "featured"] if product_id < 50 else ["clearance"]
)
@app.get("/products/{product_id}")
async def get_product(product_id: int):
cache_key = f"product:{product_id}"
# Check cache first
if cache_key in _cache:
# Return cached bytes directly -- no re-serialization needed
return Response(content=_cache[cache_key], media_type="application/json")
product = get_product_from_db(product_id)
if product is None:
error = orjson.dumps({"error": "Product not found", "id": product_id})
return Response(content=error, media_type="application/json", status_code=404)
# Serialize with orjson -- handles dataclass and datetime natively
encoded = orjson.dumps(product, option=orjson.OPT_INDENT_2)
_cache[cache_key] = encoded
return Response(content=encoded, media_type="application/json")
@app.get("/cache/stats")
async def cache_stats():
stats = {
"cached_keys": len(_cache),
"cache_size_bytes": sum(len(v) for v in _cache.values()),
"timestamp": datetime.now(timezone.utc)
}
return Response(content=orjson.dumps(stats), media_type="application/json")
Example curl output:
$ curl http://localhost:8000/products/42
{
"id": 42,
"name": "Product 42",
"price": 419.58,
"in_stock": true,
"last_updated": "2025-03-15T10:22:41.123456+00:00",
"tags": ["electronics", "featured"]
}
The power here is that the serialized bytes are stored in the cache and served directly as the HTTP response body without deserialization or re-serialization. orjson's native datetime handling means the UTC-aware datetime in last_updated is serialized to a full ISO 8601 string with timezone offset -- exactly what frontend clients expect.
Frequently Asked Questions
Why does orjson return bytes instead of str?
orjson returns bytes because JSON data in Python is almost always immediately encoded to bytes for network transport or file writing. Returning bytes directly avoids an extra .encode("utf-8") step. If you need a string, just call result.decode(). This is a deliberate performance decision -- the bytes representation is the final form that gets sent over the wire.
Is orjson a drop-in replacement for the json module?
Almost, but not completely. The function signatures are similar, but orjson.dumps() returns bytes while json.dumps() returns str. Any code that does f.write(json.dumps(data)) will break because you cannot write bytes to a text-mode file. The fix is either f.write(orjson.dumps(data).decode()) or opening the file in binary mode "wb". The default= parameter also works slightly differently in edge cases.
How do I serialize custom types that orjson doesn't support natively?
Use the default parameter with a callback function, just like the standard library. The function receives the object and should return a JSON-serializable value. For example, to serialize a Decimal: orjson.dumps(data, default=lambda x: float(x) if isinstance(x, Decimal) else TypeError). orjson's native type support is broad enough that custom default handlers are rarely needed for modern Python code.
Is orjson thread-safe?
Yes. orjson functions are stateless -- each call to dumps() or loads() is entirely independent. There is no global mutable state, so multiple threads can call orjson simultaneously without any synchronization. This makes it a natural fit for multi-threaded web servers like gunicorm or uvicorn workers.
How does orjson compare to ujson?
Both are faster than the standard library, but orjson is consistently faster than ujson in benchmarks and has better correctness guarantees. ujson has a history of silently dropping or corrupting data in edge cases (very large integers, NaN values, deeply nested structures). orjson prioritizes correctness alongside speed. For production code where data integrity matters, orjson is the better choice.
Conclusion
orjson delivers a simple, high-value upgrade to any Python codebase that does significant JSON processing. The Rust-based implementation provides 3-6x faster serialization and deserialization, native support for datetime, UUID, dataclasses, and numpy arrays, and correct strict UTF-8 validation -- all with an API close enough to the standard library that migration is usually a matter of replacing the import and handling the bytes return type.
Try extending the FastAPI caching example to use Redis as a backend instead of in-memory storage, or add a Cache-Control header to the response based on the product's last_updated timestamp. These are natural next steps that reinforce how orjson fits into production API patterns.
For the full API reference and advanced options like OPT_PASSTHROUGH_DATETIME, see the orjson GitHub repository.
Related Articles
Further Reading: For more details, see the Python sqlite3 documentation.
Frequently Asked Questions
What are the main data storage options in Python?
Python supports flat files (text, CSV, JSON), databases (SQLite, PostgreSQL, MySQL), key-value stores (Redis, shelve), pickle serialization, and cloud storage. The best choice depends on data size, structure, and access patterns.
When should I use SQLite vs a full database?
Use SQLite for single-user apps, prototypes, and small-to-medium datasets. Switch to PostgreSQL or MySQL for concurrent multi-user access, complex queries at scale, or production-grade reliability.
How do I save Python objects to disk?
Use pickle for Python-specific serialization, json for interoperable data, shelve for dictionary-like persistent storage, or databases for structured data. For data analysis, pandas can save to CSV, Parquet, or HDF5.
Is JSON or CSV better for storing data?
JSON handles nested, hierarchical data well. CSV is simpler for tabular, flat data. Use JSON for API data and configuration; use CSV for datasets and spreadsheet-compatible exports.
How do I choose between file storage and a database?
Use file storage for simple, single-user scenarios. Use a database when you need querying, indexing, concurrent access, or ACID transactions. SQLite bridges both worlds for simpler applications.