Beginner
Selenium is a useful python library to extract web page data especially for pages with javascript loading. Many of you may have tried to use selenium but may have gotten stuck in the installation process. One key thing you have to remember is that Selenium will run an actual browser in the background (or foreground if you wish) to query a given website. So a key step is to install the driver if you haven’t done so already.
Step 1: Locate the right web driver
Since Selenium will use an actual driver, one of the first decisions you’ll need to make is to determine which driver to use. Generally it won’t matter, but the best browser to use, is the one that works the best for your target website. For example, if your target website works best under Firefox, then use that.
| Browser | Supported OS | Maintained by | Download | Issue Tracker |
|---|---|---|---|---|
| Chromium/Chrome | Windows/macOS/Linux | Downloads | Issues | |
| Firefox | Windows/macOS/Linux | Mozilla | Downloads | Issues |
| Edge | Windows 10 | Microsoft | Downloads | Issues |
| Internet Explorer | Windows | Selenium Project | Downloads | Issues |
| Opera | Windows/macOS/Linux | Opera | Downloads | Issues |
So decide which one, and then go to the download page. For this example we will use FireFox. In the above table, the download link goes to this page: https://github.com/mozilla/geckodriver/releases
You can then click on the latest release:

You can then scroll down to the bottom of the page to see the driver list:

Right click on the .gz file, and then get the URL.

Step 2: Download the web driver
Next go to your linux terminal and create a directory to store this file:

Next go into that directory, and then use wget to download the url by pasting the link you copied above:
wget https://github.com/mozilla/geckodriver/releases/download/v0.29.1/geckodriver-v0.29.1-linux32.tar.gz

Step 3: Extract the download web drivers
Next you should see the .gz file when you list the files:

You can the gzip the file to extract it:
gzip -d geckodriver-v0.29.1-linux32.tar.gz

You can then finally untar the file to decompress:
tar -xvf geckodriver-v0.29.1-linux32.tar

Step 4: Configure PATH
What you will be left with is a file called “geckodriver”. This is the driver file. You will need to have it made available via the export path. The reason is that the selenium looks for the driver file from the PATH operating system environment variable.
I simply went to the parent directory, then updated the PATH environment variable by taking the existing PATH value ($PATH) then appending the gdriver folder:
export PATH=$PATH:gdriver
If you do not do the above, you will get the error:
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.
Step 5: Test running the web driver
That’s it! Now if you test the following code, you should be able to run a web query by running a firefox driver in the background:
# main.py
from selenium import webdriver
from selenium.webdriver import FirefoxOptions
opts = FirefoxOptions()
opts.add_argument("--headless")
browser = webdriver.Firefox(options=opts)
# Declare a variable containing the URL is going to be scrapped
URL = 'https://pythonhowtoprogram.com/'
# Web driver going into website
browser.get(URL)
# Printing page title
print(browser.title)
You will notice it does take a few seconds to run for the first time. It’s because that an instance of a browser needs to be loaded which does take a few seconds. Just keep this in mind in case you need to have faster performance for which you may need to use urllib or requests instead.
Next Steps
Now that you know how to install a driver, there are numerous webscraping tutorials we have on offer. You can find them all in our web scraping section: https://pythonhowtoprogram.com/category/web-scraping/
Want More Great Articles? Subscribe to our newsletter and have great articles sent right to your inbox as they come:
How To Use Python orjson for Fast JSON Processing
Intermediate
You have a Python service that parses JSON responses from an API thousands of times per second, and the standard json module is quietly becoming a bottleneck. At low traffic volumes this goes unnoticed, but once you scale up, milliseconds of serialization overhead compound into real latency. If you have ever profiled a Python web service and found json.dumps or json.loads sitting near the top of the flame graph, you already know this pain.
orjson is a fast, correct JSON library for Python written in Rust. It drops into nearly any codebase as a replacement for the standard json module and typically runs 2-10x faster on both serialization and deserialization. It also natively supports types the standard library forces you to handle manually — datetime, UUID, numpy arrays, and dataclasses.
In this article you will learn how to install orjson, serialize and deserialize JSON with it, use its built-in support for Python-native types, benchmark it against the standard library, and integrate it into a real-world FastAPI project. By the end you will have a working understanding of when and why to choose orjson over the alternatives.
orjson Quick Example
Before diving deep, here is a self-contained example that shows the core pattern. orjson is nearly a drop-in replacement for the standard json module, but returns and accepts bytes instead of str.
# quick_example.py
import orjson
from datetime import datetime
data = {
"name": "Alice",
"score": 98.6,
"logged_in": True,
"joined": datetime(2024, 3, 15, 9, 30, 0),
"tags": ["python", "backend","fast"]
}
# Serialize to bytes (not str like the standard json module)
encoded = orjson.dumps(data)
print(encoded)
print(type(encoded))
# Deserialize back to a Python dict
decoded = orjson.loads(encoded)
print(decoded["joined"]) # datetime is serialized as ISO 8601 string
print(type(decoded))
Output:
b'{"name":"Alice","score":98.6,"logged_in":true,"joined":"2024-03-15T09:30:00","tags":["python","backend","fast"]}'
<class 'bytes'>
2024-03-15T09:30:00
<class 'dict'>
Two things stand out right away. First, orjson.dumps() returns bytes, not a string — this is intentional and saves an unnecessary encoding step when writing to network sockets or files. Second, the datetime object is automatically serialized to ISO 8601 format without any extra work, which the standard json module would refuse to handle at all.
What Is orjson and Why Use It?
orjson is a Python JSON library implemented in Rust using the Serde framework. It was created specifically to address the performance limitations of Python’s built-in json module, which is implemented in C but still shows its age when processing large payloads at high throughput.
The key differences between orjson and the standard library are:
| Feature | Standard json | orjson |
|---|---|---|
| Output type of dumps() | str | bytes |
| datetime support | Raises TypeError | Native ISO 8601 |
| UUID support | Raises TypeError | Native string |
| dataclass support | Raises TypeError | Native dict-like |
| numpy array support | Not supported | Native (optional dep) |
| Performance (typical) | Baseline | 2-10x faster |
| Strict UTF-8 validation | No | Yes |
The Rust implementation takes advantage of SIMD instructions and a highly optimized Serde-based serialization pipeline. For applications doing heavy JSON processing — API gateways, caching layers, log aggregators — the improvement is measurable and often significant.
Installing orjson
orjson is available on PyPI and installs with a single command:
# install_orjson.sh
pip install orjson
Output:
Collecting orjson
Downloading orjson-3.10.x-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (144 kB)
Successfully installed orjson-3.10.x
orjson ships as a pre-compiled binary for most platforms (Linux, macOS, Windows on x86-64 and ARM), so there is no Rust toolchain required. If you are on a less common platform you may need Rust installed to build from source. Verify the installation with a quick import check:
# verify_install.py
import orjson
print(orjson.__version__)
Output:
3.10.x
Serializing Python Objects with orjson.dumps()
The orjson.dumps() function converts Python objects to JSON bytes. The most important thing to remember is that it always returns bytes, not str. If you need a string, call .decode() on the result.
# serialization_basics.py
import orjson
from datetime import datetime, date
from uuid import UUID
from dataclasses import dataclass
@dataclass
class User:
id: UUID
name: str
created: datetime
active: bool
user = User(
id=UUID("12345678-1234-5678-1234-567812345678"),
name="Bob Smith",
created=datetime(2025, 1, 10, 14, 30),
active=True
)
# Serialize the dataclass directly -- no custom encoder needed
result = orjson.dumps(user)
print(result)
# Decode to string if needed
print(result.decode("utf-8"))
Output:
b'{"id":"12345678-1234-5678-1234-567812345678","name":"Bob Smith","created":"2025-01-10T14:30:00","active":true}'
{"id":"12345678-1234-5678-1234-567812345678","name":"Bob Smith","created":"2025-01-10T14:30:00","active":true}
Notice that the UUID, datetime, and dataclass are all handled automatically with zero configuration. With the standard json module, each of these would raise a TypeError: Object of type X is not JSON serializable error, requiring a custom default function.
orjson Options and Flags
orjson supports serialization options passed via the option parameter as bitwise-OR combinations of constants. These let you control formatting, sorting, and type handling:
# orjson_options.py
import orjson
data = {
"z_key": "last",
"a_key": "first",
"count": 42,
"ratio": 3.14159
}
# Pretty-print with indented output
pretty = orjson.dumps(data, option=orjson.OPT_INDENT_2)
print("Pretty:")
print(pretty.decode())
# Sort keys alphabetically
sorted_output = orjson.dumps(data, option=orjson.OPT_SORT_KEYS)
print("\nSorted keys:")
print(sorted_output.decode())
# Combine options with bitwise OR
both = orjson.dumps(data, option=orjson.OPT_INDENT_2 | orjson.OPT_SORT_KEYS)
print("\nPretty + Sorted:")
print(both.decode())
Output:
Pretty:
{
"z_key": "last",
"a_key": "first",
"count": 42,
"ratio": 3.14159
}
Sorted keys:
{"a_key":"first","count":42,"ratio":3.14159,"z_key":"last"}
Pretty + Sorted:
{
"a_key": "first",
"count": 42,
"ratio": 3.14159,
"z_key": "last"
}
The most useful options in practice are OPT_INDENT_2 for human-readable output during debugging, OPT_SORT_KEYS for deterministic output in tests or caches, OPT_NON_STR_KEYS for dicts with integer or float keys, and OPT_UTC_Z to use Z suffix instead of +00:00 for UTC datetimes.
Deserializing with orjson.loads()
The orjson.loads() function accepts both bytes and str input and returns Python objects. Unlike the standard library, it performs strict UTF-8 validation on input, which means malformed data fails loudly rather than silently corrupting your data.
# deserialization.py
import orjson
# From bytes (most common in API and network scenarios)
json_bytes = b'{"name": "Charlie", "score": 99.5, "tags": ["fast", "correct"]}'
data = orjson.loads(json_bytes)
print(data)
print(type(data["score"]))
# From string also works
json_str = '{"status": "ok", "count": 1000}'
data2 = orjson.loads(json_str)
print(data2)
# Error handling -- orjson raises JSONDecodeError for invalid input
try:
orjson.loads(b'{"broken": }')
except orjson.JSONDecodeError as e:
print(f"Parse error: {e}")
Output:
{'name': 'Charlie', 'score': 99.5, 'tags': ['fast', 'correct']}
<class 'float'>
{'status': 'ok', 'count': 1000}
Parse error: expected value at line 1 column 12
One important detail: orjson.JSONDecodeError is a subclass of json.JSONDecodeError, so any existing except blocks using json.JSONDecodeError will still catch orjson errors without modification. This makes the migration path from the standard library seamless.
Benchmarking orjson vs Standard json
Let us run a concrete benchmark so you can see the actual performance difference on your hardware. We test serializing and deserializing a moderately complex nested dictionary 100,000 times:
# benchmark_orjson.py
import json
import orjson
import time
from datetime import datetime
# Test data -- similar to a typical API response
sample_data = {
"users": [
{"id": i, "name": f"User{i}", "email": f"user{i}@example.com",
"score": i * 1.5, "active": i % 2 == 0, "tags": ["python", "backend"]}
for i in range(50)
],
"total": 50,
"page": 1
}
ITERATIONS = 100_000
# Benchmark json.dumps
start = time.perf_counter()
for _ in range(ITERATIONS):
json.dumps(sample_data)
json_dumps_time = time.perf_counter() - start
# Benchmark orjson.dumps (returns bytes)
start = time.perf_counter()
for _ in range(ITERATIONS):
orjson.dumps(sample_data)
orjson_dumps_time = time.perf_counter() - start
# Benchmark json.loads
json_str = json.dumps(sample_data)
start = time.perf_counter()
for _ in range(ITERATIONS):
json.loads(json_str)
json_loads_time = time.perf_counter() - start
# Benchmark orjson.loads
orjson_bytes = orjson.dumps(sample_data)
start = time.perf_counter()
for _ in range(ITERATIONS):
orjson.loads(orjson_bytes)
orjson_loads_time = time.perf_counter() - start
print(f"json.dumps: {json_dumps_time:.3f}s")
print(f"orjson.dumps: {orjson_dumps_time:.3f}s ({json_dumps_time/orjson_dumps_time:.1f}x faster)")
print(f"json.loads: {json_loads_time:.3f}s")
print(f"orjson.loads: {orjson_loads_time:.3f}s ({json_loads_time/orjson_loads_time:.1f}x faster)")
Output (typical results on a modern CPU):
json.dumps: 2.841s
orjson.dumps: 0.482s (5.9x faster)
json.loads: 2.103s
orjson.loads: 0.631s (3.3x faster)
Actual speedups vary based on payload size, nesting depth, and hardware, but 3-6x faster on both operations is typical. For a service handling 1,000 requests per second with 100KB payloads each, this translates to substantial CPU savings that compound at scale.
Real-Life Example: FastAPI Response Caching with orjson
Here is a practical example that integrates orjson into a FastAPI application. We use orjson for both serializing API responses and caching them in memory, demonstrating a common production pattern:
# fastapi_orjson_cache.py
"""
FastAPI app with orjson-powered response serialization and in-memory caching.
Run with: uvicorn fastapi_orjson_cache:app --reload
"""
import orjson
from fastapi import FastAPI
from fastapi.responses import Response
from datetime import datetime, timezone
from dataclasses import dataclass, field
from typing import Optional
import hashlib
app = FastAPI()
# Simple in-memory cache using orjson bytes as values
_cache: dict[str, bytes] = {}
@dataclass
class ProductRecord:
id: int
name: str
price: float
in_stock: bool
last_updated: datetime
tags: list[str] = field(default_factory=list)
def get_product_from_db(product_id: int) -> Optional[ProductRecord]:
"""Simulates a database lookup."""
if product_id > 100:
return None
return ProductRecord(
id=product_id,
name=f"Product {product_id}",
price=round(product_id * 9.99, 2),
in_stock=product_id % 3 != 0,
last_updated=datetime.now(timezone.utc),
tags=["electronics", "featured"] if product_id < 50 else ["clearance"]
)
@app.get("/products/{product_id}")
async def get_product(product_id: int):
cache_key = f"product:{product_id}"
# Check cache first
if cache_key in _cache:
# Return cached bytes directly -- no re-serialization needed
return Response(content=_cache[cache_key], media_type="application/json")
product = get_product_from_db(product_id)
if product is None:
error = orjson.dumps({"error": "Product not found", "id": product_id})
return Response(content=error, media_type="application/json", status_code=404)
# Serialize with orjson -- handles dataclass and datetime natively
encoded = orjson.dumps(product, option=orjson.OPT_INDENT_2)
_cache[cache_key] = encoded
return Response(content=encoded, media_type="application/json")
@app.get("/cache/stats")
async def cache_stats():
stats = {
"cached_keys": len(_cache),
"cache_size_bytes": sum(len(v) for v in _cache.values()),
"timestamp": datetime.now(timezone.utc)
}
return Response(content=orjson.dumps(stats), media_type="application/json")
Example curl output:
$ curl http://localhost:8000/products/42
{
"id": 42,
"name": "Product 42",
"price": 419.58,
"in_stock": true,
"last_updated": "2025-03-15T10:22:41.123456+00:00",
"tags": ["electronics", "featured"]
}
The power here is that the serialized bytes are stored in the cache and served directly as the HTTP response body without deserialization or re-serialization. orjson's native datetime handling means the UTC-aware datetime in last_updated is serialized to a full ISO 8601 string with timezone offset -- exactly what frontend clients expect.
Frequently Asked Questions
Why does orjson return bytes instead of str?
orjson returns bytes because JSON data in Python is almost always immediately encoded to bytes for network transport or file writing. Returning bytes directly avoids an extra .encode("utf-8") step. If you need a string, just call result.decode(). This is a deliberate performance decision -- the bytes representation is the final form that gets sent over the wire.
Is orjson a drop-in replacement for the json module?
Almost, but not completely. The function signatures are similar, but orjson.dumps() returns bytes while json.dumps() returns str. Any code that does f.write(json.dumps(data)) will break because you cannot write bytes to a text-mode file. The fix is either f.write(orjson.dumps(data).decode()) or opening the file in binary mode "wb". The default= parameter also works slightly differently in edge cases.
How do I serialize custom types that orjson doesn't support natively?
Use the default parameter with a callback function, just like the standard library. The function receives the object and should return a JSON-serializable value. For example, to serialize a Decimal: orjson.dumps(data, default=lambda x: float(x) if isinstance(x, Decimal) else TypeError). orjson's native type support is broad enough that custom default handlers are rarely needed for modern Python code.
Is orjson thread-safe?
Yes. orjson functions are stateless -- each call to dumps() or loads() is entirely independent. There is no global mutable state, so multiple threads can call orjson simultaneously without any synchronization. This makes it a natural fit for multi-threaded web servers like gunicorm or uvicorn workers.
How does orjson compare to ujson?
Both are faster than the standard library, but orjson is consistently faster than ujson in benchmarks and has better correctness guarantees. ujson has a history of silently dropping or corrupting data in edge cases (very large integers, NaN values, deeply nested structures). orjson prioritizes correctness alongside speed. For production code where data integrity matters, orjson is the better choice.
Conclusion
orjson delivers a simple, high-value upgrade to any Python codebase that does significant JSON processing. The Rust-based implementation provides 3-6x faster serialization and deserialization, native support for datetime, UUID, dataclasses, and numpy arrays, and correct strict UTF-8 validation -- all with an API close enough to the standard library that migration is usually a matter of replacing the import and handling the bytes return type.
Try extending the FastAPI caching example to use Redis as a backend instead of in-memory storage, or add a Cache-Control header to the response based on the product's last_updated timestamp. These are natural next steps that reinforce how orjson fits into production API patterns.
For the full API reference and advanced options like OPT_PASSTHROUGH_DATETIME, see the orjson GitHub repository.
Related Articles
Further Reading: For more details, see the Python webbrowser module documentation.
Frequently Asked Questions
What is Selenium WebDriver used for in Python?
Selenium WebDriver is a tool for automating web browser interactions. In Python, it is used for web scraping, automated testing of web applications, form filling, screenshot capture, and any task that requires programmatic control of a web browser.
Which browser drivers work with Selenium in Python?
Selenium supports ChromeDriver (Chrome/Chromium), GeckoDriver (Firefox), EdgeDriver (Microsoft Edge), and SafariDriver (Safari). ChromeDriver and GeckoDriver are the most commonly used for Linux-based automation.
How do I install ChromeDriver on Linux?
Download ChromeDriver from the official site matching your Chrome version, extract it, and place it in your PATH (e.g., /usr/local/bin/). Alternatively, use webdriver-manager package: pip install webdriver-manager to handle driver installation automatically.
Why do I get ‘WebDriver not found’ errors?
This typically occurs when the driver executable is not in your system PATH, the driver version does not match your browser version, or the driver file lacks execute permissions. Use chmod +x chromedriver to set permissions and ensure version compatibility.
Can Selenium run without a visible browser window?
Yes. Use headless mode by adding options.add_argument('--headless') to your browser options. This runs the browser in the background without a GUI, which is faster and ideal for servers and CI/CD pipelines.
Thanks for finally writing about > How To Install
Selenium Driver For Python in Linux diatomity