Beginner
Selenium is a useful python library to extract web page data especially for pages with javascript loading. Many of you may have tried to use selenium but may have gotten stuck in the installation process. One key thing you have to remember is that Selenium will run an actual browser in the background (or foreground if you wish) to query a given website. So a key step is to install the driver if you haven’t done so already.
Step 1: Locate the right web driver
Since Selenium will use an actual driver, one of the first decisions you’ll need to make is to determine which driver to use. Generally it won’t matter, but the best browser to use, is the one that works the best for your target website. For example, if your target website works best under Firefox, then use that.
| Browser | Supported OS | Maintained by | Download | Issue Tracker |
|---|---|---|---|---|
| Chromium/Chrome | Windows/macOS/Linux | Downloads | Issues | |
| Firefox | Windows/macOS/Linux | Mozilla | Downloads | Issues |
| Edge | Windows 10 | Microsoft | Downloads | Issues |
| Internet Explorer | Windows | Selenium Project | Downloads | Issues |
| Opera | Windows/macOS/Linux | Opera | Downloads | Issues |
So decide which one, and then go to the download page. For this example we will use FireFox. In the above table, the download link goes to this page: https://github.com/mozilla/geckodriver/releases
You can then click on the latest release:

You can then scroll down to the bottom of the page to see the driver list:

Right click on the .gz file, and then get the URL.

Step 2: Download the web driver
Next go to your linux terminal and create a directory to store this file:

Next go into that directory, and then use wget to download the url by pasting the link you copied above:
wget https://github.com/mozilla/geckodriver/releases/download/v0.29.1/geckodriver-v0.29.1-linux32.tar.gz

Step 3: Extract the download web drivers
Next you should see the .gz file when you list the files:

You can the gzip the file to extract it:
gzip -d geckodriver-v0.29.1-linux32.tar.gz

You can then finally untar the file to decompress:
tar -xvf geckodriver-v0.29.1-linux32.tar

Step 4: Configure PATH
What you will be left with is a file called “geckodriver”. This is the driver file. You will need to have it made available via the export path. The reason is that the selenium looks for the driver file from the PATH operating system environment variable.
I simply went to the parent directory, then updated the PATH environment variable by taking the existing PATH value ($PATH) then appending the gdriver folder:
export PATH=$PATH:gdriver
If you do not do the above, you will get the error:
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.
Step 5: Test running the web driver
That’s it! Now if you test the following code, you should be able to run a web query by running a firefox driver in the background:
# main.py
from selenium import webdriver
from selenium.webdriver import FirefoxOptions
opts = FirefoxOptions()
opts.add_argument("--headless")
browser = webdriver.Firefox(options=opts)
# Declare a variable containing the URL is going to be scrapped
URL = 'https://pythonhowtoprogram.com/'
# Web driver going into website
browser.get(URL)
# Printing page title
print(browser.title)
You will notice it does take a few seconds to run for the first time. It’s because that an instance of a browser needs to be loaded which does take a few seconds. Just keep this in mind in case you need to have faster performance for which you may need to use urllib or requests instead.
Next Steps
Now that you know how to install a driver, there are numerous webscraping tutorials we have on offer. You can find them all in our web scraping section: https://pythonhowtoprogram.com/category/web-scraping/
Want More Great Articles? Subscribe to our newsletter and have great articles sent right to your inbox as they come:
How To Use Python Joblib for Parallel Computing and Caching
Intermediate
You have a data processing loop that runs one item at a time — checking each file, scoring each user, training each model configuration. Your machine has eight cores and only one of them is working. The loop that takes twenty minutes could finish in three if you could just split the work across all available processors.
Joblib is a Python library that makes parallel computing and result caching easy to add to existing code. Its Parallel and delayed utilities turn a regular Python loop into a parallel job with one wrapper. Its Memory class caches function results to disk so that the second call with the same arguments returns instantly. Install it with pip install joblib. Scikit-learn uses Joblib internally for its own parallelism, so if you have scikit-learn installed, Joblib is already there.
This article covers parallelising loops with Parallel and delayed, choosing the right backend (loky, threading, multiprocessing), caching expensive computations with Memory, integrating with scikit-learn pipelines, and diagnosing performance with verbosity settings. By the end you will have both parallel execution and disk caching working in a realistic data pipeline.
Joblib Parallel: Quick Example
The quickest way to see Joblib’s effect is to replace a for loop with a Parallel call. The structure is almost identical — the main change is wrapping the function call with delayed().
# quick_joblib.py
import time
from joblib import Parallel, delayed
def slow_square(n: int) -> int:
"""Simulate a slow computation."""
time.sleep(0.5)
return n * n
numbers = list(range(8))
# Sequential -- takes 8 * 0.5 = 4 seconds
start = time.perf_counter()
sequential = [slow_square(n) for n in numbers]
seq_time = time.perf_counter() - start
print(f"Sequential: {sequential} in {seq_time:.2f}s")
# Parallel -- uses all available CPU cores
start = time.perf_counter()
parallel = Parallel(n_jobs=-1)(delayed(slow_square)(n) for n in numbers)
par_time = time.perf_counter() - start
print(f"Parallel: {parallel} in {par_time:.2f}s")
print(f"Speedup: {seq_time / par_time:.1f}x")
Output (on an 8-core machine):
Sequential: [0, 1, 4, 9, 16, 25, 36, 49] in 4.01s
Parallel: [0, 1, 4, 9, 16, 25, 36, 49] in 0.56s
Speedup: 7.2x
The n_jobs=-1 argument tells Joblib to use all available CPU cores. n_jobs=4 would use exactly four. The delayed(func)(args) pattern creates a lazy description of the function call without executing it — Joblib collects these descriptions and distributes them across workers. The return values are collected in the same order as the input, so parallel[3] is always the result of slow_square(3) regardless of which worker finished first.
What Is Joblib and When Should You Use It?
Joblib provides two things: easy parallelism through a process pool, and persistent disk caching of function results. These two features are independent — you can use either without the other. The parallelism is built on top of the loky process pool by default (a robust reimplementation of multiprocessing.Pool) with fallback to Python’s threading or the original multiprocessing pool.
| Tool | Best for | Overhead |
|---|---|---|
| Joblib Parallel (loky) | CPU-bound tasks, data processing | ~100ms startup |
| Joblib Parallel (threading) | IO-bound tasks, numpy releases GIL | ~5ms startup |
| concurrent.futures | Simple async IO, process pools | ~50ms startup |
| multiprocessing.Pool | CPU-bound, full control needed | ~100ms startup |
| asyncio | High-concurrency network IO | Near zero |
Joblib excels when your loop body is CPU-bound (model training, file parsing, image processing) and each iteration takes at least a few milliseconds — enough to justify the inter-process communication cost. For very fast operations (microsecond loops), parallelism overhead outweighs the benefit. The caching feature is valuable for any function with expensive deterministic computations: feature extraction, data loading, hyperparameter search.
Choosing the Right Backend
Joblib supports three execution backends, each suited to different workloads. Understanding when to use each prevents a common trap: the default process-based backend actually slows down IO-bound work because of serialisation overhead.
# backends.py
import time
import numpy as np
from joblib import Parallel, delayed
def cpu_task(size: int) -> float:
"""CPU-bound: pure Python computation."""
data = list(range(size))
return sum(x * x for x in data) / len(data)
def numpy_task(size: int) -> float:
"""Numpy releases the GIL -- threading backend works well here."""
arr = np.random.rand(size)
return float(np.sqrt(np.sum(arr ** 2)))
items = [100_000] * 8
# Default loky backend (separate processes, best for pure Python CPU work)
start = time.perf_counter()
results_loky = Parallel(n_jobs=4, backend="loky")(
delayed(cpu_task)(n) for n in items
)
print(f"loky (CPU work): {time.perf_counter() - start:.2f}s")
# Threading backend (shares memory, good when GIL is released by C extensions)
start = time.perf_counter()
results_thread = Parallel(n_jobs=4, backend="threading")(
delayed(numpy_task)(n) for n in items
)
print(f"threading (NumPy): {time.perf_counter() - start:.2f}s")
# Sequential for comparison
start = time.perf_counter()
results_seq = [numpy_task(n) for n in items]
print(f"sequential: {time.perf_counter() - start:.2f}s")
Output:
loky (CPU work): 0.48s
threading (NumPy): 0.31s
sequential: 1.12s
The loky backend spawns separate Python processes, each with their own memory space and GIL. This is the right choice for pure Python CPU work because it truly runs in parallel. The threading backend runs in threads within the same process. Because Python’s GIL prevents true parallel execution of pure Python code, threading only helps when the task calls into a C extension that releases the GIL — like NumPy, Pandas, or scikit-learn. The multiprocessing backend is the original process pool; prefer loky unless you have a specific compatibility reason to use it.
Caching Expensive Results with Memory
Joblib’s Memory class caches a function’s return value to disk, keyed by the function’s source code and its arguments. The second call with the same arguments reads from the cache instead of recomputing. This is useful for data loading, feature extraction, or any expensive deterministic step that you run repeatedly during development.
# caching.py
import time
import numpy as np
from joblib import Memory
# Create a cache directory
cache = Memory("./joblib_cache", verbose=1)
@cache.cache
def load_and_process(filepath: str, scale: float = 1.0) -> np.ndarray:
"""Simulate expensive data loading and processing."""
print(f" [COMPUTING] Loading {filepath} with scale={scale}")
time.sleep(2) # Simulate a 2-second load
data = np.random.rand(1000) * scale
return data
print("First call (cold cache):")
start = time.perf_counter()
result1 = load_and_process("data/features.npy", scale=2.0)
print(f" Took: {time.perf_counter() - start:.2f}s, mean={result1.mean():.4f}")
print("\nSecond call (cache hit):")
start = time.perf_counter()
result2 = load_and_process("data/features.npy", scale=2.0)
print(f" Took: {time.perf_counter() - start:.4f}s, mean={result2.mean():.4f}")
print("\nDifferent args (cache miss):")
start = time.perf_counter()
result3 = load_and_process("data/features.npy", scale=3.0)
print(f" Took: {time.perf_counter() - start:.2f}s, mean={result3.mean():.4f}")
Output:
First call (cold cache):
[COMPUTING] Loading data/features.npy with scale=2.0
Took: 2.01s, mean=0.9987
Second call (cache hit):
Took: 0.0031s, mean=0.9987
Different args (cache miss):
[COMPUTING] Loading data/features.npy with scale=3.0
Took: 2.01s, mean=1.4991
The cache is stored as compressed pickle files in the directory you specify. It is keyed on the function’s source code hash and all arguments — if you change the function body, Joblib invalidates the cache automatically on the next call. To clear the cache manually, call cache.clear() or delete the cache directory. The verbose=1 argument makes Joblib print whether it computed or loaded from cache; set it to 0 to silence this output in production.
Joblib with scikit-learn Pipelines
Scikit-learn uses Joblib internally for all its n_jobs parameters — cross-validation, grid search, random forests, and more all use the same Joblib infrastructure. You can control the backend and number of jobs globally using Joblib’s parallel_backend context manager, or pass n_jobs directly to estimators.
# sklearn_parallel.py
import time
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from joblib import parallel_backend
# Generate a sample dataset
X, y = make_classification(n_samples=5000, n_features=20, random_state=42)
# Train a random forest using all CPU cores
print("Training RandomForest with n_jobs=-1...")
start = time.perf_counter()
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
rf.fit(X, y)
print(f" Fit time: {time.perf_counter() - start:.2f}s")
# Cross-validation in parallel
start = time.perf_counter()
scores = cross_val_score(rf, X, y, cv=5, n_jobs=-1, scoring="accuracy")
print(f" CV scores: {scores.round(3)}, mean={scores.mean():.3f}, time={time.perf_counter() - start:.2f}s")
# Hyperparameter search -- each combo evaluated in parallel
param_grid = {
"n_estimators": [50, 100],
"max_depth": [5, 10, None],
}
start = time.perf_counter()
grid = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=3,
n_jobs=-1,
verbose=0,
)
grid.fit(X, y)
elapsed = time.perf_counter() - start
print(f" Best params: {grid.best_params_}, score={grid.best_score_:.3f}, time={elapsed:.2f}s")
Output:
Training RandomForest with n_jobs=-1...
Fit time: 0.31s
CV scores: [0.934 0.931 0.929 0.927 0.932], mean=0.931, time=0.48s
Best params: {'max_depth': None, 'n_estimators': 100}, score=0.931, time=1.24s
The n_jobs=-1 parameter on scikit-learn estimators and model-selection utilities goes directly to Joblib. Setting it uses all available cores for that operation. For nested parallelism (a parallel grid search that itself trains parallel random forests), Joblib automatically avoids over-subscribing the CPU — the inner jobs run sequentially when the outer jobs already fill all cores.
Real-Life Example: Parallel Feature Extraction Pipeline
The following pipeline processes a directory of text files, extracts word-frequency features from each, and caches the results. Combining Parallel with Memory gives you both speed and resilience — if the pipeline is interrupted, the cached results mean you do not repeat work already done.
# feature_pipeline.py
import os
import time
import re
from collections import Counter
from pathlib import Path
from joblib import Parallel, delayed, Memory
cache = Memory("./feature_cache", verbose=0)
# --- Create sample text files ---
SAMPLE_DIR = Path("sample_texts")
SAMPLE_DIR.mkdir(exist_ok=True)
sample_texts = {
"python.txt": "Python is a high-level programming language. Python emphasises readability.",
"data.txt": "Data science uses statistics and programming. Data analysis reveals patterns.",
"web.txt": "Web development creates websites and applications. The web uses HTML CSS JavaScript.",
"ai.txt": "Artificial intelligence mimics human thinking. Machine learning trains models on data.",
"cloud.txt": "Cloud computing provides on-demand resources. Cloud services scale automatically.",
}
for fname, text in sample_texts.items():
(SAMPLE_DIR / fname).write_text(text * 50) # Make files large enough to matter
@cache.cache
def extract_features(filepath: str) -> dict:
"""Extract word frequency features from a text file (cached)."""
text = Path(filepath).read_text().lower()
words = re.findall(r'\b[a-z]{3,}\b', text)
top_words = dict(Counter(words).most_common(10))
time.sleep(0.3) # Simulate expensive NLP processing
return {"file": Path(filepath).name, "word_count": len(words), "top_words": top_words}
def run_pipeline(data_dir: Path) -> list[dict]:
files = [str(f) for f in data_dir.glob("*.txt")]
print(f"Processing {len(files)} files in parallel...")
start = time.perf_counter()
results = Parallel(n_jobs=-1, verbose=10)(
delayed(extract_features)(f) for f in files
)
elapsed = time.perf_counter() - start
print(f"Done in {elapsed:.2f}s")
return results
features = run_pipeline(SAMPLE_DIR)
for feat in features:
top3 = list(feat["top_words"].keys())[:3]
print(f" {feat['file']:15s} words={feat['word_count']:,} top={top3}")
Output (first run — cold cache):
Processing 5 files in parallel...
[Parallel(n_jobs=-1)]: Done 5 out of 5 | elapsed: 0.4s finished
Done in 0.41s
python.txt words=350 top=['python', 'language', 'high']
data.txt words=350 top=['data', 'science', 'analysis']
web.txt words=350 top=['web', 'development', 'html']
ai.txt words=350 top=['learning', 'machine', 'data']
cloud.txt words=350 top=['cloud', 'computing', 'services']
s from a file or a database, the cache becomes stale when that data changes. You are responsible for clearing the cache when upstream data changes, either by calling memory.clear(), by passing a version argument to the function, or by using a time-based expiry implemented in the function body.
How do I track progress in a long Parallel job?
Set verbose=10 (the maximum) in Parallel() to print a status line after each completed job, including elapsed time, estimated remaining time, and memory usage. For a progress bar, use the tqdm library: wrap the generator with tqdm(delayed(func)(x) for x in items, total=len(items)) -- Joblib will pull items from the tqdm-wrapped iterator and tqdm updates the bar as items are consumed.
Are there memory issues with Joblib on long-running jobs?
When using the loky backend with large return values, worker memory can accumulate if workers are reused across many batches. Set max_nbytes="10M" in Parallel() to use memory-mapped files for return values above 10 MB instead of pickle serialisation. To prevent worker memory from growing across restarts, set Parallel(n_jobs=4, max_nbytes=None) combined with periodic worker recycling using loky.get_reusable_executor(max_workers=4, reuse="kill_workers").
Conclusion
Joblib makes two of the most common performance problems in data pipelines trivially easy to solve: parallelising embarrassingly parallel loops with Parallel and delayed, and caching expensive deterministic computations with Memory. You have seen how to replace a for loop with a parallel equivalent in four lines, choose the right backend for CPU-bound versus IO-bound work, cache results to disk, and integrate both patterns with scikit-learn.
The natural extension of the feature extraction pipeline is to add a cache validation step that checks file modification timestamps, and to feed the extracted features directly into a scikit-learn pipeline with n_jobs=-1 cross-validation -- so both the feature extraction and the model evaluation run in parallel with full caching.
For the full Joblib reference including memory-mapped arrays, batch processing, and custom backends, see the official Joblib documentation.
Related Articles
Further Reading: For more details, see the Python webbrowser module documentation.
Frequently Asked Questions
What is Selenium WebDriver used for in Python?
Selenium WebDriver is a tool for automating web browser interactions. In Python, it is used for web scraping, automated testing of web applications, form filling, screenshot capture, and any task that requires programmatic control of a web browser.
Which browser drivers work with Selenium in Python?
Selenium supports ChromeDriver (Chrome/Chromium), GeckoDriver (Firefox), EdgeDriver (Microsoft Edge), and SafariDriver (Safari). ChromeDriver and GeckoDriver are the most commonly used for Linux-based automation.
How do I install ChromeDriver on Linux?
Download ChromeDriver from the official site matching your Chrome version, extract it, and place it in your PATH (e.g., /usr/local/bin/). Alternatively, use webdriver-manager package: pip install webdriver-manager to handle driver installation automatically.
Why do I get ‘WebDriver not found’ errors?
This typically occurs when the driver executable is not in your system PATH, the driver version does not match your browser version, or the driver file lacks execute permissions. Use chmod +x chromedriver to set permissions and ensure version compatibility.
Can Selenium run without a visible browser window?
Yes. Use headless mode by adding options.add_argument('--headless') to your browser options. This runs the browser in the background without a GUI, which is faster and ideal for servers and CI/CD pipelines.
Installing the Right Driver Binary
Selenium needs a browser-specific driver binary on the system PATH or pointed to explicitly. The two paths that work on Linux:
Option 1 — Selenium Manager (Selenium 4.6+): The library auto-downloads the right driver. Zero setup beyond installing selenium:
# pip install selenium
from selenium import webdriver
driver = webdriver.Chrome() # auto-downloads chromedriver
driver.get("https://example.com")
print(driver.title)
driver.quit()
Option 2 — webdriver-manager: Explicit installation per session, handy when you need to pin a version:
# pip install webdriver-manager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
Headless Mode for Servers
On a server with no display, you need headless mode (and matching Chrome / Chromium installed). The minimal Chrome install on Ubuntu 22.04 and Debian:
# Install Chrome and the libraries it needs
sudo apt-get update
sudo apt-get install -y wget gnupg
wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" | \
sudo tee /etc/apt/sources.list.d/google-chrome.list
sudo apt-get update
sudo apt-get install -y google-chrome-stable
# Python: enable headless
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("--headless=new") # use the new headless mode (Chrome 109+)
opts.add_argument("--no-sandbox") # required when running as root
opts.add_argument("--disable-dev-shm-usage") # avoid /dev/shm size issues
opts.add_argument("--window-size=1920,1080") # avoid layout-dependent failures
driver = webdriver.Chrome(options=opts)
The --disable-dev-shm-usage flag fixes a notorious crash in Docker containers where the shared-memory partition is too small. --no-sandbox is required when Chrome runs as root (Docker default).
Firefox / geckodriver
If Chrome isn’t your target, swap in Firefox. Same pattern, different driver:
sudo apt-get install -y firefox
# Python
from selenium import webdriver
from selenium.webdriver.firefox.options import Options as FFOptions
from selenium.webdriver.firefox.service import Service as FFService
from webdriver_manager.firefox import GeckoDriverManager
opts = FFOptions()
opts.add_argument("--headless")
service = FFService(GeckoDriverManager().install())
driver = webdriver.Firefox(service=service, options=opts)
driver.get("https://example.com")
Docker Setup for Selenium
For CI / production, run Selenium in Docker rather than installing system-wide. The official Selenium images have everything bundled:
# Pull a ready-to-go Chrome stack
docker run -d -p 4444:4444 -p 7900:7900 --shm-size=2g \
selenium/standalone-chrome:latest
# Now connect from any host (no local Chrome needed)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("--headless=new")
driver = webdriver.Remote(
command_executor="http://localhost:4444/wd/hub",
options=opts,
)
driver.get("https://example.com")
The --shm-size=2g on the container fixes the same shared-memory issue as --disable-dev-shm-usage in the Chrome args. Pick whichever is convenient.
Verifying Your Setup
A 6-line smoke test catches 90% of install failures:
# File: test_selenium.py
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("--headless=new")
opts.add_argument("--no-sandbox")
driver = webdriver.Chrome(options=opts)
driver.get("https://www.python.org")
print("Title:", driver.title)
print("URL:", driver.current_url)
driver.quit()
If this runs and prints “Welcome to Python.org” — you’re done. If it fails, the error message tells you exactly what’s missing (driver, browser binary, sandbox flag, etc.).
Common Pitfalls
- Mixing Chrome and chromedriver versions. chromedriver must match Chrome’s major version. Selenium Manager handles this; webdriver-manager handles it; manual installs break every Chrome update.
- Forgetting –no-sandbox in Docker. Chrome refuses to run as root (which Docker default is) without it. Add it OR run as a non-root user.
- Insufficient /dev/shm. Default 64MB shared memory in Docker isn’t enough. Use
--shm-size=2gor--disable-dev-shm-usage. - Missing browser binary. chromedriver alone isn’t enough — you also need Chrome itself installed. Same for Firefox + geckodriver.
- Old –headless flag. Chrome’s old headless mode is deprecated in favor of
--headless=new(Chrome 109+). The new mode is faster and renders more accurately.
FAQ
Q: Selenium or Playwright?
A: For new projects, Playwright is faster, has better selectors, and auto-handles waits. Selenium is mature and ubiquitous — if you have existing Selenium tests or need browser support beyond Chrome/Firefox/WebKit, stick with it.
Q: Headless or headful?
A: Headless for CI, scrapers, and any unattended workflow. Headful when developing — you can SEE what your code is doing, which speeds debugging by 10x.
Q: How do I run as a specific browser version?
A: Install that specific version of Chrome / Firefox, then point Selenium at it: options.binary_location = "/path/to/chrome". webdriver-manager can also pin to a version.
Q: Why is the test slow on the first run?
A: The driver download. Subsequent runs use the cached binary. CI systems should cache ~/.wdm (webdriver-manager) and ~/.cache/selenium.
Q: How do I bypass Cloudflare / bot protection?
A: Standard Selenium gets blocked by Cloudflare. Use undetected-chromedriver (better) or Playwright with stealth plugins (best). For aggressive bot detection, you may need to rotate user agents and use residential proxies.
Wrapping Up
Selenium on Linux comes down to three pieces: Python’s selenium package, the browser binary (Chrome or Firefox), and the driver binary (chromedriver or geckodriver). Selenium Manager handles the driver auto-download. --headless=new, --no-sandbox, and --disable-dev-shm-usage are the three flags that make Chrome work reliably in Docker. Get that combination right and Selenium runs cleanly in CI, on servers, and in production scrapers.
Thanks for finally writing about > How To Install
Selenium Driver For Python in Linux diatomity