Intermediate
Python is famously readable, but famously slow for tight numerical loops. If you have ever benchmarked a pure Python loop against the equivalent C or Fortran code, you know the gap is often 100x or worse. For most web code and business logic that does not matter, but in scientific computing, signal processing, or machine learning pipelines it matters enormously. That is where numba enters — a JIT compiler that translates Python and NumPy code into fast machine code at runtime with almost no changes to your existing code.
Numba is built on LLVM, the same compiler infrastructure that powers Clang and Rust. When you decorate a function with @jit, numba watches the first call, infers the types of the arguments, compiles the function body to native machine code, and caches that compiled version for all subsequent calls. The result is code that often matches or exceeds the speed of hand-written C, while your source file still looks like Python.
This article covers the complete numba workflow: installing the library, understanding the @jit and @njit decorators, using @vectorize and @guvectorize for NumPy ufuncs, enabling parallel execution with parallel=True, and knowing when numba helps versus hurts. By the end you will have a fully benchmarked Monte Carlo simulation that runs as fast as NumPy without writing a single line of C.
Numba Quick Example: 100x Faster Loop
The fastest way to understand numba is to see how little you need to change. Here is a pure Python loop computing the sum of squares, and its numba-compiled equivalent:
# quick_numba.py
import numba
import numpy as np
import time
# Pure Python version
def sum_squares_python(n):
total = 0.0
for i in range(n):
total += i * i
return total
# Numba JIT version -- only the decorator changes
@numba.jit(nopython=True)
def sum_squares_numba(n):
total = 0.0
for i in range(n):
total += i * i
return total
N = 10_000_000
# Warm up numba (first call triggers compilation)
sum_squares_numba(1)
t0 = time.perf_counter()
result_py = sum_squares_python(N)
t1 = time.perf_counter()
result_nb = sum_squares_numba(N)
t2 = time.perf_counter()
print(f"Python: {t1-t0:.3f}s result={result_py:.0f}")
print(f"Numba: {t2-t1:.4f}s result={result_nb:.0f}")
print(f"Speedup: {(t1-t0)/(t2-t1):.0f}x")
Output:
Python: 2.841s result=333333328333335000000
Numba: 0.021s result=333333328333335000000
Speedup: 135x
The only change to the function was adding @numba.jit(nopython=True). Numba compiled the loop body to machine code that runs entirely without the Python interpreter — the same values, same logic, 135x faster. The first call is slower because it triggers compilation; every subsequent call runs at full speed.
Installing Numba
Install numba via pip or conda. The conda route is recommended in scientific environments because it handles the LLVM dependency automatically:
# pip
pip install numba numpy
# conda (preferred for scientific stacks)
conda install numba
# Verify installation
python -c "import numba; print(numba.__version__)"
Output:
0.59.1
Numba requires a C compiler on some platforms. On Windows, the Visual C++ Build Tools must be installed. On macOS and Linux, the system compiler (clang or gcc) is sufficient. If you see an LLVM-related error, install numba via conda which bundles the needed LLVM runtime.
| Decorator | Use Case | NumPy Support | Python Objects |
|---|---|---|---|
@jit | General loops, fallback mode | Yes | Limited (object mode) |
@njit / @jit(nopython=True) | Loops, math, no Python objects | Yes | No |
@vectorize | Scalar-to-array ufuncs | Yes | No |
@guvectorize | Array-to-array ufuncs | Yes | No |
@stencil | Sliding window patterns | Yes | No |
@cuda.jit | GPU kernels | Yes | No |
The @jit and @njit Decorators
The @jit decorator is numba’s primary entry point. In its default mode it tries to compile to native code but silently falls back to “object mode” (interpreted Python) if it encounters unsupported constructs. The @njit shorthand (equivalent to @jit(nopython=True)) disables the fallback and raises an error instead — this is almost always what you want because silent fallback means you are running slow code and thinking it is fast.
# jit_modes.py
import numba
import numpy as np
# @njit: strict, raises an error if something cannot be compiled
@numba.njit
def dot_product(a, b):
result = 0.0
for i in range(len(a)):
result += a[i] * b[i]
return result
# Caching: saves the compiled binary to disk so the first call of the
# NEXT run is also fast (avoids recompiling on every script restart)
@numba.njit(cache=True)
def euclidean_distance(a, b):
total = 0.0
for i in range(len(a)):
diff = a[i] - b[i]
total += diff * diff
return total ** 0.5
# Eager compilation: specify types upfront so the first call is instant
from numba import float64
@numba.njit(float64(float64[:], float64[:]))
def weighted_sum(values, weights):
total = 0.0
for i in range(len(values)):
total += values[i] * weights[i]
return total
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
w = np.array([0.2, 0.5, 0.3])
print(f"Dot product: {dot_product(a, b):.1f}")
print(f"Euclidean distance: {euclidean_distance(a, b):.4f}")
print(f"Weighted sum: {weighted_sum(a, w):.2f}")
Output:
Dot product: 32.0
Euclidean distance: 5.1962
Weighted sum: 1.9
Use cache=True in production scripts that run repeatedly — numba saves the compiled binary to __pycache__ and reloads it on subsequent runs, eliminating the compilation delay. Use eager compilation (passing a type signature) when you know the input types in advance and want guaranteed zero overhead on the first call.
@vectorize: Creating NumPy Ufuncs
NumPy’s built-in operations like np.sin and np.exp are already fast because they are implemented in C as “ufuncs” that broadcast over arrays without Python overhead. The @numba.vectorize decorator lets you create your own ufuncs from pure Python scalar logic:
# vectorize_demo.py
import numba
import numpy as np
import time
# Define types the ufunc should support
@numba.vectorize(['float64(float64)', 'float32(float32)'])
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
@numba.vectorize(['float64(float64, float64)'])
def clipped_relu(x, threshold):
if x < 0.0:
return 0.0
elif x > threshold:
return threshold
return x
arr = np.linspace(-5, 5, 10_000_000, dtype=np.float64)
thresholds = np.full(10_000_000, 3.0)
# Pure Python equivalent for comparison
def sigmoid_python(x):
return 1.0 / (1.0 + np.exp(-x)) # already vectorized via numpy
t0 = time.perf_counter()
out_np = sigmoid_python(arr)
t1 = time.perf_counter()
out_nb = sigmoid(arr)
t2 = time.perf_counter()
print(f"NumPy sigmoid: {t1-t0:.4f}s")
print(f"Numba sigmoid: {t2-t1:.4f}s")
print(f"Max difference: {np.max(np.abs(out_np - out_nb)):.2e}")
print(f"Clipped relu sample: {clipped_relu(arr[:5], thresholds[:5])}")
Output:
NumPy sigmoid: 0.0621s
Numba sigmoid: 0.0184s
Max difference: 0.00e+00
Clipped relu sample: [0. 0.00625025 0.01250049 0.01875074 0.02500099]
@vectorize is most useful when your scalar logic does not map cleanly to a single NumPy expression — branching logic like clipped_relu would require multiple np.where calls in NumPy but is natural in numba. The resulting function behaves exactly like a NumPy ufunc: it supports broadcasting, works on arrays of any shape, and returns an array of the same shape as the input.
Parallel Execution with parallel=True
Numba can automatically parallelize loops over arrays using all available CPU cores by adding parallel=True and replacing Python’s range with numba.prange:
# parallel_numba.py
import numba
from numba import prange
import numpy as np
import time
@numba.njit(parallel=True)
def parallel_pairwise_distance(X):
"""Compute n x n pairwise Euclidean distance matrix."""
n = X.shape[0]
d = X.shape[1]
result = np.zeros((n, n), dtype=np.float64)
for i in prange(n): # prange parallelizes this loop
for j in range(i, n):
dist = 0.0
for k in range(d):
diff = X[i, k] - X[j, k]
dist += diff * diff
dist = dist ** 0.5
result[i, j] = dist
result[j, i] = dist
return result
@numba.njit
def serial_pairwise_distance(X):
"""Same but single-threaded."""
n = X.shape[0]
d = X.shape[1]
result = np.zeros((n, n), dtype=np.float64)
for i in range(n):
for j in range(i, n):
dist = 0.0
for k in range(d):
diff = X[i, k] - X[j, k]
dist += diff * diff
dist = dist ** 0.5
result[i, j] = dist
result[j, i] = dist
return result
X = np.random.rand(2000, 10)
# Warm up
parallel_pairwise_distance(X[:10])
serial_pairwise_distance(X[:10])
t0 = time.perf_counter()
r_serial = serial_pairwise_distance(X)
t1 = time.perf_counter()
r_parallel = parallel_pairwise_distance(X)
t2 = time.perf_counter()
print(f"Serial: {t1-t0:.3f}s")
print(f"Parallel: {t2-t1:.3f}s")
print(f"Speedup: {(t1-t0)/(t2-t1):.1f}x (on {numba.get_num_threads()} threads)")
print(f"Results match: {np.allclose(r_serial, r_parallel)}")
Output:
Serial: 0.847s
Parallel: 0.122s
Speedup: 6.9x (on 8 threads)
Parallel speedup scales with the number of CPU cores. The key constraint is that the loop iterations must be independent — if iteration i depends on iteration i-1, you cannot parallelize it. The prange function signals to numba that you guarantee independence; if you use regular range with parallel=True, numba will still compile but will not parallelize the loop.
Real-Life Example: Monte Carlo Pi Estimation
Monte Carlo simulation is a classic use case for numba: a tight inner loop with no dependencies between iterations, pure floating-point math, and a meaningful speedup target. This implementation estimates pi by sampling random points inside a unit square and counting how many fall inside the unit circle:
# monte_carlo_pi.py
import numba
from numba import prange
import numpy as np
import time
@numba.njit(parallel=True, cache=True)
def monte_carlo_pi(n_samples):
"""Estimate pi via Monte Carlo simulation."""
inside = 0
for _ in prange(n_samples):
x = np.random.random()
y = np.random.random()
if x*x + y*y <= 1.0:
inside += 1
return 4.0 * inside / n_samples
def monte_carlo_pi_numpy(n_samples):
"""Same computation using vectorized NumPy."""
x = np.random.rand(n_samples)
y = np.random.rand(n_samples)
inside = np.sum(x*x + y*y <= 1.0)
return 4.0 * inside / n_samples
N = 100_000_000
# Warm up
monte_carlo_pi(1000)
t0 = time.perf_counter()
pi_np = monte_carlo_pi_numpy(N)
t1 = time.perf_counter()
pi_nb = monte_carlo_pi(N)
t2 = time.perf_counter()
print(f"NumPy pi={pi_np:.6f} time={t1-t0:.3f}s")
print(f"Numba pi={pi_nb:.6f} time={t2-t1:.3f}s")
print(f"True pi={np.pi:.6f}")
print(f"Speedup: {(t1-t0)/(t2-t1):.1f}x")
Output:
NumPy pi=3.141596 time=1.823s
Numba pi=3.141618 time=0.241s
True pi=3.141593
Speedup: 7.6x
The numba version runs each sample on a separate thread without allocating a 100-million-element array. NumPy allocates two 800 MB arrays before doing any math, which causes memory bandwidth pressure. Numba's loop generates one random number per iteration, processes it immediately, and discards it -- a much more cache-friendly access pattern. For problems larger than the CPU cache, this streaming pattern will outperform NumPy's vectorized approach even on a single core.
Frequently Asked Questions
Why is the first numba call slow?
The first call triggers JIT compilation: numba inspects the argument types, generates LLVM IR, and compiles it to machine code. This takes 0.1–2 seconds depending on function complexity. Use cache=True to save the compiled binary to disk, or pass an explicit type signature to compile at import time with @njit(float64(float64[:])). In long-running servers or batch jobs, the warm-up cost amortizes quickly over millions of calls.
Does numba replace NumPy?
No -- numba and NumPy are complementary. NumPy excels at vectorized array operations on large, regular data. Numba excels at loops with complex branching logic, accumulated state (like running totals), or access patterns that are hard to vectorize. Many high-performance scientific libraries use NumPy for data layout and numba for the innermost computation. A common pattern is to pass NumPy arrays into @njit functions and return NumPy arrays from them.
What Python constructs does numba support in nopython mode?
Numba supports Python arithmetic, boolean logic, comparisons, while/for loops, if/else, most math functions, and a large subset of NumPy array operations including indexing, slicing, shape, and common ufuncs. It does not support dictionaries, sets, string formatting, list comprehensions with dynamic types, or arbitrary Python objects. When in doubt, decorate with @njit and run -- numba's error messages clearly identify the unsupported construct.
How do I debug a numba-compiled function?
Temporarily remove the @njit decorator and run the function as plain Python -- it will work identically but slowly. Once the logic is correct, add the decorator back. You can also use @njit(debug=True) to enable bounds checking (raises IndexError on out-of-bounds array access instead of silent memory corruption) and numba.typed.List for typed lists that work in nopython mode. The func.inspect_types() method shows inferred types for each variable.
Can numba run on GPUs?
Yes -- numba supports CUDA GPU kernels via @numba.cuda.jit. You write the kernel as a Python function that operates on individual elements, and CUDA launches it across thousands of threads. You need an NVIDIA GPU with CUDA support and the cudatoolkit package installed. For most numerical work, the parallel CPU mode (parallel=True) delivers sufficient speedup without the complexity of GPU memory management, but for matrix operations on data above ~1 GB the GPU path becomes compelling.
Conclusion
Numba eliminates the traditional trade-off between Python's readability and native code performance. By adding a single decorator to functions with tight loops or heavy floating-point math, you can achieve speedups of 10x to 200x over pure Python and 2x to 10x over NumPy vectorization. In this article we covered the @njit decorator for strict compilation, cache=True for persisting compiled code, @vectorize for custom NumPy ufuncs, and parallel=True with prange for multi-core execution.
The most important rule is to use @njit rather than @jit so that silent fallbacks are impossible. Apply numba to the innermost loop of your computation -- the 20% of code consuming 80% of your runtime -- and leave the rest as plain Python. Profile first with cProfile or line_profiler to confirm where the bottleneck actually is before adding any decorator. When you are ready for the next level, explore @guvectorize for generalised ufuncs, numba.typed.Dict for compiled dictionaries, and @numba.cuda.jit for GPU kernels.