Intermediate

You have two ways to do the same thing in Python — a list comprehension and a for-loop, str.join() and repeated concatenation, sorted() and list.sort() — and you want to know which one is faster. Guessing is wrong, and timing with time.time() gives unreliable one-shot measurements that vary wildly with OS scheduling. Python’s timeit module was built to solve exactly this problem.

timeit is part of Python’s standard library — nothing to install. It works by running your code snippet many thousands of times in a tight loop, averaging out random OS noise so the result actually reflects the code’s performance. You can use it from the command line for quick checks or from Python code for systematic benchmarks you can embed in your test suite.

This tutorial covers everything you need to benchmark Python accurately: the CLI interface, the timeit.timeit() and timeit.repeat() functions, benchmarking callable functions with setup code, comparing multiple implementations, and avoiding the common gotchas that give misleading results. By the end, you will have a reusable benchmarking harness you can apply to any performance question in your codebase.

Benchmarking Two Approaches: Quick Example

Here is the fastest way to settle a performance debate — comparing list comprehension against map() for squaring numbers:

# quick_timeit.py
import timeit

# Benchmark list comprehension
t1 = timeit.timeit('[x**2 for x in range(1000)]', number=10000)

# Benchmark map()
t2 = timeit.timeit('list(map(lambda x: x**2, range(1000)))', number=10000)

print(f"List comprehension: {t1:.4f}s over 10,000 runs")
print(f"map() equivalent:   {t2:.4f}s over 10,000 runs")
print(f"Winner: {'list comprehension' if t1 < t2 else 'map()'} by {abs(t1-t2):.4f}s")

Output:

List comprehension: 0.8231s over 10,000 runs
map() equivalent:   0.9847s over 10,000 runs
Winner: list comprehension by 0.1616s

The number=10000 argument runs the snippet 10,000 times and returns the total elapsed time. Dividing by number gives the per-call cost. Running many repetitions is what makes timeit reliable -- a single execution is too noisy to trust.

What Is timeit and When Should You Use It?

The timeit module provides a simple way to time small bits of Python code with high accuracy. It works by disabling garbage collection during the measurement and running the snippet inside a tight loop, both of which reduce measurement noise significantly compared to a manual time.time() wrapper.

Tool	Use When	Granularity	Setup Required
timeit	Comparing small snippets, functions	Microseconds	None (stdlib)
time.time()	Coarse script timing	Milliseconds	None (stdlib)
cProfile	Finding bottlenecks in whole programs	Per-function	None (stdlib)
line_profiler	Line-by-line profiling	Per-line	pip install

Use timeit when you have a specific performance question: "Is approach A faster than approach B?" or "Is this optimisation actually an improvement?" Use cProfile when you need to find where time is being spent across a larger codebase.

Using timeit from the Command Line

The fastest way to benchmark a one-liner is the python -m timeit command. It automatically chooses a sensible number of repetitions and reports the best time across multiple runs:

# Run from your terminal (not a Python file)
python -m timeit "'-'.join(str(i) for i in range(100))"

Output:

50000 loops, best of 5: 7.48 usec per loop

The CLI prints the loop count and the best time per loop across five timing runs. "Best of 5" means timeit ran the measurement 5 times and took the minimum -- this filters out noise from OS interrupts and context switches. A higher loop count with a lower per-loop time is a sign of a fast snippet; a lower loop count with higher time signals a slower one.

For multi-line setups, use multiple -s flags for setup code and the main statement as the final argument:

# Multi-line benchmark with setup
python -m timeit -s "import json; data = {'key': 'value', 'num': 42}" "json.dumps(data)"

Output:

500000 loops, best of 5: 0.612 usec per loop

Sudo Sam benchmarking Python code with timeit — Trust the loop count, not your gut.

Using timeit Programmatically

For benchmark scripts you want to commit and rerun, the Python API gives you full control over repetitions, setup code, and result formatting:

# programmatic_timeit.py
import timeit

# timeit.timeit() runs the stmt number times and returns total seconds
result = timeit.timeit(
    stmt="[i**2 for i in range(500)]",
    number=50000
)
print(f"Total for 50,000 runs: {result:.3f}s")
print(f"Per run: {result/50000*1e6:.2f} microseconds")

# Use setup= to import modules or define variables used in stmt
result2 = timeit.timeit(
    stmt="sorted(data)",
    setup="data = list(range(1000, 0, -1))",  # reversed list, set up once
    number=10000
)
print(f"\nsorted() on 1000-item reversed list:")
print(f"Total for 10,000 runs: {result2:.3f}s")
print(f"Per run: {result2/10000*1e6:.2f} microseconds")

Output:

Total for 50,000 runs: 0.834s
Per run: 16.68 microseconds

sorted() on 1000-item reversed list:
Total for 10,000 runs: 0.621s
Per run: 62.10 microseconds

The setup parameter is key -- it runs once before the timing loop starts, so expensive operations like imports or data creation don't pollute your measurements. Anything that must be repeated each iteration goes in stmt; anything that only needs to happen once goes in setup.

Using timeit.repeat() for Robust Statistics

timeit.repeat() runs the full timing measurement multiple times, giving you a list of results you can analyse statistically. This is more rigorous than a single run:

# repeat_benchmark.py
import timeit
import statistics

def benchmark(stmt, setup="pass", number=10000, repeat=7):
    """Run a benchmark and return min, mean, stdev."""
    times = timeit.repeat(stmt=stmt, setup=setup, number=number, repeat=repeat)
    per_run = [t / number * 1e6 for t in times]  # convert to microseconds per call
    return {
        "min_us": min(per_run),
        "mean_us": statistics.mean(per_run),
        "stdev_us": statistics.stdev(per_run),
        "runs": number,
        "repeats": repeat,
    }

# Compare two string join approaches
r1 = benchmark("''.join(str(i) for i in range(100))", number=20000)
r2 = benchmark("''.join(map(str, range(100)))", number=20000)

print("Generator expression join:")
print(f"  Min: {r1['min_us']:.2f} us  Mean: {r1['mean_us']:.2f} us  Stdev: {r1['stdev_us']:.2f} us")
print("\nmap(str, ...) join:")
print(f"  Min: {r2['min_us']:.2f} us  Mean: {r2['mean_us']:.2f} us  Stdev: {r2['stdev_us']:.2f} us")

faster = "map(str)" if r2['min_us'] < r1['min_us'] else "generator"
speedup = max(r1['min_us'], r2['min_us']) / min(r1['min_us'], r2['min_us'])
print(f"\nWinner: {faster} ({speedup:.1f}x faster)")

Output:

Generator expression join:
  Min: 8.34 us  Mean: 8.51 us  Stdev: 0.18 us

map(str, ...) join:
  Min: 6.92 us  Mean: 7.08 us  Stdev: 0.14 us

Winner: map(str) (1.2x faster)

The standard deviation tells you how noisy the measurement is. A low stdev means your results are reliable; a high stdev means something is interfering -- background processes, thermal throttling, or GC pressure. The minimum is often reported as the "true" performance because it represents the run with the least OS interference.

Debug Dee comparing two benchmark timings — repeat=7 and take the min — one data point is not a benchmark.

Benchmarking Functions

Timing functions is slightly different from timing string snippets because you need to reference the function object. Use a lambda or pass a callable directly to the Timer class:

# function_benchmark.py
import timeit

def approach_a(n):
    """Build a list of squares using a loop."""
    result = []
    for i in range(n):
        result.append(i * i)
    return result

def approach_b(n):
    """Build a list of squares using a comprehension."""
    return [i * i for i in range(n)]

def approach_c(n):
    """Build a list of squares using map."""
    return list(map(lambda x: x * x, range(n)))

N = 1000

# Use lambda to wrap callable (avoids global lookup overhead)
t_a = timeit.timeit(lambda: approach_a(N), number=10000)
t_b = timeit.timeit(lambda: approach_b(N), number=10000)
t_c = timeit.timeit(lambda: approach_c(N), number=10000)

results = [("loop + append", t_a), ("list comprehension", t_b), ("map + lambda", t_c)]
results.sort(key=lambda x: x[1])

print(f"Benchmarking list-of-squares for n={N} (10,000 runs each):\n")
fastest_time = results[0][1]
for name, t in results:
    speedup = t / fastest_time
    print(f"  {name:<22} {t:.4f}s total  {t/10000*1e6:.2f} us/call  {speedup:.2f}x")

Output:

Benchmarking list-of-squares for n=1000 (10,000 runs each):

  list comprehension     0.4123s total  41.23 us/call  1.00x
  map + lambda           0.5014s total  50.14 us/call  1.22x
  loop + append          0.5892s total  58.92 us/call  1.43x

Using a lambda wrapper means the callable is captured at benchmark setup time, avoiding repeated global name lookups that would slightly inflate the slower approaches' times. When benchmarking functions with arguments, always pass them inside the lambda rather than in the function signature -- this gives you fair comparison conditions.

Real-Life Example: Benchmarking a String Formatter

Let us build a systematic benchmark that tests five approaches to string formatting in Python and produces a ranked summary report -- the kind of micro-benchmark you would run before deciding which formatting style to standardize on for a hot code path:

# string_format_benchmark.py
import timeit
import statistics

NAME = "Alice"
AGE = 30
SCORE = 98.6

def bench(fn, number=50000, repeat=5):
    times = timeit.repeat(fn, number=number, repeat=repeat)
    per_us = [t / number * 1e6 for t in times]
    return min(per_us), statistics.mean(per_us)

approaches = {
    "%-formatting":    lambda: "%s is %d years old, score: %.1f" % (NAME, AGE, SCORE),
    "str.format()":    lambda: "{} is {} years old, score: {:.1f}".format(NAME, AGE, SCORE),
    "f-string":        lambda: f"{NAME} is {AGE} years old, score: {SCORE:.1f}",
    "str concat":      lambda: NAME + " is " + str(AGE) + " years old, score: " + str(round(SCORE, 1)),
    "Template":        None,  # set up below
}

from string import Template
tmpl = Template("$name is $age years old, score: $score")
approaches["Template"] = lambda: tmpl.substitute(name=NAME, age=AGE, score=SCORE)

print(f"{'Approach':<20} {'Min (us)':>10} {'Mean (us)':>10}")
print("-" * 42)

results = []
for name, fn in approaches.items():
    mn, avg = bench(fn)
    results.append((name, mn, avg))

results.sort(key=lambda x: x[1])
for name, mn, avg in results:
    print(f"{name:<20} {mn:>10.3f} {avg:>10.3f}")

Output:

Approach             Min (us)  Mean (us)
------------------------------------------
f-string                0.082      0.085
%-formatting            0.098      0.101
str.format()            0.124      0.128
str concat              0.143      0.148
Template                0.621      0.641

F-strings win on every modern Python version, which matches CPython's implementation: f-strings compile to FORMAT_VALUE bytecodes that avoid function call overhead. Template.substitute() is the slowest by a wide margin because it uses regex parsing at runtime. This benchmark gives you concrete numbers to justify your team's style guide choice, not just "f-strings feel faster."

Loop Larry choosing between two Python approaches on a race track — Guessing performance is always wrong — measure it.

Frequently Asked Questions

How do I choose the right number of repetitions?

Aim for a total measurement time of 0.1 to 5 seconds. If your snippet takes 1 microsecond per call, use number=1000000. If it takes 1 millisecond, use number=1000. The CLI auto-selects number for you based on a calibration run. For manual selection, start with number=10000 and adjust until the total time is in the 0.2--2s range -- this keeps measurement noise below 1%.

Why does timeit disable garbage collection?

timeit temporarily disables Python's cyclic garbage collector during measurement because GC pauses can add unpredictable milliseconds to individual runs. This gives more consistent results but means your benchmark does not reflect real-world performance if your code produces a lot of cyclic garbage. If GC behavior matters for your use case, use timeit.Timer directly and call gc.enable() inside your setup string to re-enable it.

Why does my benchmark fail with a NameError about my functions?

String snippets run in a minimal namespace that does not include your module's globals. Either pass globals=globals() as a keyword argument to timeit.timeit(), or use the setup parameter to import what you need: setup="from __main__ import my_function". The lambda wrapper approach sidesteps this entirely because the closure captures the variable directly.

Can timeit measure memory usage?

timeit only measures time, not memory. For memory profiling, use the memory_profiler package (pip install memory-profiler) which provides a @profile decorator and line-by-line memory usage reports. For quick peak memory checks, tracemalloc from the standard library can take memory snapshots before and after running code to show allocation deltas.

What is the difference between wall time and CPU time?

timeit uses time.perf_counter() which measures wall-clock time -- the elapsed real time including any waits for I/O, locks, or sleep. For CPU-bound code this is equivalent to CPU time. For I/O-bound code (network requests, file reads), wall time includes the wait for the I/O operation to complete, which is not a reflection of code efficiency. For I/O benchmarks, wall time is usually what you want anyway since that is what the user experiences.

Conclusion

Python's timeit module gives you repeatable, noise-resistant microbenchmarks without any external dependencies. In this tutorial you used timeit.timeit() for single measurements, timeit.repeat() for statistical analysis, the CLI for quick command-line comparisons, and lambda wrappers to benchmark real functions with arguments. The string formatter comparison project tied all of these together into a systematic benchmarking harness.

A good next step is to add the benchmark() helper function to your project's test utilities and call it from your CI pipeline to catch performance regressions before they reach production. Even a loose assertion like assert bench(my_fn)[0] < 100 catches catastrophic slowdowns automatically.

Official documentation: timeit -- Measure execution time of small code snippets.

Post Views: 6

How To Use Python timeit for Accurate Code Benchmarking

Benchmarking Two Approaches: Quick Example

What Is timeit and When Should You Use It?

Using timeit from the Command Line

Using timeit Programmatically

Using timeit.repeat() for Robust Statistics

Benchmarking Functions

Real-Life Example: Benchmarking a String Formatter

Frequently Asked Questions

How do I choose the right number of repetitions?

Why does timeit disable garbage collection?

Why does my benchmark fail with a NameError about my functions?

Can timeit measure memory usage?

What is the difference between wall time and CPU time?

Conclusion

Submit a Comment Cancel reply

How To Use Python timeit for Accurate Code Benchmarking

Benchmarking Two Approaches: Quick Example

What Is timeit and When Should You Use It?

Using timeit from the Command Line

Using timeit Programmatically

Using timeit.repeat() for Robust Statistics

Benchmarking Functions

Real-Life Example: Benchmarking a String Formatter

Frequently Asked Questions

How do I choose the right number of repetitions?

Why does timeit disable garbage collection?

Why does my benchmark fail with a NameError about my functions?

Can timeit measure memory usage?

What is the difference between wall time and CPU time?

Conclusion

Related Articles

Submit a Comment Cancel reply