Advanced
You’ve profiled your Python code and a particular function keeps showing up in the hot path. You’ve tried rewriting it a few different ways and timeit says one version is 30% faster — but you’re not sure why. Or you’re curious what Python is actually doing when you write a list comprehension versus a for loop, or amat the difference is between x += 1 and x = x + 1 at the bytecode level. The dis module (short for “disassemble”) lets you answer these questions by showing you the CPython bytecode that any Python function compiles to.
The dis module is part of Python’s standard library and works with any function, method, class, module, or code string. It doesn’t modify your code — it just translates the compiled .pyc bytecode back into a human-readable instruction listing. You don’t need to be a CPython internals expert to use it; the instruction names are descriptive enough that you can reason about them with basic Python knowledge.
In this tutorial, you’ll learn how to disassemble functions with dis.dis(), read and interpret bytecode output, inspect code objects with __code__ attributes, compare implementations by their bytecode, understand how closures and global lookups work at the bytecode level, and use dis as a performance diagnosis tool. By the end, you’ll have a concrete mental model of what CPython does with your code.
Python dis Module: Quick Example
Here’s a basic disassembly of a simple function to show what the output looks like:
# dis_quick.py
import dis
def add_numbers(a, b):
result = a + b
return result
dis.dis(add_numbers)
Output:
2 0 RESUME 0
3 2 LOAD_FAST 0 (a)
4 LOAD_FAST 1 (b)
6 BINARY_OP 0 (+)
10 STORE_FAST 2 (result)
4 12 LOAD_FAST 2 (result)
14 RETURN_VALUE
Each row is one bytecode instruction. The columns are: source line number (left), byte offset in the code object, instruction name (opcode), argument, and the argument’s human-readable value in parentheses. LOAD_FAST pushes a local variable onto the evaluation stack. BINARY_OP pops two values, applies the operator (0 = +), and pushes the result. STORE_FAST pops the top of the stack into a local variable. RETURN_VALUE returns the top of the stack to the caller.
What Is CPython Bytecode?
When CPython compiles your Python source code, it produces bytecode — a sequence of fixed-width instructions for the CPython virtual machine (VM). The VM then interprets these instructions in a loop, maintaining an evaluation stack where operands are pushed and popped. Bytecode is stored in .pyc files and in the __code__ attribute of function objects.
| Concept | What It Is | Example |
|---|---|---|
| Opcode | A single VM instruction | LOAD_FAST, CALL |
| Offset | Byte position of instruction in code object | 0, 2, 4… |
| Argument | Integer parameter to the opcode | Variable index, constant index |
| Stack | LIFO buffer for operands and results | LOAD pushes, STORE/CALL pops |
| Code object | Compiled bytecode + metadata for a function | func.__code__ |
| co_varnames | Tuple of local variable names | ('a', 'b', 'result') |
Understanding the stack model is the key to reading bytecode. Every LOAD instruction pushes a value; every STORE, CALL, or binary operator pops one or more values. The stack depth at any point tells you how many “in-flight” values exist. CPython verifies the stack depth at compile time, which is why certain invalid constructs are caught before the code runs.
Reading the dis Output
Let’s look at a more complex function to practice reading bytecode:
# dis_read.py
import dis
def categorize_score(score):
if score >= 90:
return "A"
elif score >= 80:
return "B"
else:
return "C"
print("=== categorize_score ===")
dis.dis(categorize_score)
# Compare with a dict-based version
GRADE_THRESHOLDS = [(90, "A"), (80, "B"), (0, "C")]
def categorize_score_v2(score):
for threshold, grade in GRADE_THRESHOLDS:
if score >= threshold:
return grade
print("\n=== categorize_score_v2 ===")
dis.dis(categorize_score_v2)
Output:
=== categorize_score ===
2 0 RESUME 0
3 2 LOAD_FAST 0 (score)
4 LOAD_CONST 1 (90)
6 COMPARE_OP 5 (>=)
10 POP_JUMP_IF_FALSE 12 (to 36)
4 12 LOAD_CONST 2 ('A')
14 RETURN_VALUE
5 16 LOAD_FAST 0 (score)
...
(abbreviated)
=== categorize_score_v2 ===
2 0 RESUME 0
3 2 LOAD_GLOBAL 1 (NULL + GRADE_THRESHOLDS)
12 GET_ITER
>> 14 FOR_ITER 28 (to 72)
18 UNPACK_SEQUENCE 2
22 STORE_FAST 1 (threshold)
24 STORE_FAST 2 (grade)
...
The POP_JUMP_IF_FALSE instruction is how Python implements if statements in bytecode: it pops the boolean result of the comparison and jumps to the target offset if it’s False, otherwise continues to the next instruction. In v2, the LOAD_GLOBAL instruction for GRADE_THRESHOLDS does a dictionary lookup in the global namespace on every call — that’s slower than LOAD_FAST which reads from a local slot. This is why moving frequently-accessed globals into local variables inside tight loops is a valid micro-optimization.
Inspecting Code Objects
Every function has a __code__ attribute that exposes the raw code object. The code object contains not just the bytecode bytes but also metadata: constant values, variable names, closure variables, and source information.
# dis_code_object.py
import dis
def make_adder(n):
"""Return a function that adds n to its argument."""
def adder(x):
return x + n
return adder
add5 = make_adder(5)
# Inspect the outer function
outer_code = make_adder.__code__
print("=== make_adder code object ===")
print(f"co_name: {outer_code.co_name}")
print(f"co_varnames: {outer_code.co_varnames}") # local variables
print(f"co_freevars: {outer_code.co_freevars}") # closed-over variables
print(f"co_cellvars: {outer_code.co_cellvars}") # variables used by inner funcs
print(f"co_consts: {outer_code.co_consts}") # constants (includes inner code)
print(f"co_flags: {outer_code.co_flags:#010b}")
# Inspect the inner adder function
inner_code = add5.__code__
print("\n=== adder code object ===")
print(f"co_name: {inner_code.co_name}")
print(f"co_varnames: {inner_code.co_varnames}")
print(f"co_freevars: {inner_code.co_freevars}") # 'n' is a free variable
print("\n=== adder bytecode ===")
dis.dis(add5)
Output:
=== make_adder code object ===
co_name: make_adder
co_varnames: ('n', 'adder')
co_freevars: ()
co_cellvars: ('n',)
co_consts: (None, <code object adder at 0x...>, 'make_adder.<locals>.adder')
co_flags: 0b0000000001000011
=== adder code object ===
co_name: adder
co_varnames: ('x',)
co_freevars: ('n',)
=== adder bytecode ===
3 0 RESUME 0
4 2 LOAD_FAST 0 (x)
4 LOAD_DEREF 0 (n)
6 BINARY_OP 0 (+)
10 RETURN_VALUE
LOAD_DEREF is the opcode for accessing a closure variable (a free variable). It reads from a “cell” object that’s shared between the outer and inner function, which is how closures maintain state after the outer function returns. This is slower than LOAD_FAST but necessary for closures. co_cellvars in the outer function lists variables that are “wrapped in cells” for sharing with inner functions; co_freevars in the inner function lists the variables it accesses from the cell.
Comparing Implementations
One of the most practical uses of dis is comparing two implementations of the same function. Fewer instructions usually means faster execution (though the relationship isn’t perfect — some instructions are heavier than others).
# dis_compare.py
import dis
# Three ways to join a list of strings
def join_v1(words):
result = ""
for word in words:
result += word + " "
return result.strip()
def join_v2(words):
return " ".join(words)
def join_v3(words):
parts = []
for word in words:
parts.append(word)
return " ".join(parts)
# Count instructions for each
for fn in [join_v1, join_v2, join_v3]:
instructions = list(dis.get_instructions(fn))
real_instructions = [i for i in instructions if i.opname != "RESUME"]
print(f"{fn.__name__}: {len(real_instructions)} instructions")
print("\n=== join_v2 bytecode (winner) ===")
dis.dis(join_v2)
Output:
join_v1: 18 instructions
join_v2: 6 instructions
join_v3: 17 instructions
=== join_v2 bytecode (winner) ===
2 0 RESUME 0
3 2 LOAD_CONST 1 (' ')
4 LOAD_ATTR 1 (NULL|self + join)
14 LOAD_FAST 0 (words)
16 CALL 1
24 RETURN_VALUE
join_v2 uses just 5 real instructions: load the string ' ', load its join method, load the words argument, call the method, and return. The other versions loop in Python bytecode, which means executing multiple instructions per element. str.join() does the concatenation in C, which is why it’s significantly faster than loop-based string building for large lists.
Other dis Module Tools
Beyond dis.dis(), the module provides several other useful functions for programmatic bytecode analysis:
# dis_tools.py
import dis
def sample(x, y=10):
z = x * y
return z if z > 0 else -z
# get_instructions() -- iterator of Instruction namedtuples
print("=== get_instructions ===")
for instr in dis.get_instructions(sample):
print(f" {instr.offset:3d} {instr.opname:<20} {instr.argrepr}")
# dis.code_info() -- compact summary of the code object
print("\n=== code_info ===")
print(dis.code_info(sample))
# dis.show_code() -- prints code_info to stdout
print("\n=== show_code ===")
dis.show_code(sample)
# Disassemble a code string directly
print("\n=== disassemble from string ===")
code_str = "x = [i**2 for i in range(5)]"
dis.dis(compile(code_str, "<string>", "exec"))
Output (abridged):
=== get_instructions ===
0 RESUME 0
2 LOAD_FAST x
4 LOAD_FAST y
6 BINARY_OP *
...
=== code_info ===
Name: sample
Filename: dis_tools.py
Argument count: 2
...
Constants: (None, 0)
Variable names: x, y, z
dis.get_instructions() returns Instruction namedtuples with fields opname, opcode, arg, argval, argrepr, offset, starts_line, and is_jump_target. This is the programmatic interface for tools that analyze bytecode -- linters, optimizers, and coverage tools all use it.
Real-Life Example: Bytecode-Based Performance Audit
This audit tool identifies functions in a module that use slow patterns -- global variable access in loops, attribute chaining, or excessive stack operations -- and reports them with instruction counts.
# bytecode_audit.py
import dis
import types
def audit_function(fn):
"""
Audit a function for common bytecode-level performance patterns.
Returns a dict of findings.
"""
instructions = list(dis.get_instructions(fn))
opnames = [i.opname for i in instructions]
# Count meaningful instructions (skip RESUME)
total = sum(1 for op in opnames if op != "RESUME")
# Detect LOAD_GLOBAL inside a FOR_ITER loop
global_in_loop = 0
in_loop = False
for instr in instructions:
if instr.opname == "FOR_ITER":
in_loop = True
if instr.opname == "RETURN_VALUE":
in_loop = False
if in_loop and instr.opname == "LOAD_GLOBAL":
global_in_loop += 1
# Count closure variable accesses
deref_count = opnames.count("LOAD_DEREF") + opnames.count("STORE_DEREF")
# Count function calls
call_count = opnames.count("CALL") + opnames.count("CALL_FUNCTION")
return {
"name": fn.__name__,
"total_instructions": total,
"global_in_loop": global_in_loop,
"deref_accesses": deref_count,
"function_calls": call_count,
}
# Test functions with different characteristics
MULTIPLIER = 10 # global variable
def process_list_slow(data):
"""Uses global in loop -- flagged by audit."""
result = []
for item in data:
result.append(item * MULTIPLIER) # LOAD_GLOBAL each iteration
return result
def process_list_fast(data, multiplier=MULTIPLIER):
"""Caches global as local default -- not flagged."""
return [item * multiplier for item in data] # LOAD_FAST
def nested_closure(x):
"""Creates a closure -- deref accesses flagged."""
factor = x * 2
def inner(y):
return y + factor # LOAD_DEREF
return inner
# Run audit
print(f"{'Function':<25} {'Instructions':>14} {'Global-in-loop':>15} {'Deref':>8} {'Calls':>8}")
print("-" * 75)
for fn in [process_list_slow, process_list_fast, nested_closure]:
r = audit_function(fn)
flag = " [!]" if r["global_in_loop"] > 0 else ""
print(f"{r['name']:<25} {r['total_instructions']:>14} {r['global_in_loop']:>15}{flag} {r['deref_accesses']:>6} {r['call_count']:>6}")
Output:
Function Instructions Global-in-loop Deref Calls
---------------------------------------------------------------------------
process_list_slow 10 1 [!] 0 2
process_list_fast 6 0 0 0
nested_closure 6 0 0 0
process_list_slow is flagged because it accesses the global MULTIPLIER inside a FOR_ITER loop. Each LOAD_GLOBAL does a dictionary lookup in the global namespace, while LOAD_FAST (used in the fast version) reads from a pre-allocated local slot -- typically 3-5x faster. The fix is simple: assign the global to a local variable before the loop, or use a default argument as process_list_fast does.
Frequently Asked Questions
Does bytecode change between Python versions?
Yes -- CPython's bytecode is not guaranteed to be stable between minor versions. Python 3.12 introduced significant opcode changes (e.g., CALL_FUNCTION was renamed and restructured to CALL), and 3.13 added further changes for the free-threaded build. The dis module always reflects the bytecode of the running interpreter, so your disassembly output will match what CPython actually executes. Never rely on specific bytecode sequences across Python versions in production code.
Why is LOAD_FAST faster than LOAD_GLOBAL?
LOAD_FAST reads a variable from a fixed-size local variable array using an integer index -- it's essentially an array lookup with no hashing. LOAD_GLOBAL performs a hash table lookup in the module's __dict__, then falls back to builtins.__dict__ if not found there. The hash table lookup involves computing a hash, finding the slot, and handling potential collisions -- much slower for tight loops. The micro-optimization of caching globals as locals (_len = len before a loop) can matter in hot paths that execute millions of times.
Does CPython optimize bytecode?
Yes, CPython applies "peephole optimization" during compilation -- a limited set of constant-folding and dead-code elimination passes. For example, 2 * 3 in source code becomes the constant 6 in bytecode (no runtime multiplication). if False: ... branches are eliminated entirely. Python 3.13 introduced more aggressive optimization with the "quickening" mechanism that specializes frequently-executed instructions based on their runtime types. The dis module shows the post-optimization bytecode.
Can I modify bytecode at runtime?
Technically yes -- you can create a new types.CodeType object with modified bytecode and assign it to a function's __code__. Libraries like codetransformer and bytecode provide higher-level APIs for this. In practice, modifying bytecode is fragile (it breaks across Python versions), slow to implement correctly, and almost always unnecessary -- if you need runtime code transformation, Python's decorator system, ast module, or metaclasses are safer and more maintainable alternatives.
How does bytecode relate to PyPy or Cython?
CPython's bytecode is interpreted by the CPython VM -- each instruction is executed by a Python-level switch statement in C. PyPy uses a JIT compiler that compiles frequently-executed bytecode to native machine code at runtime, which is why it can be 5-10x faster for CPU-bound loops. Cython is a different approach: it compiles Python-like code to C extensions that bypass bytecode entirely. The dis module only shows CPython bytecode -- PyPy and Cython have their own internal representations that dis doesn't apply to.
Conclusion
The dis module gives you a window into what CPython actually does with your Python source code. The key tools are dis.dis() for human-readable bytecode output, dis.get_instructions() for programmatic analysis, dis.code_info() for code object metadata, and the __code__ attribute for direct code object inspection. The performance audit above is a practical starting point -- extend it to scan all functions in a module, add checks for attribute chaining inside loops, or integrate it into your CI pipeline as a performance regression detector.
The dis module is most valuable when combined with profiling: use cProfile or line_profiler to find the hot path, then use dis to understand why one implementation is faster than another. Together they give you both the "where" and the "why" of Python performance.
Official documentation: https://docs.python.org/3/library/dis.html