Advanced
At some point, you’ll need to write code that understands code. Maybe you want to build a custom linter that enforces your team’s specific patterns, or scan a codebase to find every call to a deprecated function, or automatically modernize syntax across hundreds of files. Regular expressions aren’t the right tool — they can’t reason about structure. Python’s built-in ast module is.
The ast module parses Python source code into an Abstract Syntax Tree — a structured, tree-shaped representation of your program that you can inspect, traverse, and transform. It’s the same mechanism Python itself uses when compiling code. The module is part of the standard library, no installation required. If you’ve ever used a linter like flake8 or Ruff, or a type checker like mypy, you’ve already relied on code built on top of ast.
In this article we’ll cover how to parse source code into a tree, walk the tree to find specific patterns, use the NodeVisitor pattern to build a linter, transform code with NodeTransformer, extract docstrings and function signatures, and build a practical code analysis tool. This is advanced territory — we assume you’re comfortable with Python classes and recursion.
AST Quick Example
Parsing a Python string into an AST takes one line. Here’s the simplest possible example — parse a function definition and inspect its structure:
# quick_ast.py
import ast
source = """
def greet(name: str) -> str:
return "Hello, " + name
"""
tree = ast.parse(source)
print(ast.dump(tree, indent=2))
Output (abbreviated):
Module(
body=[
FunctionDef(
name='greet',
args=arguments(
args=[arg(arg='name', annotation=Name(id='str'))],
...
),
body=[
Return(
value=BinOp(
left=Constant(value='Hello, '),
op=Add(),
right=Name(id='name')
)
)
],
returns=Name(id='str')
)
]
)
ast.parse() returns a Module node — the root of the tree. Every element of the source code becomes a node: the function definition is a FunctionDef, the return value is a Return, and the string concatenation is a BinOp (binary operation) with Add as the operator. This tree structure is what makes programmatic analysis possible — instead of searching strings, you’re traversing structured objects.
Understanding AST Node Types
The AST has dozens of node types. You don’t need to memorize them — you learn the ones relevant to what you’re building. Here are the most common ones you’ll encounter when analyzing Python code:
| Node type | What it represents | Key attributes |
|---|---|---|
Module | Top-level file | body (list of statements) |
FunctionDef | def statement | name, args, body, decorator_list |
ClassDef | class statement | name, bases, body |
Call | Function call | func, args, keywords |
Import | import statement | names (list of aliases) |
ImportFrom | from X import Y | module, names |
Assign | Variable assignment | targets, value |
Return | return statement | value |
Constant | Literal value | value (int, str, float, etc.) |
Name | Variable name reference | id (the name string) |
The best way to discover a node type is to write a small piece of Python that uses the pattern you care about, parse it with ast.parse(), and print the tree with ast.dump(tree, indent=2). Think of it as using the AST to describe itself.
Walking the Tree with ast.walk
For simple analysis tasks, ast.walk() visits every node in the tree in no particular order. It’s the fastest way to extract information when you don’t need to care about context (which node is parent, which is child).
Finding All Imports
Here’s how to extract every import statement from a Python file:
# find_imports.py
import ast
source = """
import os
import sys
from pathlib import Path
from collections import Counter, defaultdict
import json as js
"""
tree = ast.parse(source)
imports = []
for node in ast.walk(tree):
if isinstance(node, ast.Import):
for alias in node.names:
imports.append(("import", alias.name, alias.asname))
elif isinstance(node, ast.ImportFrom):
for alias in node.names:
imports.append(("from", f"{node.module}.{alias.name}", alias.asname))
for kind, name, alias in imports:
asname = f" as {alias}" if alias else ""
print(f"{kind}: {name}{asname}")
Output:
import: os
import: sys
from: pathlib.Path
from: collections.Counter
from: collections.defaultdict
import: json as js
The key pattern here is isinstance(node, ast.Import) — you test each node against a specific type to filter what you care about. ast.walk() handles the recursion for you. The difference between ast.Import (for import X) and ast.ImportFrom (for from X import Y) is important — they have different attributes.
Finding Specific Function Calls
Suppose you want to audit a codebase for calls to eval() or exec() — common security concerns. Here’s how to find them:
# find_eval_calls.py
import ast
source = """
x = eval("1 + 1")
safe_result = int("42")
exec("print('dangerous')")
y = compile("code", "file", "exec")
"""
tree = ast.parse(source)
dangerous_calls = []
for node in ast.walk(tree):
if isinstance(node, ast.Call):
# Get the function name (handles simple Name calls)
if isinstance(node.func, ast.Name):
func_name = node.func.id
if func_name in ("eval", "exec", "compile"):
dangerous_calls.append((func_name, node.lineno))
for name, line in dangerous_calls:
print(f"Line {line}: Found dangerous call to {name}()")
Output:
Line 2: Found dangerous call to eval()
Line 4: Found dangerous call to exec()
Line 5: Found dangerous call to compile()
This is the skeleton of a security linter. The node.lineno attribute gives you the line number, which you’d use in a real tool to show the user exactly where the issue is. The check for isinstance(node.func, ast.Name) handles direct calls like eval() — for method calls like obj.eval(), the func would be an ast.Attribute node instead, with node.func.attr giving the method name.
Building a Linter with NodeVisitor
For structured analysis where you need context — like knowing which class or function you’re inside — subclass ast.NodeVisitor. You define visit_NodeType() methods that are called automatically as the visitor traverses the tree.
# node_visitor_linter.py
import ast
from dataclasses import dataclass, field
@dataclass
class LintIssue:
line: int
col: int
code: str
message: str
def __str__(self):
return f"Line {self.line}:{self.col} [{self.code}] {self.message}"
class SimpleLinter(ast.NodeVisitor):
"""A simple custom linter using NodeVisitor."""
def __init__(self):
self.issues: list[LintIssue] = []
def visit_FunctionDef(self, node: ast.FunctionDef):
# Rule: Functions should have docstrings
if not (node.body and isinstance(node.body[0], ast.Expr) and
isinstance(node.body[0].value, ast.Constant)):
self.issues.append(LintIssue(
node.lineno, node.col_offset,
"D001", f"Function '{node.name}' is missing a docstring"
))
# Rule: Function names should be snake_case (no uppercase letters)
if any(c.isupper() for c in node.name):
self.issues.append(LintIssue(
node.lineno, node.col_offset,
"N001", f"Function name '{node.name}' should be snake_case"
))
# Continue traversing into the function body
self.generic_visit(node)
def visit_Call(self, node: ast.Call):
# Rule: Warn on print() calls in non-test code
if isinstance(node.func, ast.Name) and node.func.id == "print":
self.issues.append(LintIssue(
node.lineno, node.col_offset,
"D002", "Use logging instead of print()"
))
self.generic_visit(node)
# Test the linter
source = """
def processData(data):
result = data.strip()
print("Processing:", result)
return result
def clean_text(text):
\"\"\"Clean a text string.\"\"\"
return text.lower()
"""
tree = ast.parse(source)
linter = SimpleLinter()
linter.visit(tree)
if linter.issues:
for issue in linter.issues:
print(issue)
else:
print("No issues found.")
Output:
Line 2:0 [D001] Function 'processData' is missing a docstring
Line 2:0 [N001] Function name 'processData' should be snake_case
Line 4:4 [D002] Use logging instead of print()
The key detail is self.generic_visit(node) at the end of each visitor method. This tells the visitor to continue traversing into the node’s children. If you forget it, the traversal stops at that node and won’t visit anything inside the function body. This is the most common mistake when starting with NodeVisitor.
Transforming Code with NodeTransformer
The ast.NodeTransformer class works like NodeVisitor but lets you replace nodes. This is how automated refactoring tools work. You return a modified node (or a different node entirely) to make the substitution:
# node_transformer.py
import ast
class PrintToLoggingTransformer(ast.NodeTransformer):
"""Replace print() calls with logging.info() calls."""
def __init__(self):
self.needs_logging_import = False
def visit_Call(self, node: ast.Call):
# Only transform direct print() calls
if isinstance(node.func, ast.Name) and node.func.id == "print":
self.needs_logging_import = True
# Build: logging.info(args)
new_node = ast.Call(
func=ast.Attribute(
value=ast.Name(id="logging", ctx=ast.Load()),
attr="info",
ctx=ast.Load(),
),
args=node.args,
keywords=node.keywords,
)
return ast.copy_location(new_node, node)
return node
source = """
def process(data):
print("Starting process")
result = data.strip()
print("Result:", result)
return result
"""
tree = ast.parse(source)
transformer = PrintToLoggingTransformer()
new_tree = transformer.visit(tree)
ast.fix_missing_locations(new_tree)
# Convert back to source code
print(ast.unparse(new_tree))
Output:
def process(data):
logging.info('Starting process')
result = data.strip()
logging.info('Result:', result)
return result
Two functions are critical when transforming trees. ast.copy_location(new_node, old_node) copies the line/column information from the original node to the new one — without this, the tree is missing location data and tools that use it will fail. ast.fix_missing_locations(tree) fills in any remaining missing location info before you use the tree. ast.unparse() converts the modified tree back to Python source code.
Real-Life Example: Codebase Complexity Analyzer
Here’s a practical tool that analyzes a Python module and reports function complexity (number of branches), missing docstrings, and function count per class:
# complexity_analyzer.py
"""Analyze Python source code for complexity metrics using the ast module."""
import ast
from dataclasses import dataclass, field
from pathlib import Path
@dataclass
class FunctionStats:
name: str
lineno: int
has_docstring: bool
branch_count: int
arg_count: int
@dataclass
class ModuleStats:
filename: str
function_count: int = 0
class_count: int = 0
functions: list = field(default_factory=list)
class ComplexityVisitor(ast.NodeVisitor):
"""Visit all function definitions and compute basic metrics."""
def __init__(self):
self.stats = []
def _count_branches(self, node: ast.AST) -> int:
"""Count if/elif/for/while/with/except branches."""
branch_types = (ast.If, ast.For, ast.While, ast.With,
ast.ExceptHandler, ast.comprehension)
return sum(
1 for n in ast.walk(node) if isinstance(n, branch_types)
)
def _has_docstring(self, node: ast.FunctionDef) -> bool:
return (
bool(node.body)
and isinstance(node.body[0], ast.Expr)
and isinstance(node.body[0].value, ast.Constant)
and isinstance(node.body[0].value.value, str)
)
def visit_FunctionDef(self, node: ast.FunctionDef):
self.stats.append(FunctionStats(
name=node.name,
lineno=node.lineno,
has_docstring=self._has_docstring(node),
branch_count=self._count_branches(node),
arg_count=len(node.args.args),
))
self.generic_visit(node)
visit_AsyncFunctionDef = visit_FunctionDef # same logic for async
def analyze_source(source: str, filename: str = "") -> ModuleStats:
tree = ast.parse(source, filename=filename)
visitor = ComplexityVisitor()
visitor.visit(tree)
classes = [n for n in ast.walk(tree) if isinstance(n, ast.ClassDef)]
stats = ModuleStats(
filename=filename,
function_count=len(visitor.stats),
class_count=len(classes),
functions=visitor.stats,
)
return stats
# Demo: analyze a sample module
sample_source = """
class DataProcessor:
\"\"\"Processes incoming data.\"\"\"
def __init__(self, config):
self.config = config
def process(self, data):
if not data:
return None
results = []
for item in data:
if item.get("valid"):
try:
results.append(self._transform(item))
except ValueError:
pass
return results
def _transform(self, item):
return item["value"] * 2
def standalone_helper(x, y, z):
return x + y + z
"""
report = analyze_source(sample_source, "data_processor.py")
print(f"File: {report.filename}")
print(f"Functions: {report.function_count}, Classes: {report.class_count}")
print()
for fn in report.functions:
doc = "OK" if fn.has_docstring else "MISSING"
print(f" {fn.name}() line {fn.lineno}: branches={fn.branch_count}, args={fn.arg_count}, docstring={doc}")
Output:
File: data_processor.py
Functions: 4, Classes: 1
__init__() line 4: branches=0, args=2, docstring=MISSING
process() line 7: branches=5, args=2, docstring=MISSING
_transform() line 17: branches=0, args=2, docstring=MISSING
standalone_helper() line 20: branches=0, args=4, docstring=MISSING
This is the foundation of a real complexity gate — you could extend it to fail a CI job if any function has a branch count over 10 (cyclomatic complexity), or to generate a Markdown report of which functions need documentation. The visit_AsyncFunctionDef = visit_FunctionDef line is a neat trick: async functions have a different node type but the same structure, so you can reuse the same visitor method.
Frequently Asked Questions
Is ast.literal_eval() safe for evaluating user input?
Yes — ast.literal_eval() is the safe alternative to eval() for evaluating Python literals (strings, numbers, lists, dicts, tuples, booleans, None). It raises a ValueError or SyntaxError if the input contains anything that isn’t a literal, so it can’t execute arbitrary code. Use it any time you need to parse a string like "[1, 2, 3]" or "{'key': 'value'}" into a Python object. Never use plain eval() on data you don’t fully control.
How do I convert an AST back to source code?
Use ast.unparse(tree), which was added in Python 3.9. It converts an AST back to a valid Python source string. The output is normalized — it won’t perfectly preserve the original formatting, spacing, or comments (comments are not represented in the AST). For tools that need to preserve the original formatting while making targeted changes, use the LibCST library instead, which works at the Concrete Syntax Tree level and preserves all formatting.
Does ast parse type comments?
Yes, if you pass type_comments=True to ast.parse(). This handles the older Python 2-style type annotations written as comments: # type: List[str]. For modern Python 3 type annotations (written inline as x: List[str]), the ast module handles them automatically without any flag — they’re represented as Annotation nodes on the function arguments and assignments.
When should I use the ast module vs. regex?
Use ast when you need to understand structure — “find all functions that call X”, “check if a class has a method named Y”, “find all string literals longer than N characters.” Use regex when you need fast text search that doesn’t require structural understanding — “does this file contain the word ‘TODO'”, “find all email addresses in a log file.” The AST is slower to parse and more complex to use, but it’s correct. Regex on code can be fooled by strings, comments, and multiline expressions.
How do I get line numbers from AST nodes?
Most statement and expression nodes have lineno and col_offset attributes that give the line number (1-indexed) and column (0-indexed) of the node. Some nodes (like arguments) don’t have line numbers — in that case, use the parent node’s line number. Python 3.8+ also provides end_lineno and end_col_offset for the end position, which is useful for tools that need to highlight or replace a specific range in the source text.
Conclusion
The ast module gives you the ability to write programs that understand Python programs. We covered parsing source into a tree with ast.parse(), finding nodes with ast.walk() and isinstance() checks, building structured linters with ast.NodeVisitor, transforming code with ast.NodeTransformer, and converting back to source with ast.unparse(). These primitives power real tools: every Python linter, formatter, and type checker is built on the same foundation.
To go deeper, try extending the complexity analyzer to output a Markdown report, or build a transformer that automatically adds __slots__ to classes without them. The natural next level of complexity is LibCST for format-preserving transformations, and the tokenize module for even lower-level analysis of individual tokens including comments.
For the complete node reference, see the official ast module documentation.