Beginner
YAML has become the go-to format for configuration files, infrastructure as code, and data serialization across countless Python projects. Whether you’re working with Docker Compose files, Kubernetes manifests, Ansible playbooks, or custom application configuration, understanding how to parse and create YAML files is an essential skill for any Python developer. In this comprehensive guide, we’ll explore the PyYAML library and walk through practical examples that demonstrate how to read configuration files, generate YAML output, handle complex data structures, and follow security best practices when working with untrusted YAML sources.
YAML, which stands for “YAML Ain’t Markup Language,” was designed with human readability as a primary goal. Unlike JSON’s curly braces and strict syntax, or XML’s verbose tag structure, YAML uses indentation and simple key-value pairs that mirror natural Python data structures. This makes it intuitive for both writing configuration files by hand and parsing them programmatically. Throughout this article, you’ll discover how Python’s PyYAML library bridges the gap between YAML’s readable format and Python’s powerful data manipulation capabilities.
By the end of this tutorial, you’ll be able to confidently read existing YAML files into Python dictionaries and lists, write Python data structures back to YAML format, handle edge cases like multi-document YAML files, leverage advanced features such as anchors and aliases, and most importantly, understand the security implications of YAML parsing. Let’s dive in and master the art of working with YAML in Python.
Quick Example
Before we explore the details, here’s a snapshot of what’s possible with just a few lines of Python:
# basic_example.py
import yaml
# Reading a YAML file
with open('config.yaml', 'r') as file:
config = yaml.safe_load(file)
print(config['database']['host'])
# Creating and writing YAML
data = {
'app_name': 'MyApp',
'version': '1.0',
'features': ['auth', 'logging']
}
with open('output.yaml', 'w') as file:
yaml.dump(data, file, default_flow_style=False)
This example demonstrates the two fundamental operations: reading configuration into Python data structures and serializing Python objects back to YAML format. In the sections below, we’ll expand on these concepts and explore advanced scenarios.
What is YAML?
YAML is a human-friendly data serialization language that excels at representing configuration files and structured data. Its design philosophy emphasizes readability, allowing developers to write and maintain configuration files without learning complex syntax rules. The language uses indentation to denote nesting, colons to separate keys from values, and hyphens to represent list items, all of which feel natural to anyone familiar with Python’s syntax.
To understand YAML’s place in the ecosystem, let’s compare it with other popular data formats:
| Feature | YAML | JSON | TOML | INI |
|---|---|---|---|---|
| Human Readable | Excellent | Good | Good | Fair |
| Nested Structures | Native | Native | Native | Limited |
| Comments | Yes | No | Yes | Yes |
| Type Safety | Implicit | Explicit | Mixed | String-based |
| Use Cases | Config, IaC | APIs, Data | Settings | Legacy Apps |
| Parsing Speed | Slower | Fast | Medium | Fast |
YAML’s strength lies in its readability and native support for comments, making it ideal for configuration files that humans regularly edit. JSON, by contrast, excels at machine-to-machine communication due to its strict structure and rapid parsing. TOML offers a middle ground with table-based organization, while INI files, though simple, lack native support for complex nested structures. For Python developers working with configuration files and infrastructure as code, YAML remains the most popular choice.
Installation and Setup
Before you can parse and create YAML files in Python, you need to install the PyYAML library. PyYAML is not part of Python’s standard library, but it’s lightweight and easy to set up. Open your terminal and run the following command:
# setup.sh
pip install pyyaml
Once installed, verify that PyYAML is working correctly by checking its version:
# verify_install.py
import yaml
print(f"PyYAML version: {yaml.__version__}")
Output:
PyYAML version: 6.0
Congratulations! You’re now ready to work with YAML files in Python. The PyYAML library provides a straightforward API that we’ll explore throughout this guide.
Reading YAML Files with safe_load
The most common operation when working with YAML is reading configuration files into Python data structures. The PyYAML library provides several methods for this, but yaml.safe_load() is the recommended approach for security reasons. Unlike yaml.load(), which can execute arbitrary Python code embedded in YAML, safe_load() only constructs simple Python objects like dictionaries, lists, and strings, preventing code injection attacks.
Let’s start with a basic example. First, create a YAML file containing application configuration:
# config.yaml
app:
name: DataProcessor
version: 2.1.0
debug: true
database:
host: localhost
port: 5432
name: appdb
credentials:
user: admin
password: secret123
features:
- authentication
- logging
- reporting
Now, parse this YAML file in Python:
# read_yaml_basic.py
import yaml
with open('config.yaml', 'r') as file:
config = yaml.safe_load(file)
print("Application Name:", config['app']['name'])
print("Database Host:", config['database']['host'])
print("Features:", config['features'])
Output:
Application Name: DataProcessor
Database Host: localhost
Features: ['authentication', 'logging', 'reporting']
Notice how the YAML structure maps directly to Python dictionaries and lists. Nested keys become nested dictionaries, arrays become Python lists, and boolean values are properly recognized. This seamless conversion is one of YAML’s greatest strengths.
Writing YAML Files with dump
Beyond reading YAML, you often need to generate YAML files from Python data. The yaml.dump() function converts Python objects into YAML format. Let’s create a practical example where we construct a configuration dictionary and write it to a file:
# write_yaml_basic.py
import yaml
config = {
'app': {
'name': 'MyService',
'version': '1.0.0',
'debug': False
},
'database': {
'host': 'db.example.com',
'port': 5432
},
'cache': {
'enabled': True,
'ttl': 3600
}
}
with open('generated_config.yaml', 'w') as file:
yaml.dump(config, file, default_flow_style=False)
print("Configuration written to generated_config.yaml")
Output:
Configuration written to generated_config.yaml
Check the contents of the generated file:
# view_generated.py
with open('generated_config.yaml', 'r') as file:
print(file.read())
Output:
app:
debug: false
name: MyService
version: 1.0.0
cache:
enabled: true
ttl: 3600
database:
host: db.example.com
port: 5432
The default_flow_style=False parameter ensures that nested structures are formatted with indentation rather than JSON-like curly braces. This produces more readable configuration files that follow YAML conventions. You can also control formatting with additional parameters like sort_keys=True to alphabetize keys or allow_unicode=True to preserve non-ASCII characters.
Working with Complex Data Types
YAML supports a rich variety of data types beyond simple strings and numbers. Python’s PyYAML library automatically handles conversion between YAML’s type system and Python’s native types. Understanding these conversions helps you work with complex configurations effectively.
Here’s a comprehensive example demonstrating various data types:
# complex_data_types.py
import yaml
from datetime import datetime
data = {
'strings': {
'simple': 'hello',
'multiline': 'first line\nsecond line',
'quoted': 'special chars: @#$%'
},
'numbers': {
'integer': 42,
'float': 3.14,
'scientific': 1.23e-4,
'hex': 0xFF,
'octal': 0o755
},
'booleans': {
'true_value': True,
'false_value': False,
'yes': True,
'no': False
},
'null_value': None,
'lists': {
'simple': [1, 2, 3],
'mixed': ['string', 42, True, None]
},
'dates': {
'timestamp': datetime(2026, 4, 5, 14, 30, 0)
}
}
with open('complex.yaml', 'w') as file:
yaml.dump(data, file, default_flow_style=False)
with open('complex.yaml', 'r') as file:
loaded = yaml.safe_load(file)
print(loaded)
Output:
{'strings': {'simple': 'hello', 'multiline': 'first line\nsecond line', 'quoted': 'special chars: @#$%'}, 'numbers': {'integer': 42, 'float': 3.14, 'scientific': 0.000123, 'hex': 255, 'octal': 493}, 'booleans': {'true_value': True, 'false_value': False, 'yes': True, 'no': False}, 'null_value': None, 'lists': {'simple': [1, 2, 3], 'mixed': ['string', 42, True, None]}, 'dates': {'timestamp': datetime.datetime(2026, 4, 5, 14, 30, 0)}}
YAML’s type inference system automatically detects whether a value is a string, number, boolean, or null. This intelligent parsing eliminates the need for explicit type declarations. However, if you need to force a specific type—for instance, treating the string “yes” as text rather than a boolean—you can quote it in the YAML file.
Multi-Document YAML Files
YAML supports storing multiple documents in a single file, separated by three hyphens (---). This is particularly useful when you need to manage multiple configurations or data structures in one file. PyYAML provides yaml.safe_load_all() to iterate through all documents:
# multi_document.yaml
---
name: Configuration A
version: 1.0
settings:
debug: true
---
name: Configuration B
version: 2.0
settings:
debug: false
---
name: Configuration C
version: 1.5
settings:
debug: true
Now load all documents:
# read_multi_yaml.py
import yaml
with open('multi_document.yaml', 'r') as file:
documents = yaml.safe_load_all(file)
for i, doc in enumerate(documents, 1):
print(f"Document {i}:")
print(f" Name: {doc['name']}")
print(f" Version: {doc['version']}")
print()
Output:
Document 1:
Name: Configuration A
Version: 1.0
Document 2:
Name: Configuration B
Version: 2.0
Document 3:
Name: Configuration C
Version: 1.5
Multi-document YAML is invaluable for scenarios like managing Kubernetes manifests, where multiple resource definitions appear in a single file. The safe_load_all() function returns a generator, allowing you to process documents one at a time without loading the entire file into memory.
Anchors and Aliases for Code Reuse
YAML provides a powerful feature called anchors and aliases that allows you to define a value once and reference it multiple times. This reduces duplication and makes configurations easier to maintain. An anchor is created with an ampersand (&), and aliases reference the anchor with an asterisk (*).
# anchors_aliases.yaml
defaults: &default_settings
timeout: 30
retries: 3
cache: true
services:
api:
<<: *default_settings
port: 8000
name: API Service
worker:
<<: *default_settings
port: 9000
name: Worker Service
database:
<<: *default_settings
port: 5432
name: Database
Parse this configuration:
# read_anchors.py
import yaml
with open('anchors_aliases.yaml', 'r') as file:
config = yaml.safe_load(file)
for service_name, settings in config['services'].items():
print(f"{service_name}:")
print(f" Timeout: {settings['timeout']}")
print(f" Retries: {settings['retries']}")
print()
Output:
api:
Timeout: 30
Retries: 3
worker:
Timeout: 30
Retries: 3
database:
Timeout: 30
Retries: 3
The merge key (<<) combines the referenced anchor with the current dictionary, allowing service definitions to inherit default settings while still overriding specific values. This pattern significantly reduces repetition in large configuration files.
Safe Loading Practices and Security
When working with YAML files from untrusted sources, security is paramount. The standard yaml.load() function is dangerous because it can execute arbitrary Python code embedded in YAML. Consider this malicious YAML:
# dangerous.yaml
!!python/object/apply:os.system
args: ['rm -rf /']
Loading this with yaml.load() would execute the command. Always use yaml.safe_load() instead:
# safe_loading_demo.py
import yaml
# WRONG: Never do this with untrusted YAML
# data = yaml.load(untrusted_yaml, Loader=yaml.FullLoader)
# CORRECT: Use safe_load for security
try:
with open('config.yaml', 'r') as file:
data = yaml.safe_load(file)
print("Safely loaded configuration")
except yaml.YAMLError as e:
print(f"Error parsing YAML: {e}")
Output:
Safely loaded configuration
Beyond using safe_load(), implement additional security measures: validate configuration schemas to ensure expected structure, restrict file permissions so only authorized users can modify configuration files, and sanitize any user input that gets incorporated into YAML files. For high-security environments, consider using specialized YAML validation libraries or writing custom validation functions.
Custom YAML Tags and Constructors
YAML's tag system allows you to extend its functionality with custom types. While safe_load() prevents arbitrary code execution, you can still register custom constructors for specific tags to handle domain-specific data types. This is useful for configurations that require special processing:
# custom_tags.py
import yaml
import os
from pathlib import Path
def env_constructor(loader, node):
"""Custom constructor for !env tag to read environment variables"""
value = loader.construct_scalar(node)
return os.getenv(value, f'${{{value}}}')
def path_constructor(loader, node):
"""Custom constructor for !path tag to create Path objects"""
value = loader.construct_scalar(node)
return str(Path(value).resolve())
# Register constructors
yaml.SafeLoader.add_constructor('!env', env_constructor)
yaml.SafeLoader.add_constructor('!path', path_constructor)
yaml_content = """
database_url: !env DATABASE_URL
log_dir: !path /var/logs
app_name: MyApp
"""
data = yaml.safe_load(yaml_content)
print(data)
Output:
{'database_url': '${DATABASE_URL}', 'log_dir': '/var/logs', 'app_name': 'MyApp'}
Custom tags enable you to handle environment variables, file paths, date strings, and other special formats seamlessly during YAML parsing. This approach keeps your configuration files readable while maintaining type safety and extensibility.
Real-Life Example: Configuration File Manager
Let's bring everything together with a practical application—a configuration file manager that reads YAML, validates settings, and provides utilities for working with configuration data:
# config_manager.py
import yaml
import os
from pathlib import Path
from typing import Any, Dict, Optional
class ConfigManager:
"""Manages application configuration from YAML files."""
def __init__(self, config_path: str):
self.config_path = Path(config_path)
self.config: Dict[str, Any] = {}
self.load()
def load(self) -> None:
"""Load configuration from YAML file."""
if not self.config_path.exists():
raise FileNotFoundError(f"Config file not found: {self.config_path}")
with open(self.config_path, 'r') as file:
try:
self.config = yaml.safe_load(file) or {}
except yaml.YAMLError as e:
raise ValueError(f"Invalid YAML: {e}")
def get(self, key: str, default: Any = None) -> Any:
"""Get configuration value using dot notation."""
keys = key.split('.')
value = self.config
for k in keys:
if isinstance(value, dict):
value = value.get(k)
if value is None:
return default
else:
return default
return value
def set(self, key: str, value: Any) -> None:
"""Set configuration value using dot notation."""
keys = key.split('.')
config = self.config
for k in keys[:-1]:
if k not in config:
config[k] = {}
config = config[k]
config[keys[-1]] = value
def save(self) -> None:
"""Save configuration back to YAML file."""
with open(self.config_path, 'w') as file:
yaml.dump(self.config, file, default_flow_style=False)
def validate_required(self, required_keys: list) -> bool:
"""Check that all required configuration keys exist."""
for key in required_keys:
if self.get(key) is None:
print(f"Missing required configuration: {key}")
return False
return True
# Usage example
if __name__ == '__main__':
# Create sample configuration
sample_config = {
'app': {
'name': 'MyApplication',
'version': '1.0.0'
},
'database': {
'host': 'localhost',
'port': 5432,
'name': 'mydb'
},
'server': {
'host': '0.0.0.0',
'port': 8000
}
}
# Write sample config
with open('app_config.yaml', 'w') as f:
yaml.dump(sample_config, f, default_flow_style=False)
# Load and use configuration
config = ConfigManager('app_config.yaml')
print(f"App: {config.get('app.name')}")
print(f"Database: {config.get('database.host')}:{config.get('database.port')}")
print(f"Server: {config.get('server.host')}:{config.get('server.port')}")
# Modify configuration
config.set('database.pool_size', 10)
config.save()
print("\nConfiguration updated and saved.")
Output:
App: MyApplication
Database: localhost:5432
Server: 0.0.0.0:8000
Configuration updated and saved.
This ConfigManager class demonstrates a production-ready approach to handling YAML configuration files. It supports dot notation for accessing nested values, provides methods for modifying configurations, validates required settings, and handles errors gracefully. You can extend this class with additional features like configuration merging, environment variable substitution, or schema validation depending on your application's needs.
Frequently Asked Questions
What's the difference between yaml.load() and yaml.safe_load()?
yaml.load() uses the full YAML specification and can deserialize arbitrary Python objects, including those that execute code during instantiation. This makes it dangerous with untrusted input. yaml.safe_load() only constructs simple Python objects (dicts, lists, strings) and is safe for use with any YAML source. Always prefer safe_load() unless you have a specific reason to use the full loader and have full control over the input.
Can I preserve comments when reading and writing YAML?
Standard PyYAML doesn't preserve comments during round-trip operations. If you need to maintain comments, consider using the ruamel.yaml library instead, which is designed specifically for preserving comments, formatting, and other YAML features. However, for most applications, PyYAML's simpler approach is sufficient.
How do I handle very large YAML files efficiently?
For large YAML files, use yaml.safe_load_all() with generators to process documents one at a time rather than loading everything into memory. Additionally, consider using streaming parsers or breaking large files into smaller chunks. PyYAML can handle reasonably sized files, but for massive datasets, you might explore alternative formats like JSON or CSV.
Why does my integer sometimes become a string when loading YAML?
YAML's automatic type detection usually works well, but certain values can be ambiguous. For example, ZIP codes like 02134 are interpreted as octal numbers. To force a string type, quote the value in your YAML file: '02134'. Similarly, yes/no values become booleans unless quoted.
How can I validate YAML against a schema?
PyYAML doesn't include built-in schema validation. For validation, use libraries like jsonschema (which works with YAML since both parse to dictionaries) or pydantic for more sophisticated type checking. After loading YAML with safe_load(), you can validate the resulting Python object against your schema.
Conclusion
Mastering YAML parsing and creation in Python opens doors to working with modern configuration systems, infrastructure as code, and data serialization across countless projects. From reading simple configuration files with yaml.safe_load() to writing complex data structures with yaml.dump(), the PyYAML library provides everything you need for practical YAML handling. Remember to always prioritize security by using safe_load(), validate your configurations, and keep comments in mind when choosing between YAML and alternative formats.
As you build more sophisticated applications, you'll find that understanding YAML's features—from anchors and aliases to custom tags and multi-document files—will help you write cleaner, more maintainable configurations. For more advanced techniques and comprehensive documentation, visit the PyYAML Documentation.
Related Articles
Related Python Tutorials
Continue learning with these related guides: