Beginner
How To Work With CSV Files in Python Using the csv Module and Pandas
Quick Example (TLDR)
Reading a CSV file with Python is simple. Here’s the fastest way using pandas:
# The quick way: pandas
import pandas as pd
# Read CSV file
df = pd.read_csv('data.csv')
# Access data
print(df.head())
print(df['column_name'].mean())
Output:
name age salary
0 Alice 28 65000
1 Bob 34 72000
2 Carol 29 68000
3 David 45 95000
4 Eve 31 71000
Understanding CSV Files
CSV stands for “Comma-Separated Values.” It’s the simplest way to store tabular data: each row is a line, columns are separated by commas. Here’s what a CSV file looks like inside:
name,age,salary,department
Alice,28,65000,Engineering
Bob,34,72000,Sales
Carol,29,68000,Engineering
David,45,95000,Management
Reading CSV with the csv Module
Python’s built-in csv module is lightweight and doesn’t require external dependencies:
import csv
# Open and read CSV file
with open('employees.csv', 'r') as file:
# csv.reader returns an iterator over rows
csv_reader = csv.reader(file)
# Get the header row
headers = next(csv_reader)
print("Headers:", headers)
# Process each data row
for row in csv_reader:
print(f"Name: {row[0]}, Age: {row[1]}, Salary: {row[2]}")
Output:
Headers: ['name', 'age', 'salary', 'department']
Name: Alice, Age: 28, Salary: 65000
Name: Bob, Age: 34, Salary: 72000
Name: Carol, Age: 29, Salary: 68000
Using DictReader for Cleaner Code
DictReader automatically treats the first row as headers and returns dictionaries instead of lists:
import csv
# DictReader uses first row as keys
with open('employees.csv', 'r') as file:
dict_reader = csv.DictReader(file)
for row in dict_reader:
# Access by column name instead of index
print(f"{row['name']} earns $" + row['salary'] + " in " + row['department'])
Output:
Alice earns $65000 in Engineering
Bob earns $72000 in Sales
Carol earns $68000 in Engineering
David earns $95000 in Management
Writing CSV Files
Creating a CSV file is equally straightforward:
import csv
# Data to write
employees = [
{'name': 'Alice', 'age': 28, 'salary': 65000},
{'name': 'Bob', 'age': 34, 'salary': 72000},
{'name': 'Carol', 'age': 29, 'salary': 68000},
]
# Write to CSV
with open('new_employees.csv', 'w', newline='') as file:
fieldnames = ['name', 'age', 'salary']
writer = csv.DictWriter(file, fieldnames=fieldnames)
# Write header row
writer.writeheader()
# Write data rows
writer.writerows(employees)
print("File written successfully!")
Output:
File written successfully!
Working with Pandas for Advanced Operations
Pandas makes it much easier to filter, transform, and analyze data:
import pandas as pd
# Read CSV into DataFrame
df = pd.read_csv('employees.csv')
# Basic info about the data
print(f"Total rows: {len(df)}")
print(f"Average salary: ${df['salary'].mean():.2f}")
# Filter data: salaries above 70000
high_earners = df[df['salary'] > 70000]
print("
High earners:")
print(high_earners)
# Group by department
print("
Average salary by department:")
print(df.groupby('department')['salary'].mean())
Output:
Total rows: 4
Average salary: $75000.00
High earners:
name age salary department
1 Bob 34 72000 Sales
3 David 45 95000 Management
Average salary by department:
department
Engineering 66500.0
Management 95000.0
Sales 72000.0
Name: salary, dtype: float64
Handling Large CSV Files Efficiently
For massive files that don’t fit in memory, use chunking with pandas:
import pandas as pd
# Read large file in chunks
chunk_size = 10000
# Process file in batches
for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
# Process each chunk
print(f"Processing chunk with {len(chunk)} rows")
# Do something with the chunk
high_value = chunk[chunk['amount'] > 1000]
print(f"Found {len(high_value)} high-value transactions")
Output:
Processing chunk with 10000 rows
Found 2345 high-value transactions
Processing chunk with 10000 rows
Found 2412 high-value transactions
Processing chunk with 5234 rows
Found 1123 high-value transactions
Real-Life Example: Cleaning and Merging Sales Reports
Here’s a practical example of reading, cleaning, and merging sales data from multiple CSV files:
import pandas as pd
# Read sales data from multiple sources
sales_q1 = pd.read_csv('sales_q1.csv')
sales_q2 = pd.read_csv('sales_q2.csv')
# Combine the datasets
all_sales = pd.concat([sales_q1, sales_q2], ignore_index=True)
# Data cleaning: remove duplicates
all_sales = all_sales.drop_duplicates(subset=['order_id'])
print(f"After removing duplicates: {len(all_sales)} records")
# Clean: remove rows with missing values
all_sales = all_sales.dropna(subset=['customer_id', 'amount'])
print(f"After removing null values: {len(all_sales)} records")
# Transform: add new column for commission (5% of amount)
all_sales['commission'] = all_sales['amount'] * 0.05
# Filter: only successful orders (status='completed')
completed_sales = all_sales[all_sales['status'] == 'completed']
# Analysis: sales by region
print("
Sales by region:")
region_totals = completed_sales.groupby('region')['amount'].sum()
print(region_totals)
# Sort by amount and show top 10
top_sales = completed_sales.nlargest(5, 'amount')
print("
Top 5 sales:")
print(top_sales[['order_id', 'customer_name', 'amount', 'region']])
# Save cleaned data
all_sales.to_csv('cleaned_sales.csv', index=False)
print("
Cleaned data saved to cleaned_sales.csv")
Output:
After removing duplicates: 1997 records
After removing null values: 1985 records
Sales by region:
region
North 45230.50
South 38920.75
East 52340.25
West 41230.00
Name: amount, dtype: float64
Top 5 sales:
order_id customer_name amount region
5 ORD005 Acme Corp 8500.00 East
12 ORD012 TechStart 7200.50 North
18 ORD018 GlobalCo 6950.25 West
24 ORD024 InnovateLabs 6800.00 East
31 ORD031 CloudSys 6550.75 North
Cleaned data saved to cleaned_sales.csv
FAQ
Q: Should I use csv module or pandas?
A: Use csv for simple operations and to avoid dependencies. Use pandas when you need analysis, filtering, or complex transformations. Pandas makes data manipulation much easier and faster to code.
Q: How do I handle CSV files with different delimiters?
A: With csv module: csv.reader(file, delimiter=’;’) or with pandas: pd.read_csv(‘file.csv’, sep=’;’)
Q: What if my CSV has special characters or encoding issues?
A: Specify encoding: pd.read_csv(‘file.csv’, encoding=’utf-8′) or pd.read_csv(‘file.csv’, encoding=’latin-1′)
Q: Can I read CSV directly from a URL?
A: Yes! df = pd.read_csv(‘https://example.com/data.csv’) works directly with pandas.
Q: How do I export a pandas DataFrame to different formats?
A: DataFrame has methods for many formats: to_csv(), to_excel(), to_json(), to_html(), and more.
Conclusion
CSV files are everywhere in data work, and Python makes handling them simple. Start with the built-in csv module for basic needs, then graduate to pandas when you need real analysis power. The combination of these tools covers everything from simple data reading to complex transformations.
References
