Beginner

How To Work With CSV Files in Python Using the csv Module and Pandas

Quick Example (TLDR)

Reading a CSV file with Python is simple. Here’s the fastest way using pandas:

# The quick way: pandas
import pandas as pd

# Read CSV file
df = pd.read_csv('data.csv')

# Access data
print(df.head())
print(df['column_name'].mean())

Output:

   name   age  salary
0  Alice   28   65000
1    Bob   34   72000
2  Carol   29   68000
3  David   45   95000
4   Eve   31   71000

Understanding CSV Files

CSV stands for “Comma-Separated Values.” It’s the simplest way to store tabular data: each row is a line, columns are separated by commas. Here’s what a CSV file looks like inside:

name,age,salary,department
Alice,28,65000,Engineering
Bob,34,72000,Sales
Carol,29,68000,Engineering
David,45,95000,Management

Reading CSV with the csv Module

Python’s built-in csv module is lightweight and doesn’t require external dependencies:

import csv

# Open and read CSV file
with open('employees.csv', 'r') as file:
    # csv.reader returns an iterator over rows
    csv_reader = csv.reader(file)
    
    # Get the header row
    headers = next(csv_reader)
    print("Headers:", headers)
    
    # Process each data row
    for row in csv_reader:
        print(f"Name: {row[0]}, Age: {row[1]}, Salary: {row[2]}")

Output:

Headers: ['name', 'age', 'salary', 'department']
Name: Alice, Age: 28, Salary: 65000
Name: Bob, Age: 34, Salary: 72000
Name: Carol, Age: 29, Salary: 68000

Using DictReader for Cleaner Code

DictReader automatically treats the first row as headers and returns dictionaries instead of lists:

import csv

# DictReader uses first row as keys
with open('employees.csv', 'r') as file:
    dict_reader = csv.DictReader(file)
    
    for row in dict_reader:
        # Access by column name instead of index
        print(f"{row['name']} earns $" + row['salary'] + " in " + row['department'])

Output:

Alice earns $65000 in Engineering
Bob earns $72000 in Sales
Carol earns $68000 in Engineering
David earns $95000 in Management

Writing CSV Files

Creating a CSV file is equally straightforward:

import csv

# Data to write
employees = [
    {'name': 'Alice', 'age': 28, 'salary': 65000},
    {'name': 'Bob', 'age': 34, 'salary': 72000},
    {'name': 'Carol', 'age': 29, 'salary': 68000},
]

# Write to CSV
with open('new_employees.csv', 'w', newline='') as file:
    fieldnames = ['name', 'age', 'salary']
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    
    # Write header row
    writer.writeheader()
    
    # Write data rows
    writer.writerows(employees)

print("File written successfully!")

Output:

File written successfully!

Working with Pandas for Advanced Operations

Pandas makes it much easier to filter, transform, and analyze data:

import pandas as pd

# Read CSV into DataFrame
df = pd.read_csv('employees.csv')

# Basic info about the data
print(f"Total rows: {len(df)}")
print(f"Average salary: ${df['salary'].mean():.2f}")

# Filter data: salaries above 70000
high_earners = df[df['salary'] > 70000]
print("
High earners:")
print(high_earners)

# Group by department
print("
Average salary by department:")
print(df.groupby('department')['salary'].mean())

Output:

Total rows: 4
Average salary: $75000.00

High earners:
   name  age  salary department
1   Bob   34   72000     Sales
3 David   45   95000 Management

Average salary by department:
department
Engineering    66500.0
Management     95000.0
Sales          72000.0
Name: salary, dtype: float64

Handling Large CSV Files Efficiently

For massive files that don’t fit in memory, use chunking with pandas:

import pandas as pd

# Read large file in chunks
chunk_size = 10000

# Process file in batches
for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
    # Process each chunk
    print(f"Processing chunk with {len(chunk)} rows")
    
    # Do something with the chunk
    high_value = chunk[chunk['amount'] > 1000]
    print(f"Found {len(high_value)} high-value transactions")

Output:

Processing chunk with 10000 rows
Found 2345 high-value transactions
Processing chunk with 10000 rows
Found 2412 high-value transactions
Processing chunk with 5234 rows
Found 1123 high-value transactions

Real-Life Example: Cleaning and Merging Sales Reports

Here’s a practical example of reading, cleaning, and merging sales data from multiple CSV files:

import pandas as pd

# Read sales data from multiple sources
sales_q1 = pd.read_csv('sales_q1.csv')
sales_q2 = pd.read_csv('sales_q2.csv')

# Combine the datasets
all_sales = pd.concat([sales_q1, sales_q2], ignore_index=True)

# Data cleaning: remove duplicates
all_sales = all_sales.drop_duplicates(subset=['order_id'])
print(f"After removing duplicates: {len(all_sales)} records")

# Clean: remove rows with missing values
all_sales = all_sales.dropna(subset=['customer_id', 'amount'])
print(f"After removing null values: {len(all_sales)} records")

# Transform: add new column for commission (5% of amount)
all_sales['commission'] = all_sales['amount'] * 0.05

# Filter: only successful orders (status='completed')
completed_sales = all_sales[all_sales['status'] == 'completed']

# Analysis: sales by region
print("
Sales by region:")
region_totals = completed_sales.groupby('region')['amount'].sum()
print(region_totals)

# Sort by amount and show top 10
top_sales = completed_sales.nlargest(5, 'amount')
print("
Top 5 sales:")
print(top_sales[['order_id', 'customer_name', 'amount', 'region']])

# Save cleaned data
all_sales.to_csv('cleaned_sales.csv', index=False)
print("
Cleaned data saved to cleaned_sales.csv")

Output:

After removing duplicates: 1997 records
After removing null values: 1985 records

Sales by region:
region
North    45230.50
South    38920.75
East     52340.25
West     41230.00
Name: amount, dtype: float64

Top 5 sales:
  order_id customer_name  amount  region
5    ORD005   Acme Corp   8500.00   East
12   ORD012   TechStart   7200.50   North
18   ORD018   GlobalCo    6950.25   West
24   ORD024   InnovateLabs 6800.00   East
31   ORD031   CloudSys    6550.75   North

Cleaned data saved to cleaned_sales.csv

FAQ

Q: Should I use csv module or pandas?

A: Use csv for simple operations and to avoid dependencies. Use pandas when you need analysis, filtering, or complex transformations. Pandas makes data manipulation much easier and faster to code.

Q: How do I handle CSV files with different delimiters?

A: With csv module: csv.reader(file, delimiter=’;’) or with pandas: pd.read_csv(‘file.csv’, sep=’;’)

Q: What if my CSV has special characters or encoding issues?

A: Specify encoding: pd.read_csv(‘file.csv’, encoding=’utf-8′) or pd.read_csv(‘file.csv’, encoding=’latin-1′)

Q: Can I read CSV directly from a URL?

A: Yes! df = pd.read_csv(‘https://example.com/data.csv’) works directly with pandas.

Q: How do I export a pandas DataFrame to different formats?

A: DataFrame has methods for many formats: to_csv(), to_excel(), to_json(), to_html(), and more.

Conclusion

CSV files are everywhere in data work, and Python makes handling them simple. Start with the built-in csv module for basic needs, then graduate to pandas when you need real analysis power. The combination of these tools covers everything from simple data reading to complex transformations.

References

Loop Larry tangled in data ribbons
line.split(‘,’) works until a field contains a comma. Then you learn why csv.reader exists.