Intermediate
Real-world data is rarely clean. You inherit a CSV with mixed-case city names, a column of phone numbers stored as floats, and a “date” field that is half timestamps and half strings like “Jan 5 2024”. Before any analysis or model can run, you need to wrangle that chaos into a consistent shape — and pandas is the tool most Python developers reach for first.
The good news: pandas has a rich toolkit for every dirty-data scenario you will encounter. Whether you are dropping duplicate rows, coercing a column to a proper numeric type, or stripping rogue whitespace from string fields, the operations are concise and composable. You do not need a data-engineering background — you just need to know which methods to reach for.
This tutorial walks you through the most common data-cleaning tasks in pandas, using realistic sample data throughout. We will cover missing values, type conversion, deduplication, string standardization, and outlier filtering. By the end, you will have a repeatable cleaning pipeline you can drop into any project.
Cleaning a DataFrame: Quick Example
Before diving into each technique, here is a self-contained example showing several common cleaning steps applied to a small dataset:
# quick_clean.py
import pandas as pd
import io
raw_csv = """name,age,city,salary
Alice,29,New York ,75000
Bob,,Chicago,82000
alice,29,new york,75000
Charlie,35,Los Angeles,None
Dave,150,Chicago,92000
"""
df = pd.read_csv(io.StringIO(raw_csv))
# 1. Strip whitespace from string columns
df["city"] = df["city"].str.strip()
# 2. Standardize to lowercase, then title-case
df["name"] = df["name"].str.strip().str.title()
df["city"] = df["city"].str.lower().str.title()
# 3. Convert salary to numeric (coerce turns "None" into NaN)
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")
# 4. Fill missing age with median
df["age"] = df["age"].fillna(df["age"].median())
# 5. Remove duplicates
df = df.drop_duplicates()
# 6. Filter out implausible ages
df = df[df["age"].between(0, 120)]
print(df.to_string(index=False))
Output:
name age city salary
Alice 29.0 New York 75000.0
Bob 32.0 Chicago 82000.0
Charlie 35.0 Los Angeles NaN
Six operations, three lines of garbage gone. The duplicate “alice / new york” row was removed, the outlier age of 150 was filtered, and the salary “None” became NaN so downstream math does not crash. The following sections show how each of these steps works in detail.
What Is Messy Data and Why Does It Matter?
Messy data is any data that violates the assumptions your code makes about it. Common violations include:
| Problem | Example | Consequence if Ignored |
|---|---|---|
| Missing values | NaN, empty string, “N/A” | TypeError or silent wrong results in aggregations |
| Wrong type | Salary stored as “75,000” (string) | Math operations fail or produce nonsense |
| Duplicate rows | Same transaction recorded twice | Inflated counts, totals, averages |
| Inconsistent strings | “New York”, “new york”, “NEW YORK” | Group-bys split into three separate groups |
| Outliers / bad values | Age = 999, salary = -1 | Skewed statistics, model poisoning |
None of these problems surface as loud errors — pandas will happily compute a mean that skips NaN by default, but a group-by on “New York” vs “new york” silently splits one city into two groups. Cleaning is a prerequisite to trustworthy analysis, not an optional polish step.
Handling Missing Values
Pandas represents missing data as NaN (Not a Number) for numeric columns and None or pd.NA for others. The first step in any cleaning job is finding out how much is missing and deciding what to do about it.
Finding Missing Values
Use df.isnull().sum() to get a count per column, and df.isnull().mean() to see the proportion:
# check_nulls.py
import pandas as pd
df = pd.DataFrame({
"name": ["Alice", "Bob", None, "Dave"],
"age": [29, None, 35, 41],
"salary": [75000, 82000, None, 92000],
})
print("Null counts:")
print(df.isnull().sum())
print()
print("Null proportions:")
print(df.isnull().mean().round(2))
Output:
Null counts:
name 1
age 1
salary 1
dtype: int64
Null proportions:
name 0.25
age 0.25
salary 0.25
dtype: float64
A column missing 5% of values might be safely imputed; one missing 60% is usually a signal to drop it entirely with df.drop(columns=["column_name"]). Knowing the proportion before acting prevents you from imputing columns that are fundamentally broken.
Filling and Dropping
You have three main choices: fill missing values with a constant or computed value, drop the rows entirely, or leave them for downstream handling. Here is how each looks in practice:
# fill_and_drop.py
import pandas as pd
df = pd.DataFrame({
"name": ["Alice", "Bob", None, "Dave"],
"age": [29.0, None, 35.0, 41.0],
"salary": [75000.0, 82000.0, None, 92000.0],
})
# Fill numeric column with column median
df["age"] = df["age"].fillna(df["age"].median())
# Fill salary with forward fill (carry last known value forward)
df["salary"] = df["salary"].ffill()
# Drop rows where name is still missing
df = df.dropna(subset=["name"])
print(df.to_string(index=False))
Output:
name age salary
Alice 29.0 75000.0
Bob 35.0 82000.0
Dave 41.0 92000.0
The subset argument to dropna is key — without it, you drop any row that has a missing value anywhere, which is often too aggressive. Using subset=["name"] limits the drop to rows where only the name is missing, so rows with a missing age or salary survive for further imputation downstream.
Fixing Data Types
Reading a CSV often results in columns that are object (string) dtype when they should be numeric, datetime, or boolean. Type conversion is one of the highest-impact cleaning steps because it unlocks correct math, sorting, and comparisons.
Converting to Numeric
pd.to_numeric with errors="coerce" is your safest path: it converts what it can and turns unparseable values into NaN rather than raising an exception. This is far safer than .astype(float), which crashes the whole pipeline on a single bad value.
# to_numeric.py
import pandas as pd
df = pd.DataFrame({
"price": ["12.50", " 8.99", "N/A", "100", "free"],
})
df["price"] = pd.to_numeric(df["price"], errors="coerce")
print(df)
print()
print(f"dtype: {df['price'].dtype}")
print(f"NaN count: {df['price'].isnull().sum()}")
Output:
price
0 12.50
1 8.99
2 NaN
3 100.00
4 NaN
dtype: float64
NaN count: 2
“N/A” and “free” were gracefully coerced to NaN. Now you can call .mean(), .sum(), or use the column in a model without it blowing up on row 2. The leading whitespace on “8.99” is also handled automatically during numeric parsing.
Converting to Datetime
Date columns in CSVs are almost always strings. pd.to_datetime handles dozens of formats automatically and also accepts errors="coerce":
# to_datetime.py
import pandas as pd
df = pd.DataFrame({
"event_date": ["2024-01-05", "Jan 5, 2024", "05/01/2024", "not-a-date", "2024-03-15"],
})
df["event_date"] = pd.to_datetime(df["event_date"], errors="coerce")
print(df)
print()
print(f"dtype: {df['event_date'].dtype}")
Output:
event_date
0 2024-01-05
1 2024-01-05
2 2024-01-05
3 NaT
4 2024-03-15
dtype: datetime64[ns]
“not-a-date” becomes NaT (Not a Time), which behaves like NaN for datetime columns. All three valid date formats were parsed into the same datetime64 type, so you can now sort, filter by date range, or extract year/month with df["event_date"].dt.year.
Removing Duplicate Rows
Duplicate rows inflate metrics and corrupt join results. They show up when data is merged from multiple sources, when a form is submitted twice, or when an ETL process ran more than once. Pandas makes deduplication straightforward with drop_duplicates().
# dedup.py
import pandas as pd
df = pd.DataFrame({
"order_id": [101, 102, 101, 103, 102],
"customer": ["Alice", "Bob", "Alice", "Charlie", "Bob"],
"total": [49.99, 89.50, 49.99, 35.00, 89.50],
})
print(f"Rows before: {len(df)}")
# Drop exact duplicate rows
df = df.drop_duplicates()
print(f"Rows after drop_duplicates(): {len(df)}")
# Drop rows with duplicate order_id only (keep first occurrence)
df = df.drop_duplicates(subset=["order_id"], keep="first")
print(f"Rows after dedup on order_id: {len(df)}")
print()
print(df.to_string(index=False))
Output:
Rows before: 5
Rows after drop_duplicates(): 3
Rows after dedup on order_id: 3
order_id customer total
101 Alice 49.99
102 Bob 89.50
103 Charlie 35.00
The subset parameter lets you define what “duplicate” means for your domain. In this case, two rows with the same order_id are a duplicate even if other fields differ — for example, if an order was re-exported with a slightly different timestamp. The keep="first" argument retains the earliest occurrence; use keep="last" to keep the most recent, or keep=False to drop all copies and force manual review.
Standardizing String Data
String inconsistency is the sneakiest data quality problem. Two rows that represent the same thing will never join, group, or deduplicate correctly if one has a trailing space or uses different casing. Pandas string accessor methods (.str.*) make standardization concise and vectorized.
# string_clean.py
import pandas as pd
df = pd.DataFrame({
"city": [" New York ", "new york", "NEW YORK", "Chicago ", " chicago"],
"phone": ["(312) 555-1234", "312.555.5678", "3125559999", "312 555 0001", "312-555-7777"],
})
# Strip whitespace and title-case city
df["city"] = df["city"].str.strip().str.title()
# Normalize phone: keep digits only
df["phone"] = df["phone"].str.replace(r"\D", "", regex=True)
print(df.to_string(index=False))
Output:
city phone
New York 3125551234
New York 3125555678
New York 3125559999
Chicago 3125550001
Chicago 3125557777
The chain .str.strip().str.title() handles whitespace and casing in one pass. The regex replace r"\D" means “any non-digit character” — so parentheses, dashes, dots, and spaces all vanish, leaving a clean 10-digit string you can store or compare reliably. All five “New York” variants are now identical, so a groupby("city") will treat them as one.
Filtering Outliers and Bad Values
Outliers are values that fall outside a plausible range for the domain. An age of 999, a negative salary, or a temperature of 5000 degrees Fahrenheit are data entry errors, not real observations. Boolean masks and .between() let you filter them efficiently and expressively.
# filter_outliers.py
import pandas as pd
df = pd.DataFrame({
"employee": ["Alice", "Bob", "Charlie", "Dave", "Eve"],
"age": [29, 999, 35, -5, 41],
"salary": [75000, 82000, 120000, 92000, -500],
})
print("Before filtering:")
print(df.to_string(index=False))
# Keep only rows where age is in [18, 100] and salary >= 0
df_clean = df[df["age"].between(18, 100) & (df["salary"] >= 0)]
print()
print("After filtering:")
print(df_clean.to_string(index=False))
Output:
Before filtering:
employee age salary
Alice 29 75000
Bob 999 82000
Charlie 35 120000
Dave -5 92000
Eve 41 -500
After filtering:
employee age salary
Alice 29 75000
Charlie 35 120000
Eve 41 120000
Three rows removed: Bob had an impossible age of 999, Dave had a negative age, and Eve had a negative salary. The bitwise & operator combines the two boolean masks — both conditions must be true for a row to survive. For more sophisticated outlier detection (IQR-based or z-score), you would replace the hard-coded thresholds with computed bounds, but for domain-aware cleaning, explicit range checks are clearer and more auditable.
Renaming and Normalizing Column Names
Column names inherited from CSV headers often have spaces, inconsistent casing, or long verbose names. Cleaning column names early prevents attribute-access bugs and makes code more readable throughout the rest of the pipeline.
# rename_columns.py
import pandas as pd
# Simulate columns from a legacy Excel export
df = pd.DataFrame(columns=[
"First Name", "Last Name", "Date Of Birth", " Salary (USD) "
])
# Normalize ALL column names at once
df.columns = (
df.columns
.str.strip() # remove edge whitespace
.str.lower() # lowercase everything
.str.replace(r"[\s()]+", "_", regex=True) # spaces and parens to underscores
.str.replace(r"_+", "_", regex=True) # collapse multiple underscores
.str.strip("_") # remove leading/trailing underscores
)
print(df.columns.tolist())
Output:
['first_name', 'last_name', 'date_of_birth', 'salary_usd']
The method chain on df.columns is a one-time investment that pays dividends throughout the rest of the pipeline. With snake_case column names you can use dot-notation access (df.salary_usd), avoid quoting keys with spaces, and write filter expressions that look like real Python rather than string soup.
Real-Life Example: Building a Sales Data Cleaning Pipeline
Let us put all the techniques together into a realistic cleaning function for a monthly sales export. This is the kind of dataset you might receive from a legacy CRM: mixed formats, sparse nulls, and column names straight from a 2009 Excel template.
# clean_sales.py
import pandas as pd
import io
RAW_DATA = (
"Order ID,Customer Name ,Product,Qty,Unit Price,Order Date,Region\n"
"S001, Alice Johnson,Widget A,3,$14.99,2024-01-15,North\n"
"S002,BOB SMITH,Widget B,,8.50,January 18 2024,south\n"
"S001, alice johnson,Widget A,3,$14.99,2024-01-15,North\n" # duplicate
"S003,Charlie Brown,Widget C,2,N/A,2024-02-01,EAST\n"
"S004,Dave Wilson,Widget A,999,-5.00,2024-02-03,West\n" # bad qty and price
"S005,Eve Davis,Widget B,1,8.50,bad-date,North\n" # bad date
)
def clean_sales(raw_csv: str) -> pd.DataFrame:
df = pd.read_csv(io.StringIO(raw_csv))
# 1. Normalize column names
df.columns = (
df.columns.str.strip().str.lower()
.str.replace(r"\s+", "_", regex=True)
)
# 2. Strip whitespace from all string columns
str_cols = df.select_dtypes("object").columns
df[str_cols] = df[str_cols].apply(lambda c: c.str.strip())
# 3. Standardize text fields
df["customer_name"] = df["customer_name"].str.title()
df["region"] = df["region"].str.title()
# 4. Clean and convert numeric columns
df["unit_price"] = pd.to_numeric(
df["unit_price"].astype(str).str.replace(r"[\$,]", "", regex=True),
errors="coerce"
)
df["qty"] = pd.to_numeric(df["qty"], errors="coerce")
# 5. Parse dates (coerce bad values to NaT)
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
# 6. Remove exact duplicates
df = df.drop_duplicates()
# 7. Filter out implausible values
df = df[
df["qty"].between(1, 500) &
(df["unit_price"].fillna(0) >= 0)
]
# 8. Add a computed revenue column
df["revenue"] = (df["qty"] * df["unit_price"]).round(2)
return df.reset_index(drop=True)
result = clean_sales(RAW_DATA)
print(result.to_string(index=False))
print()
print(f"Rows: {len(result)}")
print(f"Columns: {list(result.columns)}")
Output:
order_id customer_name product qty unit_price order_date region revenue
S001 Alice Johnson Widget A 3.0 14.99 2024-01-15 North 44.97
S003 Charlie Brown Widget C 2.0 NaN 2024-02-01 East NaN
S005 Eve Davis Widget B 1.0 8.50 NaT North 8.50
Rows: 3
Columns: ['order_id', 'customer_name', 'product', 'qty', 'unit_price', 'order_date', 'region', 'revenue']
The pipeline removed the duplicate S001 (Alice), filtered out S002 because qty was missing and fell outside the 1-500 range, filtered out S004 (Dave, negative unit price and implausible qty of 999), and standardized every string field. Orders S003 and S005 still appear with partial nulls — their underlying data is recoverable after a follow-up request, so dropping them entirely would be premature. You can extend this function with domain-specific rules, add a logging step that records every drop reason, or persist the clean DataFrame to a database or Parquet file for downstream use.
Frequently Asked Questions
When should I use fillna vs. dropna?
Use fillna when you can make a defensible guess about the true value — the column median for a numeric field, “Unknown” for a categorical field, or a forward-fill for a time series where the last known value is a reasonable proxy. Use dropna when missing data is truly unrecoverable and including it would corrupt your results. As a rule of thumb: if a column is more than 40-50% null, dropping the entire column is usually more honest than imputing most of it with df.drop(columns=["col"]).
What is the difference between astype() and pd.to_numeric()?
.astype(float) raises a ValueError if any value cannot be converted — a single “N/A” in a column of 10,000 numbers will crash your pipeline. pd.to_numeric(errors="coerce") converts what it can and silently turns the rest into NaN. For production data cleaning, always prefer pd.to_numeric with errors="coerce". Reserve .astype() only for columns you have already verified are fully clean, such as after filling all nulls.
How do I detect duplicates on just a few columns, not the entire row?
Pass the subset parameter to drop_duplicates() with a list of the columns that define uniqueness for your domain. For example, df.drop_duplicates(subset=["order_id"]) keeps only the first row for each order ID, even if other columns differ. To inspect duplicates before removing them, use df[df.duplicated(subset=["order_id"], keep=False)] to see all rows involved in a collision, which helps you decide whether to keep first, last, or neither.
Why does df[“col”].str.replace() not update the column in place?
Pandas string operations return a new Series; they do not modify the original in place. You must assign the result back: df["col"] = df["col"].str.replace(...). This is consistent with how all pandas transformations work — they return new objects rather than mutating existing ones, which makes pipelines easier to reason about and debug, and avoids subtle bugs where one piece of code modifies a DataFrame being used elsewhere.
Should I use inplace=True to avoid reassignment?
Generally no. inplace=True does not actually save memory in most cases — pandas still creates an internal copy in many operations before replacing the original. It also breaks method chaining and makes code harder to debug. The modern pandas best practice is to chain operations and reassign: df = df.drop_duplicates().dropna(subset=["name"]). The inplace parameter is considered a legacy interface and may be deprecated in a future pandas version.
How do I verify my cleaned data is actually clean?
After cleaning, always run a quick sanity check: df.dtypes to confirm types, df.isnull().sum() to confirm null counts, df.describe() to spot remaining outliers in numeric ranges, and df.duplicated().sum() to confirm deduplication worked. For production pipelines, consider the pandera library, which lets you define a schema and validate a DataFrame against it with a single function call — raising a clear exception if any rule is violated.
Conclusion
A clean dataset is the foundation every analysis, model, and visualization is built on. In this tutorial we covered the core pandas cleaning toolkit: detecting and filling missing values with isnull, fillna, and dropna; safe type conversion with pd.to_numeric and pd.to_datetime; deduplication with drop_duplicates; string standardization via the .str accessor; and outlier removal with boolean masks and .between(). We also walked through a complete cleaning function that handles a realistic sales CSV end-to-end, removing duplicates, bad values, and type mismatches in a single reusable pipeline.
The real-life example is a starting point — extend it by logging every dropped row to a separate DataFrame so you can audit what was removed and why. For production pipelines where data contracts matter, consider adding a validation step using pandera or Great Expectations to catch regressions automatically when the upstream data format changes.
For deeper reading, the official pandas missing data guide and the working with text data guide are comprehensive references for every method covered here.