Intermediate

Real-world data is rarely clean. You inherit a CSV with mixed-case city names, a column of phone numbers stored as floats, and a “date” field that is half timestamps and half strings like “Jan 5 2024”. Before any analysis or model can run, you need to wrangle that chaos into a consistent shape — and pandas is the tool most Python developers reach for first.

The good news: pandas has a rich toolkit for every dirty-data scenario you will encounter. Whether you are dropping duplicate rows, coercing a column to a proper numeric type, or stripping rogue whitespace from string fields, the operations are concise and composable. You do not need a data-engineering background — you just need to know which methods to reach for.

This tutorial walks you through the most common data-cleaning tasks in pandas, using realistic sample data throughout. We will cover missing values, type conversion, deduplication, string standardization, and outlier filtering. By the end, you will have a repeatable cleaning pipeline you can drop into any project.

Cleaning a DataFrame: Quick Example

Before diving into each technique, here is a self-contained example showing several common cleaning steps applied to a small dataset:

# quick_clean.py
import pandas as pd
import io

raw_csv = """name,age,city,salary
Alice,29,New York ,75000
Bob,,Chicago,82000
alice,29,new york,75000
Charlie,35,Los Angeles,None
Dave,150,Chicago,92000
"""

df = pd.read_csv(io.StringIO(raw_csv))

# 1. Strip whitespace from string columns
df["city"] = df["city"].str.strip()

# 2. Standardize to lowercase, then title-case
df["name"] = df["name"].str.strip().str.title()
df["city"] = df["city"].str.lower().str.title()

# 3. Convert salary to numeric (coerce turns "None" into NaN)
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")

# 4. Fill missing age with median
df["age"] = df["age"].fillna(df["age"].median())

# 5. Remove duplicates
df = df.drop_duplicates()

# 6. Filter out implausible ages
df = df[df["age"].between(0, 120)]

print(df.to_string(index=False))

Output:

      name   age         city   salary
     Alice  29.0     New York  75000.0
       Bob  32.0      Chicago  82000.0
   Charlie  35.0  Los Angeles      NaN

Six operations, three lines of garbage gone. The duplicate “alice / new york” row was removed, the outlier age of 150 was filtered, and the salary “None” became NaN so downstream math does not crash. The following sections show how each of these steps works in detail.

What Is Messy Data and Why Does It Matter?

Messy data is any data that violates the assumptions your code makes about it. Common violations include:

ProblemExampleConsequence if Ignored
Missing valuesNaN, empty string, “N/A”TypeError or silent wrong results in aggregations
Wrong typeSalary stored as “75,000” (string)Math operations fail or produce nonsense
Duplicate rowsSame transaction recorded twiceInflated counts, totals, averages
Inconsistent strings“New York”, “new york”, “NEW YORK”Group-bys split into three separate groups
Outliers / bad valuesAge = 999, salary = -1Skewed statistics, model poisoning

None of these problems surface as loud errors — pandas will happily compute a mean that skips NaN by default, but a group-by on “New York” vs “new york” silently splits one city into two groups. Cleaning is a prerequisite to trustworthy analysis, not an optional polish step.

Confused developer surrounded by mismatched data types and NaN values
Salary stored as a string. Age stored as a float. The CSV laughs.

Handling Missing Values

Pandas represents missing data as NaN (Not a Number) for numeric columns and None or pd.NA for others. The first step in any cleaning job is finding out how much is missing and deciding what to do about it.

Finding Missing Values

Use df.isnull().sum() to get a count per column, and df.isnull().mean() to see the proportion:

# check_nulls.py
import pandas as pd

df = pd.DataFrame({
    "name":   ["Alice", "Bob", None, "Dave"],
    "age":    [29, None, 35, 41],
    "salary": [75000, 82000, None, 92000],
})

print("Null counts:")
print(df.isnull().sum())
print()
print("Null proportions:")
print(df.isnull().mean().round(2))

Output:

Null counts:
name      1
age       1
salary    1
dtype: int64

Null proportions:
name      0.25
age       0.25
salary    0.25
dtype: float64

A column missing 5% of values might be safely imputed; one missing 60% is usually a signal to drop it entirely with df.drop(columns=["column_name"]). Knowing the proportion before acting prevents you from imputing columns that are fundamentally broken.

Filling and Dropping

You have three main choices: fill missing values with a constant or computed value, drop the rows entirely, or leave them for downstream handling. Here is how each looks in practice:

# fill_and_drop.py
import pandas as pd

df = pd.DataFrame({
    "name":   ["Alice", "Bob", None, "Dave"],
    "age":    [29.0, None, 35.0, 41.0],
    "salary": [75000.0, 82000.0, None, 92000.0],
})

# Fill numeric column with column median
df["age"] = df["age"].fillna(df["age"].median())

# Fill salary with forward fill (carry last known value forward)
df["salary"] = df["salary"].ffill()

# Drop rows where name is still missing
df = df.dropna(subset=["name"])

print(df.to_string(index=False))

Output:

  name   age   salary
 Alice  29.0  75000.0
   Bob  35.0  82000.0
  Dave  41.0  92000.0

The subset argument to dropna is key — without it, you drop any row that has a missing value anywhere, which is often too aggressive. Using subset=["name"] limits the drop to rows where only the name is missing, so rows with a missing age or salary survive for further imputation downstream.

Developer carefully handling missing values in a dataset
25% missing. The ‘dropna everything’ instinct is wrong. Ask why first.

Fixing Data Types

Reading a CSV often results in columns that are object (string) dtype when they should be numeric, datetime, or boolean. Type conversion is one of the highest-impact cleaning steps because it unlocks correct math, sorting, and comparisons.

Converting to Numeric

pd.to_numeric with errors="coerce" is your safest path: it converts what it can and turns unparseable values into NaN rather than raising an exception. This is far safer than .astype(float), which crashes the whole pipeline on a single bad value.

# to_numeric.py
import pandas as pd

df = pd.DataFrame({
    "price": ["12.50", "  8.99", "N/A", "100", "free"],
})

df["price"] = pd.to_numeric(df["price"], errors="coerce")

print(df)
print()
print(f"dtype: {df['price'].dtype}")
print(f"NaN count: {df['price'].isnull().sum()}")

Output:

    price
0   12.50
1    8.99
2     NaN
3  100.00
4     NaN
dtype: float64
NaN count: 2

“N/A” and “free” were gracefully coerced to NaN. Now you can call .mean(), .sum(), or use the column in a model without it blowing up on row 2. The leading whitespace on “8.99” is also handled automatically during numeric parsing.

Converting to Datetime

Date columns in CSVs are almost always strings. pd.to_datetime handles dozens of formats automatically and also accepts errors="coerce":

# to_datetime.py
import pandas as pd

df = pd.DataFrame({
    "event_date": ["2024-01-05", "Jan 5, 2024", "05/01/2024", "not-a-date", "2024-03-15"],
})

df["event_date"] = pd.to_datetime(df["event_date"], errors="coerce")

print(df)
print()
print(f"dtype: {df['event_date'].dtype}")

Output:

  event_date
0 2024-01-05
1 2024-01-05
2 2024-01-05
3        NaT
4 2024-03-15
dtype: datetime64[ns]

“not-a-date” becomes NaT (Not a Time), which behaves like NaN for datetime columns. All three valid date formats were parsed into the same datetime64 type, so you can now sort, filter by date range, or extract year/month with df["event_date"].dt.year.

Developer surrounded by date strings in incompatible formats
Three formats, one column. pd.to_datetime handles it. Your regex would not.

Removing Duplicate Rows

Duplicate rows inflate metrics and corrupt join results. They show up when data is merged from multiple sources, when a form is submitted twice, or when an ETL process ran more than once. Pandas makes deduplication straightforward with drop_duplicates().

# dedup.py
import pandas as pd

df = pd.DataFrame({
    "order_id": [101, 102, 101, 103, 102],
    "customer": ["Alice", "Bob", "Alice", "Charlie", "Bob"],
    "total":    [49.99, 89.50, 49.99, 35.00, 89.50],
})

print(f"Rows before: {len(df)}")

# Drop exact duplicate rows
df = df.drop_duplicates()
print(f"Rows after drop_duplicates(): {len(df)}")

# Drop rows with duplicate order_id only (keep first occurrence)
df = df.drop_duplicates(subset=["order_id"], keep="first")
print(f"Rows after dedup on order_id: {len(df)}")
print()
print(df.to_string(index=False))

Output:

Rows before: 5
Rows after drop_duplicates(): 3
Rows after dedup on order_id: 3
 order_id customer  total
      101    Alice  49.99
      102      Bob  89.50
      103  Charlie  35.00

The subset parameter lets you define what “duplicate” means for your domain. In this case, two rows with the same order_id are a duplicate even if other fields differ — for example, if an order was re-exported with a slightly different timestamp. The keep="first" argument retains the earliest occurrence; use keep="last" to keep the most recent, or keep=False to drop all copies and force manual review.

Standardizing String Data

String inconsistency is the sneakiest data quality problem. Two rows that represent the same thing will never join, group, or deduplicate correctly if one has a trailing space or uses different casing. Pandas string accessor methods (.str.*) make standardization concise and vectorized.

# string_clean.py
import pandas as pd

df = pd.DataFrame({
    "city":  ["  New York ", "new york", "NEW YORK", "Chicago ", " chicago"],
    "phone": ["(312) 555-1234", "312.555.5678", "3125559999", "312 555 0001", "312-555-7777"],
})

# Strip whitespace and title-case city
df["city"] = df["city"].str.strip().str.title()

# Normalize phone: keep digits only
df["phone"] = df["phone"].str.replace(r"\D", "", regex=True)

print(df.to_string(index=False))

Output:

      city       phone
  New York  3125551234
  New York  3125555678
  New York  3125559999
   Chicago  3125550001
   Chicago  3125557777

The chain .str.strip().str.title() handles whitespace and casing in one pass. The regex replace r"\D" means “any non-digit character” — so parentheses, dashes, dots, and spaces all vanish, leaving a clean 10-digit string you can store or compare reliably. All five “New York” variants are now identical, so a groupby("city") will treat them as one.

Developer standardizing string data at a whiteboard
str.strip().str.title() — two methods, infinite city-name variants handled.

Filtering Outliers and Bad Values

Outliers are values that fall outside a plausible range for the domain. An age of 999, a negative salary, or a temperature of 5000 degrees Fahrenheit are data entry errors, not real observations. Boolean masks and .between() let you filter them efficiently and expressively.

# filter_outliers.py
import pandas as pd

df = pd.DataFrame({
    "employee": ["Alice", "Bob", "Charlie", "Dave", "Eve"],
    "age":      [29, 999, 35, -5, 41],
    "salary":   [75000, 82000, 120000, 92000, -500],
})

print("Before filtering:")
print(df.to_string(index=False))

# Keep only rows where age is in [18, 100] and salary >= 0
df_clean = df[df["age"].between(18, 100) & (df["salary"] >= 0)]

print()
print("After filtering:")
print(df_clean.to_string(index=False))

Output:

Before filtering:
 employee  age  salary
    Alice   29   75000
      Bob  999   82000
  Charlie   35  120000
     Dave   -5   92000
      Eve   41    -500

After filtering:
 employee  age  salary
    Alice   29   75000
  Charlie   35  120000
      Eve   41  120000

Three rows removed: Bob had an impossible age of 999, Dave had a negative age, and Eve had a negative salary. The bitwise & operator combines the two boolean masks — both conditions must be true for a row to survive. For more sophisticated outlier detection (IQR-based or z-score), you would replace the hard-coded thresholds with computed bounds, but for domain-aware cleaning, explicit range checks are clearer and more auditable.

Renaming and Normalizing Column Names

Column names inherited from CSV headers often have spaces, inconsistent casing, or long verbose names. Cleaning column names early prevents attribute-access bugs and makes code more readable throughout the rest of the pipeline.

# rename_columns.py
import pandas as pd

# Simulate columns from a legacy Excel export
df = pd.DataFrame(columns=[
    "First Name", "Last Name", "Date Of Birth", "  Salary (USD) "
])

# Normalize ALL column names at once
df.columns = (
    df.columns
    .str.strip()                                 # remove edge whitespace
    .str.lower()                                 # lowercase everything
    .str.replace(r"[\s()]+", "_", regex=True)    # spaces and parens to underscores
    .str.replace(r"_+", "_", regex=True)         # collapse multiple underscores
    .str.strip("_")                              # remove leading/trailing underscores
)

print(df.columns.tolist())

Output:

['first_name', 'last_name', 'date_of_birth', 'salary_usd']

The method chain on df.columns is a one-time investment that pays dividends throughout the rest of the pipeline. With snake_case column names you can use dot-notation access (df.salary_usd), avoid quoting keys with spaces, and write filter expressions that look like real Python rather than string soup.

Developer renaming messy column names to clean snake_case
Column names with spaces. Enjoy your df[‘Salary (USD)’]. Forever.

Real-Life Example: Building a Sales Data Cleaning Pipeline

Let us put all the techniques together into a realistic cleaning function for a monthly sales export. This is the kind of dataset you might receive from a legacy CRM: mixed formats, sparse nulls, and column names straight from a 2009 Excel template.

# clean_sales.py
import pandas as pd
import io

RAW_DATA = (
    "Order ID,Customer Name ,Product,Qty,Unit Price,Order Date,Region\n"
    "S001, Alice Johnson,Widget A,3,$14.99,2024-01-15,North\n"
    "S002,BOB SMITH,Widget B,,8.50,January 18 2024,south\n"
    "S001, alice johnson,Widget A,3,$14.99,2024-01-15,North\n"  # duplicate
    "S003,Charlie Brown,Widget C,2,N/A,2024-02-01,EAST\n"
    "S004,Dave Wilson,Widget A,999,-5.00,2024-02-03,West\n"    # bad qty and price
    "S005,Eve Davis,Widget B,1,8.50,bad-date,North\n"          # bad date
)


def clean_sales(raw_csv: str) -> pd.DataFrame:
    df = pd.read_csv(io.StringIO(raw_csv))

    # 1. Normalize column names
    df.columns = (
        df.columns.str.strip().str.lower()
        .str.replace(r"\s+", "_", regex=True)
    )

    # 2. Strip whitespace from all string columns
    str_cols = df.select_dtypes("object").columns
    df[str_cols] = df[str_cols].apply(lambda c: c.str.strip())

    # 3. Standardize text fields
    df["customer_name"] = df["customer_name"].str.title()
    df["region"] = df["region"].str.title()

    # 4. Clean and convert numeric columns
    df["unit_price"] = pd.to_numeric(
        df["unit_price"].astype(str).str.replace(r"[\$,]", "", regex=True),
        errors="coerce"
    )
    df["qty"] = pd.to_numeric(df["qty"], errors="coerce")

    # 5. Parse dates (coerce bad values to NaT)
    df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")

    # 6. Remove exact duplicates
    df = df.drop_duplicates()

    # 7. Filter out implausible values
    df = df[
        df["qty"].between(1, 500) &
        (df["unit_price"].fillna(0) >= 0)
    ]

    # 8. Add a computed revenue column
    df["revenue"] = (df["qty"] * df["unit_price"]).round(2)

    return df.reset_index(drop=True)


result = clean_sales(RAW_DATA)
print(result.to_string(index=False))
print()
print(f"Rows: {len(result)}")
print(f"Columns: {list(result.columns)}")

Output:

  order_id   customer_name    product  qty  unit_price order_date  region  revenue
      S001   Alice Johnson   Widget A  3.0       14.99 2024-01-15   North    44.97
      S003   Charlie Brown   Widget C  2.0         NaN 2024-02-01    East      NaN
      S005       Eve Davis   Widget B  1.0        8.50        NaT   North     8.50

Rows: 3
Columns: ['order_id', 'customer_name', 'product', 'qty', 'unit_price', 'order_date', 'region', 'revenue']

The pipeline removed the duplicate S001 (Alice), filtered out S002 because qty was missing and fell outside the 1-500 range, filtered out S004 (Dave, negative unit price and implausible qty of 999), and standardized every string field. Orders S003 and S005 still appear with partial nulls — their underlying data is recoverable after a follow-up request, so dropping them entirely would be premature. You can extend this function with domain-specific rules, add a logging step that records every drop reason, or persist the clean DataFrame to a database or Parquet file for downstream use.

Frequently Asked Questions

When should I use fillna vs. dropna?

Use fillna when you can make a defensible guess about the true value — the column median for a numeric field, “Unknown” for a categorical field, or a forward-fill for a time series where the last known value is a reasonable proxy. Use dropna when missing data is truly unrecoverable and including it would corrupt your results. As a rule of thumb: if a column is more than 40-50% null, dropping the entire column is usually more honest than imputing most of it with df.drop(columns=["col"]).

What is the difference between astype() and pd.to_numeric()?

.astype(float) raises a ValueError if any value cannot be converted — a single “N/A” in a column of 10,000 numbers will crash your pipeline. pd.to_numeric(errors="coerce") converts what it can and silently turns the rest into NaN. For production data cleaning, always prefer pd.to_numeric with errors="coerce". Reserve .astype() only for columns you have already verified are fully clean, such as after filling all nulls.

How do I detect duplicates on just a few columns, not the entire row?

Pass the subset parameter to drop_duplicates() with a list of the columns that define uniqueness for your domain. For example, df.drop_duplicates(subset=["order_id"]) keeps only the first row for each order ID, even if other columns differ. To inspect duplicates before removing them, use df[df.duplicated(subset=["order_id"], keep=False)] to see all rows involved in a collision, which helps you decide whether to keep first, last, or neither.

Why does df[“col”].str.replace() not update the column in place?

Pandas string operations return a new Series; they do not modify the original in place. You must assign the result back: df["col"] = df["col"].str.replace(...). This is consistent with how all pandas transformations work — they return new objects rather than mutating existing ones, which makes pipelines easier to reason about and debug, and avoids subtle bugs where one piece of code modifies a DataFrame being used elsewhere.

Should I use inplace=True to avoid reassignment?

Generally no. inplace=True does not actually save memory in most cases — pandas still creates an internal copy in many operations before replacing the original. It also breaks method chaining and makes code harder to debug. The modern pandas best practice is to chain operations and reassign: df = df.drop_duplicates().dropna(subset=["name"]). The inplace parameter is considered a legacy interface and may be deprecated in a future pandas version.

How do I verify my cleaned data is actually clean?

After cleaning, always run a quick sanity check: df.dtypes to confirm types, df.isnull().sum() to confirm null counts, df.describe() to spot remaining outliers in numeric ranges, and df.duplicated().sum() to confirm deduplication worked. For production pipelines, consider the pandera library, which lets you define a schema and validate a DataFrame against it with a single function call — raising a clear exception if any rule is violated.

Conclusion

A clean dataset is the foundation every analysis, model, and visualization is built on. In this tutorial we covered the core pandas cleaning toolkit: detecting and filling missing values with isnull, fillna, and dropna; safe type conversion with pd.to_numeric and pd.to_datetime; deduplication with drop_duplicates; string standardization via the .str accessor; and outlier removal with boolean masks and .between(). We also walked through a complete cleaning function that handles a realistic sales CSV end-to-end, removing duplicates, bad values, and type mismatches in a single reusable pipeline.

The real-life example is a starting point — extend it by logging every dropped row to a separate DataFrame so you can audit what was removed and why. For production pipelines where data contracts matter, consider adding a validation step using pandera or Great Expectations to catch regressions automatically when the upstream data format changes.

For deeper reading, the official pandas missing data guide and the working with text data guide are comprehensive references for every method covered here.