Intermediate

You have inherited a folder of Excel files from an accounting department. Half of them have the .xlsx extension from modern Excel, and the other half are old .xls files from the early 2000s. If you try to open a .xls file with openpyxl, you get an error. If you use xlrd 2.x on an .xlsx file, you get a different error. Knowing which library to use for which file format is the first step — everything else follows.

The two key libraries are: xlrd for reading legacy .xls (Excel 97-2003) format, and openpyxl for reading and writing modern .xlsx format. Install both with pip install xlrd openpyxl. For most practical data extraction tasks you will also want to know about pandas, which uses both libraries internally and provides a higher-level API.

In this article we will cover reading .xls files with xlrd, reading .xlsx files with openpyxl, detecting file format and routing automatically, extracting headers and data rows, handling different cell types, and building a unified Excel reader that handles both formats with the same output interface.

Reading Excel Files in Python: Quick Example

Here is the essential pattern for reading each format. Note the different library imports and APIs:

# quick_excel.py
import xlrd      # For .xls files (pip install xlrd)
import openpyxl  # For .xlsx files (pip install openpyxl)

# --- Reading a .xls file with xlrd ---
# xlrd 2.x only supports .xls format (NOT .xlsx)
# workbook = xlrd.open_workbook('legacy_data.xls')
# sheet = workbook.sheet_by_index(0)
# for row_idx in range(sheet.nrows):
#     row = sheet.row_values(row_idx)
#     print(row)

# --- Reading a .xlsx file with openpyxl ---
# Create a sample workbook to demonstrate
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Sales"
ws.append(["Product", "Qty", "Price"])
ws.append(["Widget A", 100, 9.99])
ws.append(["Widget B", 50,  14.99])
wb.save("sample.xlsx")

# Now read it back
wb2 = openpyxl.load_workbook("sample.xlsx")
ws2 = wb2.active
for row in ws2.iter_rows(values_only=True):
    print(row)

Output:

('Product', 'Qty', 'Price')
('Widget A', 100, 9.99)
('Widget B', 50, 14.99)

The critical difference is format support: xlrd 2.0 (released 2020) intentionally dropped .xlsx support due to security concerns with the XML-based format. Always use openpyxl for .xlsx files. If you need to process both formats in one script, you will need both libraries. Want to go deeper? Below we cover each library in detail with cell type handling and a unified reader.

xlrd vs openpyxl: Which Library for Which Format?

The most important thing to understand about Excel reading in Python is the format split. The confusion often arises because older xlrd versions (before 2.0) supported both formats, so many tutorials and Stack Overflow answers use xlrd for .xlsx files. Those answers are outdated — xlrd 2.x will raise a XLRDError if you try to open a .xlsx file.

Library.xls (Excel 97-2003).xlsx (Excel 2007+)Write Support
xlrd 2.xYes (read only)No — raises errorNo
openpyxlNoYes (read + write)Yes
pandasYes (via xlrd)Yes (via openpyxl)Yes (via openpyxl)

For most new projects, openpyxl is the right choice since virtually all modern Excel files are .xlsx. You need xlrd only when processing legacy files from systems that still produce the old binary format — which is more common than you might expect in finance, manufacturing, and government data pipelines.

Reading Legacy .xls Files with xlrd

xlrd provides a workbook/sheet/cell hierarchy for accessing data. Row and column indices are zero-based, and cell values come with type information that you need to handle for dates:

# read_xls.py
import xlrd
from datetime import datetime

# First, create a sample .xls for demonstration
# (normally you would just open an existing file)
import xlwt  # pip install xlwt -- only for creating .xls test files
wb_write = xlwt.Workbook()
ws_write = wb_write.add_sheet('Employees')
headers = ['Name', 'Department', 'Salary', 'Start Date']
data = [
    ['Alice Chen',    'Engineering', 95000, '2021-03-15'],
    ['Bob Martinez',  'Marketing',   72000, '2020-07-01'],
    ['Carol Johnson', 'Finance',     88000, '2019-11-20'],
]
for col, h in enumerate(headers):
    ws_write.write(0, col, h)
for row_idx, row in enumerate(data, 1):
    for col_idx, val in enumerate(row):
        ws_write.write(row_idx, col_idx, val)
wb_write.save('employees.xls')

# Now read the .xls file with xlrd
workbook = xlrd.open_workbook('employees.xls')
print(f'Sheets: {workbook.sheet_names()}')

sheet = workbook.sheet_by_name('Employees')
print(f'Rows: {sheet.nrows}, Cols: {sheet.ncols}')
print()

# Read header row
headers = sheet.row_values(0)
print('Headers:', headers)
print()

# Read data rows
for row_idx in range(1, sheet.nrows):
    row = {}
    for col_idx, header in enumerate(headers):
        cell = sheet.cell(row_idx, col_idx)
        # Check cell type -- xlrd.XL_CELL_DATE = 3
        if cell.ctype == xlrd.XL_CELL_DATE:
            dt = xlrd.xldate_as_datetime(cell.value, workbook.datemode)
            row[header] = dt.strftime('%Y-%m-%d')
        else:
            row[header] = cell.value
    print(row)

Output:

Sheets: ['Employees']
Rows: 4, Cols: 4

Headers: ['Name', 'Department', 'Salary', 'Start Date']

{'Name': 'Alice Chen', 'Department': 'Engineering', 'Salary': 95000.0, 'Start Date': '2021-03-15'}
{'Name': 'Bob Martinez', 'Department': 'Marketing', 'Salary': 72000.0, 'Start Date': '2020-07-01'}
{'Name': 'Carol Johnson', 'Department': 'Finance', 'Salary': 88000.0, 'Start Date': '2019-11-20'}

Date handling is the trickiest part of reading .xls files with xlrd. Excel stores dates as floating-point numbers (days since 1900-01-01), not as Python datetime objects. The xlrd.xldate_as_datetime() function converts this float to a real datetime using the workbook’s date mode (which differs between Windows and Mac Excel). Always check cell.ctype == xlrd.XL_CELL_DATE before attempting date conversion.

Reading Modern .xlsx Files with openpyxl

openpyxl uses a Workbook/Worksheet/Cell model similar to xlrd but with a more Pythonic API. The values_only=True flag is the fastest way to extract data when you do not need cell formatting metadata:

# read_xlsx.py
import openpyxl
from datetime import datetime

# Create a sample .xlsx file with mixed cell types
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Inventory"
ws.append(["SKU", "Product", "Stock", "Price", "Last Updated"])
from datetime import date
ws.append(["A001", "Widget Pro",   500, 29.99, date(2026, 4, 15)])
ws.append(["A002", "Gadget Basic",  75, 9.99,  date(2026, 4, 20)])
ws.append(["A003", "Super Deluxe",  12, 149.99,date(2026, 5, 1)])
wb.save("inventory.xlsx")

# Read it back
wb2 = openpyxl.load_workbook("inventory.xlsx", read_only=True)
ws2 = wb2.active

print(f'Dimensions: {ws2.dimensions}')
print()

# Method 1: iter_rows with values_only
print('--- Using iter_rows(values_only=True) ---')
headers = None
for i, row in enumerate(ws2.iter_rows(values_only=True)):
    if i == 0:
        headers = row
        print('Headers:', headers)
    else:
        record = dict(zip(headers, row))
        print(record)

wb2.close()

# Method 2: read_only=False for cell object access
wb3 = openpyxl.load_workbook("inventory.xlsx")
ws3 = wb3["Inventory"]

print()
print('--- Cell object access (non-readonly) ---')
# Access individual cells by coordinate
print('A1:', ws3['A1'].value)
print('B2:', ws3['B2'].value)
print('E2 (date):', ws3['E2'].value, type(ws3['E2'].value).__name__)

Output:

Dimensions: A1:E4

--- Using iter_rows(values_only=True) ---
Headers: ('SKU', 'Product', 'Stock', 'Price', 'Last Updated')
{'SKU': 'A001', 'Product': 'Widget Pro', 'Stock': 500, 'Price': 29.99, 'Last Updated': datetime.datetime(2026, 4, 15, 0, 0)}
{'SKU': 'A002', 'Product': 'Gadget Basic', 'Stock': 75, 'Price': 9.99, 'Last Updated': datetime.datetime(2026, 4, 20, 0, 0)}
{'SKU': 'A003', 'Product': 'Super Deluxe', 'Stock': 12, 'Price': 149.99, 'Last Updated': datetime.datetime(2026, 5, 1, 0, 0)}

--- Cell object access (non-readonly) ---
A1: SKU
B2: Widget Pro
E2 (date): 2026-04-15 00:00:00 datetime

Unlike xlrd, openpyxl automatically converts date cells to Python datetime objects — no manual conversion needed. The read_only=True mode is faster for large files because it uses an event-based parser that does not load the entire file into memory. Use it when you only need to read data, and disable it (default mode) when you need to access cell formatting or ose the ws[coordinate] syntax.

Real-Life Example: Unified Excel File Reader

Here is a unified reader that auto-detects the file format and uses the right library, returning a consistent list-of-dicts structure regardless of which format the input uses:

# unified_excel_reader.py
import xlrd
import openpyxl
from pathlib import Path
from datetime import datetime

def read_excel_file(filepath, sheet_index=0, header_row=0):
    """
    Read an Excel file (.xls or .xlsx) into a list of dicts.
    Auto-detects format from file extension.
    
    Returns:
        list[dict]: One dict per data row, keyed by header names
    """
    path = Path(filepath)
    ext = path.suffix.lower()
    
    if ext == '.xls':
        return _read_xls(filepath, sheet_index, header_row)
    elif ext in ('.xlsx', '.xlsm', '.xltx', '.xltm'):
        return _read_xlsx(filepath, sheet_index, header_row)
    else:
        raise ValueError(f"Unsupported format: {ext}. Use .xls or .xlsx")

def _read_xls(filepath, sheet_index, header_row):
    """Read a .xls file using xlrd."""
    wb = xlrd.open_workbook(filepath)
    ws = wb.sheet_by_index(sheet_index)
    
    headers = ws.row_values(header_row)
    records = []
    
    for row_idx in range(header_row + 1, ws.nrows):
        record = {}
        for col_idx, header in enumerate(headers):
            cell = ws.cell(row_idx, col_idx)
            if cell.ctype == xlrd.XL_CELL_DATE:
                value = xlrd.xldate_as_datetime(cell.value, wb.datemode)
            elif cell.ctype == xlrd.XL_CELL_EMPTY:
                value = None
            else:
                value = cell.value
            record[header] = value
        records.append(record)
    
    return records

def _read_xlsx(filepath, sheet_index, header_row):
    """Read a .xlsx file using openpyxl."""
    wb = openpyxl.load_workbook(filepath, read_only=True, data_only=True)
    ws_name = wb.sheetnames[sheet_index]
    ws = wb[ws_name]
    
    records = []
    headers = None
    
    for row_idx, row in enumerate(ws.iter_rows(values_only=True)):
        if row_idx == header_row:
            headers = [str(h) if h is not None else f'Col_{i}' 
                      for i, h in enumerate(row)]
            continue
        if headers is None:
            continue
        record = dict(zip(headers, row))
        records.append(record)
    
    wb.close()
    return records

def print_summary(records, filename):
    """Print a summary of the extracted data."""
    print(f"\nFile: {filename}")
    print(f"Records: {len(records)}")
    if records:
        print(f"Columns: {list(records[0].keys())}")
        print("First record:", records[0])
        print("Last record: ", records[-1])

# Test with both formats
# (Using the files created in earlier examples)
xlsx_data = read_excel_file('inventory.xlsx')
print_summary(xlsx_data, 'inventory.xlsx')

# xls_data = read_excel_file('employees.xls')
# print_summary(xls_data, 'employees.xls')

Output:

File: inventory.xlsx
Records: 3
Columns: ['SKU', 'Product', 'Stock', 'Price', 'Last Updated']
First record: {'SKU': 'A001', 'Product': 'Widget Pro', 'Stock': 500, 'Price': 29.99, 'Last Updated': datetime.datetime(2026, 4, 15, 0, 0)}
Last record:  {'SKU': 'A003', 'Product': 'Super Deluxe', 'Stock': 12, 'Price': 149.99, 'Last Updated': datetime.datetime(2026, 5, 1, 0, 0)}

This unified reader pattern is the cleanest way to handle a mixed bag of Excel files without scattering format-detection logic throughout your codebase. Extend it by adding support for a max_rows parameter for large files, a dtype_map for explicit type coercion, or a pandas output mode that returns a DataFrame instead of a list of dicts.

Frequently Asked Questions

Why does xlrd raise an error when I open a .xlsx file?

xlrd version 2.0 (released January 2020) deliberately removed .xlsx support. The maintainer decided the risk of parsing the complex XML-based format was not worth maintaining. If you are getting xlrd.biffh.XLRDError: Excel xlsx file; not supported, you are using xlrd 2.x. Switch to openpyxl for .xlsx files. If you have legacy code that used xlrd for .xlsx, the fix is straightforward: replace xlrd.open_workbook() with openpyxl.load_workbook() and adjust the API calls accordingly.

How do I efficiently read large Excel files without running out of memory?

For large .xlsx files, use openpyxl.load_workbook(filepath, read_only=True). In read-only mode, openpyxl uses an event-based parser that processes rows one at a time without loading the whole file. You can also use pandas pd.read_excel() with the chunksize parameter to process data in chunks. For very large files (100,000+ rows), pandas with openpyxl is typically faster than using openpyxl directly because pandas optimizes the data extraction path.

How do I read cell values instead of formulas in .xlsx files?

Pass data_only=True to openpyxl.load_workbook(). By default, openpyxl reads the formula text (=SUM(A1:A10)) instead of the computed value. With data_only=True, openpyxl reads the cached value that Excel stored when the file was last saved. Note: if the file was never opened in Excel (e.g., programmatically generated), the cache may be empty and you will get None for formula cells.

When should I use pandas instead of xlrd/openpyxl directly?

Use pandas when you need to do any data analysis, filtering, or transformation after reading the file. pd.read_excel('file.xlsx') handles format detection, type inference, header parsing, and multi-sheet reading in one call. Use xlrd or openpyxl directly when you need precise control over cell access, formatting metadata, or when writing production code that should not depend on pandas. For simple ETL pipelines that just extract data and load it elsewhere, the direct library approach has fewer dependencies.

Can xlrd write .xls files?

No — xlrd is read-only. To create .xls files programmatically, use xlwt (pip install xlwt). However, xlwt is also unmaintained and you should avoid creating new .xls files in 2026. If a system still requires .xls output, use xlwt as a stopgap, but the right long-term solution is upgrading that system to accept .xlsx. For .xlsx writing, openpyxl is the standard choice and actively maintained.

Conclusion

Reading Excel files in Python requires knowing which library matches the file format: xlrd for legacy .xls files, openpyxl for modern .xlsx files. We covered the format difference and why xlrd 2.x dropped .xlsx support, reading .xls files with xlrd including date cell handling, reading .xlsx files with openpyxl including read-only mode for performance, and building a unified reader that auto-detects format and returns consistent output.

The unified reader is the practical tool for any data pipeline that processes Excel files from varied sources. From here, you can add pandas integration for data analysis, write Excel output with openpyxl’s workbook creation API, or build a batch processor that walks a directory and converts every Excel file to CSV. The format-handling knowledge transfers directly to any project that touches Excel files.

For writing and formatting .xlsx files, see the openpyxl documentation. For legacy .xls reading details, see the xlrd documentation.