Intermediate

You need data from a website, but there is no API available. Maybe you want to track product prices, collect research data, or aggregate job listings from multiple sources. Web scraping with Python and BeautifulSoup lets you extract structured data from HTML pages quickly and reliably — and in 2026, the fundamentals remain as relevant as ever.

Python’s requests library handles HTTP connections while BeautifulSoup (from the bs4 package) parses the HTML and lets you navigate the document tree with simple, readable methods. Both are pure Python, install in seconds with pip, and work on every platform.

In this tutorial, you will learn how to fetch web pages, parse HTML with BeautifulSoup, extract text and attributes, work with tables, handle multiple pages, and build a complete scraping project. By the end, you will have a reusable scraping toolkit you can adapt to any static website.

Scraping a Web Page: Quick Example

Let us start with a minimal working example that fetches quotes from a practice website and prints them out.

# quick_scrape.py
import requests
from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

quotes = soup.find_all("span", class_="text")
for quote in quotes[:5]:
    print(quote.text)

Output:

"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
"It is our choices, Harry, that show what we truly are, far more than our abilities."
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring."

Three lines do the heavy lifting: requests.get() fetches the page, BeautifulSoup() parses the HTML, and find_all() extracts matching elements. The rest of this tutorial builds on these patterns.

What Is BeautifulSoup and Why Use It?

BeautifulSoup is a Python library that parses HTML and XML documents into a tree structure you can search and navigate with Python code. Think of it like a smart search tool for web pages — instead of working with raw text strings, you work with structured elements that know about their tags, attributes, parents, and children.

The library handles malformed HTML gracefully, which matters because real-world web pages are often messy. Missing closing tags, inconsistent indentation, and mixed encoding are all things BeautifulSoup handles without crashing.

Feature BeautifulSoup Regular Expressions lxml
Learning curve Low High Medium
Handles broken HTML Yes No Partially
Speed Moderate Fast Very fast
CSS selectors Yes No Yes
Best for Most scraping tasks Simple patterns Large documents

For most web scraping projects, BeautifulSoup with the html.parser backend gives you the best balance of simplicity and reliability.

Finding Elements: find() and find_all()

The two methods you will use most are find() (returns the first match) and find_all() (returns all matches). Both accept tag names, CSS classes, IDs, and attribute dictionaries.

# finding_elements.py
import requests
from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

first_quote = soup.find("span", class_="text")
print("First quote:", first_quote.text[:60] + "...")

authors = soup.find_all("small", class_="author")
print("Authors on page:", [a.text for a in authors[:5]])

tags_div = soup.find("div", attrs={"class": "tags"})
if tags_div:
    tag_links = tags_div.find_all("a", class_="tag")
    print("Tags:", [t.text for t in tag_links])

Output:

First quote: "The world as we have created it is a process of our...
Authors on page: ['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe']
Tags: ['change', 'deep-thoughts', 'thinking', 'world']

Notice the defensive pattern with if tags_div: before calling find_all(). Real websites are messy — elements might be missing, classes might change, or content might be empty. Defensive parsing separates a scraper that crashes on page 3 from one that runs reliably across thousands of pages.

Using CSS Selectors with select()

If you already know CSS, the select() method lets you use CSS selector syntax to find elements. This is often more concise than chaining find() calls.

# css_selectors.py
import requests
from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

quotes = soup.select("div.quote span.text")
print(f"Found {len(quotes)} quotes using CSS selector")

author_links = soup.select("div.quote span a")
print("Author pages:", [a["href"] for a in author_links[:3]])

first_tag = soup.select_one("a.tag")
print("First tag:", first_tag.text if first_tag else "None found")

Output:

Found 10 quotes using CSS selector
Author pages: ['/author/Albert-Einstein', '/author/J-K-Rowling', '/author/Albert-Einstein']
First tag: change

The select() method returns a list while select_one() returns the first match or None. CSS selectors are particularly useful when elements are deeply nested.

Extracting Text, Attributes, and Links

Once you have found an element, you need to extract useful data from it. BeautifulSoup gives you .text for visible text content, .get() for attributes, and .attrs for the full attribute dictionary.

# extracting_data.py
import requests
from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

quote_div = soup.find("div", class_="quote")
if quote_div:
    quote_text = quote_div.find("span", class_="text")
    print("Quote:", quote_text.text if quote_text else "Unknown")

    author_link = quote_div.find("a")
    href = author_link.get("href", "#") if author_link else "#"
    print("Author page:", href)

    tags_container = quote_div.find("div", class_="tags")
    if tags_container:
        tags = [tag.text for tag in tags_container.find_all("a")]
        print("Tags:", tags)

Output:

Quote: "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
Author page: /author/Albert-Einstein
Tags: ['change', 'deep-thoughts', 'thinking', 'world']

The .get() method with a default value is safer than direct dictionary access like element["href"] — it returns your default instead of raising a KeyError if the attribute is missing.

Handling Pagination

Most websites split content across multiple pages. To scrape all pages, find the “next page” link and follow it until there are no more pages.

# pagination.py
import requests
from bs4 import BeautifulSoup
import time

base_url = "https://quotes.toscrape.com"
all_quotes = []
url = base_url + "/page/1/"

while url:
    print(f"Scraping: {url}")
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    for quote_div in soup.find_all("div", class_="quote"):
        text = quote_div.find("span", class_="text")
        author = quote_div.find("small", class_="author")
        if text and author:
            all_quotes.append({"text": text.text, "author": author.text})

    next_btn = soup.find("li", class_="next")
    if next_btn:
        next_link = next_btn.find("a")
        url = base_url + next_link["href"] if next_link else None
    else:
        url = None
    time.sleep(1)

print(f"Total quotes scraped: {len(all_quotes)}")
print(f"Sample: {all_quotes[0]['author']}: {all_quotes[0]['text'][:50]}...")

Output:

Scraping: https://quotes.toscrape.com/page/1/
Scraping: https://quotes.toscrape.com/page/2/
...
Scraping: https://quotes.toscrape.com/page/10/
Total quotes scraped: 100
Sample: Albert Einstein: "The world as we have created it is a process of o...

The time.sleep(1) between requests is essential. Without it, you risk overwhelming the server and getting your IP blocked.

Saving Scraped Data to CSV and JSON

Once you have extracted data, you typically want to save it for analysis. Python’s built-in csv and json modules handle this without extra dependencies.

# save_data.py
import csv
import json

quotes = [
    {"text": "Be yourself; everyone else is already taken.", "author": "Oscar Wilde"},
    {"text": "Two things are infinite: the universe and human stupidity.", "author": "Albert Einstein"},
    {"text": "Be the change that you wish to see in the world.", "author": "Mahatma Gandhi"},
]

with open("quotes.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["text", "author"])
    writer.writeheader()
    writer.writerows(quotes)
print("Saved to quotes.csv")

with open("quotes.json", "w", encoding="utf-8") as f:
    json.dump(quotes, f, indent=2, ensure_ascii=False)
print("Saved to quotes.json")

Output:

Saved to quotes.csv
Saved to quotes.json

Always specify encoding="utf-8" when opening files for scraped data. Web content often contains special characters that will cause errors with the default system encoding.

Real-Life Example: Building a Job Listing Scraper

Let us build a complete scraper that extracts job listings from the Fake Jobs practice site, processes the data, and saves it as both CSV and JSON.

# job_scraper.py
import requests
from bs4 import BeautifulSoup
import csv
import json

def scrape_jobs(url):
    response = requests.get(url)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "html.parser")
    jobs = []
    cards = soup.find_all("div", class_="card-content")

    for card in cards:
        title_elem = card.find("h2", class_="title")
        company_elem = card.find("h3", class_="company")
        location_elem = card.find("p", class_="location")
        date_elem = card.find("time")

        job = {
            "title": title_elem.text.strip() if title_elem else "Unknown",
            "company": company_elem.text.strip() if company_elem else "Unknown",
            "location": location_elem.text.strip() if location_elem else "Unknown",
            "date_posted": date_elem.text.strip() if date_elem else "Unknown",
        }
        apply_link = card.find("a", string="Apply")
        job["apply_url"] = apply_link["href"] if apply_link else "N/A"
        jobs.append(job)
    return jobs

def save_results(jobs, csv_path, json_path):
    with open(csv_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=jobs[0].keys())
        writer.writeheader()
        writer.writerows(jobs)
    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(jobs, f, indent=2, ensure_ascii=False)

url = "https://realpython.github.io/fake-jobs/"
jobs = scrape_jobs(url)
print(f"Scraped {len(jobs)} job listings")
save_results(jobs, "jobs.csv", "jobs.json")
print("Data saved to jobs.csv and jobs.json")

Output:

Scraped 100 job listings
Data saved to jobs.csv and jobs.json

This scraper uses defensive parsing throughout — every find() result is checked before accessing .text, and default values handle missing elements. You can extend this by adding pagination, filtering by location, or scheduling it to run daily.