Advanced

You have built a scraper that works perfectly on practice sites, but real-world websites fight back. Rate limiting, CAPTCHAs, IP blocking, and browser fingerprinting are all common defenses designed to stop automated access. Understanding these measures — and the ethical ways to work around them — is essential for any serious scraping project.

Python gives you several tools to handle anti-scraping defenses ethically: rotating User-Agent headers, managing request timing, using sessions to maintain cookies, and respecting robots.txt files. The key principle is to make your scraper behave like a polite human visitor rather than an aggressive bot.

In this tutorial, you will learn about common anti-scraping techniques websites use, how to add proper headers and User-Agents, manage request timing and rate limits, handle cookies and sessions, check and respect robots.txt, and deal with common blocking scenarios. By the end, you will know how to build scrapers that work reliably without abusing the websites you scrape.

Handling Blocks: Quick Example

Many websites block requests that do not include a User-Agent header. Adding one is the simplest fix for “403 Forbidden” responses.

# quick_headers.py
import requests

url = "https://httpbin.org/headers"

# Without User-Agent (might get blocked on real sites)
response_bare = requests.get(url)
print("Status:", response_bare.status_code)

# With proper headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Accept": "text/html,application/xhtml+xml",
    "Accept-Language": "en-US,en;q=0.9",
}
response = requests.get(url, headers=headers)
data = response.json()
print("Sent headers:", list(data["headers"].keys()))

Output:

Status: 200
Sent headers: ['Accept', 'Accept-Language', 'Host', 'User-Agent', 'X-Amzn-Trace-Id']

The httpbin.org/headers endpoint echoes back the headers your request sent, which is perfect for verifying that your headers are being sent correctly.

Common Anti-Scraping Measures

Understanding what defenses websites use helps you build scrapers that handle them correctly. Here are the most common anti-scraping measures you will encounter.

Measure How It Works Your Response
Missing User-Agent block Rejects requests without browser-like headers Add realistic headers
Rate limiting Limits requests per time period per IP Add delays between requests
IP blocking Bans IPs making too many requests Slow down, use proxies ethically
CAPTCHA Requires human verification Respect it — do not automate CAPTCHA solving
JavaScript rendering Content loads only via JS Use Playwright or Selenium
robots.txt Declares which paths bots should avoid Always respect it

Respecting robots.txt

Every ethical scraper should check robots.txt before scraping. Python’s built-in urllib.robotparser module handles this for you.

# check_robots.py
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://quotes.toscrape.com/robots.txt")
rp.read()

# Check if we can scrape specific paths
paths = ["/", "/page/2/", "/login", "/admin/"]
for path in paths:
    allowed = rp.can_fetch("*", path)
    print(f"{path}: {'Allowed' if allowed else 'Blocked'}")

# Check crawl delay
delay = rp.crawl_delay("*")
print(f"Crawl delay: {delay or 'Not specified'}")

Output:

/: Allowed
/page/2/: Allowed
/login: Allowed
/admin/: Allowed
Crawl delay: Not specified

The can_fetch() method returns True if the specified user agent is allowed to access the path according to the site’s robots.txt rules. Always check this before scraping any new website.

Managing Request Timing

The most common reason scrapers get blocked is making too many requests too quickly. Smart timing makes your scraper both polite and resilient.

# rate_limiting.py
import requests
import time
import random

def polite_get(url, session, min_delay=1.0, max_delay=3.0):
    """Make a request with random delay to avoid detection."""
    delay = random.uniform(min_delay, max_delay)
    time.sleep(delay)
    response = session.get(url)
    print(f"[{response.status_code}] {url} (waited {delay:.1f}s)")
    return response

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
})

urls = [
    "https://httpbin.org/get",
    "https://httpbin.org/headers",
    "https://httpbin.org/ip",
]

for url in urls:
    response = polite_get(url, session)

Output:

[200] https://httpbin.org/get (waited 1.7s)
[200] https://httpbin.org/headers (waited 2.3s)
[200] https://httpbin.org/ip (waited 1.2s)

Random delays between 1-3 seconds mimic human browsing patterns. Using requests.Session() maintains cookies across requests, which also makes your scraper behave more like a real browser.

Implementing Retry Logic

Network errors and temporary blocks happen. Retry logic with exponential backoff handles these gracefully without hammering the server.

# retry_logic.py
import requests
import time
from requests.exceptions import RequestException

def fetch_with_retry(url, max_retries=3, backoff_factor=2):
    """Fetch URL with exponential backoff retry."""
    session = requests.Session()
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    })

    for attempt in range(max_retries):
        try:
            response = session.get(url, timeout=10)
            if response.status_code == 200:
                return response
            elif response.status_code == 429:
                wait = backoff_factor ** attempt
                print(f"Rate limited. Waiting {wait}s before retry...")
                time.sleep(wait)
            else:
                print(f"Got status {response.status_code}")
                return response
        except RequestException as e:
            wait = backoff_factor ** attempt
            print(f"Error: {e}. Retrying in {wait}s...")
            time.sleep(wait)

    print("All retries exhausted")
    return None

result = fetch_with_retry("https://httpbin.org/get")
if result:
    print(f"Success: {result.status_code}")

Output:

Success: 200

Exponential backoff doubles the wait time with each retry (1s, 2s, 4s). This prevents your scraper from making the situation worse when a server is already under load. The timeout=10 parameter prevents hanging on unresponsive servers.

Session and Cookie Management

Using sessions properly is critical. Sessions maintain cookies across requests, which many websites require for normal browsing. Without proper session management, websites may treat each request as a new visitor and trigger anti-bot measures.

# session_management.py
import requests

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
    "Accept-Language": "en-US,en;q=0.9",
})

# First request sets cookies
response = session.get("https://httpbin.org/cookies/set/session_id/abc123")
print(f"Cookies after first request: {dict(session.cookies)}")

# Subsequent requests send cookies automatically
response = session.get("https://httpbin.org/cookies")
print(f"Server sees cookies: {response.json()['cookies']}")

Output:

Cookies after first request: {'session_id': 'abc123'}
Server sees cookies: {'session_id': 'abc123'}

Always use requests.Session() instead of bare requests.get() for multi-page scraping. It handles cookies, connection pooling, and header persistence automatically.

Real-Life Example: Building a Resilient Scraper

# resilient_scraper.py
import requests
import time
import random
from urllib.robotparser import RobotFileParser
from bs4 import BeautifulSoup

class ResilientScraper:
    def __init__(self, base_url, min_delay=1.0, max_delay=3.0):
        self.base_url = base_url
        self.min_delay = min_delay
        self.max_delay = max_delay
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            "Accept": "text/html,application/xhtml+xml",
            "Accept-Language": "en-US,en;q=0.9",
        })
        self.robot_parser = RobotFileParser()
        self.robot_parser.set_url(f"{base_url}/robots.txt")
        self.robot_parser.read()

    def can_scrape(self, path):
        return self.robot_parser.can_fetch("*", path)

    def get_page(self, path, max_retries=3):
        if not self.can_scrape(path):
            print(f"Blocked by robots.txt: {path}")
            return None
        url = f"{self.base_url}{path}"
        for attempt in range(max_retries):
            delay = random.uniform(self.min_delay, self.max_delay)
            time.sleep(delay)
            try:
                response = self.session.get(url, timeout=10)
                if response.status_code == 200:
                    return BeautifulSoup(response.text, "html.parser")
                elif response.status_code == 429:
                    wait = 2 ** attempt
                    print(f"Rate limited on {path}. Waiting {wait}s...")
                    time.sleep(wait)
            except Exception as e:
                print(f"Error on {path}: {e}")
        return None

scraper = ResilientScraper("https://quotes.toscrape.com")
all_quotes = []
for page_num in range(1, 4):
    soup = scraper.get_page(f"/page/{page_num}/")
    if soup:
        quotes = soup.find_all("div", class_="quote")
        for q in quotes:
            text = q.find("span", class_="text")
            author = q.find("small", class_="author")
            if text and author:
                all_quotes.append({"text": text.text, "author": author.text})
        print(f"Page {page_num}: {len(quotes)} quotes")

print(f"Total scraped: {len(all_quotes)} quotes")

Output:

Page 1: 10 quotes
Page 2: 10 quotes
Page 3: 10 quotes
Total scraped: 30 quotes

This ResilientScraper class combines every technique from this tutorial: robots.txt checking, session management, random delays, retry logic with exponential backoff, and defensive parsing.

Frequently Asked Questions

Is it ethical to bypass anti-scraping measures?

Adding headers and managing timing are standard practices that mimic normal browser behavior. However, bypassing CAPTCHAs, breaking authentication, or ignoring robots.txt crosses ethical lines. Always respect the website’s terms of service and only scrape publicly available data.

Should I use proxy rotation?

Proxy rotation can help distribute requests across multiple IPs, but it should not be used to circumvent explicit blocking. If a website is actively blocking you, that is a signal to stop rather than escalate. Proxies are more appropriate for geographic content access.

What does HTTP 429 mean?

HTTP 429 means “Too Many Requests.” The server is rate limiting you. The correct response is to slow down with exponential backoff, not to try harder. Check the Retry-After header if present — it tells you exactly how long to wait.

Do I need to rotate User-Agent strings?

For most scraping projects, a single realistic User-Agent string is sufficient. Rotating User-Agents is only necessary for high-volume scraping where a single agent string might get flagged. The more important factor is request timing.

Web scraping laws vary by jurisdiction. In the US, scraping publicly available data is generally legal, but violating a website’s terms of service could have legal consequences. In the EU, GDPR adds restrictions around personal data. When in doubt, consult a legal professional.

Conclusion

Handling anti-scraping measures responsibly is about being a good citizen of the web. We covered checking robots.txt, adding proper headers, managing request timing, implementing retry logic, session management, and building a resilient scraper class that combines all these techniques.

The ResilientScraper class gives you a production-ready foundation. Extend it with logging, database storage, or email alerts for long-running scraping projects.