Advanced
You have built a scraper that works perfectly on practice sites, but real-world websites fight back. Rate limiting, CAPTCHAs, IP blocking, and browser fingerprinting are all common defenses designed to stop automated access. Understanding these measures — and the ethical ways to work around them — is essential for any serious scraping project.
Python gives you several tools to handle anti-scraping defenses ethically: rotating User-Agent headers, managing request timing, using sessions to maintain cookies, and respecting robots.txt files. The key principle is to make your scraper behave like a polite human visitor rather than an aggressive bot.
In this tutorial, you will learn about common anti-scraping techniques websites use, how to add proper headers and User-Agents, manage request timing and rate limits, handle cookies and sessions, check and respect robots.txt, and deal with common blocking scenarios. By the end, you will know how to build scrapers that work reliably without abusing the websites you scrape.
Handling Blocks: Quick Example
Many websites block requests that do not include a User-Agent header. Adding one is the simplest fix for “403 Forbidden” responses.
# quick_headers.py
import requests
url = "https://httpbin.org/headers"
# Without User-Agent (might get blocked on real sites)
response_bare = requests.get(url)
print("Status:", response_bare.status_code)
# With proper headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
}
response = requests.get(url, headers=headers)
data = response.json()
print("Sent headers:", list(data["headers"].keys()))
Output:
Status: 200
Sent headers: ['Accept', 'Accept-Language', 'Host', 'User-Agent', 'X-Amzn-Trace-Id']
The httpbin.org/headers endpoint echoes back the headers your request sent, which is perfect for verifying that your headers are being sent correctly.
Common Anti-Scraping Measures
Understanding what defenses websites use helps you build scrapers that handle them correctly. Here are the most common anti-scraping measures you will encounter.
| Measure | How It Works | Your Response |
|---|---|---|
| Missing User-Agent block | Rejects requests without browser-like headers | Add realistic headers |
| Rate limiting | Limits requests per time period per IP | Add delays between requests |
| IP blocking | Bans IPs making too many requests | Slow down, use proxies ethically |
| CAPTCHA | Requires human verification | Respect it — do not automate CAPTCHA solving |
| JavaScript rendering | Content loads only via JS | Use Playwright or Selenium |
| robots.txt | Declares which paths bots should avoid | Always respect it |
Respecting robots.txt
Every ethical scraper should check robots.txt before scraping. Python’s built-in urllib.robotparser module handles this for you.
# check_robots.py
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://quotes.toscrape.com/robots.txt")
rp.read()
# Check if we can scrape specific paths
paths = ["/", "/page/2/", "/login", "/admin/"]
for path in paths:
allowed = rp.can_fetch("*", path)
print(f"{path}: {'Allowed' if allowed else 'Blocked'}")
# Check crawl delay
delay = rp.crawl_delay("*")
print(f"Crawl delay: {delay or 'Not specified'}")
Output:
/: Allowed
/page/2/: Allowed
/login: Allowed
/admin/: Allowed
Crawl delay: Not specified
The can_fetch() method returns True if the specified user agent is allowed to access the path according to the site’s robots.txt rules. Always check this before scraping any new website.
Managing Request Timing
The most common reason scrapers get blocked is making too many requests too quickly. Smart timing makes your scraper both polite and resilient.
# rate_limiting.py
import requests
import time
import random
def polite_get(url, session, min_delay=1.0, max_delay=3.0):
"""Make a request with random delay to avoid detection."""
delay = random.uniform(min_delay, max_delay)
time.sleep(delay)
response = session.get(url)
print(f"[{response.status_code}] {url} (waited {delay:.1f}s)")
return response
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
})
urls = [
"https://httpbin.org/get",
"https://httpbin.org/headers",
"https://httpbin.org/ip",
]
for url in urls:
response = polite_get(url, session)
Output:
[200] https://httpbin.org/get (waited 1.7s)
[200] https://httpbin.org/headers (waited 2.3s)
[200] https://httpbin.org/ip (waited 1.2s)
Random delays between 1-3 seconds mimic human browsing patterns. Using requests.Session() maintains cookies across requests, which also makes your scraper behave more like a real browser.
Implementing Retry Logic
Network errors and temporary blocks happen. Retry logic with exponential backoff handles these gracefully without hammering the server.
# retry_logic.py
import requests
import time
from requests.exceptions import RequestException
def fetch_with_retry(url, max_retries=3, backoff_factor=2):
"""Fetch URL with exponential backoff retry."""
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
})
for attempt in range(max_retries):
try:
response = session.get(url, timeout=10)
if response.status_code == 200:
return response
elif response.status_code == 429:
wait = backoff_factor ** attempt
print(f"Rate limited. Waiting {wait}s before retry...")
time.sleep(wait)
else:
print(f"Got status {response.status_code}")
return response
except RequestException as e:
wait = backoff_factor ** attempt
print(f"Error: {e}. Retrying in {wait}s...")
time.sleep(wait)
print("All retries exhausted")
return None
result = fetch_with_retry("https://httpbin.org/get")
if result:
print(f"Success: {result.status_code}")
Output:
Success: 200
Exponential backoff doubles the wait time with each retry (1s, 2s, 4s). This prevents your scraper from making the situation worse when a server is already under load. The timeout=10 parameter prevents hanging on unresponsive servers.
Session and Cookie Management
Using sessions properly is critical. Sessions maintain cookies across requests, which many websites require for normal browsing. Without proper session management, websites may treat each request as a new visitor and trigger anti-bot measures.
# session_management.py
import requests
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
})
# First request sets cookies
response = session.get("https://httpbin.org/cookies/set/session_id/abc123")
print(f"Cookies after first request: {dict(session.cookies)}")
# Subsequent requests send cookies automatically
response = session.get("https://httpbin.org/cookies")
print(f"Server sees cookies: {response.json()['cookies']}")
Output:
Cookies after first request: {'session_id': 'abc123'}
Server sees cookies: {'session_id': 'abc123'}
Always use requests.Session() instead of bare requests.get() for multi-page scraping. It handles cookies, connection pooling, and header persistence automatically.
Real-Life Example: Building a Resilient Scraper
# resilient_scraper.py
import requests
import time
import random
from urllib.robotparser import RobotFileParser
from bs4 import BeautifulSoup
class ResilientScraper:
def __init__(self, base_url, min_delay=1.0, max_delay=3.0):
self.base_url = base_url
self.min_delay = min_delay
self.max_delay = max_delay
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
})
self.robot_parser = RobotFileParser()
self.robot_parser.set_url(f"{base_url}/robots.txt")
self.robot_parser.read()
def can_scrape(self, path):
return self.robot_parser.can_fetch("*", path)
def get_page(self, path, max_retries=3):
if not self.can_scrape(path):
print(f"Blocked by robots.txt: {path}")
return None
url = f"{self.base_url}{path}"
for attempt in range(max_retries):
delay = random.uniform(self.min_delay, self.max_delay)
time.sleep(delay)
try:
response = self.session.get(url, timeout=10)
if response.status_code == 200:
return BeautifulSoup(response.text, "html.parser")
elif response.status_code == 429:
wait = 2 ** attempt
print(f"Rate limited on {path}. Waiting {wait}s...")
time.sleep(wait)
except Exception as e:
print(f"Error on {path}: {e}")
return None
scraper = ResilientScraper("https://quotes.toscrape.com")
all_quotes = []
for page_num in range(1, 4):
soup = scraper.get_page(f"/page/{page_num}/")
if soup:
quotes = soup.find_all("div", class_="quote")
for q in quotes:
text = q.find("span", class_="text")
author = q.find("small", class_="author")
if text and author:
all_quotes.append({"text": text.text, "author": author.text})
print(f"Page {page_num}: {len(quotes)} quotes")
print(f"Total scraped: {len(all_quotes)} quotes")
Output:
Page 1: 10 quotes
Page 2: 10 quotes
Page 3: 10 quotes
Total scraped: 30 quotes
This ResilientScraper class combines every technique from this tutorial: robots.txt checking, session management, random delays, retry logic with exponential backoff, and defensive parsing.
Frequently Asked Questions
Is it ethical to bypass anti-scraping measures?
Adding headers and managing timing are standard practices that mimic normal browser behavior. However, bypassing CAPTCHAs, breaking authentication, or ignoring robots.txt crosses ethical lines. Always respect the website’s terms of service and only scrape publicly available data.
Should I use proxy rotation?
Proxy rotation can help distribute requests across multiple IPs, but it should not be used to circumvent explicit blocking. If a website is actively blocking you, that is a signal to stop rather than escalate. Proxies are more appropriate for geographic content access.
What does HTTP 429 mean?
HTTP 429 means “Too Many Requests.” The server is rate limiting you. The correct response is to slow down with exponential backoff, not to try harder. Check the Retry-After header if present — it tells you exactly how long to wait.
Do I need to rotate User-Agent strings?
For most scraping projects, a single realistic User-Agent string is sufficient. Rotating User-Agents is only necessary for high-volume scraping where a single agent string might get flagged. The more important factor is request timing.
Can I get in legal trouble for scraping?
Web scraping laws vary by jurisdiction. In the US, scraping publicly available data is generally legal, but violating a website’s terms of service could have legal consequences. In the EU, GDPR adds restrictions around personal data. When in doubt, consult a legal professional.
Conclusion
Handling anti-scraping measures responsibly is about being a good citizen of the web. We covered checking robots.txt, adding proper headers, managing request timing, implementing retry logic, session management, and building a resilient scraper class that combines all these techniques.
The ResilientScraper class gives you a production-ready foundation. Extend it with logging, database storage, or email alerts for long-running scraping projects.