Intermediate

Some websites load their content dynamically with JavaScript, which means a simple HTTP request with requests only gets you an empty shell. Playwright solves this by controlling a real browser — Chromium, Firefox, or WebKit — letting you scrape pages that rely on JavaScript rendering, single-page applications, and content loaded behind user interactions.

Microsoft’s Playwright library for Python provides a clean async and sync API for browser automation. It installs its own browser binaries, handles waits automatically, and runs headless by default. Combined with BeautifulSoup for HTML parsing, it gives you the power to scrape virtually any website.

In this tutorial, you will learn how to install Playwright, launch a browser, navigate to pages, wait for dynamic content, extract data from JavaScript-rendered pages, handle clicks and scrolling, and build a complete scraper for dynamic websites.

Scraping a Dynamic Page: Quick Example

Here is a minimal example that scrapes quotes from a JavaScript-rendered page that returns empty HTML to regular HTTP requests.

# quick_playwright.py
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://quotes.toscrape.com/js/")
    page.wait_for_selector("div.quote")

    quotes = page.query_selector_all("div.quote span.text")
    for quote in quotes[:5]:
        print(quote.text_content())
    browser.close()

Output:

"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
"It is our choices, Harry, that show what we truly are, far more than our abilities."
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring."

The key difference from BeautifulSoup is that Playwright actually executes the page’s JavaScript before you extract data. The wait_for_selector() call ensures the dynamic content has loaded before scraping.

What Is Playwright and When Do You Need It?

Playwright is a browser automation library that controls real browsers programmatically. Unlike requests + BeautifulSoup which only works with static HTML, Playwright can handle any page a human can see in a browser — including single-page applications built with React, Vue, or Angular.

Feature requests + BeautifulSoup Playwright
JavaScript rendering No Yes
Speed Very fast Slower (browser overhead)
Memory usage Low Higher
Login/cookies Manual Automatic
Click/scroll No Yes
Best for Static HTML pages Dynamic/JS-heavy pages

Use Playwright when the page you need to scrape loads content with JavaScript. If the page works with JavaScript disabled, stick with requests and BeautifulSoup — it will be much faster.

Installing Playwright for Python

Playwright requires two installation steps: the Python package and the browser binaries.

# install_playwright.sh
pip install playwright
playwright install chromium

Output:

Successfully installed playwright-1.42.0
Downloading Chromium 123.0.6312.4 - 140.2 Mb
Chromium downloaded to /home/user/.cache/ms-playwright/

The playwright install chromium command downloads a specific Chromium build. You can also install firefox or webkit if you need cross-browser testing.

Sync vs Async API

Playwright offers both synchronous and asynchronous APIs. The sync API is simpler for scripts and scraping. The async API is better for high-concurrency applications.

# sync_example.py
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://quotes.toscrape.com/js/")
    title = page.title()
    print(f"Page title: {title}")
    browser.close()

Output:

Page title: Quotes to Scrape
# async_example.py
import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto("https://quotes.toscrape.com/js/")
        title = await page.title()
        print(f"Page title: {title}")
        await browser.close()

asyncio.run(main())

Output:

Page title: Quotes to Scrape

For most scraping projects, the sync API is perfectly fine. Use async only when you need to scrape multiple pages concurrently.

Waiting for Dynamic Content

The most common mistake with browser-based scraping is trying to extract data before it has loaded. Playwright provides several waiting strategies to handle this.

# waiting_strategies.py
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://quotes.toscrape.com/js/")

    # Wait for a specific element to appear
    page.wait_for_selector("div.quote", timeout=10000)

    # Wait for network to be idle (all AJAX calls done)
    page.wait_for_load_state("networkidle")

    quotes = page.query_selector_all("div.quote")
    print(f"Found {len(quotes)} quotes after waiting")
    browser.close()

Output:

Found 10 quotes after waiting

The wait_for_selector() method pauses execution until the element appears in the DOM. The timeout parameter (in milliseconds) prevents infinite waiting if the element never appears.

Extracting Data from the Page

Playwright gives you two approaches for extracting data: using Playwright’s built-in selectors, or passing the rendered HTML to BeautifulSoup.

# extract_with_playwright.py
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://quotes.toscrape.com/js/")
    page.wait_for_selector("div.quote")

    # Method 1: Playwright selectors
    first_quote = page.query_selector("span.text")
    print("Playwright:", first_quote.text_content()[:60] + "...")

    # Method 2: Pass HTML to BeautifulSoup
    html = page.content()
    soup = BeautifulSoup(html, "html.parser")
    bs_quote = soup.select_one("span.text")
    print("BeautifulSoup:", bs_quote.text[:60] + "...")

    browser.close()

Output:

Playwright: "The world as we have created it is a process of our thinking...
BeautifulSoup: "The world as we have created it is a process of our thinking...

The BeautifulSoup approach is often better for complex parsing because BeautifulSoup has richer navigation methods. Use page.content() to get the fully rendered HTML and parse it with BeautifulSoup.

Real-Life Example: Scraping a JavaScript-Rendered Quote Page

# js_quote_scraper.py
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import json

def scrape_js_quotes():
    all_quotes = []
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        for page_num in range(1, 11):
            url = f"https://quotes.toscrape.com/js/page/{page_num}/"
            page.goto(url)
            page.wait_for_selector("div.quote", timeout=5000)

            html = page.content()
            soup = BeautifulSoup(html, "html.parser")

            for quote_div in soup.find_all("div", class_="quote"):
                text = quote_div.find("span", class_="text")
                author = quote_div.find("small", class_="author")
                tags_div = quote_div.find("div", class_="tags")
                tags = [t.text for t in tags_div.find_all("a")] if tags_div else []

                all_quotes.append({
                    "text": text.text if text else "Unknown",
                    "author": author.text if author else "Unknown",
                    "tags": tags
                })
            print(f"Page {page_num}: {len(soup.find_all('div', class_='quote'))} quotes")
        browser.close()
    return all_quotes

quotes = scrape_js_quotes()
print(f"Total: {len(quotes)} quotes from JS-rendered pages")
with open("js_quotes.json", "w", encoding="utf-8") as f:
    json.dump(quotes, f, indent=2, ensure_ascii=False)
print("Saved to js_quotes.json")

Output:

Page 1: 10 quotes
Page 2: 10 quotes
...
Page 10: 10 quotes
Total: 100 quotes from JS-rendered pages
Saved to js_quotes.json

This scraper combines Playwright for rendering with BeautifulSoup for parsing. Reusing the same browser instance across pages is significantly faster than launching a new browser for each page.

Frequently Asked Questions

What does headless mode mean?

Headless mode means the browser runs without a visible window. This is the default and preferred mode for scraping because it uses less memory and runs faster. Set headless=False during development to see what the browser is doing.

How does Playwright compare to Selenium?

Playwright is newer and generally faster than Selenium. It has better auto-waiting, built-in support for multiple browser contexts, and a cleaner API. Selenium has a larger community and longer track record. For new scraping projects, Playwright is the recommended choice.

Can websites detect Playwright?

Yes, some websites use bot detection that can identify automated browsers. Playwright provides stealth options, but sophisticated anti-bot systems can still detect automation. Always respect website terms of service and consider whether scraping a particular site is appropriate.

How can I make Playwright scraping faster?

Disable images and CSS loading with route interception to speed up page loads. Reuse browser instances across multiple pages. Use networkidle wait state only when necessary. For multiple pages, use async mode with concurrent page objects.

Does Playwright use a lot of memory?

Yes, each browser instance uses 50-200MB of RAM. For large scraping jobs, process pages sequentially and close browser contexts when done. Monitor memory usage and restart the browser periodically for long-running scrapers.

Conclusion

Playwright opens up scraping possibilities that static HTTP libraries cannot touch. We covered installation, sync and async APIs, waiting for dynamic content, extracting data with both Playwright selectors and BeautifulSoup, and building a complete scraper for JavaScript-rendered pages.

For the full Playwright Python documentation, visit playwright.dev/python.