Web scraping sites in python can save you a lot of time and can be fairly straightforward when the site structure is consistent. However, what do you do when the information you need to scrape is behind a login path where you need to enter a username and password first? This to can be automated fairly easily.
This article uses selenium to perform the web scraping. It is a little bit slower than using request or urllib module, however there is one big advantage which is that selenium uses a real browser. So if the website server is checking to see if the web request comes from a real person or not, there is less chance that it will rejected if you use selenium. If you use request module, for example, the normal headers that are included when a real user uses a browser (e.g. browser version, language etc) will not be there.
Web scraping fundamentals with python
You can read about the fundamentals of web scraping from our ‘web scraping with selenium‘ article. Why selenium? Well it helps to run an actual instance of a browser in the background so that to the receiving website server, it feels much more like a real person. This helps to minimise and blocks from robot checks.
A simple example of web scraping to start with
Let’s start with a quick scraping example to gather a list of blog titles from our own website: PythonHowToProgram.
In order to use selenium (see full guide on python selenium here), you need to simulate a browser. In order to simulate a browser, you need to install a browser driver.
Quick reminder… A web driver will be needed, if you already have keep in mind where it is store, if you do not have it, download it based on the following table
Browser | Supported OS | Maintained by | Download | Issue Tracker |
---|---|---|---|---|
Chromium/Chrome | Windows/macOS/Linux | Downloads | Issues | |
Firefox | Windows/macOS/Linux | Mozilla | Downloads | Issues |
Edge | Windows 10 | Microsoft | Downloads | Issues |
Internet Explorer | Windows | Selenium Project | Downloads | Issues |
Safari | macOS El Capitan and newer | Apple | Built-in | Issues |
Opera | Windows/macOS/Linux | Opera | Downloads | Issues |
Let’s start creating a main.py
file which will contain all the code needed to list titles of the python blogs.
Once you have the web driver saved on your computer you are ready to code. As a quick recap on how to do a simple web scraping, the following code will list some blog titles from pythonhowtoprogram.com
# main.py
from selenium import webdriver
# Web driver path
WEBDRIVER_PATH = './'
# Web driver declaration
driver = webdriver.Firefox(WEBDRIVER_PATH)
# Website to scrap
URL = 'https://pythonhowtoprogram.com/'
# Web driver going into website
driver.get(URL)
# Getting all alrticules into an array
articles = driver.find_elements_by_css_selector('article.article-card')
# Navagating into articules
for article in articles:
# Getting each articule header
header = article.find_element_by_css_selector('header')
# Getting articule title
title = header.find_element_by_css_selector('a')
# Printing each articule title
print(title.text)
# close the webdriver
driver.quit()
The above code will return a list of blog titles from pythonhowtoprogram.com. In this particular case the web browser is first going to open a window and navigate to the website. Next, the site contains the section of recent posts which contains a list of titles, the code stores this into an array and goes for each blog collecting the title and printing its content on the terminal. The results will be similar to the following images
Once you know how to perform some basic web scraping , you are ready to try websites with a login.
Login into a website
Assume you are interested on knowing how many chess games you have played and details on how many games have been lost and won on your account at www.chess.com. In order to achieve this, you can create a file named mychess.py
which will contain the following cod:
# mychess.py
from selenium import webdriver
# Web driver path
WEBDRIVER_PATH = './'
# Web driver declaration
driver = webdriver.Firefox(WEBDRIVER_PATH)
# Create a payload with the credentials
payload = {'username':'[YOUR_USERNAME]',
'password':'[YOUR_PASSWORD]'
}
# Website with login
URL = 'https://www.chess.com/login'
# Web driver going into website
driver.get(URL)
# Create a variable so you are able to check if login was successful
login_title = driver.title
# Sending credentials
driver.find_element_by_id('username').send_keys(payload['username'])
driver.find_element_by_id('password').send_keys(payload['password'])
driver.find_element_by_id('login').click()
#Check login
if login_title == driver.title:
print("Login failed")
As can be seen, the first step is to go the login page. Next the fields for username and password are located with the “find_element_by_id()” function. Once you get here, you are able to check if the login was performed successfully or not with the title. If the title (from field driver.title) has changed after trying to login then the login was successful.
If you have this error, it means that the credentials used are not correct, so you can retry or even close the application. In this case, if login fails the code will go straight to the driver.close()
and finish the execution of the application.
Otherwise, if the login was performed without any error the title has been changed meaning that you are now able to collect the data you are looking for. So this is going to be achieved with the following code:
# mychess.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Web driver path
WEBDRIVER_PATH = './'
# Web driver declaration
driver = webdriver.Firefox(WEBDRIVER_PATH)
# Create a payload with the credentials
payload = {'username':'[YOUR_USERNAME]',
'password':'[YOUR_PASSWORD]'
}
# Website with login
LOGIN = 'https://www.chess.com/login'
# Web driver going into website
driver.get(LOGIN)
# create a variable so you are able to check if login was successful
login_title = driver.title
# Sending credentials
driver.find_element_by_id('username').send_keys(payload['username'])
driver.find_element_by_id('password').send_keys(payload['password'])
driver.find_element_by_id('login').click()
#Check login
if login_title == driver.title:
print("Login failed")
else:
STATS = f"https://www.chess.com/stats/live/rapid/{payload['username']}"
driver.get(STATS)
total = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.game-iconBlock:nth-child(2) > div:nth-child(2)")))
won = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.tricolor-bar-header:nth-child(1) > span:nth-child(1) > div:nth-child(1)")))
lost = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.tricolor-bar-header:nth-child(1) > span:nth-child(2) > div:nth-child(1)")))
print('Games: ' + total.text)
print('Won: ' + won.text)
print('Lost: ' + lost.text)
#close the webdriver
driver.quit()
Once you are logged in with your credentials, the web driver will display the website content, and you are free to navigate through it, so let’s navigate and get a count of matches played and how many were lost and how many were won.
# mychess.py
from selenium import webdriver
# New imports needed, in order to wait for the content to be ready
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Web driver path
WEBDRIVER_PATH = './'
# Web driver declaration
driver = webdriver.Firefox(WEBDRIVER_PATH)
# Create a payload with the credentials
payload = {'username':'[YOUR_USERNAME]',
'password':'[YOUR_PASSWORD]'
}
# Website with login
LOGIN = 'https://www.chess.com/login'
# Web driver going into website
driver.get(LOGIN)
# Sending credentials
driver.find_element_by_id('username').send_keys(payload['username'])
driver.find_element_by_id('password').send_keys(payload['password'])
driver.find_element_by_id('login').click()
# Declare the new page
STATS = f"https://www.chess.com/stats/live/rapid/{payload['username']}"
# Navigate into the page
driver.get(STATS)
# Search for each element containing the information needed
total = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.game-iconBlock:nth-child(2) > div:nth-child(2)")))
won = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.tricolor-bar-header:nth-child(1) > span:nth-child(1) > div:nth-child(1)")))
lost = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.tricolor-bar-header:nth-child(1) > span:nth-child(2) > div:nth-child(1)")))
# Print all results
print('Games: ' + total.text)
print('Won: ' + won.text)
print('Lost: ' + lost.text)
#close the webdriver
driver.quit()
The data collected will be displayed at the console terminal as shown in the image above, it will change based on the history of each account.
Conclusion
The power of web scraping with selenium is endless, and this wonderful tools provides a lot of ways to solve any tasks. In order to extract data behind a login screen, you can take the above steps to simulate the login process. Once login has been performed, the session will be active and you can freely run selenium extractions normally.
One thing to note is that the method to verify whether you have logged in will vary from site to site. In this example, we used the updated title of the document to assess if login was successful. For other sites, it maybe the presence of the user icon, or the lack of sign in/sign up betters, or a range of other indicators.
Get Notified Automatically Of New Articles
Error SendFox Connection: