There are some rich data sets that are available on web pages which are in table formats. Extracting this data can be very useful in order to do further analysis or even to save to your own database. So here you are going to learn more about web scraping and how to extract tables from a webpage using Python 3.
For this article
Understanding HTML Tables
Each single web page is going to have different way to show table data however they are typically done using <table> structure where there are tags for the header, columns and the individual cells.
The following table contains the different Tag options that a table will contain and a description to make it easier to understand
So you can see how the <tr> depicts the rows of the table. The header cells are structured via <th> while the data cells are captured via <td>. This is a simple table but is typical for many pages. One thing to note is that there is an “id” attribute on the table tag which will be used to find the relevant tag. This is not always available, and if not you’ll need to loop through the tables to find the correct one on the page.
One variant of this is that some sites also have the <thead> tag surrounding the header columns and/or the <tbody> tag surrounding the data rows. So the table structure will be as follows:
The visual result is the same for the user, but just something we have to be mindful of when scraping for this type of data.
Extracting Table Data from Python – Simple Example First
For this first example, we will use table data from the above example code in order to test extracting data.
Before collecting the data, you can first setup an initial script to ensure you can extract data from a website using urllib3. Incase you have not installed beautifulsoup library, you can install this using pip or pip3
pip3 install beautifulsoup4
# table_data.py
import urllib3
from bs4 import BeautifulSoup
#create instance to extract from website
http = urllib3.PoolManager()
# Declare a variable containing the URL is going to be scrapped
URL = 'https://pythonhowtoprogram.com/how-to-extract-table-data-from-webpages-with-example-code-python-3/'
#Extract the data from the website
r = http.request('GET', URL)
#create a map of the site into beautiful soup framework
soup = BeautifulSoup(r.data, 'html.parser')
print("[1]*******")
[ print( item.prettify() ) for item in soup.find_all('table') ]
Output:
In the above code, we first get the url data from the website using the http.request call. From their, we can start to manipulate the data by accessing the field r.data, however it will be in raw HTML data where you have to parse all the tags. You can instead use beatuifulsoup which will help you easily parse all the data. You first need to create an instance of beautifulsoup by the following:
soup = BeautifulSoup(r.data, 'html.parser')
Next, we print out the table data using the following:
[ print( item.prettify() ) for item in soup.find_all('table') ]
This is quite a compact line, but there’s quite a bit going on in here which I’ll further breakdown. If you’re not familiar with this syntax it is in fact using a python feature called list comprehension (you can see a full example in our “list comprehension example” article). The above is in fact similar to the following:
table_data = soup.find_all('table')
for item in table_data:
print( item.prettify() )
The [ ] brackets help to run a for loop within a single line where the code before the “for” runs for each iteration of the loop.
Now, in the above code, this is extracting all the tables. How about if we just focus on the table above which has an “id” attribute of “example1″ – this means it has the following <table id=”example1”>. We can use that to just target that table.
Extract Table Data using CSS Selector
We can use similar code to above, but instead, we can filter where id=”example1″ by using the find_all() function. This will return a list, so similar to above, we use list comprehension to print out all items.
# table_data.py
import urllib3
from bs4 import BeautifulSoup
#create instance to extract from website
http = urllib3.PoolManager()
# Declare a variable containing the URL is going to be scrapped
URL = 'https://pythonhowtoprogram.com/how-to-extract-table-data-from-webpages-with-example-code-python-3/'
#Extract the data from the website
r = http.request('GET', URL)
#create a map of the site into beautiful soup framework
soup = BeautifulSoup(r.data, 'html.parser')
print("[2]*******")
[ print( item.prettify() ) for item in soup.find_all(id="example1") ]
Output:
As you can see, just the table we highlighted above is printed. This is incredibly handy when you know exactly which table to extract.
Note that there are still all the tags printed as well – this is what prettify() includes. Suppose you just want the data to be listed out into a CSV file, this can also be done fairly easily.
Extracting Table Data From a Website and Store in a CSV File
To extract data from a table in a CSV file, you can use the above methods of using beautiful soup with minimal code. This can be very useful if you’re trying to capture historical data from a site which has daily updates so that you can do some analysis over time.
The following code shows an example of how it can be done from this site:
# table_data.py
import urllib3
from bs4 import BeautifulSoup
from datetime import datetime
#create instance to extract from website
http = urllib3.PoolManager()
# Declare a variable containing the URL is going to be scrapped
URL = 'https://pythonhowtoprogram.com/how-to-extract-table-data-from-webpages-with-example-code-python-3/'
#Extract the data from the website
r = http.request('GET', URL)
#create a map of the site into beautiful soup framework
soup = BeautifulSoup(r.data, 'html.parser')
filename = 'data_file_' + datetime.today().strftime('%Y_%m_%d') + ".txt"
with open(filename, 'w') as f:
#First print out the column headings
header = ','.join( [ item.get_text() for item in soup.select("table[id='example1'] > thead > tr > th") ] )
print(header)
print(header, file=f)
#Next print all the data rows
for item in soup.select("table[id='example1'] > tbody > tr"):
row_data = ','.join( [ item.get_text() for item in item.select("td") ] )
print(row_data)
print(row_data, file=f)
Output:
In the above output, you can see the data from the table at the top of this blog being printed out in a CSV format, and a file called “data_file_2021_05_10.txt” being created, which when printed with the “cat” command has the same output. In case you’re wondering, I added some newlines in the screenshot to make it easier to see.
Cod explained:
The code above has a few elements to consider.
We first create a dynamic filename with todays date (please note that the library datetime was imported at the top). The reason for the dynamic filename was that you can then run this on a daily basis to get a daily snapshot of a given website.
filename = 'data_file_' + datetime.today().strftime('%Y_%m_%d') + ".txt"
with open(filename, 'w') as f:
Next, we then get all the column headers using the following code and print it as a comma separated line:
#First print out the column headings
header = ','.join( [ item.get_text() for item in soup.select("table[id='example1'] > thead > tr > th") ] )
print(header)
print(header, file=f)
There’s again, quite a bit going on. The above is in fact equivelant to the following:
#First print out the column headings
column_item_list = soup.select("table[id='example1'] > thead > tr > th")
col_list = []
for item in column_item_list:
col_list.append( item.get_text() )
header = ','.join( col_list )
print(header)
print(header, file=f)
The first thing to note is the soup.select(“table[id=’example1′] > thead > tr > th”). What this is telling beautifulSoup is to find the table which has an attribute called “id” which is equal to “example1”, then find the children tags called thead, which then has children called tr and then th. This notation is called “CSS Selector” notation and can be very handy to find tags in a very intuitive way.
If for example the thead is not there, you can simply remove that part. So it’s searching for tags like the following:
So in this mini example, it will capture “ABC” and “XYZ”
The other line you may not be familiar with is header = ‘,’.join( col_list ) which takes col_list and then joins all the elements as a string with a “,” as a separator.
You’ll also notice that there are two print statements? This is to simply print to the screen and also the file which we opened earlier.
print(header)
print(header, file=f)
The file portion to explain is the following snippet which you probably guess already:
#Next print all the data rows
for item in soup.select("table[id='example1'] > tbody > tr"):
row_data = ','.join( [ item.get_text() for item in item.select("td") ] )
print(row_data)
print(row_data, file=f)
This is to loop through all the rows in the table using the for item in soup.select(“table[id=’example1′] > tbody > tr”) statement, and then within each row, then concatenate all the calls within that row using the following:
row_data = ','.join( [ item.get_text() for item in item.select("td") ] )
This is exactly like the above list comprehension scenario where the cell contents are concatenated with a comma in between.
Extracting currency table data from a website into a CSV – final example
To finish of with a final example, let’s look at an example of extracting currency data – something that changes on daily basis. This will be slightly more complex but the principles are the same.
As you can see there is a table of currency data that’s available. First thing to do is to ensure that the data is available in the HTML file by right clicking on the page, and clicking on viewsource.
You can then search to find the table (quickest way is to search for one of the currency values)
Why is this necessary? Well some sites use javascript to load the data – if that’s the case, you cannot use “urllib3” to extract data since that just extracts a HTML file and does not execute any javascript. You will instead need to simulate a browser so that the javascript can be executed. You can use something like selenium (see our selenium article) to do the job. This means that the “r = http.request(‘GET’, URL)” will need to change to the selenium code, but everything else the same (email me if you’d like more details).
So now that the table data is available in the source, you can see a few of the key pieces:
There are similar tags for the data part. Hence, the code would be as follows:
# table_data.py
import urllib3
from bs4 import BeautifulSoup
from datetime import datetime
#create instance to extract from website
http = urllib3.PoolManager()
# Declare a variable containing the URL is going to be scrapped
URL = 'https://wise.com/gb/currency-converter/currencies/usd-us-dollar'
#Extract the data from the website
r = http.request('GET', URL)
#create a map of the site into beautiful soup framework
soup = BeautifulSoup(r.data, 'html.parser')
filename = 'data_file_' + datetime.today().strftime('%Y_%m_%d') + ".txt"
with open(filename, 'w') as f:
#First print out the column headings
header = ','.join( [ item.get_text() for item in soup.select("table[class='table table-condensed'] > thead > tr > th > a > span[aria-hidden]") ] )
print(header)
print(header, file=f)
#Next print all the data rows
for item in soup.select("table[class='table table-condensed'] > tbody > tr"):
row_data = ','.join( [ item.get_text() for item in item.select("td > a") ] )
print(row_data)
print(row_data, file=f)
The above code is similar to above in that it saves the data into a file and also outputs to a file.
Output:
The key part in the code is the selector which you’ve seen before. This piece code code helps to find the content of the field where there’s a <span> tag which has an attribute called “aria-hidden” with a series of parent tags.
header = ','.join( [ item.get_text() for item in soup.select("table[class='table table-condensed'] > thead > tr > th > a > span[aria-hidden]") ] )
This code maps to this piece of the HTML tags:
Conclusion
There are several packages available that can be used to scrape from webpages. In this article we are using the urllib3 package to get data from websites however there are other packages available as well. If you want to extract from websites that have javascript loading, you can also using selenium package instead: “How To Scrape Javascript Websites With Selenium Using Python 3“
With the above examples, you have the tools and skills to extract data from sites with a short snippet of code and also to output these to files. You can now also schedule this script with a crontab on linux or the task scheduler on windows to collect that historical data.
Get notified automatically of new articles
If you found this article helpful, we can notify you of new articles straight to your inbox! It only takes 10 seconds and you won’t miss out on new great content.
Web scraping sites in python can save you a lot of time and can be fairly straightforward when the site structure is consistent. However, what do you do when the information you need to scrape is behind a login path where you need to enter a username and password first? This to can be automated fairly easily.
This article uses selenium to perform the web scraping. It is a little bit slower than using request or urllib module, however there is one big advantage which is that selenium uses a real browser. So if the website server is checking to see if the web request comes from a real person or not, there is less chance that it will rejected if you use selenium. If you use request module, for example, the normal headers that are included when a real user uses a browser (e.g. browser version, language etc) will not be there.
Web scraping fundamentals with python
You can read about the fundamentals of web scraping from our ‘web scraping with selenium‘ article. Why selenium? Well it helps to run an actual instance of a browser in the background so that to the receiving website server, it feels much more like a real person. This helps to minimise and blocks from robot checks.
A simple example of web scraping to start with
Let’s start with a quick scraping example to gather a list of blog titles from our own website: PythonHowToProgram.
In order to use selenium (see full guide on python selenium here), you need to simulate a browser. In order to simulate a browser, you need to install a browser driver.
Quick reminder… A web driver will be needed, if you already have keep in mind where it is store, if you do not have it, download it based on the following table
Let’s start creating a main.py file which will contain all the code needed to list titles of the python blogs.
Once you have the web driver saved on your computer you are ready to code. As a quick recap on how to do a simple web scraping, the following code will list some blog titles from pythonhowtoprogram.com
# main.py
from selenium import webdriver
# Web driver path
WEBDRIVER_PATH = './'
# Web driver declaration
driver = webdriver.Firefox(WEBDRIVER_PATH)
# Website to scrap
URL = 'https://pythonhowtoprogram.com/'
# Web driver going into website
driver.get(URL)
# Getting all alrticules into an array
articles = driver.find_elements_by_css_selector('article.article-card')
# Navagating into articules
for article in articles:
# Getting each articule header
header = article.find_element_by_css_selector('header')
# Getting articule title
title = header.find_element_by_css_selector('a')
# Printing each articule title
print(title.text)
# close the webdriver
driver.quit()
The above code will return a list of blog titles from pythonhowtoprogram.com. In this particular case the web browser is first going to open a window and navigate to the website. Next, the site contains the section of recent posts which contains a list of titles, the code stores this into an array and goes for each blog collecting the title and printing its content on the terminal. The results will be similar to the following images
Once you know how to perform some basic web scraping , you are ready to try websites with a login.
Login into a website
Assume you are interested on knowing how many chess games you have played and details on how many games have been lost and won on your account at www.chess.com. In order to achieve this, you can create a file named mychess.py which will contain the following cod:
#Â mychess.py
from selenium import webdriver
# Web driver path
WEBDRIVER_PATHÂ =Â './'
# Web driver declaration
driver = webdriver.Firefox(WEBDRIVER_PATH)
# Create a payload with the credentials
payload = {'username':'[YOUR_USERNAME]',Â
           'password':'[YOUR_PASSWORD]'
          }
         Â
# Website with login
URLÂ =Â 'https://www.chess.com/login'
# Web driver going into website
driver.get(URL)
# Create a variable so you are able to check if login was successful
login_title = driver.title
# Sending credentialsÂ
driver.find_element_by_id('username').send_keys(payload['username'])
driver.find_element_by_id('password').send_keys(payload['password'])
driver.find_element_by_id('login').click()
#Check login
if login_title == driver.title:
    print("Login failed")
As can be seen, the first step is to go the login page. Next the fields for username and password are located with the “find_element_by_id()” function. Once you get here, you are able to check if the login was performed successfully or not with the title. If the title (from field driver.title) has changed after trying to login then the login was successful.
If you have this error, it means that the credentials used are not correct, so you can retry or even close the application. In this case, if login fails the code will go straight to the driver.close() and finish the execution of the application.
Otherwise, if the login was performed without any error the title has been changed meaning that you are now able to collect the data you are looking for. So this is going to be achieved with the following code:
#Â mychess.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWaitÂ
from selenium.webdriver.support import expected_conditions as EC
# Web driver path
WEBDRIVER_PATHÂ =Â './'
# Web driver declaration
driver = webdriver.Firefox(WEBDRIVER_PATH)
# Create a payload with the credentials
payload = {'username':'[YOUR_USERNAME]',Â
          'password':'[YOUR_PASSWORD]'
         }
         Â
# Website with login
LOGINÂ =Â 'https://www.chess.com/login'
# Web driver going into website
driver.get(LOGIN)
# create a variable so you are able to check if login was successful
login_title = driver.title
# Sending credentialsÂ
driver.find_element_by_id('username').send_keys(payload['username'])
driver.find_element_by_id('password').send_keys(payload['password'])
driver.find_element_by_id('login').click()
#Check login
if login_title == driver.title:
    print("Login failed")
else:
    STATS = f"https://www.chess.com/stats/live/rapid/{payload['username']}"
    driver.get(STATS)
    total = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.game-iconBlock:nth-child(2) > div:nth-child(2)")))
    won = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.tricolor-bar-header:nth-child(1) > span:nth-child(1) > div:nth-child(1)")))
    lost = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.tricolor-bar-header:nth-child(1) > span:nth-child(2) > div:nth-child(1)")))
    print('Games: ' + total.text)
    print('Won: ' + won.text)
    print('Lost: ' + lost.text)
#close the webdriver
driver.quit()
Once you are logged in with your credentials, the web driver will display the website content, and you are free to navigate through it, so let’s navigate and get a count of matches played and how many were lost and how many were won.
#Â mychess.py
from selenium import webdriver
# New imports needed, in order to wait for the content to be ready
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWaitÂ
from selenium.webdriver.support import expected_conditions as EC
# Web driver path
WEBDRIVER_PATHÂ =Â './'
# Web driver declaration
driver = webdriver.Firefox(WEBDRIVER_PATH)
# Create a payload with the credentials
payload = {'username':'[YOUR_USERNAME]',Â
          'password':'[YOUR_PASSWORD]'
          }
# Website with login
LOGINÂ =Â 'https://www.chess.com/login'
# Web driver going into website
driver.get(LOGIN)
# Sending credentialsÂ
driver.find_element_by_id('username').send_keys(payload['username'])
driver.find_element_by_id('password').send_keys(payload['password'])
driver.find_element_by_id('login').click()
# Declare the new page
STATSÂ =Â f"https://www.chess.com/stats/live/rapid/{payload['username']}"
# Navigate into the page
driver.get(STATS)
# Search for each element containing the information needed
total = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.game-iconBlock:nth-child(2) > div:nth-child(2)")))
won = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.tricolor-bar-header:nth-child(1) > span:nth-child(1) > div:nth-child(1)")))
lost = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.tricolor-bar-header:nth-child(1) > span:nth-child(2) > div:nth-child(1)")))
# Print all results
print('Games:Â 'Â +Â total.text)
print('Won:Â 'Â +Â won.text)
print('Lost:Â 'Â +Â lost.text)
#close the webdriver
driver.quit()
The data collected will be displayed at the console terminal as shown in the image above, it will change based on the history of each account.
Conclusion
The power of web scraping with selenium is endless, and this wonderful tools provides a lot of ways to solve any tasks. In order to extract data behind a login screen, you can take the above steps to simulate the login process. Once login has been performed, the session will be active and you can freely run selenium extractions normally.
One thing to note is that the method to verify whether you have logged in will vary from site to site. In this example, we used the updated title of the document to assess if login was successful. For other sites, it maybe the presence of the user icon, or the lack of sign in/sign up betters, or a range of other indicators.
Web scraping is a very useful mechanism to either extract data, or automate actions on websites. Normally we would use urllib or requests to do this, but things start to fail when websites use javascript to render the page rather than static HTML. For many websites the information is stored in static HTML files, but for others the information is loaded dynamically through javascript (e.g. from ajax calls). The reason maybe because the information is constantly changing, or it maybe to prevent webscraping! Either way, you need to more advanced techniques to scrape the information – this is where the library selenium can help.
What is web scraping?
To align with terms, web scraping, also known as web harvesting, or web data extraction is data scraping used for data extraction from websites. The web scraping script may access the url directly using HTTP requests or through simulating a web browser. The second approach is exactly how selenium works – it simulates a web browser. The big advantage in simulating the website is that you can have the website fully render – whether it uses javascript or static HTML files.
What is selenium?
According to selenium official web page, it is a suite of tools for automating web browsers. This project is a member of the Software Freedom Conservancy, Selenium has three projects, each provides a different functionality if you are interested in it, visit their official website. The scope of this blog will be attached to the Selenium WebDriver project
When should you use selenium?
Selenium is going to facilitate us with tools to perform web scraping, but when should it be used? You generally can use selenium in the following scenarios:
When the data is loaded dynamically – for example Twitter. What you see in “view source” is different to what you see on the page (The reason is that “view source” just shows the static HTML files. If you want to see under the covers of a dynamic website, right click and “inspect element” instead)
When you need to perform an interactive action in order to display the data on screen – a classic example is infinite scrolling. For some websites, you need to scroll to the bottom of the page, and then more entries will show. What happens behind the scene is that when you scroll to the bottom, javascript code will call the server to load more records on screen.
So why not use selenium all the time? It is a bit slower then using requests and urllib. The reason is that selenium simulates running a full browser including the overhead that a brings with it. There are also a few extra steps required to use selenium as you can see below.
Once you have the data extracted, you can still use similar approaches to process the data (e.g. using tools such as BeautifulSoup)
Pre-requisites for using selenium
Step 1: Install selenium library
Before starting with a web scraping sample ensure that all requirements have been set, Selenium requires pip or pip3 installed, if you don’t have it installed you can follow the official guide to install it based on the operating system you have.
Once pip is installed you can proceed with the installation of selenium, with the following command
pip install selenium
Alternatively, you can download the PyPI source archive (selenium-x.x.x.tar.gz) and install it using setup.py:
python setup.py install
Step 2: Install web driver
Selenium simulates an actual browser. It won’t use your chrome installation but it will use a “driver” which is the browser engine to run a browser. Selenium supports multiple web browsers, so you may chose which web browser to use (read on)
Selenium WebDriver refers to both the language bindings and the implementations of the individual browser controlling code. This is commonly referred to as just a web driver.
Web driver needs to be downloaded, and then it could be either added to the path environment variable or initialized with a string containing the path where downloaded web driver is. Environment variables are out of the scope of the blog so we are going to use the second option.
From here to the end Firefox web driver is going to be used, but here is a table containing information regarding each web driver, you are able to choose any of them, Firefox is recommended to follow this blog
Ok, we’re all set. To begin with, let’s start with a quick staring example to ensure things are all working. Our first example will involving collecting a website title. In order to achieve this goal, we are going to use selenium, assuming it is already installed in your environment, just import webdriver from selenium in a python file as it’s shown in the following.
Running the code below will open a firefox window which looks a little bit different as can be seen in the following image and at the then it prints into the console the title of the website, in this case, it is collecting data from ‘Google’. Results should be similar to the following images:
Note that this was run in foreground so that you can see what is happening. Now we are going to manually close the firefox window opened, it was intentionally opened in this way to be able to see that the web driver actually navigates just like a human will do. But now that it is known, we can add at the end of the out this code: driver.quit() so the window will automatically be closed after the job is done. Code now will look like this.
Now the sample will open the Firefox web driver do its jobs and then close the windows. With this little and simple example, we are ready to go dipper and learn with a complex sample
How To Run Selenium in background
In case you are running your environment in console only or through putty or other terminal, you may not have access to the GUI. Also, in an automated environment, you will certainly want to run selenium without the browser popping up – e.g. in silent or headless mode. This is where you can add the following code at the start “options” and “–headless”.
# import web driver's Options
from selenium.webdriver.firefox.options import Options
# initialize the options
firefox_options = Options()
# add the argument headless
firefox_options.add_argument('--headless')
#Â creates the driver setting the options defined before
WEBDRIVER_PATHÂ =Â './'
driver = webdriver.Firefox(WEBDRIVER_PATH,options=firefox_options)
The remaining examples will be run in ‘online’ mode so that you can see what is happening, but you can add the above snippet to help.
Example of Scraping a Dynamic Website in Python With Selenium
Until here, we have figure out how to scrap data from a static website, with a little bit of time, and patience you are now able to collect data from static websites. Let’s now dive a little bit more into the topic and build a script to extract data from a webpage which is dynamically loaded.
Imagine that you were requested to collect a list of YouTube videos regarding “Selenium”. With that information, we know that we are going to gather data from YouTube, that we need the searching result of “Selenium”, but this result will be dynamic and will change all the time.
The first approach is to replicate what we have done with Google, but now with YouTube, so a new file needs to be created yt-scraper.py
Now we are retrieving data YouTube title printed, but we are about to add some magic to the code. Our next step is to edit the search box and fill it with the word that we are looking for “Selenium” by simulating a person typing this into the search. This is done by using the Keys class:
from selenium.webdriver.common.keys import Keys.
The driver.quit() line is going to be commented temporally so we are able to see what we are performing
# yt-scraper.py
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
WEBDRIVER_PATH = './'
# initialize the firefox web driver
driver = webdriver.Firefox(WEBDRIVER_PATH)
URL = 'https://www.youtube.com'
driver.get(URL)
print (driver.title)
# create a object which contains the searchbox with xpathsearch_box = driver.find_element_by_xpath('//input[@id="search"]')# edit the content of the seatch box, filling it with "Selenium"search_box.send_keys('Selenium')
# once searchbox is with content we can press "press enter" to active the searchsearch_box.send_keys(Keys.ENTER)#driver.quit()
The Youtube page shows a list of videos from the search as expected!
As you might notice, a new function has been called, named find_element_by_xpath, which could be kind of confusing at the moment as it uses strange xpath text. Let’s learn a little bit about XPath to understand a bit more.
What is XPath?
XPath is an XML path used for navigation through the HTML structure of the page. It is a syntax for finding any element on a web page using XML path expression. XPath can be used for both HTML and XML documents to find the location of any element on a webpage using HTML DOM structure.
The above diagram shows how it can be used to find an element. In the above example we had ‘//input[@id=”search”]. This finds all <input> elements which have an attributed called “id” where the value is “search”. See the image below – under the “inspect element” for the search box from youTube, you can seen there’s a tag <input id=”search” … >. That’s exactly the element we’re searching for with XPath
There are a great variety of ways to find elements within a website, here is the full list which is recommended to read if you want to master the web scraping technique.
Looping Through Elements with Selenium
Now that Xpath has been explained, we are able to the next step, listing videos. Until now we have a code that is able to open https://youtube.com, type in the search box the word “Selenium” and hit Enter key so the search is performed by youtube engine, resulting in a bunch of videos related to Selenium, so let’s now list them.
Firstly, right click and “inspect element” on the video section and find the element which is the start of the video section. You can see in the image below that it’s a <div> tag with “id=’dismissable'”
We want to grab the title, so within the video, find the tag that covers the title. Again, right click on the title and “inspect element” – here you can see the element “id=’video-title'”. Within this tag, you can see the text of the title.
One last thing, let’s remind that we are working with internet and web browsing, so sometimes is needed to wait for the data to be able, in this case, we are going to wait 5 seconds after the search is performed and then retrieve the data we are looking information. Keep in mind that the results could vary due to internet speed, and device performance.
# yt-scraper.py
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
WEBDRIVER_PATH = './'
driver = webdriver.Firefox(WEBDRIVER_PATH)
URL = 'https://www.youtube.com'
driver.get(URL)
print (driver.title)
search_box = driver.find_element_by_xpath('//input[@id="search"]')
search_box.send_keys('Selenium')
search_box.send_keys(Keys.ENTER)
# waiting data to be loaded
time.sleep(5)
# collect an HTML Tag which contains all videos
videos = driver.find_elements_by_xpath('//*[@id="dismissable"]')
# print how many videos were collected
print (len(videos))
# iterate through all videos
for video in videos:
# collect each video title#please note that the find_element_by_xpath under the video variable
title = video.find_element_by_xpath('.//*[@id="video-title"]')
#print the title collected
print (title.text)
# close the webdriver
driver.quit()
Once the code is executed you are going to see a list printed containing videos collected from YouTube as shown in the following image, which firstly prints the website title, then it tells us how many videos were collected and finally, it lists those videos.
Waiting for 5 seconds works, but then you have to adjust for each internet speed. There’s another mechanism you can use which is to wait for the actual element to be loaded – you can use this a with a try/except block instead.
from selenium.webdriver.support.ui import WebDriverWait
def selenium_wait_for_class( browser, id, waitperiod=5):
try:
# Wait for an id with a specific name
WebDriverWait(browser, waitperiod).until( EC.presence_of_element_located( (By.XPATH, "//*[@id='"+ classid + "']")) )
except TimeoutException:
return False #if hit a timeout then it failed to find it
return True
So instead of the time.sleep(5), you can then replace the code with:
This will wait up to a maximum of 5 seconds for the videos to load, otherwise it’ll timeout
Conclusion
With Selenium you are going to be able to perform endless of tasks, from automation tasks to automate testing, the sky is the limit here, you have learned how to scrape data from static and dynamic websites, performing javascript actions like send some keys like “Enter”. You can also look at BeautifulSoup to extract and search for data next
Testing is an extremely important part of having good reliable python code. However, the challenge is that it can be very tiring to manually test your code again and again, and it’s also very easy to forget things to test. This is where Automated Testing can help.
Automated Testing is where you write code, independent of your main program, to test your code in a repeatable fashion. You can also run the tests automatically each time your code starts or when it gets deployed to production. Let’s learn How to write perform automated testing with pytest in python 3.
What is pytest?
Pytest is a mature full-featured Python testing tool that helps you write better programs. The pytest framework makes it easy to write small tests, yet scales to support complex functional testing for applications and libraries.
With pytest, you create a separate python script with a series of small functions that test your code where each function (called a unit test) is one test case. When this testing script is executed, you can get a report to see which tests succeeded and failed. It’s as simple as that, but can help improve the quality of your code immensely.
Installing pytest
Installing pytest is as simple as running the following command in your command line, just ensure that python is installed:
pip install -U pytest
Then you can execute the following command to check that it has been installed successfully
pytest --version
Running a simple unit test
Let’s imagine that an application has an adding method, and the developer wants to apply automated unit testing to the method, e.g. ensure that 3+5 = 8. To achieve this, the developer needs a new function that will test the addition function and its results. This testing function is in charge of assert that another function is working and performing a specific task as expected
# addtion_test.py
def addtion(x,y):
return x + y
def test_adition():
assert addition(3,5) == 5
The assert statement checks to see if condition passed to it is true
Note that the file containing this code is named test_addition.py and the testing function is named test_addition(), this is very important to keep in mind due to pytest will execute the unit tests withing files which accomplish the following patterns test_*.py or *_test.py, same rules apply to the testing functions themselves.
In order to trigger the automated testing, pytest has to be executed, just open a terminal at the same path than addition_test.py and writepytest , the result will be the following. It will find the file automatically.
As can be seen at the above image, the test has been approved with a 100% score, meaning that all our tests, in this case, just one, has been passed. This means that the addition() function is working as intended, but let’s change a little bit that function to see how a failed test looks like.
Assume that somebody has made some changes to the code, but by mistake introduce a little bug, where the addition() function now is multiplying numbers, the code will look something similar to this:
# addtion_test.py
def addtion(x,y):
return x * y
def test_adition():
assert addtion(3,5) == 8
Here, the assert statement fails and returns false hence the error. The mistake above is very simple to note in this sample due to it is a very short code, but imagine this change in an extensive commit or even worse, imagine this chance in very manual testing performed by humans with no help of software. This issue is solved with pytest, as you can see at the following image, the pytest command will notice this mistake and alert that something is not working as expected
Here pytest is telling that something has changed and now the application is not working as expected, and actually, it’s explicitly alerting which assert is falling, in this case, the addition() function which is receiving 3 and 5 as parameters should return 8 but it is returning 15 meaning something is wrong. This provides us with the ability to ensure high quality without breaking the functionality.
This little sample shows us the basis of pytest but we can perform complex testing as well, perform many tests in just one execution and much more, pytest combined with tools as Mock is a very powerful automated testing tool.
List of Assert Statements in Python
Above, you saw just the basic assert statement to compare values, but there are other variations you can use – some examples here:
Imagine a bigger application with some files and many functions along with all the application. As it gets bigger it gets harder to test, pytest solves this by running as many tests as needed. Let’s build a sample with two classes and four files, one test file for each class.
We are going to have two classes: MyMath and MyText. MyMath is in charge of handle additions and multiplication and MyText is in charge of concatenating and uppercasing words. Individual functions are created to handle each functionality.
Each class has a test file, which will be in charge of asserting each functionality, the __init__() method is tested as well.
# my_math.py
# Class definition
class MyMath():
# Addtion funtion, receive two integer parameters and return the resulting value as integer as well
def addition(self,x:int,y:int) -> int:
return x + y
# Multiplication funtion, receive two integer parameters and return the resulting value as integer as well
def multiplication(self,x:int,y:int) -> int:
return x * y
In the same path, a testing file named test_math.py
# test_math.py
from my_math import MyMath
class TestMath():
# test the __init__() method
def test_constructor(self):
# initialize a MyMath object
mm = MyMath()
# Test if the mm object is a instance of MyMath class
assert isinstance(mm,MyMath)
def test_addition(self):
# initialize a MyMath object
mm = MyMath()
# assert addition's funtion result
assert mm.addition(3,5) == 8
def test_multiplication(self) -> int:
# initialize a MyMath object
mm = MyMath()
# assert multiplication's funtion result
assert mm.multiplication(3,5) == 15
Same process with MyText class, one file to hold the functionality and other in charge of testing every single method
# my_text.py
class MyText():
def __init__(self):
pass
# Concatenate two words, with a space between them
def concatenate(self,word1:str,word2:str) -> str:
return f'{word1} {word2}'
# Returns the uppercase of the received string
def uppercase(self,word1:str) -> str:
return word1.upper()
Now create a new file, in the same path, remember that file name and function names must accomplish the following patterns test_*.py or *_test.py
# test_text.py
from my_text import MyText
class TestText():
# test the __init__() method
def test_constructor(self):
# initialize a MyText object
mt = MyText()
# Test if the mt object is a MyText class instance
assert isinstance(mt,MyText)
def test_concatenate(self):
# initialize a MyText object
mt = MyText()
# assert concatenate funtion result
assert mt.concatenate("hello","world") == 'hello world'
def test_uppercase(self) -> int:
# initialize a MyText object
mt = MyText()
# assert uppercase funtion result
assert mt.uppercase('hello') == 'HELLO'
Running a pytest the command will assert all files and functions that accomplish the pattern mentioned, meaning it will execute three functions within each file, six functions in total as shown in the following image
As can be seen, it says that six items were collected, meaning that six functions were tested, and then it displays which files were tested as well ant the respective percentage. If you want to collect more information a flag -v can be attached to the command, having the following result.
The -v flag provide us with detailed information of each function that were tested and the state of the assert with the respective percentage. If you just want to test a specific file just need to write the name of the file after the pytest command.
Parametrized test
In the above examples, the test cases were hardcoded for individual numbers. How can you test a range of inputs rather than a single set? By using the pytest.mark.parametrize helper you can easily set parametrization of arguments for a test function, This way different scenarios can be tested at once. Here is a sample of the pytest.mark.parametrize decorator
# test_math.py
import pytest
from my_math import MyMath
class TestMath():
# test the __init__() method
def test_constructor(self):
# initialize a MyMath object
mm = MyMath()
assert isinstance(mm,MyMath)
@pytest.mark.parametrize("num1,num2,expected", [(3,5, 8), (4,2, 6), (-4,-1, -5)])
def test_addition(self,num1,num2, expected):
# initialize a MyMath object
mm = MyMath()
# assert addition's funtion result
assert mm.addition(num1,num2) == expected
@pytest.mark.parametrize("num1,num2,expected", [(3,5, 15), (4,2, 8), (-4,-1, 4)])
def test_multiplication(self,num1,num2,expected) -> int:
# initialize a MyMath object
mm = MyMath()
# assert multiplication's funtion result
assert mm.multiplication(num1,num2) == expected
In the above example, we use a decorator @pytest.mark.parametrize to specify inputs for the parameters that are provided. For example, in the first for the three variables “num1, num2, expected”, the values (3, 5, 15) are included meaning that the test function is executed with num1 =3, num2=5 and expected=15. This is an easy way to run multiple scenarios quite easily and retain readable code.
As shown above, the execution has been done more than once on the parameterized functions, three different scenarios have been tested, this way the testing can ensure a better assurance.
Conclusions
Will all the tools provided until here, you are able to perform thousands of testing, bring incredibly high quality to your application and letting. This way you will be able to test your application functionality and be sure that anything breaks with the changes performed.
For some of your Python applications, a plugin architecture could really help you to extend the functionality of your applications without affecting the core structure of your application. Why would you want to do this? Well, it helps you to separate the core system and allows either yourself or others to extend the functionality of the core system safely and reliably.
Some of the advantages include that when a new plugin is created, for example, you would just need to test the plugin and the whole application. The other big advantage is that your application can grow by your community making your application even more appealing. Two classic examples are the plugins for WordPress blog and the plugins for Sublime text editor. In both cases, the plugins enhance the functionality of the core system but the core system developer did not need to create the plugin.
There are disadvantages too however with one of the main ones being that you can only extend the functionality based on the constraints that is imposed on the plugin placeholder e.g. if an app allows plugins for formatting text in a GUI, it’s unlikely you can create a plugin to play videos.
There are several methods to create a plugin architecture, here we will walkthrough the approach using importlib.
The Basic Structure Of A Plugin Architecture
At it’s core, a plugin architecture consists of two components: a core system and plug-in modules. The main key design here is to allow adding additional features that are called plugins modules to our core system, providing extensibility, flexibility, and isolation to our application features. This will provide us with the ability to add, remove, and change the behaviour of the application with little or no effect on the core system or other plug-in modules making our code very modular and extensible
The Core System
The core system defines how it operates and the basic business logic. It can be understood as the workflow, such as how the data flow inside the application, but, the steps involved inside that workflow is up to the plugin(s). Hence, all extending plugins will follow that generic flow providing their customised implementation, but not changing the core business logic or the application’s workflow.
In addition, it also contains the common code being used (or has to be used) by multiple plugins as a way to get rid of duplicate and boilerplate code, and have one single structure.
The Plug-in Modules
On the other hand, plug-ins are stand-alone, independent components that contain, additional features, and custom code that is intended to enhance or extend the core system. The plugins however, must follow a particular set of standards or a framework imposed by the core system so that the core system and plugin must communicate effectively. A real world example would be a car engine – only certain car engines (“plugins”) would fit into a Toyota Prius as they follow the specifications of the chassis/car (“core system”)
The independence of each plugin is the best approach to take. It is not advisable to have plugins talk to each other, unless, the core system facilitates that communication in a standardized way so that independent plugins can talk to each other. Either way, it is simpler to keep the communication and the dependency between plug-ins as minimal as possible.
Building a Core System
As mentioned before, we will have a core system and zero or more plugins which will add features to our system, so, first of all, we are going to build our core system (we will call this file core.py) to have the basis in which our plugins are going to work. To get started we are going to create a class called “MyApplication” with a run() method which prints our workflow
#core.py
class MyApplication:
def __init__(self, plugins:list=[]):
pass
# This method will print the workflow of our application
def run(self):
print("Starting my application")
print("-" * 10)
print("This is my core system")
print("-" * 10)
print("Ending my application")
print()
Now we are going to create the main file, which will import our application and execute the run() method
#main.py
# This is a main file which will initialise and execute the run method of our application
# Importing our application file
from core import MyApplication
if __name__ == "__main__":
# Initialising our application
app = MyApplication()
# Running our application
app.run()
And finally, we are run our main file which result is the following:
Once that we have a simple application which prints it’s own workflow, we are going to enhance it so we can have an application which supports plugins, in order to perform this, we are going to modify the __init__() and run() methods.
The importlib package
In order to achieve our next goal, we are going to use the importlib which provide us with the power of implement the import statement in our __init__() method so we are going to be able to dynamically import as many packages as needed. It’s these packages that will form our plugins
#core.py
import importlib
class MyApplication:
# We are going to receive a list of plugins as parameter
def __init__(self, plugins:list=[]):
# Checking if plugin were sent
if plugins != []:
# create a list of plugins
self._plugins = [
# Import the module and initialise it at the same time
importlib.import_module(plugin,".").Plugin() for plugin in plugins
]
else:
# If no plugin were set we use our default
self._plugins = [importlib.import_module('default',".") .Plugin()]
def run(self):
print("Starting my application")
print("-" * 10)
print("This is my core system")
# We is were magic happens, and all the plugins are going to be printed
for plugin in self._plugins:
print(plugin)
print("-" * 10)
print("Ending my application")
print()
The key line is “importlib.import_module” which imports the package specified in the first string variable with a “.py” extension under the current directory (specified by “.” second argument). So for example, the file “default.py” which is present in the same directory would be imported by calling: importlib.import_module('default', '.')
The second thing to note is that we have “.Plugin()” appended to the importlib statement: importlib.import_module(plugin,".").Plugin()
This (specifically the trailing brackets Plugin() is present) to create an instance of the class and store it into _plugins internal variable.
We are now ready to create our first plugin, the default one, if we run the code at this moment it is going to raise a ModuleNotFoundError exception due to we have not created our plugin yet. So let’s do it!.
Creating default plugin
Keep in mind that we are going to call all plugins in the same way, so files have to be named as carefully, in this sample, we are going to create our “default” plugin, so, first of all, we create a new file called “default.py” within the same folder than our main.py and core.py.
Once we have a file we are going to create a class called “Plugin“, which contains a method called process. This can also be a static method in case you want to call the method without instantiating the calls. It’s important that any new plugin class is named the same so that these can be called dynamically
#deafult.py
# Define our default class
class Plugin:
# Define static method, so no self parameter
def process(self, num1,num2):
# Some prints to identify which plugin is been used
print("This is my default plugin")
print(f"Numbers are {num1} and {num2}")
At this moment we can run our main.py file which will print only the plugin name. We should not get any error due to we have created our default.py plugin. This will print out the module (from the print statement under the MyApplicaiton.run() module) object itself to show that we have successfully imported out the plugin
Let’s now modify just one line in our core.py file so we call the process() method instead of printing the module object
#core.py
import importlib
class MyApplication:
# We are going to receive a list of plugins as parameter
def __init__(self, plugins:list=[]):
# Checking if plugin were sent
if plugins != []:
# create a list of plugins
self._plugins = [
importlib.import_module(plugin,".").Plugin() for plugin in plugins
]
else:
# If no plugin were set we use our default
self._plugins = [importlib.import_module('default',".").Plugin()]
def run(self):
print("Starting my application")
print("-" * 10)
print("This is my core system")
# Modified for in order to call process method
for plugin in self._plugins:
plugin.process(5,3)
print("-" * 10)
print("Ending my application")
print()
# Output
$ py main.py
Starting my application
This is my core system
This is my default plugin
Numbers are 5 and 3
Ending my application
We have successfully created our first plugin, and it is up and running. You can see the statement “This is my default plugin” which comes from the plugin default.py rather than the main.py program.
Let’s create two more plugins which provides multiplication and addition to show the extensibility of plugins can work. Please note, that in the following example, you do not need to make any changes to core.py! You simply need to add plugin files.
Adding new features (plugins)
In order to add these features, we are going to build two new files, one named “addition.py” and the order called “multiplication.py“, just remember that the class within each file has to be named as “Plugin” and the static method should be named “process“. Once we have these implementations we are going to be able to modify or add functionality to the core system without modifying it.
# addition.py file
# Each new plugin class has to be named Plugin
class Plugin:
# This is the feature offer by this plugin.
# it prints the result of adding 2 numbers
def process(self, num1, num2):
print("This is my addition plugin")
print(num1 + num2)
# multiplication.py file
# Each new plugin class has to be named Plugin
class Plugin:
# This is the feature offer by this plugin.
# it prints the result of multiplying 2 numbers
def process(self, num1, num2):
print("This is my multiplication plugin")
print(num1 * num2)
Before we run our new plugins we are going to use a library called “sys” which allow us to receive parameters from the CLI (see our article: How to use argv in python), this way we are going to be able to indicate which plugin we are going to call within our application.
Our director has the following files:
+ main.py
+ core.py
+ default.py
+ addition.py
+ multiplication.py
# main.py
# Import the sys library
import sys
from core import MyApplication
if __name__ == "__main__":
# Initialize our app with the parameters received from CLI with the sys.argv
# Starts from the possion one due to position 0 will be main.py
app = MyApplication(sys.argv[1:])
# Run our application
app.run()
# Call our main.py file
# Send all 3 plugins as parameters
$ py main.py default addition multiplication
Starting my application
This is my core system
This is my default plugin
Numbers are 5 and 3
This is my addition plugin
8
This is my multiplication plugin
15
Ending my application
Conclusion
We started off with two files of main.py and core.py which provided the core functionality. Without modifying these two files, we can now create new plugin files and then call that functionality with command line parameters. This provides incredible flexibility and extensibility and can help to expand your applications very easily.
The other are to be mindful of is to ensure you provide strong documentation and examples so that the community for your application knows exactly how to create new plugins. This make it easier to enrich your applications quite effectively.