There are some rich data sets that are available on web pages which are in table formats. Extracting this data can be very useful in order to do further analysis or even to save to your own database. So here you are going to learn more about web scraping and how to extract tables from a webpage using Python 3.
For this article
Understanding HTML Tables
Each single web page is going to have different way to show table data however they are typically done using <table> structure where there are tags for the header, columns and the individual cells.
The following table contains the different Tag options that a table will contain and a description to make it easier to understand
Tag | Description |
---|---|
<table> | Defines a table |
<th> | Defines a header cell in a table |
<tr> | Defines a row in a table |
<td> | Defines a cell in a table |
<caption> | Defines a table caption |
<colgroup> | Specifies a group of one or more columns in a table for formatting |
<col> | Specifies column properties for each column within a <colgroup> element. This is not always used. |
<thead> | Groups the header content in a table |
<tbody> | Groups the body content in a table. |
<tfoot> | Groups the footer content in a table |
Although you have all these fields, many web pages don’t use all of these and typically just use <table><tr><td> in order to create a simple table.
Take the following example table:
Name | |
---|---|
Charles | charles@pythonhowtoprogram.com |
Lucius | lucius.fox@wayne.com |
Andy | andy.defruen@shawshank.com |
Richard | richard@kimble.net |
What you will find is that the html has the following format:
<table id="example1">
<tr>
<th>Name</th> <th>Email</th>
</tr>
<tr>
<td>Charles</td> <td>charles@pythonhowtopgrogram.com</td>
</tr>
<tr>
<td>Lucius</td> <td>lucius.fox@wayne.com</td>
</tr>
....
</table>
So you can see how the <tr> depicts the rows of the table. The header cells are structured via <th> while the data cells are captured via <td>. This is a simple table but is typical for many pages. One thing to note is that there is an “id” attribute on the table tag which will be used to find the relevant tag. This is not always available, and if not you’ll need to loop through the tables to find the correct one on the page.
One variant of this is that some sites also have the <thead> tag surrounding the header columns and/or the <tbody> tag surrounding the data rows. So the table structure will be as follows:
<table id="example1">
<thead>
<tr>
<th>Name</th> <th>Email</th>
</tr>
</thead>
<tbody>
<tr>
<td>Charles</td><td>charles@pythonhowtopgrogram.com</td>
</tr>
<tr>
<td>Lucius</td> <td>lucius.fox@wayne.com</td>
</tr>
....
</tbody>
</table>
The visual result is the same for the user, but just something we have to be mindful of when scraping for this type of data.
Extracting Table Data from Python – Simple Example First
For this first example, we will use table data from the above example code in order to test extracting data.
Before collecting the data, you can first setup an initial script to ensure you can extract data from a website using urllib3. Incase you have not installed beautifulsoup library, you can install this using pip or pip3
pip3 install beautifulsoup4
# table_data.py
import urllib3
from bs4 import BeautifulSoup
#create instance to extract from website
http = urllib3.PoolManager()
# Declare a variable containing the URL is going to be scrapped
URL = 'https://pythonhowtoprogram.com/how-to-extract-table-data-from-webpages-with-example-code-python-3/'
#Extract the data from the website
r = http.request('GET', URL)
#create a map of the site into beautiful soup framework
soup = BeautifulSoup(r.data, 'html.parser')
print("[1]*******")
[ print( item.prettify() ) for item in soup.find_all('table') ]
Output:
In the above code, we first get the url data from the website using the http.request call. From their, we can start to manipulate the data by accessing the field r.data, however it will be in raw HTML data where you have to parse all the tags. You can instead use beatuifulsoup which will help you easily parse all the data. You first need to create an instance of beautifulsoup by the following:
soup = BeautifulSoup(r.data, 'html.parser')
Next, we print out the table data using the following:
[ print( item.prettify() ) for item in soup.find_all('table') ]
This is quite a compact line, but there’s quite a bit going on in here which I’ll further breakdown. If you’re not familiar with this syntax it is in fact using a python feature called list comprehension (you can see a full example in our “list comprehension example” article). The above is in fact similar to the following:
table_data = soup.find_all('table')
for item in table_data:
print( item.prettify() )
The [ ] brackets help to run a for loop within a single line where the code before the “for” runs for each iteration of the loop.
Now, in the above code, this is extracting all the tables. How about if we just focus on the table above which has an “id” attribute of “example1″ – this means it has the following <table id=”example1”>. We can use that to just target that table.
Extract Table Data using CSS Selector
We can use similar code to above, but instead, we can filter where id=”example1″ by using the find_all() function. This will return a list, so similar to above, we use list comprehension to print out all items.
# table_data.py
import urllib3
from bs4 import BeautifulSoup
#create instance to extract from website
http = urllib3.PoolManager()
# Declare a variable containing the URL is going to be scrapped
URL = 'https://pythonhowtoprogram.com/how-to-extract-table-data-from-webpages-with-example-code-python-3/'
#Extract the data from the website
r = http.request('GET', URL)
#create a map of the site into beautiful soup framework
soup = BeautifulSoup(r.data, 'html.parser')
print("[2]*******")
[ print( item.prettify() ) for item in soup.find_all(id="example1") ]
Output:
As you can see, just the table we highlighted above is printed. This is incredibly handy when you know exactly which table to extract.
Note that there are still all the tags printed as well – this is what prettify() includes. Suppose you just want the data to be listed out into a CSV file, this can also be done fairly easily.
Extracting Table Data From a Website and Store in a CSV File
To extract data from a table in a CSV file, you can use the above methods of using beautiful soup with minimal code. This can be very useful if you’re trying to capture historical data from a site which has daily updates so that you can do some analysis over time.
The following code shows an example of how it can be done from this site:
# table_data.py
import urllib3
from bs4 import BeautifulSoup
from datetime import datetime
#create instance to extract from website
http = urllib3.PoolManager()
# Declare a variable containing the URL is going to be scrapped
URL = 'https://pythonhowtoprogram.com/how-to-extract-table-data-from-webpages-with-example-code-python-3/'
#Extract the data from the website
r = http.request('GET', URL)
#create a map of the site into beautiful soup framework
soup = BeautifulSoup(r.data, 'html.parser')
filename = 'data_file_' + datetime.today().strftime('%Y_%m_%d') + ".txt"
with open(filename, 'w') as f:
#First print out the column headings
header = ','.join( [ item.get_text() for item in soup.select("table[id='example1'] > thead > tr > th") ] )
print(header)
print(header, file=f)
#Next print all the data rows
for item in soup.select("table[id='example1'] > tbody > tr"):
row_data = ','.join( [ item.get_text() for item in item.select("td") ] )
print(row_data)
print(row_data, file=f)
Output:
In the above output, you can see the data from the table at the top of this blog being printed out in a CSV format, and a file called “data_file_2021_05_10.txt” being created, which when printed with the “cat” command has the same output. In case you’re wondering, I added some newlines in the screenshot to make it easier to see.
Cod explained:
The code above has a few elements to consider.
We first create a dynamic filename with todays date (please note that the library datetime was imported at the top). The reason for the dynamic filename was that you can then run this on a daily basis to get a daily snapshot of a given website.
filename = 'data_file_' + datetime.today().strftime('%Y_%m_%d') + ".txt"
with open(filename, 'w') as f:
Next, we then get all the column headers using the following code and print it as a comma separated line:
#First print out the column headings
header = ','.join( [ item.get_text() for item in soup.select("table[id='example1'] > thead > tr > th") ] )
print(header)
print(header, file=f)
There’s again, quite a bit going on. The above is in fact equivelant to the following:
#First print out the column headings
column_item_list = soup.select("table[id='example1'] > thead > tr > th")
col_list = []
for item in column_item_list:
col_list.append( item.get_text() )
header = ','.join( col_list )
print(header)
print(header, file=f)
The first thing to note is the soup.select(“table[id=’example1′] > thead > tr > th”). What this is telling beautifulSoup is to find the table which has an attribute called “id” which is equal to “example1”, then find the children tags called thead, which then has children called tr and then th. This notation is called “CSS Selector” notation and can be very handy to find tags in a very intuitive way.
If for example the thead is not there, you can simply remove that part. So it’s searching for tags like the following:
<table id="example1">
<thead>
<tr>
<th>ABC</th>
</tr>
<tr>
<th>XYZ</th>
</tr>
...
So in this mini example, it will capture “ABC” and “XYZ”
The other line you may not be familiar with is header = ‘,’.join( col_list ) which takes col_list and then joins all the elements as a string with a “,” as a separator.
You’ll also notice that there are two print statements? This is to simply print to the screen and also the file which we opened earlier.
print(header)
print(header, file=f)
The file portion to explain is the following snippet which you probably guess already:
#Next print all the data rows
for item in soup.select("table[id='example1'] > tbody > tr"):
row_data = ','.join( [ item.get_text() for item in item.select("td") ] )
print(row_data)
print(row_data, file=f)
This is to loop through all the rows in the table using the for item in soup.select(“table[id=’example1′] > tbody > tr”) statement, and then within each row, then concatenate all the calls within that row using the following:
row_data = ','.join( [ item.get_text() for item in item.select("td") ] )
This is exactly like the above list comprehension scenario where the cell contents are concatenated with a comma in between.
Extracting currency table data from a website into a CSV – final example
To finish of with a final example, let’s look at an example of extracting currency data – something that changes on daily basis. This will be slightly more complex but the principles are the same.
This is the website we will look at: https://wise.com/gb/currency-converter/currencies/usd-us-dollar
As you can see there is a table of currency data that’s available. First thing to do is to ensure that the data is available in the HTML file by right clicking on the page, and clicking on viewsource.
You can then search to find the table (quickest way is to search for one of the currency values)
Why is this necessary? Well some sites use javascript to load the data – if that’s the case, you cannot use “urllib3” to extract data since that just extracts a HTML file and does not execute any javascript. You will instead need to simulate a browser so that the javascript can be executed. You can use something like selenium (see our selenium article) to do the job. This means that the “r = http.request(‘GET’, URL)” will need to change to the selenium code, but everything else the same (email me if you’d like more details).
So now that the table data is available in the source, you can see a few of the key pieces:
There are similar tags for the data part. Hence, the code would be as follows:
# table_data.py
import urllib3
from bs4 import BeautifulSoup
from datetime import datetime
#create instance to extract from website
http = urllib3.PoolManager()
# Declare a variable containing the URL is going to be scrapped
URL = 'https://wise.com/gb/currency-converter/currencies/usd-us-dollar'
#Extract the data from the website
r = http.request('GET', URL)
#create a map of the site into beautiful soup framework
soup = BeautifulSoup(r.data, 'html.parser')
filename = 'data_file_' + datetime.today().strftime('%Y_%m_%d') + ".txt"
with open(filename, 'w') as f:
#First print out the column headings
header = ','.join( [ item.get_text() for item in soup.select("table[class='table table-condensed'] > thead > tr > th > a > span[aria-hidden]") ] )
print(header)
print(header, file=f)
#Next print all the data rows
for item in soup.select("table[class='table table-condensed'] > tbody > tr"):
row_data = ','.join( [ item.get_text() for item in item.select("td > a") ] )
print(row_data)
print(row_data, file=f)
The above code is similar to above in that it saves the data into a file and also outputs to a file.
Output:
The key part in the code is the selector which you’ve seen before. This piece code code helps to find the content of the field where there’s a <span> tag which has an attribute called “aria-hidden” with a series of parent tags.
header = ','.join( [ item.get_text() for item in soup.select("table[class='table table-condensed'] > thead > tr > th > a > span[aria-hidden]") ] )
This code maps to this piece of the HTML tags:
Conclusion
There are several packages available that can be used to scrape from webpages. In this article we are using the urllib3 package to get data from websites however there are other packages available as well. If you want to extract from websites that have javascript loading, you can also using selenium package instead: “How To Scrape Javascript Websites With Selenium Using Python 3“
With the above examples, you have the tools and skills to extract data from sites with a short snippet of code and also to output these to files. You can now also schedule this script with a crontab on linux or the task scheduler on windows to collect that historical data.
Get notified automatically of new articles
If you found this article helpful, we can notify you of new articles straight to your inbox! It only takes 10 seconds and you won’t miss out on new great content.
Error SendFox Connection: