Scraping tables

The python Scrapy library is an excellent helper to build simple but powerful scrapers. It’s common to want to scrape HTML tables when we scrape text of pages and as I’m going to show it really doesn’t need to be difficult.

The rough idea is to find a table, iterate each row and then get the text out of each cell.

Sources

I struggled to scrape a table wanting to easily get an array of array for values and I found this guide on how to scrape tables. In this page I go through the same steps but also offer a quick utility class you can use.

Also, check out my post on debugging Scrapy for a quick and easy way to try this out in your own project.

Table scraper

Here’s the table scraper I’ve put together for my project:

It’s very simple and will give you an array of arrays for each row and cells.

How to use the table scraper

Simply select the table you want to scrape and you can even get it out as a dictionary. In this example, since the page doesn’t identify each table, I had to use an xpath to pick out the first table.

The results running it (scrapy runspider table-scraping.py) should look something like this:

Running example scraper of scraping a table
Scraping table with helper class

The data we get out from this will look like:

Scraped table output as a CSV table
Result of outputting the scraper data

How it works

It is really simply so I’ll run through the steps here:

  1. First we pull out the root element of the table, this should be your <table> element, probably identified by an ID or class
  2. When we have the table we get all rows by listing <tr> tags
  3. Then for each row we extract the text for all cells, header cells might be identified with <th> instead of <td> so to get both, my selector will pick either

And that’s it, there we have our table.

Final words

Though this will get you your table you will likely want to wrap the data and parse it individually. Either you could post-process each line or add a different callback or function rather than the ::text selector to pick out the elements you want, for example, pulling out links.