Quickly parse websites

Beautiful Soup is a python library dedicated to quickly parse websites. You can navigate, search and modify the tree. The library handles all encodings for you and outputs UTF-8. It’s powerful and easy to use.

In a recent small side-project, I needed to gather a lot of data (about 37.000 products with about 160 attributes each) from a website. The products were listed on the page using a regular HTML list and each entry contained a link to a details site. The detail-site itself delivered additional information inside several different tables.

I first thought about implementing the parser in Excel (because I needed the final data to be there), but because I’m working on a Mac my options were very limited (only a few add-ins are available on OS X).

I then decided to write a python script for data gathering and importing the data as a UTF-8 CSV into Excel. I was certain that I would quickly find an adequate library to parse the HTML-sites when I stumbled about Beautiful Soap:

Valuable data that was once locked up in poorly-designed websites is now within your reach. Projects that would have taken hours take only minutes with Beautiful Soup.

That sounded too good to be true, but - hey - let’s give it a try:

from bs4 import BeautifulSoup
conn = urllib.request.urlopen(url)
html = conn.read()

soup = BeautifulSoup(html,features="html.parser")
products = soup.find_all("a", class_="list-group-item")

We are starting with connecting to the target url getting the HTML-document and finding all a-elements with class list-group-item. After that we can iterate through the list and get the corresponding details page:

for product in products:
    productDetails = product.get('href',None)

    if productDetails is not None:
        conn = urllib.request.urlopen(productDetails)

        innerHtml = conn.read()

        innerSoup = BeautifulSoup(innerHtml,features="html.parser")

While working on the details page, we first create a list of all tables and iterate through all the rows:

        tbodys = innerSoup.find_all("table")

        for tbody in tbodys:
            tds = tbody.find_all("td")

I never used Beautiful Soap before and needed just an hour to implement the final solution. While this scenario isn’t too complicated, I loved how fast and straightforward the solution turned out.