Beautiful Soup is a python library dedicated to quickly parse websites. You can navigate, search and modify the tree. The library handles all encodings for you and outputs UTF-8. It’s powerful and easy to use.
In a recent small side-project, I needed to gather a lot of data (about 37.000 products with about 160 attributes each) from a website. The products were listed on the page using a regular HTML list and each entry contained a link to a details site. The detail-site itself delivered additional information inside several different tables.
I first thought about implementing the parser in Excel (because I needed the final data to be there), but because I’m working on a Mac my options were very limited (only a few add-ins are available on OS X).
I then decided to write a python script for data gathering and importing the data as a UTF-8 CSV into Excel. I was certain that I would quickly find an adequate library to parse the HTML-sites when I stumbled about Beautiful Soap:
Valuable data that was once locked up in poorly-designed websites is now within your reach. Projects that would have taken hours take only minutes with Beautiful Soup.
That sounded too good to be true, but - hey - let’s give it a try:
from bs4 import BeautifulSoup conn = urllib.request.urlopen(url) html = conn.read() soup = BeautifulSoup(html,features="html.parser") products = soup.find_all("a", class_="list-group-item")
We are starting with connecting to the target
url getting the HTML-document and finding all
a-elements with class
list-group-item. After that we can iterate through the list and get the corresponding details page:
for product in products: productDetails = product.get('href',None) if productDetails is not None: conn = urllib.request.urlopen(productDetails) innerHtml = conn.read() innerSoup = BeautifulSoup(innerHtml,features="html.parser")
While working on the details page, we first create a list of all tables and iterate through all the rows:
tbodys = innerSoup.find_all("table") for tbody in tbodys: tds = tbody.find_all("td")
I never used Beautiful Soap before and needed just an hour to implement the final solution. While this scenario isn’t too complicated, I loved how fast and straightforward the solution turned out.