Thursday, November 30, 2017

Web scraping with Python tutorial, part 1 -- BeautifulSoup

In this tutorial, I will explain how to scrape content from a website using Python. In this first part, we will scrape the content of a static page.

First, you should check whether scraping is really the best way to get the data you want. If there is an API available, that is usually a much more robust way to get the data. But if there is no API, web scraping is a very powerful tool to do the job.

We will use the package BeautifulSoup to handle the HTML. You can install this package with

    pip install beautifulsoup4

and include it in the Python script with
from bs4 import BeautifulSoup
As an example, I will scrape all job advertisements from the HEP job list. We can access a list of all job advertisements with
start_link = "https://inspirehep.net/search?ln=en&cc=Jobs&rg=0"
Here we will use urllib to obtain the raw HTML of this site and feed it into BeautifulSoup
html = urlopen(link)
BeautifulSoup(html.read(), "html.parser")
Web scraping relies on components which are outside our control, for example, we rely on the website being accessible (online) and we rely on the HTML structure not having changed too much.  For that reason, it is a good idea to write the scraper with very extensive exception handling. The two lines above can be re-written as
def get_soup(link):
    try:
        html = urlopen(link)
    except HTTPError as e: # The page is not found on the server
        print("ERROR: Internal server error!", e)
        return None 
    except URLError as e:
        print("ERROR: The server could not be found!", e)
        return None 
    else:
        return BeautifulSoup(html.read(), "html.parser")
Here we re-wrote the whole procedure as a function and catch exceptions in which the page or the server can not be found.

Now that we have the HTML code as a BeautifulSoup object we can access its components in several ways. The most important are .find() and .findAll(), which return the first matching HTML object or all HTML objects, respectively.

However, how do we know what HTML objects to search for? For that, we need to look at the HTML code. A good tool for that is the Chrome inspector tool (View > Developer > Developer Tools, then click on the inspector tool at the top left corner). You can move the mouse over an object on the page and you can directly see the associated HTML object on the left
Figure 1: Google Chrome Developer tools inspector. You can see the HTML
on the right with the object highlighted, which corresponds to the part
the mouse is currently pointing to on the left.
With this, you can find the HTML tags you want to access. Here we want to access the list of job ads and each job ad is embedded in a div tag with class="record_body". So let's find all div objects with that class.
list_of_jobs = soup.findAll("div", {"class": "record_body"})
Now we have a list, which contains all the div objects for the different jobs on that site. We can loop through that list and extract some more information
for job in list_of_jobs:
    job_info = {'posting_date': '', 'record_link': ''}
    strong_tag = job.find("strong")
    if strong_tag:
        job_info['posting_date'] = strong_tag.text
    else:
        print("WARNING: No posting date found?")

    a_tags = job.findAll("a")
    if a_tags[1]:
        job_info['institute'] = a_tags[1].text
        job_info['position'] = a_tags[1].findNext("span").text
    else:
        print("WARNING: No institute found?")
This website is an example of very badly written HTML, which makes scraping quite difficult. If there is structure in the site, with lots of attributes, like ids or classes, we can easily identify the objects we want. Here this is not the case. For example, the loop above assumes that the first strong tag is the posting date. Let's hope that the admin does not change the HTML by including more strong tags since that would break this scraper. However, I could not find a more reliable way to access the posting date.

We can also access the link to the detailed job advertisement with
job_info['record_link'] = a_tags[0].attrs['href']
So to get even more information about the job ads we can now visit the individual links
for i, record in enumerate(records):
    time.sleep(1)
    soup = get_soup(record['record_link'])
    if soup == None:
        print('ERROR: No soup object received for record %d' % i)
    else:
        record_details = soup.find("div", {"class": "detailed_record_info"})
        job_description_identifier = record_details.find("strong",
                                                         text="Job description: ")
        description = job_description_identifier.parent
        if description:
            record['description'] = str(description)\
                          .replace('<strong>Job description: </strong><br/>', '')
        else:
            print('WARNING: No description found')
Here we loop over all job ads, stored in the records list, obtain the HTML code with the get_soup() function defined above and scrape the job description. The scraper waits for one second within the loop, otherwise we could put some heavy load on the server and disrupt the website performance. Again the job description has to be scraped by searching for title headings rather than HTML attributes because this website does not offer much structure.

One thing to consider when scraping a site is the robots.txt file which you can access on the root of the site (e.g. https://inspirehep.net/robots.txt). In this file, the owner of the website specifies which content should not be accessed by scrapers (robots). You don't have to follow what is written in this file, but it is a good idea to look at it, to see whether what you want to do goes against the owner's intention. If you look at the file linked above you will find that while there are some sites excluded from robot access, the jobs site on HEP is not, so we are good.

You can get the entire code discussed in this post from my GitHub account. Let me know if you have comments/questions below.
cheers
Florian