First, you should check whether scraping is really the best way to get the data you want. If there is an API available, that is usually a much more robust way to get the data. But if there is no API, web scraping is a very powerful tool to do the job.
We will use the package BeautifulSoup to handle the HTML. You can install this package with
pip install beautifulsoup4
and include it in the Python script with
from bs4 import BeautifulSoup
start_link = "https://inspirehep.net/search?ln=en&cc=Jobs&rg=0"
html = urlopen(link) BeautifulSoup(html.read(), "html.parser")
def get_soup(link): try: html = urlopen(link) except HTTPError as e: # The page is not found on the server print("ERROR: Internal server error!", e) return None except URLError as e: print("ERROR: The server could not be found!", e) return None else: return BeautifulSoup(html.read(), "html.parser")
Now that we have the HTML code as a BeautifulSoup object we can access its components in several ways. The most important are .find() and .findAll(), which return the first matching HTML object or all HTML objects, respectively.
However, how do we know what HTML objects to search for? For that, we need to look at the HTML code. A good tool for that is the Chrome inspector tool (View > Developer > Developer Tools, then click on the inspector tool at the top left corner). You can move the mouse over an object on the page and you can directly see the associated HTML object on the left
Figure 1: Google Chrome Developer tools inspector. You can see the HTML on the right with the object highlighted, which corresponds to the part the mouse is currently pointing to on the left. |
list_of_jobs = soup.findAll("div", {"class": "record_body"})
for job in list_of_jobs: job_info = {'posting_date': '', 'record_link': ''} strong_tag = job.find("strong") if strong_tag: job_info['posting_date'] = strong_tag.text else: print("WARNING: No posting date found?") a_tags = job.findAll("a") if a_tags[1]: job_info['institute'] = a_tags[1].text job_info['position'] = a_tags[1].findNext("span").text else: print("WARNING: No institute found?")
We can also access the link to the detailed job advertisement with
job_info['record_link'] = a_tags[0].attrs['href']
for i, record in enumerate(records): time.sleep(1) soup = get_soup(record['record_link']) if soup == None: print('ERROR: No soup object received for record %d' % i) else: record_details = soup.find("div", {"class": "detailed_record_info"}) job_description_identifier = record_details.find("strong",
text="Job description: ") description = job_description_identifier.parent if description: record['description'] = str(description)\
.replace('<strong>Job description: </strong><br/>', '') else: print('WARNING: No description found')
One thing to consider when scraping a site is the robots.txt file which you can access on the root of the site (e.g. https://inspirehep.net/robots.txt). In this file, the owner of the website specifies which content should not be accessed by scrapers (robots). You don't have to follow what is written in this file, but it is a good idea to look at it, to see whether what you want to do goes against the owner's intention. If you look at the file linked above you will find that while there are some sites excluded from robot access, the jobs site on HEP is not, so we are good.
You can get the entire code discussed in this post from my GitHub account. Let me know if you have comments/questions below.
Florian
No comments:
Post a Comment