My statistics blog

Monday, January 22, 2018

Understanding the self variable in Python classes

When defining a class in Python you will probably have encountered the self variable. Here is a simple example

class Cat(object):
    def __init__(self, hungry=False):
        self.hungry = hungry

    def should_we_feed_the_cat(self):
        if self.hungry:
            print("Yes")
        else:
           print("No")

This is a simple class of a cat, which has an attribute hungry. The __init__() method is automatically called whenever a new class object is created, while the should_we_feed_the_cat() method is a property of the object and can be called anytime. This method tests the hungry attribute and prints out "Yes" or "No".

Let's initialize a cat

>>> felix = Cat(True)
>>> felix.should_we_feed_the_cat()
Yes

Here we initialized the object with 1 argument, even though the __init__() method is defined with two arguments. We also test whether the cat is hungry using the should_we_feed_the_cat() function without any argument, even though it is defined with one. So why is Python not complaining about this?

First, we should clarify the difference between a function and a method. A method is a function which is associated with an object. So all the functions we defined within our class above are associated with an object of that class and hence they are methods. However, we can still call methods as function

>>> type(Cat.should_we_feed_the_cat)
<class 'function'>
>>> type(felix.should_we_feed_the_cat)
<class 'method'>

The first call is a function call while the second is a method call. Class methods in Python automatically put the object itself as the first argument. So

Cat.should_we_feed_the_cat(felix)

is the same as

felix.should_we_feed_the_cat()

In the first case we call a function and need to pass in the object manually, while in the second case we call the method and the object is passed automatically.

If we get this wrong and for example call the method by passing in the object, we will get an error message which you probably have seen before (I certainly have)

>>> felix.should_we_feed_the_cat(felix)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: should_we_feed_the_cat() takes 1 positional argument but 2

were given

If you really want to avoid the self variable you can do that by using the @staticmethod decorator

class Cat(object):
    def __init__(self, hungry=False):
        self.hungry = hungry

    @staticmethod
    def other_function():
        print("This function has no self variable")

Note that the self variable is not a reserved keyword. You could easily write the class above naming the first variable in the should_we_feed_the_cat() method anything you want and it would still work

class Cat(object):
    def __init__(whatever, hungry=False):
        whatever.hungry = hungry

def should_we_feed_the_cat(whatever):
        if whatever.hungry:
            print("Yes")
        else:
           print("No")

However, self is the accepted convention and you should stick to it.

Given the very simple concept captured by the self variable, you might already think about ways to get rid of it and use other ways to access the class attribute. If you have experience with other programming languages like Java, you might prefer to just have a pre-defined keyword. However, that does not seem to be an option in Python. If you are interested, here is a blog post by Guido Van Rossum arguing for keeping the explicit self variable.

I hope that was useful and let me know if you have questions/comments in the comments section below.

cheers
Florian

Wednesday, January 17, 2018

Lambda functions in Python

The lambda operator or lambda function is a way to create small anonymous functions, i.e. functions without a name. This construct can be useful if you need a simple function only once and you want to discard it directly after usage.

The syntax for a lambda function is

lambda arguments: expression

and they can have any number of arguments but only one expression. The expression is evaluated and returned.

Here is an example: This function squares its argument

g = lambda x: x**2

print("g(5) = ", g(5))

which gives

('g(5) = ', 25)

The same function can be defined in a conventional way

def f(x): 
   return x**2
print("f(5) = ", f(5))

which gives

('f(5) = ', 25)

Both functions g() and f() do the same thing, so lambda functions can operate like any normal function and are basically an alternative to def. Given the example above you might ask what is the purpose of a lambda function if it is just a different way of defining a function?

Lambda functions can come in handy if you do not want to define a separate function. Assume you have a list and you want to find all list elements which are even

a = [2, 3, 6, 7, 9, 10, 122]
print(filter(lambda x: x % 2 == 0, a))

which gives

[2, 6, 10, 122]

filter() is a built-in Python function, which expects a function as the first argument and a list as second. The filter function will discard all list elements for which the lambda function returns False, and will keep all elements for which it returns True. Wherever a function is expected we can instead provide a lambda function and if it is a simple task like in the example above, you might prefer a lambda function rather than defining a stand-alone function using def.

Note that many programmers don't like lambda functions, but rather use list comprehension, which they deem more readable. The example above written with a list comprehension would look like this

print([num for num in a if num % 2 == 0])

You see that this looks a bit easier to read since it uses well-known concepts and it does not actually use much more space.

Maybe just to extend on this we should discuss two examples where list comprehension does not work. Let's assume we want to sort a list by the modulus of 10 of the value. We can do that with

print(sorted(a, key=lambda x: x % 10))

which results in

[10, 2, 122, 3, 6, 7, 9]

Another example is where a function can return a lambda function

def gaussian(height, center_x, center_y, width_x, width_y):
    return lambda x,y: height*np.exp(
                -(((center_x-x)/width_x)**2+((center_y-y)/width_y)**2)/2)

The function gaussian() returns a 2D Gaussian function which expects two parameters as input and returns the value of the 2D Gaussian at these two values.

func = gaussian(1., 0.5, 0.5, 1., 1.)
print(func(0.2, 0.8))

So we first create a Gaussian distribution with the center at $x=0.5$ and $y=0.5$ and than print the value of this Gaussian at $x=0.2$ and $y=0.8$.

And there are many more examples where lambda functions can come in very handy. In any case, it is important to be familiar with the concept of lambda functions, even just to be able to read other people's code.

Besides the filter function, there are also map() and reduce() which often make use of lambda functions. The map() function maps all values of a list to another list using a function like this

a = [2, 3, 6, 7, 9, 10, 122]
b = [121, 32, 61, 45, 78, 1, 90]
print(map(lambda x, y: x+y, a, b))

which gives

[123, 35, 67, 52, 87, 11, 212]

Here I provided two arguments to the lambda function. The same function can be called with any number of arguments, as long as the lists provided have the same length.

The reduce function is a bit different since it is executed multiple times. The function which needs to be fed into reduce() has to accept two arguments. This function is then called on the first two elements of the list, then with the result of that call and the third element, and so on, until all of the list elements have been processed

from functools import reduce
a = [2, 3, 6, 7, 9, 10, 122]
print(reduce(lambda x,y: x+y, a))

which gives

This means that the function is called $n-1$ times if the list contains $n$ elements. The return value of the last call is the result of the reduce() function.

Lambda functions can be used anywhere, but in my case, they rarely appear outside of filter(), map() and reduce(). I hope that was useful, please leave questions/comments below.
cheers
Florian

Wednesday, January 10, 2018

Global variables in Python

Global variables in Python behave a bit differently than in other programming languages. If we define a global variable outside a function and print it inside the function, everything works fine.

x = 5

def f():
    print("x = ", x)
f()

which results in

('x = ', 5)

However, if we include a second definition of the same variable within the function, the function loses its access to the global variable

x = 5

def f():
    x = 6
    print("x = ", x)
f()
print("but the global variable has not changed... => ", x)

which gives

('x = ', 6)
('but the global variable has not changed... => ', 5)

So if you define the same variable within the function, a new local variable is created, which is lost after the function finishes. The global variable is not affected.

However, if you try to modify the global variable within the function it will result in an error

x = 5

def f():
    x += 1
    print("x = ", x)
f()

which gives

UnboundLocalError: local variable 'x' referenced before assignment

I think this is very confusing. One has access to the global variable, as shown by the print statement, but can't modify it.

To actually modify the global variable we need the global keyword like this

x = 5

def f():
    global x
    x += 1
    print("x = ", x)
f()

which gives

('x = ', 6)

In other programming languages like C++ one would not be able to declare a variable within a function if a global variable with the same name already exists. Python does not require explicit variable declarations and therefore assumes that a variable that you assign has function scope, except if you explicitly state (using the global keyword) that you want to work with global variables. However, if you want to print or access a variable, Python is using the global variable if no local variable of that name exists.

Anyway, I hope that was helpful, if you have any questions/comments, please let me know in the comment section below.
cheers
Florian

Wednesday, January 3, 2018

Plotting MCMC chains in Python using getdist

This is a quick introduction to the getdist package by Antony Lewis, which allows visualizing MCMC chains.

from getdist import plots, MCSamples
import numpy as np

def main():
    mean = [0, 0, 0]
    cov = [[1, 0, 0], [0, 100, 0], [0, 0, 8]]
    x1, x2, x3 = np.random.multivariate_normal(mean, cov, 5000).T
    s1 = np.c_[x1, x2, x3]

    mean = [0.2, 0.5, 2.]
    cov = [[0.7, 0.3, 0.1], [0.3, 10, 0.25], [0.1, 0.25, 7]]
    x1, x2, x3 = np.random.multivariate_normal(mean, cov, 5000).T
    s2 = np.c_[x1, x2, x3]

    names = ['x_1', 'x_2', 'x_3']
    samples1 = MCSamples(samples=s1, labels=names, label='sample1')
    samples2 = MCSamples(samples=s2, labels=names, label='sample2')

    g = plots.getSubplotPlotter()
    g.triangle_plot([samples1, samples2], filled=True)
    g.export('triangle_plot.png')
    g.export('triangle_plot.pdf')
    return 

if __name__ == "__main__":
    main()

which results in

We can also plot the constraints on the individual parameters

def get_constraints(samples):
    for i, mean in enumerate(samples.getMeans()):
        upper = samples.confidence(i, upper=True, limfrac=0.05)
        print("\nupper limit 95 C.L. = %f" % upper)
        lower = samples.confidence(i, upper=False, limfrac=0.05)
        print("lower limit 95 C.L. = %f" % lower)
        print("%s = %f +/- %f +/- %f" % (samples.parLabel(i),\

        mean, mean - samples.confidence(i, limfrac=0.16),\
        mean - samples.confidence(i, limfrac=0.025)) )
    return

which in our case looks like this

upper limit 95 C.L. = 1.653261
lower limit 95 C.L. = -1.637202
x_1 = 0.000167 +/- 0.996626 +/- 1.958270

upper limit 95 C.L. = 16.365135
lower limit 95 C.L. = -16.484502
x_2 = -0.010400 +/- 9.874866 +/- 19.682727

upper limit 95 C.L. = 4.637981
lower limit 95 C.L. = -4.656869
x_3 = -0.009280 +/- 2.827152 +/- 5.571246

While in our case we set the variance and covariance of the parameters manually at the beginning to generate the samples, usually one gets an MCMC chain and wants to determine these parameters. To get the covariance matrix between two parameters we have to use the cov() function and provide the index of the parameters

print("cov(x_1, x_3) = ", samples2.cov(pars=[0,2]))
print("cov(x_1, x_2) = ", samples2.cov(pars=[0,1]))

which in our case just reproduces the input values

cov(x_1, x_3) =  [[ 0.69364079  0.10129875]
 [ 0.10129875  7.13148423]]
cov(x_1, x_2) =  [[  0.69364079   0.30451831]
 [  0.30451831  10.05805092]]

I posted the example code above on GitHub
best
Florian

Wednesday, December 20, 2017

Building a random forest classifier from scratch in Python

A random forest classifier uses decision trees to classify objects. The outcome of the individual decision tree results are counted and the one with the highest score is chosen. The reason for using multiple decision trees is to reduce overfitting, which is often present in decision trees.

We can build a random forest from a training sample of size $M$ by following these 4 steps:

1. Randomly select $k << M$ samples with replacement from the dataset (bagging).
2. Build a decision tree: Pick a feature randomly from the set of $F$ available features and determine the split points with the highest information gain.
3. Continue picking features randomly until the decision tree reaches a maximum depth or has only leaf nodes.

4. Repeat steps 1-3 $B$ times to build a random forest classifier with $B$ trees.

Step 1: Bagging

Bagging or bootstrapping is a term which describes building sub-samples of a larger sample by picking a number of objects $k$ with replacement.

k_indices = np.random.choice(len(train), k)
train[k_indices]

Using bootstrapped training samples is a way to decorrelate the decision trees. If the trees are uncorrelated, increasing the number of trees will always lead to a reduction in the variance of the random forest without increasing the bias.

The uncertainty of a random forest with $B$ uncorrelated decision trees can be calculated as the variance between the different trees
\[\sigma^2 = \frac{1}{B-1}\sum^B_i[f(x') - \overline{f}]^2\]
The training and test error tend to level off after some number of trees have been fit and can be determined using cross-validation.

Step 2: Build a decision tree

We start with the root node. To fully specify this node we need to split the dataset into two daughter nodes. The way in which the dataset is split at each node is the main aspect that categorizes the tree we are building.

In our case, we determine the split point by the maximum information gain. The information gain is defined as
\[IG = I(D_p) - \frac{N_{left}}{N_p}I(D_{left}) - \frac{N_{right}}{N_p}I(D_{right})\]
and $I(D)$ can be the Gini index or Entropy:
\[
\begin{align}
I^{Gini}(D) &= 1 - \sum^{N_{classes}}_{i}p_i^2\\
I^{Entropy}(D) &= - \sum^{N_{classes}}_{i}p_i\log(p_i)
\end{align}
\]
where $p$ is the probability that a random pick from all elements in that node is a member of that class. For the rest of this blog post I will just focus on the Gini index.

Let's go through an example. Assume we want to build a decision tree to decide whether a person is female or male, based on a set of features. Here we have only two classes male and female. Have a look at the following two possible split points:

Given these two possibilities, let's calculate the information gain for split number 1:
\[\begin{align}
I^{Gini}(D_{parent}) &= 1 - \left[\left(\frac{40}{80}\right)^2 + \left(\frac{40}{80}\right)^2\right] = 1 - (0.5^2 + 0.5^2) = 0.5\\
I^{Gini}(D_{left}) &= 1 - \left[\left(\frac{30}{40}\right)^2 + \left(\frac{10}{40}\right)^2\right] = 1 - \left[\left(\frac{3}{4}\right)^2 + \left(\frac{1}{4}\right)^2\right] = \frac{3}{8} = 0.375\\
I^{Gini}(D_{right}) &= 1 - \left[\left(\frac{10}{40}\right)^2 + \left(\frac{30}{40}\right)^2\right] = 1 - \left[\left(\frac{1}{4}\right)^2 + \left(\frac{3}{4}\right)^2\right] = \frac{3}{8} = 0.375\\
IG &= 0.5 - \frac{40}{80}\times 0.375 - \frac{40}{80}\times 0.375 = 0.125
\end{align}\]
So the information gain is $0.125$. Let's do the same for split number 2:
\[\begin{align}
I^{Gini}(D_{parent}) &= 0.5 \text{ (as before)}\\
I^{Gini}(D_{left}) &= 1 - \left[\left(\frac{20}{60}\right)^2 + \left(\frac{40}{60}\right)^2\right] = 1 - \left[\left(\frac{1}{3}\right)^2 + \left(\frac{2}{3}\right)^2\right] = 1 - \frac{5}{9} = 0.44\\
I^{Gini}(D_{right}) &= 1 - \left[\left(\frac{20}{20}\right)^2 + \left(\frac{0}{20}\right)^2\right] = 1 - (1 + 0) = 0\\
IG &= 0.5 - \frac{60}{80}\times 0.44 - 0 = 0.17
\end{align}\]
So the second split is preferred. In general, splits which separate classes always have a large information gain, with a complete split between the classes corresponding to the maximum information gain of $0.5$. An information gain of $0.5$ means that you have all the information at that point, since all classes are separated. Therefore there is no point in growing the tree any deeper.

Here are the functions which calculate the information gain using the Gini index:

def calc_information_gain(groups, list_of_class_ids):
    # count all samples
    Nall = sum([len(group) for group in groups])
    # calculate Gini index of parent node

    all_rows = [row for group in groups for row in group]
    IG = calc_gini(all_rows, list_of_class_ids)
    # calculate Gini index of daughter nodes
    for group in groups:
        IG -= calc_gini(group, list_of_class_ids)*len(group)/Nall
    return IG

def calc_gini(group, list_of_class_ids):
    Ngroup = len(group)
    if Ngroup == 0:
        return 0
    dataset_class_ids = [row[-1] for row in group]
    sum_over_classes = 0.
    for class_id in list_of_class_ids:
        prob = dataset_class_ids.count(class_id)/Ngroup
        sum_over_classes += prob**2
    return 1. - sum_over_classes

Note that building a tree like this is highly representative of the training dataset and hence is susceptible to overfitting. This is the reason why random forest classifiers build multiple trees with random subsets of the training dataset.

Splitting the dataset: Splitting a dataset involves iterating over all rows in the dataset, checking if the feature value is below or above the split value and assigning it to the left or right group, respectively.

def split_node(index, value, dataset):
    ''' Split the dataset into two using a feature index and

    feature value '''
    left = []
    right = []
    for row in dataset:
        if row[index] < value:
            left.append(row)
        else:
            right.append(row)
    return [left, right]

The feature on which to split is chosen randomly when building a random forest. To find the value on which to split, we have to go through all possible splits and evaluate the information gain. This can be a very expensive calculation.

Our approach is to pick all possible values of the feature within the training dataset and evaluate the information gain for all of them. In that case, the cost will scale with the number of rows in the training dataset.

With the Gini function above and the test split function, we now have everything we need to evaluate all possible splits.

def get_split(dataset, index):
    ''' Evaluate all possible splits given the dataset and the index of 
    the feature on which to split '''
    list_of_class_ids = list(set(row[-1] for row in dataset))
    split_value, max_IG, split_groups = 0., -1., None
    for row in dataset:
        groups = split_node(index, row[index], dataset)
        IG = calc_information_gain(groups, list_of_class_ids)
        if IG > max_IG:
            split_value, max_IG, split_groups = row[index], IG, groups
    return {'index': index,'split_value': split_value,'groups': groups }

To build the tree we also need to decide how deep the tree should be. There are two possibilities: We can set a maximum tree depth from the root node or we set a minimum number of rows in a node as a stopping criteria. Here we will take both as input values for our tree building function.

Step 3: Building the decision tree

To build the decision tree we need to implement the following steps:

1. Randomly determine the first splitting feature.
2. Split the original dataset according to the first optimal split using the get_split() function.
3. Recursively repeat step 1 and 2 on the daughter nodes until the tree has reached the maximum depth or either left or right group of rows is smaller than the minimum number of rows.

And here is the implementation

def build_tree(train, max_depth, min_size):
    # randomly determine the feature index
    feature_index = int( np.random.random()*(len(right[0]) - 1) )
    root = get_split(train, feature_index)
    split(root, max_depth, min_size, 1)
    return root

def to_terminal(group):
    # Create a terminal node value
    list_of_classes = [row[-1] for row in group]
    return max(set(list_of_classes), key=list_of_classes.count)

def split(node, max_depth, min_size, depth):
    left, right = node['groups']
    del(node['groups'])
    # check for a no split
    if not left or not right:
        node['left'] = node['right'] = to_terminal(left + right)
        return
    # check for max depth
    if depth >= max_depth:
        node['left'] = to_terminal(left)

        node['right'] = to_terminal(right)
        return
    # process left child
    if len(left) <= min_size:
        node['left'] = to_terminal(left)
    else:
        feature_index = int( np.random.random()*(len(right[0]) - 1) )
        node['left'] = get_split(left, feature_index)
        split(node['left'], max_depth, min_size, depth+1)
    # process right child
    if len(right) <= min_size:
        node['right'] = to_terminal(right)
    else:
        feature_index = int( np.random.random()*(len(right[0]) - 1) )
        node['right'] = get_split(right, feature_index)
        split(node['right'], max_depth, min_size, depth+1)

Step 4: Building the forest

Building the forest just means building multiple trees, so we call the build_tree() function multiple times with bootstrapped training samples

def build_forest(train, k, N_trees):
    max_depth = 4
    min_size = 2
    forest = []
    for i in range(0, N_trees):
        # bootstrap training dataset
        k_indices = np.random.choice(len(train), k)
        forest.append(build_tree(train[k_indices], max_depth, min_size))
    return forest

And that is it. We now have a random forest classifier in the form of a list of dictionaries. To use this classifier to classify a new object we just need to loop over the decison trees, store the tree results and the majority decision is what gives the classification.

Here is an implementation together with a test example:

def main():
    train = np.array([[3.77,4.19,0],
    [4.77,1.169761413,0],
    [-5.,2.81281357,0],
    [3.1,2.61995032,0],
    [3.6,2.209014212,0],
    [1.2,-3.162953546,1],
    [2.3,-3.339047188,1],
    [5.6,0.476683375,1],
    [-1.3,-3.234550982,1],
    [2.1,-3.319983761,1]])
    forest = build_forest(train, k=10, N_trees=100)
    for row in train:
        prediction = make_prediction(forest, row)
        print('truth = %d : prediction = %d' % (row[-1], prediction))
    return

def traverse_tree(node, row):
    if row[node['index']] < node['split_value']:
        if isinstance(node['left'], dict):
            return traverse_tree(node['left'], row)
        else:
            return node['left']
    else:
        if isinstance(node['right'], dict):
            return traverse_tree(node['right'], row)
        else:
            return node['right']

def make_prediction(forest, row):
    list_of_classes = []
    for tree_root in forest:
        list_of_classes.append(traverse_tree(tree_root, row))
    return max(set(list_of_classes), key=list_of_classes.count)

There are some hyper-parameters in the implementation above, for example what is the correct tree depth or minimum node size? Is it really correct to pick one feature randomly or should we split by multiple features? How many trees should the forest contain? All of these questions can be addressed with the dataset itself and do not actually need to be determined by a fitting procedure. If there is interest, I can go into details in a future blog post.

If you ever want to use a random forest classifier you should, of course, consider just using the sklearn learn implementation, which reduces the entire code above to 3 lines:

from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100)
forest.fit(dataset_features, dataset_classes)

I hope this exercise gave you some insight into what is hidden behind the sklearn black box. The entire code is also available on GitHub.
best
Florian

Useful resources:
Blog post by Jason Brownlee on building decision trees: https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python

Monday, December 18, 2017

Web scraping with Python tutorial, part 2 -- scrapy.py

This is the second part of this tutorial about web scraping with Python. In the first part we looked at scraping static content from a particular website and we used the syntax of that website to access the desired information. In this part, we will look at web scrapers that potentially could search the entire internet.

It would be quite simple to extend the scraper we already developed to search the entire internet rather than just one website. All that we need to do is search for links on that website and after processing one site we follow one of the links found on that site. If we build up a list of links that way it is very unlikely that we will ever run out of links.

Rather than building such a scraper ourselves, we are going to make use of scrapy.py, a Python package exactly for the purpose of widespread web scraping. Install scrapy with

pip install scrapy

and create a new project with

scrapy startproject email_scraper

Now there should be a folder called email_scraper. Under email_scraper/spiders we have to write our scraping commands. I wrote a short spider called email_spider.py

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re
# scrapy crawl email_spider
def write_to_file(emails):
    with open("list_of_emails.csv", "a") as myfile:
        for email in emails:
            myfile.write("%s\n" % email)
    myfile.close
    return 

class PureEvil(CrawlSpider):
    name = "email_spider"
    download_delay = 1.0
    allowed_domains = ['stackoverflow.com'] 
    start_urls = ['https://stackoverflow.com/']    rules = [Rule(
        LinkExtractor(allow=".*"),
        callback='parse_page', 
        follow=True)
    ]
    def parse_page(self, response):
        emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.body.decode('utf-8'))
        if emails:
            write_to_file(emails)
        return

This spider is named email_spider and if you go through the lines of code you can see why. It searches for email addresses in the source code and stores them in a CSV file. It is limited to domains within the site stackoverflow.com and will start at the main site. It will automatically store all the links on that site and will programmatically follow these links until it crawled the entire domain space.

You can see how easy it is to write a scraper with scrapy. Just a few lines of code give us a very powerful tool. Let me know if you have comments/questions below.

cheers
Florian

Thursday, November 30, 2017

Web scraping with Python tutorial, part 1 -- BeautifulSoup

In this tutorial, I will explain how to scrape content from a website using Python. In this first part, we will scrape the content of a static page.

First, you should check whether scraping is really the best way to get the data you want. If there is an API available, that is usually a much more robust way to get the data. But if there is no API, web scraping is a very powerful tool to do the job.

We will use the package BeautifulSoup to handle the HTML. You can install this package with

pip install beautifulsoup4

and include it in the Python script with

from bs4 import BeautifulSoup

As an example, I will scrape all job advertisements from the HEP job list. We can access a list of all job advertisements with

start_link = "https://inspirehep.net/search?ln=en&cc=Jobs&rg=0"

Here we will use urllib to obtain the raw HTML of this site and feed it into BeautifulSoup

html = urlopen(link)
BeautifulSoup(html.read(), "html.parser")

Web scraping relies on components which are outside our control, for example, we rely on the website being accessible (online) and we rely on the HTML structure not having changed too much. For that reason, it is a good idea to write the scraper with very extensive exception handling. The two lines above can be re-written as

def get_soup(link):
    try:
        html = urlopen(link)
    except HTTPError as e: # The page is not found on the server
        print("ERROR: Internal server error!", e)
        return None 
    except URLError as e:
        print("ERROR: The server could not be found!", e)
        return None 
    else:
        return BeautifulSoup(html.read(), "html.parser")

Here we re-wrote the whole procedure as a function and catch exceptions in which the page or the server can not be found.

Now that we have the HTML code as a BeautifulSoup object we can access its components in several ways. The most important are .find() and .findAll(), which return the first matching HTML object or all HTML objects, respectively.

However, how do we know what HTML objects to search for? For that, we need to look at the HTML code. A good tool for that is the Chrome inspector tool (View > Developer > Developer Tools, then click on the inspector tool at the top left corner). You can move the mouse over an object on the page and you can directly see the associated HTML object on the left

Figure 1: Google Chrome Developer tools inspector. You can see the HTML
on the right with the object highlighted, which corresponds to the part
the mouse is currently pointing to on the left.

With this, you can find the HTML tags you want to access. Here we want to access the list of job ads and each job ad is embedded in a div tag with class="record_body". So let's find all div objects with that class.

list_of_jobs = soup.findAll("div", {"class": "record_body"})

Now we have a list, which contains all the div objects for the different jobs on that site. We can loop through that list and extract some more information

for job in list_of_jobs:
    job_info = {'posting_date': '', 'record_link': ''}
    strong_tag = job.find("strong")
    if strong_tag:
        job_info['posting_date'] = strong_tag.text
    else:
        print("WARNING: No posting date found?")

    a_tags = job.findAll("a")
    if a_tags[1]:
        job_info['institute'] = a_tags[1].text
        job_info['position'] = a_tags[1].findNext("span").text
    else:
        print("WARNING: No institute found?")

This website is an example of very badly written HTML, which makes scraping quite difficult. If there is structure in the site, with lots of attributes, like ids or classes, we can easily identify the objects we want. Here this is not the case. For example, the loop above assumes that the first strong tag is the posting date. Let's hope that the admin does not change the HTML by including more strong tags since that would break this scraper. However, I could not find a more reliable way to access the posting date.

We can also access the link to the detailed job advertisement with

job_info['record_link'] = a_tags[0].attrs['href']

So to get even more information about the job ads we can now visit the individual links

for i, record in enumerate(records):
    time.sleep(1)
    soup = get_soup(record['record_link'])
    if soup == None:
        print('ERROR: No soup object received for record %d' % i)
    else:
        record_details = soup.find("div", {"class": "detailed_record_info"})
        job_description_identifier = record_details.find("strong",

                                                         text="Job description: ")
        description = job_description_identifier.parent
        if description:
            record['description'] = str(description)\

                          .replace('<strong>Job description: </strong><br/>', '')
        else:
            print('WARNING: No description found')

Here we loop over all job ads, stored in the records list, obtain the HTML code with the get_soup() function defined above and scrape the job description. The scraper waits for one second within the loop, otherwise we could put some heavy load on the server and disrupt the website performance. Again the job description has to be scraped by searching for title headings rather than HTML attributes because this website does not offer much structure.

One thing to consider when scraping a site is the robots.txt file which you can access on the root of the site (e.g. https://inspirehep.net/robots.txt). In this file, the owner of the website specifies which content should not be accessed by scrapers (robots). You don't have to follow what is written in this file, but it is a good idea to look at it, to see whether what you want to do goes against the owner's intention. If you look at the file linked above you will find that while there are some sites excluded from robot access, the jobs site on HEP is not, so we are good.

You can get the entire code discussed in this post from my GitHub account. Let me know if you have comments/questions below.

cheers
Florian