My statistics blog: January 2018

Wednesday, January 31, 2018

The N+1 problem in relational databases and eager loading options in SQLalchemy

In the first part of this post, I will explain the N+1 problem in relational databases. The second part will show how one can deal with this problem in Flask-SQLalchemy.

Assume you have a user table and each user has bought a number of products, listed in a sales table. Let's assume you want a list of all products bought by users in the last 2 days. You can do that by querying the user table to get all users

SELECT * FROM User;

and then for each user, you query the number of products they have bought

SELECT * FROM Sales WHERE user_id = ? AND date > two_days_ago

In other words, you have one Select statement for the user table followed by N additional Select statements to get the associated products, where N is the total number of users.

Each access to the database has a certain overhead which means that the procedure above will scale quite badly with the number of users.

Give that, for each user we intend to perform the same query, there should be a way to do this more efficiently rather than using N+1 calls to the database. Most object-relational mappers (ORMs) such as SQLalchemy, give you several tools to deal with this problem. The SQLalchemy manual discusses this issue here.

However, I certainly noticed that ORMs hide the N+1 problem. If you would write pure SQL, you would directly see the number of Select statements you submit, while in SQLalchemy, it is not obvious when the data is loaded and how many times the database is accessed. So it is quite important to understand the loading options in SQLalchemy and use them appropriately.

Let's build the example discussed above using Flask-SQLalchemy

class User(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    ...

    products = db.relationship('product',
                               secondary=sales)

class Product(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    date = db.Column(db.DateTime)
    ...

sales = db.Table(
    'sales',
    db.Column('product_id', db.Integer, db.ForeignKey('product.id')),
    db.Column('user_id', db.Integer, db.ForeignKey('user.id'))
)

Here we have a User and a Product table and the helper table 'sales' establishes a many-to-many relationship between the two tables. Using the helper table I added a products attribute to each user object, which allows us to access all products this user bought.

We now have two main choices for how we can implement the products relationship (db.relationship above). The default setting of SQLalchemy is lazy loading, which means that the related rows in the sales table are not loaded together with the user object. This is a reasonable choice for a default setting, but not what we want here. Lazy loading means we will run into the N+1 problem since whenever we want to have the products for each user we have to run another N database calls.

To solve this problem we have to make use of the lazy keyword, which by default is set to 'select' like this

products = db.relationship('product',
                           secondary=sales,
                           lazy='select')

So let's go through the options we have for this keyword. One very useful option is lazy='dynamic' like this

products = db.relationship('product',
                           secondary=sales,
                           lazy='dynamic')

which means that user.products actually returns a query object rather than table rows. This gives you a lot of flexibility how the second database access might look. For example, you could add additional filters like this

user.products.filter_by(Products.date > two_days_ago).all()

This can be very useful e.g. if there are many rows in the products table, the additional filter might make the second database access much quicker. But this is of course not solving our problem since it definitely needs a second database access. To avoid the N+1 problem we need to load the rows in the sales table together with the users.

To do this we have to make use of eager loading and SQLalchemy provides three choices for that: joined loading (lazy='joined'), subquery loading (lazy='subquery') and select IN loading (lazy='selectin').

Let's start with the subquery loading since that one is often a good choice. Here two Select statements are run, one to retrieve all users and one to retrieve all related rows in the sales table. So rather N+1 we have 2 Select statements.

Alternatively, joined loading squeezes everything into one Select statement. So we save another Select statement compared to subquery loading, but the downside is that if the products table is large, this can become very slow since it makes use of an out left join.

Finally, the selectin loading is the newest option in SQLalchemy and the recommended one according to the docs. It only queries 500 rows per Select statement, so if you need to retrieve many products, you might end up with several Select statements, but in my experience, this option works very well.

Note that most of the time you do not want to add this loading option to the mapping, but instead set the loading style at runtime. You can do that with joinedload(), subqueryload(), selectinload() and lazyload(), which can be imported as

from sqlalchemy.orm import subqueryload, joinedload, selectinload

and used like

users = User.query.options(selectinload(User.products)).all()

I hope that was useful. Let me know if you have any question/comments below.
best
Florian

Monday, January 22, 2018

Understanding the self variable in Python classes

When defining a class in Python you will probably have encountered the self variable. Here is a simple example

class Cat(object):
    def __init__(self, hungry=False):
        self.hungry = hungry

    def should_we_feed_the_cat(self):
        if self.hungry:
            print("Yes")
        else:
           print("No")

This is a simple class of a cat, which has an attribute hungry. The __init__() method is automatically called whenever a new class object is created, while the should_we_feed_the_cat() method is a property of the object and can be called anytime. This method tests the hungry attribute and prints out "Yes" or "No".

Let's initialize a cat

>>> felix = Cat(True)
>>> felix.should_we_feed_the_cat()
Yes

Here we initialized the object with 1 argument, even though the __init__() method is defined with two arguments. We also test whether the cat is hungry using the should_we_feed_the_cat() function without any argument, even though it is defined with one. So why is Python not complaining about this?

First, we should clarify the difference between a function and a method. A method is a function which is associated with an object. So all the functions we defined within our class above are associated with an object of that class and hence they are methods. However, we can still call methods as function

>>> type(Cat.should_we_feed_the_cat)
<class 'function'>
>>> type(felix.should_we_feed_the_cat)
<class 'method'>

The first call is a function call while the second is a method call. Class methods in Python automatically put the object itself as the first argument. So

Cat.should_we_feed_the_cat(felix)

is the same as

felix.should_we_feed_the_cat()

In the first case we call a function and need to pass in the object manually, while in the second case we call the method and the object is passed automatically.

If we get this wrong and for example call the method by passing in the object, we will get an error message which you probably have seen before (I certainly have)

>>> felix.should_we_feed_the_cat(felix)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: should_we_feed_the_cat() takes 1 positional argument but 2

were given

If you really want to avoid the self variable you can do that by using the @staticmethod decorator

class Cat(object):
    def __init__(self, hungry=False):
        self.hungry = hungry

    @staticmethod
    def other_function():
        print("This function has no self variable")

Note that the self variable is not a reserved keyword. You could easily write the class above naming the first variable in the should_we_feed_the_cat() method anything you want and it would still work

class Cat(object):
    def __init__(whatever, hungry=False):
        whatever.hungry = hungry

def should_we_feed_the_cat(whatever):
        if whatever.hungry:
            print("Yes")
        else:
           print("No")

However, self is the accepted convention and you should stick to it.

Given the very simple concept captured by the self variable, you might already think about ways to get rid of it and use other ways to access the class attribute. If you have experience with other programming languages like Java, you might prefer to just have a pre-defined keyword. However, that does not seem to be an option in Python. If you are interested, here is a blog post by Guido Van Rossum arguing for keeping the explicit self variable.

I hope that was useful and let me know if you have questions/comments in the comments section below.

cheers
Florian

Wednesday, January 17, 2018

Lambda functions in Python

The lambda operator or lambda function is a way to create small anonymous functions, i.e. functions without a name. This construct can be useful if you need a simple function only once and you want to discard it directly after usage.

The syntax for a lambda function is

lambda arguments: expression

and they can have any number of arguments but only one expression. The expression is evaluated and returned.

Here is an example: This function squares its argument

g = lambda x: x**2

print("g(5) = ", g(5))

which gives

('g(5) = ', 25)

The same function can be defined in a conventional way

def f(x): 
   return x**2
print("f(5) = ", f(5))

which gives

('f(5) = ', 25)

Both functions g() and f() do the same thing, so lambda functions can operate like any normal function and are basically an alternative to def. Given the example above you might ask what is the purpose of a lambda function if it is just a different way of defining a function?

Lambda functions can come in handy if you do not want to define a separate function. Assume you have a list and you want to find all list elements which are even

a = [2, 3, 6, 7, 9, 10, 122]
print(filter(lambda x: x % 2 == 0, a))

which gives

[2, 6, 10, 122]

filter() is a built-in Python function, which expects a function as the first argument and a list as second. The filter function will discard all list elements for which the lambda function returns False, and will keep all elements for which it returns True. Wherever a function is expected we can instead provide a lambda function and if it is a simple task like in the example above, you might prefer a lambda function rather than defining a stand-alone function using def.

Note that many programmers don't like lambda functions, but rather use list comprehension, which they deem more readable. The example above written with a list comprehension would look like this

print([num for num in a if num % 2 == 0])

You see that this looks a bit easier to read since it uses well-known concepts and it does not actually use much more space.

Maybe just to extend on this we should discuss two examples where list comprehension does not work. Let's assume we want to sort a list by the modulus of 10 of the value. We can do that with

print(sorted(a, key=lambda x: x % 10))

which results in

[10, 2, 122, 3, 6, 7, 9]

Another example is where a function can return a lambda function

def gaussian(height, center_x, center_y, width_x, width_y):
    return lambda x,y: height*np.exp(
                -(((center_x-x)/width_x)**2+((center_y-y)/width_y)**2)/2)

The function gaussian() returns a 2D Gaussian function which expects two parameters as input and returns the value of the 2D Gaussian at these two values.

func = gaussian(1., 0.5, 0.5, 1., 1.)
print(func(0.2, 0.8))

So we first create a Gaussian distribution with the center at $x=0.5$ and $y=0.5$ and than print the value of this Gaussian at $x=0.2$ and $y=0.8$.

And there are many more examples where lambda functions can come in very handy. In any case, it is important to be familiar with the concept of lambda functions, even just to be able to read other people's code.

Besides the filter function, there are also map() and reduce() which often make use of lambda functions. The map() function maps all values of a list to another list using a function like this

a = [2, 3, 6, 7, 9, 10, 122]
b = [121, 32, 61, 45, 78, 1, 90]
print(map(lambda x, y: x+y, a, b))

which gives

[123, 35, 67, 52, 87, 11, 212]

Here I provided two arguments to the lambda function. The same function can be called with any number of arguments, as long as the lists provided have the same length.

The reduce function is a bit different since it is executed multiple times. The function which needs to be fed into reduce() has to accept two arguments. This function is then called on the first two elements of the list, then with the result of that call and the third element, and so on, until all of the list elements have been processed

from functools import reduce
a = [2, 3, 6, 7, 9, 10, 122]
print(reduce(lambda x,y: x+y, a))

which gives

This means that the function is called $n-1$ times if the list contains $n$ elements. The return value of the last call is the result of the reduce() function.

Lambda functions can be used anywhere, but in my case, they rarely appear outside of filter(), map() and reduce(). I hope that was useful, please leave questions/comments below.
cheers
Florian

Wednesday, January 10, 2018

Global variables in Python

Global variables in Python behave a bit differently than in other programming languages. If we define a global variable outside a function and print it inside the function, everything works fine.

x = 5

def f():
    print("x = ", x)
f()

which results in

('x = ', 5)

However, if we include a second definition of the same variable within the function, the function loses its access to the global variable

x = 5

def f():
    x = 6
    print("x = ", x)
f()
print("but the global variable has not changed... => ", x)

which gives

('x = ', 6)
('but the global variable has not changed... => ', 5)

So if you define the same variable within the function, a new local variable is created, which is lost after the function finishes. The global variable is not affected.

However, if you try to modify the global variable within the function it will result in an error

x = 5

def f():
    x += 1
    print("x = ", x)
f()

which gives

UnboundLocalError: local variable 'x' referenced before assignment

I think this is very confusing. One has access to the global variable, as shown by the print statement, but can't modify it.

To actually modify the global variable we need the global keyword like this

x = 5

def f():
    global x
    x += 1
    print("x = ", x)
f()

which gives

('x = ', 6)

In other programming languages like C++ one would not be able to declare a variable within a function if a global variable with the same name already exists. Python does not require explicit variable declarations and therefore assumes that a variable that you assign has function scope, except if you explicitly state (using the global keyword) that you want to work with global variables. However, if you want to print or access a variable, Python is using the global variable if no local variable of that name exists.

Anyway, I hope that was helpful, if you have any questions/comments, please let me know in the comment section below.
cheers
Florian

Wednesday, January 3, 2018

Plotting MCMC chains in Python using getdist

This is a quick introduction to the getdist package by Antony Lewis, which allows visualizing MCMC chains.

from getdist import plots, MCSamples
import numpy as np

def main():
    mean = [0, 0, 0]
    cov = [[1, 0, 0], [0, 100, 0], [0, 0, 8]]
    x1, x2, x3 = np.random.multivariate_normal(mean, cov, 5000).T
    s1 = np.c_[x1, x2, x3]

    mean = [0.2, 0.5, 2.]
    cov = [[0.7, 0.3, 0.1], [0.3, 10, 0.25], [0.1, 0.25, 7]]
    x1, x2, x3 = np.random.multivariate_normal(mean, cov, 5000).T
    s2 = np.c_[x1, x2, x3]

    names = ['x_1', 'x_2', 'x_3']
    samples1 = MCSamples(samples=s1, labels=names, label='sample1')
    samples2 = MCSamples(samples=s2, labels=names, label='sample2')

    g = plots.getSubplotPlotter()
    g.triangle_plot([samples1, samples2], filled=True)
    g.export('triangle_plot.png')
    g.export('triangle_plot.pdf')
    return 

if __name__ == "__main__":
    main()

which results in

We can also plot the constraints on the individual parameters

def get_constraints(samples):
    for i, mean in enumerate(samples.getMeans()):
        upper = samples.confidence(i, upper=True, limfrac=0.05)
        print("\nupper limit 95 C.L. = %f" % upper)
        lower = samples.confidence(i, upper=False, limfrac=0.05)
        print("lower limit 95 C.L. = %f" % lower)
        print("%s = %f +/- %f +/- %f" % (samples.parLabel(i),\

        mean, mean - samples.confidence(i, limfrac=0.16),\
        mean - samples.confidence(i, limfrac=0.025)) )
    return

which in our case looks like this

upper limit 95 C.L. = 1.653261
lower limit 95 C.L. = -1.637202
x_1 = 0.000167 +/- 0.996626 +/- 1.958270

upper limit 95 C.L. = 16.365135
lower limit 95 C.L. = -16.484502
x_2 = -0.010400 +/- 9.874866 +/- 19.682727

upper limit 95 C.L. = 4.637981
lower limit 95 C.L. = -4.656869
x_3 = -0.009280 +/- 2.827152 +/- 5.571246

While in our case we set the variance and covariance of the parameters manually at the beginning to generate the samples, usually one gets an MCMC chain and wants to determine these parameters. To get the covariance matrix between two parameters we have to use the cov() function and provide the index of the parameters

print("cov(x_1, x_3) = ", samples2.cov(pars=[0,2]))
print("cov(x_1, x_2) = ", samples2.cov(pars=[0,1]))

which in our case just reproduces the input values

cov(x_1, x_3) =  [[ 0.69364079  0.10129875]
 [ 0.10129875  7.13148423]]
cov(x_1, x_2) =  [[  0.69364079   0.30451831]
 [  0.30451831  10.05805092]]

I posted the example code above on GitHub
best
Florian