Wednesday, January 31, 2018

The N+1 problem in relational databases and eager loading options in SQLalchemy

In the first part of this post, I will explain the N+1 problem in relational databases. The second part will show how one can deal with this problem in Flask-SQLalchemy.

Assume you have a user table and each user has bought a number of products, listed in a sales table. Let's assume you want a list of all products bought by users in the last 2 days. You can do that by querying the user table to get all users
SELECT * FROM User;
and then for each user, you query the number of products they have bought
SELECT * FROM Sales WHERE user_id = ? AND date > two_days_ago
In other words, you have one Select statement for the user table followed by N additional Select statements to get the associated products, where N is the total number of users.

Each access to the database has a certain overhead which means that the procedure above will scale quite badly with the number of users.

Give that, for each user we intend to perform the same query, there should be a way to do this more efficiently rather than using N+1 calls to the database. Most object-relational mappers (ORMs) such as SQLalchemy, give you several tools to deal with this problem. The SQLalchemy manual discusses this issue here.

However, I certainly noticed that ORMs hide the N+1 problem. If you would write pure SQL, you would directly see the number of Select statements you submit, while in SQLalchemy, it is not obvious when the data is loaded and how many times the database is accessed. So it is quite important to understand the loading options in SQLalchemy and use them appropriately.

Let's build the example discussed above using Flask-SQLalchemy
class User(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    ...

    products = db.relationship('product',
                               secondary=sales)

class Product(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    date = db.Column(db.DateTime)
    ...

sales = db.Table(
    'sales',
    db.Column('product_id', db.Integer, db.ForeignKey('product.id')),
    db.Column('user_id', db.Integer, db.ForeignKey('user.id'))
)
Here we have a User and a Product table and the helper table 'sales' establishes a many-to-many relationship between the two tables. Using the helper table I added a products attribute to each user object, which allows us to access all products this user bought.

We now have two main choices for how we can implement the products relationship (db.relationship above). The default setting of SQLalchemy is lazy loading, which means that the related rows in the sales table are not loaded together with the user object. This is a reasonable choice for a default setting, but not what we want here. Lazy loading means we will run into the N+1 problem since whenever we want to have the products for each user we have to run another N database calls.

To solve this problem we have to make use of the lazy keyword, which by default is set to 'select' like this
products = db.relationship('product',
                           secondary=sales,
                           lazy='select')
So let's go through the options we have for this keyword. One very useful option is lazy='dynamic' like this
products = db.relationship('product',
                           secondary=sales,
                           lazy='dynamic')
which means that user.products actually returns a query object rather than table rows. This gives you a lot of flexibility how the second database access might look. For example, you could add additional filters like this
user.products.filter_by(Products.date > two_days_ago).all()
This can be very useful e.g. if there are many rows in the products table, the additional filter might make the second database access much quicker. But this is of course not solving our problem since it definitely needs a second database access. To avoid the N+1 problem we need to load the rows in the sales table together with the users.

To do this we have to make use of eager loading and SQLalchemy provides three choices for that: joined loading (lazy='joined'), subquery loading (lazy='subquery') and select IN loading (lazy='selectin').

Let's start with the subquery loading since that one is often a good choice. Here two Select statements are run, one to retrieve all users and one to retrieve all related rows in the sales table. So rather N+1 we have 2 Select statements.

Alternatively, joined loading squeezes everything into one Select statement. So we save another Select statement compared to subquery loading, but the downside is that if the products table is large, this can become very slow since it makes use of an out left join.

Finally, the selectin loading is the newest option in SQLalchemy and the recommended one according to the docs. It only queries 500 rows per Select statement, so if you need to retrieve many products, you might end up with several Select statements, but in my experience, this option works very well.

Note that most of the time you do not want to add this loading option to the mapping, but instead set the loading style at runtime. You can do that with joinedload(), subqueryload(), selectinload() and lazyload(), which can be imported as
from sqlalchemy.orm import subqueryload, joinedload, selectinload
and used like 
users = User.query.options(selectinload(User.products)).all()
I hope that was useful. Let me know if you have any question/comments below.
best
Florian

No comments:

Post a Comment