How to Use Grouping Sets in Django

How We Cut Response Time In Half Using Advanced SQL In Django

Originally published at hakibenita.com on March 10, 2019. For a better reading experience with syntax highlighting read the article on my website.

I recently had the pleasure of optimizing an old dashboard. The solution we came up with required some advanced SQL that Django does not support out of the box. In this article I present the solution, how we got to it, and a word of caution.

The Dashboard

The dashboard is of a sales model. It includes a simple table with metrics grouped by merchants and their devices, and a summary line.

The code to produce the table looks roughly like this:

The code to produce the summary line uses the same metrics, and looks likes this:

Our admin gets a nice dashboard that looks roughly like this:

Image for post
Image for post
A summary line in Django Admin page

On how to create the dashboard above see How to Turn Django Admin Into a Lightweight Dashboard

The Problem

The dashboard was working well for about three years. We got good response times and accurate information. However, as data piled up, performance has degraded to a point where the page became unusable.

To analyze the problem, we inspected the SQL, and timed it. The query to produce the table looks like this:

At worst, this query took about 30s to complete.

The next query executed by the dashboard was used to produce the summary line:

This query took roughly the same time, ~30s. Together, at their worst, the two queries took more than a minute to complete.

Aggregate in Memory

Both queries process the exact same data, the only difference is the GROUP BY key. The first query produces results at the merchant and device level, the second produces the same aggregates for the entire dataset.

The first thought that came to mind was to calculate the summary by aggregating the results in-memory.

The first metric, total, is easy to calculate:

The second metric is the average charged amount. We can’t just sum the average for each device and merchant, we need additional information.

To calculate the average charged amount for all merchants and devices, we need to divide the total charged amount by the number of sales. We already have the number of sales, so we need to add a metric for the total charged amount:

Now that we have both total and total_charged_amount, we can compute avg_charged_amount:

We have one metric left, unique_users. This metric counts the unique number of users that visited each device at each merchant. The same user can visit several devices at different merchants. If we sum unique_users we won't get the correct metric for the entire set.

It’s impossible to accurately compute distinct values from aggregated results, so the solution must be in the database.

Aggregating in the Database

Most SQL implementations provide several useful functions to aggregate data at different levels.

NOTE: Throughout this article I use PostgreSQL. Similar functions are available in Oracle, MySQL and MSSQL. As far as I know, SQLite has no support for the functions I’m about to use.

Let’s start with some data:

The query we used in our dashboard produces the following results:

The query to produce the summary line:

Rollup

The first special GROUP BY expression is ROLLUP. As the name suggest, ROLLUP aggregate at the lowest level and up:

We grouped by two fields, device and merchant, and we got three groups of aggregation:

  1. () all
  2. (device, merchant)
  3. (device)

ROLLUP aggregates "up", so the order of the fields is significant. Let's flip the order of the fields:

This time we got the following groups:

  1. () all
  2. (merchant, device)
  3. (merchant)

Cube

The next group by expression is most likely borrowed from OLAP, which often mention cubes. The CUBE expression aggregates all possible combinations:

The results contain the following groups:

  1. () all
  2. (device, merchant)
  3. (merchant)
  4. (device)

Grouping Sets

Grouping sets allows us to provide the exact groups of aggregation we want. For example, to recreate the results of the ROLLUP above, we can provide the following grouping sets:

Each list of fields inside parentheses in the GROUPING SETS is a group in the result.

Both CUBE and ROLLUP can be implemented using GROUPING SETS. The following table shows the equivalent GROUPING SETS expression for both ROLLUP and CUBE, on two fields a and b:

  • ROLLUP(a, b) = GROUPING SETS ((a, b), (a), ())
  • CUBE(a, b)= GROUPING SETS ((a, b), (a), (b) ())

In our original query we had metrics at the merchant and device level, and we wanted to get a summary line. Using GROUPING SETS, this query will look like this:

The first 6 lines are similar to the original query. The last line is similar to the results from the summary query we used.

Using GROUPING SETS we get the results we need in just one query instead of two.

Using Grouping Sets in Django

Now that we have the query, we need to find a way to use it with Django. Unfortunately, Django still has no support for grouping sets. On top of that, the query is generated by Django Admin, and it includes predicates from list filters and date hierarchy. We couldn’t just use raw SQL.

We need to find a way to modify a given Django QuerySet, and add grouping sets to it.

Since Django has no built-in support for grouping sets, we are forced to manipulate the query. The base query we need to manipulate is the query that Django generates, along with any predicates and annotations added by Django Admin. Eventually, we want to execute the query in the database the same way Django does.

Getting the Query

A nice feature of Django QuerySet is that it provides the generated SQL:

This is a simple query, can we execute it directly in the database?

This looks like something we can work with, let’s dig deeper…

As mentioned before, the QuerySet is generated by Django Admin, and it might include predicates for list filters and date hierarchy. Let’s try to execute a query with a predicate on the sold_at date field:

Looks like Django is unable to execute the query as is. The reason for that is that the text generated by str(qs.query) is just a text representation of the query. Under the hood, Django uses proper bind variables (might also be known as substitution variables) to avoid SQL injection.

Much of the Django ORM QuerySet logic is carried out by an internal class called Query. The class is not documented. The only place to learn about it is in the source. One promising function of Query is sql_with_params. Let's use it on the query above, and see what we get:

The function sql_with_params returns a tuple. The first argument is the SQL query. The second, is a list of parameters to that query.

The keen-eyed might have spotted the placeholder %s in the query text:

This placeholder corresponds to the parameter we got in the second argument. Let’s try to execute the query with the placeholder, and the params:

Great! We are now able to execute a query as Django does. We are ready to manipulate the query.

Manipulating the Query

The query generated by Django includes a simple GROUP BY clause:

We want to replace that with the following group by clause:

This looks like a job for re.

We want catch the grouped by fields between GROUP BY and ORDER BY, and make them the first group in the GROUPING SET expression. Then, we want add the group () for the summary:

Now we can take the modified query, and execute it with the params:

Lo and behold… We got both the results and the summary line in a single query.

Conclusion

Several important issues to consider about this approach:

  • Don’t do this!: This is as bad as it gets. This approach is a nice exercise, and a great opportunity to explore the ORM internals, but the implementation is too fragile. When using an internal, undocumented API, there is no guaranty it wont change unexpectedly in the future. Having said that, we decided to use this approach in one of our internal admin pages. It’s a very specific scenario involving a queryset that’s not used for any user facing features. It helped us cut the page response time exactly in half and we are pleased with the result.
  • Make the sort order deterministic: When using GROUPING SETS (and ROLLUP or CUBE for that matter), you mix more than one level of aggregation in a single query. To be able to fetch the results in a predictable way, it's important to explicitly sort the results. For example, in the query above, to make sure the summary row is the first row, add the following sort order qs.order_by( F('merchant').desc(nulls_last=False) ).

Written by

Full Stack Developer, Team Leader, Independent. More from me at https://hakibenita.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store