How “Export to Excel” Almost Killed Our System

And How We Eventually Made It 4 Times Faster

Originally published at https://hakibenita.com on September 17, 2019.
Read
this article on by blog with proper syntax highlighting.

A few weeks ago we had some trouble with an “Export to Excel” functionality in one of our systems. In the process of resolving this issue, we made some interesting discoveries and came up with original solutions.

This article is inspired by the actual issue we used to track this incident over a period of two days. We go through the process of identifying the problem, experimenting and benchmarking different solutions until eventually deploying to production.

These are the main takeaways described in this article:

  • Generating xlsx files can consume significant amount of resources.
  • Under some circumstances better performance can be gained by not using prefetch_related.
  • pyexcelerate is a fast package for creating simple Excel files.
  • tablib (and django-import-export) can be patched to use pyexcelerate and produce excel files faster.
Image for post
Image for post

A few weeks ago we started getting complaints from users about slow response time from one of our systems. A quick glance at the server metrics showed higher than normal CPU usage. This system is mostly IO intensive, so high CPU usage is not something we experience regularly.

The first thing we did was to identify the worker process that is consuming high CPU using htop. After getting the process identifier (PID) of the process, we used py-spy to get a glance at what it's doing:

This command samples the process a 1000 times per second and provides a top-like view of the results:

After monitoring this view for a minute or two, we had a few insights:

  1. A lot of time is spent fetching data.
  2. A lot of time is spent on some call to prefetch_related.
  3. The problem is in the dashboard, and more specifically in the view that exports data.

With these insights, we wanted to moved on to identify the exact view. We then turned to the nginx access log:

We managed to identify several endpoints that were taking very long to execute. Some of them finished in just under 60 seconds, others were killed by PostgreSQL after hitting the statement_timeout limit and returned a 500 status code.

At this point we had a pretty good idea where the problem is, but we were still clueless as to why. The next step was to inspect the problematic code, and try to reproduce.

Exporting a QuerySet to Excel

The system is used to report and track violations in public transportation. During an inspection, the inspector documents different types of violations such as dirty bus, bus running late etc. The models for this system look roughly like this:

Every once in a while, a back office user would download the inspection information to Excel for further analysis.

Image for post
Image for post
Users + Excel = Love

The report includes a lot of information about the inspection, but most importantly, it includes a list of the violation types for each inspection:

Using django-import-export

To produce the Excel report we use a package called django-import-export. Using the package, we define a ModelResource that can produce an Excel file from a queryset:

The query produced by this ModelResource causes an N+1 queries issue, so before we ever deployed it to production we patched it and added prefetch_related:

To use prefetch_related in a ModelResource we had to make the following changes:

  1. Override export and adjust the query to prefetch the violations using prefetch_related. We use the Prefetch object because we needed to customize the prefetch query, and add the violation type name from a related table.
  2. Evaluate the query and have the export function return a list instead of a queryset. django-import-export uses iterator to speed up the query. Using , the ORM uses a cursor to iterate over the data in chunks and reduce memory. While this is usually useful, Django is unable to use iterator() with prefetch_related.
  3. Add a custom dehydrate_ function for the violations field that will render a comma-delimited list of violation type names.

NOTE: This is all you need to know about prefetching in Django

The resource was used by the view to produce the Excel report:

The view takes a request, apply some filter on the inspections and produces the xlsx file using the ModelResource.

Finding the Best File Format

Before we can start improving the export process, we need to establish a baseline. To get the timings and identify the hot spots in the call stack we used . To identify and time query execution we turned SQL logging on in the Django settings:

The benchmark looked like this:

These were the results of exporting 10,000 rows in xlsx format using prefetch_related:

We ran the benchmark twice to make sure the results were not effected by caches. The function took 40s to complete, and only 10s of it (25%) were spent in the database.

At this point, we suspected that the problem might be in the file format. This assumption was supported by the application server’s high CPU usage.

Next, we wanted to try the same benchmark, only instead of xlsx we produced a csv:

These were the results of exporting 10,000 rows in csv format using prefetch_related:

Wow! That’s a big improvement. This confirmed our suspicion that the actual production of the xlsx was the problem.

Before we moved on, we wanted to check another file format that might be more useful to our users, the old xls format:

These were the results of exporting 10,000 rows in xls format using prefetch_related:

OK, so that’s surprising. I’m not familiar with the internals of the Microsoft Office file formats, but it seems like the old format is only a little bit slower than the csv format, and much faster than the new xlsx format.

This benchmark results brought up an old dilemma. In the past we used to serve users with only csv files, but they complained a lot about troubles opening the files, and encoding and formatting issues. For this reason we decided to produce xlsx in the first place, so at that time, producing xls files seemed like the best solution.

I should already tell you, using the old xls format was a bad decision, but we didn’t know that yet.

Improving the Query

After reducing the overall execution time by half, our next targets were the queries. Two queries are executed to produce the dataset for the export. Before any change is made, it took the “main” query ~2s and the prefetch ~8s to complete.

The “main” query looked like this:

The resource used a lot of data from related tables, and the query joined ~12 tables and had many fields listed in the SELECT clause. The table is one of the main tables in the database so it is heavily indexed, and the lookup tables were relatively small so the query didn’t take long to complete.

The prefetch query looked like this:

This query seems innocent, but in fact, it took ~8s to complete. The execution plan of this query looked like this:

I trimmed the execution plan for brevity, but the Filter line was three or four pages long, filled with IDs. This got us thinking, is it possible that this huge ANY filter is what's causing us trouble?

To answer this question we decided to try and implement the query without prefetch_related. Instead, we decided to use the new Subquery expression.

Using Subquery the query using the ORM looked like that:

If you never experimented with Subquery there is a lot to take in here. Before we break it down, this is what the query looks like:

Now, let’s break it down:

  • Subquery is a query expression that can only exist inside another query. In this case, the outer query is inspection.
  • Subquery in used in annotate so the result of the subquery is stored in a another column for each row.
  • We added a dummy annotation to prevent Django from grouping the results. The subquery is executed for each inspection, this is what the filter using OuterRef does. For this reason, we don't need to group by any other column.
  • The subquery must return at most one row, so we group the names into an array using ARRAY_AGG.

After all this hard work, we were keen to see if this is the silver bullet we were waiting for, but in fact, when we executed this on 10,000 rows it choked. To see it through, we executed the export function with only 1,000 rows.

These were the results of exporting 1,000 rows in xls format using subquery:

The query is now crazy slow. I won’t paste the execution plan because there were so many other tables, but PostgreSQL used a nested loop join on the top level of the query to produce the value for this field. Surprisingly, the database did a significantly worse job than the ORM did in this case.

Using an Iterator

Before we completely abandoned this solution, we wanted to check one last thing. We previously mentioned that django-import-export is using iterator() to create a cursor over the results. We also mentioned that using prefetch_related prevents us from using iterator(). Well, we no longer use prefetch_related so we might as well check if using iterator() makes any difference.

These were the results of exporting 1,000 rows in xls format using subquery and iterator:

The iterator made no difference.

Simplifying the Query

In a final attempt to get something out of this expedition, we wanted to see if the complexity of the query prevented PostgreSQL from finding an optimal execution plan. To do that, we could have adjusted the database parameters from_collapse_limit and join_collapse_limit and let PostgreSQL take all the time and resources it needs to find an optimal execution plan, but instead, we decided to strip all other fields from the resources besides id and violations.

These were the results of exporting 1,000 rows containing only the id and violations fields in xls format using subquery and iterator:

No change, this is officially a dead end!

Manual Prefetch

After a quick lunch break we decided it’s time pull out the big guns. If Django’s prefetch implementation wasn’t working for us, and PostgreSQL was unable to produce a decent execution plan, we will just have to do it ourselves.

To implement our own “prefetch” we needed to adjust some of the other functions in the resource:

This looks like a lot, but it’s actually not:

  1. We create our own “prefetch related” dict prefetched_violations:
  • The key is the violation ID, and the value is an array containing the violation names (violations_csv).
  • To fetch only relevant violations, we use filter using queryset to filter only the necessary inspections.
  • We executed select_related(None) to remove all previously set select_related tables, and make the ORM remove any unnecessary joins.
  1. We return the original queryset to the export function which produces the Excel file.
  2. To construct the value for the violations field, we use the prefetched_violations we populated during export. This is the "lookup" part of the prefetch. While using Django's prefetch_related we have access to this value on the instance, when we do it manually we have to look it up ourselves.
  3. Once again, since we no longer use Django’s prefetch_related we were able to use an iterator. So, instead of evaluating the query we return a queryset.

We already got disappointed after putting in a lot of effort the last time, let’s see if this time the hard work paid off.

These were the results of exporting 10,000 rows in xls format using manual prefetch and iterator:

Compared to the 40 seconds we started with, this is an overall 75% improvement. 20s were reduced by switching to xls format, another 10s were from manually doing the prefetch.

We are ready for production!

Trouble in Paradise

Quickly after rolling out the new version to production we started getting complaints from users not being able to open the file.

Remember I told you using xls was a bad idea? Well, when users started downloading the xls files they got a nasty message saying the file is corrupt, and excel, thank god, managed to salvage some of the data (which is way worse!).

One might ask, “but how come you didn’t catch this in QA?”. Well, that’s just another reason we hate working with Excel. When we tested it locally on our Linux desktops using LibreOffice, it worked just fine.

Image for post
Image for post
It works on my machine

So let’s recap:

  • xlsx is slow and consumes a lot of CPU.
  • xls is not supported by the excel version used by our users.
  • csv has many encoding and formatting issues, and proved to be unusable in the past.

Using a Different Excel Writer

As always, when all options suck and the future is looking bleak, we turned to Google.

A quick search of “python excel performance” brought up this gist which compares 4 different Excel writers in Python (gotta love the internet!).

These are the benchmark results:

According to the results, there is a big difference between the xlsx libraries.

As mentioned before, we use django-import-export to produce excel files from Django models and querysets. Under the hood, django-import-export is using the popular tablib package to do the actual export.

Tablib offers export and import capabilities to and from many formats, but it doesn’t do any of the heavy lifting itself. To produce xlsx files, tablib is using the package .

A Faster Excel Writer in Python

Looking back at the benchmark results, openpyxl is the slowest among all packages. It looks like by switching to the fastest implementation, pyexcelerate we might be able to gain some significant improvement for this export process.

The package pyexcelerate looked great from start. The tag line is just what we needed:

PyExcelerate is a Python for writing Excel-compatible XLSX spreadsheet files, with an emphasis on speed.

Even the snarky subtitles on the “Usage” section in the README were just what we wanted: fast, faster and fastest!

With such promising benchmarks and README, we had to try it out!

Patching tablib

We already have an entire system built on top of django-import-export and tablib, and we didn't want to start making changes everywhere. So instead, we looked for a way to patch tablib, and make it use pyexcelerate instead of openpyxl.

After some digging, we found that tablib uses an internal function called _register_formats to add export and import formats such as csv, xls and xlsx. To get a list of available formats, tablib imports a collection called available from the module formats. The contents of the file formats/__init__.py where the collection is defined, look like this:

The interesting part is the contents of the file _xlsx.py. The file defines some functions to export and import from Excel using openpyxl.

To patch tablib, we first need to implement a similar interface to the one in _xlsx.py using pyexcelerate, and then register it in tablib.

Let’s start with implementing _xlsx.py using pyexcelerate:

This is a simple implementation of the main functions. It lacks some functionalities such a multiple sheets, but it was fine for our needs.

Next, we need to make tablib register this file instead of the existing xlsx format. To do that, we created a new file called monkeypatches.py:

To apply the patch to tablib, we import our implementation and add it to the available formats list. We then import this file in the module's __init__.py so every time the system starts up, tablib is patched.

Now for the moment of truth, did all this hard work finally paid off?

These were the results of exporting 10,000 rows in xlsx format with pyexcelerate using manual prefetch and iterator:

The hard work definitely paid off! Just so we have an honest comparison, these are the results of exporting 10,000 rows in xlsx format without patching tablib using manual prefetch and iterator:

That’s a 64% improvement compared to the default implementation provided by tablib, and a 75% improvements compared to the 40s we started with.

Results Summary

This a summary of all the results mentioned in the article:

Image for post
Image for post
Summary of different methods to produce Excel files in Python

Seifa

We try to study every incident and take actions to prevent similar incidents from happening in the future. During this incident, some of our users did experience slowness for a short period of time, however, the “Export to Excel” functionality did not really killed our app.

Following this incident, there are a few open questions we haven’t had the chance to fully explore yet:

  • Why was the prefetch query so slow? The difference boils down to executing Model.objects.filter(fk__in = [1,2,3,4....9,999, 10,000]) vs executing Model.objects.filter(fk__in = OtherModel.objects.filter( ... ).values_list('pk')). When we tried to compare the two in the database, we found no difference, but the built-in prefetch_related was significantly slower. Is it possible that time is being spent generating the query in Python?
  • Can openpyxl3 performance be improved? When I talked to John, the author of the Excel writers benchmark, he mentioned that openpyxl3 can be faster if lxml is installed.
  • Is xlsx really the best format? Can we eliminate some of the problems we had with csv by switching to a different textual format such as tsv?

If you have the answer to any of these questions feel free to share them with me and i’ll be happy to post the response.

UPDATED: Aug 19, 2019

A reader from lobste.rs ran a quick benchmark to check how faster openpyxl can get using lxml. These were his results:

This benchmark shows that openpyxl can be made almost twice as fast just by installing lxml. However, pyexcelerate improved the speed by a factor of 3.

Many reader on Reddit and Lobsters suggested that a better approach would be to generate the Excel file on the client side using Javascript. This is definitely something worth considering when designing a new system, even thought I think this approach might be problematic for very large files.

Originally published at https://hakibenita.com on September 17, 2019.

Written by

Full Stack Developer, Team Leader, Independent. More from me at https://hakibenita.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store