How “Export to Excel” Almost Killed Our System

And How We Eventually Made It 4 Times Faster

Originally published at https://hakibenita.com on September 17, 2019.
Read
this article on by blog with proper syntax highlighting.

A few weeks ago we had some trouble with an “Export to Excel” functionality in one of our systems. In the process of resolving this issue, we made some interesting discoveries and came up with original solutions.

This article is inspired by the actual issue we used to track this incident over a period of two days. We go through the process of identifying the problem, experimenting and benchmarking different solutions until eventually deploying to production.

These are the main takeaways described in this article:

  • Generating xlsx files can consume significant amount of resources.
  • Under some circumstances better performance can be gained by not using .
  • is a fast package for creating simple Excel files.
  • (and ) can be patched to use and produce excel files faster.

A few weeks ago we started getting complaints from users about slow response time from one of our systems. A quick glance at the server metrics showed higher than normal CPU usage. This system is mostly IO intensive, so high CPU usage is not something we experience regularly.

The first thing we did was to identify the worker process that is consuming high CPU using . After getting the process identifier (PID) of the process, we used py-spy to get a glance at what it's doing:

$ py-spy -p 8187 -r 1000

This command samples the process a 1000 times per second and provides a -like view of the results:

Total Samples 17974
GIL: 0.00%, Active: 0.00%, Threads: 1

OwnTime TotalTime Function (filename:line)
0.000s 173.7s get_response (django/core/handlers/base.py:75)
0.000s 173.7s inner (django/core/handlers/exception.py:34)
0.000s 173.7s __call__ (django/utils/deprecation.py:94)
0.000s 173.7s __call__ (django/core/handlers/wsgi.py:141)
0.000s 173.6s view (django/views/generic/base.py:71)
0.000s 173.6s _get_response (django/core/handlers/base.py:113)
0.000s 173.6s dispatch (django/contrib/auth/mixins.py:52)
0.000s 173.6s dispatch (django/contrib/auth/mixins.py:109)
0.000s 173.6s dispatch (django/views/generic/base.py:97)
0.050s 173.6s get (dashboard/views/list_views.py:100)
0.000s 94.69s get_resource_to_export (dashboard/views/list_views.py:70)
0.000s 94.69s export (dashboard/views/list_views.py:73)

0.000s 94.68s export (dashboard/resources.py:215)
0.000s 83.81s __iter__ (django/db/models/query.py:274)
0.040s 82.73s _fetch_all (django/db/models/query.py:1242)
0.000s 78.84s export (dashboard/views/list_views.py:74)
0.000s 70.58s __iter__ (django/db/models/query.py:55)
0.000s 68.98s execute_sql (django/db/models/sql/compiler.py:1100)
68.81s 68.81s _execute (django/db/backends/utils.py:84)
0.000s 68.81s _execute_with_wrappers (django/db/backends/utils.py:76)
0.000s 68.81s execute (django/db/backends/utils.py:67)
0.000s 50.11s save (tablib/packages/openpyxl3/workbook.py:186)
0.000s 50.11s export_set (tablib/formats/_xlsx.py:46)
0.000s 46.41s save (tablib/packages/openpyxl3/writer/excel.py:124)
0.000s 46.41s save_workbook (tablib/packages/openpyxl3/writer/excel.py:141)

0.000s 42.40s _fetch_all (django/db/models/query.py:1244)
0.000s 42.40s _prefetch_related_objects (django/db/models/query.py:771)
0.000s 42.38s prefetch_related_objects (django/db/models/query.py:1625)
0.000s 41.94s prefetch_one_level (django/db/models/query.py:1738)
0.000s 41.25s get_prefetch_queryset (django/db/models/fields/related_descriptors.py:627)
0.000s 32.30s _write_worksheets (tablib/packages/openpyxl3/writer/excel.py:91)

After monitoring this view for a minute or two, we had a few insights:

  1. A lot of time is spent fetching data.
  2. A lot of time is spent on some call to .
  3. The problem is in the dashboard, and more specifically in the view that exports data.

With these insights, we wanted to moved on to identify the exact view. We then turned to the nginx access log:

$ journalctl -u nginx -r | grep dashboard

We managed to identify several endpoints that were taking very long to execute. Some of them finished in just under 60 seconds, others were killed by PostgreSQL after hitting the limit and returned a 500 status code.

At this point we had a pretty good idea where the problem is, but we were still clueless as to why. The next step was to inspect the problematic code, and try to reproduce.

Exporting a QuerySet to Excel

The system is used to report and track violations in public transportation. During an inspection, the inspector documents different types of violations such as dirty bus, bus running late etc. The models for this system look roughly like this:

class ViolationType(models.Model):
name = model.TextField()


class Inspection(models.Model):
id = models.AutoField(primary_key=True)


class Violation(models.Model):
inspection = models.ForeignKey(Inspection, on_delete=models.PROTECT)
violation_type = models.ForeignKey(ViolationType, on_delete=models.PROTECT)
comments = models.TextField()

Every once in a while, a back office user would download the inspection information to Excel for further analysis.

Users + Excel = Love

The report includes a lot of information about the inspection, but most importantly, it includes a list of the violation types for each inspection:

inspection, violations
1, dirty floors | full trash can
2, full trash can | no light | missing signs

Using

To produce the Excel report we use a package called . Using the package, we define a that can produce an Excel file from a queryset:

from import_export import resources, fields, widgets

from .models import Inspection, Violation


class InspectionResource(resources.ModelResource):
violations = fields.Field(
widget=widgets.ManyToManyWidget(Violation, field='violation_type')
)

class Meta:
model = Inspection
fields = (
'id',
'violations',
)

The query produced by this causes an N+1 queries issue, so before we ever deployed it to production we patched it and added :

from django.db.models import Prefetch
from import_export import resources, fields, widgets

from .models import Inspection, Violation


class InspectionResource(resources.ModelResource):
violations = fields.Field(
widget=widgets.ManyToManyWidget(Violation, field='violation_type')
)

class Meta:
model = Inspection
fields = (
'id',
'violations',
)

def export(self, queryset=None):
queryset = (
queryset
.prefetch_related(Prefetch(
'violations',
queryset=Violation.objects.select_related('violation_type'),
to_attr='prefetched_violations',
))
)
return super().export(list(queryset))

def dehydrate_violations(self, inspection: Inspection) -> str:
return '\n'.join(
v.violation_type.name
for v in inspection.prefetched_violations
)

To use in a we had to make the following changes:

  1. Override and adjust the query to prefetch the violations using . We use the object because we needed to customize the prefetch query, and add the violation type name from a related table.
  2. Evaluate the query and have the export function return a list instead of a queryset. uses to speed up the query. Using , the ORM uses a cursor to iterate over the data in chunks and reduce memory. While this is usually useful, Django is unable to use with .
  3. Add a custom function for the violations field that will render a comma-delimited list of violation type names.

NOTE: This is all you need to know about prefetching in Django

The resource was used by the view to produce the Excel report:

from django.http import HttpRequest, HttpResponse
from .models import Inspection
from .resources import InspectionResource

LIMIT = 10000

def export_to_excel(request: HttpRequest) -> HttpResponse:
inspections = Inspection.objects.all()
# Apply some filter on the queryset based on request
data = InspectionResource().export(inspections[:LIMIT]).xlsx
response = HttpResponse(data, content_type='text/xlsx')
response['Content-Disposition'] = 'attachment; filename=export.xlsx'
return response

The view takes a request, apply some filter on the inspections and produces the xlsx file using the .

Finding the Best File Format

Before we can start improving the export process, we need to establish a baseline. To get the timings and identify the hot spots in the call stack we used . To identify and time query execution we turned SQL logging on in the Django settings:

# settings.py

LOGGING = {
# ...
'loggers': {
'django.db.backends': {
'level': 'DEBUG',
},
# ...
},
}

The benchmark looked like this:

import cProfile

from .models import Inspection
from .resources import InspectionResource

qs = VehicleInspection.objects.all()[:10000]
cProfile.run('resources.VehicleInspectionResource().export(qs).xlsx')

These were the results of exporting 10,000 rows in xlsx format using :

56830808 function calls (47142920 primitive calls) in 41.574 seconds
select 5.009
prefetch 8.009

56660555 function calls (47149065 primitive calls) in 39.927 seconds
select 2.356
prefetch 7.991

We ran the benchmark twice to make sure the results were not effected by caches. The function took 40s to complete, and only 10s of it (25%) were spent in the database.

At this point, we suspected that the problem might be in the file format. This assumption was supported by the application server’s high CPU usage.

Next, we wanted to try the same benchmark, only instead of xlsx we produced a csv:

cmd = 'resources.VehicleInspectionResource().export(qs).csv'
cProfile.run(cmd)

These were the results of exporting 10,000 rows in csv format using :

9179705 function calls (9107672 primitive calls) in 17.429 seconds
select 1.970
prefetch 8.343

Wow! That’s a big improvement. This confirmed our suspicion that the actual production of the xlsx was the problem.

Before we moved on, we wanted to check another file format that might be more useful to our users, the old xls format:

cmd = 'resources.VehicleInspectionResource().export(qs).xls'
cProfile.run(cmd)

These were the results of exporting 10,000 rows in xls format using :

16317592 function calls (15745704 primitive calls) in 20.694 seconds 
select 1.922
prefetch 7.976

OK, so that’s surprising. I’m not familiar with the internals of the Microsoft Office file formats, but it seems like the old format is only a little bit slower than the csv format, and much faster than the new xlsx format.

This benchmark results brought up an old dilemma. In the past we used to serve users with only csv files, but they complained a lot about troubles opening the files, and encoding and formatting issues. For this reason we decided to produce xlsx in the first place, so at that time, producing xls files seemed like the best solution.

I should already tell you, using the old xls format was a bad decision, but we didn’t know that yet.

Improving the Query

After reducing the overall execution time by half, our next targets were the queries. Two queries are executed to produce the dataset for the export. Before any change is made, it took the “main” query ~2s and the prefetch ~8s to complete.

The “main” query looked like this:

SELECT
"inspection"."id",
-- around 50 more fields from joined tables
FROM
"inspection"
INNER JOIN "auth_user" ON ("inspection"."user_id" = "auth_user"."id")
-- around 11 more joined tables

The resource used a lot of data from related tables, and the query joined ~12 tables and had many fields listed in the SELECT clause. The table is one of the main tables in the database so it is heavily indexed, and the lookup tables were relatively small so the query didn’t take long to complete.

The prefetch query looked like this:

SELECT
"violation"."id",
"violation"."inspection_id",
"violation"."violation_type_id",
"violation_type"."id",
"violation_type"."name",
FROM "violation"
INNER JOIN "violation_type" ON (
"violation"."violation_type_id" = "violation_type"."id"
)
WHERE
"violation"."inspection_id" = ANY([
2814, 9330, 8848, 8971, 9372, 9084, 78, 3896,
-- another 10,000 inspection IDs
1399, 9348, 914, 8884, 9082, 3356, 2896, 742
])

This query seems innocent, but in fact, it took ~8s to complete. The execution plan of this query looked like this:

Nested Loop (cost=1000.28..2040346.39 rows=26741 width=181)
-> Gather (cost=1000.00..2032378.29 rows=26741 width=115)
Workers Planned: 2
-> Parallel Seq Scan on violation (cost=0.00..2028704.19 rows=11142 width=115)
Filter: (vehicle_inspection_id = ANY ('{2814,9330,....,8926,9153}'::integer[]))
-> Index Scan using violationtype_pkey on violationtype (cost=0.28..0.30 rows=1 width=66)
Index Cond: (id = violation.violation_type_id)

I trimmed the execution plan for brevity, but the line was three or four pages long, filled with IDs. This got us thinking, is it possible that this huge filter is what's causing us trouble?

To answer this question we decided to try and implement the query without . Instead, we decided to use the new expression.

Using the query using the ORM looked like that:

from django.db.models import OuterRef, Subquery, Value
from django.contrib.postgres.aggregates import ArrayAgg

inspections = inspections.annotate(violations_csv=Subquery(
Violation.objects
# Reference the inspection ID of the outer table, inspection.
.filter(inspection_id=OuterRef('id'))
# Prevent Django from adding a group by column.
.annotate(dummy=Value('1')).values('dummy')
# Construct an array of violation names.
.annotate(violations=ArrayAgg('violation_type__name', distinct=True))
.values('violations')
))

If you never experimented with there is a lot to take in here. Before we break it down, this is what the query looks like:

SELECT
"inspection"."id",
(
SELECT
ARRAY_AGG(DISTINCT U2."name") AS "violations"
FROM
"violation" U0
INNER JOIN "violationtype" U2 ON (U0."violation_type_id" = U2."id")
WHERE
U0."inspection_id" = ("inspection"."id")

) AS "violations_csv",
-- around 50 more fields from joined tables
FROM
"inspection"
INNER JOIN "auth_user" ON ("inspection"."user_id" = "auth_user"."id")
-- around 11 more joined tables

Now, let’s break it down:

  • is a query expression that can only exist inside another query. In this case, the outer query is .
  • in used in so the result of the subquery is stored in a another column for each row.
  • We added a dummy annotation to prevent Django from grouping the results. The subquery is executed for each inspection, this is what the filter using does. For this reason, we don't need to group by any other column.
  • The subquery must return at most one row, so we group the names into an array using .

After all this hard work, we were keen to see if this is the silver bullet we were waiting for, but in fact, when we executed this on 10,000 rows it choked. To see it through, we executed the export function with only 1,000 rows.

These were the results of exporting 1,000 rows in xls format using subquery:

1571053 function calls (1514505 primitive calls) in 60.962 seconds select 59.917

The query is now crazy slow. I won’t paste the execution plan because there were so many other tables, but PostgreSQL used a nested loop join on the top level of the query to produce the value for this field. Surprisingly, the database did a significantly worse job than the ORM did in this case.

Using an Iterator

Before we completely abandoned this solution, we wanted to check one last thing. We previously mentioned that is using to create a cursor over the results. We also mentioned that using prevents us from using . Well, we no longer use so we might as well check if using makes any difference.

These were the results of exporting 1,000 rows in xls format using subquery and iterator:

1571580 function calls (1514788 primitive calls) in 62.130 seconds 
select 60.618

The iterator made no difference.

Simplifying the Query

In a final attempt to get something out of this expedition, we wanted to see if the complexity of the query prevented PostgreSQL from finding an optimal execution plan. To do that, we could have adjusted the database parameters and and let PostgreSQL take all the time and resources it needs to find an optimal execution plan, but instead, we decided to strip all other fields from the resources besides and .

These were the results of exporting 1,000 rows containing only the id and violations fields in xls format using subquery and iterator:

6937 function calls (6350 primitive calls) in 57.280 seconds 
select 57.255

No change, this is officially a dead end!

Manual Prefetch

After a quick lunch break we decided it’s time pull out the big guns. If Django’s prefetch implementation wasn’t working for us, and PostgreSQL was unable to produce a decent execution plan, we will just have to do it ourselves.

To implement our own “prefetch” we needed to adjust some of the other functions in the resource:

from import_export import resources, fields

from .models import Inspection, Violation


class InspectionResource(resources.ModelResource):
violations = fields.Field()

class Meta:
model = Inspection
fields = (
'id',
'violations',
)

def export(self, queryset=None):
# Manually prefetch the violations.
self.prefetched_violations = dict(
Violation.objects
.filter(inspection_id__in=(
queryset
# Clean all joins.
.select_related(None)
.values_list('pk')
)
.annotate(
violations_csv=ArrayAgg('violation_type__name'),
)
.values_list(
'vehicle_inspection_id',
'violations_csv',
)
)

return super().export(queryset)

def dehydrate_violations(self, inspection: Inspection) -> str:
return '\n'.join(self.prefetched_violations.get(inspection.id, []))

This looks like a lot, but it’s actually not:

  1. We create our own “prefetch related” dict :
  • The key is the violation ID, and the value is an array containing the violation names ().
  • To fetch only relevant violations, we use filter using to filter only the necessary inspections.
  • We executed to remove all previously set tables, and make the ORM remove any unnecessary joins.
  1. We return the original queryset to the function which produces the Excel file.
  2. To construct the value for the field, we use the we populated during . This is the "lookup" part of the prefetch. While using Django's we have access to this value on the instance, when we do it manually we have to look it up ourselves.
  3. Once again, since we no longer use Django’s we were able to use an iterator. So, instead of evaluating the query we return a queryset.

We already got disappointed after putting in a lot of effort the last time, let’s see if this time the hard work paid off.

These were the results of exporting 10,000 rows in xls format using manual prefetch and iterator:

15281887 function calls (14721333 primitive calls) in 11.411s
select 0.833
manual prefetch 0.107

Compared to the 40 seconds we started with, this is an overall 75% improvement. 20s were reduced by switching to xls format, another 10s were from manually doing the prefetch.

We are ready for production!

Trouble in Paradise

Quickly after rolling out the new version to production we started getting complaints from users not being able to open the file.

Remember I told you using xls was a bad idea? Well, when users started downloading the xls files they got a nasty message saying the file is corrupt, and excel, thank god, managed to salvage some of the data (which is way worse!).

One might ask, “but how come you didn’t catch this in QA?”. Well, that’s just another reason we hate working with Excel. When we tested it locally on our Linux desktops using LibreOffice, it worked just fine.

It works on my machine

So let’s recap:

  • xlsx is slow and consumes a lot of CPU.
  • xls is not supported by the excel version used by our users.
  • csv has many encoding and formatting issues, and proved to be unusable in the past.

Using a Different Excel Writer

As always, when all options suck and the future is looking bleak, we turned to Google.

A quick search of “python excel performance” brought up this gist which compares 4 different Excel writers in Python (gotta love the internet!).

These are the benchmark results:

# Source: https://gist.github.com/jmcnamara/ba25c2bf4ba0777065eb

Versions:
python : 2.7.2
openpyxl : 2.2.1
pyexcelerate: 0.6.6
xlsxwriter : 0.7.2
xlwt : 1.0.0

Dimensions:
Rows = 10000
Cols = 50

Times:
pyexcelerate : 10.63
xlwt : 16.93
xlsxwriter (optimised): 20.37
xlsxwriter : 24.24
openpyxl (optimised): 26.63
openpyxl : 35.75

According to the results, there is a big difference between the xlsx libraries.

As mentioned before, we use to produce excel files from Django models and querysets. Under the hood, is using the popular package to do the actual export.

Tablib offers export and import capabilities to and from many formats, but it doesn’t do any of the heavy lifting itself. To produce xlsx files, tablib is using the package .

A Faster Excel Writer in Python

Looking back at the benchmark results, is the slowest among all packages. It looks like by switching to the fastest implementation, we might be able to gain some significant improvement for this export process.

The package looked great from start. The tag line is just what we needed:

PyExcelerate is a Python for writing Excel-compatible XLSX spreadsheet files, with an emphasis on speed.

Even the snarky subtitles on the “Usage” section in the README were just what we wanted: fast, faster and fastest!

With such promising benchmarks and README, we had to try it out!

Patching

We already have an entire system built on top of and , and we didn't want to start making changes everywhere. So instead, we looked for a way to patch tablib, and make it use instead of .

After some digging, we found that tablib uses an internal function called to add export and import formats such as csv, xls and xlsx. To get a list of available formats, tablib imports a collection called from the module . The contents of the file where the collection is defined, look like this:

# -*- coding: utf-8 -*-

""" Tablib - formats """

from . import _csv as csv
from . import _json as json
from . import _xls as xls
from . import _yaml as yaml
from . import _tsv as tsv
from . import _html as html
from . import _xlsx as xlsx
from . import _ods as ods
from . import _dbf as dbf
from . import _latex as latex
from . import _df as df
from . import _rst as rst
from . import _jira as jira

available = (json, xls, yaml, csv, dbf, tsv, html, jira, latex, xlsx, ods, df, rst)

The interesting part is the contents of the file _xlsx.py. The file defines some functions to export and import from Excel using .

To patch , we first need to implement a similar interface to the one in using , and then register it in .

Let’s start with implementing using :

# fast_xlsx.py
import itertools
from io import BytesIO

from tablib.formats._xlsx import * # noqa
from pyexcelerate import Workbook


# Override the default xlsx implementation
title = 'xlsx'


def export_set(dataset, freeze_panes=True):
"""Returns XLSX representation of Dataset."""
title = dataset.title or 'Sheet1'
wb = Workbook()
wb.new_sheet(title, data=itertools.chain([dataset.headers], dataset))
stream = BytesIO()
wb.save(stream)
return stream.getvalue()


def export_book(databook, freeze_panes=True):
"""Returns XLSX representation of DataBook."""
assert len(databook._datasets) == 1
return export_set(databook._datasets[0], freeze_panes)


def dset_sheet(*args, **kwargs):
assert False, 'How did you get here?'

This is a simple implementation of the main functions. It lacks some functionalities such a multiple sheets, but it was fine for our needs.

Next, we need to make register this file instead of the existing xlsx format. To do that, we created a new file called :

# monkeypatches.py
import tablib

from . import fast_xlsx

# Override default xlsx format with a faster implementation
# using `pyexcelerate` (export only).
tablib.formats.available += (fast_xlsx, )

To apply the patch to , we import our implementation and add it to the available formats list. We then import this file in the module's so every time the system starts up, is patched.

Now for the moment of truth, did all this hard work finally paid off?

These were the results of exporting 10,000 rows in xlsx format with using manual prefetch and iterator:

13627507 function calls (13566956 primitive calls) in 10.944s
select 0.137
manual prefetch 2.219

The hard work definitely paid off! Just so we have an honest comparison, these are the results of exporting 10,000 rows in xlsx format without patching using manual prefetch and iterator:

55982358 function calls (46155371 primitive calls) in 29.965s
select 0.137
manual prefetch 1.724

That’s a 64% improvement compared to the default implementation provided by , and a 75% improvements compared to the 40s we started with.

Results Summary

This a summary of all the results mentioned in the article:

Summary of different methods to produce Excel files in Python

Seifa

We try to study every incident and take actions to prevent similar incidents from happening in the future. During this incident, some of our users did experience slowness for a short period of time, however, the “Export to Excel” functionality did not really killed our app.

Following this incident, there are a few open questions we haven’t had the chance to fully explore yet:

  • Why was the prefetch query so slow? The difference boils down to executing vs executing . When we tried to compare the two in the database, we found no difference, but the built-in was significantly slower. Is it possible that time is being spent generating the query in Python?
  • Can performance be improved? When I talked to John, the author of the Excel writers benchmark, he mentioned that can be faster if is installed.
  • Is xlsx really the best format? Can we eliminate some of the problems we had with csv by switching to a different textual format such as tsv?

If you have the answer to any of these questions feel free to share them with me and i’ll be happy to post the response.

UPDATED: Aug 19, 2019

A reader from lobste.rs ran a quick benchmark to check how faster can get using . These were his results:

This benchmark shows that can be made almost twice as fast just by installing . However, improved the speed by a factor of 3.

Many reader on Reddit and Lobsters suggested that a better approach would be to generate the Excel file on the client side using Javascript. This is definitely something worth considering when designing a new system, even thought I think this approach might be problematic for very large files.

Originally published at https://hakibenita.com on September 17, 2019.

Full Stack Developer, Team Leader, Independent. More from me at https://hakibenita.com