12 Common Mistakes and Missed Optimization Opportunities in SQL

Made by Developers and Non-Developers

On the 20th of March I’ll be giving an online session inspired by this article. The session will feature SQL productivity and performance tips, and a walk through of common mistakes in SQL.

For more details and registration go to SQL Next Steps: Optimization

Originally published at https://hakibenita.com on November 21, 2019.
this article on by blog with proper syntax highlighting.

Most programming languages are designed for professional developers with knowledge of algorithms and data structure. SQL is different.

SQL is used by analysts, data scientists, product managers, designers and many others. These professionals have access to databases, but they don’t always have the intuition and understanding to write efficient queries.

In an effort to make my team write better SQL, I went over reports written by non-developers and code reviews, and gathered common mistakes and missed optimization opportunities in SQL.

Be Careful When Dividing Integers

In PostgreSQL, dividing an integer by an integer results in an integer:

To get the expected result of the division, you need to cast one of the values to float:

Failing to recognize this pitfall might lead to horribly incorrect results.

Guard Against “division by zero” Errors

Zero division is a notorious error in production:

Division by zero is a logic error, and shouldn’t just be “worked around” but fixed such that you don’t have a zero divisor in the first place. However, there are situations where a zero denominator is possible. One easy way to protect against zero division errors in such cases, is to make the entire expression null by setting the denominator to null if it equals zero:

The function NULLIF returns null if the first argument equals the second argument. In this case, if the denominator is equal to zero.

When dividing any number with NULL, the result is NULL. To force some value, you can wrap the entire expression with COALESCE and provide a fallback value:

The function COALESCE is very useful. It accepts any number of arguments, and returns the first value which is not null.

Know the Difference Between UNION and UNION ALL

A classic entry level interview question for developers and DBAs is “what is the difference between UNION and UNION ALL”.

UNION ALL concatenate the results of one or more queries. UNION does the same, but it also eliminates duplicate rows:

You can see in the execution plan that after appending the two queries, the database sorted the results and eliminated duplicate rows.

If you don’t need to eliminate duplicate rows, it’s best to use UNION ALL:

The execution plan is much simpler. The results are appended and a sort is not necessary.

Be Careful When Counting Nullable Columns

When using aggregate functions such as COUNT, it's important to understand how they handle null values.

For example, take the following table:

The id column contains a null value. Counting the id column:

There are two rows in the table, but COUNT returned 1. This is because null values are ignored by COUNT.

To count rows, use COUNT(*):

This feature can also be useful. For example, if a field called modified contains null if a row was not changed, you can calculate the percentage of changed rows like this:

Other aggregate functions, such as SUM, will ignore null values. To demonstrate, SUM a field that contains only null values:

These are all documented behaviors, so just be aware!

Be Aware of Timezones

Timezone are always a source of confusion and pitfalls. PostgreSQL does a fair job with timezones, but you still have to pay attention to some things.

A common mistake I see countless times is truncating timestamps without specifying the time zone. Say we want to find out how many sales were made each day:

Without explicitly setting the time zone, you might get different results, depending on the time zone set by the client application:

If you are not sure what time zone you are working with, you might be doing it wrong.

When truncating a timestamp, convert to the desired time zone first:

Setting the time zone is usually the responsibility of the client application. For example, to get the time zone used by psql:

To set the time zone in psql:

Another important thing to keep in mind is that the time zone of your server can be different than the time zone of your local machine. so if you run queries in you local machine they might yield different results in production. To avoid mistakes, always explicitly set a time zone.

Timezones in PostgreSQL: To get a complete list of time zone names in PostgreSQL query the view pg_timezone_names.

Avoid Transformations on Indexed Fields

Using functions on an indexed field might prevent the database from using the index on the field:

The field created is indexed, but because we transformed it with timezone, the index was not used.

One way to utilize the index in this case is to apply the transformation on the right-hand side instead:

Another common use-case involving dates is filtering a specific period:

Like before, the interval function on the field created prevented the database from utilizing the index. To make the database use the index, apply the transformation on the right-hand side instead of the field:

Use BETWEEN Only For Inclusive Ranges

A common mistake I see very often is when filtering a date range using BETWEEN:

Intuitively, you might think this query is fetching all the sales in 2019, but in fact, it’s fetching all the sales made in 2019 and the first day of 2020. BETWEEN is inclusive, so the query above is equivalent to this query:

To filter results in 2019 you can either write this:

Or better yet:

Using BETWEEN incorrectly might produce overlapping results, for example, counting sales twice in two different periods.

Add “Faux” Predicates

One of the most important components in database implementations, and usually the one that makes one database better than the other, is the query optimizer.

The query optimizer looks at your SQL and generates an execution plan. The execution plan describes how the database in going to access the data necessary to satisfy the query. For example, the optimizer decides whether to use a specific index or not, in what order to execute a join, which table to filter first, and so on.

To generate a good execution plan, the optimizer utilizes metadata and statistics it has on your data. For example, if you apply a filter on a column with a unique constraint, the optimizer knows it can expect exactly one row for each value. In this case, it might conclude that it makes more sense to use an index rather than scan the entire table.

In some circumstances, you have knowledge of your data that the optimizer does not have, or cannot have. You might be able to improve the performance of a query by providing additional information to the optimizer, using what I like to call, a “Faux Predicate”.

Take this query for example:

The query fetches sales that were modified before 2019. There is no index on this field, so the optimizer generates an execution plan to scan the entire table.

Let’s say you have another field in this table with the time the sale was created. Since it’s not possible for a sale to be modified before it was created, adding a similar condition on the created field won't change the result of the query. However, the optimizer might use this information to generate a better execution plan:

After we added the “Faux Predicate” the optimizer decided to use the index on the created field, and the query got much faster! Note that the previous predicate on the modified field is still being evaluated, but it's now being applied on much fewer rows.

A “Faux Predicate” should not change the result of the query. It should only be used to provide more information to the optimizer that can improve the query performance. Keep in mind that the database has to evaluate all the predicates, so adding too many might make a query slower.

Inline CTE*

Before PostgreSQL 12, Common Table Expressions (aka CTE) were materialized. This changed in PostgreSQL 12, where CTEs are no longer materialized and are treated like sub-queries.

In versions prior to PostgreSQL 12, when CTEs are used incorrectly they can cause increased memory usage and degraded performance:

The overall cost of the execution plan seems very high. This is because the database first materialized the result of the common table expression, and only then applied the predicate. The database was unable to utilize the index on the field, and the query ended up not very efficient.

For better performance, inline the CTE (or upgrade to PostgreSQL 12 😉):

For more about CTE in PostgreSQL and how it effects a query execution plan check out Be Careful With CTE in PostgreSQL.

Fetch Only What You Need!

Databases are really good at storing and retrieving data. Other application, not so much. If you fetch data to Excel, SASS, R, Pandas or any other reporting tool — it’s best to fetch only what you need.

For example, you sometimes want to get a sense of the data and you might do this:

This query will fetch the entire table and will most likely cause unnecessary load if you only need several rows.

Some client applications will automatically fetch the data in pages or limit the result set, but to be on the safe side, it’s best to set LIMIT yourself:

Another common case where unnecessary data is fetched from the database is when a user that is less familiar with SQL is fetching data into some other tool such as Excel or Pandas, only to immediately apply some filter or aggregation to it. This can usually be solved by some sort of training.

Reference Column Position in GROUP BY and ORDER BY

A nice feature in PostgreSQL is that columns can be referenced in GROUP BY and ORDER BY by their position in the SELECT clause:

The GROUP BY clause references the expression in first position in the SELECT clause, full_name. The ORDER BY clause references the second expression, the sales count. By referencing the position we avoided repeating the expression. Aside from saving a few more clicks, if the expression changes in the future we can edit it in only one place.

I realize this tip can be controversial because the column position in the SELECT clause has no significance and might itself change when the query is edited. However, I found that it improves productivity when writing ad-hoc queries.

Format Your Query

Readability counts. Pick whatever style you and your team feel most comfortable with, and stick with it.

When I got started years ago I write queries like this:

I started this way because this is roughly the format Oracle used in their documentation.

Over the years, I encountered many different styles. For example:

I can’t think of a reason anyone would write like this, it’s exhausting to format this manually (but it does look good…)

Nowadays, my team an I use this format:

It’s readable, it’s flexible, and most importantly, it plays very nicely with git diff which makes code reviews easier.


Applying the tips above in our day-to-day helps us sustain a healthy database with very little waste. We found that educating developers and non-developers about how to write better SQL can go a long way.

If you have any more SQL tips I might have missed, let me know and I’ll be happy to add them here.

Originally published at https://hakibenita.com on November 21, 2019.

Full Stack Developer, Team Leader, Independent. More from me at https://hakibenita.com