Hey Mark, I just ran a quick benchmark on my own and indeed the “University” method was faster and…

2 min readSep 16, 2019

Hey Mark,
I just ran a quick benchmark on my own and indeed the “University” method was faster and consumed less space.

db=# EXPLAIN (ANALYZE ON, BUFFERS ON)
db-# SELECT  
db-#     *
db-# FROM
db-#     sale
db-# WHERE
db-#     (user, created) IN (
db(#         SELECT 
db(#             user, 
db(#             MAX(created)
db(#         FROM
db(#             sale
db(#         GROUP BY
db(#             user
db(#     )
db-# ;Hash Join  (cost=2717.22..4196.43 rows=1 width=75) (actual time=94.962..114.427 rows=38639 loops=1)
   Hash Cond: ((sale.user = sale_1.user) AND (sale.created = (max(sale_1.created))))
   Buffers: shared hit=688 read=691
   ->  Seq Scan on sale  (cost=0.00..1206.83 rows=51883 width=75) (actual time=0.529..4.452 rows=51280 loops=1)
         Buffers: shared hit=687 read=1
   ->  Hash  (cost=2181.09..2181.09 rows=35742 width=12) (actual time=94.198..94.199 rows=38639 loops=1)
         Buckets: 65536  Batches: 1  Memory Usage: 2324kB
         Buffers: shared hit=1 read=687
         ->  HashAggregate  (cost=1466.24..1823.66 rows=35742 width=12) (actual time=79.480..87.120 rows=38639 loops=1)
               Group Key: sale_1.user
               Buffers: shared hit=1 read=687
               ->  Seq Scan on sale sale_1  (cost=0.00..1206.83 rows=51883 width=12) (actual time=0.011..29.354 rows=51280 loops=1)
                     Buffers: shared hit=1 read=687
 Planning Time: 11.441 ms
 Execution Time: 116.234 msdb=# EXPLAIN (ANALYZE ON, BUFFERS ON)
db-# SELECT DISTINCT ON (user)
db-#     *
db-# FROM
db-#     sale
db-# ORDER BY 
db-#     user,
db-#     created DESC;Unique  (cost=7576.54..7835.96 rows=35742 width=75) (actual time=96.598..113.452 rows=38639 loops=1)
   Buffers: shared hit=691, temp read=544 written=546
   ->  Sort  (cost=7576.54..7706.25 rows=51883 width=75) (actual time=96.596..104.272 rows=51280 loops=1)
         Sort Key: user, created DESC
         Sort Method: external merge  Disk: 4352kB
         Buffers: shared hit=691, temp read=544 written=546
         ->  Seq Scan on sale  (cost=0.00..1206.83 rows=51883 width=75) (actual time=0.020..17.146 rows=51280 loops=1)
               Buffers: shared hit=688
 Planning Time: 1.512 ms
 Execution Time: 115.921 ms

I think the main difference is that the “University” method uses a HashAggregate, which is more efficient in term of memory than the Sort used by the DISTINCT ON method.

In other words, the “University” method goes one row at a time and keeps only the max value for each group, while the DISTINCT ON method sort the entire table. Maintaining a hash is faster and smaller than sorting the entire table.

I was unable to use the “array trick” here because the benchmark fetch all the rows. But, if you only need some of the fields the performance and memory usage should be comparable to the “University” approach.

Written by Haki Benita

No responses yet