Why PostgreSQL prefers MergeJoin to HashJoin?
Intensify usage of the extended statistics capabilities
Today's post is sparked by a puzzling observation: users, especially those who use an abstraction layer like REST or ORM library to interact with databases, frequently disable the MergeJoin option across the entire database instance. They justify this action by citing numerous instances of performance degradation.
Considering how many interesting execution paths MergeJoin adds to the system elaborating IncrementalSort or sort orderings derived from underlying IndexScan, it looks strange: one more bug of skewed cost balance inside the PostgreSQL cost model?
As a developer, I have refused to accept such a mysterious belief in evil algorithms and discovered this case. It turned out that the real reason (or at least one but quite frequent one) lies in the typical challenge optimiser faces: multi-clause JOIN.
Let's take a glance at the query:
SELECT * FROM a JOIN b ON (a.x=b.x AND a.y=b.y AND a.z=b.z);
In this scenario, the optimiser often unexpectedly selects MergeJoin or, much more rarely, a NestLoop instead of the more efficient HashJoin.
It is a challenge to reproduce it with synthetic data, so this example looks a bit complicated:
CREATE TABLE a AS SELECT
((3*gs) % 300) AS x,
((3*gs+1) % 300) AS y,
((3*gs+2) % 300) AS z
FROM generate_series(1,1e5) AS gs;
CREATE TABLE b AS SELECT
gs % 49 AS x,
gs % 51 AS y,
gs %73 AS z
FROM generate_series(1,1e5) AS gs;
ANALYZE a,b;
Table 'b' has been created quite typically for actual data: having a small number of distinct values in each column, a row is almost unique considering values in its columns together. Let's execute a single join on three columns:
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT * FROM a,b WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;
Merge Join (actual rows=0 loops=1)
Merge Cond: ((a.x = b.x) AND (a.y = b.y) AND (a.z = b.z))
-> Sort (actual rows=17001 loops=1)
Sort Key: a.x, a.y, a.z
-> Seq Scan on a (actual rows=100000 loops=1)
-> Sort (actual rows=100000 loops=1)
Sort Key: b.x, b.y, b.z
-> Seq Scan on b (actual rows=100000 loops=1)
Execution Time: 843.179 ms
Let's manually disable MergeJoin and see what will happen:
SET enable_mergejoin = 'off';
Hash Join (actual rows=0 loops=1)
Hash Cond: ((b.x = a.x) AND (b.y = a.y) AND (b.z = a.z))
-> Seq Scan on b (actual rows=100000 loops=1)
-> Hash (actual rows=100000 loops=1)
-> Seq Scan on a (actual rows=100000 loops=1)
Execution Time: 154.822 ms
HashJoin is much faster, isn't it? Even though sorted outer has allowed MergeJoin to fetch only 17001 tuples from 100_000, it is still five times slower. So, why has the optimiser chosen the non-optimal variant?
Looking into the details (I didn't show it in the explanation for simplicity), we see that the optimiser correctly predicts the size of the inner and outer (it is quite a trivial query, though), but the cost of MergeJoin is 20937 in comparison to HashJoin's 419280. Almost twenty times more! What's going on there? Is it a bug in the cost model? - Not exactly.
Just look into the final_cost_hashjoin() routine. The HashJoin cost formula looks like this:
In Postgres terms, bucket_size means the number of tuples with the same hash value. It is a crucial factor because to find a match in the bucket, we have to pass through the bucket and match the incoming tuple with each stored tuple until we find a comparison or pass the whole bucket.
In our specific example, the size of the bucket is roughly equal to 0.015 for relation 'b' and 0.011 for relation 'a', and the number of buckets is estimated at 131000. This means the optimiser predicts that each bucket will contain around 2000 tuples. Passing a whole bucket with a linear search on each incoming tuple is really costly. I agree with the optimiser on that choice! The massive cost of the HashJoin node now makes sense. But why has it made the wrong prediction here?
The problem is estimating the number of groups in the case of multiple columns. Correctly estimating the number of distinct values in 49, 51 and 73 on columns x, y and z correspondingly, the optimiser chooses the maximum value, i.e. 73 distinct values as an estimation of the number of groups on (x,y,z) that is incorrect in most the actual cases, likewise have shown here. Why does it do that? - Because it is the maximum skewed case according to the worst-case scenario, which can be obtained from the statistics.
But the actual number of groups here is:
SELECT count(*) FROM (SELECT * FROM b GROUP BY x,y,z);
count
--------
100000
The number of distinct values on a set of columns can be calculated only with extended statistics. Let's define it:
CREATE STATISTICS a_stx (ndistinct) ON x,y,z FROM a;
CREATE STATISTICS b_stx (ndistinct) ON x,y,z FROM b;
Here, we employ only distinct-type statistics because it is enough for our purpose. Unfortunately, the current PostgreSQL core doesn't utilise that - let's implement the code and see how it is going:
RESET enable_mergejoin;
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT * FROM a,b WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;
Hash Join (actual rows=0 loops=1)
Hash Cond: ((a.x = b.x) AND (a.y = b.y) AND (a.z = b.z))
-> Seq Scan on a (actual rows=100000 loops=1)
-> Hash (actual rows=100000 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5604kB
-> Seq Scan on b (actual rows=100000 loops=1)
Execution Time: 88.582 ms
You can see that the optimiser not only chose the HashJoin algorithm but also correctly chose relation 'b' as the inner input to be hashed. In that case, we see a two-time faster execution time than the previous already good HashJoin plan! It results from the correct bucket size estimation: 0.00001 for relation 'b' and 0.01 for relation 'a'.
So, as you can see, this approach led to a nearly tenfold speedup in the elementary example. Since real-life queries are typically more complex, executed over huge tables with non-trivial data distribution and involve complex scan filters, DBAs often struggle to identify optimisation points and end up with a perplexing belief in the disruptive MergeJoin. So, extended statistics is potentially becoming a "must-have" feature when dealing with queries that contain two or more join clauses in a single JOIN operator.
But what's wrong with turning it off? Let's add indexes on these tables and try to execute the grouping query:
CREATE INDEX ON a (x,y,z);
CREATE INDEX ON b (x,y,z);
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT a.x,a.y,a.z FROM a,b
WHERE a.x=b.x AND a.y=b.y AND a.z=b.z GROUP BY a.x,a.y,a.z;
In this case, using presorted orders for both relations, MergeJoin is executed two times faster: 44 ms V/S 76 ms of HasJoin. So, the JOIN operator, providing some order, is a native choice in analytical queries - it can give way with fewer sort operations, and disabling it reduces the optimiser’s scope to search for effective plans.
Henceforth, we should find a way to estimate costs for complex clauses more precisely. In the case of many clauses, we have only one tool so far - EXTENDED STATISTICS. As a result, it looks promising to invent extensions to manage such statistics automatically - fortunately, we have already published one :). Do you agree with us? Would you like to use this type of statistic?
As usual, you can assess the results by playing with the code on top of the current PostgreSQL master code branch.
THE END.
July 7, Thailand, South Pattaya