As a PostgreSQL developer, I am inevitably stumbling across some weak points in the code. I’m not sure alternative DBMSes manage that much better, but knowledge about possible issues may help in the long run.
This time, I encountered quite a subtle issue - incremental sort estimation instability.
Incremental sort was added in 2020 and is one of the essential tools in a row of intelligent features like Parameterised NestLoop, Material and Memoize that allow PostgreSQL to survive the challenge of processing big data and analytical queries. It adds to the optimiser one more strategy to bypass massive sort operations by utilising sort orders derived from index scans or switching to Merge Joins to have at least partly presorted orders before some grouping or ordering operation.
As a beneficial tool, it should have a reasonable cost model because if sort takes a significant part of the query work, mistakes here can cause execution time degradation— maybe not as huge as using NestLoop over millions of tuples, but unpleasant enough to find a way to improve that.
We already have seen how this new optimisation strategy sometimes allows new evil query plans to emerge. So, careful estimation matters.
To demonstrate the current issue, let’s create a ‘test‘ table with a couple of columns, x and y, where x contains unique values and y contains a constant:
CREATE TABLE test(x integer, y integer,z text);
INSERT INTO test (x,y) SELECT x, 1 FROM generate_series(1,1000000) AS x;
CREATE INDEX ON test(x);
CREATE INDEX ON test(y);
VACUUM ANALYZE;
SET max_parallel_workers_per_gather = 0;
So, we have two columns with opposite numbers of distinct values. Don’t forget to add indexes on each column to consider plans using the IncrementalSort node. Disable parallel workers to not smooth execution time differences, and let’s execute the following query over brand new PostgreSQL 17:
EXPLAIN (ANALYZE, TIMING OFF)
SELECT count(*) FROM test t1, test t2
WHERE t1.x=t2.y AND t1.y=t2.x
GROUP BY t1.x,t1.y;
EXPLAIN (1)
===========
GroupAggregate (cost=37824.89..37824.96 rows=1 width=16)
Group Key: t1.y, t1.x
-> Incremental Sort (cost=37824.89..37824.94 rows=2 width=8)
Sort Key: t1.y, t1.x
Presorted Key: t1.y
-> Merge Join (cost=0.85..37824.88 rows=1 width=8)
Merge Cond: (t1.y = t2.x)
Join Filter: (t2.y = t1.x)
-> Index Scan using test_y_idx on test t1
-> Index Scan using test_x_idx on test t2
Execution Time: 1518.249 ms
I personally prefer to choose sorting with a unique column ‘x’ at the head of the sort list. But ignore it and just reverse the left and right sides of expressions in the WHERE condition:
EXPLAIN (ANALYZE, TIMING OFF)
SELECT count(*) FROM test t1, test t2
WHERE t2.y=t1.x AND t2.x=t1.y
GROUP BY t1.x,t1.y;
Nothing special, right? We should get precisely the same plan, shouldn’t we? But executing that, we see a different plan:
EXPLAIN (2)
===========
GroupAggregate (cost=37824.89..37824.92 rows=1 width=16)
Group Key: t1.x, t1.y
-> Sort (cost=37824.89..37824.90 rows=1 width=8)
Sort Key: t1.x, t1.y
Sort Method: quicksort Memory: 25kB
-> Merge Join (cost=0.85..37824.88 rows=1 width=8)
Merge Cond: (t1.y = t2.x)
Join Filter: (t2.y = t1.x)
-> Index Scan using test_y_idx on test t1
-> Index Scan using test_x_idx on test t2
Execution Time: 1535.216 ms
Where is our incremental sort? Why did the optimiser choose full sort in this case and change the order of grouping columns? Disable Sort and execute again:
SET enable_sort = off;
EXPLAIN (3)
===========
GroupAggregate (cost=37824.89..37824.96 rows=1 width=16)
Group Key: t1.x, t1.y
-> Incremental Sort (cost=37824.89..37824.94 rows=2 width=8)
Sort Key: t1.x, t1.y
Presorted Key: t1.x
-> Merge Join (cost=0.85..37824.88 rows=1 width=8)
Merge Cond: (t1.x = t2.y)
Join Filter: (t1.y = t2.x)
-> Index Scan using test_x_idx on test
-> Index Scan using test_y_idx on test t2
Execution Time: 601.715 ms
Now we have our Incremental Sort with the exact cost as in case (1) but still with reversed the order of grouping columns and presorted key. Using t1.x as a presorted key here is definitely better because we don’t need any sorting de facto - Index Scan returns tuples in presorted order. Because all t1.x is different, it means that we don’t touch t1.y during sorting at all. You can see the result - execution time dwindled down two times!
But we want to identify what’s happening here. To force the optimiser to choose t1.y as a presorted key, switch off the GROUP-BY list juggling feature and execute this query again:
SET enable_group_by_reordering = off;
EXPLAIN (ANALYZE, TIMING OFF)
SELECT count(*) FROM test t1, test t2
WHERE t2.y=t1.x AND t2.x=t1.y
GROUP BY t1.y,t1.x;
EXPLAIN (4)
===========
GroupAggregate (cost=18912.88..37825.00 rows=1 width=16)
Group Key: t1.y, t1.x
-> Incremental Sort (cost=18912.88..37824.97 rows=2 width=8)
Sort Key: t1.y, t1.x
Presorted Key: t1.y
-> Merge Join (cost=0.85..37824.88 rows=1 width=8)
Merge Cond: (t1.y = t2.x)
Join Filter: (t2.y = t1.x)
-> Index Scan using test_y_idx on test t1
-> Index Scan using test_x_idx on test t2
Execution Time: 1336.453 ms
We ended up with the same query plan and execution time as in case (1). But comparing the cost of the incremental sort in explains (1) and (4), we see the difference in costs of Incremental Sort nodes:
-> Incremental Sort (cost=37824.89..37824.94 ...
Sort Key: t1.y, t1.x
Presorted Key: t1.y
-> Merge Join (cost=0.85..37824.88 ...
-> Incremental Sort (cost=18912.88..37824.97 ...
Sort Key: t1.y, t1.x
Presorted Key: t1.y
-> Merge Join (cost=0.85..37824.88 ...
Identical sort lists, presorted key and underlying query tree, but different startup and total costs! And that’s the issue’s essence: by trivial reversing an expression’s left and right sides, we get different estimations of the same query and significant speedup! Diving into the code, you can find out that the source of such instability lies in the logic of choosing the expression to be estimated. Having the Equivalence Class t1.y=t2.x
and its members {t1.y,t2.x}
, the optimiser selects just the first equivalence member from the list to estimate the number of distinct values. Hence, this choice depends on the text of the SQL query. Take a glance at the statistics on these attributes:
SELECT c.relname, a.attname, s.stadistinct
FROM pg_statistic s JOIN pg_class c ON (s.starelid = c.oid)
JOIN pg_attribute a ON (c.oid = a.attrelid AND s.staattnum = a.attnum)
WHERE c.relname = 'test';
relname | attname | stadistinct
---------+---------+-------------
test | x | -1
test | y | 1
Here, -1 means 100% of distinct values - close to uniqueness for x. And the only one different tuple for y. Having that data, Postgres estimates the number of distinct values in the presorted key column to be 1 or 1000000, depending only on the syntactical order of sides in the expression.
In toto, we have demonstrated a vulnerability in the optimiser code that sometimes doesn’t allow the optimiser to find the most optimal plan. Complex queries and database schemas can cause ‘performance cliffs’, which can puzzle DBAs a lot because of this bug’s subtle and unusual nature. Of course, no one optimiser is perfect, but dependence on the query text looks like a bug, doesn’t it?
THE END.
very interesting. I guess these complex data skew issues is pretty hard to workaround and can throw off any query optimizer. Please change the spelling for vulnerability in article caption.