Introduction
As usual, this project was prompted by multiple user reports with typical complaints, like 'SQL server executes the query times faster' or 'Postgres doesn't pick up my index'. The underlying issue that united these reports was frequently used VALUES sequences, typically transformed in the query tree into an SEMI JOIN
.
I also want to argue one general question: Should an open-source DBMS correct user errors? I mean optimising a query even before the search for an optimal plan begins, eliminating self-joins, subqueries, and simplifying expressions - everything that can be achieved by proper query tuning. The question is not that simple since DBAs point out that the cost of query planning in Oracle overgrows with the complexity of the query text, which is most likely caused, among other things, by the extensive range of optimisation rules.
Now, let's turn our attention to the VALUES
construct. Interestingly, it's not just used with the INSERT
command but also frequently appears in SELECT
queries in the form of a test of inclusion in a set:
SELECT * FROM something WHERE x IN (VALUES (1), (2), ...);
and in the query plan, this syntactical construct is transformed into SEMI JOIN. To demonstrate the essence of the problem, let's generate a test table with an uneven distribution of data in one of the columns:
CREATE EXTENSION tablefunc;
CREATE TABLE norm_test AS
SELECT abs(r::integer) AS x, 'abc'||r AS payload
FROM normal_rand(1000, 1., 10.) AS r;
CREATE INDEX ON norm_test (x);
ANALYZE norm_test;
here, the value x
of the norm_test
table has a normal distribution with a mean of 1 and a standard deviation 10 [1]. There are not too many distinct values, which will all be included in the MCV statistics. As a result, it will be possible to calculate the number of duplicates accurately for each value despite the uneven distribution. Also, we naturally introduced an index on this column, easing the table’s scanning. Now, let's execute the query:
EXPLAIN ANALYZE
SELECT * FROM norm_test WHERE x IN (VALUES (1), (29));
Uncomplicated query, right? It is rational to execute it with two iterations of index scanning. However, in Postgres, we have:
Hash Semi Join (cost=0.05..21.36 rows=62) (actual rows=85)
Hash Cond: (norm_test.x = "*VALUES*".column1)
-> Seq Scan on norm_test (rows=1000) (actual rows=1000)
-> Hash (cost=0.03..0.03 rows=2) (actual rows=2)
-> Values Scan on "*VALUES*" (rows=2) (actual rows=2)
Here and onwards, I slightly simplify the explain for clarity.
Hmm, a sequential scan of all the table's tuples when two index scans were enough for us? Let's disable HashJoin
and see what happens:
SET enable_hashjoin = 'off';
Nested Loop (cost=4.43..25.25 rows=62) (actual rows=85)
-> Unique (rows=2 width=4) (actual rows=2)
-> Sort (rows=2) (actual rows=2)
Sort Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (rows=2) (actual rows=2)
-> Bitmap Heap Scan on norm_test (rows=31) (actual rows=42)
Recheck Cond: (x = "*VALUES*".column1)
-> Bitmap Index Scan on norm_test_x_idx
(rows=31) (actual rows=42)
Index Cond: (x = "*VALUES*".column1)
Now you can see that Postgres has squeezed out the maximum: in one pass through the VALUES
set for each outer value, it performs an index scan on the table. It's much more interesting than the previous option. However, it is not as simple as just a regular index scan. In addition, if you look at the query explanation more closely, you can see that the optimiser makes a mistake in predicting the cardinality of the join and index scan. And what happens if you rewrite the query without VALUES
:
EXPLAIN (ANALYSE, TIMING OFF)
SELECT * FROM norm_test WHERE x IN (1, 29);
/*
Bitmap Heap Scan on norm_test (cost=4.81..13.87 rows=85) (actual rows=85)
Recheck Cond: (x = ANY ('{1,29}'::integer[]))
Heap Blocks: exact=8
-> Bitmap Index Scan on norm_test_x_idx (rows=85) (actual rows=85)
Index Cond: (x = ANY ('{1,29}'::integer[]))
*/
As you can see, we got a query plan containing only an index scan that is almost twice as cheap. At the same time, by estimating each value from the set and having both of these values in the MCV statistics, Postgres accurately predicts the cardinality of this scan.
So, being not a big problem in itself (you can always use HashJoin
and hash the inner's VALUES
), using VALUES
sequences is a source of dangers:
The optimiser can choose
NestLoop
, which can reduce performance with a vast VALUES list.All of a sudden,
SeqScan
can be chosen instead ofIndexScan
.The optimiser makes significant estimation errors when predicting the cardinality of a JOIN operation and its underlying operations.
By the way, why would anyone need to use such expressions at all?
I guess this is a particular case when the automation system - ORM or Rest API tests the inclusion of an object into a specific set of objects. Since VALUES
describes a relational table, and the value of such a list is a table row, we are most likely dealing with cases where each row represents an instance of an object in the application. Our case is a corner case when the object is characterised by only one property. If my guess is wrong, please correct me in the comments - maybe someone knows other reasons?
So, passing the 'x IN VALUES
' construct into the optimiser is risky. Why not fix the situation by converting this VALUES
construct to an array? Then, we will have a construct like 'x = ANY [...]
', a special case of the ScalarArrayOpExpr
operation in the Postgres code. It will simplify the query tree, eliminating the appearance of an unnecessary join. Also, the Postgres cardinality evaluation mechanism can work with the array inclusion check operation. If the array is small enough (< 100 elements), it will perform a statistical evaluation element by element. In addition, Postgres can optimise array search by hashing the values (if the memory required for that fits the work_mem value) - and everyone will be happy, right?
Well, we decided to try to do this in our optimisation lab - and surprisingly, it turned out to be relatively trivial. The first peculiarity we encountered is that the conversion is only possible for operations on scalar values: that is, so far, it is generally impossible to convert an expression of the form '(x,y) IN (VALUES (1,1), (2,2), ...)
' so that the result exactly matches the state before the conversion. Why? It is not very easy to explain - the reason lies in the design of the comparison operator for the record type - to teach Postgres to work with such an operator completely similarly to scalar types, the type cache needs to be significantly redesigned. Secondly, you must remember to check this subquery (yes, VALUES
is represented in the query tree as a subquery) for the presence of volatile functions - and that's it - one pass of the query tree mutator doing transformation, quite similar to [2] replaces VALUES with an array, constifying it if possible. Curiously, the conversion is possible even if VALUES
contains parameters, function calls, and complex expressions, like the below:
CREATE TEMP TABLE onek (ten int, two real, four real);
PREPARE test (int,numeric, text) AS
SELECT ten FROM onek
WHERE sin(two)*four/($3::real) IN (VALUES (sin($2)), (2), ($1));
EXPLAIN (COSTS OFF) EXECUTE test(1, 2, '3');
/*
Seq Scan on onek
Filter: (((sin((two)::double precision) * four) / '3'::real) = ANY ('{0.9092974268256817,2,1}'::double precision[]))
(2 rows)
*/
The feature is currently being tested. The query tree structure is pretty stable, and there is no reason to modify the code, considering that the dependencies on the kernel version are minimal; it can be used in Postgres down to version 10 and maybe even earlier. As usual, you can play with the library’s binaries, compiled in a typical Ubuntu 22 environment - it doesn’t have any UI and may be loaded statically or dynamically.
And now, the actual holy war that I mentioned above. Since we did this as an external library, we had to intercept the planner hook (to simplify the query tree before optimisation), which cost us an additional pass through the query tree. Obviously, most queries in the system will not need this transformation, and this operation will simply add overhead. However, when it works, it can provide a noticeable effect (and from my observations, it does).
Until recently, there was a consensus in the PostgreSQL community [3, 4]: if the problem can be fixed by changing the query itself, then there is no point in complicating the kernel code since this will inevitably lead to increased maintenance costs and (remembering Oracle's experience) will affect the performance of the optimiser itself.
However, watching the core commits, I notice that the community's opinion seems to be drifting. For example, this year, they complicated the technology of subquery to SEMI JOIN
transformation by adding correlated subqueries [5]. A little later, they allowed the parent query to receive information about the sort order of the subquery result [6], although previously, to simplify planning, the query and its subqueries were planned independently. It looks like a way to re-planning subqueries, doesn't it?
And what do you think? Is an open-source project capable of supporting multiple transformation rules that would eliminate the redundancy and complexity that the user introduces, trying to make the query more readable and understandable? And most importantly - is it worth it?
References
Commit 9f13376. pull-up correlated subqueries
Commit a65724d. Propagate pathkeys from CTEs up to the outer query
THE END.
October 2, 2024. Pattaya, Thailand.
I think SOME amount of rewrite by the optimizer is required. We especially found that to be true with exactly this situation in the post. It seems to me that "x IN(list of scalars)" and "x = ANY(list of scalers)" and x IN VALUES” is all the same thing until you use query parameters. I agree it doesn't need to get carried away (as it sounds like Oracle did). Everyone with the same issue benefits from optimizer improvements instead of having to change their applications as long as the planning time is capped to some time limit.
When I was on a team working with SQL Server queries in an application, I often advised people that when using any of the relational databases, try to keep these ideas in mind:
1. Write the query as WHAT you want to retrieve, not HOW to execute it. If you need to explain how to execute it whether by writing the query complicated or adding SQL server query hints then something is too complicated.
2. Second, try to keep the query as simple as possible (which does require some understanding of how the execution would happen). Like don't join in the same table again if it is joining on the same key! Maybe prefer to write as a join instead of a subquery etc.
3. EXPECT complicated queries to change execution plan as your data changes.
Until PG14 using IN VALUES with a long list (and the resulting semi join) was significantly more performant than an = ANY expression with a large list because that expression was always executed with a linear scan through the list. That patch (to execute such an expression as a hash lookup) came out of my experience with such an = ANY containing 17k values being horrifically slow (but rewriting as an explicit join to VALUES being much faster).