Considering how many interesting execution paths *MergeJoin* adds to the system elaborating *IncrementalSort* or sort orderings derived from underlying *IndexScan*, it looks strange: one more bug of skewed cost balance inside the PostgreSQL cost model?

As a developer, I have refused to accept such a mysterious belief in evil algorithms and discovered this case. It turned out that the real reason (or at least one but quite frequent one) lies in the typical challenge optimiser faces: multi-clause JOIN.

Let's take a glance at the query:

`SELECT * FROM a JOIN b ON (a.x=b.x AND a.y=b.y AND a.z=b.z);`

In this scenario, the optimiser often unexpectedly selects *MergeJoin* or, much more rarely, a *NestLoop* instead of the more efficient *HashJoin*.

It is a challenge to reproduce it with synthetic data, so this example looks a bit complicated:

```
CREATE TABLE a AS SELECT
((3*gs) % 300) AS x,
((3*gs+1) % 300) AS y,
((3*gs+2) % 300) AS z
FROM generate_series(1,1e5) AS gs;
CREATE TABLE b AS SELECT
gs % 49 AS x,
gs % 51 AS y,
gs %73 AS z
FROM generate_series(1,1e5) AS gs;
ANALYZE a,b;
```

Table 'b' has been created quite typically for actual data: having a small number of distinct values in each column, a row is almost unique considering values in its columns together. Let's execute a single join on three columns:

```
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT * FROM a,b WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;
Merge Join (actual rows=0 loops=1)
Merge Cond: ((a.x = b.x) AND (a.y = b.y) AND (a.z = b.z))
-> Sort (actual rows=17001 loops=1)
Sort Key: a.x, a.y, a.z
-> Seq Scan on a (actual rows=100000 loops=1)
-> Sort (actual rows=100000 loops=1)
Sort Key: b.x, b.y, b.z
-> Seq Scan on b (actual rows=100000 loops=1)
Execution Time: 843.179 ms
```

Let's manually disable MergeJoin and see what will happen:

```
SET enable_mergejoin = 'off';
Hash Join (actual rows=0 loops=1)
Hash Cond: ((b.x = a.x) AND (b.y = a.y) AND (b.z = a.z))
-> Seq Scan on b (actual rows=100000 loops=1)
-> Hash (actual rows=100000 loops=1)
-> Seq Scan on a (actual rows=100000 loops=1)
Execution Time: 154.822 ms
```

HashJoin is much faster, isn't it? Even though sorted outer has allowed MergeJoin to fetch only 17001 tuples from 100_000, it is still five times slower. So, why has the optimiser chosen the non-optimal variant?

Looking into the details (I didn't show it in the explanation for simplicity), we see that the optimiser correctly predicts the size of the inner and outer (it is quite a trivial query, though), but the cost of MergeJoin is 20937 in comparison to HashJoin's 419280. Almost twenty times more! What's going on there? Is it a bug in the cost model? - Not exactly.

Just look into the final_cost_hashjoin() routine. The HashJoin cost formula looks like this:

In Postgres terms, *bucket_size* means the number of tuples with the same hash value. It is a crucial factor because to find a match in the bucket, we have to pass through the bucket and match the incoming tuple with each stored tuple until we find a comparison or pass the whole bucket.

In our specific example, the size of the bucket is roughly equal to 0.015 for relation *'b'* and 0.011 for relation *'a'*, and the number of buckets is estimated at 131000. This means the optimiser predicts that each bucket will contain around 2000 tuples. Passing a whole bucket with a linear search on each incoming tuple is really costly. I agree with the optimiser on that choice! The massive cost of the HashJoin node now makes sense. But why has it made the wrong prediction here?

The problem is estimating the number of groups in the case of multiple columns. Correctly estimating the number of distinct values in 49, 51 and 73 on columns x, y and z correspondingly, the optimiser chooses the maximum value, i.e. 73 distinct values as an estimation of the number of groups on *(x,y,z)* that is incorrect in most the actual cases, likewise have shown here. Why does it do that? - Because it is the maximum skewed case according to the worst-case scenario, which can be obtained from the statistics.

But the actual number of groups here is:

```
SELECT count(*) FROM (SELECT * FROM b GROUP BY x,y,z);
count
--------
100000
```

The number of distinct values on a set of columns can be calculated only with extended statistics. Let's define it:

```
CREATE STATISTICS a_stx (ndistinct) ON x,y,z FROM a;
CREATE STATISTICS b_stx (ndistinct) ON x,y,z FROM b;
```

Here, we employ only distinct-type statistics because it is enough for our purpose. Unfortunately, the current PostgreSQL core doesn't utilise that - let's implement the code and see how it is going:

```
RESET enable_mergejoin;
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT * FROM a,b WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;
Hash Join (actual rows=0 loops=1)
Hash Cond: ((a.x = b.x) AND (a.y = b.y) AND (a.z = b.z))
-> Seq Scan on a (actual rows=100000 loops=1)
-> Hash (actual rows=100000 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5604kB
-> Seq Scan on b (actual rows=100000 loops=1)
Execution Time: 88.582 ms
```

You can see that the optimiser not only chose the *HashJoin* algorithm but also correctly chose relation *'b'* as the inner input to be hashed. In that case, we see a two-time faster execution time than the previous already good *HashJoin* plan! It results from the correct bucket size estimation: 0.00001 for relation *'b'* and 0.01 for relation *'a'*.

So, as you can see, this approach led to a nearly tenfold speedup in the elementary example. Since real-life queries are typically more complex, executed over huge tables with non-trivial data distribution and involve complex scan filters, DBAs often struggle to identify optimisation points and end up with a perplexing belief in the disruptive *MergeJoin*. So, extended statistics is potentially becoming a "must-have" feature when dealing with queries that contain two or more join clauses in a single JOIN operator.

But what's wrong with turning it off? Let's add indexes on these tables and try to execute the grouping query:

```
CREATE INDEX ON a (x,y,z);
CREATE INDEX ON b (x,y,z);
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT a.x,a.y,a.z FROM a,b
WHERE a.x=b.x AND a.y=b.y AND a.z=b.z GROUP BY a.x,a.y,a.z;
```

In this case, using presorted orders for both relations, *MergeJoin* is executed two times faster: 44 ms V/S 76 ms of HasJoin. So, the JOIN operator, providing some order, is a native choice in analytical queries - it can give way with fewer sort operations, and disabling it reduces the optimiser’s scope to search for effective plans.

Henceforth, we should find a way to estimate costs for complex clauses more precisely. In the case of many clauses, we have only one tool so far - EXTENDED STATISTICS. As a result, it looks promising to invent extensions to manage such statistics automatically - fortunately, we have already published one :). Do you agree with us? Would you like to use this type of statistic?

As usual, you can assess the results by playing with the code on top of the current PostgreSQL master code branch.

THE END.

*July 7, Thailand, South Pattaya*

The background of this work lies in the long story of the GROUP-BY optimisation feature development: reverted in 2023, it ended up its way in the core in 2024. But only a tiny part of the feature was accepted. Challenging by the reason of "Embrace a down-to-earth attitude," we have tried to analyse the situation before the next attempt.

The current development code can be found in the GitHub branch of the Postgres Professional's public Postgres fork.

In the current implementation of the PostgreSQL optimiser, the complexity of sorting is determined only by the number of incoming tuples, also known as "tuples". While a logical and reliable solution, such a model does not consider the composite nature of the tuple and the number of duplicates in the input data. Moreover, the hash aggregation cost model already employs the number of columns in a tuple and statistics on the number of duplicates. At the same time, in searching for the optimal plan, the optimiser often has the freedom to choose the order of the columns in the resulting tuple of an operation.

An example is the GROUP-BY operation - one of its execution algorithms requires pre-sorted data. Also, the MergeJoin operator can change the order of expressions in the join clause and optimise the plan by considering different sortings of the left and right inputs. Having the ability to differentiate by cost between various possibilities for such sorting, the optimiser can reasonably choose the order of clauses in the grouping operator and discover more optimal alternatives for the query plan.

After reviewing the ACM, arxiv and IEEE catalogues, we found no work devoted to estimating multi-column sorting. Looking around into the competitors' behaviour, we found out that the MySQL code, which is directly available for study, implements a simple model for estimating the cost of sorting, which does not consider the length and number of columns in the tuple. Experiments with MSSql and Oracle have also shown the absence of such tuning. It is why we commenced this work and ended up in this post.

So, let's demonstrate the problem using a simple degenerate example:

```
CREATE TABLE test (
n integer, t integer, company text, counterpart_company text
);
INSERT INTO test (n,t,company,counterpart_company)
SELECT n, n % 10000 AS t,
CASE n % 3
WHEN 0
THEN 'A Limited Liability Partnership (LLP) "White chamomile"'
WHEN 1
THEN 'A Limited Liability Partnership (LLP) "White swan"'
WHEN 2
THEN 'A Limited Liability Partnership (LLP) "White idea"'
END AS company,
CASE (n + 1) % 3
WHEN 0
THEN 'A Limited Liability Partnership (LLP) "White chamomile"'
WHEN 1
THEN 'A Limited Liability Partnership (LLP) "White swan"'
WHEN 2
THEN 'A Limited Liability Partnership (LLP) "White idea"'
END AS counterpart_company
FROM generate_series(1,1000000) AS gs(n);
VACUUM ANALYZE test;
```

Thanks to Ivan Frolkov, who provided us with this example.

Here, Postgres has created several columns with different numbers of duplicates and widths each. Let's see how the specific order of sorting columns impacts execution time:

```
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT n,t,company,counterpart_company FROM test ORDER BY 1,2,3,4;
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT n,t,company,counterpart_company FROM test ORDER BY 4,3,2,1;
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT n,t,company,counterpart_company FROM test ORDER BY 1,3,4,2;
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT n,t,company,counterpart_company FROM test ORDER BY 2,3,4,1;
```

The execution time of these queries in the experiment was:

853 ms,

4701 ms,

854 ms

5065 ms.

It is noticeable that options No. 2 and No. 4, when large-width columns with a large number of duplicates are sorted first, the complexity of the operation increases by around six times. Thus, the sort order can significantly impact the computational efforts required to process a query. Hence, having been able to distinguish between such sorting options and choose the most optimal one, we can obtain an additional query optimisation tool. For example, if a GROUP-BY operation is higher up the query tree, you can significantly reduce the overhead by choosing the optimal order of grouping expressions. It should be noted that to obtain such a great difference, we cheated a little by declaring the scanned table as temporary, eliminating the use of parallel workers and making the execution time gap more evident. Also, we deliberately use ORDER-BY instead of GROUP-BY to emphasise attention to the issue of sort costing and avoid the overhead of grouping operations.

The main problem with altering the optimiser's cost model is adjusting the balance of the costs of various operations. The cost model merely approximates the operation's actual cost, and does not consider many minor factors of the exact characteristics of the hardware and software platform. Detailing the cost model of one of the query tree operators can create unjustified discrimination in favour of picking one strategy. In case of modifications in the sorting cost model, you should check the cost models for at least the *MergeAppend*, *GatherMerge* and *IncrementalSort* operators. Also, we want to avoid upsetting the balance with hashing methods unless the choice is proven to be incorrect. Previously, we have experimented with changing the formula for calculating the cost of sorting, and we noticed a shift in the balance between the operators *Sort*, *MergeAppend*, *GatherMerge* and *IncrementalSort*. Also, we ought to remember situations where the optimiser chooses between sorting lower and higher in the query tree, at least:

Sort each input below the JOIN versus the sorting results of the JOIN.

Perform a MergeJoin and have an already pre-sorted data stream at the output versus NestLoop or HashJoin with the following sorting.

The cost of quick sort (see the cost_tuplesort function) in the current implementation of the PostgreSQL DBMS is estimated using the formula:

where cmp_cost is a generalised cost of the comparison operator call or *cpu_operator_cost* in terms of PostgreSQL; *T* is the number of incoming tuples.

Obviously, this formula does not consider the number of columns in the sorting tuple, the sorted value's characteristic width and the number of duplicates in a separate column. At the same time, we know that the comparison operator is called separately for a pair of values in each column. Imagine that the first column contains unique values. In that case, the Sort operation doesn't need to call the comparison operator for subsequent columns at all. This would seem to be an insignificant fact. However, if we imagine that the next pair of columns being compared contains a text value of considerable length - as in the example above, where the almost unique value '*n*' was followed by the full name of the company - then the order of columns of the sorting text apparently matters in terms of CPU cycles spent on sorting. Thus, when tuple columns have very different numbers of duplicates, minimising the number of comparison calls makes sense, especially if it is an expensive operator. For example, the complexity of using a variable-length bytea type can significantly depend on the storage method: PostgreSQL can use compression or the TOAST technology to optimise the storage for such values. It also makes sense when most of the data is generated within a query - for example, a Cartesian product of tables - in that case, access to disk or even the buffer cache is not a significant factor. One real-world example where this behaviour could play an important role would be grouping purchase analytics by ID, gender, age, and status (especially if the latter is in text form). The option: "gender, status, age, ID" is worse if the ID is an almost unique field of fixed width.

The first evident step towards complicating the sorting model is to add a multiplier to introduce a dependence on the number of columns to the cmp_cost value like hash aggregation does:

The next complication is related to taking into account the order of columns in the sorting cost model. In the absence of extended statistics, which is a widespread case today, we only have one factor that is valid enough to be used - the ndistinct value for the first column in the sorting tuple. Having that statistic, enter an estimation of the number of value comparison operations per tuple into the cost model as follows:

where *T* is the number of incoming tuples, *T>1*; *M* Is the number of groups (the ndistinct value) predicted by statistics in the first column, *M≥1*.

This approach allows to rank sortings of the same set depending on the estimate of the number of duplicates in the first column of the sorting tuple, remaining *cmp_cost* within the range of values *C _{0},..., n*C_{0}*. As a positive outcome, it simplifies the transition from the above primitive model and gives the optimiser an additional factor to find the optimal plan.

For only one group *(M = 1)*, which means all values of a given column are equal to each other, and this formula boils down to the following:

In the worst-case scenario, when comparing each pair of tuples, the sorting operator will traverse all the columns and invoke the comparison operator for each pair of values. The uncertainty arises in the second and subsequent columns - we lack statistics, so we stick to the extreme case. Should we rely on an average case, i.e. *n/2*? If we do, we risk smoothing out the cost difference and reducing the chances of choosing the order of columns determined by ndistinct - remember, the current implementation of PostgreSQL 17 has two competitors: the order specified by the ORDER-BY expression and the pathkeys sort order coming from the subquery below. For unique values in the first column *(M = T)*, the formula simplifies to:

since transitions to comparison by the second attribute will never occur, the values from the first column are sufficient for matching a pair of tuples. The linearity of the formula ensures the same gradient over the entire range of values of the number of groups, which can be interpreted as the same sensitivity to changes. If you use, for example, a hyperbola, you can increase sensitivity to changes in the number of groups in a specific range of values. The goal could be, for example, to choose such a strategy only when there is (almost) uniqueness in the value of ndistinct. Without specific ideas about what is better, we choose the linear rule.

How is the above formula transformed if ndistinct is known not for only one column but for the entire prefix of the list of sorted columns? - this can happen, for example, with defined extended statistics.

Let's imagine we have the distinct value for two columns` `

*(x _{1}, x_{2})*. Then, if a given pair of columns is at the beginning of the list of sorted values

``

(xwhere *f _{1}* is calculated using the standard ndistinct statistics of the first column

It is curious that since the value of ndistinct does not depend on the order of the columns, in this case, we can estimate the cost of sorting for an alternative sort order *(x _{2},x_{1},...,x_{n})*. This can be useful for choosing, for example, the most optimal ordering in the list of values of the GROUP-BY operator.

Taking into account that in PostgreSQL, the presence of extended statistics for a set of *m* columns of a table means that there is also an ndistinct value for any subset of it, it is pretty helpful to have a more general formula with coverage of *m<n,n≥2*:

Thus, it is possible to refine the estimate of the sorting cost by detailing the statistics for a given table.

We can get many alternative plans with extended statistics, such as a merge join or grouping. For example, having a column *x _{1}* with a considerable ndistinct value and statistics on non-unique

There are many options around sort order permutations, but the scope of such estimation is too narrow - it is worth considering only a small set of perspective orderings to estimate or introducing empirical rules for choosing the most optimal sorting order. The simplest empiric is to evaluate the order of columns ordered according to the value of ndistinct. Another less obvious rule of thumb is to arrange columns according to their average width.

The authors' practical experience shows that, in most cases, this optimisation is a second-order factor of importance and fine-tuning of the query plan. There is a high probability that the computational cost of finding the best sort order will surpass the effect on query execution time. Putting aside the possibility of caching and query plan reuse, it is necessary to limit the use of this strategy, employing some reliable empirical evidence. The relative weight of the sorting operation can serve as an empiric. Suppose the cost of sorting is a significant part of the work (say 20% of the cost of the underlying subtree). In that case, the higher-level operator can test alternative combinations of the sorted columns. However, this leads to the necessity of adding an extra step into the operator planning procedure: after choosing a sorted path for the current set of pathkeys, it will analyse the generated path and, at this moment, decide on an additional search for optimal combinations of sorting columns.

Initially being quite difficult to code, it looks much simpler with the solution explained and implemented earlier. The primitive model of cost sort estimation was implemented in advance to check the breaking of cost balances.

Curiously, even that model raised some issues. For example, *MergeAppend* also compares tuples incoming from different subplans. Hence, *MergeAppend* + *Sort* of siblings is an essential competitor for the *Append* + *Sort* of result strategy, and to get a good balance, we must add the same multiplier there. It's precisely the same thing with the *GatherMerge* path in the case of parallel paths. The *IncrementalSort* cost model also raised the same issues, but this model looks like a mix of compromises so far and changing that means opening a Pandora's box.

The second step, the addition of the estimation of the number of duplicates for the first column, was more straightforward because it needed only a tiny complication. As an outcome, it triggered a preference for *Sort* over *IncrementalSort* strategy, which may be studied later.

The final introduction of the new GROUP-BY ordering strategy includes one cycle along the list of grouping clauses. Checking statistics on each of these columns to find the one with a minimum number of duplicates, as the first grouping column looks a bit expensive. Still, we hope that using many column groupings is needed only in analytic queries, which consume a lot of computational resources by themselves, and that planning improvement would be worth it.

Links to commits above are the only sketch of possible solutions, so look into the development branch instead to see the current state.

Taking into account the number of columns in the sorting cost is physical. It will bring the detail of this model closer to the model for estimating the cost of hash aggregation, which already considers the number of columns in the row.

The impact of this change on internal balances of the PostgreSQL cost model can be adjusted:

by choosing as an outermost case not a complete comparison of all columns for each pair of tuples but something more essential, for example,

*n/2*;by selecting a different interpolation law between extreme values.

The formula can be extended to the case of extended statistics for a subset of columns from the grouping list. At the same time, the question remains open about the relationship between the ndistinct value and the complexity of the comparison operator: is it possible to determine a sufficiently objective relationship between these parameters? Solving this problem would make it possible to decide on the most efficient sorting order, considering all significant factors.

THE END.

]]>This time, I encountered quite a subtle issue - incremental sort estimation instability.

Incremental sort was added in 2020 and is one of the essential tools in a row of intelligent features like *Parameterised* *NestLoop*, *Material* and *Memoize* that allow PostgreSQL to survive the challenge of processing big data and analytical queries. It adds to the optimiser one more strategy to bypass massive sort operations by utilising sort orders derived from index scans or switching to *Merge Joins* to have at least partly presorted orders before some grouping or ordering operation.

As a beneficial tool, it should have a reasonable cost model because if *sort* takes a significant part of the query work, mistakes here can cause execution time degradation— maybe not as huge as using *NestLoop* over millions of tuples, but unpleasant enough to find a way to improve that.

We already have seen how this new optimisation strategy sometimes allows new evil query plans to emerge. So, careful estimation matters.

To demonstrate the current issue, let’s create a ‘test‘ table with a couple of columns, x and y, where x contains unique values and y contains a constant:

```
CREATE TABLE test(x integer, y integer,z text);
INSERT INTO test (x,y) SELECT x, 1 FROM generate_series(1,1000000) AS x;
CREATE INDEX ON test(x);
CREATE INDEX ON test(y);
VACUUM ANALYZE;
SET max_parallel_workers_per_gather = 0;
```

So, we have two columns with opposite numbers of distinct values. Don’t forget to add indexes on each column to consider plans using the *IncrementalSort* node. Disable parallel workers to not smooth execution time differences, and let’s execute the following query over brand new PostgreSQL 17:

```
EXPLAIN (ANALYZE, TIMING OFF)
SELECT count(*) FROM test t1, test t2
WHERE
```**t1.x=t2.y AND t1.y=t2.x**
GROUP BY t1.x,t1.y;
EXPLAIN (1)
===========
GroupAggregate (cost=37824.89..37824.96 rows=1 width=16)
Group Key: t1.y, t1.x
-> Incremental Sort (cost=37824.89..37824.94 rows=2 width=8)
Sort Key: t1.y, t1.x
Presorted Key: t1.y
-> Merge Join (cost=0.85..37824.88 rows=1 width=8)
Merge Cond: (t1.y = t2.x)
Join Filter: (t2.y = t1.x)
-> Index Scan using test_y_idx on test t1
-> Index Scan using test_x_idx on test t2
Execution Time: 1518.249 ms

I personally prefer to choose sorting with a unique column ‘x’ at the head of the sort list. But ignore it and just reverse the left and right sides of expressions in the WHERE condition:

```
EXPLAIN (ANALYZE, TIMING OFF)
SELECT count(*) FROM test t1, test t2
WHERE
```**t2.y=t1.x AND t2.x=t1.y**
GROUP BY t1.x,t1.y;

Nothing special, right? We should get precisely the same plan, shouldn’t we? But executing that, we see a different plan:

```
EXPLAIN (2)
===========
GroupAggregate (cost=37824.89..37824.92 rows=1 width=16)
Group Key: t1.x, t1.y
-> Sort (cost=37824.89..37824.90 rows=1 width=8)
Sort Key: t1.x, t1.y
Sort Method: quicksort Memory: 25kB
-> Merge Join (cost=0.85..37824.88 rows=1 width=8)
Merge Cond: (t1.y = t2.x)
Join Filter: (t2.y = t1.x)
-> Index Scan using test_y_idx on test t1
-> Index Scan using test_x_idx on test t2
Execution Time: 1535.216 ms
```

Where is our incremental sort? Why did the optimiser choose full sort in this case and change the order of grouping columns? Disable *Sort* and execute again:

```
SET enable_sort = off;
EXPLAIN (3)
===========
GroupAggregate (cost=37824.89..37824.96 rows=1 width=16)
Group Key: t1.x, t1.y
-> Incremental Sort (cost=37824.89..37824.94 rows=2 width=8)
Sort Key: t1.x, t1.y
Presorted Key: t1.x
-> Merge Join (cost=0.85..37824.88 rows=1 width=8)
Merge Cond: (t1.x = t2.y)
Join Filter: (t1.y = t2.x)
-> Index Scan using test_x_idx on test
-> Index Scan using test_y_idx on test t2
Execution Time: 601.715 ms
```

Now we have our *Incremental Sort* with the exact cost as in case (1) but still with reversed the order of grouping columns and presorted key. Using t1.x as a presorted key here is definitely better because we don’t need any sorting de facto - *Index Scan* returns tuples in presorted order. Because all t1.x is different, it means that we don’t touch t1.y during sorting at all. You can see the result - execution time dwindled down two times!

But we want to identify what’s happening here. To force the optimiser to choose t1.y as a presorted key, switch off the GROUP-BY list juggling feature and execute this query again:

```
SET enable_group_by_reordering = off;
EXPLAIN (ANALYZE, TIMING OFF)
SELECT count(*) FROM test t1, test t2
WHERE t2.y=t1.x AND t2.x=t1.y
GROUP BY
```**t1.y,t1.x**;
EXPLAIN (4)
===========
GroupAggregate (cost=18912.88..37825.00 rows=1 width=16)
Group Key: t1.y, t1.x
-> Incremental Sort (cost=18912.88..37824.97 rows=2 width=8)
Sort Key: t1.y, t1.x
Presorted Key: t1.y
-> Merge Join (cost=0.85..37824.88 rows=1 width=8)
Merge Cond: (t1.y = t2.x)
Join Filter: (t2.y = t1.x)
-> Index Scan using test_y_idx on test t1
-> Index Scan using test_x_idx on test t2
Execution Time: 1336.453 ms

We ended up with the same query plan and execution time as in case (1). But comparing the cost of the incremental sort in explains (1) and (4), we see the difference in costs of *Incremental Sort* nodes:

```
-> Incremental Sort (cost=37824.89..37824.94 ...
Sort Key: t1.y, t1.x
Presorted Key: t1.y
-> Merge Join (cost=0.85..37824.88 ...
-> Incremental Sort (cost=18912.88..37824.97 ...
Sort Key: t1.y, t1.x
Presorted Key: t1.y
-> Merge Join (cost=0.85..37824.88 ...
```

Identical sort lists, presorted key and underlying query tree, but different startup and total costs! And that’s the issue’s essence: by trivial reversing an expression’s left and right sides, we get different estimations of the same query and significant speedup! Diving into the code, you can find out that the source of such instability lies in the logic of choosing the expression to be estimated. Having the Equivalence Class `t1.y=t2.x`

and its members {`t1.y,t2.x}`

, the optimiser selects just the first equivalence member from the list to estimate the number of distinct values. Hence, this choice depends on the text of the SQL query. Take a glance at the statistics on these attributes:

```
SELECT c.relname, a.attname, s.stadistinct
FROM pg_statistic s JOIN pg_class c ON (s.starelid = c.oid)
JOIN pg_attribute a ON (c.oid = a.attrelid AND s.staattnum = a.attnum)
WHERE c.relname = 'test';
relname | attname | stadistinct
---------+---------+-------------
test | x | -1
test | y | 1
```

Here, -1 means 100% of distinct values - close to uniqueness for x. And the only one different tuple for y. Having that data, Postgres estimates the number of distinct values in the presorted key column to be 1 or 1000000, depending only on the syntactical order of sides in the expression.

In toto, we have demonstrated a vulnerability in the optimiser code that sometimes doesn’t allow the optimiser to find the most optimal plan. Complex queries and database schemas can cause ‘performance cliffs’, which can puzzle DBAs a lot because of this bug’s subtle and unusual nature. Of course, no one optimiser is perfect, but dependence on the query text looks like a bug, doesn’t it?

THE END.

]]>While Oracle has undoubtedly introduced many impressive solutions, the transition from Oracle to PostgreSQL has been remarkably smooth. We've seen a significant reduction in migration challenges, with most migrations proceeding without a hitch. We've even developed the extension of client session variables to facilitate migration. Admittedly, many of these solutions are under enterprise license. However, in the PostgreSQL world, it's not uncommon for popular codes to find their way into the core, isn’t it?

However, the landscape changes when it comes to MSSQL to PostgreSQL migrations. We've encountered some challenges, with clients reporting significant query slowdowns during migration benchmarks. These problematic queries are diverse, and the environment and database sizes vary from gigabytes to terabytes. In some instances, queries have become so unoptimised that even after two weeks of execution, they remain unfinished, a stark contrast to the 20 ms execution time in MSSQL. This is not a mere stumbling block but a testament to the technological superiority we're dealing with, prompting me to analyse these cases.

The first case I want to show looks relatively trivial. Let’s see the schematic:

```
HashAggregate (width=1002) (actual time=4000s rows=2.1E3)
Group Key: t1.x1, t1.x2, t1.x3, t1.x4
-> Nested Loop (width=662) (actual time=500s rows=1.5E9)
-> Seq Scan on t1 (actual time=0.3s rows=2.4E5)
-> Index Scan using t2_idx on t2 (actual time=0.3E-4 rows=11)
Index Cond: t2.x1 = t1.x1 AND t2.x2 = t1.x2
Filter: t2.x3 = t1.x3 AND t2.x4 = t1.x4
Rows Removed by Filter: 0
```

Here, t1 and t2 - are two temporary tables containing around 2E5 tuples each.

Here, we have one JOIN over two not-so-huge tables and aggregation at the top. This join generates a massive number of tuples - more than one billion, and because of that, it works for around nine minutes. But what’s really interesting here is trivial hashed GROUP-BY, which consumes about one hour! Keeping in mind how trivial this operation usually is, it looks weird. That is the main reason for scrutinising that case; MSSQL executes this query in only 300s.

Something must be a reason, right? So, look into the MSSQL Plan schematic:

```
HashAggregate (parallel 8 streams)
Hash Join
Index Scan t1
Index Scan t2
```

Comparing MSSQL and PostgreSQL Plans, we see some differences. At first, HashJoin seems to be the better option. But, the replacement of NestLoop with HashJoin gives us only a tiny speedup (300s instead of 500s) and doesn’t influence the total execution time much. The second difference - is parallel execution. Why is it so impactful here? To get some insights, look into the flame graph:

According to this graph, Postgres spends a lot of time generating hash values—the *hashtext()* routine—and comparing strings—the *texteq*().

Having a billion incoming tuples generated by the JOIN operation, the query is grouping them into 21 thousand groups. That means we have around 7E4 tuples in each group - lots of duplicates! The second part of this enigma is in the type of columns - all four columns have text type.

Look into the statistics over these columns:

```
SELECT a.attname,s.stadistinct,s.stanullfrac,s.stawidth
FROM pg_statistic s, pg_attribute a
WHERE
starelid=16395 AND
starelid=attrelid AND
s.staattnum=a.attnum AND
a.attname IN ('x1', 'x2', 'x3', 'x4');
attname | stadistinct | stanullfrac | stawidth
---------------+-------------+---------------+----------
x1 | 7 | 0 | 72
x2 | 3574 | 0 | 72
x3 | 6 | 0.00033333333 | 72
x4 | 3 | 0 | 50
```

Because of duplicates, you can see that almost all comparisons in the first column, ‘x1’, will need a second comparison in ‘x2’. Moreover, each column is quite a long string, and to generate a hash or identify a duplicate, we should pass through around 300 bytes on average. Remember that hash aggregation make at least two computational operations for each incoming tuple - hash generation and string comparison. Recalling a billion incoming tuples - it may be a tremendous job and the reason for such a long time to group! In the absence of another obvious way to speed grouping operation, my main conjecture of MSSQL's impressive execution is the utilisation of parallelism.

Summarising that, I see two advantages: *join algorithm selection* and *parallelism*.

According to the documentation, MSSQL has implemented multithreaded parallelism to speed up groupings. As I can imagine, it is not easy to execute any aggregate in parallel mode, but trivial grouping by hash can be done in a highly parallel manner. Currently, when most of the data is generated in memory, it looks ideal for lightweight horizontal scalability codes.

PostgreSQL has parallel workers for such cases. However, being implemented as a process, it is a heavy tool. Postgres usually utilises 2-3 processes in a single query, parallelising the whole query subtree. What’s more, the process model and specifics of temporary table implementation don’t allow the use of parallel workers here.

To estimate how helpful parallel workers could be here, I have made tables persistent and forced many parallel workers. Forcing multiple parallel workers seems a bit tricky - you must reduce parallel startup and tuple costs. At the same time, you should reduce the min_parallel_table_scan_size and remember the max_parallel_workers GUC. The final configuration looks like this:

```
SET max_parallel_workers = 64;
SET max_parallel_workers_per_gather = 16;
SET parallel_setup_cost = 0.001;
SET parallel_tuple_cost = 0.0001;
SET min_parallel_table_scan_size = 0;
```

And we are getting more comparable numbers on PostgreSQL:

```
Finalize HashAggregate (actual time=416s)
Group Key: t1.x1, t1.x2, t1.x3, t1.x4
-> Gather (actual time=416s)
Workers Launched: 9
-> Partial HashAggregate (actual time=416s)
Group Key: t1.x1, t1.x2, t1.x3, t1.x4
-> Nested Loop (actual time=68s)
-> Parallel Seq Scan on t1 (actual time=0.08s)
-> Index Scan using t2_idx on t2 (actual time=0.04)
Index Cond: t2.x1 = t1.x1 AND t2.x2 = t1.x2
Filter: t2.x3 = t1.x3 AND t2.x4 = t1.x4)
Rows Removed by Filter: 0
Execution Time: 416.5s
```

As you can see, the parallel workers’ technique helps. Its drawback is the rigidity of the process model. We should resolve issues with temporary tables and possibly show stoppers downstairs in the plan (volatile functions in selection filters, for example), which can lead to the rejection of parallel workers. The thread model, used locally in the grouping node, looks more flexible. Maybe ponder about a custom aggregate implemented in the multi-threading model. That's a good reason to start a GSoC project next year!

This case also represents one frequent problem: multi-clause JOIN. Multi-clause means something like that:

`t1.x1 = t2.x1 AND t1.x2 = t2.x2 AND ... AND t1.xN = t2.xN`

PostgreSQL estimates the selectivity of the whole clause by a selectivity estimation of each AND’ed expression separately - let’s denote them

Having these estimations, it calculates the total number of rows produced by this join:

As you can imagine, this formula tends to underestimate the number of tuples produced. So, providing a good estimation for a single join clause, PostgreSQL underestimates if the data model needs to join tables by many clauses.

Looking into the MSSQL documentation and the research, I found out that there are a lot of statistics that the DBMS gathers: statistics on an index definition, WHERE clause, custom-made statistics (analogue of the Postgres CREATE STATISTICS) but with an addition of multiple options. The most interesting options are sampling and the WHERE clause, which allows the scanning of only part of a table. MSSQL contains so many statistics that developers invented complicated methods for asynchronous updates of these statistics.

Related to our case, the composite join clause, containing equality expressions on four columns, is transformed (my conjecture) into a comparison between two rows, likewise:

`ROW(t1.x1,t1.x2,t1.x3,t1.x4) = ROW(t2.x1,t2.x2,t2.x3,t2.x4)`

and distinct (or histogram) statistics on t1(x1,x2,x3,x4) and t2(x1,x2,x3,x4) as a whole, allows MSSQL to estimate the cardinality of the JOIN more precisely. Fortunately, the PostgreSQL community already comprehend the issue and working out the solution right now.

THE END.

]]>What I see there is a continuing statement that ORMs divide logic into small queries, causing many roundtrips. But in my mind, it doesn’t matter if it is formulated as a single query or as many smaller ones—it is still ORM. The SQL-generation core must be covered under that ORM’s hood, right?

The next claim is that SQL is too complex and contains many keywords. Real applications, being written with object-oriented languages need something more laconic. ORMs, providing such a more native interface, have no uniformity. So, reading between the lines, I see the request to implement a kind of DB library for each language where an implementation would not generate SQL queries but request objects from the database instance, which means a new query protocol that will parse language-native requests into a query tree.

One more thing: they argue about a declarative approach to schema definition in object-oriented terms with declarative migration.

People tell me: why not implement something like bytecode [4] in PostgreSQL? In that case, we could send a binary representation of the query directly from the client to the instance core without intermediate checks.

Summarising the data I have learned from this project, I see real issues are:

Performance degradation because of the multi-query approach.

Non-uniformity of ORM languages.

ORMs have limited functionality in comparison to full-fledged SQL queries.

In toto, I see that developers want to remove the client-side layer, which translates OO → SQL and add the OO → parse tree functionality as an extension or directly into the core. This is all about performance concerns.

It makes sense. The main point here is to learn DBMS to receive object-oriented language statements and translate them directly to a single parse tree. Do we have enough flexibility to allow an extension to fully replace the current core parser with a custom-made one, and would it be safe?

EdgeDB: A next-generation graph-relational database.

https://github.com/edgedbA solution to the SQL vs. ORM dilemma

https://www.edgedb.com/blog/a-solution-to-the-sql-vs-orm-dilemmaWhy ORMs are slow (and getting slower)

https://www.edgedb.com/blog/why-orms-are-slow-and-getting-slowerWhy SQLite Uses Bytecode

https://sqlite.org/draft/whybytecode.html

**Attention:** It is designed like a technical report, explaining the feature in internal terms of PostgreSQL code. So, it can be hard for bystanders to read this text.

The Asymmetric Join (AJ) optimisation strategy introduces a novel approach to joining a partitioned relation (PR) and a non-partitioned relation (NR). Its uniqueness lies in individually connecting each partition with an NR and then merging the results using the APPEND operation. It looks like an essential evolution of the partitionwise join technique (PWJ) [4]. Although we haven’t seen any mention of this technique before — any links and references are welcome.

This strategy is a complete analogue of query rewriting using UNION ALL:

`SELECT * FROM A JOIN partitioned B ON A.x=B.x; `

to:

```
(SELECT * FROM A JOIN B_1 ON A.x=B_1.x)
UNION ALL
(SELECT * FROM A JOIN B_1 ON A.x=B_2.x);
```

The majority of the implementation for this strategy can be found in the ‘relnode.c’ and ‘joinrels.c’ files.

It adds one more way to improve the efficiency of the Parallel Append.

Independent choice of join strategy for each partition.

The append does not have to combine large relations; it can sift most tuples out in the child join. This is also good if the join or target list has a heavy condition.

It allows the partitioning condition to be pushed directly into the inner, improving the filtering of the table scan procedure.

Reduces the hash table size of each particular child HashJoin and can ease the data skew problem.

Additional ways to partition pruning might also be found.

Further direction of FDW development towards shippable FOREIGN TABLES - as an analogy to shippable functions.

The search space for plans is growing.

inner of AJ is a table or subtree of a query that does not contain partitioned tables in its join tree.

outer doesn't have lateral references to the inner

The code is similar to partitionwise_join and consists of changes in three parts of the optimiser code.

The first part of AJ is the initialisation of the partitioning properties of the RelOptInfo structure - *part_scheme* and *part_exprs*. AJ logic has been added to the end of the function. We only go to it if there is no way to apply PWJ. Since the optimiser will not call this function with the reverse order of inner and outer (for the partitionwise method, this does not make sense), here, for AJ, we must immediately check both options for placing inputs. Also, in the case of AJ, the inner does not have partitioning properties, and they are inherited from the outer. Also, here, the optimiser sets the *consider_asymmetric_join* flag - this only means that this joinrel can potentially be executed using the PWJ method and serve as an input for PWJ or AJ (only as an outer) at a higher level.

A significant difference introduced into the *build_joinrel_partition_info* routine is caused by the asymmetrical nature of this join technique. Let's say a join is made between two tables P1, P2 partitioned in the same scheme and one plain table T. If on the first attempt outer = (P1, P2), inner = (T), then PWJ cannot be built and AJ will be initialised. Then, on the next attempt, we have outer=(P1, T), inner = (P2), and the PWJ option cannot be considered. If the order of searching through combinations is different and the optimiser tries to construct *(P1, T) JOIN (P2)* first, then PWJ will be initialised, and following AJs between *(P1, T) JOIN (P2)* or *(P1) JOIN (T, P2)* will be rejected. For example, you can look at the following queries, which are also added to the regression tests in *partition_join.sql*:

```
EXPLAIN (COSTS OFF) -- PWJ on top
SELECT * from prt1 d1, unnest(array[3,4]) n, prt2 d2
WHERE d1.a = n AND d2.b = d1.a;
EXPLAIN (COSTS OFF) -- AJ on top
SELECT * from prt1 d1, prt2 d2, unnest(array[3,4]) n
WHERE d1.a = n AND d2.b = d1.a;
```

Curiously, no matter which combination of inner and outer the optimiser tries to construct the joinrel - the *part_scheme* will be precisely the same pointer, and *part_exprs* must match up to the permutation of elements. This fact makes it possible to change the implementation in the future, eliminating dependency on the order of inner/outer combinations described above.

Also, we still have the open question: for the same joinrel, can different combinations of inner and outer allow AJ, but others can not? To detect that we added code to *build_join_rel*() that, having assertions enabled, checks that if the *part_scheme* for an already existing joinrel is not created, then after executing the *build_joinrel_partition_info* routine, it will not appear either. The mechanism does not consider all possible options but can help identify errors in autotests.

Another consequence of introducing AJ is that we must either save allowed combinations (inner, outer) or recheck the conditions of AJ in the try_asymmetric_partitionwise_join function. Logically, the first is cheaper.

One significant difference between the AJ logic and PWJ is that the *consider_partitionwise_join* flag is set gradually on RelOptInfo on different levels, starting with the lowest one - the partitioned table. Thus, a tree is built in which all elements have allowed properties. In the case of AJ, if you follow the same path, you will first need to check a lot (all sorts of tablesample, functionscan, etc.). In addition, there will be overhead even if there are no partitioned tables in the query. Therefore, in this implementation, build_joinrel_partition_info checks and sets AJ properties for the request *subtree*.

Checking all entries and subtree clauses in each combination is too expensive. Might it be better to invent a *safe_for_asymmetric_join* flag? Then, it will only be necessary to check the current *RelOptInfo* and the flags of the underlying ones. It would also be closer to PWJ implementation.

It is called in populate_joinrel_with_paths immediately after *try_partitionwise_join*. Since the caller is not expected to test for different ordering for the inputs, we must do this ourselves, given that the jointype may not allow us to swap the sides - this needs to be looked into carefully - since logically, we can swap the inner and outer sides by replacing, for example, the jointype from LEFT OUTER JOIN to RIGHT OUTER JOIN. However, is it technically simple and reasonable to accomplish? - It is an open question. In any case, the first step should be a demo query showing the limitation of our optimisation resulting from this section of code.

Here, we just double-check on the inner for AJ's correctness at the moment of building child joinrel.

Next, we initialise the *RelOptInfo* fields associated with partitioning: *boundinfo*, *nparts*, *part_rels*. If the fields are already initialised, then we simply check the identity of the partitioning scheme:

`Assert(joinrel->nparts == prel->nparts && joinrel->part_rels != NULL);`

Here, I have a question that needs to be checked (see doubt No. 3).

Next, if everything is okay, we go through each partition from the outer and build a join with the inner. The tricky point here is replacing references to the relid of a partitioned table with the relid of its partition.

Next, we call *populate_joinrel_with_paths* and add paths to the child joinrel's pathlist for the combination of inner/outer.

Here, we collect all living *part_rels* of a given rel (recursively) and build APPEND, thus finishing the construction of partitionwise paths. The changes are minimal since, from the point of view of this function, the partition-related *RelOptInfo* fields for AJ and PWJ are identical. We just need to fix some consistency checks.

An important feature that interacts with AJ is reparameterisation (see the Parameterised NestLoop technique). A parameterised expression can refer to the inner side of some AJ, which can be inside the outer side of the parameterised NestLoop (PNL). Each *path* in the AJ’s child join may refer to the same underlying *path* on the inner side. Because of that, reparameterisation, altering such a path, must make a private copy beforehand.

To process it correctly, we introduced the *is_asymmetric_join* function, which detects a situation when the path is an asymmetric join, and it causes the replacement of relids in a flat copy of the path. The AJ criterion is the following: one of the inputs is a partition, and the other is a baserel or joinrel. Here, we may capture some extra cases, but copying instead of in-place changing will not worsen things; it would just waste more memory.

It's still not obvious that we are checking the inner subtree entirely and correctly. What if the join clause contains a volatile function?

Also, we still haven't figured out how to expand the root->simple_*** array. Without this, we don't have native PostgreSQL core integration. See [3] for the reason.

The AJ & PWJ scheme is such that AJ, applied many times to the same joinrel with different inner/outer combinations, does not seem to lead to different partitioning schemes. But can one plain relation cause partition pruning differently than another combination of plain and partitioned relations? Could this cause the pruning of different part_rels[i]? And how can this affect the construction of the plan?

Current pgsql-hackers thread.

Current version can be found in this branch of my GitHub repository.

https://www.postgresql.org/message-id/CAExHW5vOGLD5MUW2tMTYR8pSjcT67%2BRVRyDy99fUSCKsdBELaA%40mail.gmail.com

Partition-wise joins: "divide and conquer" for joins between partitioned table