Elaboration of the PostgreSQL sort cost model

Column order matters

and

Jul 01, 2024

This post debates the current sort cost model, identifies its shortcomings, and proposes a new, more comprehensive model that expands the scope of the optimiser search. Designing this post for developers, we've intentionally used terminology derived from PostgreSQL internal code, which may sometimes be unclear to newcomers.

The background of this work lies in the long story of the GROUP-BY optimisation feature development: reverted in 2023, it ended up its way in the core in 2024. But only a tiny part of the feature was accepted. Challenging by the reason of "Embrace a down-to-earth attitude," we have tried to analyse the situation before the next attempt.

The current development code can be found in the GitHub branch of the Postgres Professional's public Postgres fork.

Introduction

In the current implementation of the PostgreSQL optimiser, the complexity of sorting is determined only by the number of incoming tuples, also known as "tuples". While a logical and reliable solution, such a model does not consider the composite nature of the tuple and the number of duplicates in the input data. Moreover, the hash aggregation cost model already employs the number of columns in a tuple and statistics on the number of duplicates. At the same time, in searching for the optimal plan, the optimiser often has the freedom to choose the order of the columns in the resulting tuple of an operation.

An example is the GROUP-BY operation - one of its execution algorithms requires pre-sorted data. Also, the MergeJoin operator can change the order of expressions in the join clause and optimise the plan by considering different sortings of the left and right inputs. Having the ability to differentiate by cost between various possibilities for such sorting, the optimiser can reasonably choose the order of clauses in the grouping operator and discover more optimal alternatives for the query plan.

After reviewing the ACM, arxiv and IEEE catalogues, we found no work devoted to estimating multi-column sorting. Looking around into the competitors' behaviour, we found out that the MySQL code, which is directly available for study, implements a simple model for estimating the cost of sorting, which does not consider the length and number of columns in the tuple. Experiments with MSSql and Oracle have also shown the absence of such tuning. It is why we commenced this work and ended up in this post.

So, let's demonstrate the problem using a simple degenerate example:

CREATE TABLE test (
  n integer, t integer, company text, counterpart_company text
);
INSERT INTO test (n,t,company,counterpart_company)
SELECT n, n % 10000 AS t,
  CASE n % 3
    WHEN 0
      THEN 'A Limited Liability Partnership (LLP) "White chamomile"'
    WHEN 1
      THEN 'A Limited Liability Partnership (LLP) "White swan"'
    WHEN 2
      THEN 'A Limited Liability Partnership (LLP) "White idea"'
  END AS company,
  CASE (n + 1) % 3
    WHEN 0
      THEN 'A Limited Liability Partnership (LLP) "White chamomile"'
    WHEN 1
      THEN 'A Limited Liability Partnership (LLP) "White swan"'
    WHEN 2
      THEN 'A Limited Liability Partnership (LLP) "White idea"'
  END AS counterpart_company
FROM generate_series(1,1000000) AS gs(n);
VACUUM ANALYZE test;

Thanks to Ivan Frolkov, who provided us with this example.

Here, Postgres has created several columns with different numbers of duplicates and widths each. Let's see how the specific order of sorting columns impacts execution time:

EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT n,t,company,counterpart_company FROM test ORDER BY 1,2,3,4;
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT n,t,company,counterpart_company FROM test ORDER BY 4,3,2,1;
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT n,t,company,counterpart_company FROM test ORDER BY 1,3,4,2;
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT n,t,company,counterpart_company FROM test ORDER BY 2,3,4,1;

The execution time of these queries in the experiment was:

853 ms,
4701 ms,
854 ms
5065 ms.

It is noticeable that options No. 2 and No. 4, when large-width columns with a large number of duplicates are sorted first, the complexity of the operation increases by around six times. Thus, the sort order can significantly impact the computational efforts required to process a query. Hence, having been able to distinguish between such sorting options and choose the most optimal one, we can obtain an additional query optimisation tool. For example, if a GROUP-BY operation is higher up the query tree, you can significantly reduce the overhead by choosing the optimal order of grouping expressions. It should be noted that to obtain such a great difference, we cheated a little by declaring the scanned table as temporary, eliminating the use of parallel workers and making the execution time gap more evident. Also, we deliberately use ORDER-BY instead of GROUP-BY to emphasise attention to the issue of sort costing and avoid the overhead of grouping operations.

Requirements

The main problem with altering the optimiser's cost model is adjusting the balance of the costs of various operations. The cost model merely approximates the operation's actual cost, and does not consider many minor factors of the exact characteristics of the hardware and software platform. Detailing the cost model of one of the query tree operators can create unjustified discrimination in favour of picking one strategy. In case of modifications in the sorting cost model, you should check the cost models for at least the MergeAppend, GatherMerge and IncrementalSort operators. Also, we want to avoid upsetting the balance with hashing methods unless the choice is proven to be incorrect. Previously, we have experimented with changing the formula for calculating the cost of sorting, and we noticed a shift in the balance between the operators Sort, MergeAppend, GatherMerge and IncrementalSort. Also, we ought to remember situations where the optimiser chooses between sorting lower and higher in the query tree, at least:

Sort each input below the JOIN versus the sorting results of the JOIN.
Perform a MergeJoin and have an already pre-sorted data stream at the output versus NestLoop or HashJoin with the following sorting.

Proposed modifications

The cost of quick sort (see the cost_tuplesort function) in the current implementation of the PostgreSQL DBMS is estimated using the formula:

\(sort\_cost = 2.0*cmp\_cost * T*log_2(T), cmp\_cost = C_0\)

where cmp_cost is a generalised cost of the comparison operator call or cpu_operator_cost in terms of PostgreSQL; T is the number of incoming tuples.

Obviously, this formula does not consider the number of columns in the sorting tuple, the sorted value's characteristic width and the number of duplicates in a separate column. At the same time, we know that the comparison operator is called separately for a pair of values in each column. Imagine that the first column contains unique values. In that case, the Sort operation doesn't need to call the comparison operator for subsequent columns at all. This would seem to be an insignificant fact. However, if we imagine that the next pair of columns being compared contains a text value of considerable length - as in the example above, where the almost unique value 'n' was followed by the full name of the company - then the order of columns of the sorting text apparently matters in terms of CPU cycles spent on sorting. Thus, when tuple columns have very different numbers of duplicates, minimising the number of comparison calls makes sense, especially if it is an expensive operator. For example, the complexity of using a variable-length bytea type can significantly depend on the storage method: PostgreSQL can use compression or the TOAST technology to optimise the storage for such values. It also makes sense when most of the data is generated within a query - for example, a Cartesian product of tables - in that case, access to disk or even the buffer cache is not a significant factor. One real-world example where this behaviour could play an important role would be grouping purchase analytics by ID, gender, age, and status (especially if the latter is in text form). The option: "gender, status, age, ID" is worse if the ID is an almost unique field of fixed width.

The first evident step towards complicating the sorting model is to add a multiplier to introduce a dependence on the number of columns to the cmp_cost value like hash aggregation does:

\(cmp\_cost = n * C_0\)

The next complication is related to taking into account the order of columns in the sorting cost model. In the absence of extended statistics, which is a widespread case today, we only have one factor that is valid enough to be used - the ndistinct value for the first column in the sorting tuple. Having that statistic, enter an estimation of the number of value comparison operations per tuple into the cost model as follows:

\(cmp\_cost=\left[1+\frac{f}{T-1}*\left(n-1\right)\right]*C_0, f=\begin{cases}T-M, T>M \\ 0, T \leq M\end{cases}\)

where T is the number of incoming tuples, T>1; M Is the number of groups (the ndistinct value) predicted by statistics in the first column, M≥1.

This approach allows to rank sortings of the same set depending on the estimate of the number of duplicates in the first column of the sorting tuple, remaining cmp_cost within the range of values C₀,..., n*C₀. As a positive outcome, it simplifies the transition from the above primitive model and gives the optimiser an additional factor to find the optimal plan.

Analysis

For only one group (M = 1), which means all values of a given column are equal to each other, and this formula boils down to the following:

\(cmp\_cost = n*cpu\_operator\_cost.\)

In the worst-case scenario, when comparing each pair of tuples, the sorting operator will traverse all the columns and invoke the comparison operator for each pair of values. The uncertainty arises in the second and subsequent columns - we lack statistics, so we stick to the extreme case. Should we rely on an average case, i.e. n/2? If we do, we risk smoothing out the cost difference and reducing the chances of choosing the order of columns determined by ndistinct - remember, the current implementation of PostgreSQL 17 has two competitors: the order specified by the ORDER-BY expression and the pathkeys sort order coming from the subquery below. For unique values in the first column (M = T), the formula simplifies to:

\(cmp\_cost=C_0,\)

since transitions to comparison by the second attribute will never occur, the values from the first column are sufficient for matching a pair of tuples. The linearity of the formula ensures the same gradient over the entire range of values of the number of groups, which can be interpreted as the same sensitivity to changes. If you use, for example, a hyperbola, you can increase sensitivity to changes in the number of groups in a specific range of values. The goal could be, for example, to choose such a strategy only when there is (almost) uniqueness in the value of ndistinct. Without specific ideas about what is better, we choose the linear rule.

Model extension

How is the above formula transformed if ndistinct is known not for only one column but for the entire prefix of the list of sorted columns? - this can happen, for example, with defined extended statistics.

Let's imagine we have the distinct value for two columns (x₁, x₂). Then, if a given pair of columns is at the beginning of the list of sorted values (x₁,x₂,...,x_n), then the following refined formula can be used to estimate the sort operation:

\(cmp\_cost=\left[1+\frac{f_1+f_2*\left(n-2\right)}{T_1-1}\right]*C_0, T_2=\frac{T_1}{M_1}, f_i=\begin{cases}T_i-M_i, T_i>M_i \\ 0, T_i \leq M_i\end{cases}\)

where f₁ is calculated using the standard ndistinct statistics of the first column x₁.

It is curious that since the value of ndistinct does not depend on the order of the columns, in this case, we can estimate the cost of sorting for an alternative sort order (x₂,x₁,...,x_n). This can be useful for choosing, for example, the most optimal ordering in the list of values of the GROUP-BY operator.

Taking into account that in PostgreSQL, the presence of extended statistics for a set of m columns of a table means that there is also an ndistinct value for any subset of it, it is pretty helpful to have a more general formula with coverage of m<n,n≥2:

\(cmp\_cost=\left[1+\frac{\Sigma_1^{m-1}{f_i}+f_m*\left(n-m\right)}{T_1-1}\right]*C_0, T_i=\frac{T_{i-1}}{M_{i-1}}, f_i=\begin{cases}T_i-M_i, T_i>M_i \\ 0, T_i \leq M_i\end{cases}\)

Thus, it is possible to refine the estimate of the sorting cost by detailing the statistics for a given table.

We can get many alternative plans with extended statistics, such as a merge join or grouping. For example, having a column x₁ with a considerable ndistinct value and statistics on non-unique x₂ ,..., x_m, we can evaluate what is preferable: putting x₁ in the first position or putting x₂ ,... ,x_m in front. It is also possible to assess various permutations of the prefix x₂ ,..., x_m.

There are many options around sort order permutations, but the scope of such estimation is too narrow - it is worth considering only a small set of perspective orderings to estimate or introducing empirical rules for choosing the most optimal sorting order. The simplest empiric is to evaluate the order of columns ordered according to the value of ndistinct. Another less obvious rule of thumb is to arrange columns according to their average width.

Limit the scope

The authors' practical experience shows that, in most cases, this optimisation is a second-order factor of importance and fine-tuning of the query plan. There is a high probability that the computational cost of finding the best sort order will surpass the effect on query execution time. Putting aside the possibility of caching and query plan reuse, it is necessary to limit the use of this strategy, employing some reliable empirical evidence. The relative weight of the sorting operation can serve as an empiric. Suppose the cost of sorting is a significant part of the work (say 20% of the cost of the underlying subtree). In that case, the higher-level operator can test alternative combinations of the sorted columns. However, this leads to the necessity of adding an extra step into the operator planning procedure: after choosing a sorted path for the current set of pathkeys, it will analyse the generated path and, at this moment, decide on an additional search for optimal combinations of sorting columns.

Implementation

Initially being quite difficult to code, it looks much simpler with the solution explained and implemented earlier. The primitive model of cost sort estimation was implemented in advance to check the breaking of cost balances.

Curiously, even that model raised some issues. For example, MergeAppend also compares tuples incoming from different subplans. Hence, MergeAppend + Sort of siblings is an essential competitor for the Append + Sort of result strategy, and to get a good balance, we must add the same multiplier there. It's precisely the same thing with the GatherMerge path in the case of parallel paths. The IncrementalSort cost model also raised the same issues, but this model looks like a mix of compromises so far and changing that means opening a Pandora's box.

The second step, the addition of the estimation of the number of duplicates for the first column, was more straightforward because it needed only a tiny complication. As an outcome, it triggered a preference for Sort over IncrementalSort strategy, which may be studied later.

The final introduction of the new GROUP-BY ordering strategy includes one cycle along the list of grouping clauses. Checking statistics on each of these columns to find the one with a minimum number of duplicates, as the first grouping column looks a bit expensive. Still, we hope that using many column groupings is needed only in analytic queries, which consume a lot of computational resources by themselves, and that planning improvement would be worth it.

Links to commits above are the only sketch of possible solutions, so look into the development branch instead to see the current state.

Conclusion

Taking into account the number of columns in the sorting cost is physical. It will bring the detail of this model closer to the model for estimating the cost of hash aggregation, which already considers the number of columns in the row.

The impact of this change on internal balances of the PostgreSQL cost model can be adjusted:

by choosing as an outermost case not a complete comparison of all columns for each pair of tuples but something more essential, for example, n/2;
by selecting a different interpolation law between extreme values.

The formula can be extended to the case of extended statistics for a subset of columns from the grouping list. At the same time, the question remains open about the relationship between the ndistinct value and the complexity of the comparison operator: is it possible to determine a sufficiently objective relationship between these parameters? Solving this problem would make it possible to decide on the most efficient sorting order, considering all significant factors.

THE END.

A guest post by

Alena Rybakina

Core developer in Postgres Professional since 2021. Contributor in the PostgreSQL project.

Conserving CPU's cycles ...

Discussion about this post