Could GROUP-BY clause reordering improve performance?

Utilising statistics to optimise GROUP-BY clause order

Nov 25, 2024

PostgreSQL users often employ analytical queries that sort and group data by different rules. Optimising these operators can significantly reduce the time and cost of query execution. In this post, I will discuss one such optimisation: choosing the order of columns in the GROUP BY expression.

Postgres can already reshuffle the list of grouped expressions according to the ORDER BY condition to eliminate additional sorting and save computing resources. We went further and implemented an additional strategy of group-by-clause list permutation in a series of patches (the first attempt and the second one) for discussion with the Postgres community, expecting it to be included in the next version of PostgreSQL core. You can also try it in action in the commercial Postgres Pro Enterprise fork.

A short introduction to the issue

To group table data by one or more columns, DBMSes usually use hashing methods (HashAgg) or preliminary sorting of rows (tuples) with subsequent traversal of the sorted set (SortAgg). When sorting incoming tuples by multiple columns, Postgres must call the comparison operator not just once but for each pair of values. For example, to compare a table row ('UserX1', 'Saturday', $100) with a row ('UserX1', 'Monday', $10) and determine the relative order of these rows, we must first compare the first two values and, if they match, move on to the next pair. If the second pair of values (in our example, 'Saturday' and 'Monday') differs, then there is no point in calling the comparison operator for the third element.

This is the principle on which the proposed SortAgg operator optimisation mechanism is based. If, when comparing rows, we compare column values with fewer duplicates first (for example, first compare UserID numbers and then days of the week), then we will have to call the comparison operator much less often.

Time for a demo case

How much minimising the number of comparisons may speed up a Sort operation? Let's look at the examples. In the first example, we sort the table by the same fields but in different orders:

CREATE TABLE shopping (
  CustomerId bigint, CategoryId bigint, WeekDay text, Total money
);
INSERT INTO shopping (CustomerId, CategoryId, WeekDay, Total)
  SELECT random()*1E6, random()*100, 'Day ' || (random()*7)::integer,
    random()*1000::money
  FROM generate_series(1,1E6) AS gs;
VACUUM ANALYZE shopping;

SET max_parallel_workers_per_gather = 0;
SET work_mem = '256MB';

EXPLAIN (ANALYZE, TIMING OFF)
SELECT CustomerId, CategoryId, WeekDay, Total
FROM shopping
ORDER BY WeekDay,Total,CategoryId,CustomerId;

EXPLAIN (ANALYZE, TIMING OFF)
SELECT CustomerId, CategoryId, WeekDay, Total
FROM shopping
ORDER BY CustomerId,CategoryId,WeekDay,Total;

The results of executing these queries will be as follows:

 Sort  (cost=117010.84..119510.84 rows=1000000 width=30)
       (actual rows=1000000 loops=1)
   Sort Key: weekday, total, categoryid, customerid
   Sort Method: quicksort  Memory: 71452kB
   ->  Seq Scan on shopping  (actual rows=1000000 loops=1)
 Execution Time: 2858.596 ms

 Sort  (cost=117010.84..119510.84 rows=1000000 width=30)
       (actual rows=1000000 loops=1)
   Sort Key: customerid, categoryid, weekday, total
   Sort Method: quicksort  Memory: 71452kB
   ->  Seq Scan on shopping  (actual rows=1000000 loops=1)
 Execution Time: 505.775 ms

The second query is executed almost six times faster than the first, although the processed data is identical. This is because the comparison operator was called less often in the second case. The sorted tuple has 4 columns (CustomerId, CategoryId, WeekDay, Total), and Postgres calls the comparison operator separately for each pair of values - a maximum of 4 times. But if the first column in the comparison is CustomerId, then the need to call the comparison operator for the next column will be much lower than when the WeekDay column is the first.

This example shows that the computational costs of the sorting operation may be pretty significant. Even with the “Abbreviated keys” optimisation in the pocket, we are still not guaranteed execution time stability in the sort operation. I wonder if some newly proposed optimisations [1, 2] could significantly weaken the performance gap. Considering that an analytical query may have multiple sorts/additional sorts (each aggregate may define its individual order of incoming data), such an additional operation will save computing resources.

Note that the values of the cost field of the Sort operator in the EXPLAIN of the first example are the same. This means that for the Postgres optimiser both sorting options are identical.

Since the sort order for GROUP BY or Merge Join does not affect the final result, it can be chosen to minimise the number of comparison operations. In addition, if the table has many indexes, the data can be scanned and sorted in different ways, and the correct choice of the incremental sort option (IncrementalSort) may provide a positive effect.

Imagine a second example. Let's say you want to group your data to calculate the average spend for each customer in a given product category based on the day of the week:

SET enable_hashagg = 'off';
EXPLAIN (ANALYZE, TIMING OFF)
SELECT CustomerId, CategoryId, WeekDay, avg(Total::numeric)
FROM shopping
GROUP BY WeekDay,CategoryId,CustomerId;

/*
GroupAggregate (actual rows=999370 loops=1)
   Group Key: weekday, categoryid, customerid
   ->  Sort (actual rows=1000000 loops=1)
         Sort Key: weekday, categoryid, customerid
         Sort Method: quicksort  Memory: 71452kB
         ->  Seq Scan on shopping (actual rows=1000000 loops=1)
  Execution Time: 2742.777 ms
 */

To demonstrate the concept explicitly, I have disabled hash aggregation. From a query perspective, the order of the columns in the GROUP BY clause is entirely unimportant. Let's change the order and see the result:

EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
SELECT CustomerId, CategoryId, WeekDay, avg(Total::numeric)
FROM shopping
GROUP BY CustomerId,CategoryId,WeekDay;

/*
 GroupAggregate (actual rows=999370 loops=1)
   Group Key: customerid, categoryid, weekday
   ->  Sort (actual rows=1000000 loops=1)
         Sort Key: customerid, categoryid, weekday
         Sort Method: quicksort  Memory: 71452kB
         ->  Seq Scan on shopping (actual rows=1000000 loops=1)
  Execution Time: 1840.517 ms
 */

The speedup is less impressive than in the first example but pretty noticeable overall. What is important is that this transformation is free: we do not need a new index or complex query tree change, likewise performing a subquery pull-up. Such a change can be done automatically, and the main thing is to teach the Postgres optimiser to distinguish the costs of different combinations of grouping clauses and consider an additional grouping strategy.

State of the art

In 2023, Postgres discovered to exclude redundant columns from a grouping operation. Redundancy can occur, for example, when there is an equality expression in the query tree:

SELECT sum(total) FROM shopping
WHERE CustomerId=CategoryId AND WeekDay='Monday'                                                                                                                    GROUP BY CustomerId,CategoryId, WeekDay;

/*
 GroupAggregate
   Group Key: customerid
   ->  Sort
         Sort Key: customerid
         ->  Seq Scan on shopping
               Filter: ((customerid = categoryid) AND
                       (weekday = 'Monday'::text))
 */

In the example above, the values in the CustomerId and CategoryId columns belong to the same equivalence class (EquivalenceClass structure in Postgres code), and either column can be excluded from the grouping expression. At the same time, the clause "weekday = 'Monday'" makes explicit grouping by WeekDay unnecessary.

PostgreSQL 17 introduced another strategy: the optimiser can now adjust the order of the grouped columns according to sort order the input data. Thus, during planning, Postgres may consider two alternative strategies:

Group the already sorted data, and then re-sort by ORDER BY requirements.
Sort the incoming data by the rules specified by ORDER BY, then perform the grouping.

To demonstrate both options, let's add an index to our table and compare the results of the two queries:

CREATE INDEX ON shopping(CustomerId, weekday);

EXPLAIN (COSTS OFF)
SELECT count(*) FROM shopping WHERE CustomerId < 5000
GROUP BY WeekDay,CustomerId ORDER BY WeekDay,CustomerId;

EXPLAIN (COSTS OFF)
SELECT count(*) FROM shopping WHERE CustomerId < 50000
GROUP BY WeekDay,CustomerId ORDER BY WeekDay,CustomerId;

/*
 GroupAggregate
   Group Key: weekday, customerid
   ->  Sort
         Sort Key: weekday, customerid
         ->  Index Only Scan using
             shopping_customerid_weekday_idx on shopping
               Index Cond: (customerid < 5000)

Sort
   Sort Key: weekday, customerid
   ->  GroupAggregate
         Group Key: customerid, weekday
         ->  Index Only Scan using
             shopping_customerid_weekday_idx on shopping
               Index Cond: (customerid < 50000)
 */

In the first case, there is little data to be grouped, and it is cheaper to sort the tuples in advance according to the requirements of the ORDER BY operator. In the second case, sorting after grouping is justified: the index scan operator will return the rows in sorted form, and grouping will significantly reduce the number of such rows, which makes subsequent sorting cheaper. Isn't it true that the additional Postgres strategy allows you to find exciting variants of query plans? The downside is that it does not use column statistics, which could have helped to optimise example No. 2.

How to employ statistics?

The proposed GROUP-BY columns reordering strategy is based on the standard Postgres columnar statistics stored in the pg_statistic table. It is a cost-based strategy, and it supplies the optimiser with an alternative path for the Sort operator that minimises the number of comparison operations during sorting. To clarify the basic idea, consider the query with grouping from the example above:

SELECT avg(Total::numeric) FROM shopping
GROUP BY CustomerId,CategoryId,WeekDay;

The case where CustomerId is in the first position of sorting tuples is more efficient because it contains the largest number of distinct values (approximately half of a million). That means there are two other tuples for each single tuple where the comparison operation of the CustomerId column will not determine the order of these tuples, and the values from subsequent columns will have to be compared. The WeekDay column has no more than seven distinct values. If Postgres sorted this column first, then to determine the order, the values of subsequent columns would have to be compared with a higher degree of probability.

Dive into the code

Since the code is very voluminous, we split it into four patches.

The first patch teaches the optimiser to consider EquivalenceClass members during estimation of number of groups in the estimate_num_groups() routine. What does it means? Look at the queries:

EXPLAIN SELECT CustomerId,CategoryId FROM shopping
WHERE CustomerId = CategoryId GROUP BY CustomerId,CategoryId;

EXPLAIN SELECT CustomerId,CategoryId FROM shopping
WHERE CustomerId = CategoryId GROUP BY CategoryId,CustomerId;

These queries semantically identical: we just rearranged columns in the grouping list. Equivalence expression leveled out difference in distinct values for both CategoryId and CustomerId: after applying the filter they will contain exactly the same values. But if you EXPLAIN it you will see different estimations and, as a result, different query plans:

HashAggregate  (cost=14073.83..14123.71 rows=4988 width=16)

--and:

Group  (cost=13676.18..13715.13 rows=101 width=16)

So, the first patch adds into the estimate_num_groups a code which pass through the equivalence class and look for its members ndistinct estimations. The minimum number of distinct values should be the most correct answer. Also, it introduces distincts' caching inside an EquivalenceMember.

The second patch concerns the formula for calculating the cost of sorting. In the current version of Postgres, sorting is estimated using the formula:

$cost = C \cdot N \cdot log_2(N),$

where:

N - number of tuples to sort,
C = 2.0*cpu_operator_cost - use-defined parameter.

This patch introduces into the Sort estimation formula the number of columns involved:

$cost = \left( 1.0+ncols \right) \cdot N \cdot log_2\left(N\right)$

The approach seems straightforward and relatively crude. It is designed to be intermediate - to discover how many places employ sort estimation formulas and how many areas will be impacted.

Looking at the regression test changes, you may notice that this change affects the balance among Sort, IncrementalSort, MergeAppend, GatherMerge, and HashAgg nodes. With this formula, the optimiser favours using hashAgg grouping in more situations than before. HashAgg have been taking into account the number of columns in the aggregated tuple. At the same time, aggregation with preliminary sorting have been evaluated too positively in the case of a long list of sorted values. Thus, this patch increases the optimiser's bias towards hashing in grouping operations, especially on small data volumes.

But why is it such a trivial formula, you might ask me? Is it OK to suppose all the values are duplicates? It looks pretty strange, but in my experience, the problem with grouping orders is usually raised when a query processes massive numbers of tuples filled with text values (or numerics), containing largely duplicates. One more excuse for me is that we immediately introduced an improvement of this formula in the next patch. But even with such a simple formula, Postgres is ready to distinguish various sortings.

The third patch reconsiders the formula introduced by the second patch. Here, the distinct statistics cache, added by the first patch, is employed to estimate the number of distinct values in the first sorted column, and the formula becomes:

$cost = \left[ 1.0 + 1.0 + \left( \frac{N-ndistinct}{N-1} \right) \cdot \left( ncols - 1 \right)\cdot N \cdot \log_2\left(N\right) \right].$

This approach can be extended when reliable statistics on the joint distribution of columns (EXTENDED STATISTICS) exist. Still, at the moment, we limit ourselves to the first column estimation only because it is sufficient in most cases. With this formula, the optimiser can distinguish the costs of different sorting combinations of columns, which allows us to choose the optimal sorting operator.

The fourth patch adds code to the optimiser that permutes grouped columns to place the column with the maximum ndistinct value in the first position. This GROUP-BY order is added to the optimiser to estimate and choose among two other alternatives discussed above. The optimiser will choose the best one based on their costs and sorting requested by the upper query operator.

Which positive outcome we have earned?

Look at how this change will affect the queries in our examples 1 and 2. Let's start with sorting:

EXPLAIN (ANALYZE, TIMING ON)
SELECT CustomerId, CategoryId, WeekDay, Total
FROM shopping
ORDER BY CustomerId,CategoryId,WeekDay,Total;

EXPLAIN (ANALYZE, TIMING ON)
SELECT CustomerId, CategoryId, WeekDay, Total
FROM shopping
ORDER BY CategoryId,CustomerId,WeekDay,Total;

/*
Sort  (cost=191291.64..193791.64) (actual time=350.819..395.024)
   Sort Key: customerid, categoryid, weekday, total
   ->  Seq Scan on shopping  (cost=0.00..17353.00) (actual time=0.031..60.262)
 Execution Time: 423.583 ms

Sort  (cost=266482.66..268982.66)
       (actual time=653.143..694.736)
   Sort Key: categoryid, customerid, weekday, total
   ->  Seq Scan on shopping  (cost=0.00..17353.00) (actual time=0.012..55.073)
 Execution Time: 723.005 ms
 */

There are two notable improvements: The overall query cost has changed, and the sorting and scanning cost ratio has become more accurate and reflects reality. The difference in plan cost reflects the difference in query execution time. And now the result of query execution with grouping:

SET enable_hashagg = 'off';
EXPLAIN (COSTS OFF)
SELECT CustomerId, CategoryId, WeekDay, avg(Total::numeric)
FROM shopping
GROUP BY WeekDay,CategoryId,CustomerId;

/*
 GroupAggregate
   Group Key: customerid, weekday, categoryid
   ->  Sort
         Sort Key: customerid, weekday, categoryid
         ->  Seq Scan on shopping
 */

The optimiser changed the order of the columns and moved the CustomerId column to the beginning of the grouping list. Given the actual distribution of values by the other columns, it was possible to rearrange the CategoryId and WeekDay columns additionally. However, such fine-tuning has little practical meaning and can be done with sufficient reliability if there are extended statistics for all three fields. Of course, the proposed solution is not ideal: the mathematical model can be adjusted and made more practical (the case when all columns contain duplicates is sporadic) as more detailed. We also did not consider the relative cost of the comparison operator itself: comparing text types will require more resources than integer types, right? However, the current version already fulfils the main task - to create an additional grouping strategy that is qualitatively different from those already available in the Postgres optimiser.

If you have any comments or opinion on that subject, please leave it in the comments below or in thread on the Postgres community mailing list.

THE END.

November 25th, 2024. Pattaya, Thailand.

Conserving CPU's cycles ...

Discussion about this post