Fractional Path Issue in Partitioned Postgres databases

Continuing discovery on Postgres planner weak points

Dec 15, 2024

While the user notices the positive aspects of technology, a developer, usually encountering limitations, shortcomings or bugs, watches the product from a completely different perspective. The same stuff happened at this time: after the publication of the comparative testing results, where Join-Order-Benchmark queries were passed on a database with and without partitions, I couldn't push away the feeling that I had missed something. In my mind, Postgres should build a worse plan with partitions than without them. And this should not be just a bug but a technological limitation. After a second thought, I found a weak spot - queries with limits.

In the presence of a LIMIT statement in the SQL query, unlike the case of plain tables, the optimiser immediately faces many questions: How many rows may be extracted from each partition? Will only a single partition be used? If so, which one will be this single one? - it is not apparent in the circumstances of potential execution-time pruning ... .

What if we scan partitions by index, and the result is obtained by merging? In that case, it is entirely unclear how to estimate the number of rows that should be extracted from the partition and, therefore, which type of partition scan operator to apply. And what if using partitionwise join, we have an intricate subtree under the Append - knowledge of the limits, in this case, should be crucial - for example, when choosing the JOIN type, isn't it?

Interim-cost query plans

Such a pack of questions about planning partitions led to a compromise solution in choosing a query plan for Append's subpaths: for picking the optimal fractional path, two plan options are considered: the minimum total cost and the minimum startup cost paths. Roughly speaking, the plan will be optimal if we have LIMIT 1 or some considerable LIMIT value in the query. But what about intermediate options? Let's look at specific examples (thanks to Alexander Pyhalov).

DROP TABLE IF EXISTS parted,plain CASCADE;
CREATE TEMP TABLE parted (x integer, y integer, payload text)
PARTITION BY HASH (payload);
CREATE TEMP TABLE parted_p1 PARTITION OF parted
  FOR VALUES WITH (MODULUS 2, REMAINDER 0);
CREATE TEMP TABLE parted_p2 PARTITION OF parted
  FOR VALUES WITH (MODULUS 2, REMAINDER 1);
INSERT INTO parted (x,y,payload)
  SELECT (random()*600)::integer,
         (random()*600)::integer, md5((gs%500)::text)
  FROM generate_series(1,1E5) AS gs;
CREATE TEMP TABLE plain (x numeric, y numeric, payload text);
INSERT INTO plain (x,y,payload) SELECT x,y,payload FROM parted;
CREATE INDEX ON parted(payload);
CREATE INDEX ON plain(payload);
VACUUM ANALYZE;
VACUUM ANALYZE parted;

In this example we executed VACUUM ANALYZE twice because by-default statistics on the partitioned table cannot be built. It is built on each partition separately. To gather statistic, combining data from all partitions, we must explicitly execute ANALYZE with the name of such table as a parameter. Now, let's see how the selection from the partitioned and regular table works with the same data:

EXPLAIN (COSTS OFF)
SELECT * FROM plain p1 JOIN plain p2 USING (payload) LIMIT 100;
EXPLAIN (COSTS OFF)
SELECT * FROM parted p1 JOIN parted p2 USING (payload) LIMIT 100;

/*
 Limit
   ->  Nested Loop
         ->  Seq Scan on plain p1
         ->  Memoize
               Cache Key: p1.payload
               Cache Mode: logical
               ->  Index Scan using plain_payload_idx on plain p2
                     Index Cond: (payload = p1.payload)

 Limit
   ->  Merge Join
         Merge Cond: (p1.payload = p2.payload)
         ->  Merge Append
               Sort Key: p1.payload
               ->  Index Scan using parted_p1_payload_idx
               ->  Index Scan using parted_p2_payload_idx
         ->  Materialize
               ->  Merge Append
                     Sort Key: p2.payload
                     ->  Index Scan using parted_p1_payload_idx
                     ->  Index Scan using parted_p2_payload_idx
*/

The query plans seem optimal: depending on the limit, only the minimum number of rows will be selected since, with a helpful index on the join attribute, we have already ordered access to the table rows. Now let's prompt the optimiser to build a complex subtree under the append by enabling partitionwise join:

SET enable_partitionwise_join = 'true';
EXPLAIN (COSTS OFF)
SELECT * FROM parted p1 JOIN parted p2 USING (payload) LIMIT 100;
/*
 Limit
   ->  Append
         ->  Nested Loop
               Join Filter: (p1_1.payload = p2_1.payload)
               ->  Seq Scan on parted_p1 p1_1
               ->  Materialize
                     ->  Seq Scan on parted_p1 p2_1
         ->  Nested Loop
               Join Filter: (p1_2.payload = p2_2.payload)
               ->  Seq Scan on parted_p2 p1_2
               ->  Materialize
                     ->  Seq Scan on parted_p2 p2_2
*/

Although everything has stayed the same in the data, an unsuccessful plan has been selected. The reason for such degradation is that when planning an Append, the optimiser chooses the cheapest plan according to the startup_cost criterion. And this is the one that contains NestLoop + SeqScan - in terms of launch speed, in the absence of the necessity to scan tables at all, such a plan slightly wins even over the obvious NestLoop + IndexScan. This is how the current Postgres works, including the dev branch.

However, this problem can be fixed quite simply by adding the appropriate logic to the optimiser code. Together with Nikita Malakhov and Alexander Pyhalov, we have prepared a patch that can be found on the current commitfest to fix this problem. In the thread with its discussion, you can find another gripping remark about the revision of the startup_cost computation logic of the sequential scan operator, the implementation of which can also alleviate the situation with the choice of non-optimal fractional paths for the case with LIMIT 1. Applying this patch, we will already get an acceptable query plan:

 Limit
   ->  Append
         ->  Nested Loop
               ->  Seq Scan on parted_p1 p1_1
               ->  Memoize
                     Cache Key: p1_1.payload
                     Cache Mode: logical
                     ->  Index Scan using parted_p1_payload_idx
                           Index Cond: (payload = p1_1.payload)
         ->  Nested Loop
               ->  Seq Scan on parted_p2 p1_2
               ->  Memoize
                     Cache Key: p1_2.payload
                     Cache Mode: logical
                     ->  Index Scan using parted_p2_payload_idx
                           Index Cond: (payload = p1_2.payload)

Now, let's look at the next problem, which does not have a simple solution yet.

Calculated limit

Consider the following query:

EXPLAIN (COSTS OFF)
SELECT * FROM parted p1 JOIN parted p2 USING (payload,y)
ORDER BY payload,y LIMIT 100;

Executing it with the patch provided above gives you an optimal plan - it uses NestLoop with a parameterized index scan that will touch only the minimum number of table rows needed to produce the result. However, by simply reducing the limit, we get the original bleak picture:

EXPLAIN (COSTS OFF)
SELECT * FROM parted p1 JOIN parted p2 USING (payload,y)
ORDER BY payload,y LIMIT 1;

/*
Limit
   ->  Merge Append
         Sort Key: p1.payload, p1.y
         ->  Merge Join
               Merge Cond: ((p1_1.payload = p2_1.payload) AND
                            (p1_1.y = p2_1.y))
               ->  Sort
                     Sort Key: p1_1.payload, p1_1.y
                     ->  Seq Scan on parted_p1 p1_1
               ->  Sort
                     Sort Key: p2_1.payload, p2_1.y
                     ->  Seq Scan on parted_p1 p2_1
         ->  Merge Join
               Merge Cond: ((p1_2.payload = p2_2.payload) AND
                            (p1_2.y = p2_2.y))
               ->  Sort
                     Sort Key: p1_2.payload, p1_2.y
                     ->  Seq Scan on parted_p2 p1_2
               ->  Sort
                     Sort Key: p2_2.payload, p2_2.y
                     ->  Seq Scan on parted_p2 p2_2
*/

A SeqScan operator again reads all rows from tables, and the query becomes tens of times slower, although we only reduced the LIMIT! At the same time, by disabling SeqScan, you can see a fast plan and incremental sorting again.

The fundamental problem is that the optimiser only knows the final limit on the number of rows in the query/subquery. In this case, at the Append planning stage, the optimiser cannot estimate how many tuples the upper Incremental Sort could request. As a result, only one row or all rows from each partition may be needed, depending on the data distribution in the 'y' column.

Even if we theoretically imagine that we have taught IncrementalSort to calculate the number of groups by the 'payload' column and, based on this, estimate the maximum required number of rows in each partition, we could not improve the plan estimation since the planning of the Append operator has already been completed, the possible options for its execution have already been fixed - after all, we are planning the query from the bottom up!

To sum it up. Partitioned tables do make the task much more difficult for the current version of Postgres, limiting the search space for optimal query plans. Switching to partitions should be thoroughly tested, focusing on cases where some limited selection of tables' tuples is required and there is no noticeable pruning of partitions at the planning stage. Although the direction is actively developing, we can expect improvements soon (especially if users report emerging issues more actively). Still, there are cases where the solution within the existing architecture is not apparent and requires additional R&D.

Do you agree with my conclusions, or did I just write nonsense? Please leave your opinion in the comments.

THE END.

December 9, 2024, Pattaya, Thailand.

Conserving CPU's cycles ...

Discussion about this post