Conserving CPU's cycles ...

Revising the Postgres Multi-master Concept

Andrei Lepikhov — Sat, 18 Oct 2025 10:39:42 GMT

One of the ongoing challenges in database management systems (DBMS) is maintaining consistent data across multiple instances (nodes) that can independently accept client connections. If one node fails in such a system, the others must continue to operate without interruption - accepting connections and committing transactions without sacrificing consistency. An analogy for a single DBMS instance might be staying operational despite a RAM failure or intermittent access to multiple processor cores.

In this context, I would like to revisit the discussion about the Postgres-based multi-master problem, including its practical value, feasibility, and the technology stack that needs to be developed to address it. By narrowing the focus of the problem, we may be able to devise a solution that benefits the industry.

I spent several years developing the multi-master extension in the late 2010s until it became clear that the concept of essentially consistent multi-master replication had reached a dead end. Now, after taking a long break from working on replication, changing countries, residency, and companies, I am revisiting the Postgres-based multi-master idea to explore its practical applications.

First, I want to clarify the general use case for multi-master replication and highlight its potential benefits. Apparently, any technology must balance its capabilities with the needs it aims to address. Let's explore this balance within the context of multi-master replication.

Typically, clients consider a multi-master solution when they hit a limit in connection counts for their OLTP workloads. They often have a large number of clients, an N transaction-per-second (TPS) workload, and a single database. They envision a solution that involves adding another identical server, setting up active-active replication, and doubling their workload.

Typical (desired) case users envision discussing multi-master

Sometimes, a client has a distributed application that requires a stable connection to the database across different geographic locations. At other times, clients simply desire a reliable automatic failover. A less common request, though still valuable, is to provide an online upgrade or to enable the detachment of a node for maintenance. Additionally, some clients may request an extra master node that has warmed-up caches and a storage state that closely resembles the production environment, allowing it to be used for testing and benchmarking. In summary, there are numerous tasks that can be requested, but what can realistically be achieved?

It's important to remember that active-active replication in PostgreSQL is currently only feasible with logical replication, which denotes network latency and additional server load from decoding, walsender, and so on. Network latency also immediately arises - we need to wait for confirmation from the remote node that the transaction was successfully applied, right? Therefore, the idea of scaling the write load for the general case of 100% replication immediately encounters the fact that each server will be required to write not only its own changes, but also changes from other instances (see the figure above). That isn't a problem for massive queries with bushy SELECTs, but a multi-master query is more likely to be used by clients with pure OLTP and very simple DML.

We face a similar challenge when trying to provide each copy of the distributed application with a nearby database instance. If the application's connection to the remote database is weak, then the connection between the two DBMS instances will also be unstable. As a result, waiting for confirmation of a successful transaction commit on the remote node can lead to significant delays.

Complex situations can also occur when a replication update tries to overwrite the same table row that has been updated locally. Such an event can happen because we have no guarantee that the snapshots of the transactions that caused these conflicting changes are consistent across DBMS instances. This raises the question: whose change should be applied, and whose should be rolled back? Does a single row change within the transaction logic need to take into account the competing change for it to be valid?

The autofailover case is relatively straightforward, but something still needs to be done to make such a configuration effective: after all, if all instances can write, then a transaction commit must be accompanied by a supplemental message ensuring that the transaction is written at each instance. Otherwise, it could happen that if node N_x crashes, some of its transactions will be written to the database on node N_y, but will not be committed to (or will be rolled back in) the database of node N_z. So, how do you fix this situation except by sending the entire configuration to recovery?

So, the concept of multi-master replication can be questionable, particularly for those seeking to accelerate OLTP workload. So, why would anyone need it? Let's begin by examining the underlying technology: logical replication.

I see two significant advantages to logical replication. First, it enables highly selective data replication, allowing you to pick only specific tables. Additionally, for each table, you can set up filters with replication conditions that let you easily skip individual records or entire transactions right at the outset of the replication process during the decoding phase. This feature provides a highly granular mechanism for selecting specific data that should be synchronised with a remote system.

The second notable advantage is the high-level nature of the mechanism. This type of replication occurs at the level of relational algebra, which means you can abstract away the complexities of physical storage.

What amazing benefits come with a high level of abstraction? Imagine the possibilities! You can customise different sets of indexes on synchronised nodes, which significantly reduces DML overhead and allows you to route queries based on where execution can be most effective. For example, you could focus on loading one instance with brief UPDATE/DELETE queries on primary keys, while reserving another instance for larger subqueries or INSERTs that usually don't conflict with updates. You could even mix it up by using a traditional Postgres heap on one instance and a column storage on another! The creativity here knows no bounds when it comes to the potential of your replication protocol.

Now that we have outlined the benefits of logical replication, let's consider a use case that can be effectively implemented using a multi-master configuration.

To begin, we will set aside concerns related to upgrades, maintenance, and failovers. The most apparent use case is for supporting a geo-distributed application. By categorising the data in the database into three types - critically important general data, general data that is changed on only one side, and purely local data - we can leverage the advantages of this setup (see figure).

The case of multi-master replication with data classification

Here, the red rectangle denotes data that must be reliably synchronised between instances. The green and blue denote data that doesn't require immediate synchronisation and should be accessible to the remote instance in read-only mode. The grey denotes purely local data.

By designing the database schema to categorise data by replication method, we can even reduce the database size on a specific instance by avoiding the transmission of local data to remote nodes. Furthermore, plenty of data can be replicated asynchronously in one direction, avoiding the overhead of waiting for the remote end to confirm a commit. Only critical data requires strict synchronisation, using mechanisms such as synchronous commit, 2PC, and at least REPEATABLE READ isolation level, which enormously raises the commit time of such a transaction and increases the risk of rollback due to conflicts.

What is an example of this use case? To be honest, I don't have any experience with customer installations, so I can only imagine how it might work hypothetically. I envision an international company that needs to separate employee data and fiscal metrics on servers located in each country, which seems to be a common requirement these days. For analytical purposes, this data could be made accessible externally, similar to how key values can filter replicated data.

The company's employee table could be divided, replicating names, positions, salaries, and other relevant information across all database instances. Sensitive identifiers, such as social security numbers or passport numbers, could be kept in a separate local table to maintain privacy.

In principle, if updates to local or asynchronously replicated data dominate, it may be possible to achieve the desired scalability for writing operations (sounds wicked, but who knows...).

Drawing from my experience in rocket science, I've developed the habit of qualitatively evaluating the effects of the phenomena being studied beforehand. Let's estimate the percentage of the database that can be replicated in active-active mode without potentially degrading performance. For simplicity, let's assume there are two company branches located on different continents, and consider the following configuration options: (1) one server, or (2) two servers operating in multi-master mode, where access will always be local (as illustrated in the figure below).

Let's introduce some notations. Please refer to the figure above for further clarification:

T_l - transaction execution time in a DBMS backend, ms.
T^c_l - network round-trip time between DBMS and local application, ms.$
T^c_r - network round-trip time between DBMS and remote application, ms.
T_r - extra time to ensure that the transaction is successfully committed across the DBMS cluster, ms.
X_l - fraction of local connections.
N - fraction of transactions that are OK with asynchronous replication guarantees.

For a single server holding all the connections we have:

Active-active replication has the following formula:

Now, let's determine appropriate numbers for our formulas. We'll assume a 50% share for local connections (X_l = 0.5). Drawing from my experience living in Asia and connecting to resources in Europe, we can use the following figures as a reference:

In this context, T^c_l and T^c_r are the time of one round-trip. In contrast, confirming a remote commit (T_r) usually requires at least two round-trips: in the 2PC protocol, the PREPARE STATEMENT command should first be executed, waiting for the changes to be successfully replicated and all the resources necessary for the commit to be reserved, and then the COMMIT command should be issued.

Based on these timing considerations, we can calculate the following:

N > \\frac{230}{300}\\approx 76\\%.\n\\end{gather*}","id":"BNBYNVTVGQ"}" data-component-name="LatexBlockToDOM">

Now, let's imagine that the number of remote connections has grown by 80%:

188/300\\approx 62\\%.\n\\end{gather*}","id":"FIFMYHOSQG"}" data-component-name="LatexBlockToDOM">

What if we need to ensure full synchronous 2PC synchronisation of the entire database? Let's do the math:

These numbers indicate a performance loss of approximately 2.5 times, even in the most optimistic scenario. While it's not particularly encouraging, may it be sufficient for some applications?

This rough calculation suggests that if approximately 25% of DML transactions require remote confirmation, the multi-master system has a chance to stay with the same writing performance. If the majority of traffic originates from remote regions, the proportion of reliably replicated data could increase to 40%. However, for a more conservative estimate, let's stick with N = 25%. This approach also eases some of the load on the disk subsystem, locks, and other resources, allowing them to be used for local operations such as VACUUM or read-only queries. There appears to be a grain of truth in that, doesn't it?

On the other hand, replication, even if asynchronous, must be able to keep up with the commit flow. If the total time required for local transaction execution is 15 ms, and the one-way delay to the remote server is 75 ms, then even without waiting for confirmation from the remote side, a queue of changes for replication will still accumulate in a sequential scenario.

25% of the DML in our computation is committed with remote confirmation, leaving 75% to be replicated. 75 ms * 0.75 = 56 ms. To address the disparity between the rate of local commits and the speed of data transfer to the remote server, we must utilise the bandwidth by sending and receiving data on the remote server in parallel (i.e., parallel replication is required). In our rough model, it turns out that four threads are sufficient to transfer changes. Considering the freed-up server resources (resulting from distributing connections between instances), this seems quite realistic.

So, the bottom line is that by distributing data geographically in multi-master mode, we can theoretically expect comparable transaction handling speed. This also reduces the number of backends and resource consumption on each server. These resources can be freed up for system processes and various analytics. Let's not forget the ability to optimise indexes, storage, and other attributes of physical data placement independently on each system node. An additional benefit is that in the event of a connection failure, each subnet can be temporarily maintained with the expectation that, upon recovery, a conflict resolution strategy will restore the database's integrity.

It's easy to imagine an application for such a scenario with a complete network breakdown - for example, a database for a hospital network in a region with complex terrain and climate. The medical records database must be shared, but a single client is unlikely to be served by two different hospitals within a short period of time, making conflicts in critical data quite rare.

Taking all of the above into account, a viable multi-master solution should implement the following set of technologies:

Replication sets – to classify data for replication, providing separate synchronous and asynchronous replication.
Replication type dependency detection - check that synchronously replicated tables don't refer to asynchronously replicated tables.
Remote commit confirmation (similar to 2PC).
A distributed consensus protocol for determining a healthy subset of nodes and fencing failed nodes in 3+ configurations.
Parallel replication – parallelising both DML sending and application on the remote side.
Automatic Conflict Resolution.

Let's not overlook the importance of automatic failover and the capability to hot-swap hardware without any downtime. Currently, the concept of alternative physical storage arrangements still sounds a little wild, so I'm excluding it from our discussion for now. However, if we gain more experience with successful multi-master implementations, this option might eventually become the preferred approach.

That's all for now. This post is meant to spark discussion, so please feel free to share your thoughts in the comments or through any other method you prefer.

THE END.
October 18, 2025, Madrid, Spain.

Extra approach to RTABench Q0 optimisation

Andrei Lepikhov — Thu, 07 Aug 2025 14:19:40 GMT

In the previous post, I explored some nuances of Postgres related to indexes and parallel workers. This text sparked a lively discussion on LinkedIn, during which one commentator (thanks to Ants Aasma) proposed an index that was significantly more efficient than those discussed in the article. However, an automated comparison of EXPLAINs did not clarify the reasons for its superiority, necessitating further investigation.

This index:

CREATE INDEX ON order_events ((event_payload ->> 'terminal'::text),
                              event_type,event_created); -- (1)

At first (purely formal) glance, this index should not be much better than the alternatives:

CREATE INDEX ON order_events (event_created,
                              (event_payload ->> 'terminal'::text),
                              event_type); -- (2)
CREATE INDEX ON order_events (event_created, event_type); -- (3)

However, the observed speedup is significant; in fact, the performance of index (1) surpasses index (2) by more than 50 times and exceeds index (3) by almost 25 times!

The advantages of the proposed index are evident when we consider the logic of the subject area. It is more selective and is less likely to retrieve rows that do not match the filter. For instance, if we first identify all the rows that correspond to a specific airport, we can then focus on the boundaries of the date range. At this point, all retrieved rows will already meet the filter criteria. Conversely, if we begin by determining the date range, we may encounter numerous rows related to other terminals within that range.

However, when examining the EXPLAIN output, we do not see any distinctive reasons:

-- (1)
->  Index Scan using order_events_expr_event_type_event_created_idx
      (cost=0.57..259038.66 rows=64540 width=72)
      (actual time=0.095..232.855 rows=204053.00 loops=1)
    Index Cond:
      event_payload ->> 'terminal' = ANY ('{Berlin,Hamburg,Munich}' AND
      event_type = ANY ('{Created,Departed,Delivered}') AND
      event_created >= '2024-01-01 00:00:00+00' AND
      event_created < '2024-02-01 00:00:00+00'
    Index Searches: 9
    Buffers: shared hit=204566

-- (2)
->  Index Scan using order_events_event_created_event_type_expr_idx
      (cost=0.57..614892.22 rows=64540 width=72)
      (actual time=0.499..14303.685 rows=204053.00 loops=1)
    Index Cond:
      event_created >= '2024-01-01 00:00:00+00' AND
      event_created < '2024-02-01 00:00:00+00' AND
      event_type = ANY ('{Created,Departed,Delivered}' AND
      event_payload ->> 'terminal' = ANY ('{Berlin,Hamburg,Munich}')
    Index Searches: 1
    Buffers: shared hit=279131
                        
-- (3)

->  Index Scan using idx_3
      (cost=0.57..6979008.62 rows=64540 width=72)
      (actual time=0.238..8777.846 rows=204053.00 loops=1)
    Index Cond:
      event_created >= '2024-01-01 00:00:00+00' AND
      event_created < '2024-02-01 00:00:00+00' AND
      event_type = ANY ('{Created,Departed,Delivered}')
    Filter: event_payload ->> 'terminal' = ANY ('{Berlin,Hamburg,Munich}')
    Rows Removed by Filter: 4292642
    Index Searches: 1
    Buffers: shared hit=4509185

Let's say IndexScan on (3) filters a lot of tuples and is therefore slow. However, even after eliminating 4 million rows, IndexScan on (3) is still twice as fast as IndexScan on (2). At the same time, the only difference between indexes (1) and (2) is the order of the columns.

If we compare scans (1) and (2), the only noticeable difference is a 30% difference in the number of buffer pages hit. But not 50 times! That means the EXPLAIN does not show us where the main work was done; only the cost value signals the superiority of index (1).

However, we live in the world of ORM and ad-hoc queries, where it is difficult to choose the order of columns in the index, analysing the meaning of the stored data, which means we need to find out precisely what is happening there and what data is missing for the automated detection of an [un]successful index.

If you look at the optimiser code, it becomes clear in numbers why index (1) is so pleasing: all other things being equal, it is going to go through only 39 out of 1 million index pages. Compare this with index (2), which also contains 1 million pages, and we pass through 73 thousand of them. In terms of index tuples, this is 64.5 thousand versus 14 million. It turns out that the main work is to select a row, extract the appropriate attribute and perform the comparison.

The work performed is not represented in the EXPLAIN output. Additionally, the IndexScan structure of the query plan, which is accessible to the Postgres core and its extensions after the plan has been executed, lacks valuable information necessary for assessing the quality of planning and sources of execution time grow. Consequently, developing a method for automatically identifying ineffective indexes and selecting more optimal alternatives appears to be challenging, if not impossible.

There are numerous parameters that the optimiser calculates during the index scan planning. Take a look at the GenericCosts structure, including numIndexTuples, numIndexPages, indexCorrelation, and indexSelectivity. Having all this information available at the end of execution could help detect scanning anomalies and draw the DBA's attention.

Of course, the number of installations, types of load and cases is close to infinity. Hence, extending the core code by continually adding more data from the optimisation stage to the plan seems not flexible. Moreover, sometimes we would like to have alternative paths that lost the battle but may be beneficial for analysis.

Moreover, since Postgres 18, the core already has a nicely extensible explain, where we may add options, node information, and overall plan information. So, the only step needed is a bridge between the cloud of possible paths and the final plan.

Having this capability would allow extensions to analyse the predictions against the actual outcomes of query execution. Additionally, it would help in making informed decisions for fine-tuning the query planner and developing effective indexing strategies for a table.

Please feel free to share your feedback, whether you agree or disagree with my viewpoint.

THE END.
7 De Agosto De 2025, Torrevieja, España.

Squeezing out Postgres performance on RTABench Q0

Andrei Lepikhov — Mon, 04 Aug 2025 12:08:18 GMT

I often hear that PostgreSQL is not suitable for solving analytics problems, referencing TPC-H or ClickBench results as evidence. Surely, handling a straightforward task like sorting through 100 million rows on disk and calculating a set of aggregates, you would get stuck on the storage format and parallelisation issues that limit the ability to optimise the DBMS.

In practice, queries tend to be highly selective and do not require processing extensive rows. The focus, then, shifts to the order of JOIN operations, caching intermediate results, and minimising sorting operations. In these scenarios, PostgreSQL, with its wide range of query execution strategies, can indeed have an advantage.

I wanted to explore whether Postgres could be improved by thoroughly utilising all available tools, and for this, I chose the RTABench benchmark. RTABench is a relatively recent benchmark that is described as being close to real-world scenarios and highly selective. One of its advantages is that the queries include expressions involving the JSONB type, which can be challenging to process. Additionally, the Postgres results on RTABench have not been awe-inspiring.

Ultimately, I decided to review all of the benchmark queries, and fortunately, there aren't many, to identify possible optimisations. However, already on the zero query, there were enough nuances that it was worth taking it out into a separate discussion.

My setup isn't the latest - it's a MacBook Pro from 2019 with an Intel processor—so we can't expect impressive or stable performance metrics. Instead, let's concentrate on qualitative characteristics rather than quantitative ones. For this purpose, my hardware setup should be sufficient. You can find the list of non-standard settings for the Postgres instance here.

Now, considering the zero RTABench query, which involves calculating several aggregates over a relatively small sample from the table:

EXPLAIN (ANALYZE, BUFFERS ON, TIMING ON, SETTINGS ON)
WITH hourly_stats AS (
  SELECT 
    date_trunc('hour', event_created) as hour,
    event_payload->>'terminal' AS terminal,
    count(*) AS event_count
  FROM order_events
  WHERE 
    event_created >= '2024-01-01' AND
    event_created < '2024-02-01'
    AND event_type IN ('Created', 'Departed', 'Delivered')
  GROUP BY hour, terminal
)
SELECT 
  hour,
  terminal,
  event_count,
  AVG(event_count) OVER (
    PARTITION BY terminal
    ORDER BY hour
    ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
  ) AS moving_avg_events
FROM hourly_stats
WHERE terminal IN ('Berlin', 'Hamburg', 'Munich')
ORDER BY terminal, hour;

Phase 0. ‘Default behaviour‘

The query does not seem surprising. Let's run a query over the default data schema (EXPLAIN has been cleaned):

WindowAgg  (actual time=21053.119 rows=2232)
  Window: w1 AS (PARTITION BY
   order_events.event_payload ->> 'terminal'
   ORDER BY date_trunc('hour', order_events.event_created))
  Buffers: shared read=3182778
  -> Sort  (actual time=21052.476 rows=2232)
     Sort Key:
      order_events.event_payload ->> 'terminal',
      date_trunc('hour', order_events.event_created)
      Sort Method: quicksort  Memory: 184kB
   -> GroupAggregate  (actual time=21051.875 rows=2232)
       Group Key:
        date_trunc('hour', order_events.event_created),
        order_events.event_payload ->> 'terminal'
     -> Sort (actual time=21037.609..21042.766 rows=204053)
         Sort Key:
          date_trunc('hour', order_events.event_created),
          order_events.event_payload ->> 'terminal'
         Sort Method: quicksort  Memory: 12521kB
       -> Bitmap Heap Scan on order_events
            (actual time=20999.978 rows=204053)
          Recheck Cond: event_type =
           ANY ('{Created,Departed,Delivered}')
          Filter:
           event_created >= '2024-01-01 00:00:00+00' AND
           event_created < '2024-02-01 00:00:00+00' AND
           event_payload ->> 'terminal') =
             ANY ('{Berlin,Hamburg,Munich}')
          Rows Removed by Filter: 57210049
          Heap Blocks: exact=3133832
          Buffers: shared read=3182778
        -> Bitmap Index Scan (actual time=1683.357 rows=57414102)
           Index Cond: event_type = ANY ('{Created,Departed,Delivered}')
           Index Searches: 1
           Buffers: shared read=48946
Execution Time: 21060.564 ms

The execution time is 21 seconds? Really? Seems too slow. Upon examining the EXPLAIN, we realise that the main issue is that the default schema contains only one low-selectivity index, which was used during execution instead of performing a sequential scan. The EXPLAIN also indicates that a significant portion of the work involves collecting identifiers (ctid) of candidate rows, which takes 1.6 seconds. Following that, filtering through these rows results in filtering out 98% of the read rows, which takes 19 seconds.

The first problem was identified quickly: I allocated 8GB for shared_buffers; however, the DBMS limits the amount of buffer space that can be assigned to a single table. The formula for this allocation is quite complex, involving multiple factors, but the NBuffers/MaxBackends ratio applies here. Consequently, with my current settings, PostgreSQL can allocate a maximum of only 2.4GB per table. Therefore, denormalising the entire database into one wide, long table is a bad idea in PostgreSQL, at least for this reason.

Despite the data access pattern in this query plan being somewhat inefficient, let’s first try a straightforward approach to improve the performance by increasing the number of parallel workers:

A very strange graph. Obviously, suppose most of the work is reading from disk and the tuple deformation. In that case, one should not expect a significant effect from parallelism. But where does this jump between 6 and 7 parallel processes come from? Looking at the EXPLAINs, one can easily understand - there was a change in the query plan. BitmapScan was used on a small number of workers, and the optimiser picked SeqScan on a larger number.

So maybe SeqScan should have been used on a small number of workers, too? Let's see how the scanning operation is accelerated separately for BitmapScan and SeqScan, and also watch how the costs of the scanning nodes change (in relative values):

Raw numbers and EXPLAINs can be found here and here.

The thing is that for a small number of workers, BitmapScan has a smaller cost than SeqScan, which in our case does not correlate with execution time. The point here is probably in a delicate balance: 'fewer lines read / but more often went from the index to the table'. It is difficult to say more precisely, since explain does not show such details of the estimate as the expected number of tuples fetched from the disk before filtering, an estimate of the number of fetched disk pages, or an estimate of the proportion of pages that will be found in shared_buffers. On the other hand, the cost model assumes better scalability of SeqScan compared to BitmapScan, which causes the plan to switch to SeqScan. Considering that for SeqScan, a change in the cost value predicts an unreasonably significant increase in performance, this may result in selecting a SeqScan method where it should not. Thus, for now, you should be careful optimising queries by increasing the number of parallel workers.

Phase 1. ‘Typical optimisation‘

Now let's move on and see what a good index will give Postgres. A typical practice is to create an index on the most frequently used highly selective column in filters. For this query, the choice is limited to only one option:

CREATE INDEX idx_1 ON order_events (event_created);

With this index, the optimiser utilises IndexScan to access data, reducing the query execution time (without parallel workers) to 6.5 seconds. Interestingly, the previous query plan accessed buffer pages 3 million times (shared read = 3182778), while in this case, with a more selective scan, it increased to 14 million (shared hit = 14317527). Although there are more trips to the buffer now, in the previous example, each page replaced a previous one in the buffer. In contrast, both the index and the disk pages now fit into the shared buffers, which contributes to the acceleration.

Next, let's explore whether parallel workers will provide additional benefits and examine how the cost model predicts this outcome:

Raw data can be taken here.

Yes, we see some acceleration. The parallelisation effect goes so far that we have to expand the permissible number of workers to 24 to track the impact to the end. And here one of the disadvantages of Postgres showed itself. Although we set all possible hooks to large values, the optimiser hit a hard-wired limit on the number of workers (10 for this table) on the order_events table. We had to bypass it with the command:

ALTER TABLE order_events SET (parallel_workers = 32);

It is pretty sad that the number of workers implicitly depends on the table size estimate. This can be a problem, for example, in the JOIN operator, which determines the required number of workers based on the number of workers requested on the outer side of the join. It is easy to imagine a situation where a petite table, insufficient even for one parallel worker, and a very large one are joined. In this case, the entire jointree may remain non-parallelised just because of one small table!

Another fact can be extracted from the above graph - the cost model for Index Scan is very conservative: with the maximum observed speedup of 8, the model did not show even a two-fold speedup. Hence the conclusions: 1) when using indexes, do not be shy about raising worker limits, and 2) Scan nodes that are more sensitive to the number of workers (we observed this, for example, with SeqScan), can unexpectedly trigger a rebuild of the query plan for the worse.

However, the current index is not the ultimate dream. Let's take a closer look at the scanning node:

Index Scan (actual time=6555.122 rows=204053)
  Index Cond: event_created >= '2024-01-01 00:00:00+00' AND
    event_created < '2024-02-01 00:00:00+00'
  Filter: event_type = ANY ('{Created,Departed,Delivered}' AND
    event_payload ->> 'terminal' = ANY ('{Berlin,Hamburg,Munich}')
  Rows Removed by Filter: 14099758
  Index Searches: 1
  Buffers: shared hit=14317527

Lots of pages touched, lots of lines filtered. Let's see what can be achieved by minimising disk reads.

Phase 2. ‘Reinforced optimisation‘

In this section, we will confidently assume the existence of an 'index adviser' that helps analyse and automatically create composite indexes. These indexes optimise data retrieval by minimising the reading of table rows, thereby adapting the entire system to the incoming load.

In this query, we have several options to consider. We will exclude the GIN index because the event_payload column lacks selectivity. This leaves us with two alternative options:

CREATE INDEX idx_2 ON order_events (event_created, event_type)
  INCLUDE (event_payload);
CREATE INDEX idx_3 ON order_events (event_created, event_type);

The idx_2 variant does not require accessing the table at all, while the idx_3 index can cause both IndexScan and BitmapScan. You can find various EXPLAINs of these indexes here. It’s interesting to note that with the previously created idx_1 index, adding idx_2 does not result in switching to the obviously faster IndexOnlyScan. This suggests that when evaluating the cost of index access, the width of the index plays a significant role. The jsonb field likely increases the size of idx_2 considerably.

Consequently, the idx_3 index has proven to be the most optimal in terms of compactness and the number of selected records when using the BitmapScan method. By closely examining the scanning nodes, we can understand the reasons behind this conclusion:

Bitmap Heap Scan (actual time=1286.430 rows=204053
  Rows Removed by Filter: 4292642
  Heap Blocks: exact=269237
  Buffers: shared hit=313925
  Bitmap Index Scan (actual time=625.170 rows=4496695)
    Index Searches: 1
    Buffers: shared hit=44688

Index Only Scan (actual time=1586.097 rows=204053)
  Rows Removed by Filter: 4292642
  Heap Fetches: 0
  Index Searches: 1
  Buffers: shared hit=2558314

Index Scan (actual time=2847.517 rows=204053)
  Rows Removed by Filter: 4292642
  Index Searches: 1
  Buffers: shared hit=4509185

All three index scans return the same number of rows, perform a single pass through the index, and filter the same number of rows. However, IndexOnlyScan wins over Index Scan due to the fact that it does not go into the table and touches the buffer pages twice as rarely (2.6 million V/S 4.5 million); BitmapScan goes into the buffer even less often (300 thousand times) - after going through the index and collecting tid of candidate rows, it then goes pointwise to the heap, touching each potentially useful page only once.

Let's see how parallel workers now help speed up the query for each type of scanning:

It turns out that in the case of BitmapScan, there is no particular sense in using workers. Having a significant computing resource and low competition between clients, it is worth considering reducing the cost (see the parallel_setup_cost and parallel_tuple_cost parameters) of parallel execution and disabling BitmapScan.

However, the cost model again turned out to be insensitive to the effect of parallelism. And first of all, something should be done here. It was also noted that with 8+ workers, the plan was again rebuilt on SeqScan, which led to an increase in execution time from ~1c to ~21c. Therefore, in the interests of the experiment, SeqScan had to be manually disabled.

However, even with such a good index, we see that a certain number of lines have to be filtered. Let's go all the way and organise selective access to only relevant data.

Phase 3. ‘Crazy optimisation‘

Let's now try to reach the theoretical limit of optimisation of this query. Here, we can imagine having an advanced 'Disk Access Tuner' that analyses various expressions of the SQL query to find combinations of high and low selective filters on the same table, which is a good reason to consider partial indexes. Let's create the following ideal index:

CREATE INDEX idx_5 ON order_events (event_created, event_type)
INCLUDE (event_payload)
WHERE
  event_created >= '2024-01-01' and event_created < '2024-02-01' AND
  event_type IN ('Created', 'Departed', 'Delivered') AND
  (event_payload ->> 'terminal') = ANY ('{Berlin,Hamburg,Munich}');

The index is designed so that no table access is necessary, and all rows in this index are relevant to the query. Therefore, there is no need to evaluate the filter value, which also helps save CPU time during the execution. The base case (without workers) now executes in 54 ms (as shown in the EXPLAINs here).

In such a straightforward scenario, it's clear that the estimation of the cardinality for the scan operator is made with an error:

Index Only Scan (cost=0.42..4998.13 rows=70210)
                (actual time=34.718 rows=204053.00 loops=1)
    Heap Fetches: 0
    Index Searches: 1
    Buffers: shared hit=110862

Scanning does not rely on a filter; the table is, in fact, static, yet it still makes errors! In this instance, it may not be significant, but if there were a join tree above, an incorrect estimate at a leaf node could lead to a substantial error when selecting a join strategy. Why can't the optimiser refine the selectivity of the sample based on the existing indexes? Let's also explore whether scaling is effective here with parallel workers. Due to the increased share of the remaining (non-parallel) portion of the query, we will focus solely on the actual time and estimated cost of the scan nodes themselves:

Here, as in the previous case, it is clear that the IndexOnlyScan scanning effect ends at three workers. But the cost model does not even show this. Why does the cost model reflect the real numbers so poorly? Perhaps it is conservative due to the diversity of hardware parallelism models and, as a result, the different impact of parallelisation on the query execution. Or maybe there is an implicit assumption that there is a neighbouring backend nearby that will compete for the resource? In any case, I would personally like to have an explicit parameter that allows me to configure this effect, given the nature of my system's load.

Conclusion

What lessons can be drawn from this simple experiment?

Parallel workers significantly affect performance, so it's vital to adjust the optimiser's cost model to align with the server's capabilities, increasing the proportion of parallel plans and the number of workers.
The efficiency of parallelisation is highly dependent on the access technique, and preference should be given to IndexScan over BitmapScan, SeqScan and even IndexOnlyScan.
It appears that the cost model for parallelism in PostgreSQL has not been sufficiently polished, potentially leading to side effects such as defaulting to inefficient SeqScan operations.
Considering the weak points of the current PostgreSQL row storage, it lacks the ability to adjust the set of indexes to optimise data access based on the actual workload.
More deeply employing indexes in the planning process may provide worthwhile improvements in cardinality estimations.

THE END.
July 26, 2025, Madrid, Spain.

On Postgres Plan Cache Mode Management

Andrei Lepikhov — Thu, 03 Jul 2025 08:29:57 GMT

Having attended PGConf.DE'2025 and discussed the practice of using Postgres on large databases there, I was surprised to regularly hear the opinion that query planning time is a significant issue. As a developer, it was surprising to learn that this factor can, for example, slow down the decision to move to a partitioned schema, which seems like a logical step once the number of records in a table exceeds 100 million. Well, let's figure it out.

The obvious way out of this situation is to use prepared statements, initially intended for reusing labour-intensive parts such as parse trees and query plans. For more specifics, let's look at a simple table scan with a large number of partitions (see initialisation script):

EXPLAIN (ANALYZE, COSTS OFF, MEMORY, TIMING OFF)
SELECT * FROM test WHERE y = 127;

/*
...
   ->  Seq Scan on l256 test_256
         Filter: (y = 127)
 Planning:
   Buffers: shared hit=1536
   Memory: used=3787kB  allocated=4104kB
 Planning Time: 61.272 ms
 Execution Time: 4.929 ms
*/

In this scenario involving a selection from a table with 256 partitions, my laptop's PostgreSQL took approximately 60 milliseconds for the planning phase and only 5 milliseconds for execution. During the planning process, it allocated 4 MB of RAM and accessed 1,500 data pages. Quite substantial overhead for a production environment! In this case, PostgreSQL has generated a custom plan that is compiled anew each time the query is executed, choosing an execution strategy based on the query parameter values during optimisation. To improve efficiency, let's parameterise this query and store it in the 'Plan Cache' of the backend by executing PREPARE:

PREPARE tst (integer) AS SELECT * FROM test WHERE y = $1;
EXPLAIN (ANALYZE, COSTS OFF, MEMORY, TIMING OFF) EXECUTE tst(127);

/*
...
   ->  Seq Scan on l256 test_256
         Filter: (y = $1)
 Planning:
   Buffers: shared hit=1536
   Memory: used=3772kB  allocated=4120kB
 Planning Time: 59.525 ms
 Execution Time: 5.184 ms
*/

The planning workload remains the same since a custom plan has been used. Let's force the backend to generate and use a 'generic' plan:

SET plan_cache_mode = 'force_generic_plan';
EXPLAIN (ANALYZE, COSTS OFF, MEMORY, TIMING OFF) EXECUTE tst(127);

/*
...
  ->  Seq Scan on l256 test_256
         Filter: (y = $1)
 Planning:
   Memory: used=4kB  allocated=24kB
 Planning Time: 0.272 ms
 Execution Time: 2.810 ms
*/

The first time the query is executed, a generic execution plan is created (we are using forced mode here to keep the example straightforward). This process requires resources nearly equivalent to those needed for building a custom plan. However, when the query is executed again, the generic plan can be quickly retrieved from the cache. As a result, the time spent preparing the query plan drops to just 0.2 ms, memory usage is only 24 KB, and no data page reads are required. It seems we have a clear benefit!

However, my suggestion to use the PREPARE command has often been met with rejection and scepticism. This is primarily due to the problems that arise with generic plans in practice, particularly regarding their updating (replanning) and switching to a custom plan type. To gain a clearer understanding of how the generic plan mechanism is structured and to explore the root of these issues, I decided to investigate the history of this project. Additionally, I aimed to experiment with a new publication format, such as a mailing list review.

What’s wrong with a generic plan?

Upon examining the Git history of PostgreSQL, it appears that the concept of the plan cache was introduced in 2007 with commit b9527e9. At that time, it was decided that each prepared query in PostgreSQL should be executed exclusively using a generic plan, thereby avoiding unnecessary time spent on rebuilding the plan. Unlike Oracle, SQL Server, DB2 and other colleagues in the shop (see link, link, link, and link), PostgreSQL constructs the generic plan with the 'total uncertainty' concept, without utilising any specific 'reference' parameter value. For instance, in the example mentioned, the constant '127' is set aside during the creation of the generic plan.

Due to its limited ability to estimate scan selectivities, the optimiser often depends on default 'magic' values of certain predefined constants. Consequently, a generic plan is often of lower quality compared to a custom one. Let me provide another example to illustrate this point more clearly (see the reproduction script):

EXPLAIN
SELECT * FROM test_2
WHERE
  start_date > '2025-06-30'::timestamp - '7 days'::interval;
/*
 Index Scan using test_2_start_date_idx on test_2  (rows=739)
   Index Cond: (start_date > '2025-06-23 00:00:00'::timestamp)
*/

PREPARE tst3(timestamp) AS SELECT * FROM test_2
  WHERE start_date > $1 - '7 days'::interval;

EXPLAIN EXECUTE tst3('2025-06-30'::timestamp);
/*
 Seq Scan on test_2  (rows=333333)
   Filter: (start_date > ($1 - '7 days'::interval))
*/

Offhand, here are some key reasons to consider: the lack of a constant in the inequality operator results in a filter estimate of 33%; for range filters, the default value is set at 0.5% of the total number of rows in the table; with the equality operator, using MCV statistics is not possible, so we must rely solely on the ndistinct value. Additionally, in certain situations, it is not feasible to use partial indexes.

Let's turn to the origins

The absence of alternatives resulted in a significant decline in performance and the infrequent use of the generally practical PREPARE/EXECUTE statement construct. In 2011, a discussion began that ultimately led to the e6faf91 commit, which introduced a simple automatic technique for switching between custom and generic plan variants.

This discussion began with the pressing issue that prepared statements were executed exclusively using generic plans (Mark Mielke, link). While these plans were rebuilt each time an invalidation signal was received, such as after executing the ANALYZE or ALTER TABLE commands, the quality of the planning was noticeably inferior.

Several ideas were proposed to address this problem:

Periodically, replan the generic plan (Jeroen Vermeulen, link).
Introduce a threshold for the 'planning/execution time' ratio - If the criterion value is greater than 100, then use only the generic plan; if less than 0.01, then only the custom plan. (Bart Samwel link. Yeb Havinga opposes (link) this idea - an objective criterion should not contain the 'time' parameter). However, Jeroen Vermeulen and Greg Stark (link) supported this idea with the clause that the difference between planning and execution times should be significant, amounting to orders of magnitude.
Track the standard deviation (stddev) value of various parameters for executing a specific query plan, which will enable estimating the probability of how long the query will take to plan and execute next time (Greg Stark, link).
Build several custom and generic plans, and make a choice based on the cost ratio (Tom Lane, link).
Abandon generic plans altogether, while reducing the cost of replanning by preserving the PlannerInfo optimiser 'cache' and replanning only that part of the jointree / subquery where the parameters are actually used (Yeb Havinga, link).
Use generic plans, but introduce a replanning criterion - whether the parameter value falls within the MCV or not (Robert Haas (link, link), supported by Jeff Davis).

Interestingly, the idea of re-optimisation was already being discussed back then (Richard Huxton, link). At that time, it was more of a dream, but by the 2020s, the code infrastructure had matured enough to allow us to implement a similar concept in a short time (see replan). The approach of detecting, generalising, and caching frequently arriving statements through a simple protocol, which we implemented in sr_plan, is also explicitly described here (Robert Haas, link), along with Yeb Havinga's idea of achieving this through a method similar to the then non-existent queryId (link).

At the same time, in 2011, Simon Riggs introduced the concept of a one-shot plan. The primary idea behind this type of plan is to inform the DBMS that a query plan will be created, executed immediately, and subsequently destroyed upon completion. This approach allows for the application of additional optimisations that are not relevant when there is no connection between the planning and execution phases.

To support this idea, Simon provided an example involving the calculation of stable functions, which would enable more efficient execution of partition pruning. Additionally, Bruce Momjian highlighted another potential optimisation that could be implemented in a one-shot plan: analysing the buffer cache to assess the effectiveness of using a specific index.

Meanwhile, Tom Lane was developing a similar feature, motivated by complaints about regressions in dynamic SQL queries (link, link). However, his approach was different from Simon Riggs' original concept. Tom Lane's idea focused on unifying the mechanisms of SPI, PREPARE, and the extended protocol through the use of a plan cache. As a result, Riggs' original idea did not receive much further development, though it was discussed later on (link, link).

The concept of tracking the planning and execution time of queries did not gain traction due to objections from Tom Lane, who argued against using this time characteristic, as it is inherently unpredictable and can behave inconsistently across different systems.

In 2017, Pavel Stehule raised the need for explicit control over the type of plan selected when invoking the plan cache. This discussion led to the introduction of the plan_cache_mode parameter, which has two options: force_generic_plan and force_custom_plan. These options are designed specifically for using generic and custom plan types, respectively.

What stands out to me as a developer is the emphasis on several key concepts from the Postgres core that emerged during these discussions.

Tom Lane pointed out that in the absence of a general solution, we should develop heuristics. Providing users with such solutions through an additional GUC is a poor idea and ultimately a compromise.
Greg Stark and Pavel Stehule emphasised that the predictability of execution is more important than speed.
Tom Lane also noted that the ability to switch between different query plan types is valuable, provided it is controlled on a per-query basis.

Outcomes

Analysing the history of feature creation, the opinions expressed within the community, and the current knowledge base on generic plans' usage experience, I conclude that many of the current problems stem from the following issues:

Unstable Performance. Generic plan performance may vary significantly based on different sets of input parameter values. This suggests a need to switch to a custom plan type. However, PostgreSQL cannot automatically detect and switch plans because it lacks any statistics on the query execution. The current state of the kernel code enables a straightforward implementation to track various execution parameters, including the average and standard deviation. But, before we proceed with a community's proposal, we must address a fundamental question: should the PostgreSQL kernel have a feedback system from the executor to the optimiser?
Outdated custom/generic cost proportion. When a plan is invalidated, for instance, due to updated table statistics, the generic plan is rebuilt, and its cost is recalculated. However, this does not happen for the custom plan. Since the custom plan's cost is not recalculated, the value stored in the plan cache may significantly differ from reality due to gradual changes in table contents. This discrepancy can often lead to situations where a generic plan is utilised, even though the efficiency of a custom plan is apparent and could be determined by the planner during replanning.
Inadequate plan costs. A common issue arises when erroneous estimates make the query plan costs irrelevant to the actual workload. Consequently, the choice between custom and generic plans becomes largely a matter of chance.

What can we propose?

After many years of development and testing of the code, can we generate any new ideas? As usual, there are two separate solution designs: one is an in-core part for the community, and the other is an extensible code, which may even include a core patch that could be incorporated into a Postgres fork.

For the core version, we can consider the option of resetting the custom plan statistics on the cached plan, similar to what we do for the generic plan in the event of a plan invalidation call. This would trigger a new plan selection cycle from scratch. This approach is easily justified because statistics form the basis for calculating plan costs. When they change, it's comparable to switching to a different coordinate system, making it necessary to recalculate all costs.

The second option is somewhat more controversial: we could introduce a new 'referenced' mode for the generic plan creation process. This mode would use current constants as reference values for the planner. While it may not offer any fundamental advantages, it would provide users with a familiar tool for influencing the query plan, especially for those migrating from SQL Server.

As usual, it makes sense to implement an in-core 'plan switching hook' to leverage the plan switching method within an extension.

If we extend our coding options into the enterprise domain, we can explore more sophisticated plan-switching techniques. For instance, we could track statistics on the planning and execution time for each plan, compare their relative weight with cost values, and make decisions about replanning or even forcing a specific type of plan. An even better alternative could be to use a more stable parameter, such as the number of pages read.

To be more objective, you can check the project, which includes a draft for an automated system to manage plan types, as well as another branch outlining a draft for switching between forced modes.

Have you ever faced issues when using generic plans? Does it make sense to develop a comprehensive system for switching plans, or is it enough to implement an extension that enables each specific prepared statement to monitor its state and update it manually using SQL tools like pg_stat_statements?

References

Hackers' mailing lists threads:

Avoiding bad prepared-statement plans. , 2010-02
Restructuring plancache.c API , 2010-11
One-Shot Plans , 2011-06
Transient plans versus the SPI API , 2011-08
why do we need two snapshots per query? , 2011-11
dynamic SQL - possible performance regression in 9.2 , 2012-12
PoC plpgsql - possibility to force custom or generic plan , 2017-01
The logic behind comparing generic vs. custom plan costs , 2025-03
inefficient/wrong plan cache mode selection for queries with partitioned tables (postgresql 17) ,2025-05

Main commits:

b9527e9 - first attempt to the feature's design, 2007-03
e6faf91 custom plans introduction, 2011-09
94afbd5 - one-shot entries, 2013-01
2aac339 - more sophisticated planning cost model, 2013-09
f7cb284 - plan_cache_mode setting, 2018-07

THE END.
June 29, 2025. Madrid, Spain.

On expressions' reordering in Postgres

Andrei Lepikhov — Tue, 22 Apr 2025 11:08:29 GMT

Today, I would like to discuss additional techniques to speed up query execution. Specifically, I will focus on rearranging conditions in filter expressions, JOINs, HAVING clauses, and similar constructs. The main idea is that if you encounter a negative result in one condition within a series of expressions connected by the AND operator, or a positive result in one of the conditions linked by the OR operator, you can avoid evaluating the remaining conditions. This can save computing resources. Below, I will explain how much execution efforts this approach saves and how to implement it effectively.

Occasionally, you may come across queries featuring complex filters similar to the following:

SELECT * FROM table
WHERE
  date > min_date AND
  date < now() - interval '1 day' AND
  value IN Subplan AND
  id = 42';

And in practice, it happens that a simple rearrangement of the order of conditions in such an expression allows for speeding up (sometimes quite notably) the query execution time. Why? Each individual operation costs little. However, if it is performed repeatedly on each of millions of the table's rows, then the price of the operation becomes palpable. Especially if other problems, like the table blocks getting into shared buffers, are successfully solved.

This effect is particularly evident on wide tables that contain many variable-length columns. For instance, I often encounter slow IndexScans that become slow when the field used for additional filtering is located somewhere around the 20th (!) position in the table, containing many variable-width columns. Accessing this field requires calculating its offset from the beginning of the row, which takes up processor time and slows down the execution.

The PostgreSQL community has already addressed this issue, as observed in the code. In 2002, commit 3779f7f, which was added by T. Lane, reorganised the clauses by positioning all clauses containing subplans at the end of the clause list (see order_qual_clauses). This change was logical because the cost of evaluating a subplan can depend on the parameters passed to it, introducing an additional source of error.

In 2007, this approach evolved with the commit 5a7471c, which established that the sorting of clauses would be performed exclusively in ascending order based on the cost parameter. This logic has remained in place to the present day, except for a minor modification in commit 215b43c, which required controlling the order of expression evaluation in each query plan node due to changes in the Row-Level Security (RLS) code.

Now, let’s take a look at what we have in the upstream as of today:

CREATE TABLE test (
  x integer, y numeric,
  w timestamp DEFAULT CURRENT_TIMESTAMP, z integer);
INSERT INTO test (x,y)
  SELECT gs,gs FROM generate_series(1,1E3) AS gs;
VACUUM ANALYZE test;

EXPLAIN (COSTS ON)
SELECT * FROM test
WHERE
  z > 0 AND
  w > now() AND
  x < (SELECT avg(y)
    FROM generate_series(1,1E2) y WHERE y%2 = x%3) AND
  x NOT IN (SELECT avg(y)
    FROM generate_series(1,1E2) y OFFSET 0) AND
  w IS NOT NULL AND
  x = 42;

Looking into the filter of this SELECT, we see the following sequence of conditions:

Filter: ((w IS NOT NULL) AND (z > 0) AND
         (x = 42) AND (w > now()) AND
         ((x)::numeric = (InitPlan 2).col1) AND
         ((x)::numeric < (SubPlan 1)))

During the execution of the query, they will be calculated in strict sequence from left to right. The operator costs are as follows for reference:

"z > 0" - 0.0025
"w > now()" - 0.005
"x < SubPlan 1" - 2.0225
"x NOT IN SubPlan 2" - 0.005
“w IS NOT NULL" - 0.0
“x = 42“ - 0.0025

This order appears quite logical. However, you may be wondering what can be improved here.

There are at least two straightforward opportunities for enhancement. First, you can assign a small cost to the ordinal position of each column involved in the expression. In simple terms, the further a column is to the right in a table row, the more expensive it is to evaluate. The cost should not be excessively high; it merely needs to signal to the optimiser that the expression x=42 is cheaper to evaluate than z>0, assuming all other factors are equal.

You may argue it is related to the current Postgres row-based storage. It is true, but we use this type of storage more frequently, isn't it? Moreover, it would make sense for storage to provide its own cost model.

The second standard pattern relates to pairs of expressions with approximately the exact cost. For instance, consider x=42 and z<50. Clearly, the second expression is less selective and should be placed in the second position. Since the expression x=42 will be true in fewer cases, there will be less need to evaluate subsequent conditions further down the list.

Now, let's assess the potential impact of these optimisations. Is it worth the effort? To illustrate, we can create a table where a pair of columns has the same selectivity but is positioned far apart, while another pair is located next to each other but has different selectivity.

CREATE TEMP TABLE test_2 (x1 numeric, x2 numeric,
  x3 numeric, x4 numeric);
INSERT INTO test_2 (x1,x2,x3,x4)
  SELECT x,(x::integer)%2,(x::integer)%100,x FROM
    (SELECT random()*1E7 FROM generate_series(1,1E7) AS x) AS q(x);
ANALYZE;

Let's examine the performance impact of searching for a value in a relatively "wide" row. Columns x1 and x4 are identical in every way, except that the position of the value in the column x1 is known in advance. In contrast, the position of the value in the column x4 needs to be calculated for each row.

EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
SELECT * FROM test_2 WHERE x1 = 42 AND x4 = 42;
EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
SELECT * FROM test_2 WHERE x4 = 42 AND x1 = 42;

/*
 Seq Scan on test_2  (actual rows=0.00 loops=1)
   Filter: ((x1 = '42'::numeric) AND (x4 = '42'::numeric))
   Buffers: local read=94357
  Execution Time: 2372.032 ms

 Seq Scan on test_2  (actual rows=0.00 loops=1)
   Filter: ((x4 = '42'::numeric) AND (x1 = '42'::numeric))
   Buffers: local read=94357
 Execution Time: 2413.633 ms
*/

It turns out that, all other factors being equal, even a relatively short tuple can have an effect of about 2-3%. This impact is quite comparable to the typical benefits gained from using Just-In-Time (JIT) compilation. Now, let's consider the influence of selectivity. The columns x1 and x2 are positioned next to each other. The key difference is that the values in x1 are almost unique, whereas x2 contains mostly duplicated values.

EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
SELECT * FROM test_2 WHERE x2 = 1 AND x1 = 42;
EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
SELECT * FROM test_2 WHERE x1 = 42 AND x2 = 1;
/*
 Seq Scan on test_2  (actual rows=0.00 loops=1)
   Filter: ((x2 = '1'::numeric) AND (x1 = '42'::numeric))
   Buffers: local read=74596
 Execution Time: 2363.903 ms

 Seq Scan on test_2  (actual rows=0.00 loops=1)
   Filter: ((x1 = '42'::numeric) AND (x2 = '1'::numeric))
   Buffers: local read=74596
 Execution Time: 2034.873 ms
*/

It seems we have achieved a speedup of approximately 10%.

It turns out that if we accept that the effect can accumulate throughout the plan tree, which may contain multiple scanning operators as well as joins, each contributing a particular percentage, then this technique is worthwhile to implement overall, especially considering the minimal overhead in the planning phase.

Let's proceed with the implementation and observe its effects. Creating it as an extension doesn't seem practical, as there is currently no hook that allows for operations during the creation of the plan. As for me, the necessity of introducing a create_plan_hook within the create_plan() routine is becoming increasingly evident: We may let extensions transfer some data from the optimisation stage to the plan, as well as do some additional plan enhancements (which may fit a specific load), like proposed here. However, this topic has yet to be discussed within the PostgreSQL community.

If this feature is implemented as a patch, modifications will be needed in two areas of the code: cost_qual_eval(), where the cost of expressions is evaluated, and order_qual_clauses(), which defines the sorting rules for expressions. As usual, the code can be found on GitHub in the designated branch.

Running the aforementioned examples on this branch will demonstrate that the expressions are constructed more optimally, considering column order and selectivity. Additionally, no significant overhead is observed.

Do you think it makes sense to pursue such micro-optimisations, or should we aim for broader improvements? Have you encountered similar issues? Please share your thoughts in the comments.

THE END
April 19, 2025. Madrid, Spain.

Boosting Postgres' EXPLAIN

Andrei Lepikhov — Sat, 12 Apr 2025 11:48:08 GMT

Shortly before the code freeze for PostgreSQL 18, Robert Haas added a feature that allows external modules to provide additional information to the EXPLAIN command.

This was a long-awaited feature for me. For an extension that influences the query planning process, providing users with notes on how the extension has affected the plan makes perfect sense. Instead of merely writing to a log file - access to which is often restricted by security policies - this information may be made available through the EXPLAIN command.

The feature introduced many entities that are not easy to figure out: an EXPLAIN option registration routine (RegisterExtensionExplainOption), an explain extension ID, per plan/node hooks, and an option handler.

The pg_overexplain extension, introduced with this feature to demonstrate how it works, seems a little messy and impractical for me, at least in its current state. So, I decided to find out how flexible this new technique is and demonstrate the opportunities opening up to developers with a more meaningful example. I have modified the freely available pg_index_stats extension and added information about the statistics used in the query planning process.

The STAT parameter was added to the list of EXPLAIN options, accepting Boolean ON/OFF values. If it is enabled, information about the statistics used is inserted at the end of the EXPLAIN: the presence of MCV, histogram, and the number of elements in them, as well as the values of stadistinct, stanullfrac, and stawidth.

You might wonder why this is necessary. After all, doesn't the set of statistics directly stem from the list of expressions in the query? Isn't it possible to identify which statistics were utilised by examining the cost-model code for a particular type of expression?

While it is indeed possible, this approach is not always sufficient. We understand the algorithms, but we typically do not have access to the underlying data. As a result, we cannot accurately determine which specific statistics are present in pg_statistic for a given column, nor can we know what information was available to the backend at the time of estimation.

Let's look at the example below:

CREATE TABLE sc_a(x integer, y text);
INSERT INTO sc_a(x,y) (
  SELECT gs, 'abc' || gs%10 FROM generate_series(1,100) AS gs);
VACUUM ANALYSE sc_a;
LOAD 'pg_index_stats';

EXPLAIN (COSTS OFF, STAT ON)
SELECT * FROM sc_a s1 JOIN sc_a s2 ON true
WHERE s1.x=1 AND s2.y LIKE 'a';

Explain, boosted by the pg_index_stats extension, looks like the following:

 Nested Loop
   ->  Seq Scan on sc_a s1
         Filter: (x = 1)
   ->  Seq Scan on sc_a s2
         Filter: (y ~~ 'a'::text)
 Statistics:
   "s2.y: 1 times, stats: { MCV: 10 values, Correlation,
          ndistinct: 10.0000, nullfrac: 0.0000, width: 5 }
   "s1.x: 1 times, stats: { Histogram: 0 values, Correlation,
          ndistinct: -1.0000, nullfrac: 0.0000, width: 4 }

Here, you can see that the statistics for the s1.x and s2.y columns were used. We can't detect which statistic type was actually used and how often - it is buried too deeply in the core - but we may still detect some issues:

At first, we have only ten MCV values for y, and there are no MCV statistics for the s1.x at all; the histogram seems to be there, but it is of zero length. No nulls in either column are expected.

Thus, we have some helpful information that can suggest the optimiser's plan selection logic. Considering that a client who cannot provide data can very rarely give a dump of the pg_statistic table, such relatively harmless information can be a helpful aid and reveal possible causes of problems with the query plan selection. As a bare minimum benefit, users often forget to increase the statistic_target parameter (sample size) on massive tables, and this information provides a quick insight into that issue.

The extension utilises the get_relation_stats_hook to track the statistics used. It would also be useful to know whether extended statistics are used in planning, but they are too deep in the core yet, and the current set of hooks will not help here.

Finally, I would like to know what applications you see for expanding the EXPLAIN output. Regarding the example above, how harmless is even such limited information really?

THE END

April 12, 2025. Nikola Tesla Airport, Serbia.

Automated Management of Extended Statistics in PostgreSQL

Andrei Lepikhov — Sun, 09 Mar 2025 15:34:39 GMT

Here, I am describing the results of a Postgres extension I developed out of curiosity. This extension focuses on the automatic management of extended statistics for table columns. The idea originated while I was finishing another "smart" query-driven project aimed at enhancing the quality of Postgres query planning. I realised that Postgres is not yet equipped enough for fully autonomous poor query plan detection and adjustment optimisations. Therefore, it might be beneficial to approach the problem from a different angle and create an autonomous, data-driven helper.

What is extended statistics?

The extended statistics tool allows you to tell Postgres that additional statistics should be collected for a particular set of table columns. Why is this necessary? - I will try to quickly explain using the example of an open power plant database. For example, the fuel type (primary_fuel) used by a power plant is implicitly associated with the country's name. Therefore, when executing a simple query:

SELECT count(*) FROM power_plants
WHERE country = '' AND primary_fuel = 'Solar';

we see that this number is zero for Norway and 243 for Spain. This is apparent to us since it is defined by latitude, but the DBMS does not know this, and at the query planning stage, it incorrectly estimates the sample (row number): 93 for Norway and 253 for Spain. If the query turns out to be a little more complex and the estimated data are the input for a JOIN operator, this can lead to unfortunate consequences. The extended statistic calculates the joint distribution of values in columns and allows us to detect such dependencies.

In fact, there are worse situations in ORMs. In the power plant database example, this could be the joint use of conditions on the country and country_long fields. After reading their description, anyone understands that there is a direct correlation between these fields, and when the ORM groups by both of these fields, we get a significant error:

EXPLAIN (ANALYZE, COSTS ON, TIMING OFF, BUFFERS OFF, SUMMARY OFF)
SELECT country, country_long FROM power_plants
GROUP BY country, country_long;

 HashAggregate
  (rows=3494 width=16) (actual rows=167 loops=1)
   Group Key: country, country_long
   ->  Seq Scan on power_plants
        (rows=34936 width=16) (actual rows=34936 loops=1)

A human would never write such a query, but we live in the era of AI, and automatically generated queries are not uncommon. We will have to deal with this somehow.

And what about extended statistics? It allows us to define three types of statistics on a combination of columns (or/and expressions): Most Common Values (MCV), distinct and dependencies. In the case of scanning filters, MCV works best: if the combination of values that a query selects from the table often appears in this table, then the optimiser will get an accurate estimate. If we are looking for a rare combination (as in the case of solar power plants in Norway), having a rough estimate of the sample ntupes/ndistinct, we can refine it by throwing out everything that got into MCV.

In the case of the need to estimate the number of groups (operators GROUP BY, DISTINCT, IncrementalSort, Memoize, Hash Join), the optimiser's decision is very well supported by the ndistinct value per column combination.

Now, to see the impact of extended statistics on the optimiser's row estimation from the table, let's apply extended statistics to our case by running the commands:

CREATE STATISTICS ON country,primary_fuel
 FROM power_plants;
ANALYZE;

You may find that the queries above estimate row numbers much more accurately when selecting and grouping by these two fields. For instance, Norway is estimated to have one power plant, while Spain has 253. Just to be sure, you can verify this result using filters such as country = 'RUS' or country = 'AUT'. Although the table is not very large, the tool seems effective.

However, I rarely see extended statistics being used in practice. One possible reason for this may be the concern that running the ANALYZE command will take a significant amount of time. Yet, I believe the main issue lies in the complexity of diagnostics - specifically, knowing when and where to create these statistics.

Looking for a suitable statistics definition

Is there an empirical rule of thumb for determining where and what statistics to create? I have forged two such rules for myself:

No. 1: By Index Definition. If a DBA takes a risk by creating an index on a specific set of columns, they likely expect the DBMS to receive queries that filter on these columns frequently. Additionally, the execution time of these queries is probably critical, which serves as another reason for improving the quality of query plans. However, there isn't always a significant estimation error for filters on multiple columns, which is a drawback of this empirical approach – statistics may be generated unnecessarily. It's also possible that a point sample of data from the table is what's expected, which may diminish the impact of misestimating on a composite filter – does it really matter whether 1 or 5 rows are returned?

Due to these shortcomings, I developed Method No. 2 using actual query filter templates. In this method, the first step is to identify candidate queries based on two factors: the query's contribution to the database load (which can be measured using the pages-read criterion) and the presence of composite filter conditions in table scans. It would also be beneficial to consider only those instances where the actual cardinality of the table scan operator significantly deviates from the planned value.

This approach is more selective in choosing potential candidates for generating statistics, allowing for a significant reduction in the statistics collected. However, it raises some important questions:

When it comes to creating statistics, approach No. 1 provides a clear moment for generating them - at the time of the index creation. But what about approach No. 2? In this case, you must either rely on a timer to generate statistics collecting queries in the interim or manually trigger the command. The absence of a complex query that calculates bonuses at the end of the month (for the previous 29 days) does not mean that we shouldn’t execute it within a reasonable timeframe on the thirtieth day. While such a query may contribute only a tiny amount to the overall load, the accountant may not appreciate waiting several hours for the results!
How to Clean Up a Set of Statistics. In the previous approach, we deleted the statistics along with the index. However, this situation is less straightforward now. For instance, if a problematic query suddenly stops occurring - perhaps because the sales season for a popular product has ended - it doesn't mean it won't return in a year. This uncertainty could create potential instability in the DBMS optimiser's operation.

Additionally, it's unclear how much the actual and planned row numbers should differ to be considered significant. Should this difference be two times, ten times, or even a hundred times?

With this in mind, I decided to first write code for the easy-to-implement approach No. 1. At the same time, for approach No. 2, I just plan to develop a recommender tool that, based on data of the pg_stat_statements extension and an analysis of the execution plans of queries, will suggest candidates for creating new statistics.

Extension Description

The concept behind this extension is straightforward (see the repository for details). First, we need a hook to collect the identifiers of objects created in the database, and I have chosen the object_access_hook for this purpose. Next, we need to determine an appropriate time to filter the list of objects, selecting only those that belong to relevant composite indexes. We can efficiently add a new statistics definition to the database using the ProcessUtility_hook, executing our code after a utility command is completed.

Extended statistics, which include distinct and dependencies types, are calculated for all possible combinations of columns. This leads to a rapid increase in computational complexity. For instance, with three columns, the number of distinct statistics is 4, and the number of dependencies is 9. However, these numbers rise dramatically with eight columns to 247 distinct statistics and 1016 dependencies. It's clear now why the PostgreSQL core strictly limits the number of statistical elements to 8.

To prevent excessive load on the database, I introduced a parameter that limits the number of index elements included in the statistics definition (the columns_limit parameter) and another parameter that determines which types of statistics to include in this definition (the stattypes parameter). When these automatic statistics are created, an extra dependency is established on the index serving as the template. Consequently, the associated statistics are removed when the index is deleted.

An open question remains: Is it necessary to create a dependency from the extension to delete all created statistics when DROP EXTENSION is executed? The answer is unclear because the extension may also function as a simple module without requiring a CREATE EXTENSION call, thus potentially impacting all databases within the cluster simultaneously.

To distinguish between automatically generated statistics and those created by users, a comment object that includes the module's name and the statistics name is created. Additionally, we have introduced the functions pg_index_stats_remove and pg_index_stats_rebuild into the extension interface. These functions allow you to delete all statistics and regenerate them, which can be helpful if the data schema was established prior to loading the module or if the database parameters have changed.

A separate issue to address is the reduction of redundant statistics. Given that a database can have many indexes, a procedure has been developed to identify duplicates, aiming to decrease the computational costs of the ANALYZE command (see the pg_index_stats.compactify parameter).

For example, if an index is already defined as t(x1, x2), creating another index as t(x2, x1) would not require the creation of new statistics. A more complex scenario arises when an index t(x2, x1) is created in the presence of another index t(x1, x2, x3). In this case, the most common value (MCV) statistics must be created, as they would not be redundant but the ndistinct, and the dependencies can be disregarded.

Benchmarking

As usual, theory should be validated through practice, and code should be tested on meaningful data. I didn't have access to a ready-made, loaded PostgreSQL instance in either a test or production environment, so I found a stale dump of a database for testing purposes. This particular dump was noteworthy because it contained a large number of tables -about 10,000 - along with roughly three times as many indexes.

Additionally, composite indexes were heavily employed, with around 20,000 indexes containing more than one column. Notably, more than 1,000 of these indexes cover five or more columns. So, this database provides a suitable case for research, although it is unfortunate that no payload is available. The ANALYZE command on this database took 22 seconds to execute. However, when I installed the extension and used the default limit of five columns, the ANALYZE time increased to 55 seconds.

The table with raw data below illustrates the ANALYZE time (in seconds) based on the limit on the number of columns and the types of statistics collected.

It's clear that storing all possible combinations of columns significantly impacts analysis time, mainly when dependencies are involved. Therefore, we can limit our analysis to 3-5 columns in the statistics or consider adopting approach No. 2. I now understand why SQL Server created a separate worker for updating such statistics: this process can be pretty costly. What about reducing redundancy? Let's conduct another experiment:

SET pg_index_stats.columns_limit = 5;
SET pg_index_stats.stattypes = 'mcv, ndistinct, dependencies';
SET pg_index_stats.compactify = 'off';
SELECT pg_index_stats_rebuild();
ANALYZE;

pg_index_stats.compactify = 'on';
SELECT pg_index_stats_rebuild();
ANALYZE;

The following two queries are sufficient to check the amount of statistical data generated by the pg_index_stats extension:

-- Total number of stat items
SELECT sum(nelems) FROM (
  SELECT array_length(stxkind,1) AS nelems
  FROM pg_statistic_ext);

-- Total number of stat items grouped by stat type
SELECT elem, count(elem) FROM (
 SELECT unnest(stxkind) elem FROM pg_statistic_ext
)
GROUP BY elem;

The first query shows the total number of extended statistics items in the database, and the second one - a breakdown by type. So, let's see what happens with and without compactifying:

The overall impact is modest—approximately a 15% improvement in processing time and slightly more in the set of statistics. However, it does provide some protection against corner cases. Interestingly, the compactifying reduced the number of MCV statistics, suggesting that a significant number of indexes differ only in the order of their columns. Additionally, expression statistics, which we haven't discussed before, are generated automatically by the PostgreSQL core if the definition of extended statistics includes an expression. Although this may not pose a significant issue, it would be beneficial to have the ability to regulate this behaviour.

It's also worth comparing the analysis time to an alternative statistics collector called joinsel, which exists in the enterprise Postgres fork, provided by Postgres Professional LLC. While it isn't a direct competitor to extended statistics, it works differently. Based on the index definition, it creates a new composite type within the database, which is then used to generate regular statistics stored in pg_statistic. The advantages of joinsel include MCV and a histogram, which allows for evaluating range filters while leveraging standard PostgreSQL clause estimation techniques. However, it does have some drawbacks, such as a lack of dependency statistics and only one ndistinct value for the entire composite type (a limitation that can be addressed).

Now, let's look at how quickly the ANALYZE command is executed with joinsel.

SET enable_compound_index_stats = 'on';
SELECT pg_index_stats_remove();
\timing on
ANALYZE;
Time: 41248.977 ms (00:41.249)

ANALYZE time has increased as expected compared to regular Postgres statistics, but only by two, which is a reasonable compromise. The main advantage here is that you don't have to worry about the number of columns in the index - the complexity will increase linearly.

Coclusion

The general conclusion regarding Approach No. 1 is that it can be viable, provided we exercise caution and carefully manage the limits.

Additionally, we should enhance the extended statistics in the core. It would be nice to have the possibility of a more significant impact on this tool, allowing us to reduce or expand the volume of generated statistical data.

As for the helper and Approach No. 2, I have decided to postpone it for now. If anyone is enthusiastic and has plenty of free time and patience, feel free to reach out. I would be happy to provide guidance!

THE END.
March 9, 2025, Madrid, Spain.

Does Postgres need an extension of the ANALYZE command?

Andrei Lepikhov — Tue, 04 Feb 2025 02:59:14 GMT

In this post, I would like to discuss the stability of standard Postgres statistics (distinct, MCV, and histogram over a table column) and introduce an idea for one more extension - an alternative to the ANALYZE command.

My interest in this topic began while wrapping up my previous article when I noticed something unusual: the results of executing the same Join Order Benchmark (JOB) query across a series of consecutive runs could differ by several times and even orders of magnitude - both in the value of the execution-time and in pages-read.

This was puzzling, as all variables remained constant - the test script, laptop, settings, and even the weather outside were the same. This prompted me to investigate the cause of these discrepancies… .

In my primary activity, which is highly connected to query plan optimisation, I frequently employ JOB to assess the impact of my features on the planner. At a minimum, this practice enables me to identify shortcomings and ensure that there hasn't been any degradation in the quality of the query plans produced by the optimiser. Therefore, benchmark stability is crucial, making the time spent analysing the issue worthwhile. After briefly examining the benchmark methodology, I identified the source of the instability: the ANALYZE command.

In PostgreSQL, statistics are computed using basic techniques like Random Sampling with Reservoir, calculating the number of distinct values (ndistinct), and employing HyperLogLog for streaming statistics - for instance, to compute distinct values in batches during aggregation or to decide whether to use abbreviated keys for optimisation. Given that the nature of statistics calculation is stochastic, fluctuations are expected. However, the test instability raises the following questions: How significant are these variations? How can they be minimised? And what impact do they have on query plans? Most importantly, how can we accurately compare benchmark results when such substantial deviations are present, even in the baseline case?

Is it possible to achieve query plan stability?

Well, I've thought: the tables are massive, and there are a lot of rows inside - let's just increase the value of the default_statistics_target parameter, and that will solve the problem, right? I was gradually increasing the sample size for statistics from 100 to 10000, re-running all benchmark queries and recording how the pages-read criterion behaves:

statistics_target = 100

statistics_target = 1000

statistics_target = 10000

Even with the highest level of statistical detalisation, query plans still change after re-executing the ANALYZE command. While increasing the sample size from 100 to 10,000 may improve something, this does not fundamentally alter the situation. Such significant instabilities call into question the possibility of independently reproducing benchmarks and conducting comparative results analyses without examining the query plans.

Now, let's dig deeper: what generally fluctuates? Using a simple script, I conducted an experiment to collect scalar statistics (stanullfrac, stawidth, stadistinct) by rebuilding the statistics ten times. The script for this operation might look like this:

CREATE TABLE test_res(expnum integer, oid Oid, relname name,
                      attname name, stadistinct real,
                      stanullfrac real, stawidth integer);
DO $$
DECLARE
    i integer;
BEGIN
  TRUNCATE test_res;
  FOR i IN 0..9 LOOP
    INSERT INTO test_res (expnum,oid,relname,attname,stadistinct,stanullfrac,stawidth)
      SELECT i, c.oid, c.relname, a.attname, s.stadistinct, s.stanullfrac, s.stawidth
      FROM pg_statistic s, pg_class c, pg_attribute a
      WHERE c.oid >= 16385 AND c.oid = a.attrelid AND
      s.starelid = c.oid AND a.attnum = s.staattnum;
    ANALYZE;
  END LOOP;
END; $$

Afterwards, I analysed the results with a query like the following:

WITH changed AS (
  SELECT
    relname,attname,
    (abs(max - min) / avg * 100)::integer AS res, avg
  FROM (
    SELECT relname,attname,
      max(stadistinct) AS max, avg(stadistinct) AS min,
      avg(stadistinct) AS avg
    FROM test_res
    WHERE relname <> 'test_res'
    GROUP BY relname,attname
    )
  WHERE max - min > 0 AND abs(max - min) / avg > 0.01
) SELECT relname,attname, res AS "stadistinct dispersion", avg::integer
  FROM changed t
  ORDER BY relname,attname;

Something more mathematically correct could be used here, but for our purposes, such a simple criterion is enough to see the instability of statistics. Using the above scripts, let's see how ndistinct fluctuates for 100, 1000 and 10000 elements of the statistical sample:

Stochastic, right? There's a lot of strange stuff here, but we won't dig too deep, attributing it to the fact that determining the ndistinct value from a small sample does not converge uniformly to the actual value. Yes, the fluctuations subside with the sample size, but 10,000 is already the limit, and the table sizes in this benchmark are not that big - in real life, statistics can remain unstable on much larger tables.

Another observation from the results of this experiment is that statistics on fields with a large number of duplicates suffer the most. In practice, this means that grouping or joining on some harmless "Status" field can cause immense estimation errors inside the optimiser, even if the expression on this field is only part of a long list of expressions in the query.

What exactly is the origin of instability?

However, what exactly causes the query plans to fluctuate in this benchmark? To do this, we need to look at the query plans. One query consistently changes its plan from iteration to iteration - 30c.sql, which makes it convenient to analyse. Comparing the query plans, we can see that HashJoin and the parameterised NestLoop compete with very close estimates (see here and here). Using the "close look" method, I found that the discrepancies in the estimates begin already at the SeqScan stage and then diverge throughout the entire plan:

->  Parallel Seq Scan on public.cast_info
    (cost=0.00..498779.41 rows=518229 width=42)
    Output: id, person_id, movie_id, person_role_id,
            note, nr_order, role_id
    Filter: (cast_info.note = ANY ('{(writer),
          "(head writer)","(written by)",(story),"(story editor)"}'))

->  Parallel Seq Scan on public.cast_info
    (cost=0.00..498779.41 rows=520388 width=42)
    Output: id, person_id, movie_id, person_role_id,
            note, nr_order, role_id
    Filter: (ci.note = ANY ('{(writer),
          "(head writer)","(written by)",(story),"(story editor)"}'))

There is a slight difference in the estimation of the number of rows selected from the table. Let's dig deeper and see why.

It is not easy to compare histograms or MCV, so let's just study our specific query or, more precisely, the problematic scan operator. The estimation of x = ANY (...) occurs by estimating each individual expression x = Ni and then adding up the probabilities. In our case, all five Ni constants are included in the MCV statistics - which means that even the ndistinct value will not be used by Postgres. Thus, the estimation should be as accurate and stable as possible. However, if you dig into the numbers, you can see that after the ANALYZE, the frequency of each of the sample elements changes. For example, for default_statistics_target=10000:

The final estimation changes by about 1%, which is not too much in principle. However, who knows - maybe we didn't hit a horrid probability? In addition, the error accumulates when planning the higher-level operators of the query tree, ultimately changing the plan of this query.

In general, this corresponds to the theory that calculating the number of distinct in a set of values with reasonable accuracy can only be obtained by analysing almost all the table rows. Moreover, the issue with statistics has already been reported to the community [here and here, for example] and an attempt to solve the problem was conducted. However, in any case, this comes up against an expensiveness of the volume of the statistical sample and is not universally applicable - and therefore not applicable in the PostgreSQL core.

However, for analytical tasks, where the size and denormalisation of the scheme impose increased requirements on the quality of statistics, one computationally expensive pass through the table's rows may make sense. In addition, the data is loaded rarely in large batches, and statistical collection is not required often. So maybe we just probe the approach involving the extension mechanism?

An extension for standard statistics

How to design it? Let's recall the article by DeWitt1998 - the author suggested computing statistics attaching to sequential table scans. Postgres has a CustomScan node mechanism that can be inserted into any part of the query plan and implemented with an arbitrary complexity of the operation. Therefore, such an idea is easy to implement. Also, no one prevents you from adding a new function to the UI using an extension that will go through the entire table and calculate at least ndistinct and MCV with maximum accuracy.

Having standard "lightweight" statistics on the number of ndistinct, you can assess how expensive and feasible it will be in terms of computing resources before deciding to launch such an analysis procedure.

Feeding such refined statistics to the optimiser can be implemented employing two hooks: get_relation_stats_hook and get_index_stats_hook. They allow to replace the standard statistics obtained from pg_statistic with an alternative, as long as it corresponds to the internal Postgres tuple format. The second hook is exciting because it can be used to implement not complete statistics but predicate statistics - that is, statistics based on data selected from a table with a particular filter - an analogue of SQL Server's CREATE STATISTICS ... WHERE ... .

How to store statistics? For the optimiser to be able to use them correctly, it is evident that their storage format must correspond to the format of storing standard statistics. Nothing is complicated about this since the extension can create its tables - so why not just create table pg_statistic_extra?

Additional bonuses from creating such an extension may emerge entirely unexpectedly. For example, while writing this post, I realised that it could help solve the dilemma of statistics on a partitioned table: no one likes to spend resources on it since it duplicates the work of calculating statistics for each individual partition. At the same time, it does not change much over time: there are many partitions, and if the data is well spread out, then the statistics for all data change insignificantly (except, perhaps, for the partition key). In addition, the table can be huge, and things like ndistinct in standard statistics can be far from reality. With the help of the extension, you can implement the creation of detailed statistics by event: attach/detach partition, at the time of massive updates, etc., which can allow you to launch a computationally expensive operation more consciously. Also, knowing that the table changes rarely, it will be possible to radically simplify the statistics calculation by using indexes or implementing other sampling algorithms (for example, this one)...

That's actually the whole roadmap. All that's left is to find a passionate student, and you can apply with a project to GSoC ;). What do you think about such an extension? Write in the comments.

THE END.

February 2, 2025, South Pattaya, Thailand.

Whose optimisation is better?

Andrei Lepikhov — Sat, 18 Jan 2025 15:37:54 GMT

That happened one long and warm Thai evening when I read another paper about the re-optimisation technique in which the authors used Postgres as a base for implementation. Since I had nearly finished with the WIP patch aiming to do the same stuff in the Postgres fork, I immediately began comparing our algorithms using the paper's experimental data as a reference. However, I quickly realised that neither my code nor even the standard Postgres instance bore any resemblance to the paper's figures.

The execution time measurements they provided differed significantly due to unclear details regarding the experimental setup and instance settings. I had often encountered research reports that were almost impossible to reproduce, and the current case led me to discover how we could compare query plans and optimisation effectiveness using dimensionless criteria.

From a practical point of view, the DBMS that produces a higher TPS is more efficient. However, sometimes, we need to design a system that does not yet exist or make a behaviour forecast for loads that have not yet arrived. In this case, we need a parameter to analyse a query plan or compare a pair of plans qualitatively. This post discusses one such parameter - the number of data pages read.

It hardly needs to be said that the 'performance evaluation' section of research is crucial for applied software developers, as it justifies the time spent reading the preceding text. This section must also ensure the repeatability of results and allow for independent analysis. For instance, a similarity theory has been developed in fields like hydrodynamics and heat engineering that enables researchers to present experimental results in dimensionless quantities, such as the Nusselt, Prandtl, and Reynolds numbers. Researchers can reasonably compare the results obtained by reproducing experiments under slightly different conditions.

I have not yet seen anything like this in database systems. The section devoted to testing usually briefly describes the hardware and software parts and graphs. The main parameter under study is the query execution time or TPS (transactions-per-second).

This approach appears to be the only viable method when comparing different DBMSes and making decisions regarding what system to use in production. However, it's important to note that query execution time is influenced by multiple factors, including server settings, caching algorithms, the choice of query plan, and parallelism...

Let's consider the scenario where we are developing a new query optimisation method and want to compare its performance with a previously published method. We have graphs showing query execution times (see, for example, here or there), along with a brief description of our testing platform. However, we encounter discrepancies between our results and those from published studies due to multiple unknown factors. To address this, we need a measurable parameter that can eliminate the influence of other DBMS subsystems, making our analysis more portable and accessible. I believe that developers working, for example, on a new storage system would also appreciate the opportunity to remove the optimiser's impact from their benchmarks.

When attempting to reproduce the experiments described in articles or to compare my method with the one proposed by authors, I often find that the uncertainty of the commonly accepted measurement of execution time is too high to draw conclusive judgments. This measure primarily reflects the efficiency of the code under specific operating conditions rather than the quality of the discovered query plan. Execution time is a highly variable characteristic; even when running the same test consistently on the same machine and instance, there can be a significant variation in execution times.

For instance, I've conducted ten consecutive runs of all 113 Join Order Benchmark (JOB) tests, and I've observed a typical spread in execution time of up to 50% on my desktop (see the picture below) - even under optimal conditions with all experiment parameters are meticulously controlled. This raises a crucial question: how much deviation might an external researcher encounter if they attempt to repeat the experiment, and how should they analyse the results?

Variation in execution times for repeated JOB query executions.

One more concern is how to compare query plans executed with varying numbers of parallel workers. Using multiple workers on a test machine can yield positive results; however, parallelism can sometimes be counterproductive in a production with hundreds of competing backends. Therefore, is it best to seek a more meaningful criterion for evaluation?

In my specific area of query optimisation, execution time often seems like a redundant metric. It may be more beneficial to adopt a more specific characteristic to compare different optimisation approaches or to assess the impact of a new transformation within the PostgreSQL optimiser. Such a metric should include only factors that the optimiser may consider during the planning process.

From the perspective of a DBMS, the primary operations involve data manipulation. Thus, it would be natural to select the number of operations performed on table rows during query execution, taking into account the number of attributes in each row. Minimising this parameter would indicate the efficiency of the chosen query plan. However, collecting such statistics can be challenging. Therefore, we should aim to identify a slightly less precise but more easily obtainable parameter.

For example, DBAs often use the number of pages read as a parameter. In this context, a page refers to a buffer cache page or a table data block stored on disk. It is unnecessary to differentiate between the pages that fit in the RAM buffer and those on disk, as this distinction provides redundant information that pertains more to the page eviction strategy and disk operation than to the optimal plan identified.

For our purposes, it is sufficient to sum these values mechanically. We also need to consider the pages from the temporary disk cache used by sorting, hashing, and other algorithms for placing rows that did not fit in memory. It is important to note that the same page may need to be counted twice. We access a page once during sequential row scanning to read its tuples. However, when rescanning—such as in an inner NestLoop join—we reread the data and must account for each page again. PostgreSQL already has the necessary infrastructure for measuring the number of pages read, provided by the pg_stat_statements extension. My approach is as follows: before executing each benchmark query, I run the command SELECT pg_stat_statements_reset() and then retrieve the statistics using the following query:

SELECT
  shared_blks_hit+shared_blks_read+local_blks_hit+local_blks_read+
  temp_blks_read AS blocks, total_exec_time::integer AS exec_time
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_stat_statements_reset%';

How reliable is this metric? In the same experiment mentioned above, all ten runs of the JOB test demonstrated negligible deviation in the number of pages for each query throughout the iterations:

There is a deviation of only a few pages, and while even a minor discrepancy like this should typically be examined, it seems to be more of an artifact resulting from service operations, such as accessing statistics and interactions among parallel workers. What can we infer from this indicator? Let's conduct a simple experiment. We will use one test query (10a.sql) and sequentially increase the number of workers involved in processing this query. The graph below illustrates how the query execution time and the number of data pages read change as we adjust the number of workers.

Behaviour of execution time and number of pages with the increasing number of parallel workers

It is evident that while the query execution time may vary, the number of read data pages remains relatively constant. The number of pages only changes once when the number of workers increases from 1 to 2, resulting in a doubling of the read pages. An examination of the EXPLAIN output for these two cases reveals the reason behind this change: with 0 and 1 worker, out of six query joins, three were of the Nested Loop type, and three were Hash Joins. However, with two or more workers, the number of Nested Loop joins increases by one while the number of Hash Joins decreases. Thus, by analysing the number of read pages, we were able to identify a change in the query plan that was not apparent when considering execution time alone. Now, let's explore the effect of the AQO (Adaptive Query Optimization) extension of the PostgreSQL optimiser on JOB test queries.

We will execute each test query with AQO ten times in 'learn' mode. In this mode, AQO functions as a planner memory, storing the cardinality of each plan node (as well as the number of groups in the corresponding operators) at the end of execution. This information is then used during the planning stage, allowing the optimiser to reject overly optimistic plans. Given that the PostgreSQL optimiser tends to underestimate join cardinalities, this approach appears quite reasonable. The figure below (shown on a logarithmic scale) illustrates how the number of pages read changed concerning the first iteration, during which the optimiser lacks information about the cardinalities of the query plan nodes, the number of distinct values in the columns, etc. By the tenth iteration, almost all queries either improved this metric or remained unchanged. This suggests that the PostgreSQL optimiser may have quickly identified the best plan in the space of potential options or that our technique did not have the desired effect in these cases.

There are still six degraded queries remaining, and the number of pages read for these queries has increased compared to the first iteration. It's possible that there weren't enough iterations to effectively filter out non-optimal query plans. Therefore, let's increase the number of execution iterations to 30 and observe the results.

The figure above illustrates that the query plans have converged toward an optimal solution. Notably, two queries (26b and 33b) show an increase in the number of pages read compared to the zero iteration. Additionally, the query execution time has improved by 15-20%.

The explanations for these observations are as follows: the number of Nested Loops in the query plan has decreased by one, and the Hash Join, when constructing a hash table, scans the entire table and consequently increases the number of pages read. In contrast, the parallel Hash Join proves to be more time-efficient, leading to better query execution times. This suggests that the number of pages read is not an absolute criterion for determining query optimality. This criterion can help establish a starting point within a single DBMS, allowing for reproduced experiments in different software and hardware environments. It can also aid in comparing various optimisation methods and identifying effects that may be masked by unstable execution times.

Therefore, it may not be advisable to disregard execution time when publishing benchmark results. However, should the number of pages read to be included as well? Ultimately, by providing a graph showing the changes in the number of pages read during query execution, along with a test run script (refer to the above) and a link to the raw data, one can independently reproduce the experiment, calibrate it against the published data, conduct additional studies, or compare it with other methods under similar conditions. Isn’t that convenient?

That's it for today. The primary goal of this post is to highlight the problem of reproducibility of results and to encourage objective analysis of new methods in the field of DBMS. Should we seek additional criteria for evaluating test results? How effective is the criterion of the number of pages read for this purpose? Can this criterion be adapted to compare different yet similar query plans regarding DBMS architecture? Is it possible to normalise this criterion relative to the average number of tuples per page? I welcome any opinions and comments on these questions.

THE END.

January 18th, 2025. Pattaya, Thailand.

Investigating Memoize's Boundaries

Andrei Lepikhov — Fri, 03 Jan 2025 14:01:59 GMT

During the New Year holiday week, I want to glance at one of Postgres' most robust features: the internal caching technique for query trees, also known as memoisation.

Introduced with commit 9eacee2 in 2021, the Memoize node fills the performance gap between HashJoin and parameterised NestLoop: having a couple of big tables, we sometimes need to join only minor row subsets from these tables. In that case, the parameterised NestLoop algorithm does the job much faster than HashJoin. However, the outer size is critical for performance and may cause NestLoop to be rejected just because of massive repetitive scans of inner input.

When predicting multiple duplicates in the outer column that participate as a parameter in the inner side of a join, the optimiser can insert a Memoize node. This node caches the results of the inner query subtree scan for each parameter value and reuses these results if the known value from the outer side reappears later.

This feature is highly beneficial. However, user migration reports indicate that there are still some cases in PostgreSQL where this feature does not apply, leading to significant drops in query execution time. In this post, I will compare the caching methods for intermediate results in PostgreSQL and SQL Server.

Memoisation for SEMI/ANTI JOIN

Let me introduce a couple of tables:

DROP TABLE IF EXISTS t1,t2;
CREATE TABLE t1 (x integer);
INSERT INTO t1 (x)
  SELECT value % 10 FROM generate_series(1,1000) AS value;
CREATE TABLE t2 (x integer, y integer);
INSERT INTO t2 (x,y)
  SELECT value, value%100 FROM generate_series(1,100000) AS value;
CREATE INDEX t2_idx ON t2(x,y);
VACUUM ANALYZE t1,t2;

In Postgres, a simple join of these tables prefers parameterised NestLoop with memoisation:

EXPLAIN (COSTS OFF)
SELECT t1.* FROM t1 JOIN t2 ON (t1.x = t2.x);
/*
 Nested Loop
   ->  Seq Scan on t1
   ->  Memoize
         Cache Key: t1.x
         Cache Mode: logical
         ->  Index Scan using t2_idx on t2
               Index Cond: (x = t1.x)
*/

The smaller table t1 contains many duplicates in the column used for the JOIN, while the bigger one t2 contains almost unique values. It also has an index to extract necessary tuples effectively.

Ok, it works for trivial joins. What about more complex forms, like SEMI JOIN? Look at the query:

EXPLAIN (COSTS OFF)
SELECT * FROM t1 WHERE x IN (SELECT y FROM t2 WHERE t1.x = t2.x);

/*
 Nested Loop Semi Join
   ->  Seq Scan on t1
   ->  Index Only Scan using t2_idx on t2
         Index Cond: ((x = t1.x) AND (y = t1.x))
         Filter: (x = y)
*/

Postgres can pull-up the subquery and transform it into a join. But it doesn't add a Memoize node in that case. To compare, execute this query in SQL Server (Use the OPTION (LOOP JOIN) hint to prevent hash join):

SQL Server performs a similar optimisation and utilises a Spool node in the inner subtree of a join. This approach allows the results of scans on the inner side of the join to be cached. Interestingly, it doesn't have to cache individual tuples; it only needs to keep track of the existence of the NULL/NOT NULL result.

So, why hasn’t Postgres implemented Memoize for JOIN_SEMI? If you examine the code, you will find that this limit was introduced in an initial commit by David Rowley.

if (!extra->inner_unique && (jointype == JOIN_SEMI ||
                             jointype == JOIN_ANTI))
  return NULL;

In the case of a semi-join, the executor only requires the first tuple from the inner subtree to make its decision. This means that the Memoize cache will contain incomplete results, which is (I suppose) a source of concern for developers. However, the MemoizePath struct already has built-in mechanisms for situations where the inner subtree provably produces a single tuple for each scan.

It appears that much of the groundwork is already in place to implement caching for semi-joins. We need to make minor adjustments to the get_memoize_path function, revise the cost model in the cost_memoize_rescan routine, and inform users about the memoisation mode by adding relevant details to the EXPLAIN output. The code that enables memoisation for semi-joins is relatively concise and can be found in the branch of my GitHub project. With this patch applied, you can see something like this:

EXPLAIN (COSTS OFF)
SELECT x FROM t1 WHERE EXISTS (SELECT x FROM t2 WHERE t2.x=t1.x);

/*
 Nested Loop Semi Join
   ->  Seq Scan on t1
   ->  Memoize
         Cache Key: t1.x
         Cache Mode: logical
         Store Mode: singlerow
         ->  Index Only Scan using t2_idx on t2
               Index Cond: (x = t1.x)
*/

The EXPLAIN parameter 'Store Mode' appears only when the Memoize node works in the 'incomplete' mode. The same way it works for ANTI JOIN cases:

EXPLAIN (COSTS OFF)
SELECT x FROM t1 WHERE NOT EXISTS (SELECT x FROM t2 WHERE t2.x=t1.x);

/*
 Nested Loop Anti Join
   ->  Seq Scan on t1
   ->  Memoize
         Cache Key: t1.x
         Cache Mode: logical
         Store Mode: singlerow
         ->  Index Only Scan using t2_idx on t2
               Index Cond: (x = t1.x)
*/

In real-life scenarios, I frequently see on the inner side of SEMI and ANTI joins not trivial table scans but huge subtrees containing join trees, aggregates and sortings. For such queries, avoiding unnecessary rescan calls is crucial. Even more importantly, knowledge of the only single tuple needed from such a subquery may cause the choice of a more optimal fractional path.

Memoise arbitrary query subtree

Here, I want to discover if the optimiser can use Memoize to cache the result of a bushy query tree. Look at the example:

DROP TABLE IF EXISTS t1,t2,t3;
CREATE TABLE t1 (x numeric PRIMARY KEY, payload text);
CREATE TABLE t2 (x numeric, y numeric);
CREATE TABLE t3 (x numeric, payload text);
INSERT INTO t1 (x, payload)
  (SELECT value, 'long line of text'
   FROM generate_series(1,100000) AS value);
INSERT INTO t2 (x,y)
  (SELECT value % 1000, value % 1000
   FROM generate_series(1,100000) AS value);
INSERT INTO t3 (x, payload)
  (SELECT (value%10), 'long line of text'
   FROM generate_series(1,100000) AS value);
CREATE INDEX t2_idx_x ON t2 (x);
CREATE INDEX t2_idx_y ON t2 (y);
VACUUM ANALYZE t1,t2,t3;

-- Disable any extra optimisations:
SET enable_hashjoin = f;
SET enable_mergejoin = f;
SET enable_material = f;

Now, let's discover the query:

EXPLAIN (COSTS OFF)
SELECT * FROM t3 WHERE x IN (
  SELECT y FROM t2 WHERE x IN (
    SELECT x FROM t1)
);

There are three joining tables. In the absence of a hash join, it would be better to use a parameterised scan. Lots of duplicated values inside Table t3 should trigger the use of the memoisation technique:

 Nested Loop Semi Join
   ->  Seq Scan on t3
   ->  Nested Loop
         ->  Index Scan using t2_idx_y on t2
               Index Cond: (y = t3.x)
         ->  Index Only Scan using t1_pkey on t1
               Index Cond: (x = t2.x)

As you can see in the EXPLAIN above, Postgres can’t insert a Memoize node at the top of NestLoop JOIN. As far as I remember, it has not yet been implemented because it is hard to discover the query subtree and find all the lateral references and parameters that are mandatory for the memoisation technique. At the same time, SQL Server is capable of doing it:

A closer examination of this example revealed an interesting aspect. Although Postgres doesn't have the opportunity to insert Memoize over the join, it might still insert Memoize over trivial scans of tables t1 or t2, thereby avoiding repeated scans. However, it didn't do this because it predicted only one rescan operation when planning the JOIN(t1,t2). In this case, using memoisation seems unnecessary.

One million rescan cycles occur due to the upper-level join, JOIN(t3, JOIN(t1,t2)), but the bottom-up optimiser lacks the insight to identify this valuable data at the level of JOIN(t1,t2). You can observe this behaviour in our test example by populating t3.x with unique data. Interestingly, SQL Server also uses a bottom-up planning strategy and fails to recognise this situation to insert an appropriate Spool node.

Postgres planning extensibility allows passing through the query plan and doing additional work. Should we consider adding a top-down planning cycle after constructing the query plan?

Beyond the NestLoop memoisation

In previous sections, we discussed how memoisation could be enhanced by extending it to other join types and applying it to arbitrary query subtrees. But can we go further? Let me dream for a little bit...

Memoisation is a technique used for caching parameters along with their corresponding results. In many real-world scenarios, I often encounter complex situations where a heavy subplan is evaluated within an expression or a CASE statement for every incoming tuple due to references to upper-query objects.

What if the optimiser could insert a Memoize node at the top of the subplan whenever an external value parameterises it? To illustrate this idea, let me provide an example:

-- Case 1:
EXPLAIN (COSTS OFF)
SELECT oid,relname FROM pg_class c1
WHERE oid = 
  CASE WHEN (c1.oid)::integer%2=0
    THEN (SELECT oid FROM pg_class c2 WHERE c2.relname = c1.relname)
    ELSE
      (SELECT oid FROM pg_class c3 WHERE c3.relname = c1.relname)
  END;

-- Case 2:
EXPLAIN (VERBOSE, COSTS OFF)
SELECT 
  CASE WHEN (c1.oid)::integer%2=0
    THEN (SELECT oid || ' - TRUE' FROM pg_class c2 WHERE c2.relname = c1.relname)
    ELSE
      (SELECT oid || ' - FALSE' FROM pg_class c3 WHERE c3.relname = c1.relname)
  END
FROM pg_class c1;

-- Case 3:
EXPLAIN (COSTS OFF)
SELECT oid FROM pg_class c1
WHERE EXISTS (
  SELECT true FROM pg_class c2 WHERE c1.relname=c2.relname OFFSET 0);

We can't flatten subplan nodes in these examples and must evaluate them repeatedly. In my mind, the optimiser should have a chance to build a plan that looks like the below:

   SubPlan 1
     Memoize
       Cache Key: ((c1.oid)::integer % 2)
       Cache Mode: logical
     ->  Index Scan using pg_class_relname_nsp_index on pg_class c2
           Index Cond: (relname = c1.relname)

In this case, the Memoize node will re-evaluate the underlying subplan only when a new combination of parameters is provided from the upper query. While we cannot address all the issues that arise with the subplan bubbling up in an expression, we can help mitigate performance cliffs caused by such constructs.

Do you think it makes sense?

THE END.

January 2, 2025, Pattaya, Thailand.

Fractional Path Issue in Partitioned Postgres databases

Andrei Lepikhov — Sun, 15 Dec 2024 22:01:25 GMT

While the user notices the positive aspects of technology, a developer, usually encountering limitations, shortcomings or bugs, watches the product from a completely different perspective. The same stuff happened at this time: after the publication of the comparative testing results, where Join-Order-Benchmark queries were passed on a database with and without partitions, I couldn't push away the feeling that I had missed something. In my mind, Postgres should build a worse plan with partitions than without them. And this should not be just a bug but a technological limitation. After a second thought, I found a weak spot - queries with limits.

In the presence of a LIMIT statement in the SQL query, unlike the case of plain tables, the optimiser immediately faces many questions: How many rows may be extracted from each partition? Will only a single partition be used? If so, which one will be this single one? - it is not apparent in the circumstances of potential execution-time pruning ... .

What if we scan partitions by index, and the result is obtained by merging? In that case, it is entirely unclear how to estimate the number of rows that should be extracted from the partition and, therefore, which type of partition scan operator to apply. And what if using partitionwise join, we have an intricate subtree under the Append - knowledge of the limits, in this case, should be crucial - for example, when choosing the JOIN type, isn't it?

Interim-cost query plans

Such a pack of questions about planning partitions led to a compromise solution in choosing a query plan for Append's subpaths: for picking the optimal fractional path, two plan options are considered: the minimum total cost and the minimum startup cost paths. Roughly speaking, the plan will be optimal if we have LIMIT 1 or some considerable LIMIT value in the query. But what about intermediate options? Let's look at specific examples (thanks to Alexander Pyhalov).

DROP TABLE IF EXISTS parted,plain CASCADE;
CREATE TEMP TABLE parted (x integer, y integer, payload text)
PARTITION BY HASH (payload);
CREATE TEMP TABLE parted_p1 PARTITION OF parted
  FOR VALUES WITH (MODULUS 2, REMAINDER 0);
CREATE TEMP TABLE parted_p2 PARTITION OF parted
  FOR VALUES WITH (MODULUS 2, REMAINDER 1);
INSERT INTO parted (x,y,payload)
  SELECT (random()*600)::integer,
         (random()*600)::integer, md5((gs%500)::text)
  FROM generate_series(1,1E5) AS gs;
CREATE TEMP TABLE plain (x numeric, y numeric, payload text);
INSERT INTO plain (x,y,payload) SELECT x,y,payload FROM parted;
CREATE INDEX ON parted(payload);
CREATE INDEX ON plain(payload);
VACUUM ANALYZE;
VACUUM ANALYZE parted;

In this example we executed VACUUM ANALYZE twice because by-default statistics on the partitioned table cannot be built. It is built on each partition separately. To gather statistic, combining data from all partitions, we must explicitly execute ANALYZE with the name of such table as a parameter. Now, let's see how the selection from the partitioned and regular table works with the same data:

EXPLAIN (COSTS OFF)
SELECT * FROM plain p1 JOIN plain p2 USING (payload) LIMIT 100;
EXPLAIN (COSTS OFF)
SELECT * FROM parted p1 JOIN parted p2 USING (payload) LIMIT 100;

/*
 Limit
   ->  Nested Loop
         ->  Seq Scan on plain p1
         ->  Memoize
               Cache Key: p1.payload
               Cache Mode: logical
               ->  Index Scan using plain_payload_idx on plain p2
                     Index Cond: (payload = p1.payload)

 Limit
   ->  Merge Join
         Merge Cond: (p1.payload = p2.payload)
         ->  Merge Append
               Sort Key: p1.payload
               ->  Index Scan using parted_p1_payload_idx
               ->  Index Scan using parted_p2_payload_idx
         ->  Materialize
               ->  Merge Append
                     Sort Key: p2.payload
                     ->  Index Scan using parted_p1_payload_idx
                     ->  Index Scan using parted_p2_payload_idx
*/

The query plans seem optimal: depending on the limit, only the minimum number of rows will be selected since, with a helpful index on the join attribute, we have already ordered access to the table rows. Now let's prompt the optimiser to build a complex subtree under the append by enabling partitionwise join:

SET enable_partitionwise_join = 'true';
EXPLAIN (COSTS OFF)
SELECT * FROM parted p1 JOIN parted p2 USING (payload) LIMIT 100;
/*
 Limit
   ->  Append
         ->  Nested Loop
               Join Filter: (p1_1.payload = p2_1.payload)
               ->  Seq Scan on parted_p1 p1_1
               ->  Materialize
                     ->  Seq Scan on parted_p1 p2_1
         ->  Nested Loop
               Join Filter: (p1_2.payload = p2_2.payload)
               ->  Seq Scan on parted_p2 p1_2
               ->  Materialize
                     ->  Seq Scan on parted_p2 p2_2
*/

Although everything has stayed the same in the data, an unsuccessful plan has been selected. The reason for such degradation is that when planning an Append, the optimiser chooses the cheapest plan according to the startup_cost criterion. And this is the one that contains NestLoop + SeqScan - in terms of launch speed, in the absence of the necessity to scan tables at all, such a plan slightly wins even over the obvious NestLoop + IndexScan. This is how the current Postgres works, including the dev branch.

However, this problem can be fixed quite simply by adding the appropriate logic to the optimiser code. Together with Nikita Malakhov and Alexander Pyhalov, we have prepared a patch that can be found on the current commitfest to fix this problem. In the thread with its discussion, you can find another gripping remark about the revision of the startup_cost computation logic of the sequential scan operator, the implementation of which can also alleviate the situation with the choice of non-optimal fractional paths for the case with LIMIT 1. Applying this patch, we will already get an acceptable query plan:

 Limit
   ->  Append
         ->  Nested Loop
               ->  Seq Scan on parted_p1 p1_1
               ->  Memoize
                     Cache Key: p1_1.payload
                     Cache Mode: logical
                     ->  Index Scan using parted_p1_payload_idx
                           Index Cond: (payload = p1_1.payload)
         ->  Nested Loop
               ->  Seq Scan on parted_p2 p1_2
               ->  Memoize
                     Cache Key: p1_2.payload
                     Cache Mode: logical
                     ->  Index Scan using parted_p2_payload_idx
                           Index Cond: (payload = p1_2.payload)

Now, let's look at the next problem, which does not have a simple solution yet.

Calculated limit

Consider the following query:

EXPLAIN (COSTS OFF)
SELECT * FROM parted p1 JOIN parted p2 USING (payload,y)
ORDER BY payload,y LIMIT 100;

Executing it with the patch provided above gives you an optimal plan - it uses NestLoop with a parameterized index scan that will touch only the minimum number of table rows needed to produce the result. However, by simply reducing the limit, we get the original bleak picture:

EXPLAIN (COSTS OFF)
SELECT * FROM parted p1 JOIN parted p2 USING (payload,y)
ORDER BY payload,y LIMIT 1;

/*
Limit
   ->  Merge Append
         Sort Key: p1.payload, p1.y
         ->  Merge Join
               Merge Cond: ((p1_1.payload = p2_1.payload) AND
                            (p1_1.y = p2_1.y))
               ->  Sort
                     Sort Key: p1_1.payload, p1_1.y
                     ->  Seq Scan on parted_p1 p1_1
               ->  Sort
                     Sort Key: p2_1.payload, p2_1.y
                     ->  Seq Scan on parted_p1 p2_1
         ->  Merge Join
               Merge Cond: ((p1_2.payload = p2_2.payload) AND
                            (p1_2.y = p2_2.y))
               ->  Sort
                     Sort Key: p1_2.payload, p1_2.y
                     ->  Seq Scan on parted_p2 p1_2
               ->  Sort
                     Sort Key: p2_2.payload, p2_2.y
                     ->  Seq Scan on parted_p2 p2_2
*/

A SeqScan operator again reads all rows from tables, and the query becomes tens of times slower, although we only reduced the LIMIT! At the same time, by disabling SeqScan, you can see a fast plan and incremental sorting again.

The fundamental problem is that the optimiser only knows the final limit on the number of rows in the query/subquery. In this case, at the Append planning stage, the optimiser cannot estimate how many tuples the upper Incremental Sort could request. As a result, only one row or all rows from each partition may be needed, depending on the data distribution in the 'y' column.

Even if we theoretically imagine that we have taught IncrementalSort to calculate the number of groups by the 'payload' column and, based on this, estimate the maximum required number of rows in each partition, we could not improve the plan estimation since the planning of the Append operator has already been completed, the possible options for its execution have already been fixed - after all, we are planning the query from the bottom up!

To sum it up. Partitioned tables do make the task much more difficult for the current version of Postgres, limiting the search space for optimal query plans. Switching to partitions should be thoroughly tested, focusing on cases where some limited selection of tables' tuples is required and there is no noticeable pruning of partitions at the planning stage. Although the direction is actively developing, we can expect improvements soon (especially if users report emerging issues more actively). Still, there are cases where the solution within the existing architecture is not apparent and requires additional R&D.

Do you agree with my conclusions, or did I just write nonsense? Please leave your opinion in the comments.

THE END.

December 9, 2024, Pattaya, Thailand.

Could GROUP-BY clause reordering improve performance?

Andrei Lepikhov — Mon, 25 Nov 2024 22:00:58 GMT

PostgreSQL users often employ analytical queries that sort and group data by different rules. Optimising these operators can significantly reduce the time and cost of query execution. In this post, I will discuss one such optimisation: choosing the order of columns in the GROUP BY expression.

Postgres can already reshuffle the list of grouped expressions according to the ORDER BY condition to eliminate additional sorting and save computing resources. We went further and implemented an additional strategy of group-by-clause list permutation in a series of patches (the first attempt and the second one) for discussion with the Postgres community, expecting it to be included in the next version of PostgreSQL core. You can also try it in action in the commercial Postgres Pro Enterprise fork.

A short introduction to the issue

To group table data by one or more columns, DBMSes usually use hashing methods (HashAgg) or preliminary sorting of rows (tuples) with subsequent traversal of the sorted set (SortAgg). When sorting incoming tuples by multiple columns, Postgres must call the comparison operator not just once but for each pair of values. For example, to compare a table row ('UserX1', 'Saturday', $100) with a row ('UserX1', 'Monday', $10) and determine the relative order of these rows, we must first compare the first two values and, if they match, move on to the next pair. If the second pair of values (in our example, 'Saturday' and 'Monday') differs, then there is no point in calling the comparison operator for the third element.

This is the principle on which the proposed SortAgg operator optimisation mechanism is based. If, when comparing rows, we compare column values with fewer duplicates first (for example, first compare UserID numbers and then days of the week), then we will have to call the comparison operator much less often.

Time for a demo case

How much minimising the number of comparisons may speed up a Sort operation? Let's look at the examples. In the first example, we sort the table by the same fields but in different orders:

CREATE TABLE shopping (
  CustomerId bigint, CategoryId bigint, WeekDay text, Total money
);
INSERT INTO shopping (CustomerId, CategoryId, WeekDay, Total)
  SELECT random()*1E6, random()*100, 'Day ' || (random()*7)::integer,
    random()*1000::money
  FROM generate_series(1,1E6) AS gs;
VACUUM ANALYZE shopping;

SET max_parallel_workers_per_gather = 0;
SET work_mem = '256MB';

EXPLAIN (ANALYZE, TIMING OFF)
SELECT CustomerId, CategoryId, WeekDay, Total
FROM shopping
ORDER BY WeekDay,Total,CategoryId,CustomerId;

EXPLAIN (ANALYZE, TIMING OFF)
SELECT CustomerId, CategoryId, WeekDay, Total
FROM shopping
ORDER BY CustomerId,CategoryId,WeekDay,Total;

The results of executing these queries will be as follows:

 Sort  (cost=117010.84..119510.84 rows=1000000 width=30)
       (actual rows=1000000 loops=1)
   Sort Key: weekday, total, categoryid, customerid
   Sort Method: quicksort  Memory: 71452kB
   ->  Seq Scan on shopping  (actual rows=1000000 loops=1)
 Execution Time: 2858.596 ms

 Sort  (cost=117010.84..119510.84 rows=1000000 width=30)
       (actual rows=1000000 loops=1)
   Sort Key: customerid, categoryid, weekday, total
   Sort Method: quicksort  Memory: 71452kB
   ->  Seq Scan on shopping  (actual rows=1000000 loops=1)
 Execution Time: 505.775 ms

The second query is executed almost six times faster than the first, although the processed data is identical. This is because the comparison operator was called less often in the second case. The sorted tuple has 4 columns (CustomerId, CategoryId, WeekDay, Total), and Postgres calls the comparison operator separately for each pair of values - a maximum of 4 times. But if the first column in the comparison is CustomerId, then the need to call the comparison operator for the next column will be much lower than when the WeekDay column is the first.

This example shows that the computational costs of the sorting operation may be pretty significant. Even with the “Abbreviated keys” optimisation in the pocket, we are still not guaranteed execution time stability in the sort operation. I wonder if some newly proposed optimisations [1, 2] could significantly weaken the performance gap. Considering that an analytical query may have multiple sorts/additional sorts (each aggregate may define its individual order of incoming data), such an additional operation will save computing resources.

Note that the values of the cost field of the Sort operator in the EXPLAIN of the first example are the same. This means that for the Postgres optimiser both sorting options are identical.

Since the sort order for GROUP BY or Merge Join does not affect the final result, it can be chosen to minimise the number of comparison operations. In addition, if the table has many indexes, the data can be scanned and sorted in different ways, and the correct choice of the incremental sort option (IncrementalSort) may provide a positive effect.

Imagine a second example. Let's say you want to group your data to calculate the average spend for each customer in a given product category based on the day of the week:

SET enable_hashagg = 'off';
EXPLAIN (ANALYZE, TIMING OFF)
SELECT CustomerId, CategoryId, WeekDay, avg(Total::numeric)
FROM shopping
GROUP BY WeekDay,CategoryId,CustomerId;

/*
GroupAggregate (actual rows=999370 loops=1)
   Group Key: weekday, categoryid, customerid
   ->  Sort (actual rows=1000000 loops=1)
         Sort Key: weekday, categoryid, customerid
         Sort Method: quicksort  Memory: 71452kB
         ->  Seq Scan on shopping (actual rows=1000000 loops=1)
  Execution Time: 2742.777 ms
 */

To demonstrate the concept explicitly, I have disabled hash aggregation. From a query perspective, the order of the columns in the GROUP BY clause is entirely unimportant. Let's change the order and see the result:

EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
SELECT CustomerId, CategoryId, WeekDay, avg(Total::numeric)
FROM shopping
GROUP BY CustomerId,CategoryId,WeekDay;

/*
 GroupAggregate (actual rows=999370 loops=1)
   Group Key: customerid, categoryid, weekday
   ->  Sort (actual rows=1000000 loops=1)
         Sort Key: customerid, categoryid, weekday
         Sort Method: quicksort  Memory: 71452kB
         ->  Seq Scan on shopping (actual rows=1000000 loops=1)
  Execution Time: 1840.517 ms
 */

The speedup is less impressive than in the first example but pretty noticeable overall. What is important is that this transformation is free: we do not need a new index or complex query tree change, likewise performing a subquery pull-up. Such a change can be done automatically, and the main thing is to teach the Postgres optimiser to distinguish the costs of different combinations of grouping clauses and consider an additional grouping strategy.

State of the art

In 2023, Postgres discovered to exclude redundant columns from a grouping operation. Redundancy can occur, for example, when there is an equality expression in the query tree:

SELECT sum(total) FROM shopping
WHERE CustomerId=CategoryId AND WeekDay='Monday'                                                                                                                    GROUP BY CustomerId,CategoryId, WeekDay;

/*
 GroupAggregate
   Group Key: customerid
   ->  Sort
         Sort Key: customerid
         ->  Seq Scan on shopping
               Filter: ((customerid = categoryid) AND
                       (weekday = 'Monday'::text))
 */

In the example above, the values in the CustomerId and CategoryId columns belong to the same equivalence class (EquivalenceClass structure in Postgres code), and either column can be excluded from the grouping expression. At the same time, the clause "weekday = 'Monday'" makes explicit grouping by WeekDay unnecessary.

PostgreSQL 17 introduced another strategy: the optimiser can now adjust the order of the grouped columns according to sort order the input data. Thus, during planning, Postgres may consider two alternative strategies:

Group the already sorted data, and then re-sort by ORDER BY requirements.
Sort the incoming data by the rules specified by ORDER BY, then perform the grouping.

To demonstrate both options, let's add an index to our table and compare the results of the two queries:

CREATE INDEX ON shopping(CustomerId, weekday);

EXPLAIN (COSTS OFF)
SELECT count(*) FROM shopping WHERE CustomerId < 5000
GROUP BY WeekDay,CustomerId ORDER BY WeekDay,CustomerId;

EXPLAIN (COSTS OFF)
SELECT count(*) FROM shopping WHERE CustomerId < 50000
GROUP BY WeekDay,CustomerId ORDER BY WeekDay,CustomerId;

/*
 GroupAggregate
   Group Key: weekday, customerid
   ->  Sort
         Sort Key: weekday, customerid
         ->  Index Only Scan using
             shopping_customerid_weekday_idx on shopping
               Index Cond: (customerid < 5000)

Sort
   Sort Key: weekday, customerid
   ->  GroupAggregate
         Group Key: customerid, weekday
         ->  Index Only Scan using
             shopping_customerid_weekday_idx on shopping
               Index Cond: (customerid < 50000)
 */

In the first case, there is little data to be grouped, and it is cheaper to sort the tuples in advance according to the requirements of the ORDER BY operator. In the second case, sorting after grouping is justified: the index scan operator will return the rows in sorted form, and grouping will significantly reduce the number of such rows, which makes subsequent sorting cheaper. Isn't it true that the additional Postgres strategy allows you to find exciting variants of query plans? The downside is that it does not use column statistics, which could have helped to optimise example No. 2.

How to employ statistics?

The proposed GROUP-BY columns reordering strategy is based on the standard Postgres columnar statistics stored in the pg_statistic table. It is a cost-based strategy, and it supplies the optimiser with an alternative path for the Sort operator that minimises the number of comparison operations during sorting. To clarify the basic idea, consider the query with grouping from the example above:

SELECT avg(Total::numeric) FROM shopping
GROUP BY CustomerId,CategoryId,WeekDay;

The case where CustomerId is in the first position of sorting tuples is more efficient because it contains the largest number of distinct values (approximately half of a million). That means there are two other tuples for each single tuple where the comparison operation of the CustomerId column will not determine the order of these tuples, and the values from subsequent columns will have to be compared. The WeekDay column has no more than seven distinct values. If Postgres sorted this column first, then to determine the order, the values of subsequent columns would have to be compared with a higher degree of probability.

Dive into the code

Since the code is very voluminous, we split it into four patches.

The first patch teaches the optimiser to consider EquivalenceClass members during estimation of number of groups in the estimate_num_groups() routine. What does it means? Look at the queries:

EXPLAIN SELECT CustomerId,CategoryId FROM shopping
WHERE CustomerId = CategoryId GROUP BY CustomerId,CategoryId;

EXPLAIN SELECT CustomerId,CategoryId FROM shopping
WHERE CustomerId = CategoryId GROUP BY CategoryId,CustomerId;

These queries semantically identical: we just rearranged columns in the grouping list. Equivalence expression leveled out difference in distinct values for both CategoryId and CustomerId: after applying the filter they will contain exactly the same values. But if you EXPLAIN it you will see different estimations and, as a result, different query plans:

HashAggregate  (cost=14073.83..14123.71 rows=4988 width=16)

--and:

Group  (cost=13676.18..13715.13 rows=101 width=16)

So, the first patch adds into the estimate_num_groups a code which pass through the equivalence class and look for its members ndistinct estimations. The minimum number of distinct values should be the most correct answer. Also, it introduces distincts' caching inside an EquivalenceMember.

The second patch concerns the formula for calculating the cost of sorting. In the current version of Postgres, sorting is estimated using the formula:

where:

N - number of tuples to sort,
C = 2.0*cpu_operator_cost - use-defined parameter.

This patch introduces into the Sort estimation formula the number of columns involved:

The approach seems straightforward and relatively crude. It is designed to be intermediate - to discover how many places employ sort estimation formulas and how many areas will be impacted.

Looking at the regression test changes, you may notice that this change affects the balance among Sort, IncrementalSort, MergeAppend, GatherMerge, and HashAgg nodes. With this formula, the optimiser favours using hashAgg grouping in more situations than before. HashAgg have been taking into account the number of columns in the aggregated tuple. At the same time, aggregation with preliminary sorting have been evaluated too positively in the case of a long list of sorted values. Thus, this patch increases the optimiser's bias towards hashing in grouping operations, especially on small data volumes.

But why is it such a trivial formula, you might ask me? Is it OK to suppose all the values are duplicates? It looks pretty strange, but in my experience, the problem with grouping orders is usually raised when a query processes massive numbers of tuples filled with text values (or numerics), containing largely duplicates. One more excuse for me is that we immediately introduced an improvement of this formula in the next patch. But even with such a simple formula, Postgres is ready to distinguish various sortings.

The third patch reconsiders the formula introduced by the second patch. Here, the distinct statistics cache, added by the first patch, is employed to estimate the number of distinct values in the first sorted column, and the formula becomes:

This approach can be extended when reliable statistics on the joint distribution of columns (EXTENDED STATISTICS) exist. Still, at the moment, we limit ourselves to the first column estimation only because it is sufficient in most cases. With this formula, the optimiser can distinguish the costs of different sorting combinations of columns, which allows us to choose the optimal sorting operator.

The fourth patch adds code to the optimiser that permutes grouped columns to place the column with the maximum ndistinct value in the first position. This GROUP-BY order is added to the optimiser to estimate and choose among two other alternatives discussed above. The optimiser will choose the best one based on their costs and sorting requested by the upper query operator.

Which positive outcome we have earned?

Look at how this change will affect the queries in our examples 1 and 2. Let's start with sorting:

EXPLAIN (ANALYZE, TIMING ON)
SELECT CustomerId, CategoryId, WeekDay, Total
FROM shopping
ORDER BY CustomerId,CategoryId,WeekDay,Total;

EXPLAIN (ANALYZE, TIMING ON)
SELECT CustomerId, CategoryId, WeekDay, Total
FROM shopping
ORDER BY CategoryId,CustomerId,WeekDay,Total;

/*
Sort  (cost=191291.64..193791.64) (actual time=350.819..395.024)
   Sort Key: customerid, categoryid, weekday, total
   ->  Seq Scan on shopping  (cost=0.00..17353.00) (actual time=0.031..60.262)
 Execution Time: 423.583 ms

Sort  (cost=266482.66..268982.66)
       (actual time=653.143..694.736)
   Sort Key: categoryid, customerid, weekday, total
   ->  Seq Scan on shopping  (cost=0.00..17353.00) (actual time=0.012..55.073)
 Execution Time: 723.005 ms
 */

There are two notable improvements: The overall query cost has changed, and the sorting and scanning cost ratio has become more accurate and reflects reality. The difference in plan cost reflects the difference in query execution time. And now the result of query execution with grouping:

SET enable_hashagg = 'off';
EXPLAIN (COSTS OFF)
SELECT CustomerId, CategoryId, WeekDay, avg(Total::numeric)
FROM shopping
GROUP BY WeekDay,CategoryId,CustomerId;

/*
 GroupAggregate
   Group Key: customerid, weekday, categoryid
   ->  Sort
         Sort Key: customerid, weekday, categoryid
         ->  Seq Scan on shopping
 */

The optimiser changed the order of the columns and moved the CustomerId column to the beginning of the grouping list. Given the actual distribution of values by the other columns, it was possible to rearrange the CategoryId and WeekDay columns additionally. However, such fine-tuning has little practical meaning and can be done with sufficient reliability if there are extended statistics for all three fields. Of course, the proposed solution is not ideal: the mathematical model can be adjusted and made more practical (the case when all columns contain duplicates is sporadic) as more detailed. We also did not consider the relative cost of the comparison operator itself: comparing text types will require more resources than integer types, right? However, the current version already fulfils the main task - to create an additional grouping strategy that is qualitatively different from those already available in the Postgres optimiser.

If you have any comments or opinion on that subject, please leave it in the comments below or in thread on the Postgres community mailing list.

THE END.

November 25th, 2024. Pattaya, Thailand.

PostgreSQL 'VALUES -> ANY' transformation

Andrei Lepikhov — Thu, 03 Oct 2024 23:58:58 GMT

Introduction

As usual, this project was prompted by multiple user reports with typical complaints, like 'SQL server executes the query times faster' or 'Postgres doesn't pick up my index'. The underlying issue that united these reports was frequently used VALUES sequences, typically transformed in the query tree into an SEMI JOIN.

I also want to argue one general question: Should an open-source DBMS correct user errors? I mean optimising a query even before the search for an optimal plan begins, eliminating self-joins, subqueries, and simplifying expressions - everything that can be achieved by proper query tuning. The question is not that simple since DBAs point out that the cost of query planning in Oracle overgrows with the complexity of the query text, which is most likely caused, among other things, by the extensive range of optimisation rules.

Now, let's turn our attention to the VALUES construct. Interestingly, it's not just used with the INSERT command but also frequently appears in SELECT queries in the form of a test of inclusion in a set:

SELECT * FROM something WHERE x IN (VALUES (1), (2), ...);

and in the query plan, this syntactical construct is transformed into SEMI JOIN. To demonstrate the essence of the problem, let's generate a test table with an uneven distribution of data in one of the columns:

CREATE EXTENSION tablefunc;
CREATE TABLE norm_test AS
  SELECT abs(r::integer) AS x, 'abc'||r AS payload
  FROM normal_rand(1000, 1., 10.) AS r;
CREATE INDEX ON norm_test (x);
ANALYZE norm_test;

here, the value x of the norm_test table has a normal distribution with a mean of 1 and a standard deviation 10 [1]. There are not too many distinct values, which will all be included in the MCV statistics. As a result, it will be possible to calculate the number of duplicates accurately for each value despite the uneven distribution. Also, we naturally introduced an index on this column, easing the table’s scanning. Now, let's execute the query:

EXPLAIN ANALYZE
SELECT * FROM norm_test WHERE x IN (VALUES (1), (29));

Uncomplicated query, right? It is rational to execute it with two iterations of index scanning. However, in Postgres, we have:

  Hash Semi Join  (cost=0.05..21.36 rows=62) (actual rows=85)
   Hash Cond: (norm_test.x = "*VALUES*".column1)
   -> Seq Scan on norm_test (rows=1000) (actual rows=1000)
   -> Hash (cost=0.03..0.03 rows=2) (actual rows=2)
         ->  Values Scan on "*VALUES*" (rows=2) (actual rows=2)

Here and onwards, I slightly simplify the explain for clarity.

Hmm, a sequential scan of all the table's tuples when two index scans were enough for us? Let's disable HashJoin and see what happens:

SET enable_hashjoin = 'off';

Nested Loop (cost=4.43..25.25 rows=62) (actual rows=85)
   -> Unique (rows=2 width=4) (actual rows=2)
         -> Sort (rows=2) (actual rows=2)
               Sort Key: "*VALUES*".column1
               ->  Values Scan on "*VALUES*" (rows=2) (actual rows=2)
   -> Bitmap Heap Scan on norm_test (rows=31) (actual rows=42)
         Recheck Cond: (x = "*VALUES*".column1)
         -> Bitmap Index Scan on norm_test_x_idx
            (rows=31) (actual rows=42)
               Index Cond: (x = "*VALUES*".column1)

Now you can see that Postgres has squeezed out the maximum: in one pass through the VALUES set for each outer value, it performs an index scan on the table. It's much more interesting than the previous option. However, it is not as simple as just a regular index scan. In addition, if you look at the query explanation more closely, you can see that the optimiser makes a mistake in predicting the cardinality of the join and index scan. And what happens if you rewrite the query without VALUES:

EXPLAIN (ANALYSE, TIMING OFF)
SELECT * FROM norm_test WHERE x IN (1, 29);

/*
Bitmap Heap Scan on norm_test (cost=4.81..13.87 rows=85) (actual rows=85)
   Recheck Cond: (x = ANY ('{1,29}'::integer[]))
   Heap Blocks: exact=8
   -> Bitmap Index Scan on norm_test_x_idx (rows=85) (actual rows=85)
         Index Cond: (x = ANY ('{1,29}'::integer[]))
*/

As you can see, we got a query plan containing only an index scan that is almost twice as cheap. At the same time, by estimating each value from the set and having both of these values in the MCV statistics, Postgres accurately predicts the cardinality of this scan.

So, being not a big problem in itself (you can always use HashJoin and hash the inner's VALUES), using VALUES sequences is a source of dangers:

The optimiser can choose NestLoop, which can reduce performance with a vast VALUES list.
All of a sudden, SeqScan can be chosen instead of IndexScan.
The optimiser makes significant estimation errors when predicting the cardinality of a JOIN operation and its underlying operations.

By the way, why would anyone need to use such expressions at all?

I guess this is a particular case when the automation system - ORM or Rest API tests the inclusion of an object into a specific set of objects. Since VALUES describes a relational table, and the value of such a list is a table row, we are most likely dealing with cases where each row represents an instance of an object in the application. Our case is a corner case when the object is characterised by only one property. If my guess is wrong, please correct me in the comments - maybe someone knows other reasons?

So, passing the 'x IN VALUES' construct into the optimiser is risky. Why not fix the situation by converting this VALUES construct to an array? Then, we will have a construct like 'x = ANY [...]', a special case of the ScalarArrayOpExpr operation in the Postgres code. It will simplify the query tree, eliminating the appearance of an unnecessary join. Also, the Postgres cardinality evaluation mechanism can work with the array inclusion check operation. If the array is small enough (< 100 elements), it will perform a statistical evaluation element by element. In addition, Postgres can optimise array search by hashing the values (if the memory required for that fits the work_mem value) - and everyone will be happy, right?

Well, we decided to try to do this in our optimisation lab - and surprisingly, it turned out to be relatively trivial. The first peculiarity we encountered is that the conversion is only possible for operations on scalar values: that is, so far, it is generally impossible to convert an expression of the form '(x,y) IN (VALUES (1,1), (2,2), ...)' so that the result exactly matches the state before the conversion. Why? It is not very easy to explain - the reason lies in the design of the comparison operator for the record type - to teach Postgres to work with such an operator completely similarly to scalar types, the type cache needs to be significantly redesigned. Secondly, you must remember to check this subquery (yes, VALUES is represented in the query tree as a subquery) for the presence of volatile functions - and that's it - one pass of the query tree mutator doing transformation, quite similar to [2] replaces VALUES with an array, constifying it if possible. Curiously, the conversion is possible even if VALUES contains parameters, function calls, and complex expressions, like the below:

CREATE TEMP TABLE onek (ten int, two real, four real);
PREPARE test (int,numeric, text) AS
  SELECT ten FROM onek
  WHERE sin(two)*four/($3::real) IN (VALUES (sin($2)), (2), ($1));
EXPLAIN (COSTS OFF) EXECUTE test(1, 2, '3');
/*
Seq Scan on onek
   Filter: (((sin((two)::double precision) * four) / '3'::real) = ANY ('{0.9092974268256817,2,1}'::double precision[]))
(2 rows)
*/

The feature is currently being tested. The query tree structure is pretty stable, and there is no reason to modify the code, considering that the dependencies on the kernel version are minimal; it can be used in Postgres down to version 10 and maybe even earlier. As usual, you can play with the library’s binaries, compiled in a typical Ubuntu 22 environment - it doesn’t have any UI and may be loaded statically or dynamically.

And now, the actual holy war that I mentioned above. Since we did this as an external library, we had to intercept the planner hook (to simplify the query tree before optimisation), which cost us an additional pass through the query tree. Obviously, most queries in the system will not need this transformation, and this operation will simply add overhead. However, when it works, it can provide a noticeable effect (and from my observations, it does).

Until recently, there was a consensus in the PostgreSQL community [3, 4]: if the problem can be fixed by changing the query itself, then there is no point in complicating the kernel code since this will inevitably lead to increased maintenance costs and (remembering Oracle's experience) will affect the performance of the optimiser itself.

However, watching the core commits, I notice that the community's opinion seems to be drifting. For example, this year, they complicated the technology of subquery to SEMI JOIN transformation by adding correlated subqueries [5]. A little later, they allowed the parent query to receive information about the sort order of the subquery result [6], although previously, to simplify planning, the query and its subqueries were planned independently. It looks like a way to re-planning subqueries, doesn't it?

And what do you think? Is an open-source project capable of supporting multiple transformation rules that would eliminate the redundancy and complexity that the user introduces, trying to make the query more readable and understandable? And most importantly - is it worth it?

References

F.41. tablefunc — functions that return tables
OR-clause support for indexes
Discussion on missing optimizations, 2017
BUG #18643: EXPLAIN estimated rows mismatch, 2024
Commit 9f13376. pull-up correlated subqueries
Commit a65724d. Propagate pathkeys from CTEs up to the outer query

THE END.

October 2, 2024. Pattaya, Thailand.

Postgres query re-optimisation in practice

Andrei Lepikhov — Mon, 19 Aug 2024 01:01:58 GMT

Today's story is about a re-optimisation feature I designed about a year ago for the Postgres Professional fork of PostgreSQL.

Curiously, after finishing the development and having tested the solution on different benchmarks, I found out that Michael Stonebraker et al. had already published some research in that area. Moreover, they used the same benchmark— Join Order Benchmark — to support their results. So, their authorship is obvious. As an excuse, I would say that my code looks closer to real-life usage, and during the implementation, I stuck and solved many problems that weren’t mentioned in the paper. So, in my opinion, this post still may be helpful.

It is clear that re-optimisation belongs to the class of 'enterprise' features, which means it is not wanted in the community code. So, the code is not published, but you can play with it and repeat the benchmark using the published docker container for the REL_16_STABLE Postgres branch.

Introduction

What was the impetus to begin this work? It was caused by many real cases that may be demonstrated clearly by the Join Order Benchmark. How much performance do you think Postgres loses if you change its preference of employing parallel workers from one to zero? Two times regression? What about 10 or 100 times slower?

The black line in the graph below shows the change in execution time of each query between two cases: with parallel workers disabled and with a single parallel worker per gather allowed. For details, see the test script and EXPLAINs, with and without parallel workers.

As you can see, the essential outcome is about a two-time speedup, which is logical when work is divided among two processes. But sometimes we see a 10-time speedup and even more, up to 500 times. Moreover, queries 14c, 22c, 22d, 25a, 25c, 31a, and 31c only finish their execution in a reasonable time with at least one parallel worker!

If you are hard-bitten enough to replicate this experiment, you'll quickly realise that the main obstacle lies in cardinality underestimation and NestLoop join. The optimiser's tendency to predict only a few tuples on the left and right side of the join and opt for a trivial (non-parameterised) NestLoop leads to a rapid escalation in query execution time, often spiralling towards infinity when multiple NestLoops are involved in a single join tree.

With parallel workers enabled, NestLoop has an alternative Parallel HashJoin, which is less expensive because of the parallel scan on each join side. Hence, the current case is no more than a game of chance, but it demonstrates our issue: sometimes query execution time goes to the moon, and we can't get at least EXPLAIN ANALYSE data to find out what's gone wrong.

In real-world scenarios, users rarely have a pg_query_state extension installed in the production instance, and auto_explain requires the query execution to be completed. Also, disabling NestLoop or MergeJoin reduces the optimiser's ability to find good query plans with parameterised NestLoop, as I have shown in the post before. So, to find out the origin of the specific issue, we at least need something in-core to get an execution state snapshot and, at best, have a tool for dynamic replanning to fix the optimiser gaffes, that at the same time, must be transparent to the application.

Being underpinned by these wits, I began the development.

How does it work?

Skipping the lengthy grind sequence of false attempts and a series of unsuccessful code sketches, the architecture ended up with the schema shown below:

You can see the query execution schema with additional elements needed to implement re-optimisation in PostgreSQL. Yellow-coloured elements are in-core features, and green-coloured elements are subsystems that can be pushed out into an extension.

Decision Maker. At first, DBMS should identify queries that can be potentially re-optimised: it doesn't make sense to employ this heavy machinery for trivial queries or single grouping. So, using the planner hook, the user can provide a clue and mark a plan as a 'supervised' one. As an outcome, one custom field was added to the PlannedStmt node to remember the decision has been made before.

Subtransaction. In the case of a query interruption and before the next planning attempt, Postgres must release all acquired resources: locks, pinned buffers, memory, etc. The only way to do it provably correctly is by employing subtransaction machinery. The "Supervised" query must be executed inside such a subtransaction to revert the whole state before re-optimisation and re-execution.

ExecProcNode Hook. During the execution, we have to check a trigger that the user has predefined for the query. This routine should be done from time to time at a place where the executor achieves a consistent state: for example, we shouldn't allow interruptions in the middle of hash table building or sorting - keep in mind that afterwards, Postgres would be able to discover the execution state to find some clues for re-optimisation and this execution state must be in the consistent (for a walker and ROLLBACK codes) state. As I realised, the most reliable place in the code is the ExecProcNode routine.

Trigger. Snapping up the ExecProcNode Hook, the trigger can be defined by a user, parameterised, and exported as a stored C procedure in an extension's UI. It employs the standard Postgres ERROR exception to interrupt execution with a specific error code that can be processed above by the error handler. The trigger has access to the query's Execution State and can watch any part of the query plan if needed. At the same time, it should be simple enough and not produce a lot of overhead for each produced tuple.

Error Handler. So far, the main ServerLoop translates any error coming from the portal to the client. But in the case of re-optimisation, it should catch error signals and, if it is produced by the trigger, it must launch Execution State Analyser before aborting the subtransaction and restarting the query processing, if needed.

Execution State Analyser. Being a simple walker over the plan state, it implements a complicated subsystem for gathering instrumentation data for each node. It is a bit tricky because the current core code doesn't accept partial execution. It grabs an actual number of rows, number of groups, and size of data spilled to disk for the sake of hashing or sorting. As a part of an extension, it can be sophisticated, but not much, limited by the current set of planner hooks.

Selectivity Hook. Using data earned from the partial execution state, an extension should be able to provide the optimiser with recommendations on cardinalities, number of groups, hash table sizes, and even adequate work_mem value. Like the AQO, this feature strictly depends on these hooks. No one such hook exists at the core for now, but the selectivity hook, for example, is discussed and may be committed in the near future.

Selectivity estimator. This is a key subsystem paired with the Execution State Analyser. The most complicated part of this system is the ability to correctly find specific join, scan, grouping, etc, during the early planning stage and match it to the plan node of the finalised plan state. It is the most complicated and invasive technique because Postgres has not conferred this architecturally. Experiments with path signatures in the AQO extension have shown the fragility of such matching. So, in this project, I have chosen a more stable approach based on RelOptInfo signatures. The scope of this post is too limited to explain the idea in detail, but it may be done later if people show an interest in this technique.

Tuple Storage. As you can imagine, re-optimisation and subsequent re-execution are possible if only all results of the query execution are still enclosed inside the backend. However, the Postgres receiver, by default, sends each produced tuple immediately to the client. Because the first message sent out from the instance disables the re-optimisation trigger, it was necessary to invent a tuple storage that allows the delay of the data shipment to the client for some time (limited by tuple buffer size) and do re-optimisation, if needed.

Implementation caveats

The relatively simplistic design faced multiple difficulties during development in the sophisticated code of a well-rounded database system like PostgreSQL. The first problem that immediately bubbled up was dynamic query execution, as shown in the picture below.

A query can contain a function call. Such a function, in turn, can contain quite complex logic and execute queries inside the body. Their planning and execution happen in an independent and isolated execution context somewhere in the middle of the execution of a top-level query. So, the feature should identify such a recursion and, in case of interruption, process the correct PlannedStmt tree. Moreover, functions can manage exceptions and employ saving points. Because of that, we should be careful and disable re-optimisation if it happens inside a function call.

At the moment of interruption, some nodes will be in an interim state when they have called ExecProcNode to obtain another tuple. The current Postgres ExecutionEnd walker doesn't process this state correctly, and such a state must be implemented inside instrumentation structures. This change is also profitable for the pg_query_state extension and other tools that make snapshots of the query plan and want careful calculations of each node's cardinality.

How to finalise a partially executed query. It is not apparent, but query execution could involve parallel workers who are independent processes. When we interrupt execution in the primary process, it doesn't mean that workers will stop their work immediately after that. They will work for an arbitrary period of time, and it appears that the task of finalising their work and gathering instrumentation data is not so easy.

One more problem is figuring out when to switch off re-optimisation. If a trigger interrupts query execution, it should make sense. For example, if you set up the execution time trigger to one second, but the query can't be executed for less than one minute, it could waste repeating replanning without any meaningful effect just because most of the nodes may not even have processed a single tuple. My quick solution was introducing a trivial approach of seeing if something meaningful was earned since the last re-optimisation. If re-optimisation doesn't change anything in the plan or even earned new data from partially executed state it is allowed to relieve the trigger conditions - for example, increasing a timeout value or memory usage.

The next problem relates to the signature technique. Being a hash value, it can occasionally match the signature of two totally different nodes. If these plan nodes have highly different cardinalities (for example, one and 1E6), this can cause fluctuations in the cardinality prediction provided by the selectivity estimator. As a trivial solution, I just set a limit to the maximum number of re-optimisations for one query execution, but it does not seem to be the best solution.

A quite trivial but still existing problem is plpgsql information messages, which can produce some accidental output during the execution. To make this output consistent (do not send duplicate messages because of the query execution restart), we need to hold off on their delivery to the client until re-optimisation is possible and the query is not finished yet.

How does it help?

Multiple triggers can be invented: time, cardinality error, memory consumption, temporary file quota, etc. The architecture also allows a user to define custom triggers for specific purposes. In that particular case, we have chosen a variable-time trigger. To make this more practical, we added some flexibility to this trigger. If the statement_timeout value is set, the re-optimiser can increment the time gap (up to statement_timeout) if nothing beneficial has been earned since the last re-optimisation iteration.

So, before launching this benchmark, I set the initial time trigger to 1 second and statement_timeout to about 10 minutes (see the script for details). The result of the benchmark execution is shown on the graph below (see Google Docs tables for raw data). Here, you can see a black line representing relative execution time (without parallel workers with re-optimisation divided by the case with a single parallel worker, no re-optimisation).

Compared to the previous graph, you can see that the peak 50-500 decrease (and not finished executions, too) have been fixed by re-optimisation. Only some spikes making some queries up to 6 times slower represent poorly planned queries that can be justified as some issues in our re-optimisation logic that is still a beta version.

The red line on the graph represents the total execution time of each query, including all re-optimisation iterations. From this standpoint, the outcome of the feature employment doesn't look so appealing: Postgres has spent much time in iterative partial executions, which tells us that blind re-optimisation is impractical in real life.

It is a dumb feature, isn't it?

Observing the results of the JOB benchmark, it is evident that in most cases, the total query execution time, including all the re-optimisations, is much higher than just one, maybe non-optimal execution. So, instead of speeding up, we have degradation, haven't we?

It is true. Using alone, this feature has too narrow a use case and doesn't make sense in practice. The only cases I see here are debugging and debriefing. But remember, a few weeks ago, I presented the plan freezing extension to you. Imagine, what if you can unite re-optimisation and plan freezing?

The most questionable part of the freezer is how to identify poorly planned queries and how to force Postgres to build a more optimal plan. It is precisely what the re-optimiser does! As a result, we can create a kind of self-tuning DBMS, which will 'adapt' to changed data and load. When setting triggers and calling the plan freezer after some profitable re-optimisation, Postgres will stick query plans into the cache. Control of the frozen plan's effectiveness can also be implemented by a time trigger, which can be explicitly set for the plan to the value outreaching, for example, by 20% of the initial execution time. And now re-optimisation makes sense, doesn't it?

So, the purpose of this work was much broader than just developing a prototype of a re-optimisation feature. I aimed to invent a general approach underpinning query optimisation decisions, correcting mistakes, and eventually conserving CPU cycles ;) that can be at least partially autonomous and do not require vendor lock. This approach, as I believe, is doable and workable and can be profitable, especially in cloud configurations. Do you think it is worth a separate startup project?

THE END.

August 18, 2024. Paris, France

P.S.

Links:

Join Order Benchmark repository:
https://github.com/danolivo/jo-bench
Docker container with the re-optimisation patch: https://hub.docker.com/r/danolivo/reopt
Utility files for the test reproduction:
https://github.com/danolivo/utility/tree/main/job-noworker-issue

Names of the GUCs have been introduced with re-optimisation:

query_inadequate_execution_time - time trigger (in ms) - will start re-optimisation if the current execution time overreaches this value.
replan_overrun_limit - factor to identify acceptable cardinality prediction error in a plan node until re-optimisation starts.
replan_enable - enable/disable re-optimisation
show_node_sign - show details of re-optimisation in EXPLAIN.
replan_signal(pid) - routine to manually cause re-optimisation in the process

Probing indexes to survive data skew in Postgres

Andrei Lepikhov — Mon, 12 Aug 2024 00:01:35 GMT

This is the story of an unexpected challenge I encountered and a tiny but fearless response to address the Postgres optimiser underestimations caused by a data skew, miss in statistics or inconsistency between statistics and the data. The journey began with a user's complaint on query performance, which had quite unusual anamnesis.

The problem was with only one analytical query executed regularly by the schedule. For one of the involved tables, the query EXPLAIN had indicated a single tuple scan estimation, but the executor ended up fetching four million tuples from the disk. This unexpected turn of events led Postgres to choose parameterised NestLoop + Index Scans on each side of the join, causing the query to execute two orders of magnitude longer than with an optimal query plan. However, after executing the ANALYZE command, estimations became correct, and the query was executed fast enough.

Problem Analysis

The problematic table was a huge one and contained billions of rows. The user would load data in large batches over the weekends and immediately run the troubling query to identify new trends, comparing the fresh data with the existing data. One of the columns in the data was something like the current timestamp, which indicated the time of addition to the database, and it was unique for the whole batch. So, I immediately suspected that the user's data insertion pattern was the reason impacting query performance — something in statistics.

After discovery, I found that the source of errors was the estimation of trivial filters like 'x=N', where N had a massive number of duplicates in the table's column. Right after bulk insertion into the table, this filter was estimated by the stadistinct number. On the ANALYZE execution, this value was detected as a 'most common' value; its selectivity was saved in statistics, and at the subsequent query execution, this filter was estimated precisely by the MCV statistic.

Let's briefly dip into the logic of the equality filter selectivity to understand this behaviour. See the script, generating a table with highly skewed value distribution:

CREATE EXTENSION tablefunc;
CREATE TABLE norm_test AS
  SELECT abs(r::integer) AS val
  FROM normal_rand(1E7::integer, 5.::float8, 300.::float8) AS r;
ANALYZE norm_test;

Let's examine the statistics below for the column 'val'. The green curve shows the actual distribution of values in the column from 1 to 1600; the red dots are the most common values — they cover the top of the graph. The black line shows this column's number of distinct values (943).

Let's execute some simple scan SQL queries:

-- Involve MCV statistics ('5' inside the MCV stat)
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 5;
Gather  (rows=27333 width=4) (rows=26416 loops=1)
  ->  Parallel Seq Scan on norm_test
        Filter: (val = 5)

-- Frequent value but out of MCV
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 10;
Gather  (rows=8614) (actual rows=26583)
   ->  Parallel Seq Scan on norm_test  (rows=3589) (actual rows=8861)
         Filter: (val = 10)

-- Rare value
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 10000;
Gather  (rows=8614 width=4) (rows=0 loops=1)
  ->  Parallel Seq Scan on norm_test
        Filter: (val = 10000)

As you can see, the best situation is when the value fits MCV, another way Postgres estimates the cardinality of the filter according to the formula:

I.e., it excludes from the whole number of tuples common values and divides it by the number of remaining ndistincts (see var_eq_const for details). It looks like a single prediction for any other value outside MCV. But what if almost all of the values are MCV? Let's check it:

-- Add frequent values which are most of the data
CREATE TABLE norm_test1 AS SELECT gs % 100 AS val
  FROM generate_series(1,1E7) AS gs;

-- Add some rare values
INSERT INTO norm_test1 (val) SELECT gs
  FROM generate_series(101,105) AS gs;
VACUUM ANALYZE norm_test1;
ALTER TABLE norm_test1 SET (autovacuum_enabled = 'false');

-- Batch insertion of duplicates
INSERT INTO norm_test1 (val) SELECT 100 FROM generate_series(1,1E5);

EXPLAIN ANALYZE SELECT val FROM norm_test1 WHERE val = 100;

Gather  (rows=1) (actual rows=100000)
   ->  Parallel Seq Scan on norm_test1
         Filter: (val = '100'::numeric)

As you can see, we got precisely the estimation described in the issue above. After such a long explanation, how should Postgres handle such a scenario?

Upon investigation, I discovered that Postgres has already implemented a solution: the 'index probing' technique for the inequality operator (like '<' or '>'). This technique employs the histogram to calculate the number of bins that fall into the inequality filter boundaries.

After reviewing the git history of this feature and the discussion, I realised that it suffers from performance issues. So, does it make sense to use the same trick for an equality operator? Would it be suitable for some sophisticated analytical queries? Let's try to implement this feature and assess the overhead afterwards with benchmarks.

Implementation Description

You can see the working implementation in the branch of my GitHub repository. The idea is as follows: If we can't use MCV for an equality expression and some empirical condition detects that distinct estimation can be suspicious, let's try to find an index that covers this column. With such an index, call the AM index_getbitmap routine to estimate the number of tuples that satisfy the condition. Picking the NonVacuumableSnapshot will guarantee an upper-bound estimation.

The index_getbitmap routine collects only the TIDs of tuples, not the tuples themselves. Of course, this estimation process can be time-consuming for multiple tuples. The better option could be to make two IndexScan operations - forward and backward scan on the target const value - to find lower and higher bounds and roughly estimate the number of tuples by the number of pages between these two values. But as I can see, the AM interface in Postgres is still not ready to provide the caller with information on a couple (page, offset) of the first tuple found.

One consideration that relieves the aftereffects of this approach on performance is that calling index_getbitmap pulls index pages from the disk into memory that can be reused during query execution.

The crucial point is the condition when we involve the index probing approach. There is room for improvisation, but being short on time, I just invented a trivial one: looking into the histogram's bounds and seeing if the value fits the boundaries. If it is out of the histogram's coverage, we suppose that statistics is untrusted and probe an index.

Benchmarking

To find the worst case, I employed pgbench, as usual. The benchmarking script looks like the following:

pgbench -i -s 10
psql -c "ALTER TABLE pgbench_accounts
  DROP CONSTRAINT pgbench_accounts_pkey;"
psql -c "ALTER TABLE pgbench_branches
  DROP CONSTRAINT pgbench_branches_pkey;"
psql -c "ALTER TABLE pgbench_tellers
  DROP CONSTRAINT pgbench_tellers_pkey;"

psql -c "CREATE INDEX ON pgbench_accounts(aid);"
psql -c "CREATE INDEX ON pgbench_branches(bid);"
psql -c "CREATE INDEX ON pgbench_tellers(tid);"

pgbench -c 5 -j 5 -T 180 -P 3 -f test_s.pgb

Here, we deleted unique indexes and created non-unique ones because the optimiser uses them to return single tuple estimation. Query set contained only single SELECT quite frequently coming with the constant out of the histogram boundaries:

\set aid random(-5000000 * :scale, 5000000 * :scale)
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

Analysing the results of this benchmark, I observed a 5% overhead on touching histogram statistics and an additional 5-6% on probing indexes. It looks like a huge overhead, but the code is just a simple sketch. Who anticipated an ideal result?

That's it for today. In conclusion, I must emphasise that this approach not only worsens performance but also holds great potential for application in specific, narrow cases. The question is, might we effectively limit its involvement using an empirical formula? May a global or table-only GUC be a lifesaver in that case?

In unlucky situations where data skews make estimations so bad that they cause a performance slump of a degree of magnitude, we are left with no tools except the schema change. In such cases, this feature can be a handy solution.

THE END.

August 11, 2024. Paris, France

Does PostgreSQL respond to the challenge of analytical queries?

Andrei Lepikhov — Mon, 05 Aug 2024 03:00:16 GMT

This post was triggered by Crunchy Data's recent article and the YugabyteDB approach to using Postgres as an almost stateless entry point, which parses incoming analytic queries and splits the work among multiple instances managed by another database system that fits the task of storing and processing massive data volumes and can execute relatively simple queries.

The emergence of foreign data wrappers (FDW) and partitioning features has made these solutions possible. It seems that being compatible with the Postgres infrastructure and its mature parser/planner is valuable for vendors enough to consider implementing such hybrid data management systems.

So, now we face the fact that Postgres is used to do analytics on big data. An essential question immediately emerges: does it have enough tools to process complex queries?

What Is Analytic Queries

It is quite a general term. But after reading the course materials on the subject, I can summarise that analytic queries typically:

involve multiple joins
use aggregates, often mathematical ones
need to process large subsets of table data in a single query
have ad-hoc nature and are difficult to predict when it comes

So, looking into the Postgres changes, we should discover what has changed in aggregate processing, join ordering and estimation, and table scanning.

What was the rationale?

The technique of using Postgres as a middleware between user and storage has been triggered by the emergence of FDW and partitioning features. Parallel execution doesn't help much with processing foreign tables (partitions). Still, it is beneficial for speeding up the local part of the work.

The basics of these features were introduced in 2010 - 2017. Now, Postgres can push to foreign server queries containing scan operations, joins, and orderings. We also have asynchronous append, which allows us to gather data from foreign instances simultaneously. As a perspective, the community has quite an active discussion on aggregate pushdown.

Partitioning includes pruning techniques (planning and execution stages) that allows to restrict a query pushdown by only instances containing necessary data. One more essential thing - partitionwise join - allows the optimiser to choose a specific way to execute a join for each couple of joining partitions.

FDW/Partitioning technique is not ideal now because it has many shortcomings. For example:

We can prune only partitions, not a query subtree;
We can't declare some table as a 'dictionary' that exists in any instance and join such a table with foreign partitions simultaneously on a remote instance.
The pruning technique often can't remove partitions because it lacks statistical data about the partitions' min/max values.

However, with these and many other problems, Postgres has hooks and FDW API that are flexible enough to allow a professional developer's team to arrange the code according to the project's needs. Partitioning abilities are actively mature. I see discussions (see, for example, [1, 2, 3, 4]) on enhancing the optimiser to work better with partitions. And I think, soon, we could see more hybrid systems with primary Postgres and some secondary DBMS, chosen according to the purpose.

Regardless, secondary DBMS typically performs low-level preparatory operations with data. Aggregates, complex subqueries, window functions, and other stuff are still executed locally, and the issue is how mature the optimiser is in finding an effective way to process this data after pulling it from the remote side.

After reviewing the code repository, I can confirm that the core developers are actively addressing the challenge of identifying and mitigating bottlenecks in the optimiser.

What is the progress?

To provide a comprehensive overview, please refer to the table below, which outlines my selection of the top commits that have impacted the optimiser since 2010:

Commits in this table can be grouped by the 'feature' key. See my categorisation below.

ProSupport. The initial problem addressed is the estimation of a FunctionScan operation. I briefly mentioned this issue about a month ago. The main problem lies in the optimiser's inability to precisely estimate the cost and cardinality of functions that generate data for the query. In 2019, the community found an elegant and adaptable solution - the 'prosupport' routine concept. This routine can be registered as a function that provides the necessary information to the optimiser and can be stored in the database. This approach allows users or extensions to tune the planning decisions. In 2022 and 2023, these capabilities were extended to window functions. I currently see an attempt to use them with aggregates, which appears to be an important evolution of the technique.

Extended Statistics. People still like and use ORMs and RestAPIs despite their apparent inefficiency. To tackle the challenge of bad estimations caused by multi-clause filters, the community introduced extended statistics in 2017 - 2019. It provides three types of statistics: MCV, dependency and distinct, which detects hidden dependencies between columns in a table and improves estimations.

I don't see the wide spread of this feature: at least, not many reports on its usage are available on the Internet. IMO, this is caused by its opacity, computational laboriousness and the necessity to manually detect columns or expressions to build the statistics.

Incremental Sort. It is an excellent idea that introduces a whole new way to execute a query into the optimiser. As an alternative to full sort and further re-sorting of data, it can find a path where the executor would use presorted input (for example, by x1,x2) and sort the data by x3, necessary for the following operation inside the groups of duplicated x1,x2, providing the output sorted by x1,x2,x3. This approach relieves the typical problem of analytic queries, which frequently require sorted output for aggregations on various query levels. It is especially effective in the case of the LIMIT operator - just look at this example:

SET enable_incremental_sort = off;
EXPLAIN (ANALYZE, TIMING OFF)
SELECT * FROM tenk1 ORDER BY unique1, ten LIMIT 100;

RESET enable_incremental_sort;
EXPLAIN (ANALYZE, TIMING OFF)
SELECT * FROM tenk1 ORDER BY unique1, ten LIMIT 100;

Without incremental sort, we must extract all the tuples from the table even when Heap Sort will return only 100 tuples (the explain has been edited to be laconic):

Limit  (cost=827.19..827.44) (actual rows=100 loops=1)
   ->  Sort  (cost=827.19..852.19 rows=10000) (actual rows=100 loops=1)
         Sort Key: unique1, ten
         Sort Method: top-N heapsort  Memory: 112kB
         ->  Seq Scan on tenk1  (cost=0.00..445.00)
             (actual rows=10000 loops=1)
 Execution Time: 5.404 ms

However, incremental sort allows Postgres to employ index scan and provide partial sorted input to the sort module. Also, it is not necessary to scan the whole table, which is crucial in the case of analytic queries and massive tables:

Limit  (cost=0.46..21.46) (actual rows=100 loops=1)
   ->  Incremental Sort  (cost=0.46..2100.20) (actual rows=100 loops=1)
         Sort Key: unique1, ten
         Presorted Key: unique1
         ->  Index Scan using tenk1_unique1 on tenk1
             (cost=0.29..1650.20) (actual rows=101 loops=1)
 Execution Time: 0.318 ms

Moreover, I didn't find analogues to this node in other database systems.

Memoize. It is designed to extend the parameterised NestLoop JOIN technique to more use cases. Its task is to cache tuples fetched from the inner NestLoop join input. The idea extends the Materialize technique. Imagine that the cardinality of the inner subtree of the join is too massive to cache it all. The cardinality estimation of the outer subtree is big enough to be afraid that loops through the inner can kill the performance. Parameterised NestLoop remains the best solution because a parameterised index scan allows the extraction of a tiny subset of tuples from the inner. Suppose the optimiser predicts multiple duplicated values in the outer. In that case, it can insert the Memoize node into the top of the inner to avoid rescanning if the key value already came from the output.

Let me show the effect of Memoize with a simple query:

EXPLAIN ANALYZE
SELECT COUNT(*),AVG(t1.unique1) FROM tenk1 t1
INNER JOIN tenk1 t2 ON t1.unique1 = t2.twenty
WHERE t2.unique1 < 1000;

It is extracted from regression tests, and a description of the tenk1 table can be found there. Disabling Memoize, we get the query plan:

 Aggregate  (cost=448.24..448.25) (rows=1 loops=1)
   ->  Merge Join  (cost=427.65..443.24) (rows=1000 loops=1)
         Merge Cond: (t1.unique1 = t2.twenty)
         ->  Index Only Scan using tenk1_unique1 on tenk1 t1
             (rows=21 loops=1)
         ->  Sort  (cost=427.36..429.86) (rows=1000 loops=1)
               Sort Key: t2.twenty
               ->  Bitmap Heap Scan on tenk1 t2
                   (cost=20.04..377.54) (rows=1000 loops=1)
                     Recheck Cond: (unique1 < 1000)
                     ->  Bitmap Index Scan on tenk1_unique1
                         (cost=0.00..19.79 width=0) (rows=1000 loops=1)
                           Index Cond: (unique1 < 1000)
 Execution Time: 6.512 ms

Here is a good example of effective MergeJoin: having presorted inputs because the index scans fetches only 21 tuples from the outer utilising merging algorithm. But what about NestLoop in that case? Could it be competitive? Disable MergeJoin and HashJoin and see the result:

 Aggregate  (cost=815.04..815.05) (rows=1 loops=1)
   ->  Nested Loop  (cost=20.32..810.04) (rows=1000 loops=1)
         ->  Bitmap Heap Scan on tenk1 t2
             (cost=20.04..377.54) (rows=1000 loops=1)
               Recheck Cond: (unique1 < 1000)
               ->  Bitmap Index Scan on tenk1_unique1
                   (cost=0.00..19.79) (rows=1000 loops=1)
                     Index Cond: (unique1 < 1000)
         ->  Index Only Scan using tenk1_unique1 on tenk1 t1
             (cost=0.29..0.42) (rows=1 loops=1000)
               Index Cond: (unique1 = t2.twenty)
 Execution Time: 102.476 ms

Much worse. The same 1000 tuples from the one side but 1000 index scans to obtain a single tuple worsened this case. It is precisely where caching could help if any of these 1000 loops return the same tuple. Enable Memoize and see what will happen:

 Aggregate  (cost=416.40..416.41) (rows=1 loops=1)
   ->  Nested Loop  (cost=20.33..411.39) (rows=1000 loops=1)
         ->  Bitmap Heap Scan on tenk1 t2
             (cost=20.04..377.54) (rows=1000 loops=1)
               Recheck Cond: (unique1 < 1000)
               ->  Bitmap Index Scan on tenk1_unique1
                   (cost=0.00..19.79) (rows=1000 loops=1)
                     Index Cond: (unique1 < 1000)
         ->  Memoize  (cost=0.30..0.43) (rows=1 loops=1000)
               Cache Key: t2.twenty
               Hits: 980  Misses: 20
               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
                   (cost=0.29..0.42) (rows=1 loops=20)
                     Index Cond: (unique1 = t2.twenty)
 Execution Time: 6.046 ms

The plan stays the same, but the Memoize node in 980 inner rescans returned a cached copy of the tuple instead of looking up the table. It has also provided an effect: you can see that the total plan cost is better than the two previous ones, and the execution time is at least not worse.

Pull-up subqueries. In my experience, intricated analytic queries often employ subqueries in expressions. Such a subquery can depend on the data from the wrapping query block (aka correlated subqueries), which leads to complete subquery evaluation each time the expression is called. Suppose the expression is a filter or join clause. In that case, the executor will evaluate it on each incoming tuple.

It is a common problem that resolves with query tree transformation rules, which have been researched since the 1980s. A trivial subquery is transformed to InitPlan and evaluated once, and the query uses its materialised output. If the subquery depends on parameters, it can frequently be transformed to SEMI JOIN with lateral references.

Postgres supports the transformation of simple subqueries and, in 2024, added restricted support for correlated subqueries. IMO, development in this area is crucial to speed up analytics, especially auto-generated queries.

Let me demonstrate this technique with the example below:

EXPLAIN (ANALYZE, TIMING OFF, COSTS ON)
SELECT * FROM tenk1 A
WHERE A.hundred IN (SELECT B.hundred FROM tenk2 B WHERE B.unique1 = A.odd);

This query contains one correlated subquery. Turning off the transformation, we get the plan:

 Seq Scan on tenk1 a  (cost=0.00..43420.00) (actual rows=100 loops=1)
   Filter: (ANY (hundred = (SubPlan 1).col1))
   Rows Removed by Filter: 9900
   SubPlan 1
     ->  Index Scan using tenk2_unique1 on tenk2 b
         (cost=0.29..8.30) (actual rows=1 loops=10000)
           Index Cond: (unique1 = a.odd)
 Execution Time: 87.182 ms

Transforming the subquery to the SEMI JOIN optimiser finds a better (according to the cost model) plan that executes four times faster:

 Hash Semi Join  (cost=595.00..1215.00) (actual rows=100 loops=1)
   Hash Cond: ((a.odd = b.unique1) AND (a.hundred = b.hundred))
   ->  Seq Scan on tenk1 a (actual rows=10000 loops=1)
   ->  Hash (actual rows=10000 loops=1)
         ->  Seq Scan on tenk2 b  (actual rows=10000 loops=1)
 Execution Time: 20.722 ms

Even the employment of Index Scan in the subquery doesn't help much without transformation: looping repeatedly on each tuple drastically degrades the performance.

I discovered that MS SQL Server includes diverse pull-up transformation techniques for simple and correlated subqueries. This could be clearer for Oracle, where, as explained in the documentation, it may be forced by using hints.

ORDER-BY/DISTINCT Aggregates. This is an impalpable improvement for the user, sometimes drastically enhancing execution time. The main idea is to discover aggregate orderings, find the most common ones, and sort incoming data before calculating these aggregates. To understand the effect, look at the difference between the same query executed by PG13 and PG17:

EXPLAIN (ANALYZE, TIMING OFF, COSTS ON)
SELECT sum(unique1 ORDER BY ten), sum(unique1 ORDER BY ten,two)
FROM tenk1 GROUP BY ten;

-- PG13:

/*
GroupAggregate  (cost=1108.97..1209.02) (actual rows=10 loops=1)
  Output: sum(unique1 ORDER BY ten), sum(unique1 ORDER BY ten, two), ten
  Group Key: tenk1.ten
  ->  Sort  (cost=1108.97..1133.95) (actual rows=10000 loops=1)
        Output: ten, unique1, two
        Sort Key: tenk1.ten
        ->  Seq Scan on public.tenk1  (cost=0.00..444.95)
              Output: ten, unique1, two
Execution Time: 116.375 ms
 */

-- PG17:

/*
GroupAggregate  (cost=1109.39..1209.49) (actual rows=10 loops=1)
  Output: sum(unique1 ORDER BY ten), sum(unique1 ORDER BY ten, two), ten
  Group Key: tenk1.ten
  ->  Sort  (cost=1109.39..1134.39) (actual rows=10000 loops=1)
        Output: ten, unique1, two
        Sort Key: tenk1.ten, tenk1.two
        ->  Seq Scan on public.tenk1  (cost=0.00..445.00)
              Output: ten, unique1, two
Execution Time: 12.650 ms
 */

Presorting tuples and eliminating internal aggregate sorting cause a tenfold speedup. That's curious; you can note that the execution time change doesn't change the cost value. Does it indicate the field of further improvement of the optimiser cost model?

Make Vars be outer-join-aware. The last feature, designed recently, in 2023, is too internal and hidden from the sight of the typical user that, I think, only a few people know about - machinery to detect that incoming data can contain NULL values.

It is worth mentioning because of its high perspectives. Many queries contain 'NULL' checkings. Initially, the optimiser estimated the number of null values by looking into the statistics in the table. Sometimes, table columns do not contain any NULLs or even have NOT NULL constraints. But still, in a query containing OUTER JOIN, it may happen that the data field referring to the columns as a source will produce nulls. Such 'generated' nulls frequently cause wrong estimations, mostly because of cardinality underestimation, which results in choosing the NestLoop join algorithm.

What's more we can do?

Estimating which stuff we need is difficult because we need to envision the effect it can bring. However, by looking into alternatives like MS SQL Server and GPOrca Optimiser, which have some advantages, I can briefly estimate the necessary techniques.

First and foremost, it is a further evolution of extended statistics. SQL Server has diverse options for this type of statistics, which is used intensively to estimate scans or joins. They have some stuff for gathering statistics on the fly, likewise described in the [DeWitt1998] paper.

Having points of on-the-fly statistics in combination with alternative query subplans and dynamic switching between them right during execution (let's watch Alena Rybakina's WIP report at the September 2024 Postgres Conference) can allow complex queries to survive and be executed in some sane time.

So far, I don't see any activity in the hacker's mailing list around developing pull-up subquery techniques, so the community has not forced this topic. IMO, the main reason is the efficiency issue: although correlated subquery transformation is well-described in scientific papers, it can increase execution time in some cases. As a result, this technique's performance and technical aspects still need to be revised before any further progress.

Also, the community has discussed possible ways to modify the sort model and improve the sorting and shuffling of group-by-columns. This topic looks interesting to work on in the next development cycle.

In the end, I should state that the progress is obvious. Some new and unique features are being introduced. However, the speed of development is still not as fast as people who operate fast-growing data would desire. I feel it makes sense to extend the hook's nomenclature in (at least) selectivity estimation, subquery or expression tree transformation, and node execution. Maybe we can allow custom statistics. This can give way for the outward (non-core) community to implement new techniques in advance.

Are you okay with the current state of PostgreSQL planner and its roadmap?

THE END.

August 4, 2024. Paris, France

Designing a Prototype: Postgres Plan Freezing

Andrei Lepikhov — Mon, 29 Jul 2024 01:00:47 GMT

This story is about a controversial PostgreSQL feature - query plan freezing extension (see its documentation for details) and the code's techniques underpinning it. The designing process had one specific: I had to invent it from scratch in no more than three months or throw this idea away, and because of that, solutions were raised and implemented on the fly. Time limit caused accidental impromptu findings, which can be helpful in other projects.

Developers are aware of the plan cache module in Postgres. It enables a backend to store the query plan in memory for prepared statements, extended protocol queries, and SPI calls, thereby saving CPU cycles and potentially preventing unexpected misfortunes resulting in suboptimal query plans. But what about sticking the plan for an arbitrary query if someone thinks it may bring a profit? May it be useful and implemented without core changes and a massive decline in performance? Could we make such procedure global, applied to all backends? - it is especially important because prepared statements still limited by only backend where they were created.

Before doing anything, I walked around and found some related projects: pg_shared_plans, pg_plan_guarantee, and pg_plan_advsr. Unfortunately, at this time, they looked like research projects and didn't demonstrate any inspiring ideas on credible matching of cached query plans to incoming queries.

My initial reason to commence this project was far from plan caching: at this time I designed distributed query execution based on FDW machinery and postgres_fdw extension in particular. That project is known now as 'Shardman'. Implementing and benchmarking distributed query execution, I found out that the worst issue, which limits the speed up of queries that have to extract a small number of tuples from large distributed tables (distributed OLTP), is a repeating query planning on each remote server side even when you know that your tables distributed uniformly and the plan may be fully identical on each instance. Working on different solutions, I realised that remote-side queries usually have a much more trivial structure than the origin query and are often similar across various queries (up to different constants). In that case, the most straightforward way was to 'freeze' the plan of the remote-side query and call it again the next time. What can be simpler to implement that by having a plan cache yet in the core?

In general, the idea is relatively trivial: invent a shared library that will employ the planner hook and extension to provide a UI as shown in the picture:

Our extension is labelled as 'sr_plan' on the schema above, abbreviating the phrase 'save/restore plan'. Being the last module in the chain of the planner_hook calls, it can look up the cache of previously stored query plans and, having a positive match, return this plan, avoiding the planning process at all!

So, bravely starting the project, I immediately encountered the first problem: how to match the query to the corresponding plan in the cache? - the SPI, prepared statements, and extended protocol use an internal pointer to the plan or predefined name to identify a plan in the plan cache. That’s not our case: for an arbitrary incoming query, backend has to look at the plan cache and find the query plan that can be correctly used to execute this query. What's more, the initial query string is transformed into an internal representation and passes some stages until the final plan is built. Look at the picture:

Query transformation steps until the final plan

Here, you can see that one query can be transformed into multiple parse trees, sometimes having nothing in common with an initial query, through rewriting rules (which can be altered before the next time the query comes). In its turn, each parse tree can be implemented by multiple query plans…. Remember that indexes used in the plan are not mentioned in the query or corresponding parse tree.

Rewriting rules, table names, sets of columns and indexes - all that stuff can also be altered. Moreover, I predict the user will complain if changing just one backspace in the query ends up causing a loss of matching with the frozen plan. It's quite an erratic technique, isn't it? Summarising issues mentioned above, we can't just remember query string and corresponding plan to prove that this plan may be used for execution of this query.

After spending a couple of days, I realised that the only proof that the specific cached plan may rightly execute the query is the equality of parse trees plus some checking of indexes mentioned in the plan. Match parse trees? Easy! Just use the in-core routine equal() - that's enough!

Altering database objects is not an issue in this scheme. The plan cache's invalidation machinery guarantees that if some object mentioned in the plan is altered, all plans mentioning it will be marked as 'invalid'.

Matching parse trees instead of query text has one more positive outcome: internal representation is stable to many changes in query text, like backspaces or upper/lower case letters. But as usual, it has some negatives: comparing trees is not so cheap. Imagine you have frozen 100 query plans in the cache. How much overhead do you get by comparing each incoming query around 100 times? And what if this query even out of the frozen set?

Fortunately, since Postgres 13, this question has had a quick and terse answer: queryId. This is an in-core feature to generate hash value for each query tree. This hash is based on most of the query elements, such as tables, expressions and constants. Look at this picture:

Having queryId, we can invent a hash table with queryId as a key. Entry of this hash table contains a pointer to the head of a list of frozen plans with the same queryId. Instead of passing through all 100 frozen plans, we only need to match a small fraction. Quick experiments have shown that with queryId generated by the standard JumbleQuery technique, we almost always have only one plan in the class with the same queryId.

Hmm, you might dubiously say: what about parameterised queries? If we want to use a frozen plan for arbitrary incoming queries, we have to store a parameterised, aka 'generic', plan and employ it to execute incoming queries with constants instead of parameters. How queryId could help us in this case?

This question didn't have an easy answer. Having experience with the autoprepare feature, I remember how many hurdles a developer must overturn if all the constants in the plan are replaced with parameters before freezing. What is less obvious is that one parameterised plan may be effective for one set of constants and totally worst for another.

So, we have to know which position in an expression to treat as a parameter. The solution I invented here was trivial: give a choice to the user (with hidden hope to invent an analysing procedure in the future to find correlations in parameters and plans have built) and divide the freezing procedure into two stages: registration and sticking into the plan cache.

Registration tells our extension that the query with a specific queryId is under control, and each plan generated for this query must be nailed down in the backend's plan cache, rewriting the plan built during the previous execution. In our UI, it looks like a query:

SELECT sr_register_query(query_string [, parameter_type, ...]);

For example:
SELECT sr_register_query('SELECT count(*) FROM a WHERE x = $1');

Using '$N' in the query, you point out parameterised parts of the incoming query. Parameter type allows to force the type of each parameter. Registration stores the query text, query tree, and set of parameters (with their positions in the tree) inside the extension memory context.

Registration impacts only the backend where it was registered. Afterwards, you can play locally with any GUCs, hints, or anything else to achieve the desired query plan without fear of influencing the instance's performance. After that, by executing the following query:

SELECT sr_plan_freeze();

you can stick the plan in the local backend and it will be lazily pass to the plan caches of other backends registered in the same database. Spreading across the instance's backends is relatively trivial - just employ DSM hash tables and a flag to signal backends to check the consistency of their caches with the shared storage. Serialisation/deserialisation routines can transform to the string a query tree as well as the query plan. But how do we implement parameterisation?

Easy to say, but hard to solve. At first, I changed the queryId generation algorithm to ease the accuracy of the hash generation by excluding the fact that the tree node is a parameter and considering only its data type and position in the query tree. Of course, it means a core patch, but it is only a couple of code lines. As a result, parameterised query and query with constants in the place of parameters has the same queryId and since then we can find frozen plan in the cache.

The second problem is much more severe. Playing with queries after registration, the user will use queries with specific constant values instead of parameters. After matching queryId we must prove identity of parse trees by calling the equal() routine. But it can't match the incoming constant and registered parameterised query trees without invasive changes to the core logic. Having only a month to the deadline, I discovered an essential design trick: before the query tree comparison procedure, just replace Const nodes with corresponding Param nodes in positions of the query tree defined by the user manually on registration. To make the text a bit more easy to understand, let me illustrate this technique with the following trivial picture:

As you can see, we introduced the abstraction named the 'template query tree' that may be even more extensively used in the future to match queries where the difference is only in database object aliases.

This technique definitely adds some overhead and complexity to the code, but remember, we are oriented on sophisticated queries where the optimiser fails to make an appropriate plan. By profiting from reduced disk fetches because of a good plan, we can allow backend to spend more CPU cycles. Moreover, value of the overhead mostly depends on the queryId technique: how good it is in separating queries into classes.

The third question of this technique was even more challenging to answer: having incoming constants and a parameterised plan to execute the query, how do we pass these constants to the executor? PostgreSQL architecture doesn't allow it to be done because specific parameter values are managed at higher levels of execution machinery. Having spent most of the remaining time, I invented one trick: insert at the top of the frozen plan a CustomScan node, which does nothing except alter the set of parameter values in the execution state structure at the beginning of execution. With this approach, EXPLAIN of a frozen query looks like below:

EXPLAIN SELECT count(*) FROM a WHERE x = 1::bigint;

Custom Scan (SRScan)
  Plan is: tracked
  Query ID: -5166001356546372387
  Parameters: $1 = 1
  ->  Aggregate
        ->  Seq Scan on a
              Filter: (x = $1)

As you can see, having such a node has earned us one positive outcome: this node informs the user about the state of the query plan. It can also potentially gather additional statistics and use them later to make decisions about unfreezing.

Afterwards, I passed PostgreSQL with this extension through a series of benchmarks. The most difficult was, of course, pgbench: it contains too trivial queries executing in too small periods of time that our overheads, even the queryId calculation, should be highlighted here. After manually freezing all its queries, I found out that pgbench results improved by around 15%- 25% on average. Amazing!

One more simple trick for storing frozen plans on disk in a specific file of the data catalogue to survive crushes and reboots — and the prototype is ready for demonstration. But I forgot about real-life cases: DDL, upgrades, and migrations. If an object mentioned in the plan is altered (for example, add a column to the table), Postgres marks the plan as 'invalid'. But it is impractical for us: we should unfreeze the query only if proved that the plan is totally incorrect in the context of this query and database. To be practical, our extension should survive such disasters.

In just a few days, I could invent only an obvious solution called the 'validation procedure'.

Through the validation procedure, the extension checks that the plan can still be applied to a specific query. How it works? - relatively trivial: just open a subtransaction (to survive errors) and pass the query text parsing procedure - It is precisely the reason why we store registered query text. If the query tree is the same as the stored one, it is a good sign that the objects mentioned in the query still exist. So, we need only to check the consistency of indexes mentioned in the query plan. That's enough to mark the query plan as frozen and valid.

The validation procedure allows for the survival of transaction isolation levels: some backends can already see the schema changes and 'unfreeze' query, while others may reuse the frozen plan for the same query. Moreover, a previously invalid plan can be validated on ROLLBACK, and the query can be returned to a frozen state.

What is more interesting is that we can try to pass the frozen query plan to other instances using the validation procedure. The technique looks similar to the above: on a new instance, open a subtransaction, deserialise the query tree and plan, execute the parsing procedure for the query text, recalculate queryId (OIDs of objects may be different), and compare the deserialised query tree with the parsed one. If they are identical, check the indexes and probe query execution to ensure nothing special was broken. Remember, here we should have one more additional structure: 'oid → object name' translation table to identify oids for the database objects in the case of dump/restored or logically replicated database.

Of course, in the case of an upgrade, this technique is vulnerable: we can get a SEGFAULT during deserialisation because of the difference in ABI. What's worse, the plan may be deserialised correctly, but the execution state could contain some specific data or logic that could be altered in the next Postgres version and incompatible with the plan. So, this technique looks applicable mostly for migrations between the same versions of the binaries rather than for upgrades.

Do we have any options to survive an upgrade? Yes - thanks to Michael Paquer and the developer's team in NTT for inventing the pg_hint_plan extension. Before the upgrade, we can store each query text with a set of hints, dropping away the parse tree and the plan. After the upgrade, we should pass parsing and optimisation procedures for each query with the hope that hints will direct the optimiser to build the plan we want to obtain.

That's all I wanted to tell you about this case. Be brave, think openly, and you could invent new directions for DBMS development! As usual, you can play with the extension using the binary version for Postgres 15.

In the end, I urge you to reflect on this post and discuss in comments how interesting the idea of plan freezing is. What is the perspective scope for this feature? What do you think about plan validation?

THE END.

July 28, 2024. Thailand, South Pattaya.

Looking for hidden hurdles when Postgres face partitions

Andrei Lepikhov — Mon, 22 Jul 2024 01:31:26 GMT

Preface

This post was initially intended to introduce my ‘one more Postgres thing' - a built-in fully transparent re-optimisation feature, which I'm really proud of. However, during benchmarking, I discovered that partitioning the table causes performance issues that are hard to tackle. So, let's see the origins of these issues and how PostgreSQL struggles with them.

Here, I do quite a simple thing: having the non-trivial benchmark, I just run it over a database with plain tables, do precisely the same thing over the database where all these tables are partitioned by HASH, and watch how it ends up.

When I finished writing the post, I found out that benchmarking data looked a bit boring. So, don’t hesitate to skip the main text and go to the conclusion.

Preliminary runs

The Join-Order Benchmark has some specifics that make it challenging to stabilise execution time through repeating executions. Processing many tables and most of their data makes query execution time intricately dependent on the shared buffer size and its filling. The frequent involvement of parallel workers further complicates the process, with the potential for one worker to start after a long delay, leading to performance slumps.

What’s more, it turned out that the ANALYZE command is not entirely stable on this benchmark's data, and I constantly observe that rebuilding a table statistic causes significant changes in estimations, followed by different query plans.

To manage these complexities, I used pg_prewarm for over 4GB shared buffers (all data size is about 8GB). We still can't pass statistics through the pg_upgrade process, although I feel it will be possible soon. So, I just analysed the tables one time after filling them with data. I passed all 113 benchmark queries ten times, wasting all the first run because of instability1. The scripts to create schemas, benchmarking scripts, and data can be found in the repository on GitHub.

To gauge the stability of the execution time, I conducted the benchmark over the schema with plain tables (see reference to the schema script). After a meticulous analysis of the results (see graphs for Postgres 15, 16 and 17), it becomes apparent that the dispersion of execution times of a query primarily falls within the range of -25% to +25%. With a few exceptions, the execution time consistently falls within the -50% to +50% range.

But comparing execution times of 15 with both 16 and 17, we see the constant shift for both newer versions:

Everywhere on the graphs in this post, the term 'execution time difference' means the following (i - the number of a version: 16 or 17):

Despite some instability in the executions, as mentioned earlier, I observed a surprising decrease of about 17% in execution time for both the 16 and 17 versions. This unexpected change prompts the question: what’s happening here? A comparison of their query plans with v.15 might hold the answer.

A thorough checking of query plans has shown that v.15 uses parameterised NestLoop more frequently (388 times v/s 376 times in v.16 and 381 in v.17 - the latter prefers more Parallel Hash Join for a reason). Doing that, it chooses the Memoize node (39 times V/S 32 and 35 for newer versions) and IndexScan (342 times V/S 329 and 332) and, as a result, rarely uses SeqScan. That is the reason for the peaks you see on the graph.

In general, we can only conclude that something has changed in the PostgreSQL planner as well as the executor since v.16. In the planner, it is obviously related to the cost model. In the executor, I suppose something is in parallel execution machinery. These changes influenced execution, but it is hard to say how it could work in another case. I wonder if someone else would pass the same benchmark case and compare the numbers.

But for now, we have a 25% trustful range, and I want to see if the partitioning impacts performance and throws the execution time of queries out of this range.

Partitioning (2 partitions case)

Now, create a schema with the same tables and data, but split each table into two partitions by HASH (see schema). Fortunately, in this benchmark, each table has an ID field, so it was easy to choose a partitioning schema.

In my mind, replacing a plain table with the one partitioned by HASH into two partitions can influence planning time a bit but shouldn’t significantly change execution time: taking advantage of possible partition pruning, we add the only Append operation into the plan, which adds low overhead.

Okay, let’s pass through the benchmark and look into the numbers.

As you can see in the picture below, all three Postgres versions show us some performance degradation, around 40% on average. But specific queries degrade for 100% and more. Demonstrating such a stable shift should have some underlying reasons.

Execution time difference (in %) of queries over partitioned (version 15, 16 and 17) and non-partitioned tables

Comparing the benchmark's results for partitioned and non-partitioned PostgreSQL 17, I see nothing special: all queries executed in parallel, the choice of NestLoop decreased from 381 to 317 cases, and four MergeJoins compared to 1 for a plain-table run. The number of IndexScans has plummeted from 332 to 282. The number of Memoize choices is the same—32, which means PostgreSQL has stable decisions about parameterised joins with cached inner. But something still happens, and we need to discover it.

Let's choose the worst cases and compare their query plans to solve the mystery. I have chosen the following cases:

query 12a.sql - 135% degradation; 1b.sql - 134%; 1d.sql - 114%; 2a.sql - 116%; 33a.sql - 89%; 5a.sql - 972%.

At first, I found out that Parallel Append constantly executes in more time than a single Parallel Seq Scan:

->  Parallel Append (actual time=9.643..155.768 rows=460012 loops=3)
    ->  Parallel Seq Scan on movie_info_idx_p2
        (actual time=0.033..52.699 rows=230217 loops=3)
    ->  Parallel Seq Scan on movie_info_idx_p1
        (actual time=14.431..97.858 rows=344692 loops=2)

->  Parallel Seq Scan on movie_info_idx
    (actual time=0.027..65.034 rows=460012 loops=3)

As you can see, it is about two times faster. Maybe we could say it is a page buffering problem, but I see it frequently through all the runs. Perhaps the reason here is lurking around the number of loops: in each Append, I see three loops for the first sibling and only two for the last. It looks like a bug in parallel workers’ utilisation. One more issue here is a long startup actual time - 9ms vs 0.027 for just two partitions!

The second issue is cost estimation. Look at this:

->  Parallel Append  (cost=0.00..35107.72 rows=1933 width=8)
    ->  Parallel Seq Scan on movie_companies_p2 mc_2
        (cost=0.00..17569.53 rows=1301 width=8)
        Filter: ((note ~~ '%(theatrical)%'::text)
                AND (note ~~ '%(France)%'::text))
    ...

->  Parallel Seq Scan on movie_companies mc
    (cost=0.00..35097.06 rows=1677 width=8)
    Filter: ((note ~~ '%(theatrical)%'::text)
                AND (note ~~ '%(France)%'::text))

This additional estimation error blows up estimations through the upper query plan's levels and (possibly) triggers the third source of degradation: in all cases, having chosen to scrutinise, I found out that on top-level JOIN, instead of parameterised NestLoop, which gradually reduced the number of tuples scanned from the big table, the optimiser has chosen ParallelHashJoin, forcing a full scan of the inner table. The overestimation, I think, is caused by a data skew: summarising estimation error on many partitions, we get an error that is worse than estimating once on a plain table.

Partitioning (64 partitions case)

To test these conjectures, let's try to increase the number of partitions enough to highlight the effect but not so big to get stuck in optimisation complexity, which still consumes many CPU and memory resources on thousands of partitions. Using this script, let's generate 64 partitions on each big table and skip partitioning for tiny tables to get closer to reality. See on the graph below how partitioning changed execution time:

Difference between the average execution time for PostgreSQL 15,16 and 17 with 64-hash partitions on big tables and with plain tables

You can see that many queries speed up a bit, but some queries become slower, around 100 - 500%. Check some query plans to find a reason.

After glancing at the plan of five arbitrarily chosen 'good' queries, I can say that the key reason for the speedup is the spotting of reads. Although the optimiser couldn't prune partitions in the initial phase, the parameterised NestLoop chose only one specific partition each time, which reduced scanning efforts. It also improved the "buffer hits" statistics, influencing execution time. That's an excellent example of what a DBA awaits when performing a table split between partitions. But what about slowing down queries?

Performance slump looks worse with 64 partitions. Let's compare a parallel sequential scan of a plain table and an appended one with the same data:

->  Parallel Seq Scan on movie_info
    (actual time=579..1714 rows=23159 loops=3)

->  Parallel Append
    (actual time=8285..9774 rows=23159 loops=3)
    ->  Parallel Seq Scan on movie_info_part_13
        (actual time=8158..8245 rows=1084 loops=1)
    ->  Parallel Seq Scan on movie_info_part_35
        (actual time=0.135..75 rows=1096 loops=1)
    ->  Parallel Seq Scan on movie_info_part_26
        (actual time=8576..8650 rows=1086 loops=1)
    ->  Parallel Seq Scan on movie_info_part_4
        (actual time=0.136..23 rows=332 loops=3)
    ...

As you can see, some partitions that returned the same number of tuples were executed 100 times faster than the slower ones! It looks like a parallel execution skew problem, but with the same probability, it may just be a bug. And, as in the two partitions case, we see how long it takes to produce the first tuple for some partitions.

Cost estimation also is affected by a reason:

->  Parallel Append  (cost=0.00..271597.22 rows=164209 width=8)
  ->  Parallel Seq Scan on movie_info_part_13
  ->  ...

->  Parallel Seq Scan on movie_info (cost=0.00..241078.30 rows=32517)

It is not a big issue because, in many cases, it provides an even more precise estimation than in the single plain table case. But sometimes, it triggers an insufficient query plan decision, which should henceforth be discovered.

One sad outcome I got is a selection of one tuple: with partitions, even with unique indexes on all partitions, the optimiser still predicts no fewer tuples than the number of partitions. Many wrong decisions in these runs came from choosing Parallel Hash Join instead of optimal and spotting NestLoop.

Note that I didn’t find the Memoize node even once over the appended table - is it a limitation or a game of chance?

Conclusions

First and foremost, this benchmark shows very volatile results in execution time and stability of query plan. It is necessary to use it carefully and always perform preliminary tuning.

My second outcome - local partitioning in PostgreSQL looks mature: I didn’t see huge mishaps or overconsumption of resources during planning.

With all that said, we still have unclear issues with load balancing in the case of parallel append, which triggers most performance degradation cases. Also, we have to work on the 1-tuple problem, which leads to choosing HashJoin instead of NestLoop. The last possible issue is statistics usage on partitioned tables.

Also, I didn’t observe any partitionwise joins there. Maybe the Asymmetric Partitionwise Join feature could bring some profit there?

P.S. All raw data can be found here.

THE END.

July 21, 2024. Thailand, South Pattaya.

In the end, I realised it is more stable to execute one query ten times before switching to another one instead of passing all the queries ten times. However, the essence of the problem still impacts the query execution time much more, so we stay with the current benchmarking script.

How expensive is it to maintain extended statistics?

Andrei Lepikhov — Sun, 14 Jul 2024 23:24:39 GMT

In the previous post, I passionately advocated for integrating extended statistics and, moreover, creating them automatically. But what if it is too computationally demanding to keep statistics fresh?

This time, I will roll up my sleeves, get into the nitty-gritty and shed light on the burden extended statistics put on the digital shoulders of the database instance. Let's set aside the cost of using this type of statistics during planning and focus on one aspect - how much time we will spend in an ANALYZE command execution.

I understand how boring numbers look sometimes, as well as benchmarks. However, my vast experience in computational physics and analysing long listings full of numbers shows that it can be a fantastic source of inspiration.

So, let's start and create a test table:

DROP TABLE IF EXISTS bench;
CREATE TABLE bench (
  x1 integer, x2 integer, x3 integer, x4 integer,
  x5 integer, x6 integer, x7 integer, x8 integer
) WITH (autovacuum_enabled = false);
INSERT INTO bench (x1,x2,x3,x4,x5,x6,x7,x8) (
  SELECT x%11,x%13,x%17,x%23,x%29,x%31,x%37,x%41
  FROM generate_series(1,1E6) AS x
);

Why eight columns, you might wonder? - This deliberate choice is due to the hard limit of extended statistics - STATS_MAX_DIMENSIONS, which allows only eight columns or expressions in its definition clause.

Let me compare the performance of plain statistics with extended ones. I'll consider different variations of 'ndistinct', 'MCV' and 'dependencies' types. Additionally, I'll include a comparison with a statistic type called 'Joinsel', which builds and uses histograms, MCV, and distinct statistics over a predefined set of columns, treating them as a single value of a composite type. It can be found in standard and enterprise variants of the private Postgres Professional fork.

To measure execution time, use "\timing on". I'm going to observe how much time it takes if we build statistics over two, four and eight columns. I will take the surrogate test for plain statistics by creating the 'bench' table with two, four and eight columns.

The benchmarking script for measuring the building of extended statistics looks like this:

CREATE STATISTICS stx ON x1,x2,x3,x4 FROM bench;
\timing on
ANALYSE bench;
\timing off
DROP STATISTICS stx; ANALYSE bench; -- cleanup

Benchmarking script for the Joinsel is about creating an index:

CREATE INDEX idx ON bench (x1,x2,x3,x4);
\timing on
ANALYSE bench;
\timing off
DROP INDEX idx; ANALYSE bench; -- cleanup

BTW benchmarking on two different computers, I realised that in the case of extended statistics, the most influential part is the computer's CPU.

So, after passing the tests, we have the following results:

First fact - plain statistic is cheap, but not chargeless. The second one is that the extended statistic drastically depends on the number of columns. Compared to the Joinsel statistic, which generates only one histogram, distinct and MCV for the set of rows, it is evident that the main reason for such computational burden lies in the number of combinations: in the case of four columns, extended statistics produce 11 distinct combinations and 28 combinations for dependency stat; 8 columns spawns 247 and 1016 variants respectively! Also, based on this benchmark, we can conclude that 4 - 5 columns allow the instance to survive its maintenance.

Maybe five columns are enough - would you argue with me? And you're right. But remember, if we want to make it automatic and based on index definition, it implicitly means that we assume incoming queries address the whole set of columns or their prefix. Another case index couldn't help. That means 1) we only need some of the combinations of columns, only prefixes, and 2) a database instance has to survive the case with multiple indexes on one table because it is quite a typical case.

In an attempt to find a solution, I tried to invent an options section in the definition of extended statistics (see this GitHub branch for the code) - keep in mind that options might be helpful in the future (remember how complex statistics are in MS SQL Server); I introduced an option called 'method', which allows switching algorithm of combinations selection in the distinct statistic.

With this option, the benchmark script looks like:

CREATE STATISTICS stx ON x1,x2,x3,x4 FROM bench WITH (method=linear);
\timing on
ANALYSE bench;
\timing off
DROP STATISTICS stx; ANALYSE bench;

I'm not sure the term 'linear' is good here - I just wish to show that the number of combinations linearly depends on the number of columns. So, if previously having definition clause "x1,x2,x3", we generated four ndistinct combinations (x1,x2), (x1,x3), (x2,x3), (x1,x2,x3) with 'linear' option we have only two combinations: x1,x2 and x1,x2,x3.

The same logic for dependency has not been implemented yet, but you can already see some tale-telling benchmark results:

As you can see, by generating extended statistics only over a set of prefixes of the defining clause, we drastically reduced computational efforts in building these statistics. Moreover, it can push the current limit beyond the eight to 32 columns, like the maximum number of columns in an index.

Hence, I think our necessary step in the direction of intensifying the usage of extended statistics is to make it more flexible and introduce options. Do you agree? What do you think about the perspectives of this type of statistic?

THE END.

July 14, 2024. Thailand, South Pattaya.

Why PostgreSQL prefers MergeJoin to HashJoin?

Andrei Lepikhov — Mon, 08 Jul 2024 00:01:57 GMT

Today's post is sparked by a puzzling observation: users, especially those who use an abstraction layer like REST or ORM library to interact with databases, frequently disable the MergeJoin option across the entire database instance. They justify this action by citing numerous instances of performance degradation.

Considering how many interesting execution paths MergeJoin adds to the system elaborating IncrementalSort or sort orderings derived from underlying IndexScan, it looks strange: one more bug of skewed cost balance inside the PostgreSQL cost model?

As a developer, I have refused to accept such a mysterious belief in evil algorithms and discovered this case. It turned out that the real reason (or at least one but quite frequent one) lies in the typical challenge optimiser faces: multi-clause JOIN.

Let's take a glance at the query:

SELECT * FROM a JOIN b ON (a.x=b.x AND a.y=b.y AND a.z=b.z);

In this scenario, the optimiser often unexpectedly selects MergeJoin or, much more rarely, a NestLoop instead of the more efficient HashJoin.

It is a challenge to reproduce it with synthetic data, so this example looks a bit complicated:

CREATE TABLE a AS SELECT
  ((3*gs) % 300) AS x,
  ((3*gs+1) % 300) AS y,
  ((3*gs+2) % 300) AS z
FROM generate_series(1,1e5) AS gs;

CREATE TABLE b AS SELECT
  gs % 49 AS x,
  gs % 51 AS y,
  gs %73 AS z
FROM generate_series(1,1e5) AS gs;
ANALYZE a,b;

Table 'b' has been created quite typically for actual data: having a small number of distinct values in each column, a row is almost unique considering values in its columns together. Let's execute a single join on three columns:

EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT * FROM a,b WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;

 Merge Join (actual rows=0 loops=1)
   Merge Cond: ((a.x = b.x) AND (a.y = b.y) AND (a.z = b.z))
   ->  Sort (actual rows=17001 loops=1)
         Sort Key: a.x, a.y, a.z
         ->  Seq Scan on a (actual rows=100000 loops=1)
   ->  Sort (actual rows=100000 loops=1)
         Sort Key: b.x, b.y, b.z
         ->  Seq Scan on b (actual rows=100000 loops=1)
 Execution Time: 843.179 ms

Let's manually disable MergeJoin and see what will happen:

SET enable_mergejoin = 'off';

 Hash Join (actual rows=0 loops=1)
   Hash Cond: ((b.x = a.x) AND (b.y = a.y) AND (b.z = a.z))
   ->  Seq Scan on b (actual rows=100000 loops=1)
   ->  Hash (actual rows=100000 loops=1)
         ->  Seq Scan on a (actual rows=100000 loops=1)
 Execution Time: 154.822 ms

HashJoin is much faster, isn't it? Even though sorted outer has allowed MergeJoin to fetch only 17001 tuples from 100_000, it is still five times slower. So, why has the optimiser chosen the non-optimal variant?

Looking into the details (I didn't show it in the explanation for simplicity), we see that the optimiser correctly predicts the size of the inner and outer (it is quite a trivial query, though), but the cost of MergeJoin is 20937 in comparison to HashJoin's 419280. Almost twenty times more! What's going on there? Is it a bug in the cost model? - Not exactly.

Just look into the final_cost_hashjoin() routine. The HashJoin cost formula looks like this:

In Postgres terms, bucket_size means the number of tuples with the same hash value. It is a crucial factor because to find a match in the bucket, we have to pass through the bucket and match the incoming tuple with each stored tuple until we find a comparison or pass the whole bucket.

In our specific example, the size of the bucket is roughly equal to 0.015 for relation 'b' and 0.011 for relation 'a', and the number of buckets is estimated at 131000. This means the optimiser predicts that each bucket will contain around 2000 tuples. Passing a whole bucket with a linear search on each incoming tuple is really costly. I agree with the optimiser on that choice! The massive cost of the HashJoin node now makes sense. But why has it made the wrong prediction here?

The problem is estimating the number of groups in the case of multiple columns. Correctly estimating the number of distinct values in 49, 51 and 73 on columns x, y and z correspondingly, the optimiser chooses the maximum value, i.e. 73 distinct values as an estimation of the number of groups on (x,y,z) that is incorrect in most the actual cases, likewise have shown here. Why does it do that? - Because it is the maximum skewed case according to the worst-case scenario, which can be obtained from the statistics.

But the actual number of groups here is:

SELECT count(*) FROM (SELECT * FROM b GROUP BY x,y,z);
count  
--------
 100000

The number of distinct values on a set of columns can be calculated only with extended statistics. Let's define it:

CREATE STATISTICS a_stx (ndistinct) ON x,y,z FROM a;
CREATE STATISTICS b_stx (ndistinct) ON x,y,z FROM b;

Here, we employ only distinct-type statistics because it is enough for our purpose. Unfortunately, the current PostgreSQL core doesn't utilise that - let's implement the code and see how it is going:

RESET enable_mergejoin;
EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT * FROM a,b WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;

 Hash Join (actual rows=0 loops=1)
   Hash Cond: ((a.x = b.x) AND (a.y = b.y) AND (a.z = b.z))
   ->  Seq Scan on a (actual rows=100000 loops=1)
   ->  Hash (actual rows=100000 loops=1)
         Buckets: 131072  Batches: 1  Memory Usage: 5604kB
         ->  Seq Scan on b (actual rows=100000 loops=1)
 Execution Time: 88.582 ms

You can see that the optimiser not only chose the HashJoin algorithm but also correctly chose relation 'b' as the inner input to be hashed. In that case, we see a two-time faster execution time than the previous already good HashJoin plan! It results from the correct bucket size estimation: 0.00001 for relation 'b' and 0.01 for relation 'a'.

So, as you can see, this approach led to a nearly tenfold speedup in the elementary example. Since real-life queries are typically more complex, executed over huge tables with non-trivial data distribution and involve complex scan filters, DBAs often struggle to identify optimisation points and end up with a perplexing belief in the disruptive MergeJoin. So, extended statistics is potentially becoming a "must-have" feature when dealing with queries that contain two or more join clauses in a single JOIN operator.

But what's wrong with turning it off? Let's add indexes on these tables and try to execute the grouping query:

CREATE INDEX ON a (x,y,z);
CREATE INDEX ON b (x,y,z);

EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT a.x,a.y,a.z FROM a,b
WHERE a.x=b.x AND a.y=b.y AND a.z=b.z GROUP BY a.x,a.y,a.z;

In this case, using presorted orders for both relations, MergeJoin is executed two times faster: 44 ms V/S 76 ms of HasJoin. So, the JOIN operator, providing some order, is a native choice in analytical queries - it can give way with fewer sort operations, and disabling it reduces the optimiser’s scope to search for effective plans.

Henceforth, we should find a way to estimate costs for complex clauses more precisely. In the case of many clauses, we have only one tool so far - EXTENDED STATISTICS. As a result, it looks promising to invent extensions to manage such statistics automatically - fortunately, we have already published one :). Do you agree with us? Would you like to use this type of statistic?

As usual, you can assess the results by playing with the code on top of the current PostgreSQL master code branch.

THE END.

July 7, Thailand, South Pattaya