Boosting Postgres' EXPLAIN

Add an extra output on employed statistics

Apr 12, 2025

Shortly before the code freeze for PostgreSQL 18, Robert Haas added a feature that allows external modules to provide additional information to the EXPLAIN command.

This was a long-awaited feature for me. For an extension that influences the query planning process, providing users with notes on how the extension has affected the plan makes perfect sense. Instead of merely writing to a log file - access to which is often restricted by security policies - this information may be made available through the EXPLAIN command.

The feature introduced many entities that are not easy to figure out: an EXPLAIN option registration routine (RegisterExtensionExplainOption), an explain extension ID, per plan/node hooks, and an option handler.

The pg_overexplain extension, introduced with this feature to demonstrate how it works, seems a little messy and impractical for me, at least in its current state. So, I decided to find out how flexible this new technique is and demonstrate the opportunities opening up to developers with a more meaningful example. I have modified the freely available pg_index_stats extension and added information about the statistics used in the query planning process.

The STAT parameter was added to the list of EXPLAIN options, accepting Boolean ON/OFF values. If it is enabled, information about the statistics used is inserted at the end of the EXPLAIN: the presence of MCV, histogram, and the number of elements in them, as well as the values of stadistinct, stanullfrac, and stawidth.

You might wonder why this is necessary. After all, doesn't the set of statistics directly stem from the list of expressions in the query? Isn't it possible to identify which statistics were utilised by examining the cost-model code for a particular type of expression?

While it is indeed possible, this approach is not always sufficient. We understand the algorithms, but we typically do not have access to the underlying data. As a result, we cannot accurately determine which specific statistics are present in pg_statistic for a given column, nor can we know what information was available to the backend at the time of estimation.

Let's look at the example below:

CREATE TABLE sc_a(x integer, y text);
INSERT INTO sc_a(x,y) (
  SELECT gs, 'abc' || gs%10 FROM generate_series(1,100) AS gs);
VACUUM ANALYSE sc_a;
LOAD 'pg_index_stats';

EXPLAIN (COSTS OFF, STAT ON)
SELECT * FROM sc_a s1 JOIN sc_a s2 ON true
WHERE s1.x=1 AND s2.y LIKE 'a';

Explain, boosted by the pg_index_stats extension, looks like the following:

 Nested Loop
   ->  Seq Scan on sc_a s1
         Filter: (x = 1)
   ->  Seq Scan on sc_a s2
         Filter: (y ~~ 'a'::text)
 Statistics:
   "s2.y: 1 times, stats: { MCV: 10 values, Correlation,
          ndistinct: 10.0000, nullfrac: 0.0000, width: 5 }
   "s1.x: 1 times, stats: { Histogram: 0 values, Correlation,
          ndistinct: -1.0000, nullfrac: 0.0000, width: 4 }

Here, you can see that the statistics for the s1.x and s2.y columns were used. We can't detect which statistic type was actually used and how often - it is buried too deeply in the core - but we may still detect some issues:

At first, we have only ten MCV values for y, and there are no MCV statistics for the s1.x at all; the histogram seems to be there, but it is of zero length. No nulls in either column are expected.

Thus, we have some helpful information that can suggest the optimiser's plan selection logic. Considering that a client who cannot provide data can very rarely give a dump of the pg_statistic table, such relatively harmless information can be a helpful aid and reveal possible causes of problems with the query plan selection. As a bare minimum benefit, users often forget to increase the statistic_target parameter (sample size) on massive tables, and this information provides a quick insight into that issue.

The extension utilises the get_relation_stats_hook to track the statistics used. It would also be useful to know whether extended statistics are used in planning, but they are too deep in the core yet, and the current set of hooks will not help here.

Finally, I would like to know what applications you see for expanding the EXPLAIN output. Regarding the example above, how harmless is even such limited information really?

THE END

April 12, 2025. Nikola Tesla Airport, Serbia.

Conserving CPU's cycles ...

Discussion about this post