Automated Management of Extended Statistics in PostgreSQL
The history of one more Postgres extension development
Here, I am describing the results of a Postgres extension I developed out of curiosity. This extension focuses on the automatic management of extended statistics for table columns. The idea originated while I was finishing another "smart" query-driven project aimed at enhancing the quality of Postgres query planning. I realised that Postgres is not yet equipped enough for fully autonomous poor query plan detection and adjustment optimisations. Therefore, it might be beneficial to approach the problem from a different angle and create an autonomous, data-driven helper.
What is extended statistics?
The extended statistics tool allows you to tell Postgres that additional statistics should be collected for a particular set of table columns. Why is this necessary? - I will try to quickly explain using the example of an open power plant database. For example, the fuel type (primary_fuel) used by a power plant is implicitly associated with the country's name. Therefore, when executing a simple query:
SELECT count(*) FROM power_plants
WHERE country = '<XXX>' AND primary_fuel = 'Solar';
we see that this number is zero for Norway and 243 for Spain. This is apparent to us since it is defined by latitude, but the DBMS does not know this, and at the query planning stage, it incorrectly estimates the sample (row number): 93 for Norway and 253 for Spain. If the query turns out to be a little more complex and the estimated data are the input for a JOIN operator, this can lead to unfortunate consequences. The extended statistic calculates the joint distribution of values in columns and allows us to detect such dependencies.
In fact, there are worse situations in ORMs. In the power plant database example, this could be the joint use of conditions on the country and country_long fields. After reading their description, anyone understands that there is a direct correlation between these fields, and when the ORM groups by both of these fields, we get a significant error:
EXPLAIN (ANALYZE, COSTS ON, TIMING OFF, BUFFERS OFF, SUMMARY OFF)
SELECT country, country_long FROM power_plants
GROUP BY country, country_long;
HashAggregate
(rows=3494 width=16) (actual rows=167 loops=1)
Group Key: country, country_long
-> Seq Scan on power_plants
(rows=34936 width=16) (actual rows=34936 loops=1)
A human would never write such a query, but we live in the era of AI, and automatically generated queries are not uncommon. We will have to deal with this somehow.
And what about extended statistics? It allows us to define three types of statistics on a combination of columns (or/and expressions): Most Common Values (MCV), distinct and dependencies. In the case of scanning filters, MCV works best: if the combination of values that a query selects from the table often appears in this table, then the optimiser will get an accurate estimate. If we are looking for a rare combination (as in the case of solar power plants in Norway), having a rough estimate of the sample ntupes/ndistinct
, we can refine it by throwing out everything that got into MCV.
In the case of the need to estimate the number of groups (operators GROUP BY
, DISTINCT
, IncrementalSort
, Memoize
, Hash Join
), the optimiser's decision is very well supported by the ndistinct value per column combination.
Now, to see the impact of extended statistics on the optimiser's row estimation from the table, let's apply extended statistics to our case by running the commands:
CREATE STATISTICS ON country,primary_fuel
FROM power_plants;
ANALYZE;
You may find that the queries above estimate row numbers much more accurately when selecting and grouping by these two fields. For instance, Norway is estimated to have one power plant, while Spain has 253. Just to be sure, you can verify this result using filters such as country = 'RUS'
or country = 'AUT'
. Although the table is not very large, the tool seems effective.
However, I rarely see extended statistics being used in practice. One possible reason for this may be the concern that running the ANALYZE
command will take a significant amount of time. Yet, I believe the main issue lies in the complexity of diagnostics - specifically, knowing when and where to create these statistics.
Looking for a suitable statistics definition
Is there an empirical rule of thumb for determining where and what statistics to create? I have forged two such rules for myself:
No. 1: By Index Definition. If a DBA takes a risk by creating an index on a specific set of columns, they likely expect the DBMS to receive queries that filter on these columns frequently. Additionally, the execution time of these queries is probably critical, which serves as another reason for improving the quality of query plans. However, there isn't always a significant estimation error for filters on multiple columns, which is a drawback of this empirical approach – statistics may be generated unnecessarily. It's also possible that a point sample of data from the table is what's expected, which may diminish the impact of misestimating on a composite filter – does it really matter whether 1 or 5 rows are returned?
Due to these shortcomings, I developed Method No. 2 using actual query filter templates. In this method, the first step is to identify candidate queries based on two factors: the query's contribution to the database load (which can be measured using the pages-read criterion) and the presence of composite filter conditions in table scans. It would also be beneficial to consider only those instances where the actual cardinality of the table scan operator significantly deviates from the planned value.
This approach is more selective in choosing potential candidates for generating statistics, allowing for a significant reduction in the statistics collected. However, it raises some important questions:
When it comes to creating statistics, approach No. 1 provides a clear moment for generating them - at the time of the index creation. But what about approach No. 2? In this case, you must either rely on a timer to generate statistics collecting queries in the interim or manually trigger the command. The absence of a complex query that calculates bonuses at the end of the month (for the previous 29 days) does not mean that we shouldn’t execute it within a reasonable timeframe on the thirtieth day. While such a query may contribute only a tiny amount to the overall load, the accountant may not appreciate waiting several hours for the results!
How to Clean Up a Set of Statistics. In the previous approach, we deleted the statistics along with the index. However, this situation is less straightforward now. For instance, if a problematic query suddenly stops occurring - perhaps because the sales season for a popular product has ended - it doesn't mean it won't return in a year. This uncertainty could create potential instability in the DBMS optimiser's operation.
Additionally, it's unclear how much the actual and planned row numbers should differ to be considered significant. Should this difference be two times, ten times, or even a hundred times?
With this in mind, I decided to first write code for the easy-to-implement approach No. 1. At the same time, for approach No. 2, I just plan to develop a recommender tool that, based on data of the pg_stat_statements
extension and an analysis of the execution plans of queries, will suggest candidates for creating new statistics.
Extension Description
The concept behind this extension is straightforward (see the repository for details). First, we need a hook to collect the identifiers of objects created in the database, and I have chosen the object_access_hook
for this purpose. Next, we need to determine an appropriate time to filter the list of objects, selecting only those that belong to relevant composite indexes. We can efficiently add a new statistics definition to the database using the ProcessUtility_hook
, executing our code after a utility command is completed.
Extended statistics, which include distinct and dependencies types, are calculated for all possible combinations of columns. This leads to a rapid increase in computational complexity. For instance, with three columns, the number of distinct statistics is 4, and the number of dependencies is 9. However, these numbers rise dramatically with eight columns to 247 distinct statistics and 1016 dependencies. It's clear now why the PostgreSQL core strictly limits the number of statistical elements to 8.
To prevent excessive load on the database, I introduced a parameter that limits the number of index elements included in the statistics definition (the columns_limit
parameter) and another parameter that determines which types of statistics to include in this definition (the stattypes
parameter). When these automatic statistics are created, an extra dependency is established on the index serving as the template. Consequently, the associated statistics are removed when the index is deleted.
An open question remains: Is it necessary to create a dependency from the extension to delete all created statistics when DROP EXTENSION
is executed? The answer is unclear because the extension may also function as a simple module without requiring a CREATE EXTENSION
call, thus potentially impacting all databases within the cluster simultaneously.
To distinguish between automatically generated statistics and those created by users, a comment object that includes the module's name and the statistics name is created. Additionally, we have introduced the functions pg_index_stats_remove
and pg_index_stats_rebuild
into the extension interface. These functions allow you to delete all statistics and regenerate them, which can be helpful if the data schema was established prior to loading the module or if the database parameters have changed.
A separate issue to address is the reduction of redundant statistics. Given that a database can have many indexes, a procedure has been developed to identify duplicates, aiming to decrease the computational costs of the ANALYZE
command (see the pg_index_stats.compactify
parameter).
For example, if an index is already defined as t(x1, x2)
, creating another index as t(x2, x1)
would not require the creation of new statistics. A more complex scenario arises when an index t(x2, x1)
is created in the presence of another index t(x1, x2, x3)
. In this case, the most common value (MCV) statistics must be created, as they would not be redundant but the ndistinct
, and the dependencies
can be disregarded.
Benchmarking
As usual, theory should be validated through practice, and code should be tested on meaningful data. I didn't have access to a ready-made, loaded PostgreSQL instance in either a test or production environment, so I found a stale dump of a database for testing purposes. This particular dump was noteworthy because it contained a large number of tables -about 10,000 - along with roughly three times as many indexes.
Additionally, composite indexes were heavily employed, with around 20,000 indexes containing more than one column. Notably, more than 1,000 of these indexes cover five or more columns. So, this database provides a suitable case for research, although it is unfortunate that no payload is available. The ANALYZE
command on this database took 22 seconds to execute. However, when I installed the extension and used the default limit of five columns, the ANALYZE
time increased to 55 seconds.
The table with raw data below illustrates the ANALYZE
time (in seconds) based on the limit on the number of columns and the types of statistics collected.
It's clear that storing all possible combinations of columns significantly impacts analysis time, mainly when dependencies are involved. Therefore, we can limit our analysis to 3-5 columns in the statistics or consider adopting approach No. 2. I now understand why SQL Server created a separate worker for updating such statistics: this process can be pretty costly. What about reducing redundancy? Let's conduct another experiment:
SET pg_index_stats.columns_limit = 5;
SET pg_index_stats.stattypes = 'mcv, ndistinct, dependencies';
SET pg_index_stats.compactify = 'off';
SELECT pg_index_stats_rebuild();
ANALYZE;
pg_index_stats.compactify = 'on';
SELECT pg_index_stats_rebuild();
ANALYZE;
The following two queries are sufficient to check the amount of statistical data generated by the pg_index_stats extension:
-- Total number of stat items
SELECT sum(nelems) FROM (
SELECT array_length(stxkind,1) AS nelems
FROM pg_statistic_ext);
-- Total number of stat items grouped by stat type
SELECT elem, count(elem) FROM (
SELECT unnest(stxkind) elem FROM pg_statistic_ext
)
GROUP BY elem;
The first query shows the total number of extended statistics items in the database, and the second one - a breakdown by type. So, let's see what happens with and without compactifying:
The overall impact is modest—approximately a 15% improvement in processing time and slightly more in the set of statistics. However, it does provide some protection against corner cases. Interestingly, the compactifying reduced the number of MCV statistics, suggesting that a significant number of indexes differ only in the order of their columns. Additionally, expression statistics, which we haven't discussed before, are generated automatically by the PostgreSQL core if the definition of extended statistics includes an expression. Although this may not pose a significant issue, it would be beneficial to have the ability to regulate this behaviour.
It's also worth comparing the analysis time to an alternative statistics collector called joinsel, which exists in the enterprise Postgres fork, provided by Postgres Professional LLC. While it isn't a direct competitor to extended statistics, it works differently. Based on the index definition, it creates a new composite type within the database, which is then used to generate regular statistics stored in pg_statistic. The advantages of joinsel include MCV and a histogram, which allows for evaluating range filters while leveraging standard PostgreSQL clause estimation techniques. However, it does have some drawbacks, such as a lack of dependency statistics and only one ndistinct value for the entire composite type (a limitation that can be addressed).
Now, let's look at how quickly the ANALYZE command is executed with joinsel.
SET enable_compound_index_stats = 'on';
SELECT pg_index_stats_remove();
\timing on
ANALYZE;
Time: 41248.977 ms (00:41.249)
ANALYZE
time has increased as expected compared to regular Postgres statistics, but only by two, which is a reasonable compromise. The main advantage here is that you don't have to worry about the number of columns in the index - the complexity will increase linearly.
Coclusion
The general conclusion regarding Approach No. 1 is that it can be viable, provided we exercise caution and carefully manage the limits.
Additionally, we should enhance the extended statistics in the core. It would be nice to have the possibility of a more significant impact on this tool, allowing us to reduce or expand the volume of generated statistical data.
As for the helper and Approach No. 2, I have decided to postpone it for now. If anyone is enthusiastic and has plenty of free time and patience, feel free to reach out. I would be happy to provide guidance!
THE END.
March 9, 2025, Madrid, Spain.