In the previous post, I passionately advocated for integrating extended statistics and, moreover, creating them automatically. But what if it is too computationally demanding to keep statistics fresh?
This time, I will roll up my sleeves, get into the nitty-gritty and shed light on the burden extended statistics put on the digital shoulders of the database instance. Let's set aside the cost of using this type of statistics during planning and focus on one aspect - how much time we will spend in an ANALYZE command execution.
I understand how boring numbers look sometimes, as well as benchmarks. However, my vast experience in computational physics and analysing long listings full of numbers shows that it can be a fantastic source of inspiration.
So, let's start and create a test table:
DROP TABLE IF EXISTS bench;
CREATE TABLE bench (
x1 integer, x2 integer, x3 integer, x4 integer,
x5 integer, x6 integer, x7 integer, x8 integer
) WITH (autovacuum_enabled = false);
INSERT INTO bench (x1,x2,x3,x4,x5,x6,x7,x8) (
SELECT x%11,x%13,x%17,x%23,x%29,x%31,x%37,x%41
FROM generate_series(1,1E6) AS x
);
Why eight columns, you might wonder? - This deliberate choice is due to the hard limit of extended statistics - STATS_MAX_DIMENSIONS, which allows only eight columns or expressions in its definition clause.
Let me compare the performance of plain statistics with extended ones. I'll consider different variations of 'ndistinct', 'MCV' and 'dependencies' types. Additionally, I'll include a comparison with a statistic type called 'Joinsel', which builds and uses histograms, MCV, and distinct statistics over a predefined set of columns, treating them as a single value of a composite type. It can be found in standard and enterprise variants of the private Postgres Professional fork.
To measure execution time, use "\timing on
". I'm going to observe how much time it takes if we build statistics over two, four and eight columns. I will take the surrogate test for plain statistics by creating the 'bench' table with two, four and eight columns.
The benchmarking script for measuring the building of extended statistics looks like this:
CREATE STATISTICS stx ON x1,x2,x3,x4 FROM bench;
\timing on
ANALYSE bench;
\timing off
DROP STATISTICS stx; ANALYSE bench; -- cleanup
Benchmarking script for the Joinsel is about creating an index:
CREATE INDEX idx ON bench (x1,x2,x3,x4);
\timing on
ANALYSE bench;
\timing off
DROP INDEX idx; ANALYSE bench; -- cleanup
BTW benchmarking on two different computers, I realised that in the case of extended statistics, the most influential part is the computer's CPU.
So, after passing the tests, we have the following results:
First fact - plain statistic is cheap, but not chargeless. The second one is that the extended statistic drastically depends on the number of columns. Compared to the Joinsel statistic, which generates only one histogram, distinct and MCV for the set of rows, it is evident that the main reason for such computational burden lies in the number of combinations: in the case of four columns, extended statistics produce 11 distinct combinations and 28 combinations for dependency stat; 8 columns spawns 247 and 1016 variants respectively! Also, based on this benchmark, we can conclude that 4 - 5 columns allow the instance to survive its maintenance.
Maybe five columns are enough - would you argue with me? And you're right. But remember, if we want to make it automatic and based on index definition, it implicitly means that we assume incoming queries address the whole set of columns or their prefix. Another case index couldn't help. That means 1) we only need some of the combinations of columns, only prefixes, and 2) a database instance has to survive the case with multiple indexes on one table because it is quite a typical case.
In an attempt to find a solution, I tried to invent an options section in the definition of extended statistics (see this GitHub branch for the code) - keep in mind that options might be helpful in the future (remember how complex statistics are in MS SQL Server); I introduced an option called 'method', which allows switching algorithm of combinations selection in the distinct statistic.
With this option, the benchmark script looks like:
CREATE STATISTICS stx ON x1,x2,x3,x4 FROM bench WITH (method=linear);
\timing on
ANALYSE bench;
\timing off
DROP STATISTICS stx; ANALYSE bench;
I'm not sure the term 'linear' is good here - I just wish to show that the number of combinations linearly depends on the number of columns. So, if previously having definition clause "x1,x2,x3", we generated four ndistinct combinations (x1,x2), (x1,x3), (x2,x3), (x1,x2,x3) with 'linear' option we have only two combinations: x1,x2 and x1,x2,x3.
The same logic for dependency has not been implemented yet, but you can already see some tale-telling benchmark results:
As you can see, by generating extended statistics only over a set of prefixes of the defining clause, we drastically reduced computational efforts in building these statistics. Moreover, it can push the current limit beyond the eight to 32 columns, like the maximum number of columns in an index.
Hence, I think our necessary step in the direction of intensifying the usage of extended statistics is to make it more flexible and introduce options. Do you agree? What do you think about the perspectives of this type of statistic?
THE END.
July 14, 2024. Thailand, South Pattaya.