Does PostgreSQL respond to the challenge of analytical queries?
A short glance into recent advancements
This post was triggered by Crunchy Data's recent article and the YugabyteDB approach to using Postgres as an almost stateless entry point, which parses incoming analytic queries and splits the work among multiple instances managed by another database system that fits the task of storing and processing massive data volumes and can execute relatively simple queries.
The emergence of foreign data wrappers (FDW) and partitioning features has made these solutions possible. It seems that being compatible with the Postgres infrastructure and its mature parser/planner is valuable for vendors enough to consider implementing such hybrid data management systems.
So, now we face the fact that Postgres is used to do analytics on big data. An essential question immediately emerges: does it have enough tools to process complex queries?
What Is Analytic Queries
It is quite a general term. But after reading the course materials on the subject, I can summarise that analytic queries typically:
involve multiple joins
use aggregates, often mathematical ones
need to process large subsets of table data in a single query
have ad-hoc nature and are difficult to predict when it comes
So, looking into the Postgres changes, we should discover what has changed in aggregate processing, join ordering and estimation, and table scanning.
What was the rationale?
The technique of using Postgres as a middleware between user and storage has been triggered by the emergence of FDW and partitioning features. Parallel execution doesn't help much with processing foreign tables (partitions). Still, it is beneficial for speeding up the local part of the work.
The basics of these features were introduced in 2010 - 2017. Now, Postgres can push to foreign server queries containing scan operations, joins, and orderings. We also have asynchronous append, which allows us to gather data from foreign instances simultaneously. As a perspective, the community has quite an active discussion on aggregate pushdown.
Partitioning includes pruning techniques (planning and execution stages) that allows to restrict a query pushdown by only instances containing necessary data. One more essential thing - partitionwise join - allows the optimiser to choose a specific way to execute a join for each couple of joining partitions.
FDW/Partitioning technique is not ideal now because it has many shortcomings. For example:
We can prune only partitions, not a query subtree;
We can't declare some table as a 'dictionary' that exists in any instance and join such a table with foreign partitions simultaneously on a remote instance.
The pruning technique often can't remove partitions because it lacks statistical data about the partitions' min/max values.
However, with these and many other problems, Postgres has hooks and FDW API that are flexible enough to allow a professional developer's team to arrange the code according to the project's needs. Partitioning abilities are actively mature. I see discussions (see, for example, [1, 2, 3, 4]) on enhancing the optimiser to work better with partitions. And I think, soon, we could see more hybrid systems with primary Postgres and some secondary DBMS, chosen according to the purpose.
Regardless, secondary DBMS typically performs low-level preparatory operations with data. Aggregates, complex subqueries, window functions, and other stuff are still executed locally, and the issue is how mature the optimiser is in finding an effective way to process this data after pulling it from the remote side.
After reviewing the code repository, I can confirm that the core developers are actively addressing the challenge of identifying and mitigating bottlenecks in the optimiser.
What is the progress?
To provide a comprehensive overview, please refer to the table below, which outlines my selection of the top commits that have impacted the optimiser since 2010:
Commits in this table can be grouped by the 'feature' key. See my categorisation below.
ProSupport. The initial problem addressed is the estimation of a FunctionScan operation. I briefly mentioned this issue about a month ago. The main problem lies in the optimiser's inability to precisely estimate the cost and cardinality of functions that generate data for the query. In 2019, the community found an elegant and adaptable solution - the 'prosupport' routine concept. This routine can be registered as a function that provides the necessary information to the optimiser and can be stored in the database. This approach allows users or extensions to tune the planning decisions. In 2022 and 2023, these capabilities were extended to window functions. I currently see an attempt to use them with aggregates, which appears to be an important evolution of the technique.
Extended Statistics. People still like and use ORMs and RestAPIs despite their apparent inefficiency. To tackle the challenge of bad estimations caused by multi-clause filters, the community introduced extended statistics in 2017 - 2019. It provides three types of statistics: MCV, dependency and distinct, which detects hidden dependencies between columns in a table and improves estimations.
I don't see the wide spread of this feature: at least, not many reports on its usage are available on the Internet. IMO, this is caused by its opacity, computational laboriousness and the necessity to manually detect columns or expressions to build the statistics.
Incremental Sort. It is an excellent idea that introduces a whole new way to execute a query into the optimiser. As an alternative to full sort and further re-sorting of data, it can find a path where the executor would use presorted input (for example, by x1,x2) and sort the data by x3, necessary for the following operation inside the groups of duplicated x1,x2, providing the output sorted by x1,x2,x3. This approach relieves the typical problem of analytic queries, which frequently require sorted output for aggregations on various query levels. It is especially effective in the case of the LIMIT operator - just look at this example:
SET enable_incremental_sort = off;
EXPLAIN (ANALYZE, TIMING OFF)
SELECT * FROM tenk1 ORDER BY unique1, ten LIMIT 100;
RESET enable_incremental_sort;
EXPLAIN (ANALYZE, TIMING OFF)
SELECT * FROM tenk1 ORDER BY unique1, ten LIMIT 100;
Without incremental sort, we must extract all the tuples from the table even when Heap Sort will return only 100 tuples (the explain has been edited to be laconic):
Limit (cost=827.19..827.44) (actual rows=100 loops=1)
-> Sort (cost=827.19..852.19 rows=10000) (actual rows=100 loops=1)
Sort Key: unique1, ten
Sort Method: top-N heapsort Memory: 112kB
-> Seq Scan on tenk1 (cost=0.00..445.00)
(actual rows=10000 loops=1)
Execution Time: 5.404 ms
However, incremental sort allows Postgres to employ index scan and provide partial sorted input to the sort module. Also, it is not necessary to scan the whole table, which is crucial in the case of analytic queries and massive tables:
Limit (cost=0.46..21.46) (actual rows=100 loops=1)
-> Incremental Sort (cost=0.46..2100.20) (actual rows=100 loops=1)
Sort Key: unique1, ten
Presorted Key: unique1
-> Index Scan using tenk1_unique1 on tenk1
(cost=0.29..1650.20) (actual rows=101 loops=1)
Execution Time: 0.318 ms
Moreover, I didn't find analogues to this node in other database systems.
Memoize. It is designed to extend the parameterised NestLoop JOIN technique to more use cases. Its task is to cache tuples fetched from the inner NestLoop join input. The idea extends the Materialize technique. Imagine that the cardinality of the inner subtree of the join is too massive to cache it all. The cardinality estimation of the outer subtree is big enough to be afraid that loops through the inner can kill the performance. Parameterised NestLoop remains the best solution because a parameterised index scan allows the extraction of a tiny subset of tuples from the inner. Suppose the optimiser predicts multiple duplicated values in the outer. In that case, it can insert the Memoize node into the top of the inner to avoid rescanning if the key value already came from the output.
Let me show the effect of Memoize with a simple query:
EXPLAIN ANALYZE
SELECT COUNT(*),AVG(t1.unique1) FROM tenk1 t1
INNER JOIN tenk1 t2 ON t1.unique1 = t2.twenty
WHERE t2.unique1 < 1000;
It is extracted from regression tests, and a description of the tenk1 table can be found there. Disabling Memoize, we get the query plan:
Aggregate (cost=448.24..448.25) (rows=1 loops=1)
-> Merge Join (cost=427.65..443.24) (rows=1000 loops=1)
Merge Cond: (t1.unique1 = t2.twenty)
-> Index Only Scan using tenk1_unique1 on tenk1 t1
(rows=21 loops=1)
-> Sort (cost=427.36..429.86) (rows=1000 loops=1)
Sort Key: t2.twenty
-> Bitmap Heap Scan on tenk1 t2
(cost=20.04..377.54) (rows=1000 loops=1)
Recheck Cond: (unique1 < 1000)
-> Bitmap Index Scan on tenk1_unique1
(cost=0.00..19.79 width=0) (rows=1000 loops=1)
Index Cond: (unique1 < 1000)
Execution Time: 6.512 ms
Here is a good example of effective MergeJoin: having presorted inputs because the index scans fetches only 21 tuples from the outer utilising merging algorithm. But what about NestLoop in that case? Could it be competitive? Disable MergeJoin and HashJoin and see the result:
Aggregate (cost=815.04..815.05) (rows=1 loops=1)
-> Nested Loop (cost=20.32..810.04) (rows=1000 loops=1)
-> Bitmap Heap Scan on tenk1 t2
(cost=20.04..377.54) (rows=1000 loops=1)
Recheck Cond: (unique1 < 1000)
-> Bitmap Index Scan on tenk1_unique1
(cost=0.00..19.79) (rows=1000 loops=1)
Index Cond: (unique1 < 1000)
-> Index Only Scan using tenk1_unique1 on tenk1 t1
(cost=0.29..0.42) (rows=1 loops=1000)
Index Cond: (unique1 = t2.twenty)
Execution Time: 102.476 ms
Much worse. The same 1000 tuples from the one side but 1000 index scans to obtain a single tuple worsened this case. It is precisely where caching could help if any of these 1000 loops return the same tuple. Enable Memoize and see what will happen:
Aggregate (cost=416.40..416.41) (rows=1 loops=1)
-> Nested Loop (cost=20.33..411.39) (rows=1000 loops=1)
-> Bitmap Heap Scan on tenk1 t2
(cost=20.04..377.54) (rows=1000 loops=1)
Recheck Cond: (unique1 < 1000)
-> Bitmap Index Scan on tenk1_unique1
(cost=0.00..19.79) (rows=1000 loops=1)
Index Cond: (unique1 < 1000)
-> Memoize (cost=0.30..0.43) (rows=1 loops=1000)
Cache Key: t2.twenty
Hits: 980 Misses: 20
-> Index Only Scan using tenk1_unique1 on tenk1 t1
(cost=0.29..0.42) (rows=1 loops=20)
Index Cond: (unique1 = t2.twenty)
Execution Time: 6.046 ms
The plan stays the same, but the Memoize node in 980 inner rescans returned a cached copy of the tuple instead of looking up the table. It has also provided an effect: you can see that the total plan cost is better than the two previous ones, and the execution time is at least not worse.
Pull-up subqueries. In my experience, intricated analytic queries often employ subqueries in expressions. Such a subquery can depend on the data from the wrapping query block (aka correlated subqueries), which leads to complete subquery evaluation each time the expression is called. Suppose the expression is a filter or join clause. In that case, the executor will evaluate it on each incoming tuple.
It is a common problem that resolves with query tree transformation rules, which have been researched since the 1980s. A trivial subquery is transformed to InitPlan and evaluated once, and the query uses its materialised output. If the subquery depends on parameters, it can frequently be transformed to SEMI JOIN with lateral references.
Postgres supports the transformation of simple subqueries and, in 2024, added restricted support for correlated subqueries. IMO, development in this area is crucial to speed up analytics, especially auto-generated queries.
Let me demonstrate this technique with the example below:
EXPLAIN (ANALYZE, TIMING OFF, COSTS ON)
SELECT * FROM tenk1 A
WHERE A.hundred IN (SELECT B.hundred FROM tenk2 B WHERE B.unique1 = A.odd);
This query contains one correlated subquery. Turning off the transformation, we get the plan:
Seq Scan on tenk1 a (cost=0.00..43420.00) (actual rows=100 loops=1)
Filter: (ANY (hundred = (SubPlan 1).col1))
Rows Removed by Filter: 9900
SubPlan 1
-> Index Scan using tenk2_unique1 on tenk2 b
(cost=0.29..8.30) (actual rows=1 loops=10000)
Index Cond: (unique1 = a.odd)
Execution Time: 87.182 ms
Transforming the subquery to the SEMI JOIN optimiser finds a better (according to the cost model) plan that executes four times faster:
Hash Semi Join (cost=595.00..1215.00) (actual rows=100 loops=1)
Hash Cond: ((a.odd = b.unique1) AND (a.hundred = b.hundred))
-> Seq Scan on tenk1 a (actual rows=10000 loops=1)
-> Hash (actual rows=10000 loops=1)
-> Seq Scan on tenk2 b (actual rows=10000 loops=1)
Execution Time: 20.722 ms
Even the employment of Index Scan in the subquery doesn't help much without transformation: looping repeatedly on each tuple drastically degrades the performance.
I discovered that MS SQL Server includes diverse pull-up transformation techniques for simple and correlated subqueries. This could be clearer for Oracle, where, as explained in the documentation, it may be forced by using hints.
ORDER-BY/DISTINCT Aggregates. This is an impalpable improvement for the user, sometimes drastically enhancing execution time. The main idea is to discover aggregate orderings, find the most common ones, and sort incoming data before calculating these aggregates. To understand the effect, look at the difference between the same query executed by PG13 and PG17:
EXPLAIN (ANALYZE, TIMING OFF, COSTS ON)
SELECT sum(unique1 ORDER BY ten), sum(unique1 ORDER BY ten,two)
FROM tenk1 GROUP BY ten;
-- PG13:
/*
GroupAggregate (cost=1108.97..1209.02) (actual rows=10 loops=1)
Output: sum(unique1 ORDER BY ten), sum(unique1 ORDER BY ten, two), ten
Group Key: tenk1.ten
-> Sort (cost=1108.97..1133.95) (actual rows=10000 loops=1)
Output: ten, unique1, two
Sort Key: tenk1.ten
-> Seq Scan on public.tenk1 (cost=0.00..444.95)
Output: ten, unique1, two
Execution Time: 116.375 ms
*/
-- PG17:
/*
GroupAggregate (cost=1109.39..1209.49) (actual rows=10 loops=1)
Output: sum(unique1 ORDER BY ten), sum(unique1 ORDER BY ten, two), ten
Group Key: tenk1.ten
-> Sort (cost=1109.39..1134.39) (actual rows=10000 loops=1)
Output: ten, unique1, two
Sort Key: tenk1.ten, tenk1.two
-> Seq Scan on public.tenk1 (cost=0.00..445.00)
Output: ten, unique1, two
Execution Time: 12.650 ms
*/
Presorting tuples and eliminating internal aggregate sorting cause a tenfold speedup. That's curious; you can note that the execution time change doesn't change the cost value. Does it indicate the field of further improvement of the optimiser cost model?
Make Vars be outer-join-aware. The last feature, designed recently, in 2023, is too internal and hidden from the sight of the typical user that, I think, only a few people know about - machinery to detect that incoming data can contain NULL values.
It is worth mentioning because of its high perspectives. Many queries contain 'NULL' checkings. Initially, the optimiser estimated the number of null values by looking into the statistics in the table. Sometimes, table columns do not contain any NULLs or even have NOT NULL constraints. But still, in a query containing OUTER JOIN, it may happen that the data field referring to the columns as a source will produce nulls. Such 'generated' nulls frequently cause wrong estimations, mostly because of cardinality underestimation, which results in choosing the NestLoop join algorithm.
What's more we can do?
Estimating which stuff we need is difficult because we need to envision the effect it can bring. However, by looking into alternatives like MS SQL Server and GPOrca Optimiser, which have some advantages, I can briefly estimate the necessary techniques.
First and foremost, it is a further evolution of extended statistics. SQL Server has diverse options for this type of statistics, which is used intensively to estimate scans or joins. They have some stuff for gathering statistics on the fly, likewise described in the [DeWitt1998] paper.
Having points of on-the-fly statistics in combination with alternative query subplans and dynamic switching between them right during execution (let's watch Alena Rybakina's WIP report at the September 2024 Postgres Conference) can allow complex queries to survive and be executed in some sane time.
So far, I don't see any activity in the hacker's mailing list around developing pull-up subquery techniques, so the community has not forced this topic. IMO, the main reason is the efficiency issue: although correlated subquery transformation is well-described in scientific papers, it can increase execution time in some cases. As a result, this technique's performance and technical aspects still need to be revised before any further progress.
Also, the community has discussed possible ways to modify the sort model and improve the sorting and shuffling of group-by-columns. This topic looks interesting to work on in the next development cycle.
In the end, I should state that the progress is obvious. Some new and unique features are being introduced. However, the speed of development is still not as fast as people who operate fast-growing data would desire. I feel it makes sense to extend the hook's nomenclature in (at least) selectivity estimation, subquery or expression tree transformation, and node execution. Maybe we can allow custom statistics. This can give way for the outward (non-core) community to implement new techniques in advance.
Are you okay with the current state of PostgreSQL planner and its roadmap?
THE END.
August 4, 2024. Paris, France
It's not clear to me what transformations to pull-up correlated subqueries you have in mind. Perhaps the best way to get that happen is to post some proposal on the hackers mailing list, with a description of the benefits and how the planner might decide to apply the transformation. I'd expect that to be the challenging part - costing the alternative transformations.
That was mostly the problem with the GROUP BY reordering - the statistics we have are not sufficient to make reliable decisions, and it got the patch reverted. The incremental sort has somewhat similar challenges, FWIW.