Inventing A Cost Model for PostgreSQL Local Buffers Flush

On a way to parallel temp tables scan

Jan 05, 2026

In this post, I describe experiments on the write-versus-read costs of PostgreSQL's temporary buffers. For the sake of accuracy, the PostgreSQL functions set is extended with tools to measure buffer flush operations. The measurements show that writes are approximately 30% slower than reads. Based on these results, the cost estimation formula for the optimiser has been proposed:
flush_cost = 1.30 × dirtied_bufs + 0.01 × allocated_bufs.

Introduction

Temporary tables in PostgreSQL have always been parallel restricted. From my perspective, the reasoning is straightforward: temporary tables exist primarily to compensate for the absence of relational variables, and for performance reasons, they should remain as simple as possible. Since PostgreSQL parallel workers behave like separate backends, they don't have access to the leader process's local state, where temporary tables reside. Supporting parallel operations on temporary tables would significantly increase the complexity of this machinery.

However, we now have at least two working implementations of parallel temporary table support: Postgres Pro and Tantor. One more reason: identification of temporary tables within a UTILITY command is an essential step toward auto DDL in logical replication. So, maybe it is time to propose such a feature for PostgreSQL core.

After numerous code improvements over the years, AFAICS, only one fundamental problem remains: temporary buffer pages are local to the leader process. If these pages don't match the on-disk table state, parallel workers cannot access the data.

A comment in the code (80558c1) made by Robert Haas in 2015 clarifies the state of the art:

/*
 * Currently, parallel workers can't access the leader's temporary
 * tables.  We could possibly relax this if we wrote all of its
 * local buffers at the start of the query and made no changes
 * thereafter (maybe we could allow hint bit changes), and if we
 * taught the workers to read them.  Writing a large number of
 * temporary buffers could be expensive, though, and we don't have
 * the rest of the necessary infrastructure right now anyway.  So
 * for now, bail out if we see a temporary table.
 */

The comment hints at a path forward: if we flush the leader's temporary buffers to disk before launching parallel operations, workers can safely read from the shared on-disk state. The concern, however, is cost - would the overhead of writing those buffers outweigh the benefits of parallelism?

On the path to enabling parallel temporary table scans, this cost argument is fundamental and must be addressed first. We can resolve this issue by providing the optimiser with a proper cost model. In this case, it could choose between a parallel scan with buffer flushing overhead and a sequential scan performed by parallel workers. Hence, we are looking for a constant, such as DEFAULT_SEQ_PAGE_COST, to estimate writing overhead. Let's address this question with actual data and measure the cost of flushing temporary buffers. My goal is to determine whether this overhead represents a real barrier to parallel execution or simply an overestimated concern that has kept this optimisation off the table.

Benchmarking tools

PostgreSQL currently provides no direct access to local buffers for measurement purposes. To conduct this benchmark, I extended the system with several instrumentation tools and UI functions. The temp-buffers-sandbox branch, based on the current PostgreSQL master, contains all the modifications needed for this work.

The implementation consists of two key commits:

No.1: Statistics infrastructure

This commit introduces two new internal statistics that track local buffer state:

allocated_localbufs - tracks the total number of local buffers currently allocated in this backend (it can't be more than the temp_buffers value).
dirtied_localbufs - counts how many local buffer pages are dirty (not flushed to disk).

I believe these statistics potentially provide the foundation for the cost model, giving the query optimiser visibility into the current state of temporary buffers before deciding whether to flush them.

No.2: UI functions

This commit adds SQL-callable functions that allow direct manipulation and inspection of local buffers:

pg_allocated_local_buffers() - returns the count of currently allocated local buffers.
pg_flush_local_buffers() - explicitly flushes all dirty local buffers to disk.
pg_read_temp_relation(relname, randomize) - reads all blocks of a temporary table either sequentially or in random order.
pg_temp_buffers_dirty(relname) - marks all pages of a temporary table as dirty in the buffer pool.

If local buffers are free or not allocated yet, reading a relation block in a random order simulates a random distribution in memory. So, the following flush operation of these pages to the disk serves as a simulation of a simple 'random-write' mode. These functions enable almost direct measurement of read and write operations.

Methodology

The complete test bench is available here.

Fortunately, local buffer operations are pretty straightforward: they don't acquire locks, don't require WAL logging, and avoid other costly manipulations. This eliminates concurrency concerns and simplifies the test logic. To build a cost estimation model, we need to measure three things: write speed, read speed, and the overhead of scanning buffers when no I/O is required.

The ratio of read to write speed will allow us to derive a write-page-cost parameter based on the DEFAULT_SEQ_PAGE_COST value used in core PostgreSQL. The optimiser can use this parameter to estimate the cost of flushing dirty local buffers before parallel operations begin.

Each test iteration follows this algorithm:

Sequential access testing:

Create a temp table and fill it with data that fits within the local buffer pool (all pages will be dirty in memory).
Call pg_flush_local_buffers() to write all dirty buffers to disk. Measure I/O.
Call pg_flush_local_buffers() again to measure the overhead of scanning buffers without actual flush (dry-write-run).
Evict the test table’s pages by creating a dummy table that fills the entire buffer pool, then drop it.
Call pg_read_temp_relation('test', false) to read all blocks sequentially from disk into buffers. Measure I/O.
Call pg_read_temp_relation('test', false) again to measure the overhead of scanning buffers without an actual read (dry-read-run).

Random access testing:

Evict the test table's pages again by creating and dropping a dummy table.
Call pg_read_temp_relation('test', true) to read blocks in random order, distributing them randomly across the buffer pool.
Call pg_temp_buffers_dirty('test') to mark all table pages as dirty.
Call pg_flush_local_buffers() to flush pages to disk. Since pages were loaded randomly into buffers, this pretends to simulate random write patterns.

All measurements are captured using EXPLAIN (ANALYZE, BUFFERS), which records execution time in milliseconds and buffer I/O statistics (local read, local written, local hit counts). Planning time is negligible (typically < 0.02ms) and excluded from analysis. While it's possible to avoid EXPLAIN and the Instrumentation overhead entirely, I believe this overhead is minimal and consistent between write and read operations. Using EXPLAIN provides a convenient way to verify execution time and confirm the actual number of blocks affected.

The tests cover buffer pool sizes at powers of 2 from 128 to 262,144 blocks (1MB to 2GB), with 30 iterations per size for statistical reliability. Each test allocates 101% of the target block count to accommodate Free Space Map and Visibility Map metadata. Higher buffer counts cause memory swapping and produce unreliable results.

Benchmark results

On my laptop, the most stable performance occurs in the 4-512 MB range:

Large datasets show higher write overhead and variability:

Scanning without I/O (Dry-Run) is minimal, around 0.002-0.240 ms.

Based on the results, the temp table write cost should account for both sequential overhead and random access patterns. The analysis shows:

Sequential write overhead: Approximately 20% slower than sequential reads.

Random access degradation: Random buffer distribution patterns show an additional 10-24% performance degradation compared to sequential access. This occurs when temporary table pages are scattered across the buffer pool due to interleaved operations or random access patterns.

Recommended cost formula:

DEFAULT_WRITE_TEMP_PAGE_COST = 1.30 × DEFAULT_SEQ_PAGE_COST

This 1.30 multiplier accounts for both the sequential write overhead (~1.20) and the random-access degradation (~0.10), providing a conservative estimate that covers realistic workload scenarios where buffer access patterns may not be purely sequential.

Write cost is relatively close to read cost because no WAL logging is required for temporary tables. I’m uncertain what storage type the current default seq_page_cost targets; my measurements were conducted on an NVMe SSD. Would the relationship differ on HDD? Further investigation into different storage types may be warranted.

Also, tests indicate that we can estimate the buffer-scanning overhead at approximately 1% of the writing cost. Hence, the whole formula for the preliminary temporary buffers flushing may look like (DEFAULT_SEQ_PAGE_COST = 1):

flush_cost = 1.30 × dirtied_localbufs + 0.01 × allocated_localbufs

What’s next?

This benchmark provides the foundational cost model needed to enable parallel query execution on temporary tables in PostgreSQL. The whole implementation requires four key development phases:

Add a planner flag to signal when a plan subtree contains operations on temporary objects. This allows the planner to identify when buffer flushing may be required for parallel execution. I hope that existing parallel_safe and consider_parallel flags may be modified to serve this purpose.
Implement the buffer flush operation in Gather and GatherMerge nodes before launching parallel workers. This ensures that all dirty temporary table pages are synchronised to disk before workers begin execution.
Enable parallel workers to access the leader process’s temporary table data from disk. This requires teaching workers how to locate and read temporary table files written by the leader process.
Integrate the cost model into the query planner. The planner can then make informed decisions about whether flushing temporary buffers for parallel execution will outperform sequential execution without parallel workers.

Conclusion

A flush of the local buffer is approximately 30% slower than a sequential read.
For optimisation, the default write cost may be hardcoded to 1.3*DEFAULT_SEQ_PAGE_COST.
360 measurements with 30 iterations per size. Medium datasets (16-512 MB) show a coefficient of variation consistently below 6%, indicating highly stable results. Large datasets (1-2 GB) show higher variability (CV >150% for writes), requiring careful interpretation.

THE END.
January 05, 2026, Madrid, Spain.

Alena Rybakina

Jan 5Edited

I like the idea, but I think your model lacks consideration for buffering of objects in correlated subqueries (when for each tuple of the outer relation, we look up a value from the subquery) and all their dependent objects. I suspect there’s a significant underestimation of the cost of flushing buffers for objects related to temporary tables - indexes and TOASTs - and TOAST tables can be substantially larger than the main temporary tables. While we can get information about indexes more easily from the RelOptInfo structure, for TOASTs we can only find out if we actually access the relation, and we’re lucky if it happens to be in the cache (but rare I suppose). Otherwise, during cost calculation, we’d have to query the system catalog each time, which makes it expensive for only select-like queries.

In your current model, there’s a lack of knowledge about the actual (or almoust) number of temporary buffers for such dependent temporary relations, including their indexes and TOAST tables. At the moment, PostgreSQL doesn’t provide this information in the buffer manager or in any other statistics. Or am I missing something? By the way, have you run tests on large tables (with toast elements)?

By the way, wouldn’t it be safer to flush temporary pages during memory allocation for the query plan and its subplans in the InitPlan function? This would allow flushing both the main query plan and any subquery plans.

1 reply by Andrei Lepikhov

1 more comment...

Conserving CPU's cycles ...

Discussion about this post

Ready for more?