<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Conserving CPU's cycles ...]]></title><description><![CDATA[Some thoughts and notes caused by code development process]]></description><link>https://danolivo.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!DHDg!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22ea4c7-73b5-4b9b-aaad-7db704866f94_256x256.png</url><title>Conserving CPU&apos;s cycles ...</title><link>https://danolivo.substack.com</link></image><generator>Substack</generator><lastBuildDate>Mon, 11 May 2026 02:45:08 GMT</lastBuildDate><atom:link href="https://danolivo.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Andrei Lepikhov]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[danolivo@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[danolivo@substack.com]]></itunes:email><itunes:name><![CDATA[Andrei Lepikhov]]></itunes:name></itunes:owner><itunes:author><![CDATA[Andrei Lepikhov]]></itunes:author><googleplay:owner><![CDATA[danolivo@substack.com]]></googleplay:owner><googleplay:email><![CDATA[danolivo@substack.com]]></googleplay:email><googleplay:author><![CDATA[Andrei Lepikhov]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Finding invisible use-after-free bugs in the PostgreSQL planner]]></title><description><![CDATA[A little fuss on dangling pointers]]></description><link>https://danolivo.substack.com/p/finding-invisible-use-after-free</link><guid isPermaLink="false">https://danolivo.substack.com/p/finding-invisible-use-after-free</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Mon, 20 Apr 2026 19:57:17 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f3c4ae91-8b23-4a24-9267-750186cf814c_960x640.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>On a PostgreSQL build with assertions enabled, run the standard </em><code>make check-world</code><em> suite with a small debugging extension called <a href="https://github.com/danolivo/pg_pathcheck">pg_pathcheck</a> loaded. It will report on pointers to freed memory in the planner's path lists. Such dangling pointers exist even in the core Postgres now. They are harmless today. But the word <strong>today</strong> is what makes this worth writing about.</em></p><h1>A production story</h1><p>This story started in July 2021. At the time, I was finishing a sharding solution built on top of  <code>postgres_fdw</code>. During testing, our engineers sent me an example query that would crash periodically with a <code>SEGFAULT</code>. One look at the plan told me something was very off.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:&quot;51186e2f-4068-4780-b26e-d17b994a422f&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql"> Append
   -&gt;  Nested Loop
         Output: data_1.b
         Join Filter: (g1.a = g2.a)
         ...
   -&gt;  Materialize
         Output: g2.a, data_2.b
         -&gt;  Hash Join
               Output: g2.a, data_2.b
               Hash Cond: (data_2.b = g2.a)
               ...</code></pre></div><p>The first obvious question: how did a <code>Materialize</code> node end up as a direct child of an <code>Append</code>? The second, more interesting one: how can one <code>Append</code> combine two sources with different tuple widths? No wonder the query was crashing &#8212; and to make it worse, the failure was intermittent; sometimes the very same query produced a perfectly reasonable plan.</p><p>On paper, the bug shouldn't have been possible: the optimiser doesn't work that way. A few days of debugging pointed the finger at dangling pointers. While building one of the alternative <code>Append</code> paths, the optimiser adds a cheaper path to a child <code>RelOptInfo</code>&#8217;s <code>pathlist</code> and evicts the one that was there before. But the previously constructed <code>Append</code> still holds a pointer to that now-freed slot. A step or two later, the allocator hands the exact same chunk back out for a new <code>Path</code> higher up the tree, for, say, an enclosing <code>JOIN</code>. The result is a plan that makes no semantic sense at all.</p><h1>Where dangling pointers come from</h1><p>PostgreSQL builds each relation's pathlist incrementally through <code>add_path()</code>. When a newly arrived path dominates an existing one &#8212; cheaper across all relevant dimensions (startup cost, total cost, pathkeys, parallel-safety) &#8212; the dominated path can be released immediately.</p><p>Because the optimiser builds the plan bottom-up, the pathlists of lower operations (scans, for instance) are completed first, and then the pathlists of upper operations (Append and friends) are assembled with references to specific entries in the lower nodes' pathlists.</p><p>This works fine right up until the optimiser, while building a Path for an upper operation, decides that the plan can be improved by adding something to a <em>lower</em> pathlist. At that point, a path may be evicted from the lower pathlist &#8212; one that is already referenced from higher up the plan tree. When that happens, we have a dangling pointer.</p><p>There is also an example &#8212; an upper rel evicting a path that a lower rel still references &#8212; in the Postgres core itself. A minimal reproducer looks roughly like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:&quot;a8de4e4a-c900-4240-b707-e1e950729bcd&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">DROP TYPE IF EXISTS insenum CASCADE;
CREATE TYPE insenum AS enum ('L1', 'L2');

EXPLAIN (COSTS OFF)
SELECT enumlabel,
  CASE WHEN enumsortorder &gt; 20 THEN NULL ELSE enumsortorder END AS so
FROM pg_enum
WHERE enumtypid = 'insenum'::regtype
ORDER BY enumsortorder;</code></pre></div><p>With <code>pg_pathcheck</code> loaded, you will see:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:&quot;a5871669-fd0d-4475-913e-8b87fc0c701a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">WARNING:  pg_pathcheck: invalid NodeTag T_SeqScan in pathlist, rel {pg_enum}
DETAIL:  pathlist contents: [0] T_ProjectionPath; [1] T_SeqScan INVALID

 Sort
   Sort Key: enumsortorder
   -&gt;  Seq Scan on pg_enum
         Filter: (enumtypid = '16590'::oid)</code></pre></div><p>What's <a href="https://github.com/danolivo/pg_pathcheck/wiki/Research-on-the-Pattern-A">going on</a>: the path representing the scan of <code>pg_enum</code> gets shared under certain conditions with <code>ordered_rel</code>, the rel that represents the sorted query result. Later, when a new path <code>PP3</code> arrives in <code>ordered_rel</code>, the old shared <code>PP2</code> is evicted and freed by <code>add_path()</code> &#8212; but <code>input_rel-&gt;pathlist</code> still holds a pointer to the freed chunk:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;88f23684-93f4-4f62-a27e-4e4c20cce7ab&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext"> input_rel {pg_enum}                    ordered_rel
 &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;                   &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
 &#9474; pathlist:        &#9474;                   &#9474; pathlist:        &#9474;
 &#9474;   [0] &#8594; PP1      &#9474;                   &#9474;   [0] &#8594; PP2 &#9668;&#9472;&#9472;&#9472;&#9472;&#9488;
 &#9474;   [1] &#8594; PP2 &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472; SHARED &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472; &#9496;
 &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;                   &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;


 input_rel {pg_enum}                    ordered_rel
 &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;                   &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
 &#9474; pathlist:        &#9474;                   &#9474; pathlist:        &#9474;
 &#9474;   [0] &#8594; PP1      &#9474;                   &#9474;   [0] &#8594; PP3      &#9474;
 &#9474;   [1] &#8594; ??? &#9668;&#9472;&#9472;&#9472; dangling &#9472;&#9472;&#9472;&#9587;&#9587;&#9587;     &#9474;                  &#9474;
 &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;      pfree'd      &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
                           chunk</code></pre></div><h1>Why does it still work in practice</h1><p>In vanilla Postgres, this example causes no visible problems, thanks to a subtle invariant: Postgres never walks input_rel's pathlist once ordered_rel is under construction &#8212; it uses direct references to the <code>cheapest_*</code> paths instead. The dangling pointer is created, but nobody dereferences it before the per-query memory context is reset at the end of the statement.</p><p>Extensions, however, often carry their own internal logic and may need to traverse the full pathlist, which can trip the bug. Separately, when building complex plan transformations along the lines of the example above, Postgres forks can produce dangling-pointer situations of their own &#8212; and there is no guarantee that all of them will be found and fixed before shipping to production.</p><p>Features that manipulate the query plan at runtime &#8212; <code>disable_node</code>, <code>pg_plan_advice</code>, <code>pg_hint_plan</code> and the like &#8212; can also accidentally trigger latent optimiser issues and crash the system with a <code>SEGFAULT</code>.</p><h1>A small walker</h1><p>pg_pathcheck is around 600 lines of C. It registers two planner hooks: <code>create_upper_paths_hook</code> to remember the top <code>PlannerInfo</code> and <code>planner_shutdown_hook</code> to do the work.</p><p>The walker visits every <code>Path</code> reachable from the top <code>PlannerInfo</code>. That means the <code>upper_rels[]</code> arrays, every <code>simple_rel_array </code>entry with its optional subquery subroot, every join rel collected during dynamic programming, the various parallel <code>RelOptInfos </code>(<code>unique_rel</code>, <code>grouped_rel</code>, <code>part_rels</code>), and &#8212; within compound <code>Path </code>nodes &#8212; every embedded sub-path field (<code>outerjoinpath</code>, <code>innerjoinpath</code>, <code>subpath</code>, <code>subpaths</code>, <code>bitmapqual</code>, &#8230;). A visited-pointer hash keeps the traversal linear.</p><p>At every pointer, two checks run. The first is a NodeTag whitelist: if <code>path-&gt;type</code> is not one of the known Path-family tags, the memory has either been filled with <code>0x7F</code> (freed, not yet reused) or re-allocated as some other kind of node. The second, used for base and join rels, is a parent-match check: <code>path-&gt;parent</code> on a path in <code>rel-&gt;pathlist</code> must equal rel. A mismatch catches same-size-class aliasing &#8212; a freed chunk that has been recycled into another valid Path belonging to a different rel entirely. The tag check passes in that case, but the ownership is wrong.</p><p>When a check fires, the extension emits a report at a configurable elevel (WARNING / ERROR / PANIC, controlled by <code>pg_pathcheck.elevel</code>). The report names the rel, the slot where the stray pointer sits, the full contents of the containing list (with each element annotated by its node kind), and &#8212; via <code>debug_query_string </code>in the hint &#8212; the query that triggered it.</p><h1>The allocator wrinkle</h1><p>PostgreSQL's <code>aset.c</code> uses power-of-two size classes. A <code>Path </code>is 80 bytes, which lands in the 128-byte class. So do:</p><ul><li><p><code>ProjectionPath </code>(96), <code>SortPath </code>(88), <code>MaterialPath </code>(88)</p></li><li><p><code>NestPath</code>, <code>AppendPath </code>(112).</p></li><li><p>Among Plan nodes, <code>SeqScan </code>(112), <code>BitmapHeapScan </code>(120), <code>NestLoop </code>(128), <code>Hash </code>(128), <code>Result </code>(128), <code>Gather </code>(128), and roughly a dozen more scans.</p></li></ul><p>When a <code>Path</code> is freed, its slot returns to the 128-byte freelist. The next <code>makeNode(&lt;Something&gt;)</code> inside the planner picks up that exact address. An old dangling pointer that had been invisible during planning now references a perfectly live Path or Plan node &#8212; with a valid but wrong NodeTag, and with fields shaped for an entirely different kind of object. This is also why <code>CLOBBER_FREED_MEMORY</code> on its own is not enough to detect the problem: the clobber pattern is overwritten by the re-allocation before any walker gets a chance to see it. Out of about 4,000 <a href="https://github.com/danolivo/pg_pathcheck/actions/runs/24673567968">findings </a>in my full-suite run, the number of pointers found carrying the raw <code>0x7F7F7F7F</code> fill was zero. It looks like memory is fully reused.</p><p>Valgrind catches a use-after-free <em>at the moment of the dereference</em> &#8212; and in this case, nothing ever dereferences the stale pointer during normal execution. So, a structural walker that verifies the pathlist's semantic invariants &#8212; &#8216;this pointer must reference a live Path owned by this rel&#8217; &#8212; is the right tool for a use-after-free that the rest of the program is disciplined enough not to trigger &#8212; byte-level tools catch only the dereference, not the dangling reference.</p><h1>Who may need this</h1><p>The three audiences may find this code useful.</p><p><strong>PostgreSQL core developers</strong>. The hackers' threads discuss at least three solution shapes &#8212; reference-counted paths, a <code>used</code> flag, and local memory contexts &#8212; and have been going back and forth on whether the unwritten contract is actually worth tightening. This <a href="https://github.com/danolivo/pg_pathcheck/actions/runs/24673567968">dataset </a>grounds the discussion.</p><p><strong>Extension authors</strong> &#8212; especially those who write custom-scan providers, FDWs, optimisation features, or plan-inspection tooling &#8212; benefit from the tool as a sanity check.</p><p><strong>Fork maintainers</strong> have the largest blast radius. Forks tend to modify the planner more aggressively than extensions can, and they ship on schedules that are not always in lockstep with the PostgreSQL master. Running pg_pathcheck against a fork&#8217;s test suite will tell you whether your modifications preserve the invariants that core happens to rely on.</p><h1>Continuous coverage</h1><p>The repository ships a GitHub Actions <a href="https://github.com/danolivo/pg_pathcheck/actions">workflow </a>that runs <code>make &#8212;k check-world</code> against a freshly cloned PostgreSQL master with <code>pg_pathcheck </code>enabled. It runs on every push, every pull request, on manual dispatch, and nightly. The artefacts include the full server logs and a deduplicated summary rendered into the step-summary panel. The wiki hosts the raw reports and written analyses of each run.</p><p>If upstream master introduces a new source of dangling pointers, the nightly will flag it the morning after the commit lands. If it closes one, the counts will drop. In either direction, the workflow provides a real-time pulse on the contract's state.</p><h1>Running it yourself</h1><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:&quot;1fc69a30-ee0c-457b-899a-7e9bc7cd45d9&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">git clone https://github.com/danolivo/pg_pathcheck
cd /path/to/pg_pathcheck
USE_PGXS=1 PG_CONFIG=/path/to/install/bin/pg_config make install

echo "shared_preload_libraries = 'pg_pathcheck'" &gt; /tmp/ppc.conf
TEMP_CONFIG=/tmp/ppc.conf make check-world</code></pre></div><p>After the run, look into the warnings from every <code>tmp_check/log/*.log</code>, <code>log/postmaster.log</code>, <code>results/*.out</code>, and <code>regression.diffs</code>. For interactive debugging, use <code>SET g_patcheck.elevel = error or panic </code>which makes it easy to correlate a specific finding with a specific test query.</p><p>The extension targets PostgreSQL master specifically &#8212; it uses <code>PG_MODULE_MAGIC_EXT</code> and the <code>extension_state</code> slot API, both recent additions. It registers no SQL objects. <code>CREATE EXTENSION pg_pathcheck</code> does nothing useful. All effects are routed through the planner hooks it installs at library load.</p><p>If you try it and find something &#8212; on an unmodified master, on a fork, in your extension's test suite &#8212; I would like to read about it in comments.</p><p></p><p>THE END.<br><em>April 20, 2026, Madrid, Spain.</em></p><p></p><p><strong>Disclosure</strong></p><p>Most of pg_pathcheck's code and this post were drafted with the help of a large language model (Claude). Every change was reviewed by a human before being committed, but the prose and structure are largely machine-produced.</p>]]></content:encoded></item><item><title><![CDATA[500 Milliseconds on Planning: How PostgreSQL Statistics Slowed Down a Query 20 Times Over]]></title><description><![CDATA[When legacy decisions meet current database realities]]></description><link>https://danolivo.substack.com/p/500-milliseconds-on-planning-how</link><guid isPermaLink="false">https://danolivo.substack.com/p/500-milliseconds-on-planning-how</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Wed, 28 Jan 2026 15:25:29 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!JIXZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb24b03fe-330e-42e9-a48a-04945d3b4e33_1220x878.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>A query executes in just 2 milliseconds, yet its planning phase takes 500 ms. The database is reasonably sized, the query involves 9 tables, and the default_statistics_target is set to only 500. Where does this discrepancy come from?</em></p><p><em>This question was recently <a href="https://www.postgresql.org/message-id/AS4PR02MB874303264FDEF9B160C06DF9EA8EA@AS4PR02MB8743.eurprd02.prod.outlook.com">raised</a> on the pgsql-performance mailing list, and the investigation revealed a somewhat surprising culprit: the column statistics stored in PostgreSQL's pg_statistic table.</em></p><h2>The Context</h2><p>In PostgreSQL, query optimisation relies on various statistical measures, such as MCV, histograms, distinct values, and others - all stored in the pg_statistic table. By default, these statistics are based on samples of up to 100 elements. For larger tables, however, we typically need significantly more samples to ensure reliable estimates. A thousand to 5000 elements might not seem like much when representing billions of rows, but this raises an important question: could large statistical arrays, particularly MCVs on variable-sized columns, seriously impact query planning performance, even if query execution itself is nearly instantaneous?</p><h2>Investigating the Problem</h2><p>We're examining a typical auto-generated 1C system <a href="https://github.com/danolivo/conf/blob/main/2026-massive-table-stat/query.sql">query</a>. '1C' is a typical object-relational mapping framework for accounting applications. PostgreSQL version is 17.5. Notably, the default_statistics_target value is set to only 500 elements, even below the recommended value for 1C systems (2500). The query contains 12 joins, but 9 are spread across subplans, and the join search space is limited by three JOINs, which is quite manageable. Looking at the EXPLAIN <a href="https://github.com/danolivo/conf/blob/main/2026-massive-table-stat/query-explain-1c.txt">output</a>, the planner touches only 5 buffer pages during planning - not much.</p><p>Interestingly, the alternative PostgreSQL fork (such forks have become increasingly popular these days) <a href="https://github.com/danolivo/conf/blob/main/2026-massive-table-stat/query-explain-pgpro.txt">executed </a>this query with nearly identical execution plans, and the planning time is considerably shorter - around 80 milliseconds. Let's use this as our control sample.</p><h2>The Hunt for Root Cause</h2><p>The first suspicion was obvious: perhaps the developers expanded the optimiser's search space, and it's simply passing through multiple extra paths. A flamegraph comparison between the <a href="https://github.com/danolivo/conf/blob/main/2026-massive-table-stat/flamegraph-4999-1c.svg">slow</a> planning case and the <a href="https://github.com/danolivo/conf/blob/main/2026-massive-table-stat/flamegraph-4999-pgpro.svg">alternative</a> fork showed remarkably similar patterns. Both exhibited search space expansion from features standard in 1C-related PostgreSQL forks (Joinsel and 'Append of IndexScans'), but nothing surprising beyond that.</p><p>However, the detailed analysis of the flamegraph revealed something more telling: a performance bottleneck in the <code>byteaeq()</code> comparison operation, triggered by the cost_index() function's cost estimation and <code>toast_raw_datum_size()</code> calls. The optimiser invokes this repeatedly while evaluating all possible index combinations across various expressions - not just those explicitly mentioned in the query, but also derived ones through 'equivalence classes' created by equality operations.</p><p>The query references just three columns: <code>inforg10621::fld10622rref</code>, <code>inforg10621::fld15131rref</code>, and <code>inforg8199::fld8200_rrref</code>. Yet these are involved in 20 different expressions, 15 of which are join clauses. When you factor in the number of indexes on these tables - eight between the two - it becomes clear that the number of possible combinations can explode. But how can we confirm this suspicion? How many times does the optimiser actually consult table statistics?</p><p>Unfortunately, standard PostgreSQL doesn't provide this information directly. So I turned to my own project - <a href="https://github.com/danolivo/pg_index_stats">pg_index_stats</a>, which uses PostgreSQL's internal hooks (<code>relation_stats_hook </code>and <code>get_index_stats_hook</code>) to collect precisely this data and display it in EXPLAIN output.</p><p>Here's what we found (<a href="https://github.com/danolivo/conf/blob/main/2026-massive-table-stat/stat_used-1c.res">1c</a> and <a href="https://github.com/danolivo/conf/blob/main/2026-massive-table-stat/stats_used-pgpro.res">alternative</a>):</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/QUDOf/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b24b03fe-330e-42e9-a48a-04945d3b4e33_1220x878.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8745435d-788b-479d-904b-5a21f790bfce_1220x878.png&quot;,&quot;height&quot;:443,&quot;title&quot;:&quot;Created with Datawrapper&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/QUDOf/1/" width="730" height="443" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>The statistics for four columns are being accessed more than 100 times each. Remarkably, for the <code>fld10622rref </code>column, the optimiser fetches, decompresses, and uses the statistics 217 times! While this is less critical for the <code>fld809 </code>column (which has no histogram or MCV due to its nearly unique nature), other columns require repeatedly decompressing substantial arrays. The alternative fork accesses statistics roughly twice as frequently - a significant improvement, though not quite enough to fully explain the planning time difference.</p><h2>Digging Deeper</h2><p>What statistics do we actually have, and in what volume? Comparing statistics dumps from both PostgreSQL versions (<a href="https://github.com/danolivo/conf/blob/main/2026-massive-table-stat/pgstats-1c.txt">here</a> and <a href="https://github.com/danolivo/conf/blob/main/2026-massive-table-stat/pgstats-pgpro.txt">there</a>) shows that our tables indeed contain MCV and histogram arrays with up to 500 elements for several columns. Their uncompressed size reaches tens of kilobytes (compressed, over 2KB), and extracting them requires decompression before use. Surely we don't need to fetch and decompress these large arrays repeatedly?</p><p>After all, PostgreSQL does have caching that should calculate selectivity for a given expression only once &#8230;</p><p>We have two obvious suspects: columns <code>fld10622rref </code>and <code>fld8201rref</code>. Let's test our hypothesis by mechanically zeroing out their statistics and seeing what happens:</p><pre><code>UPDATE pg_statistic
SET stanumbers1 = CASE WHEN stakind1 = 1 THEN NULL ELSE stanumbers1 END,
    Stavalues1 = CASE WHEN stakind1 = 1 THEN NULL ELSE stavalues1 END,
    Stakind1 = CASE WHEN stakind1 = 1 THEN 0 ELSE stakind1 END,
    Stanumbers2 = CASE WHEN stakind2 = 1 THEN NULL ELSE stanumbers2 END,
    Stavalues2 = CASE WHEN stakind2 = 1 THEN NULL ELSE stavalues2 END,
    Stakind2 = CASE WHEN stakind2 = 1 THEN 0 ELSE stakind2 END
WHERE (starelid = &#8216;_inforg10621&#8217;::regclass AND staattnum = (
    SELECT attnum FROM pg_attribute
    WHERE (attrelid = &#8216;inforg10621&#8217;::regclass AND attname = &#8216;fld10622rref&#8217;)))
OR (starelid = &#8216;_inforg8199&#8217;::regclass AND staattnum = (
    SELECT attnum FROM pg_attribute
    WHERE (attrelid = &#8216;_inforg8199&#8217;::regclass AND attname = &#8216;_fld8201rref&#8217;)));</code></pre><p>The result? EXPLAIN now <a href="https://github.com/danolivo/conf/blob/main/2026-massive-table-stat/query-explain-rm-columnstat.txt">shows</a> planning time at around 30ms:</p><pre><code>Planning: Buffers: shared hit=5 Memory: used=4030kB allocated=4096kB
Planning Time: 31.347 ms
Execution Time: 0.237 ms</code></pre><p>If we delete all statistics entirely with:</p><pre><code>DELETE FROM pg_statistic;</code></pre><p>We get the theoretical minimum planning time for this query:</p><pre><code>Planning: Buffers: shared hit=5 Memory: used=3932kB allocated=4096kB
Planning Time: 18.477 ms
Execution Time: 0.421 ms</code></pre><p>This aligns perfectly with the alternative fork's <a href="https://github.com/danolivo/conf/blob/main/2026-massive-table-stat/query-explain-pgpro-cutstat.txt">planning time</a>.</p><p>But in the current master branch, since commit 057012b, Postgres employs a hashing technique to reduce the N^2 overhead of long MCV array passes. Ok, let's <a href="https://github.com/danolivo/conf/blob/main/2026-massive-table-stat/mcv-patch-17_5.diff">backpatch</a> our case and check the <a href="https://github.com/danolivo/conf/blob/main/2026-massive-table-stat/query-explain-1c-MCVopt.txt">explain</a>:</p><pre><code> Planning:
   Buffers: shared hit=5
   Memory: used=3984kB  allocated=4096kB
 Planning Time: 64.603 ms
 Execution Time: 0.197 ms</code></pre><p>It is definitely better than before, but we still see overhead that may grow with larger statistical arrays and repeated detoasting/decompression attempts.</p><h2>The Verdict</h2><p>Statistics indeed causes the excessive planning time, but the question remains: is it the overhead of decompressing statistics, or the overhead of repeatedly iterating through long MCV and histogram arrays? The answer is likely both.</p><p>We can indirectly confirm the impact of repeatedly traversing MCV arrays by noting that changing the storage type of columns in <code>pg_statistic </code>from <code>EXTENDED </code>to <code>EXTERNAL </code>produces no measurable difference:</p><pre><code>DELETE FROM pg_statistic;
SET allow_system_table_mods = &#8216;on&#8217;;
ALTER TABLE pg_statistic ALTER COLUMN stavalues1 SET STORAGE EXTERNAL;
&#8230;
VACUUM ANALYZE;</code></pre><h2>Conclusion and Solutions</h2><p>The root cause is clear: the optimiser's search space expanded due to increased index counts and statistics sizes - both entirely legitimate scenarios that can occur beyond ORM applications. The execution itself remains efficient and doesn't consume significant disk or memory resources, so it doesn't significantly impact neighbouring operations. However, the planning time can become problematic.</p><p>What Can Be Done?</p><p>First approach: Implement a caching system for frequently accessed, extensive statistics. This could even be implemented as an extension (similar to how I collected statistics access patterns in pg_index_stats). The code wouldn't be overly complex - just a standard module allocating a DSM segment for a hash table and decompressed statistics. Additionally, it's worth exploring a balance and perhaps storing MCVs in sorted order (when the data type allows), enabling fast element matching on both sides during JOIN estimation and quick lookup during filter estimation.</p><p>Second approach: You can just reduce the statistics size on problematic tables or columns:</p><pre><code>ALTER TABLE table_name ALTER COLUMN column_name SET STATISTICS 0;</code></pre><p>Of course, the challenge here is detecting the problematic spots (columns, clauses) inside the query. There's no universal answer - you need to EXPLAIN on suspicious queries with and without statistics, then perform the same analysis I did above. And naturally, report findings to the vendor, because there's always room for improvement!</p><p></p><p>THE END.<br><em>Istanbul, Turkey. January 26, 2026.</em></p>]]></content:encoded></item><item><title><![CDATA[Inventing A Cost Model for PostgreSQL Local Buffers Flush]]></title><description><![CDATA[On a way to parallel temp tables scan]]></description><link>https://danolivo.substack.com/p/inventing-a-cost-model-for-postgresql</link><guid isPermaLink="false">https://danolivo.substack.com/p/inventing-a-cost-model-for-postgresql</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Mon, 05 Jan 2026 12:39:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!riAX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcb757d-e3e2-48b6-9527-a782c5b260eb_1220x644.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>In this post, I describe experiments on the write-versus-read costs of PostgreSQL's temporary buffers. For the sake of accuracy, the PostgreSQL functions set is extended with tools to measure buffer flush operations. The measurements show that writes are approximately 30% slower than reads. Based on these results, the cost estimation formula for the optimiser has been proposed:</em><br><code>flush_cost = 1.30 &#215; dirtied_bufs + 0.01 &#215; allocated_bufs.</code></p><h2><strong>Introduction</strong></h2><p>Temporary tables in PostgreSQL have always been <a href="https://www.postgresql.org/docs/current/parallel-safety.html">parallel restricted</a>. From my perspective, the reasoning is straightforward: temporary tables exist primarily to compensate for the absence of <a href="https://en.wikipedia.org/wiki/Relvar">relational variables</a>, and for performance reasons, they should remain as simple as possible. Since PostgreSQL parallel workers behave like separate backends, they don't have access to the leader process's local state, where temporary tables reside. Supporting parallel operations on temporary tables would significantly increase the complexity of this machinery.</p><p>However, we now have at least two working implementations of parallel temporary table support: <a href="https://postgrespro.com/docs/postgrespro/16/runtime-config-query#GUC-ENABLE-PARALLEL-TEMPTABLES">Postgres Pro</a> and <a href="https://habr.com/ru/companies/tantor/articles/965264/#%D0%9F%D0%B0%D1%80%D0%B0%D0%BB%D0%BB%D0%B5%D0%BB%D0%B8%D0%B7%D0%BC%20VS%20%D0%B2%D1%80%D0%B5%D0%BC%D0%B5%D0%BD%D0%BD%D1%8B%D0%B5%20%D1%82%D0%B0%D0%B1%D0%BB%D0%B8%D1%86%D1%8B">Tantor</a>. One more reason: identification of temporary tables within a UTILITY command is an essential step toward auto DDL in logical replication. So, maybe it is time to propose such a feature for PostgreSQL core.</p><p>After numerous code improvements over the years, AFAICS, only one fundamental problem remains: temporary buffer pages are local to the leader process. If these pages don't match the on-disk table state, parallel workers cannot access the data.</p><p>A comment in the code (80558c1) made by Robert Haas in 2015 clarifies the state of the art:</p><pre><code><code>/*
 * Currently, parallel workers can't access the leader's temporary
 * tables.  We could possibly relax this if we wrote all of its
 * local buffers at the start of the query and made no changes
 * thereafter (maybe we could allow hint bit changes), and if we
 * taught the workers to read them.  Writing a large number of
 * temporary buffers could be expensive, though, and we don't have
 * the rest of the necessary infrastructure right now anyway.  So
 * for now, bail out if we see a temporary table.
 */
</code></code></pre><p>The comment hints at a path forward: if we flush the leader's temporary buffers to disk before launching parallel operations, workers can safely read from the shared on-disk state. The concern, however, is cost - would the overhead of writing those buffers outweigh the benefits of parallelism?</p><p>On the path to enabling parallel temporary table scans, this cost argument is fundamental and must be addressed first. We can resolve this issue by providing the optimiser with a proper cost model. In this case, it could choose between a parallel scan with buffer flushing overhead and a sequential scan performed by parallel workers. Hence, we are looking for a constant, such as DEFAULT_SEQ_PAGE_COST, to estimate writing overhead. Let's address this question with actual data and measure the cost of flushing temporary buffers. My goal is to determine whether this overhead represents a real barrier to parallel execution or simply an overestimated concern that has kept this optimisation off the table.</p><h2><strong>Benchmarking tools</strong></h2><p>PostgreSQL currently provides no direct access to local buffers for measurement purposes. To conduct this benchmark, I extended the system with several instrumentation tools and UI functions. The <a href="https://github.com/danolivo/pgdev/tree/temp-buffers-sandbox">temp-buffers-sandbox</a> branch, based on the current PostgreSQL master, contains all the modifications needed for this work.</p><p>The implementation consists of two key commits:</p><p><strong>No.1: Statistics infrastructure</strong></p><p>This commit introduces two new internal statistics that track local buffer state:</p><ul><li><p><code>allocated_localbufs</code> - tracks the total number of local buffers currently allocated in this backend (it can't be more than the <code>temp_buffers</code> value).</p></li><li><p><code>dirtied_localbufs</code> - counts how many local buffer pages are dirty (not flushed to disk).</p></li></ul><p>I believe these statistics potentially provide the foundation for the cost model, giving the query optimiser visibility into the current state of temporary buffers before deciding whether to flush them.</p><p><strong>No.2: UI functions</strong></p><p>This commit adds SQL-callable functions that allow direct manipulation and inspection of local buffers:</p><ul><li><p><code>pg_allocated_local_buffers()</code> - returns the count of currently allocated local buffers.</p></li><li><p><code>pg_flush_local_buffers()</code> - explicitly flushes all dirty local buffers to disk.</p></li><li><p><code>pg_read_temp_relation(relname, randomize)</code> - reads all blocks of a temporary table either sequentially or in random order.</p></li><li><p><code>pg_temp_buffers_dirty(relname)</code> - marks all pages of a temporary table as dirty in the buffer pool.</p></li></ul><p>If local buffers are free or not allocated yet, reading a relation block in a random order simulates a random distribution in memory. So, the following flush operation of these pages to the disk serves as a simulation of a simple '<code>random-write</code>' mode. These functions enable almost direct measurement of read and write operations.</p><h2><strong>Methodology</strong></h2><p>The complete test bench is available <a href="https://github.com/danolivo/conf/tree/main/Scripts/temp_buffers_benchmark">here</a>.</p><p>Fortunately, local buffer operations are pretty straightforward: they don't acquire locks, don't require WAL logging, and avoid other costly manipulations. This eliminates concurrency concerns and simplifies the test logic. To build a cost estimation model, we need to measure three things: write speed, read speed, and the overhead of scanning buffers when no I/O is required.</p><p>The ratio of read to write speed will allow us to derive a write-page-cost parameter based on the <a href="https://github.com/postgres/postgres/blob/915711c8a4e60f606a8417ad033cea5385364c07/src/include/optimizer/cost.h#L24">DEFAULT_SEQ_PAGE_COST</a> value used in core PostgreSQL. The optimiser can use this parameter to estimate the cost of flushing dirty local buffers before parallel operations begin.</p><p>Each test iteration follows this algorithm:</p><p><strong>Sequential access testing:</strong></p><ol><li><p>Create a temp table and fill it with data that fits within the local buffer pool (all pages will be dirty in memory).</p></li><li><p>Call <code>pg_flush_local_buffers()</code> to write all dirty buffers to disk. Measure I/O.</p></li><li><p>Call <code>pg_flush_local_buffers()</code> again to measure the overhead of scanning buffers without actual flush (<code>dry-write-run</code>).</p></li><li><p>Evict the test table&#8217;s pages by creating a dummy table that fills the entire buffer pool, then drop it.</p></li><li><p>Call <code>pg_read_temp_relation('test', false)</code> to read all blocks sequentially from disk into buffers. Measure I/O.</p></li><li><p>Call <code>pg_read_temp_relation('test', false)</code> again to measure the overhead of scanning buffers without an actual read (<code>dry-read-run</code>).</p></li></ol><p><strong>Random access testing:</strong></p><ol><li><p>Evict the test table's pages again by creating and dropping a dummy table.</p></li><li><p>Call <code>pg_read_temp_relation('test', true)</code> to read blocks in random order, distributing them randomly across the buffer pool.</p></li><li><p>Call <code>pg_temp_buffers_dirty('test')</code> to mark all table pages as dirty.</p></li><li><p>Call <code>pg_flush_local_buffers()</code> to flush pages to disk. Since pages were loaded randomly into buffers, this pretends to simulate random write patterns.</p></li></ol><p>All measurements are captured using <code>EXPLAIN (ANALYZE, BUFFERS)</code>, which records execution time in milliseconds and buffer I/O statistics (local read, local written, local hit counts). Planning time is negligible (typically &lt; 0.02ms) and excluded from analysis. While it's possible to avoid EXPLAIN and the Instrumentation overhead entirely, I believe this overhead is minimal and consistent between write and read operations. Using EXPLAIN provides a convenient way to verify execution time and confirm the actual number of blocks affected.</p><p>The tests cover buffer pool sizes at powers of 2 from 128 to 262,144 blocks (1MB to 2GB), with 30 iterations per size for statistical reliability. Each test allocates 101% of the target block count to accommodate Free Space Map and Visibility Map metadata. Higher buffer counts cause memory swapping and produce unreliable results.</p><h2><strong>Benchmark results</strong></h2><p>On my laptop, the most stable performance occurs in the 4-512 MB range:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/TZ1Fc/2/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4bcb757d-e3e2-48b6-9527-a782c5b260eb_1220x644.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9fe75d4a-7409-4cb7-858d-97fd78a16d9c_1220x804.png&quot;,&quot;height&quot;:312,&quot;title&quot;:&quot;Created with Datawrapper&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/TZ1Fc/2/" width="730" height="312" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>Large datasets show higher write overhead and variability:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/db7mo/2/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3005f854-a4f6-4142-85c9-c006cb8445d5_1220x320.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d697fee-e52e-40f8-b6d0-6852beba2a84_1220x320.png&quot;,&quot;height&quot;:150,&quot;title&quot;:&quot;Created with Datawrapper&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/db7mo/2/" width="730" height="150" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>Scanning without I/O (Dry-Run) is minimal, around 0.002-0.240 ms.</p><p>Based on the results, the temp table write cost should account for both sequential overhead and random access patterns. The analysis shows:</p><p><strong>Sequential write overhead</strong>: Approximately 20% slower than sequential reads.</p><p><strong>Random access degradation</strong>: Random buffer distribution patterns show an additional 10-24% performance degradation compared to sequential access. This occurs when temporary table pages are scattered across the buffer pool due to interleaved operations or random access patterns.</p><p><strong>Recommended cost formula</strong>:</p><pre><code><code>DEFAULT_WRITE_TEMP_PAGE_COST = 1.30 &#215; DEFAULT_SEQ_PAGE_COST</code></code></pre><p>This 1.30 multiplier accounts for both the sequential write overhead (~1.20) and the random-access degradation (~0.10), providing a conservative estimate that covers realistic workload scenarios where buffer access patterns may not be purely sequential.</p><p>Write cost is relatively close to read cost because no WAL logging is required for temporary tables. I&#8217;m uncertain what storage type the current default seq_page_cost targets; my measurements were conducted on an NVMe SSD. Would the relationship differ on HDD? Further investigation into different storage types may be warranted.</p><p>Also, tests indicate that we can estimate the buffer-scanning overhead at approximately 1% of the writing cost. Hence, the whole formula for the preliminary temporary buffers flushing may look like (<code>DEFAULT_SEQ_PAGE_COST = 1</code>):</p><pre><code><code>flush_cost = 1.30 &#215; dirtied_localbufs + 0.01 &#215; allocated_localbufs</code></code></pre><h2><strong>What&#8217;s next?</strong></h2><p>This benchmark provides the foundational cost model needed to enable parallel query execution on temporary tables in PostgreSQL. The whole implementation requires four key development phases:</p><ol><li><p>Add a planner flag to signal when a plan subtree contains operations on temporary objects. This allows the planner to identify when buffer flushing may be required for parallel execution. I hope that existing <code>parallel_safe</code> and <code>consider_parallel</code> flags may be modified to serve this purpose.</p></li><li><p>Implement the buffer flush operation in <code>Gather</code> and <code>GatherMerge</code> nodes before launching parallel workers. This ensures that all dirty temporary table pages are synchronised to disk before workers begin execution.</p></li><li><p>Enable parallel workers to access the leader process&#8217;s temporary table data from disk. This requires teaching workers how to locate and read temporary table files written by the leader process.</p></li><li><p>Integrate the cost model into the query planner. The planner can then make informed decisions about whether flushing temporary buffers for parallel execution will outperform sequential execution without parallel workers.</p></li></ol><h2><strong>Conclusion</strong></h2><ul><li><p>A flush of the local buffer is approximately <strong>30% slower</strong> than a sequential read.</p></li><li><p>For optimisation, the default write cost may be hardcoded to 1.3*DEFAULT_SEQ_PAGE_COST.</p></li><li><p>360 measurements with 30 iterations per size. Medium datasets (16-512 MB) show a coefficient of variation consistently below 6%, indicating highly stable results. Large datasets (1-2 GB) show higher variability (CV &gt;150% for writes), requiring careful interpretation.</p></li></ul><p></p><p>THE END.<br><em>January 05, 2026, Madrid, Spain.</em></p>]]></content:encoded></item><item><title><![CDATA[Revising the Postgres Multi-master Concept]]></title><description><![CDATA[Does logical replication have hidden potential?]]></description><link>https://danolivo.substack.com/p/revising-the-postgres-multi-master</link><guid isPermaLink="false">https://danolivo.substack.com/p/revising-the-postgres-multi-master</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Sat, 18 Oct 2025 10:39:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Pa-S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab64541f-f448-4738-8118-adf0e6c78d00_719x403.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>One of the ongoing challenges in database management systems (DBMS) is maintaining consistent data across multiple instances (nodes) that can independently accept client connections. If one node fails in such a system, the others must continue to operate without interruption - accepting connections and committing transactions without sacrificing consistency. An analogy for a single DBMS instance might be staying operational despite a RAM failure or intermittent access to multiple processor cores.</em></p><p><em>In this context, I would like to revisit the discussion about the Postgres-based multi-master problem, including its practical value, feasibility, and the technology stack that needs to be developed to address it. By narrowing the focus of the problem, we may be able to devise a solution that benefits the industry.</em></p><p>I spent several years developing the multi-master extension in the late 2010s until it became clear that the concept of <em>essentially consistent multi-master</em> replication had reached a dead end. Now, after taking a long break from working on replication, changing countries, residency, and companies, I am revisiting the Postgres-based multi-master idea to explore its practical applications.</p><p>First, I want to clarify the general use case for multi-master replication and highlight its potential benefits. Apparently, any technology must balance its capabilities with the needs it aims to address. Let's explore this balance within the context of multi-master replication.</p><p>Typically, clients consider a multi-master solution when they hit a limit in connection counts for their OLTP workloads. They often have a large number of clients, an <em>N transaction-per-second (TPS)</em> workload, and a single database. They envision a solution that involves adding another identical server, setting up active-active replication, and doubling their workload.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Pa-S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab64541f-f448-4738-8118-adf0e6c78d00_719x403.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Pa-S!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab64541f-f448-4738-8118-adf0e6c78d00_719x403.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Pa-S!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab64541f-f448-4738-8118-adf0e6c78d00_719x403.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Pa-S!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab64541f-f448-4738-8118-adf0e6c78d00_719x403.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Pa-S!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab64541f-f448-4738-8118-adf0e6c78d00_719x403.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Pa-S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab64541f-f448-4738-8118-adf0e6c78d00_719x403.jpeg" width="719" height="403" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ab64541f-f448-4738-8118-adf0e6c78d00_719x403.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:403,&quot;width&quot;:719,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52915,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://danolivo.substack.com/i/176358418?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b5113b-f78c-4496-9b7e-ecce77c8143c_960x540.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Pa-S!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab64541f-f448-4738-8118-adf0e6c78d00_719x403.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Pa-S!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab64541f-f448-4738-8118-adf0e6c78d00_719x403.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Pa-S!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab64541f-f448-4738-8118-adf0e6c78d00_719x403.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Pa-S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab64541f-f448-4738-8118-adf0e6c78d00_719x403.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Typical (desired) case users envision discussing multi-master</figcaption></figure></div><p>Sometimes, a client has a distributed application that requires a stable connection to the database across different geographic locations. At other times, clients simply desire a reliable automatic failover. A less common request, though still valuable, is to provide an online upgrade or to enable the detachment of a node for maintenance. Additionally, some clients may request an extra master node that has warmed-up caches and a storage state that closely resembles the production environment, allowing it to be used for testing and benchmarking. In summary, there are numerous tasks that can be requested, but what can realistically be achieved?</p><p>It's important to remember that active-active replication in PostgreSQL is currently only feasible with logical replication, which denotes network latency and additional server load from decoding, walsender, and so on. Network latency also immediately arises - we need to wait for confirmation from the remote node that the transaction was successfully applied, right? Therefore, the idea of scaling the write load for the general case of 100% replication immediately encounters the fact that each server will be required to write not only its own changes, but also changes from other instances (see the figure above). That isn't a problem for massive queries with bushy <code>SELECT</code>s, but a multi-master query is more likely to be used by clients with pure OLTP and very simple DML.</p><p>We face a similar challenge when trying to provide each copy of the distributed application with a nearby database instance. If the application's connection to the remote database is weak, then the connection between the two DBMS instances will also be unstable. As a result, waiting for confirmation of a successful transaction commit on the remote node can lead to significant delays.</p><p>Complex situations can also occur when a replication update tries to overwrite the same table row that has been updated locally. Such an event can happen because we have no guarantee that the snapshots of the transactions that caused these conflicting changes are consistent across DBMS instances. This raises the question: whose change should be applied, and whose should be rolled back? Does a single row change within the transaction logic need to take into account the competing change for it to be valid?</p><p>The autofailover case is relatively straightforward, but something still needs to be done to make such a configuration effective: after all, if all instances can write, then a transaction commit must be accompanied by a supplemental message ensuring that the transaction is written at each instance. Otherwise, it could happen that if node <em>N<sub>x</sub></em> crashes, some of its transactions will be written to the database on node <em>N<sub>y</sub></em>, but will not be committed to (or will be rolled back in) the database of node <em>N<sub>z</sub></em>. So, how do you fix this situation except by sending the entire configuration to recovery?</p><p>So, the concept of multi-master replication can be questionable, particularly for those seeking to accelerate OLTP workload. So, why would anyone need it? Let's begin by examining the underlying technology: logical replication.</p><p>I see two significant advantages to logical replication. First, it enables highly selective data replication, allowing you to pick only specific tables. Additionally, for each table, you can set up filters with replication conditions that let you easily skip individual records or entire transactions right at the outset of the replication process during the decoding phase. This feature provides a highly granular mechanism for selecting specific data that should be synchronised with a remote system.</p><p>The second notable advantage is the high-level nature of the mechanism. This type of replication occurs at the level of relational algebra, which means you can abstract away the complexities of physical storage.</p><p>What amazing benefits come with a high level of abstraction? Imagine the possibilities! You can customise different sets of indexes on synchronised nodes, which significantly reduces DML overhead and allows you to route queries based on where execution can be most effective. For example, you could focus on loading one instance with brief UPDATE/DELETE queries on primary keys, while reserving another instance for larger subqueries or INSERTs that usually don't conflict with updates. You could even mix it up by using a traditional Postgres heap on one instance and a column storage on another! The creativity here knows no bounds when it comes to the potential of your replication protocol.</p><p>Now that we have outlined the benefits of logical replication, let's consider a use case that can be effectively implemented using a multi-master configuration. </p><p>To begin, we will set aside concerns related to upgrades, maintenance, and failovers. The most apparent use case is for supporting a geo-distributed application. By categorising the data in the database into three types - critically important general data, general data that is changed on only one side, and purely local data - we can leverage the advantages of this setup (see figure).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7jk4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7c8cd72-c813-4259-bea7-63c83ccbf8e1_719x421.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7jk4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7c8cd72-c813-4259-bea7-63c83ccbf8e1_719x421.jpeg 424w, https://substackcdn.com/image/fetch/$s_!7jk4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7c8cd72-c813-4259-bea7-63c83ccbf8e1_719x421.jpeg 848w, https://substackcdn.com/image/fetch/$s_!7jk4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7c8cd72-c813-4259-bea7-63c83ccbf8e1_719x421.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!7jk4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7c8cd72-c813-4259-bea7-63c83ccbf8e1_719x421.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7jk4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7c8cd72-c813-4259-bea7-63c83ccbf8e1_719x421.jpeg" width="719" height="421" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7c8cd72-c813-4259-bea7-63c83ccbf8e1_719x421.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:421,&quot;width&quot;:719,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:53646,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danolivo.substack.com/i/176358418?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ac863ed-9e1d-4799-a92e-1251b05078d6_960x540.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7jk4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7c8cd72-c813-4259-bea7-63c83ccbf8e1_719x421.jpeg 424w, https://substackcdn.com/image/fetch/$s_!7jk4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7c8cd72-c813-4259-bea7-63c83ccbf8e1_719x421.jpeg 848w, https://substackcdn.com/image/fetch/$s_!7jk4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7c8cd72-c813-4259-bea7-63c83ccbf8e1_719x421.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!7jk4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7c8cd72-c813-4259-bea7-63c83ccbf8e1_719x421.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The case of multi-master replication with data classification</figcaption></figure></div><p>Here, the red rectangle denotes data that must be reliably synchronised between instances. The green and blue denote data that doesn't require immediate synchronisation and should be accessible to the remote instance in read-only mode. The grey denotes purely local data.</p><p>By designing the database schema to categorise data by replication method, we can even reduce the database size on a specific instance by avoiding the transmission of local data to remote nodes. Furthermore, plenty of data can be replicated asynchronously in one direction, avoiding the overhead of waiting for the remote end to confirm a commit. Only critical data requires strict synchronisation, using mechanisms such as synchronous commit, 2PC, and at least REPEATABLE READ isolation level, which enormously raises the commit time of such a transaction and increases the risk of rollback due to conflicts.</p><p>What is an example of this use case? To be honest, I don't have any experience with customer installations, so I can only imagine how it might work hypothetically. I envision an international company that needs to separate employee data and fiscal metrics on servers located in each country, which seems to be a common requirement these days. For analytical purposes, this data could be made accessible externally, similar to how key values can filter replicated data. </p><p>The company's employee table could be divided, replicating names, positions, salaries, and other relevant information across all database instances. Sensitive identifiers, such as social security numbers or passport numbers, could be kept in a separate local table to maintain privacy.</p><p>In principle, if updates to local or asynchronously replicated data dominate, it may be possible to achieve the desired scalability for writing operations (sounds wicked, but who knows...). </p><p>Drawing from my experience in rocket science, I've developed the habit of qualitatively evaluating the effects of the phenomena being studied beforehand. Let's estimate the percentage of the database that can be replicated in active-active mode without potentially degrading performance. For simplicity, let's assume there are two company branches located on different continents, and consider the following configuration options: (1) one server, or (2) two servers operating in multi-master mode, where access will always be local (as illustrated in the figure below).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LcTI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F886f6bd2-b36f-4c1e-9b81-65de668ba21c_526x440.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LcTI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F886f6bd2-b36f-4c1e-9b81-65de668ba21c_526x440.jpeg 424w, https://substackcdn.com/image/fetch/$s_!LcTI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F886f6bd2-b36f-4c1e-9b81-65de668ba21c_526x440.jpeg 848w, https://substackcdn.com/image/fetch/$s_!LcTI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F886f6bd2-b36f-4c1e-9b81-65de668ba21c_526x440.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!LcTI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F886f6bd2-b36f-4c1e-9b81-65de668ba21c_526x440.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LcTI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F886f6bd2-b36f-4c1e-9b81-65de668ba21c_526x440.jpeg" width="526" height="440" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/886f6bd2-b36f-4c1e-9b81-65de668ba21c_526x440.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:440,&quot;width&quot;:526,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32552,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danolivo.substack.com/i/176358418?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76542f9c-6512-4476-9032-d8648008692e_960x540.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LcTI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F886f6bd2-b36f-4c1e-9b81-65de668ba21c_526x440.jpeg 424w, https://substackcdn.com/image/fetch/$s_!LcTI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F886f6bd2-b36f-4c1e-9b81-65de668ba21c_526x440.jpeg 848w, https://substackcdn.com/image/fetch/$s_!LcTI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F886f6bd2-b36f-4c1e-9b81-65de668ba21c_526x440.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!LcTI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F886f6bd2-b36f-4c1e-9b81-65de668ba21c_526x440.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let's introduce some notations. Please refer to the figure above for further clarification:</p><ul><li><p><em>T<sub>l</sub></em>  - transaction execution time in a DBMS backend, ms.</p></li><li><p><em>T<sup>c</sup><sub>l</sub></em> - network round-trip time between DBMS and local application, ms.$</p></li><li><p><em>T<sup>c</sup><sub>r</sub></em> - network round-trip time between DBMS and remote application, ms.</p></li><li><p><em>T<sub>r</sub></em> - extra time to ensure that the transaction is successfully committed across the DBMS cluster, ms.</p></li><li><p><em>X<sub>l</sub></em> - fraction of local connections.</p></li><li><p><em>N</em> - fraction of transactions that are OK with asynchronous replication guarantees.</p></li></ul><p>For a single server holding all the connections we have:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;TPS_1 = (T_l + T^c_l)*X_l + (T_l + T^c_r) * (1 - X_l).&quot;,&quot;id&quot;:&quot;VAEASHLHKY&quot;}" data-component-name="LatexBlockToDOM"></div><p>Active-active replication has the following formula:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;TPS_m = (T_l + T^c_l)*N + (T_l + T^c_l + T_r) * (1-N)&quot;,&quot;id&quot;:&quot;DBNKBGCOIU&quot;}" data-component-name="LatexBlockToDOM"></div><p>Now, let's determine appropriate numbers for our formulas. We'll assume a 50% share for local connections (<em>X<sub>l</sub></em> <em>= 0.5</em>). Drawing from my experience living in Asia and connecting to resources in Europe, we can use the following figures as a reference:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;T_l = 5\\ ms,\\ T^c_l = 10\\ ms,\\ T^c_r = 150\\ ms\\ and\\ T_r = 300\\ ms.&quot;,&quot;id&quot;:&quot;DXDWEXTDLX&quot;}" data-component-name="LatexBlockToDOM"></div><p>In this context, <em>T<sup>c</sup><sub>l</sub></em> and <em>T<sup>c</sup><sub>r</sub></em> are the time of one round-trip. In contrast, confirming a remote commit (<em>T<sub>r</sub></em>) usually requires at least two round-trips: in the 2PC protocol, the <code>PREPARE STATEMENT</code> command should first be executed, waiting for the changes to be successfully replicated and all the resources necessary for the commit to be reserved, and then the <code>COMMIT</code> command should be issued.</p><p>Based on these timing considerations, we can calculate the following:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{gather*}\nTPS_1 = (5+10)*0.5+(5+150)*0.5 = 7.5+77.5=85,\\\\\nTPS_m =15*N + 315*(1-N)=315-300*N,\\\\\n315-300*N < 85 => N > \\frac{230}{300}\\approx 76\\%.\n\\end{gather*}&quot;,&quot;id&quot;:&quot;BNBYNVTVGQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Now, let's imagine that the number of remote connections has grown by 80%:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{gather*}\nTPS_1=15*0.2+155*0.8=127,\\\\\nN>188/300\\approx 62\\%.\n\\end{gather*}&quot;,&quot;id&quot;:&quot;FIFMYHOSQG&quot;}" data-component-name="LatexBlockToDOM"></div><p>What if we need to ensure full synchronous 2PC synchronisation of the entire database? Let's do the math:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;N = 0, X_l = 0.2, \\frac{TPS_1}{TPS_m} = \\frac{127}{315} = 0.4.&quot;,&quot;id&quot;:&quot;NLIFADKPWF&quot;}" data-component-name="LatexBlockToDOM"></div><p>These numbers indicate a performance loss of approximately 2.5 times, even in the most optimistic scenario. While it's not particularly encouraging, may it be sufficient for some applications?</p><p>This rough calculation suggests that if approximately 25% of DML transactions require remote confirmation, the multi-master system has a chance to stay with the same writing performance. If the majority of traffic originates from remote regions, the proportion of reliably replicated data could increase to 40%. However, for a more conservative estimate, let's stick with N = 25%. This approach also eases some of the load on the disk subsystem, locks, and other resources, allowing them to be used for local operations such as <code>VACUUM</code> or read-only queries. There appears to be a grain of truth in that, doesn't it?</p><p>On the other hand, replication, even if asynchronous, must be able to keep up with the commit flow. If the total time required for local transaction execution is 15 ms, and the one-way delay to the remote server is 75 ms, then even without waiting for confirmation from the remote side, a queue of changes for replication will still accumulate in a sequential scenario.</p><p>25% of the DML in our computation is committed with remote confirmation, leaving 75% to be replicated. 75 ms * 0.75 = 56 ms. To address the disparity between the rate of local commits and the speed of data transfer to the remote server, we must utilise the bandwidth by sending and receiving data on the remote server in parallel (i.e., parallel replication is required). In our rough model, it turns out that four threads are sufficient to transfer changes. Considering the freed-up server resources (resulting from distributing connections between instances), this seems quite realistic.</p><p>So, the bottom line is that by distributing data geographically in multi-master mode, we can theoretically expect comparable transaction handling speed. This also reduces the number of backends and resource consumption on each server. These resources can be freed up for system processes and various analytics. Let's not forget the ability to optimise indexes, storage, and other attributes of physical data placement independently on each system node. An additional benefit is that in the event of a connection failure, each subnet can be temporarily maintained with the expectation that, upon recovery, a conflict resolution strategy will restore the database's integrity.</p><p>It's easy to imagine an application for such a scenario with a complete network breakdown - for example, a database for a hospital network in a region with complex terrain and climate. The medical records database must be shared, but a single client is unlikely to be served by two different hospitals within a short period of time, making conflicts in critical data quite rare.</p><p>Taking all of the above into account, a viable multi-master solution should implement the following set of technologies:</p><ul><li><p>Replication sets &#8211; to classify data for replication, providing separate synchronous and asynchronous replication.</p></li><li><p>Replication type dependency detection - check that synchronously replicated tables don't refer to asynchronously replicated tables.</p></li><li><p>Remote commit confirmation (similar to 2PC).</p></li><li><p>A distributed consensus protocol for determining a healthy subset of nodes and fencing failed nodes in 3+ configurations.</p></li><li><p>Parallel replication &#8211; parallelising both DML sending and application on the remote side.</p></li><li><p>Automatic Conflict Resolution.</p></li></ul><p>Let's not overlook the importance of automatic failover and the capability to hot-swap hardware without any downtime. Currently, the concept of alternative physical storage arrangements still sounds a little wild, so I'm excluding it from our discussion for now. However, if we gain more experience with successful multi-master implementations, this option might eventually become the preferred approach.</p><p>That's all for now. This post is meant to spark discussion, so please feel free to share your thoughts in the comments or through any other method you prefer.</p><p></p><p>THE END.<br><em>October 18, 2025, Madrid, Spain.</em></p><p></p>]]></content:encoded></item><item><title><![CDATA[Extra approach to RTABench Q0 optimisation]]></title><description><![CDATA[Reflecting a feedback]]></description><link>https://danolivo.substack.com/p/extra-approach-to-rtabench-q0-optimisation</link><guid isPermaLink="false">https://danolivo.substack.com/p/extra-approach-to-rtabench-q0-optimisation</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Thu, 07 Aug 2025 14:19:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DHDg!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22ea4c7-73b5-4b9b-aaad-7db704866f94_256x256.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the <a href="https://danolivo.substack.com/p/squeezing-out-postgres-performance">previous post</a>, I explored some nuances of Postgres related to indexes and parallel workers. This text sparked a lively discussion <a href="https://www.linkedin.com/feed/update/urn:li:activity:7358108490323644417/">on LinkedIn</a>, during which one commentator (thanks to <a href="https://www.linkedin.com/in/ants-aasma-5085b1259/overlay/about-this-profile/">Ants Aasma</a>) proposed an index that was significantly more efficient than those discussed in the article. However, an automated comparison of EXPLAINs did not clarify the reasons for its superiority, necessitating further investigation.</p><p>This index:</p><pre><code>CREATE INDEX ON order_events ((event_payload -&gt;&gt; 'terminal'::text),
                              event_type,event_created); -- (1)</code></pre><p>At first (purely formal) glance, this index should not be much better than the alternatives:</p><pre><code>CREATE INDEX ON order_events (event_created,
                              (event_payload -&gt;&gt; 'terminal'::text),
                              event_type); -- (2)
CREATE INDEX ON order_events (event_created, event_type); -- (3)</code></pre><p>However, the observed speedup is significant; in fact, the performance of index (1) surpasses index (2) by more than 50 times and exceeds index (3) by almost 25 times! </p><p>The advantages of the proposed index are evident when we consider the logic of the subject area. It is more selective and is less likely to retrieve rows that do not match the filter. For instance, if we first identify all the rows that correspond to a specific airport, we can then focus on the boundaries of the date range. At this point, all retrieved rows will already meet the filter criteria. Conversely, if we begin by determining the date range, we may encounter numerous rows related to other terminals within that range.</p><p>However, when examining the EXPLAIN output, we do not see any distinctive reasons:</p><pre><code>-- (1)
-&gt;  Index Scan using order_events_expr_event_type_event_created_idx
      (cost=0.57..259038.66 rows=64540 width=72)
      (actual time=0.095..232.855 rows=204053.00 loops=1)
    Index Cond:
      event_payload -&gt;&gt; 'terminal' = ANY ('{Berlin,Hamburg,Munich}' AND
      event_type = ANY ('{Created,Departed,Delivered}') AND
      event_created &gt;= '2024-01-01 00:00:00+00' AND
      event_created &lt; '2024-02-01 00:00:00+00'
    Index Searches: 9
    Buffers: shared hit=204566

-- (2)
-&gt;  Index Scan using order_events_event_created_event_type_expr_idx
      (cost=0.57..614892.22 rows=64540 width=72)
      (actual time=0.499..14303.685 rows=204053.00 loops=1)
    Index Cond:
      event_created &gt;= '2024-01-01 00:00:00+00' AND
      event_created &lt; '2024-02-01 00:00:00+00' AND
      event_type = ANY ('{Created,Departed,Delivered}' AND
      event_payload -&gt;&gt; 'terminal' = ANY ('{Berlin,Hamburg,Munich}')
    Index Searches: 1
    Buffers: shared hit=279131
                        
-- (3)

-&gt;  Index Scan using idx_3
      (cost=0.57..6979008.62 rows=64540 width=72)
      (actual time=0.238..8777.846 rows=204053.00 loops=1)
    Index Cond:
      event_created &gt;= '2024-01-01 00:00:00+00' AND
      event_created &lt; '2024-02-01 00:00:00+00' AND
      event_type = ANY ('{Created,Departed,Delivered}')
    Filter: event_payload -&gt;&gt; 'terminal' = ANY ('{Berlin,Hamburg,Munich}')
    Rows Removed by Filter: 4292642
    Index Searches: 1
    Buffers: shared hit=4509185</code></pre><p>Let's say IndexScan on (3) filters a lot of tuples and is therefore slow. However, even after eliminating 4 million rows, IndexScan on (3) is still twice as fast as IndexScan on (2). At the same time, the only difference between indexes (1) and (2) is the order of the columns.</p><p>If we compare scans (1) and (2), the only noticeable difference is a 30% difference in the number of buffer pages hit. But not 50 times! That means the EXPLAIN does not show us where the main work was done; only the cost value signals the superiority of index (1).</p><p>However, we live in the world of ORM and ad-hoc queries, where it is difficult to choose the order of columns in the index, analysing the meaning of the stored data, which means we need to find out precisely what is happening there and what data is missing for the automated detection of an [un]successful index.</p><p>If you look at the optimiser code, it becomes clear in numbers why index (1) is so pleasing: all other things being equal, it is going to go through only 39 out of 1 million index pages. Compare this with index (2), which also contains 1 million pages, and we pass through 73 thousand of them. In terms of index tuples, this is 64.5 thousand versus 14 million. It turns out that the main work is to select a row, extract the appropriate attribute and perform the comparison.</p><p>The work performed is not represented in the EXPLAIN output. Additionally, the IndexScan structure of the query plan, which is accessible to the Postgres core and its extensions after the plan has been executed, lacks valuable information necessary for assessing the quality of planning and sources of execution time grow. Consequently, developing a method for automatically identifying ineffective indexes and selecting more optimal alternatives appears to be challenging, if not impossible.</p><p>There are numerous parameters that the optimiser calculates during the index scan planning. Take a look at the <code>GenericCosts</code> structure, including <code>numIndexTuples</code>, <code>numIndexPages</code>, <code>indexCorrelation</code>, and <code>indexSelectivity</code>. Having all this information available at the end of execution could help detect scanning anomalies and draw the DBA's attention.</p><p>Of course, the number of installations, types of load and cases is close to infinity. Hence, extending the core code by continually adding more data from the optimisation stage to the plan seems not flexible. Moreover, sometimes we would like to have alternative paths that lost the battle but may be beneficial for analysis.</p><p>Moreover, since Postgres 18, the core already has a nicely extensible explain, where we may add options, node information, and overall plan information. So, the only step needed is a bridge between the cloud of possible paths and the final plan.</p><p>Having this capability would allow extensions to analyse the predictions against the actual outcomes of query execution. Additionally, it would help in making informed decisions for fine-tuning the query planner and developing effective indexing strategies for a table.</p><p>Please feel free to share your feedback, whether you agree or disagree with my viewpoint.</p><p></p><p>THE END.<br><em>7 De Agosto De 2025, Torrevieja, Espa&#241;a.</em></p>]]></content:encoded></item><item><title><![CDATA[Squeezing out Postgres performance on RTABench Q0]]></title><description><![CDATA[Some doubtful aspects of the optimiser]]></description><link>https://danolivo.substack.com/p/squeezing-out-postgres-performance</link><guid isPermaLink="false">https://danolivo.substack.com/p/squeezing-out-postgres-performance</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Mon, 04 Aug 2025 12:08:18 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e595595e-b6b5-4587-b7e2-0141cdb1ae9a_2560x1580.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>I often hear that PostgreSQL is not suitable for solving analytics problems, referencing TPC-H or ClickBench results as evidence. Surely, handling a straightforward task like sorting through 100 million rows on disk and calculating a set of aggregates, you would get stuck on the storage format and parallelisation issues that limit the ability to optimise the DBMS. </em></p><p><em>In practice, queries tend to be highly selective and do not require processing extensive rows. The focus, then, shifts to the order of JOIN operations, caching intermediate results, and minimising sorting operations. In these scenarios, PostgreSQL, with its wide range of query execution strategies, can indeed have an advantage.</em></p><p>I wanted to explore whether Postgres could be improved by thoroughly utilising all available tools, and for this, I chose the <a href="https://rtabench.com/">RTABench</a> benchmark. RTABench is a relatively recent benchmark that is described as being close to real-world scenarios and highly selective. One of its advantages is that the queries include expressions involving the JSONB type, which can be challenging to process. Additionally, the Postgres results on RTABench have not been awe-inspiring.</p><p>Ultimately, I decided to review all of the benchmark queries, and fortunately, there aren't many, to identify possible optimisations. However, already on the zero query, there were enough nuances that it was worth taking it out into a separate discussion.</p><p>My setup isn't the latest - it's a MacBook Pro from 2019 with an Intel processor&#8212;so we can't expect impressive or stable performance metrics. Instead, let's concentrate on qualitative characteristics rather than quantitative ones. For this purpose, my hardware setup should be sufficient. You can find the list of non-standard settings for the Postgres instance <a href="https://github.com/danolivo/conf/blob/main/2025-RTABench/postgresql.conf">here</a>. </p><p>Now, considering the <a href="https://github.com/timescale/rtabench/blob/main/postgres/queries/0000_terminal_hourly_stats.sql">zero RTABench query</a>, which involves calculating several aggregates over a relatively small sample from the table: </p><pre><code><code>EXPLAIN (ANALYZE, BUFFERS ON, TIMING ON, SETTINGS ON)
WITH hourly_stats AS (
  SELECT 
    date_trunc('hour', event_created) as hour,
    event_payload-&gt;&gt;'terminal' AS terminal,
    count(*) AS event_count
  FROM order_events
  WHERE 
    event_created &gt;= '2024-01-01' AND
    event_created &lt; '2024-02-01'
    AND event_type IN ('Created', 'Departed', 'Delivered')
  GROUP BY hour, terminal
)
SELECT 
  hour,
  terminal,
  event_count,
  AVG(event_count) OVER (
    PARTITION BY terminal
    ORDER BY hour
    ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
  ) AS moving_avg_events
FROM hourly_stats
WHERE terminal IN ('Berlin', 'Hamburg', 'Munich')
ORDER BY terminal, hour;</code></code></pre><h1>Phase 0. &#8216;Default behaviour&#8216;</h1><p>The query does not seem surprising. Let's run a query over the default data schema (EXPLAIN has been cleaned):</p><pre><code>WindowAgg  (actual time=21053.119 rows=2232)
  Window: w1 AS (PARTITION BY
   order_events.event_payload -&gt;&gt; 'terminal'
   ORDER BY date_trunc('hour', order_events.event_created))
  Buffers: shared read=3182778
  -&gt; Sort  (actual time=21052.476 rows=2232)
     Sort Key:
      order_events.event_payload -&gt;&gt; 'terminal',
      date_trunc('hour', order_events.event_created)
      Sort Method: quicksort  Memory: 184kB
   -&gt; GroupAggregate  (actual time=21051.875 rows=2232)
       Group Key:
        date_trunc('hour', order_events.event_created),
        order_events.event_payload -&gt;&gt; 'terminal'
     -&gt; Sort (actual time=21037.609..21042.766 rows=204053)
         Sort Key:
          date_trunc('hour', order_events.event_created),
          order_events.event_payload -&gt;&gt; 'terminal'
         Sort Method: quicksort  Memory: 12521kB
       -&gt; Bitmap Heap Scan on order_events
            (actual time=20999.978 rows=204053)
          Recheck Cond: event_type =
           ANY ('{Created,Departed,Delivered}')
          Filter:
           event_created &gt;= '2024-01-01 00:00:00+00' AND
           event_created &lt; '2024-02-01 00:00:00+00' AND
           event_payload -&gt;&gt; 'terminal') =
             ANY ('{Berlin,Hamburg,Munich}')
          Rows Removed by Filter: 57210049
          Heap Blocks: exact=3133832
          Buffers: shared read=3182778
        -&gt; Bitmap Index Scan (actual time=1683.357 rows=57414102)
           Index Cond: event_type = ANY ('{Created,Departed,Delivered}')
           Index Searches: 1
           Buffers: shared read=48946
Execution Time: 21060.564 ms</code></pre><p>The execution time is 21 seconds? Really? Seems too slow. Upon examining the EXPLAIN, we realise that the main issue is that the default schema contains only one low-selectivity index, which was used during execution instead of performing a sequential scan. The EXPLAIN also indicates that a significant portion of the work involves collecting identifiers (ctid) of candidate rows, which takes 1.6 seconds. Following that, filtering through these rows results in filtering out 98% of the read rows, which takes 19 seconds. </p><p>The first problem was identified quickly: I allocated 8GB for shared_buffers; however, the DBMS limits the amount of buffer space that can be assigned to a single table. The formula for this allocation is quite complex, involving multiple factors, but the NBuffers/MaxBackends ratio applies here. Consequently, with my current settings, PostgreSQL can allocate a maximum of only 2.4GB per table. Therefore, denormalising the entire database into one wide, long table is a bad idea in PostgreSQL, at least for this reason.</p><p>Despite the data access pattern in this query plan being somewhat inefficient, let&#8217;s first try a straightforward approach to improve the performance by increasing the number of parallel workers:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TfwI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa65ef4ab-7026-4573-b716-6753017e2224_678x420.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TfwI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa65ef4ab-7026-4573-b716-6753017e2224_678x420.png 424w, https://substackcdn.com/image/fetch/$s_!TfwI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa65ef4ab-7026-4573-b716-6753017e2224_678x420.png 848w, https://substackcdn.com/image/fetch/$s_!TfwI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa65ef4ab-7026-4573-b716-6753017e2224_678x420.png 1272w, https://substackcdn.com/image/fetch/$s_!TfwI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa65ef4ab-7026-4573-b716-6753017e2224_678x420.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TfwI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa65ef4ab-7026-4573-b716-6753017e2224_678x420.png" width="678" height="420" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a65ef4ab-7026-4573-b716-6753017e2224_678x420.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:420,&quot;width&quot;:678,&quot;resizeWidth&quot;:678,&quot;bytes&quot;:16287,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danolivo.substack.com/i/169909509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa65ef4ab-7026-4573-b716-6753017e2224_678x420.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TfwI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa65ef4ab-7026-4573-b716-6753017e2224_678x420.png 424w, https://substackcdn.com/image/fetch/$s_!TfwI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa65ef4ab-7026-4573-b716-6753017e2224_678x420.png 848w, https://substackcdn.com/image/fetch/$s_!TfwI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa65ef4ab-7026-4573-b716-6753017e2224_678x420.png 1272w, https://substackcdn.com/image/fetch/$s_!TfwI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa65ef4ab-7026-4573-b716-6753017e2224_678x420.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A very strange graph. Obviously, suppose most of the work is reading from disk and the tuple deformation. In that case, one should not expect a significant effect from parallelism. But where does this jump between 6 and 7 parallel processes come from? Looking at the <a href="https://github.com/danolivo/conf/blob/main/2025-RTABench/query-0-workers.md">EXPLAINs</a>, one can easily understand - there was a change in the query plan. BitmapScan was used on a small number of workers, and the optimiser picked SeqScan on a larger number.</p><p>So maybe SeqScan should have been used on a small number of workers, too? Let's see how the scanning operation is accelerated separately for BitmapScan and SeqScan, and also watch how the costs of the scanning nodes change (in relative values):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8yVo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F437972f4-6042-4e36-8e16-09b62c96bc8f_600x371.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8yVo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F437972f4-6042-4e36-8e16-09b62c96bc8f_600x371.png 424w, https://substackcdn.com/image/fetch/$s_!8yVo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F437972f4-6042-4e36-8e16-09b62c96bc8f_600x371.png 848w, https://substackcdn.com/image/fetch/$s_!8yVo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F437972f4-6042-4e36-8e16-09b62c96bc8f_600x371.png 1272w, https://substackcdn.com/image/fetch/$s_!8yVo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F437972f4-6042-4e36-8e16-09b62c96bc8f_600x371.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8yVo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F437972f4-6042-4e36-8e16-09b62c96bc8f_600x371.png" width="600" height="371" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/437972f4-6042-4e36-8e16-09b62c96bc8f_600x371.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:371,&quot;width&quot;:600,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23102,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danolivo.substack.com/i/169909509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F437972f4-6042-4e36-8e16-09b62c96bc8f_600x371.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8yVo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F437972f4-6042-4e36-8e16-09b62c96bc8f_600x371.png 424w, https://substackcdn.com/image/fetch/$s_!8yVo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F437972f4-6042-4e36-8e16-09b62c96bc8f_600x371.png 848w, https://substackcdn.com/image/fetch/$s_!8yVo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F437972f4-6042-4e36-8e16-09b62c96bc8f_600x371.png 1272w, https://substackcdn.com/image/fetch/$s_!8yVo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F437972f4-6042-4e36-8e16-09b62c96bc8f_600x371.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Raw numbers and EXPLAINs can be found <a href="https://github.com/danolivo/conf/blob/main/2025-RTABench/query-0-workers-noseqscan-O3.md">here</a> and <a href="https://github.com/danolivo/conf/blob/main/2025-RTABench/query-0-workers-nobitmapscan-O3.md">here</a>.</p><p>The thing is that for a small number of workers, BitmapScan has a smaller cost than SeqScan, which in our case does not correlate with execution time. The point here is probably in a delicate balance: 'fewer lines read / but more often went from the index to the table'. It is difficult to say more precisely, since explain does not show such details of the estimate as the expected number of tuples fetched from the disk before filtering, an estimate of the number of fetched disk pages, or an estimate of the proportion of pages that will be found in shared_buffers. On the other hand, the cost model assumes better scalability of SeqScan compared to BitmapScan, which causes the plan to switch to SeqScan. Considering that for SeqScan, a change in the cost value predicts an unreasonably significant increase in performance, this may result in selecting a SeqScan method where it should not. Thus, for now, you should be careful optimising queries by increasing the number of parallel workers.</p><h1>Phase 1. &#8216;Typical optimisation&#8216;</h1><p>Now let's move on and see what a good index will give Postgres. A typical practice is to create an index on the most frequently used highly selective column in filters. For this query, the choice is limited to only one option:</p><pre><code>CREATE INDEX idx_1 ON order_events (event_created);</code></pre><p>With this index, the optimiser utilises IndexScan to access data, reducing the query execution time (without parallel workers) to 6.5 seconds. Interestingly, the previous query plan accessed buffer pages 3 million times (shared read = 3182778), while in this case, with a more selective scan, it increased to 14 million (shared hit = 14317527). Although there are more trips to the buffer now, in the previous example, each page replaced a previous one in the buffer. In contrast, both the index and the disk pages now fit into the shared buffers, which contributes to the acceleration. </p><p>Next, let's explore whether parallel workers will provide additional benefits and examine how the cost model predicts this outcome:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kN-M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9296053-5e02-4952-a873-7ba55fdcbd52_600x371.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kN-M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9296053-5e02-4952-a873-7ba55fdcbd52_600x371.png 424w, https://substackcdn.com/image/fetch/$s_!kN-M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9296053-5e02-4952-a873-7ba55fdcbd52_600x371.png 848w, https://substackcdn.com/image/fetch/$s_!kN-M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9296053-5e02-4952-a873-7ba55fdcbd52_600x371.png 1272w, https://substackcdn.com/image/fetch/$s_!kN-M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9296053-5e02-4952-a873-7ba55fdcbd52_600x371.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kN-M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9296053-5e02-4952-a873-7ba55fdcbd52_600x371.png" width="600" height="371" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a9296053-5e02-4952-a873-7ba55fdcbd52_600x371.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:371,&quot;width&quot;:600,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:18569,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danolivo.substack.com/i/169909509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9296053-5e02-4952-a873-7ba55fdcbd52_600x371.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kN-M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9296053-5e02-4952-a873-7ba55fdcbd52_600x371.png 424w, https://substackcdn.com/image/fetch/$s_!kN-M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9296053-5e02-4952-a873-7ba55fdcbd52_600x371.png 848w, https://substackcdn.com/image/fetch/$s_!kN-M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9296053-5e02-4952-a873-7ba55fdcbd52_600x371.png 1272w, https://substackcdn.com/image/fetch/$s_!kN-M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9296053-5e02-4952-a873-7ba55fdcbd52_600x371.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Raw data can be taken <a href="https://github.com/danolivo/conf/blob/main/2025-RTABench/query-0-optphase-1.md">here</a>.</p><p>Yes, we see some acceleration. The parallelisation effect goes so far that we have to expand the permissible number of workers to 24 to track the impact to the end. And here one of the disadvantages of Postgres showed itself. Although we set all possible hooks to large values, the optimiser hit a hard-wired limit on the number of workers (10 for this table) on the <code>order_events</code> table. We had to bypass it with the command:</p><pre><code>ALTER TABLE order_events SET (parallel_workers = 32);</code></pre><p>It is pretty sad that the number of workers implicitly depends on the table size estimate. This can be a problem, for example, in the JOIN operator, which determines the required number of workers based on the number of workers requested on the outer side of the join. It is easy to imagine a situation where a petite table, insufficient even for one parallel worker, and a very large one are joined. In this case, the entire jointree may remain non-parallelised just because of one small table!</p><p>Another fact can be extracted from the above graph - the cost model for Index Scan is very conservative: with the maximum observed speedup of 8, the model did not show even a two-fold speedup. Hence the conclusions: 1) when using indexes, do not be shy about raising worker limits, and 2) Scan nodes that are more sensitive to the number of workers (we observed this, for example, with SeqScan), can unexpectedly trigger a rebuild of the query plan for the worse.</p><p>However, the current index is not the ultimate dream. Let's take a closer look at the scanning node:</p><pre><code>Index Scan (actual time=6555.122 rows=204053)
  Index Cond: event_created &gt;= '2024-01-01 00:00:00+00' AND
    event_created &lt; '2024-02-01 00:00:00+00'
  Filter: event_type = ANY ('{Created,Departed,Delivered}' AND
    event_payload -&gt;&gt; 'terminal' = ANY ('{Berlin,Hamburg,Munich}')
  Rows Removed by Filter: 14099758
  Index Searches: 1
  Buffers: shared hit=14317527</code></pre><p>Lots of pages touched, lots of lines filtered. Let's see what can be achieved by minimising disk reads.</p><h1>Phase 2. &#8216;Reinforced optimisation&#8216;</h1><p>In this section, we will confidently assume the existence of an 'index adviser' that helps analyse and automatically create composite indexes. These indexes optimise data retrieval by minimising the reading of table rows, thereby adapting the entire system to the incoming load.</p><p>In this query, we have several options to consider. We will exclude the GIN index because the event_payload column lacks selectivity. This leaves us with two alternative options:</p><pre><code>CREATE INDEX idx_2 ON order_events (event_created, event_type)
  INCLUDE (event_payload);
CREATE INDEX idx_3 ON order_events (event_created, event_type);</code></pre><p>The idx_2 variant does not require accessing the table at all, while the idx_3 index can cause both IndexScan and BitmapScan. You can find various EXPLAINs of these indexes here. It&#8217;s interesting to note that with the previously created idx_1 index, adding idx_2 does not result in switching to the obviously faster IndexOnlyScan. This suggests that when evaluating the cost of index access, the width of the index plays a significant role. The jsonb field likely increases the size of idx_2 considerably.</p><p>Consequently, the idx_3 index has proven to be the most optimal in terms of compactness and the number of selected records when using the BitmapScan method. By closely examining the scanning nodes, we can understand the reasons behind this conclusion:</p><pre><code>Bitmap Heap Scan (actual time=1286.430 rows=204053
  Rows Removed by Filter: 4292642
  Heap Blocks: exact=269237
  Buffers: shared hit=313925
  Bitmap Index Scan (actual time=625.170 rows=4496695)
    Index Searches: 1
    Buffers: shared hit=44688

Index Only Scan (actual time=1586.097 rows=204053)
  Rows Removed by Filter: 4292642
  Heap Fetches: 0
  Index Searches: 1
  Buffers: shared hit=2558314

Index Scan (actual time=2847.517 rows=204053)
  Rows Removed by Filter: 4292642
  Index Searches: 1
  Buffers: shared hit=4509185</code></pre><p>All three index scans return the same number of rows, perform a single pass through the index, and filter the same number of rows. However, IndexOnlyScan wins over Index Scan due to the fact that it does not go into the table and touches the buffer pages twice as rarely (2.6 million V/S 4.5 million); BitmapScan goes into the buffer even less often (300 thousand times) - after going through the index and collecting tid of candidate rows, it then goes pointwise to the heap, touching each potentially useful page only once.</p><p>Let's see how parallel workers now help speed up the query for each type of scanning:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1ywE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0273a925-9b19-4497-bfd3-018aaccdb7f1_687x425.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1ywE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0273a925-9b19-4497-bfd3-018aaccdb7f1_687x425.png 424w, https://substackcdn.com/image/fetch/$s_!1ywE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0273a925-9b19-4497-bfd3-018aaccdb7f1_687x425.png 848w, https://substackcdn.com/image/fetch/$s_!1ywE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0273a925-9b19-4497-bfd3-018aaccdb7f1_687x425.png 1272w, https://substackcdn.com/image/fetch/$s_!1ywE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0273a925-9b19-4497-bfd3-018aaccdb7f1_687x425.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1ywE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0273a925-9b19-4497-bfd3-018aaccdb7f1_687x425.png" width="687" height="425" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0273a925-9b19-4497-bfd3-018aaccdb7f1_687x425.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:425,&quot;width&quot;:687,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:31047,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danolivo.substack.com/i/169909509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0273a925-9b19-4497-bfd3-018aaccdb7f1_687x425.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1ywE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0273a925-9b19-4497-bfd3-018aaccdb7f1_687x425.png 424w, https://substackcdn.com/image/fetch/$s_!1ywE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0273a925-9b19-4497-bfd3-018aaccdb7f1_687x425.png 848w, https://substackcdn.com/image/fetch/$s_!1ywE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0273a925-9b19-4497-bfd3-018aaccdb7f1_687x425.png 1272w, https://substackcdn.com/image/fetch/$s_!1ywE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0273a925-9b19-4497-bfd3-018aaccdb7f1_687x425.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It turns out that in the case of BitmapScan, there is no particular sense in using workers. Having a significant computing resource and low competition between clients, it is worth considering reducing the cost (see the <code>parallel_setup_cost</code> and <code>parallel_tuple_cost</code> parameters) of parallel execution and disabling BitmapScan.</p><p>However, the cost model again turned out to be insensitive to the effect of parallelism. And first of all, something should be done here. It was also noted that with 8+ workers, the plan was again rebuilt on SeqScan, which led to an increase in execution time from ~1c to ~21c. Therefore, in the interests of the experiment, SeqScan had to be manually disabled.</p><p>However, even with such a good index, we see that a certain number of lines have to be filtered. Let's go all the way and organise selective access to only relevant data.</p><h1>Phase 3. &#8216;Crazy optimisation&#8216;</h1><p>Let's now try to reach the theoretical limit of optimisation of this query. Here, we can imagine having an advanced 'Disk Access Tuner' that analyses various expressions of the SQL query to find combinations of high and low selective filters on the same table, which is a good reason to consider partial indexes. Let's create the following ideal index:</p><pre><code>CREATE INDEX idx_5 ON order_events (event_created, event_type)
INCLUDE (event_payload)
WHERE
  event_created &gt;= '2024-01-01' and event_created &lt; '2024-02-01' AND
  event_type IN ('Created', 'Departed', 'Delivered') AND
  (event_payload -&gt;&gt; 'terminal') = ANY ('{Berlin,Hamburg,Munich}');</code></pre><p>The index is designed so that no table access is necessary, and all rows in this index are relevant to the query. Therefore, there is no need to evaluate the filter value, which also helps save CPU time during the execution. The base case (without workers) now executes in 54 ms (as shown in the EXPLAINs <a href="https://github.com/danolivo/conf/blob/main/2025-RTABench/query-0-optphase-3.md">here</a>). </p><p>In such a straightforward scenario, it's clear that the estimation of the cardinality for the scan operator is made with an error:</p><pre><code>Index Only Scan (cost=0.42..4998.13 rows=70210)
                (actual time=34.718 rows=204053.00 loops=1)
    Heap Fetches: 0
    Index Searches: 1
    Buffers: shared hit=110862</code></pre><p>Scanning does not rely on a filter; the table is, in fact, static, yet it still makes errors! In this instance, it may not be significant, but if there were a join tree above, an incorrect estimate at a leaf node could lead to a substantial error when selecting a join strategy. Why can't the optimiser refine the selectivity of the sample based on the existing indexes? Let's also explore whether scaling is effective here with parallel workers. Due to the increased share of the remaining (non-parallel) portion of the query, we will focus solely on the actual time and estimated cost of the scan nodes themselves:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q4_f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4565efe8-438a-4433-a3b7-619725136e26_600x371.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q4_f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4565efe8-438a-4433-a3b7-619725136e26_600x371.png 424w, https://substackcdn.com/image/fetch/$s_!Q4_f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4565efe8-438a-4433-a3b7-619725136e26_600x371.png 848w, https://substackcdn.com/image/fetch/$s_!Q4_f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4565efe8-438a-4433-a3b7-619725136e26_600x371.png 1272w, https://substackcdn.com/image/fetch/$s_!Q4_f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4565efe8-438a-4433-a3b7-619725136e26_600x371.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q4_f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4565efe8-438a-4433-a3b7-619725136e26_600x371.png" width="600" height="371" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4565efe8-438a-4433-a3b7-619725136e26_600x371.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:371,&quot;width&quot;:600,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:19237,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danolivo.substack.com/i/169909509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4565efe8-438a-4433-a3b7-619725136e26_600x371.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q4_f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4565efe8-438a-4433-a3b7-619725136e26_600x371.png 424w, https://substackcdn.com/image/fetch/$s_!Q4_f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4565efe8-438a-4433-a3b7-619725136e26_600x371.png 848w, https://substackcdn.com/image/fetch/$s_!Q4_f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4565efe8-438a-4433-a3b7-619725136e26_600x371.png 1272w, https://substackcdn.com/image/fetch/$s_!Q4_f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4565efe8-438a-4433-a3b7-619725136e26_600x371.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here, as in the previous case, it is clear that the IndexOnlyScan scanning effect ends at three workers. But the cost model does not even show this. Why does the cost model reflect the real numbers so poorly? Perhaps it is conservative due to the diversity of hardware parallelism models and, as a result, the different impact of parallelisation on the query execution. Or maybe there is an implicit assumption that there is a neighbouring backend nearby that will compete for the resource? In any case, I would personally like to have an explicit parameter that allows me to configure this effect, given the nature of my system's load.</p><h1>Conclusion</h1><p>What lessons can be drawn from this simple experiment?</p><ol><li><p>Parallel workers significantly affect performance, so it's vital to adjust the optimiser's cost model to align with the server's capabilities, increasing the proportion of parallel plans and the number of workers.</p></li><li><p>The efficiency of parallelisation is highly dependent on the access technique, and preference should be given to IndexScan over BitmapScan, SeqScan and even IndexOnlyScan.</p></li><li><p>It appears that the cost model for parallelism in PostgreSQL has not been sufficiently polished, potentially leading to side effects such as defaulting to inefficient SeqScan operations.</p></li><li><p>Considering the weak points of the current PostgreSQL row storage, it lacks the ability to adjust the set of indexes to optimise data access based on the actual workload.</p></li><li><p>More deeply employing indexes in the planning process may provide worthwhile improvements in cardinality estimations.</p></li></ol><p>THE END.<br><em>July 26, 2025, Madrid, Spain.</em></p>]]></content:encoded></item><item><title><![CDATA[On Postgres Plan Cache Mode Management]]></title><description><![CDATA[Can the generic plan switch method provide better performance guarantees?]]></description><link>https://danolivo.substack.com/p/on-postgres-plan-cache-mode-management</link><guid isPermaLink="false">https://danolivo.substack.com/p/on-postgres-plan-cache-mode-management</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Thu, 03 Jul 2025 08:29:57 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/1c45308c-1644-41db-b160-d1262b063522_1024x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Having attended PGConf.DE'2025 and discussed the practice of using Postgres on large databases there, I was surprised to regularly hear the opinion that query planning time is a significant issue. As a developer, it was surprising to learn that this factor can, for example, slow down the decision to move to a partitioned schema, which seems like a logical step once the number of records in a table exceeds 100 million. Well, let's figure it out.</em></p><p>The obvious way out of this situation is to use prepared statements, initially intended for reusing labour-intensive parts such as <em>parse trees</em> and <em>query plans</em>. For more specifics, let's look at a simple table scan with a large number of partitions (see <a href="https://github.com/danolivo/conf/blob/main/Scripts/partitioned-tbl-example.sql">initialisation script</a>):</p><pre><code>EXPLAIN (ANALYZE, COSTS OFF, MEMORY, TIMING OFF)
SELECT * FROM test WHERE y = 127;

/*
...
   -&gt;  Seq Scan on l256 test_256
         Filter: (y = 127)
 Planning:
   Buffers: shared hit=1536
   Memory: used=3787kB  allocated=4104kB
 Planning Time: 61.272 ms
 Execution Time: 4.929 ms
*/</code></pre><p>In this scenario involving a selection from a table with 256 partitions, my laptop's PostgreSQL took approximately 60 milliseconds for the planning phase and only 5 milliseconds for execution. During the planning process, it allocated 4 MB of RAM and accessed 1,500 data pages. Quite substantial overhead for a production environment! In this case, PostgreSQL has generated a custom plan that is compiled anew each time the query is executed, choosing an execution strategy based on the query parameter values during optimisation. To improve efficiency, let's parameterise this query and store it in the 'Plan Cache' of the backend by executing PREPARE:</p><pre><code>PREPARE tst (integer) AS SELECT * FROM test WHERE y = $1;
EXPLAIN (ANALYZE, COSTS OFF, MEMORY, TIMING OFF) EXECUTE tst(127);

/*
...
   -&gt;  Seq Scan on l256 test_256
         Filter: (y = $1)
 Planning:
   Buffers: shared hit=1536
   Memory: used=3772kB  allocated=4120kB
 Planning Time: 59.525 ms
 Execution Time: 5.184 ms
*/</code></pre><p>The planning workload remains the same since a custom plan has been used. Let's force the backend to generate and use a 'generic' plan:</p><pre><code>SET plan_cache_mode = 'force_generic_plan';
EXPLAIN (ANALYZE, COSTS OFF, MEMORY, TIMING OFF) EXECUTE tst(127);

/*
...
  -&gt;  Seq Scan on l256 test_256
         Filter: (y = $1)
 Planning:
   Memory: used=4kB  allocated=24kB
 Planning Time: 0.272 ms
 Execution Time: 2.810 ms
*/</code></pre><p>The first time the query is executed, a generic execution plan is created (we are using forced mode here to keep the example straightforward). This process requires resources nearly equivalent to those needed for building a custom plan. However, when the query is executed again, the generic plan can be quickly retrieved from the cache. As a result, the time spent preparing the query plan drops to just 0.2 ms, memory usage is only 24 KB, and no data page reads are required. It seems we have a clear benefit!</p><p>However, my suggestion to use the <code>PREPARE</code> command has often been met with rejection and scepticism. This is primarily due to the <a href="https://www.linkedin.com/feed/update/urn:li:activity:7333171071015100416?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7333171071015100416%2C7333270833563320321%29&amp;replyUrn=urn%3Ali%3Acomment%3A%28activity%3A7333171071015100416%2C7333392449064550400%29&amp;dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287333270833563320321%2Curn%3Ali%3Aactivity%3A7333171071015100416%29&amp;dashReplyUrn=urn%3Ali%3Afsd_comment%3A%287333392449064550400%2Curn%3Ali%3Aactivity%3A7333171071015100416%29">problems</a> that arise with generic plans in practice, particularly regarding their updating (replanning) and switching to a custom plan type. To gain a clearer understanding of how the generic plan mechanism is structured and to explore the root of these issues, I decided to investigate the history of this project. Additionally, I aimed to experiment with a new publication format, such as a mailing list review.</p><h2>What&#8217;s wrong with a generic plan?</h2><p>Upon examining the Git history of PostgreSQL, it appears that the concept of the plan cache was introduced in 2007 with commit b9527e9. At that time, it was decided that each prepared query in PostgreSQL should be executed exclusively using a generic plan, thereby avoiding unnecessary time spent on rebuilding the plan. Unlike Oracle, SQL Server, DB2 and other colleagues in the shop (see <a href="https://web.archive.org/web/20080929185556/http://www.db2ude.com/?q=node%2F73">link</a>, <a href="https://oracle-base.com/articles/11g/adaptive-cursor-sharing-11gr1">link</a>, <a href="https://www.mssqltips.com/sqlservertip/7491/prepare-sql-statement-spprepare-spexecute/">link</a>, and <a href="https://learn.microsoft.com/en-us/answers/questions/262485/sql-server-how-to-determine-parameter-sniffing-pro">link</a>), PostgreSQL constructs the generic plan with the 'total uncertainty' concept, without utilising any specific 'reference' parameter value. For instance, in the example mentioned, the constant '127' is set aside during the creation of the generic plan.</p><p>Due to its limited ability to estimate scan selectivities, the optimiser often depends on default 'magic' values of certain predefined constants. Consequently, a generic plan is often of lower quality compared to a custom one. Let me provide another example to illustrate this point more clearly (see the <a href="https://github.com/danolivo/conf/blob/main/Scripts/time-range-example.sql">reproduction script</a>):</p><pre><code>EXPLAIN
SELECT * FROM test_2
WHERE
  start_date &gt; '2025-06-30'::timestamp - '7 days'::interval;
/*
 Index Scan using test_2_start_date_idx on test_2  (rows=739)
   Index Cond: (start_date &gt; '2025-06-23 00:00:00'::timestamp)
*/

PREPARE tst3(timestamp) AS SELECT * FROM test_2
  WHERE start_date &gt; $1 - '7 days'::interval;

EXPLAIN EXECUTE tst3('2025-06-30'::timestamp);
/*
 Seq Scan on test_2  (rows=333333)
   Filter: (start_date &gt; ($1 - '7 days'::interval))
*/</code></pre><p>Offhand, here are some key reasons to consider: the lack of a constant in the inequality operator results in a filter estimate of 33%; for range filters, the default value is set at 0.5% of the total number of rows in the table; with the equality operator, using MCV statistics is not possible, so we must rely solely on the ndistinct value. Additionally, in certain situations, it is not feasible to use partial indexes.</p><h2>Let's turn to the origins</h2><p>The absence of alternatives resulted in a significant decline in performance and the infrequent use of the generally practical <code>PREPARE/EXECUTE</code> statement construct. In 2011, a discussion began that ultimately led to the e6faf91 commit, which introduced a simple automatic technique for switching between custom and generic plan variants.</p><p>This discussion began with the pressing issue that prepared statements were executed exclusively using generic plans (Mark Mielke, <a href="https://www.postgresql.org/message-id/4B7176DD.5060305%40mark.mielke.cc">link</a>). While these plans were rebuilt each time an invalidation signal was received, such as after executing the <code>ANALYZE</code> or <code>ALTER TABLE</code> commands, the quality of the planning was noticeably inferior.</p><p>Several ideas were proposed to address this problem:</p><ol><li><p>Periodically, replan the generic plan (Jeroen Vermeulen, <a href="https://www.postgresql.org/message-id/4B715056.8060103%40xs4all.nl">link</a>).</p></li><li><p>Introduce a threshold for the 'planning/execution time' ratio - If the criterion value is greater than 100, then use only the generic plan; if less than 0.01, then only the custom plan. (Bart Samwel <a href="https://www.postgresql.org/message-id/ded01eb21002110409m5b729dffn168061dae0cad213%40mail.gmail.com">link</a>. Yeb Havinga opposes (<a href="https://www.postgresql.org/message-id/4B7408EE.5020905%40gmail.com">link</a>) this idea - an objective criterion should not contain the 'time' parameter). However, Jeroen Vermeulen and Greg Stark (<a href="https://www.postgresql.org/message-id/407d949e1002160622l65719aabpf68165681ee8b6be%40mail.gmail.com">link</a>) supported this idea with the clause that the difference between planning and execution times should be significant, amounting to orders of magnitude.</p></li><li><p>Track the standard deviation (stddev) value of various parameters for executing a specific query plan, which will enable estimating the probability of how long the query will take to plan and execute next time (Greg Stark, <a href="https://www.postgresql.org/message-id/407d949e1002160622l65719aabpf68165681ee8b6be%40mail.gmail.com">link</a>).</p></li><li><p>Build several custom and generic plans, and make a choice based on the cost ratio (Tom Lane, <a href="https://www.postgresql.org/message-id/10153.1265905060%40sss.pgh.pa.us">link</a>).</p></li><li><p>Abandon generic plans altogether, while reducing the cost of replanning by preserving the PlannerInfo optimiser 'cache' and replanning only that part of the jointree / subquery where the parameters are actually used (Yeb Havinga, <a href="https://www.postgresql.org/message-id/4B715C23.30101%40gmail.com">link</a>).</p></li><li><p>Use generic plans, but introduce a replanning criterion - whether the parameter value falls within the MCV or not (Robert Haas (<a href="https://www.postgresql.org/message-id/603c8f071002252001n6b2467bal2e6bae26a9a2b79a%40mail.gmail.com">link</a>, <a href="https://www.postgresql.org/message-id/CA%2BTgmobgD_UZRs44cOutY1odNbR0C_HJSxvx_dMREvz-CwuiaQ%40mail.gmail.com">link</a>), supported by Jeff Davis).</p></li></ol><p>Interestingly, the idea of re-optimisation was already being discussed back then (Richard Huxton, <a href="https://www.postgresql.org/message-id/4B716828.2030303%40archonet.com">link</a>). At that time, it was more of a dream, but by the 2020s, the code infrastructure had matured enough to allow us to implement a similar concept in a short time (see <a href="https://postgrespro.com/docs/enterprise/16/realtime-query-replanning">replan</a>). The approach of detecting, generalising, and caching frequently arriving statements through a simple protocol, which we implemented in <a href="https://postgrespro.com/docs/enterprise/15/sr-plan">sr_plan</a>, is also explicitly described here (Robert Haas, <a href="https://www.postgresql.org/message-id/CA%2BTgmoZCOaKs1pGOmR8wtpHOw6uKiScswKzm%2B9fNx8w4visoQQ%40mail.gmail.com">link</a>), along with Yeb Havinga's idea of achieving this through a method similar to the then non-existent queryId (<a href="https://www.postgresql.org/message-id/4E3A6278.3050404%40gmail.com">link</a>).</p><p>At the same time, in 2011, Simon Riggs <a href="https://www.postgresql.org/message-id/BANLkTikAN%3Dg1oCC%2BtY72o7FFH0OjF%2BYy%3DA%40mail.gmail.com">introduced</a> the concept of a one-shot plan. The primary idea behind this type of plan is to inform the DBMS that a query plan will be created, executed immediately, and subsequently destroyed upon completion. This approach allows for the application of additional optimisations that are not relevant when there is no connection between the planning and execution phases.</p><p>To support this idea, Simon provided an example involving the calculation of stable functions, which would enable more efficient execution of partition pruning. Additionally, Bruce Momjian highlighted another potential optimisation that could be implemented in a one-shot plan: analysing the buffer cache to assess the effectiveness of using a specific index.</p><p>Meanwhile, Tom Lane was developing a <a href="https://www.postgresql.org/message-id/7898.1312214100%40sss.pgh.pa.us">similar</a> feature, motivated by complaints about regressions in <a href="https://www.postgresql.org/docs/current/ecpg-dynamic.html">dynamic SQL</a> queries (<a href="https://www.postgresql.org/message-id/flat/03E840D17E263A48A5766AD576E0423A03DF0A266B%40exch-mbx-111.vmware.com">link</a>, <a href="https://www.postgresql.org/message-id/flat/CAD4%2B%3DqWnGU0qi%2Biq%3DEPh6EGPuUnSCYsGDTgKazizEvrGgjo0Sg%40mail.gmail.com">link</a>). However, his approach was different from Simon Riggs' original concept. Tom Lane's idea focused on unifying the mechanisms of <code>SPI</code>, <code>PREPARE</code>, and the extended protocol through the use of a plan cache. As a result, Riggs' original idea did not receive much further development, though it was discussed later on (<a href="https://www.postgresql.org/message-id/flat/CA%2BTgmoYqKRj9BozjB-%2BtLQgVkSvzPFWBEzRF4PM2xjPOsmFRdw%40mail.gmail.com">link</a>, <a href="https://www.postgresql.org/message-id/flat/CABRT9RC-1wGxZC_Z5mwkdk70fgY2DRX3sLXzdP4voBKuKPZDow%40mail.gmail.com">link</a>).</p><p>The concept of tracking the planning and execution time of queries did not gain traction due to <a href="https://www.postgresql.org/message-id/25760.1357800208%40sss.pgh.pa.us">objections</a> from Tom Lane, who argued against using this time characteristic, as it is inherently unpredictable and can behave inconsistently across different systems.</p><p>In 2017, Pavel Stehule <a href="https://www.postgresql.org/message-id/flat/CAFj8pRAGLaiEm8ur5DWEBo7qHRWTk9HxkuUAz00CZZtJj-LkCA%40mail.gmail.com">raised</a> the need for explicit control over the type of plan selected when invoking the plan cache. This discussion led to the introduction of the <code>plan_cache_mode</code> parameter, which has two options: <code>force_generic_plan</code> and <code>force_custom_plan</code>. These options are designed specifically for using generic and custom plan types, respectively.</p><p>What stands out to me as a developer is the emphasis on several key concepts from the Postgres core that emerged during these discussions. </p><ol><li><p>Tom Lane <a href="https://www.postgresql.org/message-id/1010.1491427354@sss.pgh.pa.us">pointed out</a> that in the absence of a general solution, we should develop heuristics. Providing users with such solutions through an additional GUC is a poor idea and ultimately a compromise.</p></li><li><p>Greg Stark and Pavel Stehule <a href="https://www.postgresql.org/message-id/407d949e1002160628v14966092iea1bf26e34039344%40mail.gmail.com">emphasised</a> that the predictability of execution is more important than speed.</p></li><li><p>Tom Lane also <a href="https://www.postgresql.org/message-id/28723.1267155635%40sss.pgh.pa.us">noted</a> that the ability to switch between different query plan types is valuable, provided it is controlled on a per-query basis.</p></li></ol><h2>Outcomes</h2><p>Analysing the history of feature creation, the opinions expressed within the community, and the current knowledge base on generic plans' usage experience, I conclude that many of the current problems stem from the following issues:</p><ol><li><p><em>Unstable Performance.</em> Generic plan performance may vary significantly based on different sets of input parameter values. This suggests a need to switch to a custom plan type. However, PostgreSQL cannot automatically detect and switch plans because it lacks any statistics on the query execution. The current state of the kernel code enables a straightforward implementation to track various execution parameters, including the average and standard deviation. But, before we proceed with a community's proposal, we must address a fundamental question: should the PostgreSQL kernel have a feedback system from the executor to the optimiser?</p></li><li><p><em>Outdated custom/generic cost proportion</em>. When a plan is invalidated, for instance, due to updated table statistics, the generic plan is rebuilt, and its cost is recalculated. However, this does not happen for the custom plan. Since the custom plan's cost is not recalculated, the value stored in the plan cache may significantly differ from reality due to gradual changes in table contents. This discrepancy can often lead to situations where a generic plan is utilised, even though the efficiency of a custom plan is apparent and could be determined by the planner during replanning.</p></li><li><p><em>Inadequate plan costs</em>. A common issue arises when erroneous estimates make the query plan costs irrelevant to the actual workload. Consequently, the choice between custom and generic plans becomes largely a matter of chance.</p></li></ol><h2>What can we propose?</h2><p>After many years of development and testing of the code, can we generate any new ideas? As usual, there are two separate solution designs: one is an in-core part for the community, and the other is an extensible code, which may even include a core patch that could be incorporated into a Postgres fork.</p><p>For the core version, we can consider the option of <em>resetting the custom plan statistics</em> on the cached plan, similar to what we do for the generic plan in the event of a plan invalidation call. This would trigger a new plan selection cycle from scratch. This approach is easily justified because statistics form the basis for calculating plan costs. When they change, it's comparable to switching to a different coordinate system, making it necessary to recalculate all costs.</p><p>The second option is somewhat more controversial: we could introduce a new <em>'referenced' mode</em> for the generic plan creation process. This mode would use current constants as reference values for the planner. While it may not offer any fundamental advantages, it would provide users with a familiar tool for influencing the query plan, especially for those migrating from SQL Server.</p><p>As usual, it makes sense to implement an in-core '<em>plan switching hook</em>' to leverage the plan switching method within an extension.</p><p>If we extend our coding options into the enterprise domain, we can explore more sophisticated plan-switching techniques. For instance, we could track statistics on the planning and execution time for each plan, compare their relative weight with cost values, and make decisions about replanning or even forcing a specific type of plan. An even better alternative could be to use a more stable parameter, such as the number of pages read.</p><p>To be more objective, you can check the <a href="https://github.com/danolivo/pg_mentor/tree/auto-mode-only">project</a>, which includes a draft for an automated system to manage plan types, as well as <a href="https://github.com/danolivo/pg_mentor/tree/main">another branch</a> outlining a draft for switching between forced modes.</p><p>Have you ever faced issues when using generic plans? Does it make sense to develop a comprehensive system for switching plans, or is it enough to implement an extension that enables each specific prepared statement to monitor its state and update it manually using SQL tools like <code>pg_stat_statements</code>?</p><h2>References</h2><p><strong>Hackers' mailing lists threads:</strong></p><ol><li><p><a href="https://www.postgresql.org/message-id/flat/4B715056.8060103%40xs4all.nl">Avoiding bad prepared-statement plans.</a> , 2010-02</p></li><li><p><a href="https://www.postgresql.org/message-id/flat/22791.1289514094%40sss.pgh.pa.us">Restructuring plancache.c API</a> , 2010-11</p></li><li><p><a href="https://www.postgresql.org/message-id/BANLkTikAN%3Dg1oCC%2BtY72o7FFH0OjF%2BYy%3DA%40mail.gmail.com">One-Shot Plans</a> , 2011-06</p></li><li><p><a href="https://www.postgresql.org/message-id/flat/29216.1312318038%40sss.pgh.pa.us">Transient plans versus the SPI API</a> , 2011-08</p></li><li><p><a href="https://www.postgresql.org/message-id/flat/CA%2BTgmoYqKRj9BozjB-%2BtLQgVkSvzPFWBEzRF4PM2xjPOsmFRdw%40mail.gmail.com">why do we need two snapshots per query?</a> , 2011-11</p></li><li><p><a href="https://www.postgresql.org/message-id/flat/CAFj8pRCKfoz6L82PovLXNK-1JL%3DjzjwaT8e2BD2PwNKm7i7KVg%40mail.gmail.com">dynamic SQL - possible performance regression in 9.2</a> , 2012-12</p></li><li><p><a href="https://www.postgresql.org/message-id/flat/CAFj8pRAGLaiEm8ur5DWEBo7qHRWTk9HxkuUAz00CZZtJj-LkCA%40mail.gmail.com">PoC plpgsql - possibility to force custom or generic plan</a> , 2017-01</p></li><li><p><a href="https://www.postgresql.org/message-id/flat/CADJ%3Dz8vD-HsM_qrNRLn9Lne-EkZtg%3DxmHjJ3RqksOoakzcuOyg%40mail.gmail.com">The logic behind comparing generic vs. custom plan costs</a> , 2025-03</p></li><li><p><a href="https://www.postgresql.org/message-id/flat/CAK-MWwS_2UwF7XPy-8XpcVqUjEveDjDQRbXxPyzvJM%2BooWeh9A%40mail.gmail.com">inefficient/wrong plan cache mode selection for queries with partitioned tables (postgresql 17)</a> ,2025-05</p></li></ol><p><strong>Main commits:</strong></p><ul><li><p><code>b9527e9</code> - first attempt to the feature's design, 2007-03</p></li><li><p><code>e6faf91</code> custom plans introduction, 2011-09</p></li><li><p><code>94afbd5</code> - one-shot entries, 2013-01</p></li><li><p><code>2aac339</code> - more sophisticated planning cost model, 2013-09</p></li><li><p><code>f7cb284</code> - plan_cache_mode setting, 2018-07</p></li></ul><p></p><p>THE END.<br><em>June 29, 2025. Madrid, Spain.</em> </p>]]></content:encoded></item><item><title><![CDATA[On expressions' reordering in Postgres]]></title><description><![CDATA[Let's discuss micro-optimisations]]></description><link>https://danolivo.substack.com/p/on-expressions-reordering-in-postgres</link><guid isPermaLink="false">https://danolivo.substack.com/p/on-expressions-reordering-in-postgres</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Tue, 22 Apr 2025 11:08:29 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/a2fef34c-9b5a-4896-8703-24be5bb813dc_444x298.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Today, I would like to discuss additional techniques to speed up query execution. Specifically, I will focus on rearranging conditions in filter expressions, JOINs, HAVING clauses, and similar constructs. The main idea is that if you encounter a negative result in one condition within a series of expressions connected by the AND operator, or a positive result in one of the conditions linked by the OR operator, you can avoid evaluating the remaining conditions. This can save computing resources. Below, I will explain how much execution efforts this approach saves and how to implement it effectively.</em></p><p>Occasionally, you may come across queries featuring complex filters similar to the following:</p><pre><code>SELECT * FROM table
WHERE
  date &gt; min_date AND
  date &lt; now() - interval '1 day' AND
  value IN Subplan AND
  id = 42';</code></pre><p>And in practice, it happens that a simple rearrangement of the order of conditions in such an expression allows for speeding up (sometimes quite notably) the query execution time. Why? Each individual operation costs little. However, if it is performed repeatedly on each of millions of the table's rows, then the price of the operation becomes palpable. Especially if other problems, like the table blocks getting into shared buffers, are successfully solved.</p><p>This effect is particularly evident on wide tables that contain many variable-length columns. For instance, I often encounter slow IndexScans that become slow when the field used for additional filtering is located somewhere around the 20th (!) position in the table, containing many variable-width columns. Accessing this field requires calculating its offset from the beginning of the row, which takes up processor time and slows down the execution.</p><p>The PostgreSQL community has already addressed this issue, as observed in the code. In 2002, commit <code>3779f7f</code>, which was added by T. Lane, <a href="https://www.postgresql.org/message-id/flat/20021113062210.GB5460%40wallace.ece.rice.edu">reorganised</a> the clauses by positioning all clauses containing subplans at the end of the clause list (see <code>order_qual_clauses</code>). This change was logical because the cost of evaluating a subplan can depend on the parameters passed to it, introducing an additional source of error.</p><p>In 2007, this approach evolved with the commit <code>5a7471c</code>, which established that the sorting of clauses would be performed exclusively in ascending order based on the cost parameter. This logic has remained in place to the present day, except for a minor modification in commit <code>215b43c</code>, which required controlling the order of expression evaluation in each query plan node due to changes in the Row-Level Security (RLS) code.</p><p>Now, let&#8217;s take a look at what we have in the upstream as of today:</p><pre><code>CREATE TABLE test (
  x integer, y numeric,
  w timestamp DEFAULT CURRENT_TIMESTAMP, z integer);
INSERT INTO test (x,y)
  SELECT gs,gs FROM generate_series(1,1E3) AS gs;
VACUUM ANALYZE test;

EXPLAIN (COSTS ON)
SELECT * FROM test
WHERE
  z &gt; 0 AND
  w &gt; now() AND
  x &lt; (SELECT avg(y)
    FROM generate_series(1,1E2) y WHERE y%2 = x%3) AND
  x NOT IN (SELECT avg(y)
    FROM generate_series(1,1E2) y OFFSET 0) AND
  w IS NOT NULL AND
  x = 42;</code></pre><p>Looking into the filter of this SELECT, we see the following sequence of conditions:</p><pre><code>Filter: ((w IS NOT NULL) AND (z &gt; 0) AND
         (x = 42) AND (w &gt; now()) AND
         ((x)::numeric = (InitPlan 2).col1) AND
         ((x)::numeric &lt; (SubPlan 1)))</code></pre><p>During the execution of the query, they will be calculated in strict sequence from left to right. The operator costs are as follows for reference:</p><ul><li><p>"<code>z &gt; 0</code>" - 0.0025</p></li><li><p>"<code>w &gt; now()</code>" - 0.005</p></li><li><p>"<code>x &lt; SubPlan 1</code>" - 2.0225</p></li><li><p><code>"x NOT IN SubPlan 2</code>" - 0.005</p></li><li><p>&#8220;<code>w IS NOT NULL</code>" - 0.0</p></li><li><p>&#8220;<code>x = 42</code>&#8220; - 0.0025</p></li></ul><p>This order appears quite logical. However, you may be wondering what can be improved here.</p><p>There are at least two straightforward opportunities for enhancement. First, you can assign a small cost to the ordinal position of each column involved in the expression. In simple terms, the further a column is to the right in a table row, the more expensive it is to evaluate. The cost should not be excessively high; it merely needs to signal to the optimiser that the expression <code>x=42</code> is cheaper to evaluate than  <code>z&gt;0</code>, assuming all other factors are equal.</p><p>You may argue it is related to the current Postgres row-based storage. It is true, but we use this type of storage more frequently, isn't it? Moreover, it would make sense for storage to provide its own cost model.</p><p>The second standard pattern relates to pairs of expressions with approximately the exact cost. For instance, consider <code>x=42</code> and  <code>z&lt;50</code>. Clearly, the second expression is less selective and should be placed in the second position. Since the expression <code>x=42</code> will be true in fewer cases, there will be less need to evaluate subsequent conditions further down the list.</p><p>Now, let's assess the potential impact of these optimisations. Is it worth the effort? To illustrate, we can create a table where a pair of columns has the same selectivity but is positioned far apart, while another pair is located next to each other but has different selectivity.</p><pre><code>CREATE TEMP TABLE test_2 (x1 numeric, x2 numeric,
  x3 numeric, x4 numeric);
INSERT INTO test_2 (x1,x2,x3,x4)
  SELECT x,(x::integer)%2,(x::integer)%100,x FROM
    (SELECT random()*1E7 FROM generate_series(1,1E7) AS x) AS q(x);
ANALYZE;</code></pre><p>Let's examine the performance impact of searching for a value in a relatively "wide" row. Columns <code>x1</code> and <code>x4</code> are identical in every way, except that the position of the value in the column <code>x1</code> is known in advance. In contrast, the position of the value in the column <code>x4</code> needs to be calculated for each row.</p><pre><code>EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
SELECT * FROM test_2 WHERE x1 = 42 AND x4 = 42;
EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
SELECT * FROM test_2 WHERE x4 = 42 AND x1 = 42;

/*
 Seq Scan on test_2  (actual rows=0.00 loops=1)
   Filter: ((x1 = '42'::numeric) AND (x4 = '42'::numeric))
   Buffers: local read=94357
  Execution Time: 2372.032 ms

 Seq Scan on test_2  (actual rows=0.00 loops=1)
   Filter: ((x4 = '42'::numeric) AND (x1 = '42'::numeric))
   Buffers: local read=94357
 Execution Time: 2413.633 ms
*/</code></pre><p>It turns out that, all other factors being equal, even a relatively short tuple can have an effect of about 2-3%. This impact is quite comparable to the typical benefits gained from using Just-In-Time (JIT) compilation. Now, let's consider the influence of selectivity. The columns <code>x1</code> and <code>x2</code> are positioned next to each other. The key difference is that the values in <code>x1</code> are almost unique, whereas <code>x2</code> contains mostly duplicated values.</p><pre><code>EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
SELECT * FROM test_2 WHERE x2 = 1 AND x1 = 42;
EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
SELECT * FROM test_2 WHERE x1 = 42 AND x2 = 1;
/*
 Seq Scan on test_2  (actual rows=0.00 loops=1)
   Filter: ((x2 = '1'::numeric) AND (x1 = '42'::numeric))
   Buffers: local read=74596
 Execution Time: 2363.903 ms

 Seq Scan on test_2  (actual rows=0.00 loops=1)
   Filter: ((x1 = '42'::numeric) AND (x2 = '1'::numeric))
   Buffers: local read=74596
 Execution Time: 2034.873 ms
*/</code></pre><p>It seems we have achieved a speedup of approximately 10%.</p><p>It turns out that if we accept that the effect can accumulate throughout the plan tree, which may contain multiple scanning operators as well as joins, each contributing a particular percentage, then this technique is worthwhile to implement overall, especially considering the minimal overhead in the planning phase.</p><p>Let's proceed with the implementation and observe its effects. Creating it as an extension doesn't seem practical, as there is currently no hook that allows for operations during the creation of the plan. As for me, the necessity of introducing a <code>create_plan_hook</code> within the <code>create_plan()</code> routine is becoming increasingly evident: We may let extensions transfer some data from the optimisation stage to the plan, as well as do some additional plan enhancements (which may fit a specific load), like proposed here. However, this topic has yet to be discussed within the PostgreSQL community.</p><p>If this feature is implemented as a patch, modifications will be needed in two areas of the code: <code>cost_qual_eval()</code>, where the cost of expressions is evaluated, and <code>order_qual_clauses()</code>, which defines the sorting rules for expressions. As usual, the code can be found on GitHub in the <a href="https://github.com/danolivo/pgdev/tree/reorder-clauses-by-selectivity">designated branch</a>.</p><p>Running the aforementioned examples on this branch will demonstrate that the expressions are constructed more optimally, considering column order and selectivity. Additionally, no significant overhead is observed.</p><p>Do you think it makes sense to pursue such micro-optimisations, or should we aim for broader improvements? Have you encountered similar issues? Please share your thoughts in the comments.</p><p></p><p>THE END<br><em>April 19, 2025. Madrid, Spain.</em></p>]]></content:encoded></item><item><title><![CDATA[Boosting Postgres' EXPLAIN]]></title><description><![CDATA[Add an extra output on employed statistics]]></description><link>https://danolivo.substack.com/p/boosting-postgres-explain</link><guid isPermaLink="false">https://danolivo.substack.com/p/boosting-postgres-explain</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Sat, 12 Apr 2025 11:48:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DHDg!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22ea4c7-73b5-4b9b-aaad-7db704866f94_256x256.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Shortly before the code freeze for PostgreSQL 18, Robert Haas added a feature that allows external modules to provide additional information to the EXPLAIN command.</p><p>This was a long-awaited feature for me. For an extension that influences the query planning process, providing users with notes on how the extension has affected the plan makes perfect sense. Instead of merely writing to a log file - access to which is often restricted by security policies - this information may be made available through the EXPLAIN command.</p><p>The feature introduced many entities that are not easy to figure out: an EXPLAIN option registration routine (RegisterExtensionExplainOption), an explain extension ID, per plan/node hooks, and an option handler.</p><p>The <code>pg_overexplain</code> extension, introduced with this feature to demonstrate how it works, seems a little messy and impractical for me, at least in its current state. So, I decided to find out how flexible this new technique is and demonstrate the opportunities opening up to developers with a more meaningful example. I have modified the freely available <a href="https://github.com/danolivo/pg_index_stats">pg_index_stats</a> extension and added information about the statistics used in the query planning process.</p><p>The STAT parameter was added to the list of EXPLAIN options, accepting Boolean ON/OFF values. If it is enabled, information about the statistics used is inserted at the end of the EXPLAIN: the presence of MCV, histogram, and the number of elements in them, as well as the values &#8203;&#8203;of stadistinct, stanullfrac, and stawidth.</p><p>You might wonder why this is necessary. After all, doesn't the set of statistics directly stem from the list of expressions in the query? Isn't it possible to identify which statistics were utilised by examining the cost-model code for a particular type of expression? </p><p>While it is indeed possible, this approach is not always sufficient. We understand the algorithms, but we typically do not have access to the underlying data. As a result, we cannot accurately determine which specific statistics are present in pg_statistic for a given column, nor can we know what information was available to the backend at the time of estimation.</p><p>Let's look at the example below:</p><pre><code>CREATE TABLE sc_a(x integer, y text);
INSERT INTO sc_a(x,y) (
  SELECT gs, 'abc' || gs%10 FROM generate_series(1,100) AS gs);
VACUUM ANALYSE sc_a;
LOAD 'pg_index_stats';

EXPLAIN (COSTS OFF, STAT ON)
SELECT * FROM sc_a s1 JOIN sc_a s2 ON true
WHERE s1.x=1 AND s2.y LIKE 'a';</code></pre><p>Explain, boosted by the <code>pg_index_stats</code> extension, looks like the following:</p><pre><code> Nested Loop
   -&gt;  Seq Scan on sc_a s1
         Filter: (x = 1)
   -&gt;  Seq Scan on sc_a s2
         Filter: (y ~~ 'a'::text)
 Statistics:
   "s2.y: 1 times, stats: { MCV: 10 values, Correlation,
          ndistinct: 10.0000, nullfrac: 0.0000, width: 5 }
   "s1.x: 1 times, stats: { Histogram: 0 values, Correlation,
          ndistinct: -1.0000, nullfrac: 0.0000, width: 4 }</code></pre><p>Here, you can see that the statistics for the <code>s1.x</code> and <code>s2.y</code> columns were used. We can't detect which statistic type was actually used and how often - it is buried too deeply in the core - but we may still detect some issues:</p><p>At first, we have only ten MCV values &#8203;&#8203;for y, and there are no MCV statistics for the <code>s1.x</code> at all; the histogram seems to be there, but it is of zero length. No nulls in either column are expected.</p><p>Thus, we have some helpful information that can suggest the optimiser's plan selection logic. Considering that a client who cannot provide data can very rarely give a dump of the pg_statistic table, such relatively harmless information can be a helpful aid and reveal possible causes of problems with the query plan selection. As a bare minimum benefit, users <a href="https://www.postgresql.org/message-id/CAHgTRfedznOOrDxLhvDCHYhTMDvsbfE4uWCmxBPywcOS-GikXg@mail.gmail.com">often</a> forget to increase the <code>statistic_target</code> parameter (sample size) on massive tables, and this information provides a quick insight into that issue.</p><p>The extension utilises the <code>get_relation_stats_hook</code> to track the statistics used. It would also be useful to know whether extended statistics are used in planning, but they are too deep in the core yet, and the current set of hooks will not help here.</p><p>Finally, I would like to know what applications you see for expanding the EXPLAIN output. Regarding the example above, how harmless is even such limited information really?</p><p></p><p>THE END</p><p><em>April 12, 2025. Nikola Tesla Airport, Serbia.</em></p>]]></content:encoded></item><item><title><![CDATA[Automated Management of Extended Statistics in PostgreSQL]]></title><description><![CDATA[The history of one more Postgres extension development]]></description><link>https://danolivo.substack.com/p/automated-management-of-extended</link><guid isPermaLink="false">https://danolivo.substack.com/p/automated-management-of-extended</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Sun, 09 Mar 2025 15:34:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d2a695fa-4904-4a62-ae6d-b9f1d77c808c_420x294.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Here, I am describing the results of a Postgres extension I developed out of curiosity. This extension focuses on the automatic management of extended statistics for table columns. The idea originated while I was finishing another "smart" query-driven project aimed at enhancing the quality of Postgres query planning. I realised that Postgres is not yet equipped enough for fully autonomous poor query plan detection and adjustment optimisations. Therefore, it might be beneficial to approach the problem from a different angle and create an autonomous, data-driven helper.</em></p><h2><strong>What is extended statistics?</strong></h2><p>The extended statistics tool allows you to tell Postgres that additional statistics should be collected for a particular set of table columns. Why is this necessary? - I will try to quickly explain using the example of an open <a href="https://datasets.wri.org/dataset/globalpowerplantdatabase">power plant database</a>. For example, the fuel type (<em>primary_fuel</em>) used by a power plant is implicitly associated with the country's name. Therefore, when executing a simple query:</p><pre><code>SELECT count(*) FROM power_plants
WHERE country = '&lt;XXX&gt;' AND primary_fuel = 'Solar';</code></pre><p>we see that this number is zero for Norway and 243 for Spain. This is apparent to us since it is defined by latitude, but the DBMS does not know this, and at the query planning stage, it incorrectly estimates the sample (row number): 93 for Norway and 253 for Spain. If the query turns out to be a little more complex and the estimated data are the input for a JOIN operator, this can lead to unfortunate consequences. The extended statistic calculates the joint distribution of values &#8203;&#8203;in columns and allows us to detect such dependencies.</p><p>In fact, there are worse situations in ORMs. In the power plant database example, this could be the joint use of conditions on the <em>country</em> and <em>country_long</em> fields. After reading their description, anyone understands that there is a direct correlation between these fields, and when the ORM groups by both of these fields, we get a significant error:</p><pre><code>EXPLAIN (ANALYZE, COSTS ON, TIMING OFF, BUFFERS OFF, SUMMARY OFF)
SELECT country, country_long FROM power_plants
GROUP BY country, country_long;

 HashAggregate
  (rows=3494 width=16) (actual rows=167 loops=1)
   Group Key: country, country_long
   -&gt;  Seq Scan on power_plants
        (rows=34936 width=16) (actual rows=34936 loops=1)</code></pre><p>A human would never write such a query, but we live in the era of AI, and automatically generated queries are not uncommon. We will have to deal with this somehow.</p><p>And what about extended statistics? It allows us to define three types of statistics on a combination of columns (or/and expressions): <em>Most Common Values</em> &#8203;&#8203;(MCV), <em>distinct</em> and <em>dependencies</em>. In the case of scanning filters, MCV works best: if the combination of values &#8203;&#8203;that a query selects from the table often appears in this table, then the optimiser will get an accurate estimate. If we are looking for a rare combination (as in the case of solar power plants in Norway), having a rough estimate of the sample <code>ntupes/ndistinct</code>, we can refine it by throwing out everything that got into MCV.</p><p>In the case of the need to estimate the number of groups (operators <code>GROUP BY</code>, <code>DISTINCT</code>, <code>IncrementalSort</code>, <code>Memoize</code>, <code>Hash Join</code>), the optimiser's decision is very well supported by the ndistinct value per column combination.</p><p>Now, to see the impact of extended statistics on the optimiser's row estimation from the table, let's apply extended statistics to our case by running the commands:</p><pre><code>CREATE STATISTICS ON country,primary_fuel
 FROM power_plants;
ANALYZE;</code></pre><p>You may find that the queries above estimate row numbers much more accurately when selecting and grouping by these two fields. For instance, Norway is estimated to have one power plant, while Spain has 253. Just to be sure, you can verify this result using filters such as <code>country = 'RUS'</code> or <code>country = 'AUT'</code>. Although the table is not very large, the tool seems effective.</p><p>However, I rarely see extended statistics being used in practice. One possible reason for this may be the concern that running the <code>ANALYZE</code> command will take a significant amount of time. Yet, I believe the main issue lies in the complexity of diagnostics - specifically, knowing <em>when</em> and <em>where</em> to create these statistics.</p><h2>Looking for a suitable statistics definition</h2><p>Is there an empirical rule of thumb for determining where and what statistics to create? I have forged two such rules for myself:</p><p><strong>No. 1: By Index Definition.</strong> If a DBA takes a risk by creating an index on a specific set of columns, they likely expect the DBMS to receive queries that filter on these columns frequently. Additionally, the execution time of these queries is probably critical, which serves as another reason for improving the quality of query plans. However, there isn't always a significant estimation error for filters on multiple columns, which is a drawback of this empirical approach &#8211; statistics may be generated unnecessarily. It's also possible that a point sample of data from the table is what's expected, which may diminish the impact of misestimating on a composite filter &#8211; does it really matter whether 1 or 5 rows are returned?</p><p>Due to these shortcomings, I developed <strong>Method No. 2 using actual query filter templates</strong>. In this method, the first step is to identify candidate queries based on two factors: the query's contribution to the database load (which can be measured using the <a href="https://danolivo.substack.com/p/whose-optimisation-is-better?r=34q1yy">pages-read</a> criterion) and the presence of composite filter conditions in table scans. It would also be beneficial to consider only those instances where the actual cardinality of the table scan operator significantly deviates from the planned value.</p><p>This approach is more selective in choosing potential candidates for generating statistics, allowing for a significant reduction in the statistics collected. However, it raises some important questions:</p><ol><li><p>When it comes to creating statistics, approach No. 1 provides a clear moment for generating them - at the time of the index creation. But what about approach No. 2? In this case, you must either rely on a timer to generate statistics collecting queries in the interim or manually trigger the command. The absence of a complex query that calculates bonuses at the end of the month (for the previous 29 days) does not mean that we shouldn&#8217;t execute it within a reasonable timeframe on the thirtieth day. While such a query may contribute only a tiny amount to the overall load, the accountant may not appreciate waiting several hours for the results!</p></li><li><p>How to Clean Up a Set of Statistics. In the previous approach, we deleted the statistics along with the index. However, this situation is less straightforward now. For instance, if a problematic query suddenly stops occurring - perhaps because the sales season for a popular product has ended - it doesn't mean it won't return in a year. This uncertainty could create potential instability in the DBMS optimiser's operation.</p></li></ol><p>Additionally, it's unclear how much the actual and planned row numbers should differ to be considered significant. Should this difference be two times, ten times, or even a hundred times?</p><p>With this in mind, I decided to first write code for the easy-to-implement approach No. 1. At the same time, for approach No. 2, I just plan to develop a recommender tool that, based on data of the <code>pg_stat_statements</code> extension and an analysis of the execution plans of queries, will suggest candidates for creating new statistics.</p><h2>Extension Description</h2><p>The concept behind this extension is straightforward (see the <a href="https://github.com/danolivo/pg_index_stats">repository</a> for details). First, we need a hook to collect the identifiers of objects created in the database, and I have chosen the <code>object_access_hook</code> for this purpose. Next, we need to determine an appropriate time to filter the list of objects, selecting only those that belong to relevant composite indexes. We can efficiently add a new statistics definition to the database using the <code>ProcessUtility_hook</code>, executing our code after a utility command is completed.</p><p>Extended statistics, which include <em>distinct</em> and <em>dependencies</em> types, are calculated for all possible combinations of columns. This leads to a rapid increase in computational complexity. For instance, with three columns, the number of distinct statistics is 4, and the number of dependencies is 9. However, these numbers rise dramatically with eight columns to 247 distinct statistics and 1016 dependencies. It's clear now why the PostgreSQL core strictly limits the number of statistical elements to 8. </p><p>To prevent excessive load on the database, I introduced a parameter that limits the number of index elements included in the statistics definition (the <code>columns_limit</code> parameter) and another parameter that determines which types of statistics to include in this definition (the <code>stattypes</code> parameter). When these automatic statistics are created, an extra dependency is established on the index serving as the template. Consequently, the associated statistics are removed when the index is deleted. </p><p>An open question remains: Is it necessary to create a dependency from the extension to delete all created statistics when <code>DROP EXTENSION</code> is executed? The answer is unclear because the extension may also function as a simple module without requiring a <code>CREATE EXTENSION</code> call, thus potentially impacting all databases within the cluster simultaneously.</p><p>To distinguish between automatically generated statistics and those created by users, a comment object that includes the module's name and the statistics name is created. Additionally, we have introduced the functions <code>pg_index_stats_remove</code> and <code>pg_index_stats_rebuild</code> into the extension interface. These functions allow you to delete all statistics and regenerate them, which can be helpful if the data schema was established prior to loading the module or if the database parameters have changed.</p><p>A separate issue to address is the reduction of redundant statistics. Given that a database can have many indexes, a procedure has been developed to identify duplicates, aiming to decrease the computational costs of the <code>ANALYZE</code> command (see the <code>pg_index_stats.compactify</code> parameter). </p><p>For example, if an index is already defined as <code>t(x1, x2)</code>, creating another index as <code>t(x2, x1)</code> would not require the creation of new statistics. A more complex scenario arises when an index <code>t(x2, x1)</code> is created in the presence of another index <code>t(x1, x2, x3)</code>. In this case, the most common value (MCV) statistics must be created, as they would not be redundant but the <code>ndistinct</code>, and the <code>dependencies</code> can be disregarded.</p><h2>Benchmarking</h2><p>As usual, theory should be validated through practice, and code should be tested on meaningful data. I didn't have access to a ready-made, loaded PostgreSQL instance in either a test or production environment, so I found a stale dump of a database for testing purposes. This particular dump was noteworthy because it contained a large number of tables -about 10,000 - along with roughly three times as many indexes. </p><p>Additionally, composite indexes were heavily employed, with around 20,000 indexes containing more than one column. Notably, more than 1,000 of these indexes cover five or more columns. So, this database provides a suitable case for research, although it is unfortunate that no payload is available. The <code>ANALYZE</code> command on this database took 22 seconds to execute. However, when I installed the extension and used the default limit of five columns, the <code>ANALYZE</code> time increased to 55 seconds. </p><p>The table with raw data below illustrates the <code>ANALYZE</code> time (in seconds) based on the limit on the number of columns and the types of statistics collected.</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/n5gI7/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c53920d-bd46-40ff-a1e9-fba20634828e_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:235,&quot;title&quot;:&quot;The ANALYZE time (s) for various values of the columns_limit and stattypes GUCs&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/n5gI7/1/" width="730" height="235" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>It's clear that storing all possible combinations of columns significantly impacts analysis time, mainly when dependencies are involved. Therefore, we can limit our analysis to 3-5 columns in the statistics or consider adopting approach No. 2. I now understand why SQL Server created a separate worker for updating such statistics: this process can be pretty costly. What about reducing redundancy? Let's conduct another experiment:</p><pre><code>SET pg_index_stats.columns_limit = 5;
SET pg_index_stats.stattypes = 'mcv, ndistinct, dependencies';
SET pg_index_stats.compactify = 'off';
SELECT pg_index_stats_rebuild();
ANALYZE;

pg_index_stats.compactify = 'on';
SELECT pg_index_stats_rebuild();
ANALYZE;</code></pre><p>The following two queries are sufficient to check the amount of statistical data generated by the pg_index_stats extension:</p><pre><code>-- Total number of stat items
SELECT sum(nelems) FROM (
  SELECT array_length(stxkind,1) AS nelems
  FROM pg_statistic_ext);

-- Total number of stat items grouped by stat type
SELECT elem, count(elem) FROM (
 SELECT unnest(stxkind) elem FROM pg_statistic_ext
)
GROUP BY elem;</code></pre><p>The first query shows the total number of extended statistics <em>items</em> in the database, and the second one - a breakdown by type. So, let's see what happens with and without compactifying:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/tlslr/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63af2938-5ae2-48c2-b316-818c145b595d_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:337,&quot;title&quot;:&quot;[ Insert title here ]&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/tlslr/1/" width="730" height="337" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>The overall impact is modest&#8212;approximately a 15% improvement in processing time and slightly more in the set of statistics. However, it does provide some protection against corner cases. Interestingly, the compactifying reduced the number of MCV statistics, suggesting that a significant number of indexes differ only in the order of their columns. Additionally, expression statistics, which we haven't discussed before, are generated automatically by the PostgreSQL core if the definition of extended statistics includes an expression. Although this may not pose a significant issue, it would be beneficial to have the ability to regulate this behaviour.</p><p>It's also worth comparing the analysis time to an alternative statistics collector called <a href="https://postgrespro.com/docs/enterprise/17/runtime-config-query#GUC-ENABLE-COMPOUND-INDEX-STATS">joinsel</a>, which exists in the enterprise Postgres fork, provided by <a href="https://postgrespro.com">Postgres Professional LLC</a>. While it isn't a direct competitor to extended statistics, it works differently. Based on the index definition, it creates a new composite type within the database, which is then used to generate regular statistics stored in pg_statistic. The advantages of joinsel include MCV and a histogram, which allows for evaluating range filters while leveraging standard PostgreSQL clause estimation techniques. However, it does have some drawbacks, such as a lack of dependency statistics and only one ndistinct value for the entire composite type (a limitation that can be addressed).</p><p>Now, let's look at how quickly the ANALYZE command is executed with joinsel.</p><pre><code>SET enable_compound_index_stats = 'on';
SELECT pg_index_stats_remove();
\timing on
ANALYZE;
Time: 41248.977 ms (00:41.249)</code></pre><p><code>ANALYZE</code> time has increased as expected compared to regular Postgres statistics, but only by two, which is a reasonable compromise. The main advantage here is that you don't have to worry about the number of columns in the index - the complexity will increase linearly.</p><h2>Coclusion</h2><p>The general conclusion regarding Approach No. 1 is that it can be viable, provided we exercise caution and carefully manage the limits. </p><p>Additionally, we should enhance the extended statistics in the core. It would be nice to have the possibility of a more significant impact on this tool, allowing us to reduce or expand the volume of generated statistical data.</p><p>As for the helper and Approach No. 2, I have decided to postpone it for now. If anyone is enthusiastic and has plenty of free time and patience, feel free to reach out. I would be happy to provide guidance!</p><p></p><p>THE END.<br><em>March 9, 2025, Madrid, Spain.</em></p>]]></content:encoded></item><item><title><![CDATA[Does Postgres need an extension of the ANALYZE command?]]></title><description><![CDATA[One more idea on the Postgres extension]]></description><link>https://danolivo.substack.com/p/does-postgres-need-an-extension-of</link><guid isPermaLink="false">https://danolivo.substack.com/p/does-postgres-need-an-extension-of</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Tue, 04 Feb 2025 02:59:14 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f269d076-665f-42a3-a46c-b98770f555d3_780x780.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>In this post, I would like to discuss the stability of standard Postgres statistics (distinct, MCV, and histogram over a table column) and introduce an idea for one more extension - an alternative to the ANALYZE command.</em></p><p><em>My interest in this topic began while wrapping up <a href="https://habr.com/ru/articles/873064/">my previous article</a> when I noticed something unusual: the results of executing the same J<a href="https://github.com/danolivo/jo-bench">oin Order Benchmark</a> (JOB) query across a series of consecutive runs could differ by several times and even orders of magnitude - both in the value of the execution-time and in pages-read.</em></p><p><em> This was puzzling, as all variables remained constant - the test script, laptop, settings, and even the weather outside were the same. This prompted me to investigate the cause of these discrepancies&#8230; .</em></p><p>In my primary activity, which is highly connected to query plan optimisation, I frequently employ JOB to assess the impact of my features on the planner. At a minimum, this practice enables me to identify shortcomings and ensure that there hasn't been any degradation in the quality of the query plans produced by the optimiser. Therefore, benchmark stability is crucial, making the time spent analysing the issue worthwhile. After briefly examining the benchmark methodology, I identified the source of the instability: the <code>ANALYZE</code> command.</p><p>In PostgreSQL, statistics are computed using basic techniques like <a href="https://www.cs.umd.edu/~samir/498/vitter.pdf">Random Sampling with Reservoir</a>, calculating the <a href="https://scholar.archive.org/work/wf7zmkqzord4diev2nhfnpoqnu">number of distinct values</a> (ndistinct), and employing <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf">HyperLogLog</a> for streaming statistics - for instance, to compute distinct values in batches during aggregation or to decide whether to use abbreviated keys for optimisation. Given that the nature of statistics calculation is stochastic, fluctuations are expected. However, the test instability raises the following questions: How significant are these variations? How can they be minimised? And what impact do they have on query plans? Most importantly, how can we accurately compare benchmark results when such substantial deviations are present, even in the baseline case?</p><p><strong>Is it possible to achieve query plan stability?</strong></p><p>Well, I've thought: the tables are massive, and there are a lot of rows inside - let's just increase the value of the <code>default_statistics_target</code> parameter, and that will solve the problem, right? I was gradually increasing the sample size for statistics from 100 to 10000, re-running all benchmark queries and recording how the pages-read criterion behaves:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nEyd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e28f27b-f9cf-4a38-aea4-af958de7e362_1346x854.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nEyd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e28f27b-f9cf-4a38-aea4-af958de7e362_1346x854.png 424w, https://substackcdn.com/image/fetch/$s_!nEyd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e28f27b-f9cf-4a38-aea4-af958de7e362_1346x854.png 848w, https://substackcdn.com/image/fetch/$s_!nEyd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e28f27b-f9cf-4a38-aea4-af958de7e362_1346x854.png 1272w, https://substackcdn.com/image/fetch/$s_!nEyd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e28f27b-f9cf-4a38-aea4-af958de7e362_1346x854.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nEyd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e28f27b-f9cf-4a38-aea4-af958de7e362_1346x854.png" width="728" height="461.89598811292717" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e28f27b-f9cf-4a38-aea4-af958de7e362_1346x854.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:854,&quot;width&quot;:1346,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nEyd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e28f27b-f9cf-4a38-aea4-af958de7e362_1346x854.png 424w, https://substackcdn.com/image/fetch/$s_!nEyd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e28f27b-f9cf-4a38-aea4-af958de7e362_1346x854.png 848w, https://substackcdn.com/image/fetch/$s_!nEyd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e28f27b-f9cf-4a38-aea4-af958de7e362_1346x854.png 1272w, https://substackcdn.com/image/fetch/$s_!nEyd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e28f27b-f9cf-4a38-aea4-af958de7e362_1346x854.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">statistics_target = 100</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NP7R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8e09c16-54f4-41d7-ad2b-def7bf9ba137_1288x796.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NP7R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8e09c16-54f4-41d7-ad2b-def7bf9ba137_1288x796.png 424w, https://substackcdn.com/image/fetch/$s_!NP7R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8e09c16-54f4-41d7-ad2b-def7bf9ba137_1288x796.png 848w, https://substackcdn.com/image/fetch/$s_!NP7R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8e09c16-54f4-41d7-ad2b-def7bf9ba137_1288x796.png 1272w, https://substackcdn.com/image/fetch/$s_!NP7R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8e09c16-54f4-41d7-ad2b-def7bf9ba137_1288x796.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NP7R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8e09c16-54f4-41d7-ad2b-def7bf9ba137_1288x796.png" width="1288" height="796" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f8e09c16-54f4-41d7-ad2b-def7bf9ba137_1288x796.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:796,&quot;width&quot;:1288,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NP7R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8e09c16-54f4-41d7-ad2b-def7bf9ba137_1288x796.png 424w, https://substackcdn.com/image/fetch/$s_!NP7R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8e09c16-54f4-41d7-ad2b-def7bf9ba137_1288x796.png 848w, https://substackcdn.com/image/fetch/$s_!NP7R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8e09c16-54f4-41d7-ad2b-def7bf9ba137_1288x796.png 1272w, https://substackcdn.com/image/fetch/$s_!NP7R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8e09c16-54f4-41d7-ad2b-def7bf9ba137_1288x796.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">statistics_target = 1000</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DwO1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed67946-b236-4ae1-9fea-ef9678cb9597_1200x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DwO1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed67946-b236-4ae1-9fea-ef9678cb9597_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!DwO1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed67946-b236-4ae1-9fea-ef9678cb9597_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!DwO1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed67946-b236-4ae1-9fea-ef9678cb9597_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!DwO1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed67946-b236-4ae1-9fea-ef9678cb9597_1200x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DwO1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed67946-b236-4ae1-9fea-ef9678cb9597_1200x742.png" width="1200" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ed67946-b236-4ae1-9fea-ef9678cb9597_1200x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DwO1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed67946-b236-4ae1-9fea-ef9678cb9597_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!DwO1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed67946-b236-4ae1-9fea-ef9678cb9597_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!DwO1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed67946-b236-4ae1-9fea-ef9678cb9597_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!DwO1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed67946-b236-4ae1-9fea-ef9678cb9597_1200x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">statistics_target = 10000</figcaption></figure></div><p>Even with the highest level of statistical detalisation, query plans still change after re-executing the <code>ANALYZE</code> command. While increasing the sample size from 100 to 10,000 may improve something, this does not fundamentally alter the situation. Such significant instabilities call into question the possibility of independently reproducing benchmarks and conducting comparative results analyses without examining the query plans. </p><p>Now, let's dig deeper: what generally fluctuates? Using a simple script, I conducted an experiment to collect scalar statistics (stanullfrac, stawidth, stadistinct) by rebuilding the statistics ten times. The script for this operation might look like this:</p><pre><code>CREATE TABLE test_res(expnum integer, oid Oid, relname name,
                      attname name, stadistinct real,
                      stanullfrac real, stawidth integer);
DO $$
DECLARE
    i integer;
BEGIN
  TRUNCATE test_res;
  FOR i IN 0..9 LOOP
    INSERT INTO test_res (expnum,oid,relname,attname,stadistinct,stanullfrac,stawidth)
      SELECT i, c.oid, c.relname, a.attname, s.stadistinct, s.stanullfrac, s.stawidth
      FROM pg_statistic s, pg_class c, pg_attribute a
      WHERE c.oid &gt;= 16385 AND c.oid = a.attrelid AND
      s.starelid = c.oid AND a.attnum = s.staattnum;
    ANALYZE;
  END LOOP;
END; $$</code></pre><p>Afterwards, I analysed the results with a query like the following:</p><pre><code>WITH changed AS (
  SELECT
    relname,attname,
    (abs(max - min) / avg * 100)::integer AS res, avg
  FROM (
    SELECT relname,attname,
      max(stadistinct) AS max, avg(stadistinct) AS min,
      avg(stadistinct) AS avg
    FROM test_res
    WHERE relname &lt;&gt; 'test_res'
    GROUP BY relname,attname
    )
  WHERE max - min &gt; 0 AND abs(max - min) / avg &gt; 0.01
) SELECT relname,attname, res AS "stadistinct dispersion", avg::integer
  FROM changed t
  ORDER BY relname,attname;</code></pre><p>Something more mathematically correct could be used here, but for our purposes, such a simple criterion is enough to see the instability of statistics. Using the above scripts, let's see how ndistinct fluctuates for 100, 1000 and 10000 elements of the statistical sample:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2Wzo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52603808-2f1a-4bbc-99f1-d4c845572ece_1498x1386.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2Wzo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52603808-2f1a-4bbc-99f1-d4c845572ece_1498x1386.heic 424w, https://substackcdn.com/image/fetch/$s_!2Wzo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52603808-2f1a-4bbc-99f1-d4c845572ece_1498x1386.heic 848w, https://substackcdn.com/image/fetch/$s_!2Wzo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52603808-2f1a-4bbc-99f1-d4c845572ece_1498x1386.heic 1272w, https://substackcdn.com/image/fetch/$s_!2Wzo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52603808-2f1a-4bbc-99f1-d4c845572ece_1498x1386.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2Wzo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52603808-2f1a-4bbc-99f1-d4c845572ece_1498x1386.heic" width="1456" height="1347" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/52603808-2f1a-4bbc-99f1-d4c845572ece_1498x1386.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1347,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:59089,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2Wzo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52603808-2f1a-4bbc-99f1-d4c845572ece_1498x1386.heic 424w, https://substackcdn.com/image/fetch/$s_!2Wzo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52603808-2f1a-4bbc-99f1-d4c845572ece_1498x1386.heic 848w, https://substackcdn.com/image/fetch/$s_!2Wzo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52603808-2f1a-4bbc-99f1-d4c845572ece_1498x1386.heic 1272w, https://substackcdn.com/image/fetch/$s_!2Wzo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52603808-2f1a-4bbc-99f1-d4c845572ece_1498x1386.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Stochastic, right? There's a lot of strange stuff here, but we won't dig too deep, attributing it to the fact that determining the ndistinct value from a small sample does not converge uniformly to the actual value. Yes, the fluctuations subside with the sample size, but 10,000 is already the limit, and the table sizes in this benchmark are not that big - in real life, statistics can remain unstable on much larger tables.</p><p>Another observation from the results of this experiment is that statistics on fields with a large number of duplicates suffer the most. In practice, this means that grouping or joining on some harmless "Status" field can cause immense estimation errors inside the optimiser, even if the expression on this field is only part of a long list of expressions in the query.</p><p><strong>What exactly is the origin of instability?</strong></p><p>However, what exactly causes the query plans to fluctuate in this benchmark? To do this, we need to look at the query plans. One query consistently changes its plan from iteration to iteration - <code>30c.sql</code>, which makes it convenient to analyse. Comparing the query plans, we can see that <code>HashJoin</code> and the parameterised <code>NestLoop</code> compete with very close estimates (see <a href="https://github.com/danolivo/conf/blob/main/Benches/job-analyze-fluctuate/30c-explain-1.txt">here</a> and <a href="https://github.com/danolivo/conf/blob/main/Benches/job-analyze-fluctuate/30c-explain-2.txt">here</a>). Using the "close look" method, I found that the discrepancies in the estimates begin already at the <code>SeqScan</code> stage and then diverge throughout the entire plan:</p><pre><code>-&gt;  Parallel Seq Scan on public.cast_info
    (cost=0.00..498779.41 rows=<strong>518229</strong> width=42)
    Output: id, person_id, movie_id, person_role_id,
            note, nr_order, role_id
    Filter: (cast_info.note = ANY ('{(writer),
          "(head writer)","(written by)",(story),"(story editor)"}'))

-&gt;  Parallel Seq Scan on public.cast_info
    (cost=0.00..498779.41 rows=<strong>520388</strong> width=42)
    Output: id, person_id, movie_id, person_role_id,
            note, nr_order, role_id
    Filter: (ci.note = ANY ('{(writer),
          "(head writer)","(written by)",(story),"(story editor)"}'))</code></pre><p>There is a slight difference in the estimation of the number of rows selected from the table. Let's dig deeper and see why.</p><p>It is not easy to compare histograms or MCV, so let's just study our specific query or, more precisely, the problematic scan operator. The estimation of <code>x = ANY (...) </code>occurs by estimating each individual expression <code>x = Ni</code> and then adding up the probabilities. In our case, all five <code>Ni</code> constants are included in the MCV statistics - which means that even the ndistinct value will not be used by Postgres. Thus, the estimation should be as accurate and stable as possible. However, if you dig into the numbers, you can see that after the <code>ANALYZE</code>, the frequency of each of the sample elements changes. For example, for <code>default_statistics_target=10000</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AAeD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b0e18a-77fa-4212-bf45-945ee9c48bed_1488x572.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AAeD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b0e18a-77fa-4212-bf45-945ee9c48bed_1488x572.heic 424w, https://substackcdn.com/image/fetch/$s_!AAeD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b0e18a-77fa-4212-bf45-945ee9c48bed_1488x572.heic 848w, https://substackcdn.com/image/fetch/$s_!AAeD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b0e18a-77fa-4212-bf45-945ee9c48bed_1488x572.heic 1272w, https://substackcdn.com/image/fetch/$s_!AAeD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b0e18a-77fa-4212-bf45-945ee9c48bed_1488x572.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AAeD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b0e18a-77fa-4212-bf45-945ee9c48bed_1488x572.heic" width="1456" height="560" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37b0e18a-77fa-4212-bf45-945ee9c48bed_1488x572.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:560,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32728,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AAeD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b0e18a-77fa-4212-bf45-945ee9c48bed_1488x572.heic 424w, https://substackcdn.com/image/fetch/$s_!AAeD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b0e18a-77fa-4212-bf45-945ee9c48bed_1488x572.heic 848w, https://substackcdn.com/image/fetch/$s_!AAeD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b0e18a-77fa-4212-bf45-945ee9c48bed_1488x572.heic 1272w, https://substackcdn.com/image/fetch/$s_!AAeD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b0e18a-77fa-4212-bf45-945ee9c48bed_1488x572.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The final estimation changes by about 1%, which is not too much in principle. However, who knows - maybe we didn't hit a horrid probability? In addition, the error accumulates when planning the higher-level operators of the query tree, ultimately changing the plan of this query.</p><p>In general, this corresponds to the <a href="https://dl.acm.org/doi/10.1145/335168.335230">theory</a> that calculating the number of distinct in a set of values &#8203;&#8203;with reasonable accuracy can only be obtained by analysing almost all the table rows. Moreover, the issue with statistics has already been reported to the community [<a href="https://www.postgresql.org/message-id/4338f834-dee9-2eb8-0577-10abe9d39e2d%40postgrespro.ru">here</a> and <a href="https://www.postgresql.org/message-id/flat/200504191209.05181.josh%40agliodbs.com">here</a>, for example] and an <a href="https://www.postgresql.org/message-id/5588D644.1000909@2ndquadrant.com">attempt</a> to solve the problem was conducted. However, in any case, this comes up against an expensiveness of the volume of the statistical sample and is not universally applicable - and therefore not applicable in the PostgreSQL core.</p><p>However, for analytical tasks, where the size and denormalisation of the scheme impose increased requirements on the quality of statistics, one computationally expensive pass through the table's rows may make sense. In addition, the data is loaded rarely in large batches, and statistical collection is not required often. So maybe we just probe the approach involving the extension mechanism?</p><p><strong>An extension for standard statistics</strong></p><p>How to design it? Let's recall the article by <a href="https://dl.acm.org/doi/10.1145/276305.276315">DeWitt1998</a> - the author suggested computing statistics attaching to sequential table scans. Postgres has a <code>CustomScan</code> node mechanism that can be inserted into any part of the query plan and implemented with an arbitrary complexity of the operation. Therefore, such an idea is easy to implement. Also, no one prevents you from adding a new function to the UI using an extension that will go through the entire table and calculate at least ndistinct and MCV with maximum accuracy.</p><p>Having standard "lightweight" statistics on the number of ndistinct, you can assess how expensive and feasible it will be in terms of computing resources before deciding to launch such an analysis procedure.</p><p>Feeding such refined statistics to the optimiser can be implemented employing two hooks: <code>get_relation_stats_hook</code> and <code>get_index_stats_hook</code>. They allow to replace the standard statistics obtained from <code>pg_statistic</code> with an alternative, as long as it corresponds to the internal Postgres tuple format. The second hook is exciting because it can be used to implement not complete statistics but predicate statistics - that is, statistics based on data selected from a table with a particular filter - an analogue of SQL Server's <code>CREATE STATISTICS ... WHERE ...</code> .</p><p>How to store statistics? For the optimiser to be able to use them correctly, it is evident that their storage format must correspond to the format of storing standard statistics. Nothing is complicated about this since the extension can create its tables - so why not just create table <code>pg_statistic_extra</code>?</p><p>Additional bonuses from creating such an extension may emerge entirely unexpectedly. For example, while writing this post, I realised that it could help solve the dilemma of statistics on a partitioned table: no one likes to spend resources on it since it duplicates the work of calculating statistics for each individual partition. At the same time, it does not change much over time: there are many partitions, and if the data is well spread out, then the statistics for all data change insignificantly (except, perhaps, for the partition key). In addition, the table can be huge, and things like ndistinct in standard statistics can be far from reality. With the help of the extension, you can implement the creation of detailed statistics by event: attach/detach partition, at the time of massive updates, etc., which can allow you to launch a computationally expensive operation more consciously. Also, knowing that the table changes rarely, it will be possible to radically simplify the statistics calculation by using indexes or implementing other sampling algorithms (for example, <a href="https://15799.courses.cs.cmu.edu/spring2025/papers/12-costmodel/p287-chaudhuri.pdf">this one</a>)...</p><p>That's actually the whole roadmap. All that's left is to find a passionate student, and you can apply with a project to GSoC ;). What do you think about such an extension? Write in the comments.</p><p></p><p>THE END.</p><p><em>February 2, 2025, South Pattaya, Thailand.</em></p>]]></content:encoded></item><item><title><![CDATA[Whose optimisation is better?]]></title><description><![CDATA[How to compare the quality of SQL query plans in PostgreSQL]]></description><link>https://danolivo.substack.com/p/whose-optimisation-is-better</link><guid isPermaLink="false">https://danolivo.substack.com/p/whose-optimisation-is-better</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Sat, 18 Jan 2025 15:37:54 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7bd51315-c541-42d6-aab5-07e30d59a407_420x315.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>That happened one long and warm Thai evening when I read another paper about the <a href="https://arxiv.org/abs/1902.08291">re-optimisation technique</a> in which the authors used Postgres as a base for implementation. Since I had nearly finished with the WIP patch aiming to do the same stuff in the Postgres fork, I immediately began comparing our algorithms using the paper's experimental data as a reference. However, I quickly realised that neither my code nor even the standard Postgres instance bore any resemblance to the paper's figures.</em></p><p>The execution time measurements they provided differed significantly due to unclear details regarding the experimental setup and instance settings. I had often encountered research reports that were almost impossible to reproduce, and the current case led me to discover how we could compare query plans and optimisation effectiveness using dimensionless criteria.</p><p>From a practical point of view, the DBMS that produces a higher TPS is more efficient. However, sometimes, we need to design a system that does not yet exist or make a behaviour forecast for loads that have not yet arrived. In this case, we need a parameter to analyse a query plan or compare a pair of plans qualitatively. This post discusses one such parameter - <em>the number of data pages read</em>.</p><p>It hardly needs to be said that the 'performance evaluation' section of research is crucial for applied software developers, as it justifies the time spent reading the preceding text. This section must also ensure the repeatability of results and allow for independent analysis. For instance, a similarity theory has been developed in fields like hydrodynamics and heat engineering that enables researchers to present experimental results in dimensionless quantities, such as the Nusselt, Prandtl, and Reynolds numbers. Researchers can reasonably compare the results obtained by reproducing experiments under slightly different conditions.</p><p>I have not yet seen anything like this in  &#8203;&#8203;database systems. The section devoted to testing usually briefly describes the hardware and software parts and graphs. The main parameter under study is the query execution time or TPS (transactions-per-second).</p><p>This approach appears to be the only viable method when comparing different DBMSes and making decisions regarding what system to use in production. However, it's important to note that query execution time is influenced by multiple factors, including server settings, caching algorithms, the choice of query plan, and parallelism...</p><p>Let's consider the scenario where we are developing a new query optimisation method and want to compare its performance with a previously published method. We have graphs showing query execution times (see, for example, <a href="https://arxiv.org/abs/1902.08291">here</a> or <a href="https://www.vldb.org/pvldb/vol16/p2962-zhang.pdf">there</a>), along with a brief description of our testing platform. However, we encounter discrepancies between our results and those from published studies due to multiple unknown factors. To address this, we need a measurable parameter that can eliminate the influence of other DBMS subsystems, making our analysis more portable and accessible. I believe that developers working, for example, on a new storage system would also appreciate the opportunity to remove the optimiser's impact from their benchmarks.</p><p>When attempting to reproduce the experiments described in articles or to compare my method with the one proposed by authors, I often find that the uncertainty of the commonly accepted measurement of execution time is too high to draw conclusive judgments. This measure primarily reflects the efficiency of the code under specific operating conditions rather than the quality of the discovered query plan. Execution time is a highly variable characteristic; even when running the same test consistently on the same machine and instance, there can be a significant variation in execution times.</p><p>For instance, I've conducted <a href="https://github.com/danolivo/conf/blob/main/Benches/pages-fetched-criteria/job_stats1.sh">ten consecutive runs</a> of all 113 <a href="https://github.com/danolivo/jo-bench">Join Order Benchmark</a> (JOB) tests, and I've observed a typical spread in execution time of up to 50% on my desktop (see the picture below) - even under optimal conditions with all experiment parameters are meticulously controlled. This raises a crucial question: how much deviation might an external researcher encounter if they attempt to repeat the experiment, and how should they analyse the results?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qta2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52d1e22-7d76-4eab-95b9-5b24898e3e66_1324x819.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qta2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52d1e22-7d76-4eab-95b9-5b24898e3e66_1324x819.png 424w, https://substackcdn.com/image/fetch/$s_!qta2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52d1e22-7d76-4eab-95b9-5b24898e3e66_1324x819.png 848w, https://substackcdn.com/image/fetch/$s_!qta2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52d1e22-7d76-4eab-95b9-5b24898e3e66_1324x819.png 1272w, https://substackcdn.com/image/fetch/$s_!qta2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52d1e22-7d76-4eab-95b9-5b24898e3e66_1324x819.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qta2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52d1e22-7d76-4eab-95b9-5b24898e3e66_1324x819.png" width="1324" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f52d1e22-7d76-4eab-95b9-5b24898e3e66_1324x819.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1324,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qta2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52d1e22-7d76-4eab-95b9-5b24898e3e66_1324x819.png 424w, https://substackcdn.com/image/fetch/$s_!qta2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52d1e22-7d76-4eab-95b9-5b24898e3e66_1324x819.png 848w, https://substackcdn.com/image/fetch/$s_!qta2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52d1e22-7d76-4eab-95b9-5b24898e3e66_1324x819.png 1272w, https://substackcdn.com/image/fetch/$s_!qta2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52d1e22-7d76-4eab-95b9-5b24898e3e66_1324x819.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Variation in execution times for repeated JOB query executions.</figcaption></figure></div><p>One more concern is how to compare query plans executed with varying numbers of parallel workers. Using multiple workers on a test machine can yield positive results; however, parallelism can sometimes be counterproductive in a production with hundreds of competing backends. Therefore, is it best to seek a more meaningful criterion for evaluation?</p><p>In my specific area of query optimisation, execution time often seems like a redundant metric. It may be more beneficial to adopt a more specific characteristic to compare different optimisation approaches or to assess the impact of a new transformation within the PostgreSQL optimiser. Such a metric should include only factors that the optimiser may consider during the planning process.</p><p>From the perspective of a DBMS, the primary operations involve data manipulation. Thus, it would be natural to select the number of operations performed on table rows during query execution, taking into account the number of attributes in each row. Minimising this parameter would indicate the efficiency of the chosen query plan. However, collecting such statistics can be challenging. Therefore, we should aim to identify a slightly less precise but more easily obtainable parameter.</p><p>For example, DBAs often use the number of pages read as a parameter. In this context, a page refers to a buffer cache page or a table data block stored on disk. It is unnecessary to differentiate between the pages that fit in the RAM buffer and those on disk, as this distinction provides redundant information that pertains more to the page eviction strategy and disk operation than to the optimal plan identified.</p><p>For our purposes, it is sufficient to sum these values mechanically. We also need to consider the pages from the temporary disk cache used by sorting, hashing, and other algorithms for placing rows that did not fit in memory. It is important to note that the same page may need to be counted twice. We access a page once during sequential row scanning to read its tuples. However, when rescanning&#8212;such as in an inner NestLoop join&#8212;we reread the data and must account for each page again. PostgreSQL already has the necessary infrastructure for measuring the number of pages read, provided by the <code>pg_stat_statements</code> extension. My approach is as follows: before executing each benchmark query, I run the command <code>SELECT pg_stat_statements_reset()</code> and then retrieve the statistics using the following query:</p><pre><code>SELECT
  shared_blks_hit+shared_blks_read+local_blks_hit+local_blks_read+
  temp_blks_read AS blocks, total_exec_time::integer AS exec_time
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_stat_statements_reset%';</code></pre><p>How reliable is this metric? In the same experiment mentioned above, all ten runs of the JOB test demonstrated negligible deviation in the number of pages for each query throughout the iterations:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s9Kc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5167f437-b05f-49e6-9765-3935b8751462_1200x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s9Kc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5167f437-b05f-49e6-9765-3935b8751462_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!s9Kc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5167f437-b05f-49e6-9765-3935b8751462_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!s9Kc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5167f437-b05f-49e6-9765-3935b8751462_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!s9Kc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5167f437-b05f-49e6-9765-3935b8751462_1200x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s9Kc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5167f437-b05f-49e6-9765-3935b8751462_1200x742.png" width="1200" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5167f437-b05f-49e6-9765-3935b8751462_1200x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s9Kc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5167f437-b05f-49e6-9765-3935b8751462_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!s9Kc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5167f437-b05f-49e6-9765-3935b8751462_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!s9Kc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5167f437-b05f-49e6-9765-3935b8751462_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!s9Kc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5167f437-b05f-49e6-9765-3935b8751462_1200x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There is a deviation of only a few pages, and while even a minor discrepancy like this should typically be examined, it seems to be more of an artifact resulting from service operations, such as accessing statistics and interactions among parallel workers. What can we infer from this indicator? Let's conduct a simple experiment. We will use one test query (10a.sql) and sequentially <a href="https://github.com/danolivo/conf/blob/main/Benches/pages-fetched-criteria/workers_explain.sh">increase the number of workers</a> involved in processing this query. The graph below illustrates how the query execution time and the number of data pages read change as we adjust the number of workers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OAeL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa9b3225-0acf-4698-a5e9-a8ada65d746c_1200x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OAeL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa9b3225-0acf-4698-a5e9-a8ada65d746c_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!OAeL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa9b3225-0acf-4698-a5e9-a8ada65d746c_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!OAeL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa9b3225-0acf-4698-a5e9-a8ada65d746c_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!OAeL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa9b3225-0acf-4698-a5e9-a8ada65d746c_1200x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OAeL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa9b3225-0acf-4698-a5e9-a8ada65d746c_1200x742.png" width="1200" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa9b3225-0acf-4698-a5e9-a8ada65d746c_1200x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OAeL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa9b3225-0acf-4698-a5e9-a8ada65d746c_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!OAeL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa9b3225-0acf-4698-a5e9-a8ada65d746c_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!OAeL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa9b3225-0acf-4698-a5e9-a8ada65d746c_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!OAeL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa9b3225-0acf-4698-a5e9-a8ada65d746c_1200x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Behaviour of execution time and number of pages with the increasing number of  parallel workers</figcaption></figure></div><p>It is evident that while the query execution time may vary, the number of read data pages remains relatively constant. The number of pages only changes once when the number of workers increases from 1 to 2, resulting in a doubling of the read pages. An examination of the EXPLAIN output for these two cases reveals the reason behind this change: with 0 and 1 worker, out of six query joins, three were of the Nested Loop type, and three were Hash Joins. However, with two or more workers, the number of Nested Loop joins increases by one while the number of Hash Joins decreases. Thus, by analysing the number of read pages, we were able to identify a change in the query plan that was not apparent when considering execution time alone. Now, let's explore the effect of the <a href="https://github.com/postgrespro/aqo/tree/stable17">AQO</a> (Adaptive Query Optimization) extension of the PostgreSQL optimiser on JOB test queries.</p><p>We will <a href="https://github.com/danolivo/conf/blob/main/Benches/pages-fetched-criteria/job_stats_aqo.sh">execute</a> each test query with AQO ten times in 'learn' mode. In this mode, AQO functions as a planner memory, storing the cardinality of each plan node (as well as the number of groups in the corresponding operators) at the end of execution. This information is then used during the planning stage, allowing the optimiser to reject overly optimistic plans. Given that the PostgreSQL optimiser tends to underestimate join cardinalities, this approach appears quite reasonable. The figure below (shown on a logarithmic scale) illustrates how the number of pages read changed concerning the first iteration, during which the optimiser lacks information about the cardinalities of the query plan nodes, the number of distinct values in the columns, etc. By the tenth iteration, almost all queries either improved this metric or remained unchanged. This suggests that the PostgreSQL optimiser may have quickly identified the best plan in the space of potential options or that our technique did not have the desired effect in these cases.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8Itp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fd02e9a-bd5a-4c4d-8e68-be55b1a6f181_1350x833.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8Itp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fd02e9a-bd5a-4c4d-8e68-be55b1a6f181_1350x833.png 424w, https://substackcdn.com/image/fetch/$s_!8Itp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fd02e9a-bd5a-4c4d-8e68-be55b1a6f181_1350x833.png 848w, https://substackcdn.com/image/fetch/$s_!8Itp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fd02e9a-bd5a-4c4d-8e68-be55b1a6f181_1350x833.png 1272w, https://substackcdn.com/image/fetch/$s_!8Itp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fd02e9a-bd5a-4c4d-8e68-be55b1a6f181_1350x833.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8Itp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fd02e9a-bd5a-4c4d-8e68-be55b1a6f181_1350x833.png" width="1350" height="833" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4fd02e9a-bd5a-4c4d-8e68-be55b1a6f181_1350x833.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:833,&quot;width&quot;:1350,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8Itp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fd02e9a-bd5a-4c4d-8e68-be55b1a6f181_1350x833.png 424w, https://substackcdn.com/image/fetch/$s_!8Itp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fd02e9a-bd5a-4c4d-8e68-be55b1a6f181_1350x833.png 848w, https://substackcdn.com/image/fetch/$s_!8Itp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fd02e9a-bd5a-4c4d-8e68-be55b1a6f181_1350x833.png 1272w, https://substackcdn.com/image/fetch/$s_!8Itp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fd02e9a-bd5a-4c4d-8e68-be55b1a6f181_1350x833.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are still six degraded queries remaining, and the number of pages read for these queries has increased compared to the first iteration. It's possible that there weren't enough iterations to effectively filter out non-optimal query plans. Therefore, let's increase the number of execution iterations to 30 and observe the results.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3GAd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c0abd5b-b3c8-4edd-b866-4ab97c2b8385_1200x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3GAd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c0abd5b-b3c8-4edd-b866-4ab97c2b8385_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!3GAd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c0abd5b-b3c8-4edd-b866-4ab97c2b8385_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!3GAd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c0abd5b-b3c8-4edd-b866-4ab97c2b8385_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!3GAd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c0abd5b-b3c8-4edd-b866-4ab97c2b8385_1200x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3GAd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c0abd5b-b3c8-4edd-b866-4ab97c2b8385_1200x742.png" width="1200" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5c0abd5b-b3c8-4edd-b866-4ab97c2b8385_1200x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3GAd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c0abd5b-b3c8-4edd-b866-4ab97c2b8385_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!3GAd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c0abd5b-b3c8-4edd-b866-4ab97c2b8385_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!3GAd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c0abd5b-b3c8-4edd-b866-4ab97c2b8385_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!3GAd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c0abd5b-b3c8-4edd-b866-4ab97c2b8385_1200x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The figure above illustrates that the query plans have converged toward an optimal solution. Notably, two queries (26b and 33b) show an increase in the number of pages read compared to the zero iteration. Additionally, the query execution time has improved by 15-20%.</p><p>The explanations for these observations are as follows: the number of Nested Loops in the query plan has decreased by one, and the Hash Join, when constructing a hash table, scans the entire table and consequently increases the number of pages read. In contrast, the parallel Hash Join proves to be more time-efficient, leading to better query execution times. This suggests that the number of pages read is not an absolute criterion for determining query optimality. This criterion can help establish a starting point within a single DBMS, allowing for reproduced experiments in different software and hardware environments. It can also aid in comparing various optimisation methods and identifying effects that may be masked by unstable execution times.</p><p>Therefore, it may not be advisable to disregard execution time when publishing benchmark results. However, should the number of pages read to be included as well? Ultimately, by providing a graph showing the changes in the number of pages read during query execution, along with a test run script (refer to the above) and a <a href="https://docs.google.com/spreadsheets/d/1-7aI2JHZ7famDsKRgxxKqZ8ZPKYf_poCHUU7wNs3Ays/edit?usp=sharing">link</a> to the raw data, one can independently reproduce the experiment, calibrate it against the published data, conduct additional studies, or compare it with other methods under similar conditions. Isn&#8217;t that convenient? </p><p>That's it for today. The primary goal of this post is to highlight the problem of reproducibility of results and to encourage objective analysis of new methods in the field of DBMS. Should we seek additional criteria for evaluating test results? How effective is the criterion of the number of pages read for this purpose? Can this criterion be adapted to compare different yet similar query plans regarding DBMS architecture? Is it possible to normalise this criterion relative to the average number of tuples per page? I welcome any opinions and comments on these questions.</p><p>THE END.</p><p><em>January 18th, 2025. Pattaya, Thailand.</em></p>]]></content:encoded></item><item><title><![CDATA[Investigating Memoize's Boundaries]]></title><description><![CDATA[Compare Postgres & SQL Server query plans]]></description><link>https://danolivo.substack.com/p/investigating-memoizes-boundaries</link><guid isPermaLink="false">https://danolivo.substack.com/p/investigating-memoizes-boundaries</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Fri, 03 Jan 2025 14:01:59 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e04f5fff-cbb2-4a59-a75e-b3a39d585eb4_469x545.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>During the New Year holiday week, I want to glance at one of Postgres' most robust features: the internal caching technique for query trees, also known as <em>memoisation</em>.</p><p>Introduced with commit 9eacee2 in 2021, the Memoize node fills the performance gap between HashJoin and parameterised NestLoop: having a couple of big tables, we sometimes need to join only minor row subsets from these tables. In that case, the parameterised NestLoop algorithm does the job much faster than HashJoin. However, the outer size is critical for performance and may cause NestLoop to be rejected just because of massive repetitive scans of inner input.</p><p>When predicting multiple duplicates in the outer column that participate as a parameter in the inner side of a join, the optimiser can insert a Memoize node. This node caches the results of the inner query subtree scan for each parameter value and reuses these results if the known value from the outer side reappears later.</p><p>This feature is highly beneficial. However, user migration reports indicate that there are still some cases in PostgreSQL where this feature does not apply, leading to significant drops in query execution time. In this post, I will compare the caching methods for intermediate results in PostgreSQL and SQL Server.</p><h1>Memoisation for SEMI/ANTI JOIN</h1><p>Let me introduce a couple of tables:</p><pre><code>DROP TABLE IF EXISTS t1,t2;
CREATE TABLE t1 (x integer);
INSERT INTO t1 (x)
  SELECT value % 10 FROM generate_series(1,1000) AS value;
CREATE TABLE t2 (x integer, y integer);
INSERT INTO t2 (x,y)
  SELECT value, value%100 FROM generate_series(1,100000) AS value;
CREATE INDEX t2_idx ON t2(x,y);
VACUUM ANALYZE t1,t2;</code></pre><p>In Postgres, a simple join of these tables  prefers parameterised NestLoop with memoisation:</p><pre><code>EXPLAIN (COSTS OFF)
SELECT t1.* FROM t1 JOIN t2 ON (t1.x = t2.x);
/*
 Nested Loop
   -&gt;  Seq Scan on t1
   -&gt;  Memoize
         Cache Key: t1.x
         Cache Mode: logical
         -&gt;  Index Scan using t2_idx on t2
               Index Cond: (x = t1.x)
*/</code></pre><p>The smaller table <code>t1</code> contains many duplicates in the column used for the <code>JOIN</code>, while the bigger one <code>t2</code> contains almost unique values. It also has an index to extract necessary tuples effectively.</p><p>Ok, it works for trivial joins. What about more complex forms, like SEMI JOIN? Look at the query:</p><pre><code>EXPLAIN (COSTS OFF)
SELECT * FROM t1 WHERE x IN (SELECT y FROM t2 WHERE t1.x = t2.x);

/*
 Nested Loop Semi Join
   -&gt;  Seq Scan on t1
   -&gt;  Index Only Scan using t2_idx on t2
         Index Cond: ((x = t1.x) AND (y = t1.x))
         Filter: (x = y)
*/</code></pre><p>Postgres can <em>pull-up</em> the subquery and transform it into a join. But it doesn't add a Memoize node in that case. To compare, execute this query in SQL Server (Use the <code>OPTION (LOOP JOIN)</code> hint to prevent hash join):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_h8A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F499ca635-d93c-4e99-a0f5-e26c43071aa4_1294x498.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_h8A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F499ca635-d93c-4e99-a0f5-e26c43071aa4_1294x498.heic 424w, https://substackcdn.com/image/fetch/$s_!_h8A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F499ca635-d93c-4e99-a0f5-e26c43071aa4_1294x498.heic 848w, https://substackcdn.com/image/fetch/$s_!_h8A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F499ca635-d93c-4e99-a0f5-e26c43071aa4_1294x498.heic 1272w, https://substackcdn.com/image/fetch/$s_!_h8A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F499ca635-d93c-4e99-a0f5-e26c43071aa4_1294x498.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_h8A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F499ca635-d93c-4e99-a0f5-e26c43071aa4_1294x498.heic" width="1294" height="498" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/499ca635-d93c-4e99-a0f5-e26c43071aa4_1294x498.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:498,&quot;width&quot;:1294,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33867,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_h8A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F499ca635-d93c-4e99-a0f5-e26c43071aa4_1294x498.heic 424w, https://substackcdn.com/image/fetch/$s_!_h8A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F499ca635-d93c-4e99-a0f5-e26c43071aa4_1294x498.heic 848w, https://substackcdn.com/image/fetch/$s_!_h8A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F499ca635-d93c-4e99-a0f5-e26c43071aa4_1294x498.heic 1272w, https://substackcdn.com/image/fetch/$s_!_h8A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F499ca635-d93c-4e99-a0f5-e26c43071aa4_1294x498.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>SQL Server performs a similar optimisation and utilises a Spool node in the inner subtree of a join. This approach allows the results of scans on the inner side of the join to be cached. Interestingly, it doesn't have to cache individual tuples; it only needs to keep track of the existence of the NULL/NOT NULL result.</p><p>So, why hasn&#8217;t Postgres implemented Memoize for <code>JOIN_SEMI</code>? If you examine the code, you will find that this limit was introduced in an initial commit by David Rowley.</p><pre><code>if (!extra-&gt;inner_unique &amp;&amp; (jointype == JOIN_SEMI ||
                             jointype == JOIN_ANTI))
  return NULL;</code></pre><p>In the case of a semi-join, the executor only requires the first tuple from the inner subtree to make its decision. This means that the Memoize cache will contain incomplete results, which is (I suppose) a source of concern for developers. However, the <code>MemoizePath</code> struct already has built-in mechanisms for situations where the inner subtree provably produces a single tuple for each scan.</p><p>It appears that much of the groundwork is already in place to implement caching for semi-joins. We need to make minor adjustments to the <code>get_memoize_path</code> function, revise the cost model in the <code>cost_memoize_rescan</code> routine, and inform users about the memoisation mode by adding relevant details to the EXPLAIN output. The code that enables memoisation for semi-joins is relatively concise and can be found in the <a href="https://github.com/danolivo/pgdev/tree/memoize-semi-join">branch</a> of my GitHub project. With this patch applied, you can see something like this:</p><pre><code>EXPLAIN (COSTS OFF)
SELECT x FROM t1 WHERE EXISTS (SELECT x FROM t2 WHERE t2.x=t1.x);

/*
 Nested Loop Semi Join
   -&gt;  Seq Scan on t1
   -&gt;  Memoize
         Cache Key: t1.x
         Cache Mode: logical
         <strong>Store Mode: singlerow</strong>
         -&gt;  Index Only Scan using t2_idx on t2
               Index Cond: (x = t1.x)
*/</code></pre><p>The EXPLAIN parameter 'Store Mode' appears only when the Memoize node works in the 'incomplete' mode. The same way it works for ANTI JOIN cases:</p><pre><code>EXPLAIN (COSTS OFF)
SELECT x FROM t1 WHERE NOT EXISTS (SELECT x FROM t2 WHERE t2.x=t1.x);

/*
 Nested Loop Anti Join
   -&gt;  Seq Scan on t1
   -&gt;  Memoize
         Cache Key: t1.x
         Cache Mode: logical
         <strong>Store Mode: singlerow</strong>
         -&gt;  Index Only Scan using t2_idx on t2
               Index Cond: (x = t1.x)
*/</code></pre><p>In real-life scenarios, I frequently see on the inner side of SEMI and ANTI joins not trivial table scans but huge subtrees containing join trees, aggregates and sortings. For such queries, avoiding unnecessary rescan calls is crucial. Even more importantly, knowledge of the only single tuple needed from such a subquery may cause the choice of a more optimal <em>fractional</em> path.</p><h1>Memoise arbitrary query subtree</h1><p>Here, I want to discover if the optimiser can use Memoize to cache the result of a bushy query tree. Look at the example:</p><pre><code>DROP TABLE IF EXISTS t1,t2,t3;
CREATE TABLE t1 (x numeric PRIMARY KEY, payload text);
CREATE TABLE t2 (x numeric, y numeric);
CREATE TABLE t3 (x numeric, payload text);
INSERT INTO t1 (x, payload)
  (SELECT value, 'long line of text'
   FROM generate_series(1,100000) AS value);
INSERT INTO t2 (x,y)
  (SELECT value % 1000, value % 1000
   FROM generate_series(1,100000) AS value);
INSERT INTO t3 (x, payload)
  (SELECT (value%10), 'long line of text'
   FROM generate_series(1,100000) AS value);
CREATE INDEX t2_idx_x ON t2 (x);
CREATE INDEX t2_idx_y ON t2 (y);
VACUUM ANALYZE t1,t2,t3;

-- Disable any extra optimisations:
SET enable_hashjoin = f;
SET enable_mergejoin = f;
SET enable_material = f;</code></pre><p>Now, let's discover the query:</p><pre><code>EXPLAIN (COSTS OFF)
SELECT * FROM t3 WHERE x IN (
  SELECT y FROM t2 WHERE x IN (
    SELECT x FROM t1)
);</code></pre><p>There are three joining tables. In the absence of a hash join, it would be better to use a parameterised scan.  Lots of duplicated values inside Table t3 should trigger the use of the memoisation technique:</p><pre><code> Nested Loop Semi Join
   -&gt;  Seq Scan on t3
   -&gt;  Nested Loop
         -&gt;  Index Scan using t2_idx_y on t2
               Index Cond: (y = t3.x)
         -&gt;  Index Only Scan using t1_pkey on t1
               Index Cond: (x = t2.x)</code></pre><p>As you can see in the EXPLAIN above, Postgres can&#8217;t insert a Memoize node at the top of NestLoop JOIN. As far as I remember, it has not yet been implemented because it is hard to discover the query subtree and find all the lateral references and parameters that are mandatory for the memoisation technique. At the same time, SQL Server is capable of doing it:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMAh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb034132d-0aee-476c-b5c1-3612df73768c_1772x760.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMAh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb034132d-0aee-476c-b5c1-3612df73768c_1772x760.heic 424w, https://substackcdn.com/image/fetch/$s_!cMAh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb034132d-0aee-476c-b5c1-3612df73768c_1772x760.heic 848w, https://substackcdn.com/image/fetch/$s_!cMAh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb034132d-0aee-476c-b5c1-3612df73768c_1772x760.heic 1272w, https://substackcdn.com/image/fetch/$s_!cMAh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb034132d-0aee-476c-b5c1-3612df73768c_1772x760.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMAh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb034132d-0aee-476c-b5c1-3612df73768c_1772x760.heic" width="1456" height="624" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b034132d-0aee-476c-b5c1-3612df73768c_1772x760.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:624,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:46684,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cMAh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb034132d-0aee-476c-b5c1-3612df73768c_1772x760.heic 424w, https://substackcdn.com/image/fetch/$s_!cMAh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb034132d-0aee-476c-b5c1-3612df73768c_1772x760.heic 848w, https://substackcdn.com/image/fetch/$s_!cMAh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb034132d-0aee-476c-b5c1-3612df73768c_1772x760.heic 1272w, https://substackcdn.com/image/fetch/$s_!cMAh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb034132d-0aee-476c-b5c1-3612df73768c_1772x760.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A closer examination of this example revealed an interesting aspect. Although Postgres doesn't have the opportunity to insert Memoize over the join, it might still insert Memoize over trivial scans of tables <code>t1</code> or <code>t2</code>, thereby avoiding repeated scans. However, it didn't do this because it predicted only one rescan operation when planning the <code>JOIN(t1,t2)</code>. In this case, using memoisation seems unnecessary. </p><p>One million rescan cycles occur due to the upper-level join, <code>JOIN(t3, JOIN(t1,t2))</code>, but the bottom-up optimiser lacks the insight to identify this valuable data at the level of <code>JOIN(t1,t2)</code>. You can observe this behaviour in our test example by populating <code>t3.x</code> with unique data. Interestingly, SQL Server also uses a bottom-up planning strategy and fails to recognise this situation to insert an appropriate Spool node. </p><p>Postgres planning extensibility allows passing through the query plan and doing additional work. Should we consider adding a top-down planning cycle after constructing the query plan?</p><h1>Beyond the NestLoop memoisation</h1><p>In previous sections, we discussed how memoisation could be enhanced by extending it to other join types and applying it to arbitrary query subtrees. But can we go further? Let me dream for a little bit... </p><p>Memoisation is a technique used for caching parameters along with their corresponding results. In many real-world scenarios, I often encounter complex situations where a heavy subplan is evaluated within an expression or a CASE statement for every incoming tuple due to references to upper-query objects. </p><p>What if the optimiser could insert a Memoize node at the top of the subplan whenever an external value parameterises it? To illustrate this idea, let me provide an example:</p><pre><code>-- Case 1:
EXPLAIN (COSTS OFF)
SELECT oid,relname FROM pg_class c1
WHERE oid = 
  CASE WHEN (c1.oid)::integer%2=0
    THEN (SELECT oid FROM pg_class c2 WHERE c2.relname = c1.relname)
    ELSE
      (SELECT oid FROM pg_class c3 WHERE c3.relname = c1.relname)
  END;

-- Case 2:
EXPLAIN (VERBOSE, COSTS OFF)
SELECT 
  CASE WHEN (c1.oid)::integer%2=0
    THEN (SELECT oid || ' - TRUE' FROM pg_class c2 WHERE c2.relname = c1.relname)
    ELSE
      (SELECT oid || ' - FALSE' FROM pg_class c3 WHERE c3.relname = c1.relname)
  END
FROM pg_class c1;

-- Case 3:
EXPLAIN (COSTS OFF)
SELECT oid FROM pg_class c1
WHERE EXISTS (
  SELECT true FROM pg_class c2 WHERE c1.relname=c2.relname OFFSET 0);</code></pre><p>We can't flatten subplan nodes in these examples and must evaluate them repeatedly. In my mind, the optimiser should have a chance to build a plan that looks like the below:</p><pre><code>   SubPlan 1
     <strong>Memoize
       Cache Key: ((c1.oid)::integer % 2)
       Cache Mode: logical</strong>
     -&gt;  Index Scan using pg_class_relname_nsp_index on pg_class c2
           Index Cond: (relname = c1.relname)</code></pre><p>In this case, the Memoize node will re-evaluate the underlying subplan only when a new combination of parameters is provided from the upper query. While we cannot address all the issues that arise with the subplan bubbling up in an expression, we can help mitigate performance cliffs caused by such constructs.</p><p>Do you think it makes sense?</p><p></p><p>THE END.</p><p><em>January 2, 2025, Pattaya, Thailand.</em></p>]]></content:encoded></item><item><title><![CDATA[Fractional Path Issue in Partitioned Postgres databases]]></title><description><![CDATA[Continuing discovery on Postgres planner weak points]]></description><link>https://danolivo.substack.com/p/fractional-path-issue-in-partitioned</link><guid isPermaLink="false">https://danolivo.substack.com/p/fractional-path-issue-in-partitioned</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Sun, 15 Dec 2024 22:01:25 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9608ee3f-ba6e-4563-9a9b-41c7de9d374e_420x310.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>While the user notices the positive aspects of technology, a developer, usually encountering limitations, shortcomings or bugs, watches the product from a completely different perspective. The same stuff happened at this time: after the <a href="https://open.substack.com/pub/danolivo/p/looking-for-hidden-hurdles-when-postgres?r=34q1yy&amp;utm_campaign=post&amp;utm_medium=web">publication</a> of the comparative testing results, where J<a href="https://github.com/gregrahn/join-order-benchmark">oin-Order-Benchmark</a> queries were passed on a database with and without partitions, I couldn't push away the feeling that I had missed something. In my mind, Postgres should build a worse plan with partitions than without them. And this should not be just a bug but a technological limitation. After a second thought, I found a weak spot - queries with limits.</em></p><p>In the presence of a <code>LIMIT</code> statement in the SQL query, unlike the case of plain tables, the optimiser immediately faces many questions: How many rows may be extracted from each partition? Will only a single partition be used? If so, which one will be this single one? - it is not apparent in the circumstances of potential execution-time pruning ... .</p><p>What if we scan partitions by index, and the result is obtained by merging? In that case, it is entirely unclear how to estimate the number of rows that should be extracted from the partition and, therefore, which type of partition scan operator to apply. And what if using partitionwise join, we have an intricate subtree under the <code>Append</code> - knowledge of the limits, in this case, should be crucial - for example, when choosing the JOIN type, isn't it?</p><h1>Interim-cost query plans</h1><p>Such a pack of questions about planning partitions led to a compromise solution in choosing a query plan for <code>Append</code>'s subpaths: for picking the optimal <em>fractional path</em>, two plan options are considered: the minimum total cost and the minimum startup cost paths. Roughly speaking, the plan will be optimal if we have <code>LIMIT 1</code> or some considerable <code>LIMIT</code> value in the query. But what about intermediate options? Let's look at specific examples (thanks to Alexander Pyhalov).</p><pre><code>DROP TABLE IF EXISTS parted,plain CASCADE;
CREATE TEMP TABLE parted (x integer, y integer, payload text)
PARTITION BY HASH (payload);
CREATE TEMP TABLE parted_p1 PARTITION OF parted
  FOR VALUES WITH (MODULUS 2, REMAINDER 0);
CREATE TEMP TABLE parted_p2 PARTITION OF parted
  FOR VALUES WITH (MODULUS 2, REMAINDER 1);
INSERT INTO parted (x,y,payload)
  SELECT (random()*600)::integer,
         (random()*600)::integer, md5((gs%500)::text)
  FROM generate_series(1,1E5) AS gs;
CREATE TEMP TABLE plain (x numeric, y numeric, payload text);
INSERT INTO plain (x,y,payload) SELECT x,y,payload FROM parted;
CREATE INDEX ON parted(payload);
CREATE INDEX ON plain(payload);
VACUUM ANALYZE;
VACUUM ANALYZE parted;</code></pre><p>In this example we executed <code>VACUUM ANALYZE</code> twice because by-default statistics on the partitioned table cannot be built. It is built on each partition separately. To gather statistic, combining data from all partitions, we must explicitly execute <code>ANALYZE</code> with the name of such table as a parameter. Now, let's see how the selection from the partitioned and regular table works with the same data:</p><pre><code>EXPLAIN (COSTS OFF)
SELECT * FROM plain p1 JOIN plain p2 USING (payload) LIMIT 100;
EXPLAIN (COSTS OFF)
SELECT * FROM parted p1 JOIN parted p2 USING (payload) LIMIT 100;

/*
 Limit
   -&gt;  Nested Loop
         -&gt;  Seq Scan on plain p1
         -&gt;  Memoize
               Cache Key: p1.payload
               Cache Mode: logical
               -&gt;  Index Scan using plain_payload_idx on plain p2
                     Index Cond: (payload = p1.payload)

 Limit
   -&gt;  Merge Join
         Merge Cond: (p1.payload = p2.payload)
         -&gt;  Merge Append
               Sort Key: p1.payload
               -&gt;  Index Scan using parted_p1_payload_idx
               -&gt;  Index Scan using parted_p2_payload_idx
         -&gt;  Materialize
               -&gt;  Merge Append
                     Sort Key: p2.payload
                     -&gt;  Index Scan using parted_p1_payload_idx
                     -&gt;  Index Scan using parted_p2_payload_idx
*/</code></pre><p>The query plans seem optimal: depending on the limit, only the minimum number of rows will be selected since, with a helpful index on the join attribute, we have already ordered access to the table rows. Now let's prompt the optimiser to build a complex subtree under the append by enabling <em>partitionwise join</em>:</p><pre><code>SET enable_partitionwise_join = 'true';
EXPLAIN (COSTS OFF)
SELECT * FROM parted p1 JOIN parted p2 USING (payload) LIMIT 100;
/*
 Limit
   -&gt;  Append
         -&gt;  Nested Loop
               Join Filter: (p1_1.payload = p2_1.payload)
               -&gt;  Seq Scan on parted_p1 p1_1
               -&gt;  Materialize
                     -&gt;  Seq Scan on parted_p1 p2_1
         -&gt;  Nested Loop
               Join Filter: (p1_2.payload = p2_2.payload)
               -&gt;  Seq Scan on parted_p2 p1_2
               -&gt;  Materialize
                     -&gt;  Seq Scan on parted_p2 p2_2
*/</code></pre><p>Although everything has stayed the same in the data, an unsuccessful plan has been selected. The reason for such degradation is that when planning an <code>Append</code>, the optimiser chooses the cheapest plan according to the <code>startup_cost</code> criterion. And this is the one that contains <code>NestLoop + SeqScan </code>- in terms of launch speed, in the absence of the necessity to scan tables at all, such a plan slightly wins even over the obvious <code>NestLoop + IndexScan</code>. This is how the current Postgres works, including the dev branch.</p><p> However, this problem can be fixed quite simply by adding the appropriate logic to the optimiser code. Together with Nikita Malakhov and Alexander Pyhalov, we have prepared a patch that can be found on the <a href="https://commitfest.postgresql.org/51/5361/">current commitfest</a> to fix this problem. In the <a href="https://www.postgresql.org/message-id/flat/CAN-LCVPxnWB39CUBTgOQ9O7Dd8DrA_tpT1EY3LNVnUuvAX1NjA%40mail.gmail.comv">thread</a> with its discussion, you can find another gripping <a href="https://www.postgresql.org/message-id/87frouqlgn.fsf%40163.com">remark</a> about the revision of the <code>startup_cost</code> computation logic of the sequential scan operator, the implementation of which can also alleviate the situation with the choice of non-optimal fractional paths for the case with <code>LIMIT 1</code>. Applying this patch, we will already get an acceptable query plan:</p><pre><code> Limit
   -&gt;  Append
         -&gt;  Nested Loop
               -&gt;  Seq Scan on parted_p1 p1_1
               -&gt;  Memoize
                     Cache Key: p1_1.payload
                     Cache Mode: logical
                     -&gt;  Index Scan using parted_p1_payload_idx
                           Index Cond: (payload = p1_1.payload)
         -&gt;  Nested Loop
               -&gt;  Seq Scan on parted_p2 p1_2
               -&gt;  Memoize
                     Cache Key: p1_2.payload
                     Cache Mode: logical
                     -&gt;  Index Scan using parted_p2_payload_idx
                           Index Cond: (payload = p1_2.payload)</code></pre><p>Now, let's look at the next problem, which does not have a simple solution yet.</p><h1>Calculated limit</h1><p>Consider the following query:</p><pre><code>EXPLAIN (COSTS OFF)
SELECT * FROM parted p1 JOIN parted p2 USING (payload,y)
ORDER BY payload,y LIMIT 100;</code></pre><p>Executing it with the patch provided above gives you an optimal plan - it uses <code>NestLoop</code> with a parameterized index scan that will touch only the minimum number of table rows needed to produce the result. However, by simply reducing the limit, we get the original bleak picture:</p><pre><code>EXPLAIN (COSTS OFF)
SELECT * FROM parted p1 JOIN parted p2 USING (payload,y)
ORDER BY payload,y LIMIT 1;

/*
Limit
   -&gt;  Merge Append
         Sort Key: p1.payload, p1.y
         -&gt;  Merge Join
               Merge Cond: ((p1_1.payload = p2_1.payload) AND
                            (p1_1.y = p2_1.y))
               -&gt;  Sort
                     Sort Key: p1_1.payload, p1_1.y
                     -&gt;  Seq Scan on parted_p1 p1_1
               -&gt;  Sort
                     Sort Key: p2_1.payload, p2_1.y
                     -&gt;  Seq Scan on parted_p1 p2_1
         -&gt;  Merge Join
               Merge Cond: ((p1_2.payload = p2_2.payload) AND
                            (p1_2.y = p2_2.y))
               -&gt;  Sort
                     Sort Key: p1_2.payload, p1_2.y
                     -&gt;  Seq Scan on parted_p2 p1_2
               -&gt;  Sort
                     Sort Key: p2_2.payload, p2_2.y
                     -&gt;  Seq Scan on parted_p2 p2_2
*/</code></pre><p>A <code>SeqScan</code> operator again reads all rows from tables, and the query becomes tens of times slower, although we only reduced the <code>LIMIT</code>! At the same time, by disabling <code>SeqScan</code>, you can see a fast plan and incremental sorting again.</p><p>The fundamental problem is that the optimiser only knows the final limit on the number of rows in the query/subquery. In this case, at the <code>Append</code> planning stage, the optimiser cannot estimate how many tuples the upper <code>Incremental Sort</code> could request. As a result, only one row or all rows from each partition may be needed, depending on the data distribution in the '<code>y</code>' column.</p><p>Even if we theoretically imagine that we have taught <code>IncrementalSort</code> to calculate the number of groups by the '<code>payload</code>' column and, based on this, estimate the maximum required number of rows in each partition, we could not improve the plan estimation since the planning of the <code>Append</code> operator has already been completed, the possible options for its execution have already been fixed - after all, we are planning the query from the bottom up!</p><p><strong>To sum it up</strong>. Partitioned tables do make the task much more difficult for the current version of Postgres, limiting the search space for optimal query plans. Switching to partitions should be thoroughly tested, focusing on cases where some limited selection of tables' tuples is required and there is no noticeable pruning of partitions at the planning stage. Although the direction is actively developing, we can expect improvements soon (especially if users report emerging issues more actively). Still, there are cases where the solution within the existing architecture is not apparent and requires additional R&amp;D.</p><p>Do you agree with my conclusions, or did I just write nonsense? Please leave your opinion in the comments.</p><div class="poll-embed" data-attrs="{&quot;id&quot;:246139}" data-component-name="PollToDOM"></div><p></p><p>THE END.</p><p><em>December 9, 2024, Pattaya, Thailand.</em></p>]]></content:encoded></item><item><title><![CDATA[Could GROUP-BY clause reordering improve performance?]]></title><description><![CDATA[Utilising statistics to optimise GROUP-BY]]></description><link>https://danolivo.substack.com/p/could-group-by-clause-reordering</link><guid isPermaLink="false">https://danolivo.substack.com/p/could-group-by-clause-reordering</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Mon, 25 Nov 2024 22:00:58 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/091bf6fd-201f-417d-9dfa-9766fd602e42_420x300.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>PostgreSQL users often employ analytical queries that sort and group data by different rules. Optimising these operators can significantly reduce the time and cost of query execution. In this post, I will discuss one such optimisation: choosing the order of columns in the GROUP BY expression.</p><p>Postgres can already reshuffle the list of grouped expressions according to the ORDER BY condition to eliminate additional sorting and save computing resources. We went further and implemented an additional strategy of group-by-clause list permutation in a series of patches (the <a href="https://www.postgresql.org/message-id/flat/7c79e6a5-8597-74e8-0671-1c39d124c9d6%40sigaev.ru">first attempt</a> and the <a href="https://www.postgresql.org/message-id/flat/8742aaa8-9519-4a1f-91bd-364aec65f5cf%40gmail.com">second one</a>) for discussion with the Postgres community, expecting it to be included in the next version of PostgreSQL core. You can also try it in action in the commercial <a href="https://postgrespro.com/products/postgrespro/enterprise">Postgres Pro Enterprise</a> fork.</p><p><strong>A short introduction to the issue</strong></p><p>To group table data by one or more columns, DBMSes usually use hashing methods (<code>HashAgg</code>) or preliminary sorting of rows (<code>tuples</code>) with subsequent traversal of the sorted set (<code>SortAgg</code>). When sorting incoming tuples by multiple columns, Postgres must call the comparison operator not just once but for each pair of values. For example, to compare a table row <code>('UserX1', 'Saturday', $100)</code> with a row <code>('UserX1', 'Monday', $10)</code> and determine the relative order of these rows, we must first compare the first two values &#8203;&#8203;and, if they match, move on to the next pair. If the second pair of values &#8203;&#8203;(in our example, 'Saturday' and 'Monday') differs, then there is no point in calling the comparison operator for the third element.</p><p>This is the principle on which the proposed <code>SortAgg</code> operator optimisation mechanism is based. If, when comparing rows, we compare column values &#8203;&#8203;with fewer duplicates first (for example, first compare <code>UserID</code> numbers and then days of the week), then we will have to call the comparison operator much less often.</p><p><strong>Time for a demo case</strong></p><p>How much minimising the number of comparisons may speed up a Sort operation? Let's look at the examples. In the first example, we sort the table by the same fields but in different orders:</p><pre><code>CREATE TABLE shopping (
  CustomerId bigint, CategoryId bigint, WeekDay text, Total money
);
INSERT INTO shopping (CustomerId, CategoryId, WeekDay, Total)
  SELECT random()*1E6, random()*100, 'Day ' || (random()*7)::integer,
    random()*1000::money
  FROM generate_series(1,1E6) AS gs;
VACUUM ANALYZE shopping;

SET max_parallel_workers_per_gather = 0;
SET work_mem = '256MB';

EXPLAIN (ANALYZE, TIMING OFF)
SELECT CustomerId, CategoryId, WeekDay, Total
FROM shopping
ORDER BY WeekDay,Total,CategoryId,CustomerId;

EXPLAIN (ANALYZE, TIMING OFF)
SELECT CustomerId, CategoryId, WeekDay, Total
FROM shopping
ORDER BY CustomerId,CategoryId,WeekDay,Total;</code></pre><p>The results of executing these queries will be as follows:</p><pre><code> Sort  (cost=117010.84..119510.84 rows=1000000 width=30)
       (actual rows=1000000 loops=1)
   Sort Key: weekday, total, categoryid, customerid
   Sort Method: quicksort  Memory: 71452kB
   -&gt;  Seq Scan on shopping  (actual rows=1000000 loops=1)
 Execution Time: 2858.596 ms

 Sort  (cost=117010.84..119510.84 rows=1000000 width=30)
       (actual rows=1000000 loops=1)
   Sort Key: customerid, categoryid, weekday, total
   Sort Method: quicksort  Memory: 71452kB
   -&gt;  Seq Scan on shopping  (actual rows=1000000 loops=1)
 Execution Time: 505.775 ms</code></pre><p>The second query is executed almost six times faster than the first, although the processed data is identical. This is because the comparison operator was called less often in the second case. The sorted tuple has 4 columns (<code>CustomerId, CategoryId, WeekDay, Total</code>), and Postgres calls the comparison operator separately for each pair of values &#8203;&#8203;- a maximum of 4 times. But if the first column in the comparison is <code>CustomerId</code>, then the need to call the comparison operator for the next column will be much lower than when the <code>WeekDay</code> column is the first.</p><p>This example shows that the computational costs of the sorting operation may be pretty significant. Even with the &#8220;Abbreviated keys&#8221; optimisation in the pocket, we are still not guaranteed execution time stability in the sort operation. I wonder if some newly proposed optimisations [<a href="https://www.postgresql.org/message-id/flat/CO6PR11MB5620E3878444C023A7C8CA9C95222%40CO6PR11MB5620.namprd11.prod.outlook.com">1</a>, <a href="https://www.postgresql.org/message-id/flat/PH7P220MB1533DA211DF219996760CBB7D9EB2@PH7P220MB1533.NAMP220.PROD.OUTLOOK.COM">2</a>] could significantly weaken the performance gap. Considering that an analytical query may have multiple sorts/additional sorts (each aggregate may define its individual order of incoming data), such an additional operation will save computing resources.</p><p>Note that the values &#8203;&#8203;of the <code>cost</code> field of the Sort operator in the EXPLAIN of the first example are the same. This means that for the Postgres optimiser both sorting options are identical.</p><p>Since the sort order for GROUP BY or Merge Join does not affect the final result, it can be chosen to minimise the number of comparison operations. In addition, if the table has many indexes, the data can be scanned and sorted in different ways, and the correct choice of the incremental sort option (IncrementalSort) may provide a positive effect.</p><p>Imagine a second example. Let's say you want to group your data to calculate the average spend for each customer in a given product category based on the day of the week:</p><pre><code>SET enable_hashagg = 'off';
EXPLAIN (ANALYZE, TIMING OFF)
SELECT CustomerId, CategoryId, WeekDay, avg(Total::numeric)
FROM shopping
GROUP BY WeekDay,CategoryId,CustomerId;

/*
GroupAggregate (actual rows=999370 loops=1)
   Group Key: weekday, categoryid, customerid
   -&gt;  Sort (actual rows=1000000 loops=1)
         Sort Key: weekday, categoryid, customerid
         Sort Method: quicksort  Memory: 71452kB
         -&gt;  Seq Scan on shopping (actual rows=1000000 loops=1)
  Execution Time: 2742.777 ms
 */</code></pre><p>To demonstrate the concept explicitly, I have disabled hash aggregation. From a query perspective, the order of the columns in the GROUP BY clause is entirely unimportant. Let's change the order and see the result:</p><pre><code>EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
SELECT CustomerId, CategoryId, WeekDay, avg(Total::numeric)
FROM shopping
GROUP BY CustomerId,CategoryId,WeekDay;

/*
 GroupAggregate (actual rows=999370 loops=1)
   Group Key: customerid, categoryid, weekday
   -&gt;  Sort (actual rows=1000000 loops=1)
         Sort Key: customerid, categoryid, weekday
         Sort Method: quicksort  Memory: 71452kB
         -&gt;  Seq Scan on shopping (actual rows=1000000 loops=1)
  Execution Time: 1840.517 ms
 */</code></pre><p>The speedup is less impressive than in the first example but pretty noticeable overall. What is important is that this transformation is free: we do not need a new index or complex query tree change, likewise performing a <a href="https://www.postgresql.org/message-id/flat/CAKU4AWoZksNZ4VR-fLTdwmiR91WU8qViDBNQKNwY%3D7iyo%2BuV0w%40mail.gmail.com">subquery pull-up</a>. Such a change can be done automatically, and the main thing is to teach the Postgres optimiser to distinguish the costs of different combinations of grouping clauses and consider an additional grouping strategy.</p><p><strong>State of the art</strong></p><p>In 2023, Postgres discovered to exclude redundant columns from a grouping operation. Redundancy can occur, for example, when there is an equality expression in the query tree:</p><pre><code>SELECT sum(total) FROM shopping
WHERE CustomerId=CategoryId AND WeekDay='Monday'                                                                                                                    GROUP BY CustomerId,CategoryId, WeekDay;

/*
 GroupAggregate
   Group Key: customerid
   -&gt;  Sort
         Sort Key: customerid
         -&gt;  Seq Scan on shopping
               Filter: ((customerid = categoryid) AND
                       (weekday = 'Monday'::text))
 */</code></pre><p>In the example above, the values &#8203;&#8203;in the <code>CustomerId</code> and <code>CategoryId</code> columns belong to the same equivalence class (<code>EquivalenceClass</code> structure in Postgres code), and either column can be excluded from the grouping expression. At the same time, the clause "<code>weekday = 'Monday'</code>" makes explicit grouping by <code>WeekDay</code> unnecessary.</p><p>PostgreSQL 17 introduced another strategy: the optimiser can now adjust the order of the grouped columns according to sort order the input data. Thus, during planning, Postgres may consider two alternative strategies:</p><ol><li><p>Group the already sorted data, and then re-sort by ORDER BY requirements.</p></li><li><p>Sort the incoming data by the rules specified by ORDER BY, then perform the grouping.</p></li></ol><p>To demonstrate both options, let's add an index to our table and compare the results of the two queries:</p><pre><code>CREATE INDEX ON shopping(CustomerId, weekday);

EXPLAIN (COSTS OFF)
SELECT count(*) FROM shopping WHERE CustomerId &lt; 5000
GROUP BY WeekDay,CustomerId ORDER BY WeekDay,CustomerId;

EXPLAIN (COSTS OFF)
SELECT count(*) FROM shopping WHERE CustomerId &lt; 50000
GROUP BY WeekDay,CustomerId ORDER BY WeekDay,CustomerId;

/*
 GroupAggregate
   Group Key: weekday, customerid
   -&gt;  Sort
         Sort Key: weekday, customerid
         -&gt;  Index Only Scan using
             shopping_customerid_weekday_idx on shopping
               Index Cond: (customerid &lt; 5000)

Sort
   Sort Key: weekday, customerid
   -&gt;  GroupAggregate
         Group Key: customerid, weekday
         -&gt;  Index Only Scan using
             shopping_customerid_weekday_idx on shopping
               Index Cond: (customerid &lt; 50000)
 */</code></pre><p>In the first case, there is little data to be grouped, and it is cheaper to sort the tuples in advance according to the requirements of the ORDER BY operator. In the second case, sorting after grouping is justified: the index scan operator will return the rows in sorted form, and grouping will significantly reduce the number of such rows, which makes subsequent sorting cheaper. Isn't it true that the additional Postgres strategy allows you to find exciting variants of query plans? The downside is that it does not use column statistics, which could have helped to optimise example No. 2.</p><p><strong>How to employ statistics?</strong></p><p>The proposed GROUP-BY columns reordering strategy is based on the standard Postgres columnar statistics stored in the pg_statistic table. It is a cost-based strategy, and it supplies the optimiser with an alternative path for the Sort operator that minimises the number of comparison operations during sorting. To clarify the basic idea, consider the query with grouping from the example above:</p><pre><code>SELECT avg(Total::numeric) FROM shopping
GROUP BY CustomerId,CategoryId,WeekDay;</code></pre><p>The case where <code>CustomerId</code> is in the first position of sorting tuples is more efficient because it contains the largest number of distinct values &#8203;&#8203;(approximately half of a million). That means there are two other tuples for each single tuple where the comparison operation of the <code>CustomerId</code> column will not determine the order of these tuples, and the values &#8203;&#8203;from subsequent columns will have to be compared. The <code>WeekDay</code> column has no more than seven distinct values. If Postgres sorted this column first, then to determine the order, the values &#8203;&#8203;of subsequent columns would have to be compared with a higher degree of probability.</p><p><strong>Dive into the code</strong></p><p>Since the code is very voluminous, we split it into four patches.</p><p><strong>The <a href="https://www.postgresql.org/message-id/flat/ba0edc53-4b1f-4c67-92d1-29aeddb36a18%40gmail.com">first patch</a></strong> teaches the optimiser to consider <code>EquivalenceClass</code> members during estimation of number of groups in the <code>estimate_num_groups()</code> routine. What does it means? Look at the queries:</p><pre><code>EXPLAIN SELECT CustomerId,CategoryId FROM shopping
WHERE CustomerId = CategoryId GROUP BY CustomerId,CategoryId;

EXPLAIN SELECT CustomerId,CategoryId FROM shopping
WHERE CustomerId = CategoryId GROUP BY CategoryId,CustomerId;</code></pre><p>These queries semantically identical: we just rearranged columns in the grouping list. Equivalence expression leveled out difference in distinct values for <code>both</code> <code>CategoryId</code> and <code>CustomerId:</code> after applying the filter they will contain exactly the same values. But if you EXPLAIN it you will see different estimations and, as a result, different query plans:</p><pre><code>HashAggregate  (cost=14073.83..14123.71 rows=4988 width=16)

--and:

Group  (cost=13676.18..13715.13 rows=101 width=16)</code></pre><p>So, the first patch adds into the <code>estimate_num_groups</code> a code which pass through the equivalence class and look for its members <code>ndistinct</code> estimations. The minimum number of distinct values <a href="https://www.postgresql.org/message-id/CAApHDvp7%2B0_XYVz%2B%2BAvodGcX9CSd%2BbiQ7wvcrvJtTvHdXS_JgQ%40mail.gmail.com">should be the most correct answer</a>. Also, it introduces distincts' caching inside an <code>EquivalenceMember</code>.</p><p><strong>The second patch</strong> concerns the formula for calculating the cost of sorting. In the current version of Postgres, sorting is estimated using the formula:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;cost = C \\cdot N \\cdot log_2(N),&quot;,&quot;id&quot;:&quot;OPACXRZLBQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>where:</p><p><code>N</code> - number of tuples to sort,<br><code>C = 2.0*cpu_operator_cost</code> - use-defined parameter.</p><p>This patch introduces into the Sort estimation formula the number of columns involved:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;cost = \\left( 1.0+ncols \\right) \\cdot N \\cdot log_2\\left(N\\right)&quot;,&quot;id&quot;:&quot;LLXWAPJVRJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>The approach seems straightforward and relatively crude. It is designed to be <em>intermediate</em> - to discover how many places employ sort estimation formulas and how many areas will be impacted.</p><p>Looking at the regression test changes, you may notice that this change affects the balance among <code>Sort</code>, <code>IncrementalSort</code>, <code>MergeAppend</code>, <code>GatherMerge</code>, and <code>HashAgg</code> nodes. With this formula, the optimiser favours using <code>hashAgg</code> grouping in more situations than before. HashAgg have been taking into account the number of columns in the aggregated tuple. At the same time, aggregation with preliminary sorting have been evaluated too positively in the case of a long list of sorted values. Thus, this patch increases the optimiser's bias towards hashing in grouping operations, especially on small data volumes.</p><p>But why is it such a trivial formula, you might ask me? Is it OK to suppose all the values are duplicates? It looks pretty strange, but in my experience, the problem with grouping orders is usually raised when a query processes massive numbers of tuples filled with text values (or numerics), containing largely duplicates. One more excuse for me is that we immediately introduced an improvement of this formula in the next patch. But even with such a simple formula, Postgres is ready to distinguish various sortings. </p><p><strong>The third patch</strong> reconsiders the formula introduced by the second patch. Here, the distinct statistics cache, added by the first patch, is employed to estimate the number of distinct values &#8203;&#8203;in the first sorted column, and the formula becomes:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;cost = \\left[ 1.0 + 1.0 + \\left( \\frac{N-ndistinct}{N-1} \\right) \\cdot\n\\left( ncols - 1 \\right)\\cdot N \\cdot \\log_2\\left(N\\right)\n\\right].&quot;,&quot;id&quot;:&quot;JKKTKRCEDM&quot;}" data-component-name="LatexBlockToDOM"></div><p>This approach can be extended when reliable statistics on the joint distribution of columns (EXTENDED STATISTICS) exist. Still, at the moment, we limit ourselves to the first column estimation only because it is sufficient in most cases. With this formula, the optimiser can distinguish the costs of different sorting combinations of columns, which allows us to choose the optimal sorting operator.</p><p><strong>The fourth patch</strong> adds code to the optimiser that permutes grouped columns to place the column with the maximum ndistinct value in the first position. This GROUP-BY order is added to the optimiser to estimate and choose among two other alternatives discussed above. The optimiser will choose the best one based on their costs and sorting requested by the upper query operator.</p><p><strong>Which positive outcome we have earned?</strong></p><p> Look at how this change will affect the queries in our examples 1 and 2. Let's start with sorting:</p><pre><code>EXPLAIN (ANALYZE, TIMING ON)
SELECT CustomerId, CategoryId, WeekDay, Total
FROM shopping
ORDER BY CustomerId,CategoryId,WeekDay,Total;

EXPLAIN (ANALYZE, TIMING ON)
SELECT CustomerId, CategoryId, WeekDay, Total
FROM shopping
ORDER BY CategoryId,CustomerId,WeekDay,Total;

/*
Sort  (cost=191291.64..193791.64) (actual time=350.819..395.024)
   Sort Key: customerid, categoryid, weekday, total
   -&gt;  Seq Scan on shopping  (cost=0.00..17353.00) (actual time=0.031..60.262)
 Execution Time: 423.583 ms

Sort  (cost=266482.66..268982.66)
       (actual time=653.143..694.736)
   Sort Key: categoryid, customerid, weekday, total
   -&gt;  Seq Scan on shopping  (cost=0.00..17353.00) (actual time=0.012..55.073)
 Execution Time: 723.005 ms
 */</code></pre><p>There are two notable improvements: The overall query cost has changed, and the sorting and scanning cost ratio has become more accurate and reflects reality. The difference in plan cost reflects the difference in query execution time. And now the result of query execution with grouping:</p><pre><code>SET enable_hashagg = 'off';
EXPLAIN (COSTS OFF)
SELECT CustomerId, CategoryId, WeekDay, avg(Total::numeric)
FROM shopping
GROUP BY WeekDay,CategoryId,CustomerId;

/*
 GroupAggregate
   Group Key: customerid, weekday, categoryid
   -&gt;  Sort
         Sort Key: customerid, weekday, categoryid
         -&gt;  Seq Scan on shopping
 */</code></pre><p>The optimiser changed the order of the columns and moved the <code>CustomerId</code> column to the beginning of the grouping list. Given the actual distribution of values &#8203;&#8203;by the other columns, it was possible to rearrange the <code>CategoryId</code> and <code>WeekDay</code> columns additionally. However, such fine-tuning has little practical meaning and can be done with sufficient reliability if there are extended statistics for all three fields. Of course, the proposed solution is not ideal: the mathematical model can be adjusted and made more practical (the case when all columns contain duplicates is sporadic) as more detailed. We also did not consider the relative cost of the comparison operator itself: comparing text types will require more resources than integer types, right? However, the current version already fulfils the main task - to create an additional grouping strategy that is qualitatively different from those already available in the Postgres optimiser.</p><p><em>If you have any comments or opinion on that subject, please leave it in the comments below or in <a href="https://www.postgresql.org/message-id/flat/8742aaa8-9519-4a1f-91bd-364aec65f5cf%40gmail.com">thread</a> on the Postgres community mailing list.</em></p><p>THE END.</p><p><em>November 25th, 2024. Pattaya, Thailand.</em></p>]]></content:encoded></item><item><title><![CDATA[PostgreSQL 'VALUES -> ANY' transformation]]></title><description><![CDATA[Should a DBMS mend query structure ?]]></description><link>https://danolivo.substack.com/p/postgresql-values-any-transformation</link><guid isPermaLink="false">https://danolivo.substack.com/p/postgresql-values-any-transformation</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Thu, 03 Oct 2024 23:58:58 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/46363121-390d-472c-9af9-f5157472d3e9_1134x578.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Introduction</h2><p><em>As usual,  this project was prompted by multiple user reports with typical complaints, like 'SQL server executes the query times faster' or 'Postgres doesn't pick up my index'. The underlying issue that united these reports was frequently used VALUES sequences, typically transformed in the query tree into an </em><code>SEMI JOIN</code><em>.</em></p><p>I also want to argue one general question: Should an open-source DBMS correct user errors? I mean optimising a query even before the search for an optimal plan begins, eliminating self-joins, subqueries, and simplifying expressions - everything that can be achieved by proper query tuning. The question is not that simple since DBAs point out that the cost of query planning in Oracle overgrows with the complexity of the query text, which is most likely caused, among other things, by the extensive range of optimisation rules.</p><p>Now, let's turn our attention to the <code>VALUES</code> construct. Interestingly, it's not just used with the <code>INSERT</code> command but also frequently appears in <code>SELECT</code> queries in the form of a test of inclusion in a set:</p><pre><code>SELECT * FROM something WHERE x IN (VALUES (1), (2), ...);</code></pre><p>and&nbsp;in the query&nbsp;plan,&nbsp;this syntactical construct is transformed into SEMI JOIN. To demonstrate the essence of the problem, let's generate a test table with an uneven distribution of data in one of the columns:</p><pre><code>CREATE&nbsp;EXTENSION&nbsp;tablefunc;
CREATE&nbsp;TABLE&nbsp;norm_test&nbsp;AS
&nbsp;&nbsp;SELECT&nbsp;abs(r::integer)&nbsp;AS&nbsp;x, 'abc'||r&nbsp;AS&nbsp;payload
&nbsp;&nbsp;FROM&nbsp;normal_rand(1000, 1., 10.)&nbsp;AS&nbsp;r;
CREATE&nbsp;INDEX&nbsp;ON&nbsp;norm_test (x);
ANALYZE norm_test;</code></pre><p>here, the value <code>x</code> of the <code>norm_test</code> table has a normal distribution with a mean of 1 and a standard deviation 10 [1]. There are not too many distinct values&#8203;, which will all be included in the MCV statistics. As a result, it will be possible to calculate the number of duplicates accurately for each value despite the uneven distribution. Also, we naturally introduced an index on this column, easing the table&#8217;s scanning. Now, let's execute the query:</p><pre><code>EXPLAIN ANALYZE
SELECT&nbsp;*&nbsp;FROM&nbsp;norm_test&nbsp;WHERE&nbsp;x&nbsp;IN&nbsp;(VALUES&nbsp;(1), (29));</code></pre><p>Uncomplicated query, right? It is rational to execute it with two iterations of index scanning. However, in Postgres, we have:</p><pre><code>&nbsp;&nbsp;Hash Semi&nbsp;Join&nbsp;&nbsp;(cost=0.05..21.36 rows=62) (actual rows=85)
&nbsp; &nbsp;Hash Cond: (norm_test.x&nbsp;= "*VALUES*".column1)
&nbsp; &nbsp;-&gt; Seq Scan&nbsp;on&nbsp;norm_test (rows=1000) (actual rows=1000)
&nbsp; &nbsp;-&gt; Hash (cost=0.03..0.03 rows=2) (actual rows=2)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;-&gt;&nbsp;&nbsp;Values&nbsp;Scan&nbsp;on&nbsp;"*VALUES*" (rows=2) (actual rows=2)</code></pre><p>Here and onwards, I slightly simplify the explain for clarity.</p><p>Hmm, a sequential scan of all the table's tuples when two index scans were enough for us? Let's disable <code>HashJoin</code> and see what happens:</p><pre><code>SET&nbsp;enable_hashjoin = 'off';

Nested Loop (cost=4.43..25.25 rows=62) (actual rows=85)
&nbsp; &nbsp;-&gt; Unique (rows=2 width=4) (actual rows=2)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;-&gt; Sort (rows=2) (actual rows=2)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Sort Key: "*VALUES*".column1
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;-&gt;&nbsp;&nbsp;Values&nbsp;Scan&nbsp;on&nbsp;"*VALUES*" (rows=2) (actual rows=2)
&nbsp; &nbsp;-&gt; Bitmap Heap Scan&nbsp;on&nbsp;norm_test (rows=31) (actual rows=42)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Recheck Cond: (x = "*VALUES*".column1)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;-&gt; Bitmap Index Scan&nbsp;on&nbsp;norm_test_x_idx
            (rows=31) (actual rows=42)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Index Cond: (x = "*VALUES*".column1)</code></pre><p>Now you can see that Postgres has squeezed out the maximum: in one pass through the <code>VALUES</code> set for each outer value, it performs an index scan on the table. It's much more interesting than the previous option. However, it is not as simple as just a regular index scan. In addition, if you look at the query explanation more closely, you can see that the optimiser makes a mistake in predicting the cardinality of the join and index scan. And what happens if you rewrite the query without <code>VALUES</code>:</p><pre><code>EXPLAIN (ANALYSE, TIMING OFF)
SELECT&nbsp;*&nbsp;FROM&nbsp;norm_test&nbsp;WHERE&nbsp;x&nbsp;IN&nbsp;(1, 29);

/*
Bitmap Heap Scan on norm_test (cost=4.81..13.87 rows=85) (actual rows=85)
&nbsp; &nbsp;Recheck Cond: (x = ANY ('{1,29}'::integer[]))
&nbsp; &nbsp;Heap Blocks: exact=8
&nbsp; &nbsp;-&gt; Bitmap Index Scan on norm_test_x_idx (rows=85) (actual rows=85)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Index Cond: (x = ANY ('{1,29}'::integer[]))
*/</code></pre><p>As you can see, we got a query plan containing only an index scan that is almost twice as cheap. At the same time, by estimating each value from the set and having both of these values &#8203;&#8203;in the MCV statistics, Postgres accurately predicts the cardinality of this scan.</p><p>So, being not a big problem in itself (you can always use <code>HashJoin</code> and hash the inner's <code>VALUES</code>), using <code>VALUES</code> sequences is a source of dangers:</p><ul><li><p>The optimiser can choose <code>NestLoop</code>, which can reduce performance with a vast VALUES list.</p></li><li><p>All of a sudden, <code>SeqScan</code> can be chosen instead of <code>IndexScan</code>.</p></li><li><p>The optimiser makes significant estimation errors when predicting the cardinality of a JOIN operation and its underlying operations.</p></li></ul><p>By the way, why would anyone need to use such expressions at all?</p><p>I guess this is a particular case when the automation system - ORM or Rest API tests the inclusion of an object into a specific set of objects. Since <code>VALUES</code> describes a relational table, and the value of such a list is a table row, we are most likely dealing with cases where each row represents an instance of an object in the application. Our case is a corner case when the object is characterised by only one property. If my guess is wrong, please correct me in the comments - maybe someone knows other reasons?</p><p>So, passing the '<code>x IN VALUES</code>' construct into the optimiser is risky. Why not fix the situation by converting this <code>VALUES</code> construct to an array? Then, we will have a construct like '<code>x = ANY [...]</code>', a special case of the <code>ScalarArrayOpExpr</code> operation in the Postgres code. It will simplify the query tree, eliminating the appearance of an unnecessary join. Also, the Postgres cardinality evaluation mechanism can work with the array inclusion check operation. If the array is small enough (&lt; 100 elements), it will perform a statistical evaluation element by element. In addition, Postgres can optimise array search by hashing the values &#8203;&#8203;(if the memory required for that fits the work_mem value) - and everyone will be happy, right?</p><p>Well, we decided to try to do this in our optimisation lab - and surprisingly, it turned out to be relatively trivial. The first peculiarity we encountered is that the conversion is only possible for operations on scalar values: that is, so far, it is generally impossible to convert an expression of the form '<code>(x,y) IN (VALUES (1,1), (2,2), ...)</code>' so that the result exactly matches the state before the conversion. Why? It is not very easy to explain - the reason lies in the design of the comparison operator for the record type - to teach Postgres to work with such an operator completely similarly to scalar types, the type cache needs to be significantly redesigned. Secondly, you must remember to check this subquery (yes, <code>VALUES</code> is represented in the query tree as a subquery) for the presence of volatile functions - and that's it - one pass of the query tree mutator doing transformation, quite similar to [2] replaces VALUES with an array, constifying it if possible. Curiously, the conversion is possible even if <code>VALUES</code> contains parameters, function calls, and complex expressions, like the below:</p><pre><code>CREATE&nbsp;TEMP&nbsp;TABLE&nbsp;onek&nbsp;(ten&nbsp;int, two&nbsp;real, four&nbsp;real);
PREPARE test (int,numeric,&nbsp;text)&nbsp;AS
&nbsp;&nbsp;SELECT&nbsp;ten&nbsp;FROM&nbsp;onek
&nbsp;&nbsp;WHERE&nbsp;sin(two)*four/($3::real)&nbsp;IN&nbsp;(VALUES&nbsp;(sin($2)), (2), ($1));
EXPLAIN (COSTS OFF) EXECUTE test(1, 2, '3');
/*
Seq Scan on&nbsp;onek
&nbsp; &nbsp;Filter: (((sin((two)::double precision) * four) / '3'::real) = ANY ('{0.9092974268256817,2,1}'::double precision[]))
(2 rows)
*/</code></pre><p>The feature is currently being tested. The query tree structure is pretty stable, and there is no reason to modify the code, considering that the dependencies on the kernel version are minimal; it can be used in Postgres down to version 10 and maybe even earlier. As usual, you can play with the library&#8217;s <a href="https://github.com/danolivo/conf/blob/main/VALUES-to-ANY/pgpro_planner.so">binaries</a>, compiled in a typical Ubuntu 22 environment - it doesn&#8217;t have any UI and may be loaded statically or dynamically.</p><p>And now, the actual holy war that I mentioned above. Since we did this as an external library, we had to intercept the planner hook (to simplify the query tree before optimisation), which cost us an additional pass through the query tree. Obviously, most queries in the system will not need this transformation, and this operation will simply add overhead. However, when it works, it can provide a noticeable effect (and from my observations, it does).</p><p>Until recently, there was a consensus in the PostgreSQL community [3, 4]: if the problem can be fixed by changing the query itself, then there is no point in complicating the kernel code since this will inevitably lead to increased maintenance costs and (remembering Oracle's experience) will affect the performance of the optimiser itself.</p><p>However, watching the core commits, I notice that the community's opinion seems to be drifting. For example, this year, they complicated the technology of subquery to  <code>SEMI JOIN</code> transformation by adding correlated subqueries [5]. A little later, they allowed the parent query to receive information about the sort order of the subquery result [6], although previously, to simplify planning, the query and its subqueries were planned independently. It looks like a way to re-planning subqueries, doesn't it?</p><p>And what do you think? Is an open-source project capable of supporting multiple transformation rules that would eliminate the redundancy and complexity that the user introduces, trying to make the query more readable and understandable? And most importantly - is it worth it?</p><p><strong>References</strong></p><ol><li><p><a href="https://www.postgresql.org/docs/current/tablefunc.html#TABLEFUNC-FUNCTIONS-NORMAL-RAND">F.41.&nbsp;tablefunc&nbsp;&#8212; functions that return tables</a></p></li><li><p><a href="https://www.postgresql.org/message-id/flat/567ED6CA.2040504%40sigaev.ru">OR-clause support for indexes</a></p></li><li><p><a href="https://www.postgresql.org/message-id/flat/CAMjNa7cC4X9YR-vAJS-jSYCajhRDvJQnN7m2sLH1wLh-_Z2bsw%40mail.gmail.com">Discussion on missing optimizations, 2017</a></p></li><li><p><a href="https://www.postgresql.org/message-id/flat/18643-8d455145acd8243e%40postgresql.org">BUG #18643: EXPLAIN estimated rows mismatch</a>, 2024</p></li><li><p>Commit 9f13376. pull-up correlated subqueries</p></li><li><p>Commit a65724d. Propagate pathkeys from CTEs up to the outer query</p><p></p></li></ol><p>THE END.</p><p><em>October 2, 2024. Pattaya, Thailand.</em></p>]]></content:encoded></item><item><title><![CDATA[Postgres query re-optimisation in practice]]></title><description><![CDATA[on PostgreSQL built-in reoptimisation]]></description><link>https://danolivo.substack.com/p/postgres-query-re-optimisation-in</link><guid isPermaLink="false">https://danolivo.substack.com/p/postgres-query-re-optimisation-in</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Mon, 19 Aug 2024 01:01:58 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2ea689be-5e21-4942-b00b-e72be75857d0_64x64.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Today's story is about a re-optimisation feature I designed about a year ago for the Postgres Professional fork of PostgreSQL.</em></p><p><em>Curiously, after finishing the development and having tested the solution on different benchmarks, I found out that Michael Stonebraker et al. had <a href="https://arxiv.org/pdf/1902.08291">already published</a> some research in that area. Moreover, they used the same benchmark&#8212; <a href="https://github.com/gregrahn/join-order-benchmark">Join Order Benchmark</a> &#8212; to support their results. So, their authorship is obvious. As an excuse, I would say that my code looks closer to real-life usage, and during the implementation, I stuck and solved many problems that weren&#8217;t mentioned in the paper. So, in my opinion, this post still may be helpful.</em></p><p><em>It is clear that re-optimisation belongs to the class of 'enterprise' features, which means it is not wanted in the community code. So, the code is not published, but you can play with it and repeat the benchmark using the published <a href="https://hub.docker.com/r/danolivo/reopt">docker container</a> for the REL_16_STABLE Postgres branch.</em></p><h2>Introduction</h2><p>What was the impetus to begin this work? It was caused by many real cases that may be demonstrated clearly by the Join Order Benchmark. How much performance do you think Postgres loses if you change its preference of employing parallel workers from one to zero? Two times regression? What about 10 or 100 times slower?</p><p>The black line in the graph below shows the change in execution time of each query between two cases: with parallel workers disabled and with a single parallel worker per gather allowed. For details, see the <a href="https://github.com/danolivo/utility/blob/main/job-noworker-issue/job_test">test script</a> and EXPLAINs, <a href="https://github.com/danolivo/utility/blob/main/job-noworker-issue/21a-explain-1w.txt">with</a> and <a href="https://github.com/danolivo/utility/blob/main/job-noworker-issue/21a-explain-0w.txt">without</a> parallel workers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!60AR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54af6ef5-5822-490f-a6e5-f91385b1fc68_600x371.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!60AR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54af6ef5-5822-490f-a6e5-f91385b1fc68_600x371.heic 424w, https://substackcdn.com/image/fetch/$s_!60AR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54af6ef5-5822-490f-a6e5-f91385b1fc68_600x371.heic 848w, https://substackcdn.com/image/fetch/$s_!60AR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54af6ef5-5822-490f-a6e5-f91385b1fc68_600x371.heic 1272w, https://substackcdn.com/image/fetch/$s_!60AR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54af6ef5-5822-490f-a6e5-f91385b1fc68_600x371.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!60AR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54af6ef5-5822-490f-a6e5-f91385b1fc68_600x371.heic" width="600" height="371" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54af6ef5-5822-490f-a6e5-f91385b1fc68_600x371.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:371,&quot;width&quot;:600,&quot;resizeWidth&quot;:600,&quot;bytes&quot;:13596,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!60AR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54af6ef5-5822-490f-a6e5-f91385b1fc68_600x371.heic 424w, https://substackcdn.com/image/fetch/$s_!60AR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54af6ef5-5822-490f-a6e5-f91385b1fc68_600x371.heic 848w, https://substackcdn.com/image/fetch/$s_!60AR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54af6ef5-5822-490f-a6e5-f91385b1fc68_600x371.heic 1272w, https://substackcdn.com/image/fetch/$s_!60AR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54af6ef5-5822-490f-a6e5-f91385b1fc68_600x371.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As you can see, the essential outcome is about a two-time speedup, which is logical when work is divided among two processes. But sometimes we see a 10-time speedup and even more, up to 500 times. Moreover, queries 14c, 22c, 22d, 25a, 25c, 31a, and 31c only finish their execution in a reasonable time with at least one parallel worker!</p><p>If you are hard-bitten enough to replicate this experiment, you'll quickly realise that the main obstacle lies in cardinality underestimation and NestLoop join. The optimiser's tendency to predict only&nbsp;a few tuples on the left and right side of the join and&nbsp;opt for a trivial (non-parameterised) NestLoop leads to a rapid escalation in query execution time, often spiralling towards infinity when multiple NestLoops are involved in a single join tree.</p><p>With parallel workers enabled, NestLoop has an alternative Parallel HashJoin, which is less expensive because of the parallel scan on each join side. Hence, the current case is no more than a game of chance, but it demonstrates our issue: sometimes query execution time goes to the moon, and we can't get at least EXPLAIN ANALYSE data to find out what's gone wrong.</p><p>In real-world scenarios, users rarely have a pg_query_state extension installed in the production instance, and auto_explain requires the query execution&nbsp;to be completed. Also, disabling NestLoop or MergeJoin reduces the optimiser's ability to find good query plans with parameterised NestLoop, as I have shown in the <a href="https://danolivo.substack.com/p/looking-for-hidden-hurdles-when-postgres?r=34q1yy">post</a> before. So, to find out the origin of the specific issue, we at least need something in-core to get an execution state snapshot and, at best, have a tool for dynamic replanning to fix the optimiser gaffes, that at the same time, must be transparent to the application.</p><p>Being underpinned by these wits, I began the development.</p><h2>How does it work?</h2><p>Skipping the lengthy grind sequence of false attempts and a series of unsuccessful code sketches, the architecture ended up with the schema shown below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M1vk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1690e043-2db5-40e8-a5fd-7ae493016121_1280x720.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M1vk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1690e043-2db5-40e8-a5fd-7ae493016121_1280x720.heic 424w, https://substackcdn.com/image/fetch/$s_!M1vk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1690e043-2db5-40e8-a5fd-7ae493016121_1280x720.heic 848w, https://substackcdn.com/image/fetch/$s_!M1vk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1690e043-2db5-40e8-a5fd-7ae493016121_1280x720.heic 1272w, https://substackcdn.com/image/fetch/$s_!M1vk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1690e043-2db5-40e8-a5fd-7ae493016121_1280x720.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M1vk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1690e043-2db5-40e8-a5fd-7ae493016121_1280x720.heic" width="1280" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1690e043-2db5-40e8-a5fd-7ae493016121_1280x720.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:44259,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M1vk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1690e043-2db5-40e8-a5fd-7ae493016121_1280x720.heic 424w, https://substackcdn.com/image/fetch/$s_!M1vk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1690e043-2db5-40e8-a5fd-7ae493016121_1280x720.heic 848w, https://substackcdn.com/image/fetch/$s_!M1vk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1690e043-2db5-40e8-a5fd-7ae493016121_1280x720.heic 1272w, https://substackcdn.com/image/fetch/$s_!M1vk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1690e043-2db5-40e8-a5fd-7ae493016121_1280x720.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can see the query execution schema with additional elements needed to implement re-optimisation in PostgreSQL. Yellow-coloured elements are in-core features, and green-coloured elements are subsystems that can be pushed out into an extension.</p><p><strong>Decision Maker</strong>. At first, DBMS should identify queries that can be potentially re-optimised: it doesn't make sense to employ this heavy machinery for trivial queries or single grouping. So, using the planner hook, the user can provide a clue and mark a plan as a 'supervised' one. As an outcome, one custom field was added to the PlannedStmt node to remember the decision has been made before.</p><p><strong>Subtransaction</strong>. In the case of a query interruption and before the next planning attempt, Postgres must release all acquired resources: locks, pinned buffers, memory, etc. The only way to do it provably correctly is by employing subtransaction machinery. The "Supervised" query must be executed inside such a subtransaction to revert the whole state before re-optimisation and re-execution.</p><p><strong>ExecProcNode Hook</strong>. During the execution, we have to check a <em>trigger</em> that the user has predefined for the query. This routine should be done from time to time at a place where the executor achieves a consistent state: for example, we shouldn't allow interruptions in the middle of hash table building or sorting - keep in mind that afterwards, Postgres would be able to discover the execution state to find some clues for re-optimisation and this execution state must be in the consistent (for a walker and ROLLBACK codes) state. As I realised, the most reliable place in the code is the ExecProcNode routine.</p><p><strong>Trigger</strong>. Snapping up the ExecProcNode Hook, the trigger can be defined by a user, parameterised, and exported as a stored C procedure in an extension's UI. It employs the standard Postgres ERROR exception to interrupt execution with a specific error code that can be processed above by the <em>error handler</em>. The trigger has access to the query's Execution State and can watch any part of the query plan if needed. At the same time, it should be simple enough and not produce a lot of overhead for each produced tuple.</p><p><strong>Error Handler</strong>. So far, the main ServerLoop translates any error coming from the portal to the client. But in the case of re-optimisation, it should catch error signals and, if it is produced by the trigger, it must launch <em>Execution State Analyser</em> before aborting the subtransaction and restarting the query processing, if needed.</p><p><strong>Execution State Analyser</strong>. Being a simple walker over the plan state, it implements a complicated subsystem for gathering <em>instrumentation</em> data for each node. It is a bit tricky because the current core code doesn't accept partial execution. It grabs an actual number of rows, number of groups, and size of data spilled to disk for the sake of hashing or sorting. As a part of an extension, it can be sophisticated, but not much, limited by the current set of planner hooks.</p><p><strong>Selectivity Hook</strong>. Using data earned from the partial execution state, an extension should be able to provide the optimiser with recommendations on cardinalities, number of groups, hash table sizes, and even adequate <em>work_mem</em> value. Like the <a href="https://github.com/postgrespro/aqo">AQO</a>, this feature strictly depends on these hooks. No one such hook exists at the core for now, but the selectivity hook, for example, <a href="https://www.postgresql.org/message-id/flat/c8c0ff31-3a8a-7562-bbd3-78b2ec65f16c%40enterprisedb.com">is discussed</a> and may be committed in the near future.</p><p><strong>Selectivity estimator</strong>. This is a key subsystem paired with the Execution State Analyser. The most complicated part of this system is the ability to correctly find specific join, scan, grouping, etc, during the early planning stage and match it to the plan node of the finalised plan state. It is the most complicated and invasive technique because Postgres has not conferred this architecturally. Experiments with <em>path signatures</em> in the AQO extension have shown the fragility of such matching. So, in this project, I have chosen a more stable approach based on <em>RelOptInfo</em> <em>signatures</em>. The scope of this post is too limited to explain the idea in detail, but it may be done later if people show an interest in this technique.</p><p><strong>Tuple Storage</strong>. As you can imagine, re-optimisation and subsequent re-execution are possible if only all results of the query execution are still enclosed inside the backend. However, the Postgres receiver, by default, sends each produced tuple immediately to the client. Because the first message sent out from the instance disables the re-optimisation trigger, it was necessary to invent a tuple storage that allows the delay of the data shipment to the client for some time (limited by tuple buffer size) and do re-optimisation, if needed. </p><h2>Implementation caveats</h2><p>The relatively simplistic design faced multiple difficulties during development in the sophisticated code of a well-rounded database system like PostgreSQL. The first problem that immediately bubbled up was <em><strong>dynamic query execution</strong></em>, as shown in the picture below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sV0z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfbd45cc-881c-4f14-bfa4-0a654fe26c46_1007x597.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sV0z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfbd45cc-881c-4f14-bfa4-0a654fe26c46_1007x597.heic 424w, https://substackcdn.com/image/fetch/$s_!sV0z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfbd45cc-881c-4f14-bfa4-0a654fe26c46_1007x597.heic 848w, https://substackcdn.com/image/fetch/$s_!sV0z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfbd45cc-881c-4f14-bfa4-0a654fe26c46_1007x597.heic 1272w, https://substackcdn.com/image/fetch/$s_!sV0z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfbd45cc-881c-4f14-bfa4-0a654fe26c46_1007x597.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sV0z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfbd45cc-881c-4f14-bfa4-0a654fe26c46_1007x597.heic" width="1007" height="597" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cfbd45cc-881c-4f14-bfa4-0a654fe26c46_1007x597.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:597,&quot;width&quot;:1007,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:28772,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sV0z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfbd45cc-881c-4f14-bfa4-0a654fe26c46_1007x597.heic 424w, https://substackcdn.com/image/fetch/$s_!sV0z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfbd45cc-881c-4f14-bfa4-0a654fe26c46_1007x597.heic 848w, https://substackcdn.com/image/fetch/$s_!sV0z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfbd45cc-881c-4f14-bfa4-0a654fe26c46_1007x597.heic 1272w, https://substackcdn.com/image/fetch/$s_!sV0z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfbd45cc-881c-4f14-bfa4-0a654fe26c46_1007x597.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A query can contain a function call. Such a function, in turn, can contain quite complex logic and execute queries inside the body. Their planning and execution happen in an independent and isolated execution context somewhere in the middle of the execution of a top-level query. So, the feature should identify such a recursion and, in case of interruption, process the correct <em>PlannedStmt</em> tree. Moreover, functions can manage exceptions and employ saving points. Because of that, we should be careful and disable re-optimisation if it happens inside a function call.</p><p>At the moment of interruption, some nodes will be in an <em><strong>interim state</strong></em> when they have called ExecProcNode to obtain another tuple. The current Postgres ExecutionEnd walker doesn't process this state correctly, and such a state must be implemented inside instrumentation structures. This change is also profitable for the pg_query_state extension and other tools that make snapshots of the query plan and want careful calculations of each node's cardinality.</p><p><em><strong>How to finalise a partially executed query</strong></em>. It is not apparent, but query execution could involve parallel workers who are independent processes. When we interrupt execution in the primary process, it doesn't mean that workers will stop their work immediately after that. They will work for an arbitrary period of time, and it appears that the task of finalising their work and gathering instrumentation data is not so easy.</p><p>One more problem is figuring out when to <em><strong>switch off re-optimisation</strong></em>. If a trigger interrupts query execution, it should make sense. For example, if you set up the execution time trigger to one second, but the query can't be executed for less than one minute, it could waste repeating replanning without any meaningful effect just because most of the nodes may not even have processed a single tuple. My quick solution was introducing a trivial approach of seeing if something meaningful was earned since the last re-optimisation. If re-optimisation doesn't change anything in the plan or even earned new data from partially executed state it is allowed to relieve the trigger conditions - for example, increasing a timeout value or memory usage.</p><p>The next problem relates to the <em><strong>signature technique</strong></em>. Being a hash value, it can occasionally match the signature of two totally different nodes. If these plan nodes have highly different cardinalities (for example, one and 1E6), this can cause fluctuations in the cardinality prediction provided by the selectivity estimator. As a trivial solution, I just set a limit to the maximum number of re-optimisations for one query execution, but it does not seem to be the best solution.</p><p>A quite trivial but still existing problem is plpgsql <em><strong>information messages</strong></em>, which can produce some accidental output during the execution. To make this output consistent (do not send duplicate messages because of the query execution restart), we need to hold off on their delivery to the client until re-optimisation is possible and the query is not finished yet.</p><h2>How does it help?</h2><p>Multiple triggers can be invented: time, cardinality error, memory consumption, temporary file quota, etc. The architecture also allows a user to define custom triggers for specific purposes. In that particular case, we have chosen a variable-time trigger. To make this more practical, we added some flexibility to this trigger. If the statement_timeout value is set, the re-optimiser can increment the time gap (up to statement_timeout) if nothing beneficial has been earned since the last re-optimisation iteration.</p><p>So, before launching this benchmark, I set the initial time trigger to 1 second and statement_timeout to about 10 minutes (see the <a href="https://github.com/danolivo/utility/blob/main/job-noworker-issue/job_test_reoptimise">script</a> for details). The result of the benchmark execution is shown on the graph below (see <a href="https://docs.google.com/spreadsheets/d/1MmYkd-fRI8tKUD0-6ltBCZfYSSDQap8ky_Nmk1Bn-lk/edit?usp=sharing">Google Docs tables</a> for raw data). Here, you can see a black line representing relative execution time (without parallel workers with re-optimisation divided by the case with a single parallel worker, no re-optimisation).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cysk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8be4b9a-d00a-45cd-8465-e5302deb0b31_691x493.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cysk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8be4b9a-d00a-45cd-8465-e5302deb0b31_691x493.heic 424w, https://substackcdn.com/image/fetch/$s_!cysk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8be4b9a-d00a-45cd-8465-e5302deb0b31_691x493.heic 848w, https://substackcdn.com/image/fetch/$s_!cysk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8be4b9a-d00a-45cd-8465-e5302deb0b31_691x493.heic 1272w, https://substackcdn.com/image/fetch/$s_!cysk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8be4b9a-d00a-45cd-8465-e5302deb0b31_691x493.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cysk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8be4b9a-d00a-45cd-8465-e5302deb0b31_691x493.heic" width="691" height="493" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a8be4b9a-d00a-45cd-8465-e5302deb0b31_691x493.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:493,&quot;width&quot;:691,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:22898,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cysk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8be4b9a-d00a-45cd-8465-e5302deb0b31_691x493.heic 424w, https://substackcdn.com/image/fetch/$s_!cysk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8be4b9a-d00a-45cd-8465-e5302deb0b31_691x493.heic 848w, https://substackcdn.com/image/fetch/$s_!cysk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8be4b9a-d00a-45cd-8465-e5302deb0b31_691x493.heic 1272w, https://substackcdn.com/image/fetch/$s_!cysk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8be4b9a-d00a-45cd-8465-e5302deb0b31_691x493.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Compared to the previous graph, you can see that the peak 50-500 decrease (and not finished executions, too) have been fixed by re-optimisation. Only some spikes making some queries up to 6 times slower represent poorly planned queries that can be justified as some issues in our re-optimisation logic that is still a beta version.</p><p>The red line on the graph represents the total execution time of each query, including all re-optimisation iterations. From this standpoint, the outcome of the feature employment doesn't look so appealing: Postgres has spent much time in iterative partial executions, which tells us that blind re-optimisation is impractical in real life. </p><h2>It is a dumb feature, isn't it?</h2><p>Observing the results of the JOB benchmark, it is evident that in most cases, the total query execution time, including all the re-optimisations, is much higher than just one, maybe non-optimal execution. So, instead of speeding up, we have degradation, haven't we?</p><p>It is true. Using alone, this feature has too narrow a use case and doesn't make sense in practice. The only cases I see here are debugging and debriefing. But remember,  a few weeks ago, I presented the <a href="https://danolivo.substack.com/p/designing-a-prototype-postgres-plan?r=34q1yy">plan freezing extension</a> to you. Imagine, what if you can unite re-optimisation and plan freezing?</p><p>The most questionable part of the freezer is how to identify poorly planned queries and how to force Postgres to build a more optimal plan. It is precisely what the re-optimiser does! As a result, we can create a kind of self-tuning DBMS, which will 'adapt' to changed data and load. When setting triggers and calling the plan freezer after some profitable re-optimisation, Postgres will stick query plans into the cache. Control of the frozen plan's effectiveness can also be implemented by a time trigger, which can be explicitly set for the plan to the value outreaching, for example, by 20% of the initial execution time. And now re-optimisation makes sense, doesn't it?</p><p>So, the purpose of this work was much broader than just developing a prototype of a re-optimisation feature. I aimed to invent a general approach underpinning query optimisation decisions, correcting mistakes, and eventually conserving CPU cycles ;) that can be at least partially autonomous and do not require vendor lock. This approach, as I believe, is doable and workable and can be profitable, especially in cloud configurations. Do you think it is worth a separate startup project?</p><p>THE END.</p><p><em>August 18, 2024. Paris, France</em></p><p></p><p><em>P.S.</em></p><p><em>Links:</em></p><ol><li><p>Join Order Benchmark repository:<br>https://github.com/danolivo/jo-bench</p></li><li><p>Docker container with the re-optimisation patch: https://hub.docker.com/r/danolivo/reopt</p></li><li><p>Utility files for the test reproduction:<br>https://github.com/danolivo/utility/tree/main/job-noworker-issue</p></li></ol><p>Names of the GUCs have been introduced with re-optimisation:</p><ul><li><p>query_inadequate_execution_time - time trigger (in ms) - will start re-optimisation if the current execution time overreaches this value.</p></li><li><p><em>replan_overrun_limit</em> - factor to identify acceptable cardinality prediction error in a plan node until re-optimisation starts.</p></li><li><p><em>replan_enable</em> - enable/disable re-optimisation</p></li><li><p><em>show_node_sign</em> - show details of re-optimisation in EXPLAIN.</p></li><li><p><em>replan_signal</em>(pid) - routine to manually cause re-optimisation in the process</p></li></ul>]]></content:encoded></item><item><title><![CDATA[Probing indexes to survive data skew in Postgres]]></title><description><![CDATA[An attempt to respond the data skew issue in Postgres planner]]></description><link>https://danolivo.substack.com/p/probing-indexes-to-survive-data-skew</link><guid isPermaLink="false">https://danolivo.substack.com/p/probing-indexes-to-survive-data-skew</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Mon, 12 Aug 2024 00:01:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!udci!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6df72dc1-34ff-4f3e-b121-61f5b8ab0fcc_500x500.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This&nbsp;is the&nbsp;story&nbsp;of an unexpected challenge I encountered and a&nbsp;tiny but fearless response to&nbsp;address the Postgres optimiser underestimations caused by a data skew, miss in statistics or inconsistency between statistics and the data.&nbsp;The journey began with&nbsp;a user's complaint on query performance,&nbsp;which had quite unusual anamnesis.</p><p>The problem was with only one analytical query executed regularly by the schedule. For one of the involved tables, the query EXPLAIN had indicated a single tuple scan estimation, but the executor ended up fetching four million tuples from the disk.&nbsp;This unexpected turn of events led&nbsp;Postgres&nbsp;to choose&nbsp;parameterised NestLoop + Index Scans on each side of the join,&nbsp;causing&nbsp;the query&nbsp;to execute&nbsp;two orders of magnitude longer than with an optimal query plan. However, after executing the ANALYZE command, estimations became correct, and the query was executed fast enough.</p><h2>Problem Analysis</h2><p>The problematic table was a huge one and contained billions of rows. The user would&nbsp;load&nbsp;data in&nbsp;large batches over&nbsp;the weekends and&nbsp;immediately run the troubling&nbsp;query to&nbsp;identify new&nbsp;trends, comparing&nbsp;the fresh&nbsp;data with the&nbsp;existing&nbsp;data. One of the columns in the data was something like the current timestamp, which indicated the time of addition to the database, and it was unique for the whole batch. So, I immediately suspected that the user's data insertion pattern was the reason impacting query performance &#8212; something in statistics.</p><p>After discovery, I found that the source of errors was the estimation of trivial filters like 'x=N', where N had a massive number of duplicates in the table's column. Right after bulk insertion into the table, this filter was estimated by the <em>stadistinct</em> number. On the ANALYZE execution, this value was detected as a 'most common' value; its <em>selectivity</em> was saved in statistics, and at the subsequent query execution, this filter was estimated precisely by the MCV statistic.</p><p>Let's briefly dip into the logic of the equality filter selectivity to understand this behaviour. See the script, generating a table with highly skewed value distribution:</p><pre><code>CREATE EXTENSION tablefunc;
CREATE TABLE norm_test AS
  SELECT abs(r::integer) AS val
  FROM normal_rand(1E7::integer, 5.::float8, 300.::float8) AS r;
ANALYZE norm_test;</code></pre><p>Let's examine the statistics below for the column 'val'. The green curve shows the actual distribution of values in the column from 1 to 1600; the red dots are the most common values &#8212; they cover the top of the graph. The black line shows this column's number of distinct values (943).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!udci!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6df72dc1-34ff-4f3e-b121-61f5b8ab0fcc_500x500.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!udci!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6df72dc1-34ff-4f3e-b121-61f5b8ab0fcc_500x500.heic 424w, https://substackcdn.com/image/fetch/$s_!udci!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6df72dc1-34ff-4f3e-b121-61f5b8ab0fcc_500x500.heic 848w, https://substackcdn.com/image/fetch/$s_!udci!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6df72dc1-34ff-4f3e-b121-61f5b8ab0fcc_500x500.heic 1272w, https://substackcdn.com/image/fetch/$s_!udci!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6df72dc1-34ff-4f3e-b121-61f5b8ab0fcc_500x500.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!udci!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6df72dc1-34ff-4f3e-b121-61f5b8ab0fcc_500x500.heic" width="500" height="500" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6df72dc1-34ff-4f3e-b121-61f5b8ab0fcc_500x500.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:500,&quot;width&quot;:500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:27756,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!udci!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6df72dc1-34ff-4f3e-b121-61f5b8ab0fcc_500x500.heic 424w, https://substackcdn.com/image/fetch/$s_!udci!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6df72dc1-34ff-4f3e-b121-61f5b8ab0fcc_500x500.heic 848w, https://substackcdn.com/image/fetch/$s_!udci!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6df72dc1-34ff-4f3e-b121-61f5b8ab0fcc_500x500.heic 1272w, https://substackcdn.com/image/fetch/$s_!udci!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6df72dc1-34ff-4f3e-b121-61f5b8ab0fcc_500x500.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let's execute some simple scan SQL queries:</p><pre><code>-- Involve MCV statistics ('5' inside the MCV stat)
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 5;
Gather  (rows=27333 width=4) (rows=26416 loops=1)
  -&gt;  Parallel Seq Scan on norm_test
        Filter: (val = 5)

-- Frequent value but out of MCV
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 10;
Gather  (rows=8614) (actual rows=26583)
   -&gt;  Parallel Seq Scan on norm_test  (rows=3589) (actual rows=8861)
         Filter: (val = 10)

-- Rare value
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 10000;
Gather  (rows=8614 width=4) (rows=0 loops=1)
  -&gt;  Parallel Seq Scan on norm_test
        Filter: (val = 10000)</code></pre><p>As you can see, the best situation is when the value fits MCV, another way Postgres estimates the cardinality of the filter according to the formula:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s=\\frac{(ntuples-\\sum{MCV})}{stadistinct - size(MCV)}&quot;,&quot;id&quot;:&quot;JJASNTEYOV&quot;}" data-component-name="LatexBlockToDOM"></div><p>I.e., it excludes from the whole number of tuples common values and divides it by the number of remaining <em>ndistincts</em> (see <a href="https://github.com/postgres/postgres/blob/8c3548613d7e2a7c010360e6f55fa6db849eeef9/src/backend/utils/adt/selfuncs.c#L295">var_eq_const</a> for details). It looks like a single prediction for any other value outside MCV. But what if almost all of the values are MCV? Let's check it:</p><pre><code>-- Add frequent values which are most of the data
CREATE TABLE norm_test1 AS SELECT gs % 100 AS val
  FROM generate_series(1,1E7) AS gs;

-- Add some rare values
INSERT INTO norm_test1 (val) SELECT gs
  FROM generate_series(101,105) AS gs;
VACUUM ANALYZE norm_test1;
ALTER TABLE norm_test1 SET (autovacuum_enabled = 'false');

-- Batch insertion of duplicates
INSERT INTO norm_test1 (val) SELECT 100 FROM generate_series(1,1E5);

EXPLAIN ANALYZE SELECT val FROM norm_test1 WHERE val = 100;

Gather  (rows=1) (actual rows=100000)
   -&gt;  Parallel Seq Scan on norm_test1
         Filter: (val = '100'::numeric)</code></pre><p>As you can see, we got precisely the estimation described in the issue above. After such a long explanation, how should Postgres handle such a scenario?</p><p>Upon investigation, I discovered&nbsp;that Postgres&nbsp;has&nbsp;already&nbsp;<a href="https://github.com/postgres/postgres/blob/3dcb09de7bb21c75d4df48263561af324fd099a4/src/backend/utils/adt/selfuncs.c#L6088">implemented</a> a solution: the 'index probing' technique for the inequality operator (like '&lt;' or '&gt;'). This technique employs the histogram to calculate the number of bins that fall into the inequality filter boundaries.</p><p>After reviewing the git history of this feature and the discussion, I realised that it suffers from <a href="https://www.postgresql.org/message-id/flat/CAKZiRmznOwi0oaV%3D4PHOCM4ygcH4MgSvt8%3D5cu_vNCfc8FSUug%40mail.gmail.com">performance issues</a>. So, does it make sense to use the same trick for an equality operator? Would it be suitable for some sophisticated analytical queries? Let's try to implement this feature and assess the overhead afterwards with benchmarks.</p><h2>Implementation Description</h2><p>You can see the working implementation in the&nbsp;<a href="https://github.com/danolivo/pgdev/tree/estimate-by-index">branch</a>&nbsp;of my GitHub repository. The idea is as follows: If we can't use MCV for an equality expression and some empirical condition detects that distinct estimation can be suspicious, let's try to find an index that covers this column. With such an index, call the AM index_getbitmap routine to estimate the number of tuples that satisfy the condition. Picking the&nbsp;<em>NonVacuumableSnapshot</em>&nbsp;will guarantee an upper-bound estimation.</p><p>The index_getbitmap routine collects only the TIDs of tuples, not the tuples themselves. Of course, this estimation process can be time-consuming for multiple tuples. The better option could be to make two IndexScan operations - forward and backward scan on the target const value - to find lower and higher bounds and roughly estimate the number of tuples by the number of pages between these two values. But as I can see, the AM interface in Postgres is still not ready to provide the caller with information on a couple (<em>page</em>, <em>offset</em>) of the first tuple found.</p><p>One consideration that relieves the aftereffects of this approach on performance is that calling index_getbitmap pulls index pages from the disk into memory that can be reused during query execution.</p><p>The crucial point is the condition when we involve the index probing approach. There is room for improvisation, but being short on time, I just invented a trivial one: looking into the histogram's bounds and seeing if the value fits the boundaries. If it is out of the histogram's coverage, we suppose that statistics is untrusted and probe an index.</p><h2>Benchmarking</h2><p>To find the worst case, I employed pgbench, as usual. The benchmarking script looks like the following:</p><pre><code>pgbench -i -s 10
psql -c "ALTER TABLE pgbench_accounts
  DROP CONSTRAINT pgbench_accounts_pkey;"
psql -c "ALTER TABLE pgbench_branches
  DROP CONSTRAINT pgbench_branches_pkey;"
psql -c "ALTER TABLE pgbench_tellers
  DROP CONSTRAINT pgbench_tellers_pkey;"

psql -c "CREATE INDEX ON pgbench_accounts(aid);"
psql -c "CREATE INDEX ON pgbench_branches(bid);"
psql -c "CREATE INDEX ON pgbench_tellers(tid);"

pgbench -c 5 -j 5 -T 180 -P 3 -f test_s.pgb</code></pre><p>Here, we deleted unique indexes and created non-unique ones because the optimiser uses them to return single tuple estimation. Query set contained only single SELECT quite frequently coming with the constant out of the histogram boundaries:</p><pre><code>\set aid random(-5000000 * :scale, 5000000 * :scale)
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;</code></pre><p>Analysing the results of this benchmark, I observed a 5% overhead on touching histogram statistics and an additional 5-6% on probing indexes. It looks like a huge overhead, but the code is just a simple sketch. Who anticipated an ideal result?</p><p>That's it for today. In conclusion, I must emphasise that this approach not only worsens performance but also holds great potential for application in specific, narrow cases. The question is, might we effectively limit its involvement using an empirical formula? May a global or table-only GUC be a lifesaver in that case?</p><p>In unlucky situations where data skews make estimations so bad that they cause a performance slump of a degree of magnitude, we are left with no tools except the schema change. In such cases, this feature can be a handy solution.</p><p>THE END.</p><p><em>August 11, 2024. Paris, France</em></p>]]></content:encoded></item><item><title><![CDATA[Does PostgreSQL respond to the challenge of analytical queries?]]></title><description><![CDATA[A short glance into recent advancements]]></description><link>https://danolivo.substack.com/p/does-postgresql-respond-to-the-challenge</link><guid isPermaLink="false">https://danolivo.substack.com/p/does-postgresql-respond-to-the-challenge</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Mon, 05 Aug 2024 03:00:16 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9a3cfea2-eef3-43a0-9ffe-b2d20b3c68e4_608x608.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This post was triggered by Crunchy Data's <a href="https://thenewstack.io/unleashing-postgres-for-analytics-with-duckdb-integration/">recent article</a> and the YugabyteDB approach to using Postgres as an almost stateless entry point, which parses incoming analytic queries and splits the work among multiple instances managed by another database system that fits the task of storing and processing massive data volumes and can execute relatively simple queries.</p><p>The emergence of foreign data wrappers (FDW) and partitioning features has made these solutions possible. It seems that being compatible with the Postgres infrastructure and its mature parser/planner is valuable for vendors enough to consider implementing such hybrid data management systems.</p><p>So, now we face the fact that Postgres is used to do analytics on big data. An essential question immediately emerges: does it have enough tools to process complex queries?</p><h2>What Is Analytic Queries</h2><p>It is quite a general term. But after reading the course materials on the subject, I can summarise that analytic queries typically:</p><ul><li><p>involve multiple joins</p></li><li><p>use aggregates, often mathematical ones</p></li><li><p>need to process large subsets of table data in a single query</p></li><li><p>have ad-hoc nature and are difficult to predict when it comes</p></li></ul><p>So, looking into the Postgres changes, we should discover what has changed in aggregate processing, join ordering and estimation, and table scanning.</p><h2>What was the rationale?</h2><p>The technique of using Postgres as a middleware between user and storage has been triggered by the emergence of FDW and partitioning features. Parallel execution doesn't help much with processing foreign tables (partitions). Still, it is beneficial for speeding up the local part of the work.</p><p>The basics of these features were introduced in 2010 - 2017. Now, Postgres can push to foreign server queries containing scan operations, joins, and orderings. We also have asynchronous append, which allows us to gather data from foreign instances simultaneously. As a perspective, the community has quite an active <a href="https://www.postgresql.org/message-id/flat/cf744a8ee4d47bdabe1da9174d4f3dc9%40postgrespro.ru">discussion</a> on aggregate pushdown.</p><p>Partitioning includes pruning techniques (planning and execution stages) that allows to restrict a query pushdown by only instances containing necessary data. One more essential thing - partitionwise join - allows the optimiser to choose a specific way to execute a join for each couple of joining partitions.</p><p>FDW/Partitioning technique is not ideal now because it has many shortcomings. For example:</p><ul><li><p>We can prune only partitions, not a query subtree;</p></li><li><p>We can't declare some table as a 'dictionary' that exists in any instance and join such a table with foreign partitions simultaneously on a remote instance.</p></li><li><p>The pruning technique often can't remove partitions because it lacks statistical data about the partitions' min/max values.</p></li></ul><p>However, with these and many other problems, Postgres has hooks and FDW API that are flexible enough to allow a professional developer's team to arrange the code according to the project's needs. Partitioning abilities are actively mature. I see discussions (see, for example, [<a href="https://www.postgresql.org/message-id/flat/CAOP8fzaVL_2SCJayLL9kj5pCA46PJOXXjuei6-3aFUV45j4LJQ%40mail.gmail.com">1</a>, <a href="https://www.postgresql.org/message-id/flat/CAJ2pMkZNCgoUKSE%2B_5LthD%2BKbXKvq6h2hQN8Esxpxd%2Bcxmgomg%40mail.gmail.com">2</a>, <a href="https://www.postgresql.org/message-id/flat/CAExHW5tHqEf3ASVqvFFcghYGPfpy7o3xnvhHwBGbJFMRH8KjNw%40mail.gmail.com">3</a>, <a href="https://www.postgresql.org/message-id/flat/CAExHW5tUcVsBkq9qT%3DL5vYz4e-cwQNw%3DKAGJrtSyzOp3F%3DXacA%40mail.gmail.com">4</a>]) on enhancing the optimiser to work better with partitions. And I think, soon, we could see more hybrid systems with primary Postgres and some secondary DBMS, chosen according to the purpose.</p><p>Regardless, secondary DBMS typically performs low-level preparatory operations with data. Aggregates, complex subqueries, window functions, and other stuff are still executed locally, and the issue is how mature the optimiser is in finding an effective way to process this data after pulling it from the remote side.</p><p>After reviewing the code repository, I can confirm that the core developers are actively addressing the challenge of identifying and mitigating bottlenecks in the optimiser. </p><h2>What is the progress?</h2><p>To provide a comprehensive overview, please refer to the table below, which outlines my selection of the top commits that have impacted the optimiser since 2010:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/d27tY/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62b7eb38-42f8-4b5b-a742-fbe58a75a563_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:1064,&quot;title&quot;:&quot;[ Optimisation-related commits ]&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/d27tY/1/" width="730" height="1064" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>Commits in this table can be grouped by the 'feature' key. See my categorisation below. </p><p><strong>ProSupport</strong>. The initial problem addressed is the estimation of a FunctionScan operation. I briefly <a href="https://www.linkedin.com/posts/avlepikhov_be-careful-with-selecting-functions-in-postgresql-activity-7208299350316261376-vgxw?utm_source=share&amp;utm_medium=member_desktop">mentioned</a> this issue about a month ago. The main problem lies in the optimiser's inability to precisely estimate the cost and cardinality of functions that generate data for the query. In 2019, the community found an elegant and adaptable solution - the '<a href="https://www.postgresql.org/docs/current/xfunc-optimization.html">prosupport</a>' routine concept. This routine can be registered as a function that provides the necessary information to the optimiser and can be stored in the database. This approach allows users or extensions to tune the planning decisions. In 2022 and 2023, these capabilities were extended to window functions. I currently see an <a href="https://www.postgresql.org/message-id/flat/CAEze2Wg-%2BEV4HdbQiut7X3KQd39xwmrpV4CeCmoJFFjH8cGdhw%40mail.gmail.com">attempt</a> to use them with aggregates, which appears to be an important evolution of the technique.</p><p><strong>Extended Statistics</strong>. People still like and use ORMs and RestAPIs despite their apparent inefficiency. To tackle the challenge of bad estimations caused by multi-clause filters, the community introduced extended statistics in 2017 - 2019. It provides three types of statistics: MCV, dependency and distinct, which detects hidden dependencies between columns in a table and improves estimations.</p><p>I don't see the wide spread of this feature: at least, not many reports on its usage are available on the Internet. IMO, this is caused by its opacity, computational laboriousness and the necessity to manually detect columns or expressions to build the statistics.</p><p><strong>Incremental Sort</strong>. It is an excellent idea that introduces a whole new way to execute a query into the optimiser. As an alternative to full sort and further re-sorting of data, it can find a path where the executor would use presorted input (for example, by x1,x2) and sort the data by x3, necessary for the following operation inside the groups of duplicated x1,x2, providing the output sorted by x1,x2,x3. This approach relieves the typical problem of analytic queries, which frequently require sorted output for aggregations on various query levels. It is especially effective in the case of the LIMIT operator - just look at this example:</p><pre><code>SET enable_incremental_sort = off;
EXPLAIN (ANALYZE, TIMING OFF)
SELECT * FROM tenk1 ORDER BY unique1, ten LIMIT 100;

RESET enable_incremental_sort;
EXPLAIN (ANALYZE, TIMING OFF)
SELECT * FROM tenk1 ORDER BY unique1, ten LIMIT 100;</code></pre><p>Without incremental sort, we must extract all the tuples from the table even when Heap Sort will return only 100 tuples (the explain has been edited to be laconic):</p><pre><code>Limit  (cost=827.19..827.44) (actual rows=100 loops=1)
   -&gt;  Sort  (cost=827.19..852.19 rows=10000) (actual rows=100 loops=1)
         Sort Key: unique1, ten
         Sort Method: top-N heapsort  Memory: 112kB
         -&gt;  Seq Scan on tenk1  (cost=0.00..445.00)
             (actual rows=10000 loops=1)
 Execution Time: 5.404 ms</code></pre><p>However, incremental sort allows Postgres to employ index scan and provide partial sorted input to the sort module. Also, it is not necessary to scan the whole table, which is crucial in the case of analytic queries and massive tables:</p><pre><code>Limit  (cost=0.46..21.46) (actual rows=100 loops=1)
   -&gt;  Incremental Sort  (cost=0.46..2100.20) (actual rows=100 loops=1)
         Sort Key: unique1, ten
         Presorted Key: unique1
         -&gt;  Index Scan using tenk1_unique1 on tenk1
             (cost=0.29..1650.20) (actual rows=101 loops=1)
 Execution Time: 0.318 ms</code></pre><p>Moreover, I didn't find analogues to this node in other database systems.</p><p><strong>Memoize</strong>. It is designed to extend the parameterised NestLoop JOIN technique to more use cases. Its task is to cache tuples fetched from the inner NestLoop join input. The idea extends the Materialize technique. Imagine that the cardinality of the inner subtree of the join is too massive to cache it all. The cardinality estimation of the outer subtree is big enough to be afraid that loops through the inner can kill the performance. Parameterised NestLoop remains the best solution because a parameterised index scan allows the extraction of a tiny subset of tuples from the inner. Suppose the optimiser predicts multiple duplicated values in the outer. In that case, it can insert the Memoize node into the top of the inner to avoid rescanning if the key value already came from the output.</p><p>Let me show the effect of Memoize with a simple query:</p><pre><code>EXPLAIN ANALYZE
SELECT COUNT(*),AVG(t1.unique1) FROM tenk1 t1
INNER JOIN tenk1 t2 ON t1.unique1 = t2.twenty
WHERE t2.unique1 &lt; 1000;</code></pre><p>It is extracted from regression tests, and a description of the tenk1 table can be found there. Disabling Memoize, we get the query plan:</p><pre><code> Aggregate  (cost=448.24..448.25) (rows=1 loops=1)
   -&gt;  Merge Join  (cost=427.65..443.24) (rows=1000 loops=1)
         Merge Cond: (t1.unique1 = t2.twenty)
         -&gt;  Index Only Scan using tenk1_unique1 on tenk1 t1
             (rows=21 loops=1)
         -&gt;  Sort  (cost=427.36..429.86) (rows=1000 loops=1)
               Sort Key: t2.twenty
               -&gt;  Bitmap Heap Scan on tenk1 t2
                   (cost=20.04..377.54) (rows=1000 loops=1)
                     Recheck Cond: (unique1 &lt; 1000)
                     -&gt;  Bitmap Index Scan on tenk1_unique1
                         (cost=0.00..19.79 width=0) (rows=1000 loops=1)
                           Index Cond: (unique1 &lt; 1000)
 Execution Time: 6.512 ms</code></pre><p>Here is a good example of effective MergeJoin: having presorted inputs because the index scans fetches only 21 tuples from the outer utilising merging algorithm. But what about NestLoop in that case? Could it be competitive? Disable MergeJoin and HashJoin and see the result:</p><pre><code> Aggregate  (cost=815.04..815.05) (rows=1 loops=1)
   -&gt;  Nested Loop  (cost=20.32..810.04) (rows=1000 loops=1)
         -&gt;  Bitmap Heap Scan on tenk1 t2
             (cost=20.04..377.54) (rows=1000 loops=1)
               Recheck Cond: (unique1 &lt; 1000)
               -&gt;  Bitmap Index Scan on tenk1_unique1
                   (cost=0.00..19.79) (rows=1000 loops=1)
                     Index Cond: (unique1 &lt; 1000)
         -&gt;  Index Only Scan using tenk1_unique1 on tenk1 t1
             (cost=0.29..0.42) (rows=1 loops=1000)
               Index Cond: (unique1 = t2.twenty)
 Execution Time: 102.476 ms</code></pre><p>Much worse. The same 1000 tuples from the one side but 1000 index scans to obtain a single tuple worsened this case. It is precisely where caching could help if any of these 1000 loops return the same tuple. Enable Memoize and see what will happen:</p><pre><code> Aggregate  (cost=416.40..416.41) (rows=1 loops=1)
   -&gt;  Nested Loop  (cost=20.33..411.39) (rows=1000 loops=1)
         -&gt;  Bitmap Heap Scan on tenk1 t2
             (cost=20.04..377.54) (rows=1000 loops=1)
               Recheck Cond: (unique1 &lt; 1000)
               -&gt;  Bitmap Index Scan on tenk1_unique1
                   (cost=0.00..19.79) (rows=1000 loops=1)
                     Index Cond: (unique1 &lt; 1000)
         -&gt;  Memoize  (cost=0.30..0.43) (rows=1 loops=1000)
               Cache Key: t2.twenty
               Hits: 980  Misses: 20
               -&gt;  Index Only Scan using tenk1_unique1 on tenk1 t1
                   (cost=0.29..0.42) (rows=1 loops=20)
                     Index Cond: (unique1 = t2.twenty)
 Execution Time: 6.046 ms</code></pre><p>The plan stays the same, but the Memoize node in 980 inner rescans returned a cached copy of the tuple instead of looking up the table. It has also provided an effect: you can see that the total plan cost  is better than the two previous ones, and the execution time is at least not worse.</p><p><strong>Pull-up subqueries</strong>. In my experience, intricated analytic queries often employ subqueries in expressions. Such a subquery can depend on the data from the wrapping query block (aka correlated subqueries), which leads to complete subquery evaluation each time the expression is called. Suppose the expression is a filter or join clause. In that case, the executor will evaluate it on each incoming tuple.</p><p>It is a common problem that resolves with query tree transformation rules, which have been researched since the 1980s. A trivial subquery is transformed to InitPlan and evaluated once, and the query uses its materialised output. If the subquery depends on parameters, it can frequently be transformed to SEMI JOIN with lateral references.</p><p>Postgres supports the transformation of simple subqueries and, in 2024, added restricted support for correlated subqueries. IMO, development in this area is crucial to speed up analytics, especially auto-generated queries.</p><p>Let me demonstrate this technique with the example below:</p><pre><code>EXPLAIN (ANALYZE, TIMING OFF, COSTS ON)
SELECT * FROM tenk1 A
WHERE A.hundred IN (SELECT B.hundred FROM tenk2 B WHERE B.unique1 = A.odd);</code></pre><p>This query contains one correlated subquery. Turning off the transformation, we get the plan:</p><pre><code> Seq Scan on tenk1 a  (cost=0.00..43420.00) (actual rows=100 loops=1)
   Filter: (ANY (hundred = (SubPlan 1).col1))
   Rows Removed by Filter: 9900
   SubPlan 1
     -&gt;  Index Scan using tenk2_unique1 on tenk2 b
         (cost=0.29..8.30) (actual rows=1 loops=10000)
           Index Cond: (unique1 = a.odd)
 Execution Time: 87.182 ms</code></pre><p>Transforming the subquery to the SEMI JOIN optimiser finds a better (according to the cost model) plan that executes four times faster:</p><pre><code> Hash Semi Join  (cost=595.00..1215.00) (actual rows=100 loops=1)
   Hash Cond: ((a.odd = b.unique1) AND (a.hundred = b.hundred))
   -&gt;  Seq Scan on tenk1 a (actual rows=10000 loops=1)
   -&gt;  Hash (actual rows=10000 loops=1)
         -&gt;  Seq Scan on tenk2 b  (actual rows=10000 loops=1)
 Execution Time: 20.722 ms</code></pre><p>Even the employment of Index Scan in the subquery doesn't help much without transformation: looping repeatedly on each tuple drastically degrades the performance.</p><p>I discovered that MS SQL Server includes diverse pull-up transformation techniques for simple and correlated subqueries. This could be clearer for Oracle, where, as explained in the documentation, it may be forced by using hints.</p><p><strong>ORDER-BY/DISTINCT Aggregates</strong>. This is an impalpable improvement for the user, sometimes drastically enhancing execution time. The main idea is to discover aggregate orderings, find the most common ones, and sort incoming data before calculating these aggregates. To understand the effect, look at the difference between the same query executed by PG13 and PG17:</p><pre><code>EXPLAIN (ANALYZE, TIMING OFF, COSTS ON)
SELECT sum(unique1 ORDER BY ten), sum(unique1 ORDER BY ten,two)
FROM tenk1 GROUP BY ten;

-- PG13:

/*
GroupAggregate  (cost=1108.97..1209.02) (actual rows=10 loops=1)
  Output: sum(unique1 ORDER BY ten), sum(unique1 ORDER BY ten, two), ten
  Group Key: tenk1.ten
  -&gt;  Sort  (cost=1108.97..1133.95) (actual rows=10000 loops=1)
        Output: ten, unique1, two
        Sort Key: tenk1.ten
        -&gt;  Seq Scan on public.tenk1  (cost=0.00..444.95)
              Output: ten, unique1, two
Execution Time: 116.375 ms
 */

-- PG17:

/*
GroupAggregate  (cost=1109.39..1209.49) (actual rows=10 loops=1)
  Output: sum(unique1 ORDER BY ten), sum(unique1 ORDER BY ten, two), ten
  Group Key: tenk1.ten
  -&gt;  Sort  (cost=1109.39..1134.39) (actual rows=10000 loops=1)
        Output: ten, unique1, two
        Sort Key: tenk1.ten, tenk1.two
        -&gt;  Seq Scan on public.tenk1  (cost=0.00..445.00)
              Output: ten, unique1, two
Execution Time: 12.650 ms
 */</code></pre><p>Presorting tuples and eliminating internal aggregate sorting cause a tenfold speedup. That's curious; you can note that the execution time change doesn't change the cost value. Does it indicate the field of further improvement of the optimiser cost model?</p><p><strong>Make Vars be outer-join-aware</strong>. The last feature, designed recently, in 2023, is too internal and hidden from the sight of the typical user that, I think, only a few people know about - machinery to detect that incoming data can contain NULL values.</p><p>It is worth mentioning because of its high perspectives. Many queries contain 'NULL' checkings. Initially, the optimiser estimated the number of null values by looking into the statistics in the table. Sometimes, table columns do not contain any NULLs or even have NOT NULL constraints. But still, in a query containing OUTER JOIN, it may happen that the data field referring to the columns as a source will produce nulls. Such 'generated' nulls frequently cause wrong estimations, mostly because of cardinality underestimation, which results in choosing the NestLoop join algorithm.</p><h2>What's more we can do?</h2><p>Estimating which stuff we need is difficult because we need to envision the effect it can bring. However, by looking into alternatives like MS SQL Server and GPOrca Optimiser, which have some advantages, I can briefly estimate the necessary techniques.</p><p>First and foremost, it is a further evolution of extended statistics. SQL Server has diverse options for this type of statistics, which is used intensively to estimate scans or joins. They have some stuff for gathering statistics on the fly, likewise described in the [<a href="https://www.cs.cmu.edu/~natassa/courses/15-721/papers/reopt.pdf">DeWitt1998</a>] paper.</p><p>Having points of on-the-fly statistics in combination with alternative query subplans and dynamic switching between them right during execution (let's watch Alena Rybakina's WIP report at the September 2024 Postgres Conference) can allow complex queries to survive and be executed in some sane time.</p><p>So far, I don't see any activity in the hacker's mailing list around developing pull-up subquery techniques, so the community has not forced this topic. IMO, the main reason is the efficiency issue: although correlated subquery transformation is well-described in scientific papers, it can increase execution time in some cases. As a result, this technique's performance and technical aspects still need to be revised before any further progress.</p><p>Also, the community has discussed possible ways to modify the sort model and improve the sorting and shuffling of group-by-columns. This topic looks interesting to work on in the next development cycle.</p><p>In the end, I should state that the progress is obvious. Some new and unique features are being introduced. However, the speed of development is still not as fast as people who operate fast-growing data would desire. I feel it makes sense to extend the hook's nomenclature in (at least) selectivity estimation, subquery or expression tree transformation, and node execution. Maybe we can allow custom statistics. This can give way for the outward (non-core) community to implement new techniques in advance.</p><p>Are you okay with the current state of PostgreSQL planner and its roadmap?</p><p>THE END.</p><p>August 4, 2024. Paris, France</p>]]></content:encoded></item><item><title><![CDATA[Designing a Prototype: Postgres Plan Freezing]]></title><description><![CDATA[The story of one extension]]></description><link>https://danolivo.substack.com/p/designing-a-prototype-postgres-plan</link><guid isPermaLink="false">https://danolivo.substack.com/p/designing-a-prototype-postgres-plan</guid><dc:creator><![CDATA[Andrei Lepikhov]]></dc:creator><pubDate>Mon, 29 Jul 2024 01:00:47 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/aaceb621-c113-4c38-be46-32a99124ba14_1753x1240.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This story is about a controversial PostgreSQL feature - query plan freezing extension (see its <a href="https://postgrespro.com/docs/enterprise/16/sr-plan">documentation</a> for details) and the code's techniques underpinning it. The designing process had one specific: I had to invent it from scratch in no more than three months or throw this idea away, and because of that, solutions were raised and implemented on the fly. Time limit caused accidental impromptu findings, which can be helpful in other projects.</em></p><p>Developers are aware of the plan cache module in Postgres. It enables a backend to store the query plan in memory for prepared statements, extended protocol queries, and SPI calls, thereby saving CPU cycles and potentially preventing unexpected misfortunes resulting in suboptimal query plans. But what about sticking the plan for an arbitrary query if someone thinks it may bring a profit? May it be useful and implemented without core changes and a massive decline in performance? Could we make such procedure global, applied to all backends? - it is especially important because prepared statements still limited by only backend where they were created.</p><p>Before doing anything, I walked around and found some related projects: <a href="https://github.com/rjuju/pg_shared_plans">pg_shared_plans</a>, <a href="https://github.com/DrPostgres/pg_plan_guarantee">pg_plan_guarantee</a>, and <a href="https://github.com/ossc-db/pg_plan_advsr">pg_plan_advsr</a>. Unfortunately, at this time, they looked like research projects and didn't demonstrate any inspiring ideas on credible matching of cached query plans to incoming queries.</p><p>My initial reason to commence this project was far from plan caching: at this time I designed distributed query execution based on FDW machinery and postgres_fdw extension in particular. That project is known now as '<a href="https://postgrespro.ru/docs/shardman/14/?lang=en">Shardman</a>'. Implementing and benchmarking distributed query execution, I found out that the worst issue, which limits the speed up of queries that have to extract a small number of tuples from large distributed tables (distributed OLTP), is a repeating query planning on each remote server side even when you know that your tables distributed uniformly and the plan may be fully identical on each instance. Working on different solutions, I realised that remote-side queries usually have a much more trivial structure than the origin query and are often similar across various queries (up to different constants). In that case, the most straightforward way was to 'freeze' the plan of the remote-side  query and call it again the next time. What can be simpler to implement that by having a plan cache yet in the core?</p><p>In general, the idea is relatively trivial: invent a shared library that will employ the planner hook and extension to provide a UI as shown in the picture:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rfe9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6060ccb5-6ff8-46a7-ba5c-abdbe0a8b2bf_592x532.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rfe9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6060ccb5-6ff8-46a7-ba5c-abdbe0a8b2bf_592x532.heic 424w, https://substackcdn.com/image/fetch/$s_!rfe9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6060ccb5-6ff8-46a7-ba5c-abdbe0a8b2bf_592x532.heic 848w, https://substackcdn.com/image/fetch/$s_!rfe9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6060ccb5-6ff8-46a7-ba5c-abdbe0a8b2bf_592x532.heic 1272w, https://substackcdn.com/image/fetch/$s_!rfe9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6060ccb5-6ff8-46a7-ba5c-abdbe0a8b2bf_592x532.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rfe9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6060ccb5-6ff8-46a7-ba5c-abdbe0a8b2bf_592x532.heic" width="360" height="323.5135135135135" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6060ccb5-6ff8-46a7-ba5c-abdbe0a8b2bf_592x532.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:532,&quot;width&quot;:592,&quot;resizeWidth&quot;:360,&quot;bytes&quot;:24910,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rfe9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6060ccb5-6ff8-46a7-ba5c-abdbe0a8b2bf_592x532.heic 424w, https://substackcdn.com/image/fetch/$s_!rfe9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6060ccb5-6ff8-46a7-ba5c-abdbe0a8b2bf_592x532.heic 848w, https://substackcdn.com/image/fetch/$s_!rfe9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6060ccb5-6ff8-46a7-ba5c-abdbe0a8b2bf_592x532.heic 1272w, https://substackcdn.com/image/fetch/$s_!rfe9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6060ccb5-6ff8-46a7-ba5c-abdbe0a8b2bf_592x532.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Our extension is labelled as 'sr_plan' on the schema above, abbreviating the phrase 'save/restore  plan'. Being the last module in the chain of the <em>planner_hook</em> calls, it can look up the cache of previously stored query plans and, having a positive match, return this plan, avoiding the planning process at all! </p><p>So, bravely starting the project, I immediately encountered the first problem: how to match the query to the corresponding plan in the cache? - the SPI, prepared statements, and extended protocol use an internal pointer to the plan or predefined name  to identify a plan in the plan cache. That&#8217;s not our case: for an arbitrary incoming query, backend has to look at the plan cache and find the query plan that can be correctly used to execute this query. What's more, the initial query string is transformed into an internal representation and passes some stages until the final plan is built. Look at the picture:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WA6k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2002e64-c092-4911-b5cd-67ab2bbe1aa9_411x358.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WA6k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2002e64-c092-4911-b5cd-67ab2bbe1aa9_411x358.heic 424w, https://substackcdn.com/image/fetch/$s_!WA6k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2002e64-c092-4911-b5cd-67ab2bbe1aa9_411x358.heic 848w, https://substackcdn.com/image/fetch/$s_!WA6k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2002e64-c092-4911-b5cd-67ab2bbe1aa9_411x358.heic 1272w, https://substackcdn.com/image/fetch/$s_!WA6k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2002e64-c092-4911-b5cd-67ab2bbe1aa9_411x358.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WA6k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2002e64-c092-4911-b5cd-67ab2bbe1aa9_411x358.heic" width="411" height="358" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2002e64-c092-4911-b5cd-67ab2bbe1aa9_411x358.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:358,&quot;width&quot;:411,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:22174,&quot;alt&quot;:&quot;Query planning process&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Query planning process" title="Query planning process" srcset="https://substackcdn.com/image/fetch/$s_!WA6k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2002e64-c092-4911-b5cd-67ab2bbe1aa9_411x358.heic 424w, https://substackcdn.com/image/fetch/$s_!WA6k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2002e64-c092-4911-b5cd-67ab2bbe1aa9_411x358.heic 848w, https://substackcdn.com/image/fetch/$s_!WA6k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2002e64-c092-4911-b5cd-67ab2bbe1aa9_411x358.heic 1272w, https://substackcdn.com/image/fetch/$s_!WA6k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2002e64-c092-4911-b5cd-67ab2bbe1aa9_411x358.heic 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Query transformation steps until the final plan</figcaption></figure></div><p>Here, you can see that one query can be transformed into multiple parse trees, sometimes having nothing in common with an initial query, through rewriting rules (which can be altered before the next time the query comes). In its turn, each parse tree can be implemented by multiple query plans&#8230;. Remember that indexes used in the plan are not mentioned in the query or corresponding parse tree.</p><p>Rewriting rules, table names, sets of columns and indexes - all that stuff can also be altered. Moreover, I predict the user will complain if changing just one backspace in the query ends up causing a loss of matching with the frozen plan. It's quite an erratic technique, isn't it? Summarising issues mentioned above, we can't just remember query string and corresponding plan to prove that this plan may be used for execution of this query.</p><p>After spending a couple of days, I realised that the only proof that the specific cached plan may rightly execute the query is the equality of parse trees plus some checking of indexes mentioned in the plan. Match parse trees? Easy! Just use the in-core routine equal() - that's enough!</p><p>Altering database objects is not an issue in this scheme. The plan cache's invalidation machinery guarantees that if some object mentioned in the plan is altered, all plans mentioning it will be marked as 'invalid'.</p><p>Matching parse trees instead of query text has one more positive outcome: internal representation is stable to many  changes in query text, like backspaces or upper/lower case letters. But as usual, it has some negatives: comparing trees is not so cheap. Imagine you have frozen 100 query plans in the cache. How much overhead do you get by comparing each incoming query around 100 times? And what if this query even out of the frozen set?</p><p>Fortunately, since  Postgres 13, this question has had a quick and terse answer: <em>queryId</em>. This is an in-core feature to generate hash value for each query tree. This hash is based on most of the query elements, such as tables, expressions and constants. Look at this picture:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ygwv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa256f01f-35e2-4e53-824c-b5233307c37d_893x375.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ygwv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa256f01f-35e2-4e53-824c-b5233307c37d_893x375.heic 424w, https://substackcdn.com/image/fetch/$s_!ygwv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa256f01f-35e2-4e53-824c-b5233307c37d_893x375.heic 848w, https://substackcdn.com/image/fetch/$s_!ygwv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa256f01f-35e2-4e53-824c-b5233307c37d_893x375.heic 1272w, https://substackcdn.com/image/fetch/$s_!ygwv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa256f01f-35e2-4e53-824c-b5233307c37d_893x375.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ygwv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa256f01f-35e2-4e53-824c-b5233307c37d_893x375.heic" width="893" height="375" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a256f01f-35e2-4e53-824c-b5233307c37d_893x375.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:375,&quot;width&quot;:893,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:19001,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ygwv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa256f01f-35e2-4e53-824c-b5233307c37d_893x375.heic 424w, https://substackcdn.com/image/fetch/$s_!ygwv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa256f01f-35e2-4e53-824c-b5233307c37d_893x375.heic 848w, https://substackcdn.com/image/fetch/$s_!ygwv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa256f01f-35e2-4e53-824c-b5233307c37d_893x375.heic 1272w, https://substackcdn.com/image/fetch/$s_!ygwv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa256f01f-35e2-4e53-824c-b5233307c37d_893x375.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Having queryId, we can invent a hash table with queryId as a key. Entry of this hash table contains a pointer to the head of a list of frozen plans with the same queryId. Instead of passing through all 100 frozen plans, we only need to match a small fraction. Quick experiments have shown that with queryId generated by the standard JumbleQuery technique, we almost always have only one plan in the class with the same queryId.</p><p>Hmm, you might dubiously say: what about parameterised queries? If we want to use a frozen plan for arbitrary incoming queries, we have to store a parameterised, aka 'generic', plan and employ it to execute incoming queries with constants instead of parameters. How queryId could help us in this case?</p><p>This question didn't have an easy answer. Having experience with the <a href="https://postgrespro.com/docs/enterprise/16/autoprepare#AUTOPREPARE">autoprepare</a> feature, I remember how many hurdles a developer must overturn if all the constants in the plan are replaced with parameters before freezing. What is less obvious is that one parameterised plan may be effective for one set of constants and totally worst for another.</p><p>So, we have to know which position in an expression to treat as a parameter. The solution I invented here was trivial: give a choice to the user (with hidden hope to invent an analysing procedure in the future to find correlations in parameters and plans have built) and divide the freezing procedure into two stages: registration and sticking into the plan cache.</p><p>Registration tells our extension that the query with a specific queryId is under control, and each plan generated for this query must be nailed down in the backend's plan cache, rewriting the plan built during the previous execution. In our UI, it looks like a query:</p><pre><code>SELECT sr_register_query(query_string [, parameter_type, ...]);

For example:
SELECT sr_register_query('SELECT count(*) FROM a WHERE x = $1');</code></pre><p>Using '$N' in the query, you point out parameterised parts of the incoming query. Parameter type allows to force the type of each parameter. Registration stores the query text, query tree, and set of parameters (with their positions in the tree) inside the extension memory context. </p><p>Registration impacts only the backend where it was registered. Afterwards, you can play locally with any GUCs, hints, or anything else to achieve the desired query plan without fear of influencing the instance's performance. After that, by executing the following query:</p><pre><code>SELECT sr_plan_freeze();</code></pre><p>you can stick the plan in the local backend and it will be lazily pass to the plan caches of other backends registered in the same database. Spreading across the instance's backends is relatively trivial - just employ DSM hash tables and a flag to signal backends to check the consistency of their caches with the shared storage. Serialisation/deserialisation routines can transform to the string a query tree as well as the query plan. But how do we implement parameterisation?</p><p>Easy to say, but hard to solve. At first, I changed the queryId generation algorithm to ease the accuracy of the hash generation by excluding the fact that the tree node is a parameter and considering only its data type and position in the query tree. Of course, it means a core patch, but it is only a couple of code lines. As a result, parameterised query and query with constants in the place of parameters has the same queryId and since then we can find frozen plan in the cache.</p><p>The second problem is much more severe. Playing with queries after registration, the user will use queries with specific constant values instead of parameters. After matching queryId we must prove identity of parse trees by calling the <em>equal()</em> routine. But it can't match the incoming constant and registered parameterised query trees without invasive changes to the core logic. Having only a month to the deadline, I discovered an essential design trick: before the query tree comparison procedure, just replace Const nodes with corresponding Param nodes in positions of the query tree defined by the user manually on registration. To make the text a bit more easy to understand, let me illustrate this technique with the following trivial picture:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r-Pi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ffbe300-98b7-4e5b-b22a-d9709f5f055e_703x492.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r-Pi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ffbe300-98b7-4e5b-b22a-d9709f5f055e_703x492.heic 424w, https://substackcdn.com/image/fetch/$s_!r-Pi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ffbe300-98b7-4e5b-b22a-d9709f5f055e_703x492.heic 848w, https://substackcdn.com/image/fetch/$s_!r-Pi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ffbe300-98b7-4e5b-b22a-d9709f5f055e_703x492.heic 1272w, https://substackcdn.com/image/fetch/$s_!r-Pi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ffbe300-98b7-4e5b-b22a-d9709f5f055e_703x492.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r-Pi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ffbe300-98b7-4e5b-b22a-d9709f5f055e_703x492.heic" width="703" height="492" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ffbe300-98b7-4e5b-b22a-d9709f5f055e_703x492.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:492,&quot;width&quot;:703,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26494,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!r-Pi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ffbe300-98b7-4e5b-b22a-d9709f5f055e_703x492.heic 424w, https://substackcdn.com/image/fetch/$s_!r-Pi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ffbe300-98b7-4e5b-b22a-d9709f5f055e_703x492.heic 848w, https://substackcdn.com/image/fetch/$s_!r-Pi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ffbe300-98b7-4e5b-b22a-d9709f5f055e_703x492.heic 1272w, https://substackcdn.com/image/fetch/$s_!r-Pi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ffbe300-98b7-4e5b-b22a-d9709f5f055e_703x492.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As you can see, we introduced the abstraction named the 'template query tree' that may be even more extensively used in the future to match queries where the difference is only in database object aliases.</p><p>This technique definitely adds some overhead and complexity to the code, but remember, we are oriented on sophisticated queries where the optimiser fails to make an appropriate plan. By profiting from reduced disk fetches because of a good plan, we can allow backend to spend more CPU cycles. Moreover, value of the overhead mostly depends on the queryId technique: how good it is in separating queries into classes.</p><p>The third question of this technique was even more challenging to answer: having incoming constants and a parameterised plan to execute the query, how do we pass these constants to the executor? PostgreSQL architecture doesn't  allow it to be done because specific parameter values are managed at higher levels of execution machinery. Having spent most of the remaining time, I invented one trick: insert at the top of the frozen plan a CustomScan node, which does nothing except alter the set of parameter values in the execution state structure at the beginning of execution. With this approach, EXPLAIN of a frozen query looks like below:</p><pre><code>EXPLAIN SELECT count(*) FROM a WHERE x = 1::bigint;

Custom Scan (SRScan)
  Plan is: tracked
  Query ID: -5166001356546372387
  Parameters: $1 = 1
  -&gt;  Aggregate
        -&gt;  Seq Scan on a
              Filter: (x = $1)</code></pre><p>As you can see, having such a node has earned us one positive outcome: this node informs the user about the state of the query plan. It can also potentially gather additional statistics and use them later to make decisions about unfreezing.</p><p>Afterwards, I passed PostgreSQL with this extension through a series of benchmarks. The most difficult was, of course, pgbench: it contains too trivial queries executing in too small periods of time that our overheads, even the queryId calculation, should be highlighted here. After manually freezing all its queries, I found out that pgbench results improved by around 15%- 25% on average. Amazing!</p><p>One more simple trick for storing frozen plans on disk in a specific file of the data catalogue to survive crushes and reboots &#8212; and the prototype is ready for demonstration. But I forgot about real-life cases: DDL, upgrades, and migrations. If an object mentioned in the plan is altered (for example, add a column to the table), Postgres marks the plan as 'invalid'. But it is impractical for us: we should unfreeze the query only if proved that the plan is totally incorrect in the context of this query and database. To be practical, our extension should survive such disasters.</p><p>In just a few days, I could invent only an obvious solution called the 'validation procedure'.</p><p>Through the validation procedure, the extension checks that the plan can still be applied to a specific query. How it works? - relatively trivial: just open a subtransaction (to survive errors) and pass the query text parsing procedure - It is precisely the reason why we store registered query text. If the query tree is the same as the stored one, it is a good sign that the objects mentioned in the query still exist. So, we need only to check the consistency of indexes mentioned in the query plan. That's enough to mark the query plan as frozen and valid.</p><p>The validation procedure allows for the survival of transaction isolation levels: some backends can already see the schema changes and 'unfreeze' query, while others may reuse the frozen plan for the same query. Moreover, a previously invalid plan can be validated on ROLLBACK, and the query can be returned to a frozen state.</p><p>What is more interesting is that we can try to pass the frozen query plan to other instances using the validation procedure. The technique looks similar to the above: on a new instance, open a subtransaction, deserialise the query tree and plan, execute the parsing procedure for the query text, recalculate queryId (OIDs of objects may be different), and compare the deserialised query tree with the parsed one. If they are identical, check the indexes and probe query execution to ensure nothing special was broken. Remember, here we should have one more additional structure: 'oid &#8594; object name' translation table to identify oids for the database objects in the case of dump/restored or logically replicated database.</p><p>Of course, in the case of an upgrade, this technique is vulnerable: we can get a SEGFAULT during deserialisation because of the difference in ABI. What's worse, the plan may be deserialised correctly, but the execution state could contain some specific data or logic that could be altered in the next Postgres version and incompatible with the plan. So, this technique looks applicable mostly for migrations between the same versions of the binaries rather than for upgrades.</p><p>Do we have any options to survive an upgrade? Yes - thanks to Michael Paquer and the developer's team in NTT for inventing the <a href="https://github.com/ossc-db/pg_hint_plan">pg_hint_plan</a> extension. Before the upgrade, we can store each query text with a set of hints, dropping away the parse tree and the plan. After the upgrade, we should pass parsing and optimisation procedures for each query with the hope that hints will direct the optimiser to build the plan we want to obtain.</p><p>That's all I wanted to tell you about this case. Be brave, think openly, and you could invent new directions for DBMS development! As usual, you can play with the extension using the <a href="https://github.com/danolivo/conf/tree/main/2023-PGDay-Israel/bin">binary version</a> for Postgres 15.</p><p>In the end, I urge you to reflect on this post and discuss in comments how interesting the idea of plan freezing is. What is the perspective scope for this feature? What do you think about plan validation?</p><p>THE END.</p><p>July 28, 2024. Thailand, South Pattaya.</p>]]></content:encoded></item></channel></rss>