Skip to content

Add online shared_buffers resize without server restart#13

Open
NikolayS wants to merge 1630 commits intomasterfrom
claude/add-dsm-registry-helpers-Jg0Hy
Open

Add online shared_buffers resize without server restart#13
NikolayS wants to merge 1630 commits intomasterfrom
claude/add-dsm-registry-helpers-Jg0Hy

Conversation

@NikolayS
Copy link
Copy Markdown
Owner

@NikolayS NikolayS commented Feb 9, 2026

Summary

Allow shared_buffers to be changed at runtime via ALTER SYSTEM + pg_reload_conf() (SIGHUP) without requiring a PostgreSQL restart, when a new GUC max_shared_buffers is set at startup to reserve virtual address space.

  • New max_shared_buffers GUC (PGC_POSTMASTER) reserves VA space at startup via MAP_SHARED|MAP_ANONYMOUS|MAP_NORESERVE, enabling online resize up to that limit with zero steady-state overhead
  • Changed shared_buffers from PGC_POSTMASTER to PGC_SIGHUP with check/assign hooks for runtime validation and resize requests
  • Grow path: commits memory incrementally, initializes new buffer descriptors, updates NBuffers atomically, signals children via SIGHUP
  • Shrink path: reduces NBuffers immediately, bgwriter asynchronously drains condemned buffers (flush dirty, evict unpinned), then decommits memory via MADV_REMOVE
  • SIGHUP-based postmaster/child coordination (postmaster lacks PGPROC, so ProcSignalBarrier cannot be used)
  • Buffer lookup hash table pre-sized for MaxNBuffers to avoid rehashing on grow

Changes

  • buf_resize.c (new, 864 lines): Core resize implementation — VA reservation, incremental commit/decommit, grow/shrink logic, bgwriter drain, GUC hooks
  • buf_resize.h (new): Shared memory control structure (BufPoolResizeCtl), function declarations
  • buf_init.c: Dual-path initialization — uses pre-reserved memory when max_shared_buffers is configured, traditional ShmemInitStruct otherwise
  • postmaster.c: Calls ExecuteBufferPoolResize() after ProcessConfigFile() before signaling children
  • bgwriter.c: Calls BufPoolDrainCondemnedBuffers() after BgBufferSync() each cycle
  • freelist.c: Hash table pre-sized for MaxNBuffers when configured
  • ipci.c: Calls BufferPoolReserveMemory() and BufPoolResizeShmemInit() during shared memory setup
  • guc_parameters.dat: shared_buffers changed to PGC_SIGHUP with hooks; new max_shared_buffers entry
  • globals.c / miscadmin.h: New globals SharedBuffersGUC, MaxNBuffers

Design highlights

  • Zero steady-state overhead: Base pointers (BufferDescriptors, BufferBlocks, etc.) never move; NBuffers remains a plain int
  • Concurrency safety: Memory barriers pair atomic current_buffers writes with descriptor initialization visibility; bgwriter validates drain range under spinlock before decommitting to prevent races with concurrent grow
  • Crash recovery: Standard WAL replay; buffer pool starts at configured shared_buffers on restart
  • Portability: Requires fork() for MAP_SHARED inheritance; EXEC_BACKEND (Windows) emits FATAL if max_shared_buffers is set; MADV_REMOVE with MADV_DONTNEED fallback; MADV_POPULATE_WRITE with manual page-touch fallback for kernels < 5.14
  • Includes design document with prior art analysis (MySQL/InnoDB, Oracle SGA, SQL Server) and 15 edge cases analyzed

Test plan

  • Basic grow/shrink cycles (32MB → 64MB → 128MB → 48MB → 96MB)
  • Boundary conditions (128kB minimum, exceed max rejection)
  • No-op resize (same value)
  • Data integrity under concurrent resize (10k rows, 0 corruption)
  • Heavy pgbench load (12 clients) + 15 resize cycles — 0 failed transactions
  • Rapid oscillation (30 cycles, 200ms delay)
  • Checkpoint during resize — data intact
  • Multiple long-running sessions during resize
  • Crash recovery (immediate shutdown + restart) — data intact
  • Extreme stress: 16 clients, 128kB↔128MB, 50 cycles — 23k+ txns, 0 failures

https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos

hlinnaka and others added 30 commits December 9, 2025 13:53
This eliminates MultiXactOffset wraparound and the 2^32 limit on the
total number of multixid members. Multixids are still limited to 2^31,
but this is a nice improvement because 'members' can grow much faster
than the number of multixids. On such systems, you can now run longer
before hitting hard limits or triggering anti-wraparound vacuums.

Not having to deal with MultiXactOffset wraparound also simplifies the
code and removes some gnarly corner cases.

We no longer need to perform emergency anti-wraparound freezing
because of running out of 'members' space, so the offset stop limit is
gone. But you might still not want 'members' to consume huge amounts
of disk space. For that reason, I kept the logic for lowering vacuum's
multixid freezing cutoff if a large amount of 'members' space is
used. The thresholds for that are roughly the same as the "safe" and
"danger" thresholds used before, 2 billion transactions and 4 billion
transactions. This keeps the behavior for the freeze cutoff roughly
the same as before. It might make sense to make this smarter or
configurable, now that the threshold is only needed to manage disk
usage, but that's left for the future.

Add code to pg_upgrade to convert multitransactions from the old to
the new format, rewriting the pg_multixact SLRU files. Because
pg_upgrade now rewrites the files, we can get rid of some hacks we had
put in place to deal with old bugs and upgraded clusters. Bump catalog
version for the pg_multixact/offsets format change.

Author: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACG%3DezaWg7_nt-8ey4aKv2w9LcuLthHknwCawmBgEeTnJrJTcw@mail.gmail.com
POSIX has for a long time defined the "j" length modifier for
printf conversions as meaning the size of intmax_t or uintmax_t.
We got away without supporting that so far, because we were not
using intmax_t anywhere.  However, commit e6be843 re-introduced
upstream's use of intmax_t and PRIdMAX into zic.c.  It emerges
that on some platforms (at least FreeBSD and macOS), <inttypes.h>
defines PRIdMAX as "jd", so that snprintf.c falls over if that is
used.  (We hadn't noticed yet because it would only be apparent
if bad data is fed to zic, resulting in an error report, and even
then the only visible symptom is a missing line number in the
error message.)

We could revert that decision from our copy of zic.c, but
on the whole it seems better to update snprintf.c to support
this standard modifier.  There might well be extensions,
now or in future, that expect it to work.

I did this in the lazy man's way of translating "j" to either
"l" or "ll" depending on a compile-time sizeof() check, just
as was done long ago to support "z" for size_t.  One could
imagine promoting intmax_t to have full support in snprintf.c,
for example converting fmtint()'s value argument and internal
arithmetic to use [u]intmax_t not [unsigned] long long.  But
that'd be more work and I'm hesitant to do it anyway: if there
are any platforms out there where intmax_t is actually wider
than "long long", this would doubtless result in a noticeable
speed penalty to snprintf().  Let's not go there until we have
positive evidence that there's a reason to, and some way to
measure what size of penalty we're taking.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3210703.1765236740@sss.pgh.pa.us
As in commit 59d6c03, use <function> rather than <structname> in
the <title> to be consistent with how other functions in this
module are documented.

Oversights in commits dcf7e16 and 9ccc049.

Author: Noboru Saito <noborusai@gmail.com>
Discussion: https://postgr.es/m/CAAM3qn%2B7KraFkCyoJCHq6m%3DurxcoHPEPryuyYeg%3DQ0EjJxjdTA%40mail.gmail.com
Backpatch-through: 18
The new columns, mode and started_by, indicate the vacuum
mode ('normal', 'aggressive', or 'failsafe') and the initiator of the
vacuum ('manual', 'autovacuum', or 'autovacuum_wraparound'),
respectively. This allows users and monitoring tools to better
understand VACUUM behavior.

Bump catalog version.

Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Yu Wang <wangyu_runtime@163.com>
Discussion: https://postgr.es/m/CAOzEurQcOY-OBL_ouEVfEaFqe_md3vB5pXjR_m6L71Dcp1JKCQ@mail.gmail.com
The new column, started_by, indicates the initiator of the
analyze ('manual' or 'autovacuum'), helping users and monitoring tools
to better understand ANALYZE behavior.

Bump catalog version.

Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Yu Wang <wangyu_runtime@163.com>
Discussion: https://postgr.es/m/CAA5RZ0suoicwxFeK_eDkUrzF7s0BVTaE7M%2BehCpYcCk5wiECpw%40mail.gmail.com
Presently, the "echo" and "quiet" variables are carted around to
various functions, which is a bit tedious.  To simplify things,
this commit moves them into the vacuumingOptions struct and removes
the related function parameters.  While at it, remove some
redundant initialization code in vacuumdb's main() function.

This is preparatory work for a follow-up commit that will add a
--dry-run option to vacuumdb.

Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CADkLM%3DckHkX7Of5SrK7g0LokPUwJ%3Dkk8JU1GXGF5pZ1eBVr0%3DQ%40mail.gmail.com
This commit refactors the code for marking a ParallelSlot as idle
to a new static inline function.  This can be used to mark a slot
that was obtained via ParallelSlotGetIdle() but that we don't
intend to actually use for a query as idle again.

This is preparatory work for a follow-up commit that will add a
--dry-run option to vacuumdb.

Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com
Discussion: https://postgr.es/m/CADkLM%3DckHkX7Of5SrK7g0LokPUwJ%3Dkk8JU1GXGF5pZ1eBVr0%3DQ%40mail.gmail.com
This new option instructs vacuumdb to print, but not execute, the
VACUUM and ANALYZE commands that would've been sent to the server.

Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CADkLM%3DckHkX7Of5SrK7g0LokPUwJ%3Dkk8JU1GXGF5pZ1eBVr0%3DQ%40mail.gmail.com
PostgreSQL's src/port/open.c has always set bInheritHandle = TRUE
when opening files on Windows, making all file descriptors inheritable
by child processes.  This meant the O_CLOEXEC flag, added to many call
sites by commit 1da569c (v16), was silently ignored.

The original commit included a comment suggesting that our open()
replacement doesn't create inheritable handles, but it was a mis-
understanding of the code path.  In practice, the code was creating
inheritable handles in all cases.

This hasn't caused widespread problems because most child processes
(archive_command, COPY PROGRAM, etc.) operate on file paths passed as
arguments rather than inherited file descriptors.  Even if a child
wanted to use an inherited handle, it would need to learn the numeric
handle value, which isn't passed through our IPC mechanisms.

Nonetheless, the current behavior is wrong.  It violates documented
O_CLOEXEC semantics, contradicts our own code comments, and makes
PostgreSQL behave differently on Windows than on Unix.  It also creates
potential issues with future code or security auditing tools.

To fix, define O_CLOEXEC to _O_NOINHERIT in master, previously used by
O_DSYNC.  We use different values in the back branches to preserve
existing values.  In pgwin32_open_handle() we set bInheritHandle
according to whether O_CLOEXEC is specified, for the same atomic
semantics as POSIX in multi-threaded programs that create processes.

Backpatch-through: 16
Author: Bryan Green <dbryan.green@gmail.com>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com> (minor adjustments)
Discussion: https://postgr.es/m/e2b16375-7430-4053-bda3-5d2194ff1880%40gmail.com
The idea is to encourage more the use of these new routines across the
tree, as these offer stronger type safety guarantees than palloc().
This batch of changes includes most of the trivial changes suggested by
the author for src/backend/.

A total of 334 files are updated here.  Among these files, 48 of them
have their build change slightly; these are caused by line number
changes as the new allocation formulas are simpler, shaving around 100
lines of code in total.

Similar work has been done in 0c3c5c3 and 31d3847.

Author: David Geier <geidav.pg@gmail.com>
Discussion: https://postgr.es/m/ad0748d4-3080-436e-b0bc-ac8f86a3466a@gmail.com
A comment in tuplesort.c was claiming that the code was defining
INITIAL_MEMTUPSIZE so that it *does not* exceed
ALLOCSET_SEPARATE_THRESHOLD, but the code actually ensures that we
purposefully *do* exceed ALLOCSET_SEPARATE_THRESHOLD for the initial
allocation of the tuples array, as per reasons detailed in the
commentary of grow_memtuples().

Also, there's not much need to repeat the mention about
ALLOCSET_SEPARATE_THRESHOLD in each location where INITIAL_MEMTUPSIZE is
used, so remove those comments.

Author: ChangAo Chen <cca5507@qq.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Discussion: https://postgr.es/m/tencent_6FA14F85D6B5B5291532D6789E07F4765C08%40qq.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/CABPTF7V8CbOXGePqrad6EH3Om7DRhNiO3C0rQ-62UuT7RdU-GQ@mail.gmail.com
The warning was showing up in the early stages of the meson build, when
the contents of Makefile.global is generated based on the configuration
of meson for PGXS.

NM is added to pgxs_empty.  This declaration is only used internally for
the libpq sanity check, so there is no point in exposing it in PGXS.

Oversight in 4a8e6f4.

Reported-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/4423e01f-1e52-4f47-a6ca-05cc8081c888@eisentraut.org
REGRESS has forgotten about the test nbtree_half_dead_pages, and a
.gitignore was missing from the module.

Oversights in c085aab for REGRESS and 1e4e578 for the missing
.gitignore.

Discussion: https://postgr.es/m/aTipJA1Y1zVSmH3H@paquier.xyz
Buildfarm members skimmer and crake have reported that pg_upgrade
running from v18 fails due to the changes of d52c24b, with the
expectations that the objects removed in the test module
injection_points should still be present post upgrades, but the test
module does not have them anymore.

The origin of the issue is that the following test modules depend on
injection_points, but they do not drop the extension once the tests
finish, leaving its traces in the dumps used for the upgrades:
- gin, down to v17
- typcache, down to v18
- nbtree, HEAD-only
Test modules have no upgrade requirements, as they are used only for..
Tests, so there is no point in keeping them around.

An alternative solution would be to drop the databases created by these
modules in AdjustUpgrade.pm, but the solution of this commit to drop the
extension is simpler.  Note that there would be a catch if using a
solution based on AdjustUpgrade.pm as the database name used for the
test runs differs between configure and meson:
- configure relies on USE_MODULE_DB for the database name unicity, that
would build a database name based on the *first* entry of REGRESS, that
lists all the SQL tests.
- meson relies on a "name" field.

For example, for the test module "gin", the regression database is named
"regression_gin" under meson, while it is more complex for configure, as
of "contrib_regression_gin_incomplete_splits".  So a AdjustUpgrade.pm
would need a set of DROP DATABASE IF EXISTS to solve this issue, to cope
with each build system.

The failure has been caused by d52c24b, and the problem can happen
with upgrade dumps from v17 and v18 to HEAD.  This problem is not
currently reachable in the back-branches, but it could be possible that
a future change in injection_points in stable branches invalidates this
theory, so this commit is applied down to v17 in the test modules that
matter.

Per discussion with Tom Lane and Heikki Linnakangas.

Discussion: https://postgr.es/m/2899652.1765167313@sss.pgh.pa.us
Backpatch-through: 17
pthread_exit() is added to the list of symbols allowed when building
libpq.  This has been reported as possible when libpq is statically
linked to libcrypto, where pthread_exit() could be called.

Reported-by: Torsten Rupp <torsten.rupp@gmx.net>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/19095-6d8256d0c37d4be2@postgresql.org
These functions took a ResourceOwner argument, but only checked if it
was NULL, and then used CurrentResourceOwner for the actual work.
Surely the intention was to use the passed-in resource owner. All
current callers passed CurrentResourceOwner or NULL, so this has no
consequences at the moment, but it's an accident waiting to happen for
future caller and extensions.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAEze2Whnfv8VuRZaohE-Af+GxBA1SNfD_rXfm84Jv-958UCcJA@mail.gmail.com
Backpatch-through: 17
This function gets the list of relations associated with the
publication but the comment said the opposite.

Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CANhcyEV3C_CGBeDtjvKjALDJDMH-Uuc9BWfSd=eck8SCXnE=fQ@mail.gmail.com
Phase I vacuum gives the page a once-over after pruning and freezing to
check that the values of all_visible and all_frozen agree with the
result of heap_page_is_all_visible(). This is meant to keep the logic in
phase I for determining visibility in sync with the logic in phase III.

Rewrite the assertion to avoid an Assert(false).

Suggested by Andres Freund.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/mhf4vkmh3j57zx7vuxp4jagtdzwhu3573pgfpmnjwqa6i6yj5y%40sy4ymcdtdklo
The comment above heap_xlog_visible() about the critical integrity
requirement for PD_ALL_VISIBLE and the visibility map should also be in
heap_xlog_prune_freeze() where we set PD_ALL_VISIBLE.

Oversight in add323d

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
1. The test initially focuses on the "parent" table, then switches to
the "part" table, and goes back to the "parent" table. That seems a
little weird, so move the tests around so that all the commands on the
"parent" table are done first, followed by the "part" table.

2. ALTER TABLE ALTER COLUMN SET EXPRESSION was not tested, so add
that.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACJufxFDi7fnwB-8xXd_ExML7-7pKbTaK4j46AJ=4-14DXvtVg@mail.gmail.com
The test seemed to incorrectly think that query_safe() takes an
argument that describes what the query does, similar to e.g.
command_ok(). Until commit bd8d9c9 the extra arguments were
harmless and were just ignored, but when commit bd8d9c9 introduced
a new optional argument to query_safe(), the extra arguments started
clashing with that, causing the test to fail.

Backpatch to v17, that's the oldest branch where the test exists. The
extra arguments didn't cause any trouble on the older branches, but
they were clearly bogus anyway.
It's only useful for an ILIKE optimization for the libc provider using
a single-byte encoding and a non-C locale, but it creates significant
internal complexity.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
True if character has multiple case forms. Will be a useful
multibyte-aware replacement for char_is_cased().

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
Always return TIDs in descending order when returning groups of TIDs
from an nbtree posting list tuple during nbtree backwards scans.  This
makes backwards scans tend to require fewer buffer hits, since the scan
is less likely to repeatedly pin and unpin the same heap page/buffer
(we'll get exactly as many buffer hits as we get with a similar forwards
scan case).

Commit 0d861bb, which added nbtree deduplication, originally did things
this way to avoid interfering with _bt_killitems's approach to setting
LP_DEAD bits on posting list tuples.  _bt_killitems makes a soft
assumption that it can always iterate through posting lists in ascending
TID order, finding corresponding killItems[]/so->currPos.items[] entries
in that same order.  This worked out because of the prior _bt_readpage
backwards scan behavior.  If we just changed the backwards scan posting
list logic in _bt_readpage, without altering _bt_killitems itself, it
would break its soft assumption.

Avoid that problem by sorting the so->killedItems[] array at the start
of _bt_killitems.  That way the order that dead items are saved in from
btgettuple can't matter; so->killedItems[] will always be in the same
order as so->currPos.items[] in the end.  Since so->currPos.items[] is
now always in leaf page order, regardless of the scan direction used
within _bt_readpage, and since so->killedItems[] is always in that same
order, the _bt_killitems loop can continue to make a uniform assumption
about everything being in page order.  In fact, sorting like this makes
the previous soft assumption about item order into a hard invariant.

Also deduplicate the so->killedItems[] array after it is sorted.  That
way there's no risk of the _bt_killitems loop becoming confused by a
duplicate dead item/TID.  This was possible in cases that involved a
scrollable cursor that encountered the same dead TID more than once
(within the same leaf page/so->currPos context).  This doesn't come up
very much in practice, but it seems best to be as consistent as possible
about how and when _bt_killitems will LP_DEAD-mark index tuples.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Mircea Cadariu <cadariu.mircea@gmail.com>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=Wut2pKvbW-u3hJ_LXwsYeiXHiW8oN1GfbKPavcGo8Ow@mail.gmail.com
Although clang claims to be compatible with gcc's printf format
archetypes, this appears to be a falsehood: it likes __syslog__
(which gcc does not, on most platforms) and doesn't accept
gnu_printf.  This means that if you try to use gcc with clang++
or clang with g++, you get compiler warnings when compiling
printf-like calls in our C++ code.  This has been true for quite
awhile, but it's gotten more annoying with the recent appearance
of several buildfarm members that are configured like this.

To fix, run separate probes for the format archetype to use with the
C and C++ compilers, and conditionally define PG_PRINTF_ATTRIBUTE
depending on __cplusplus.

(We could alternatively insist that you not mix-and-match C and
C++ compilers; but if the case works otherwise, this is a poor
reason to insist on that.)

No back-patch for now, but we may want to do that if this
patch survives buildfarm testing.

Discussion: https://postgr.es/m/986485.1764825548@sss.pgh.pa.us
An array of LLVMBasicBlockRef is allocated with the size used for an
element being "LLVMBasicBlockRef *" rather than "LLVMBasicBlockRef".
LLVMBasicBlockRef is a type that refers to a pointer, so this did not
directly cause a problem because both should have the same size, still
it is incorrect.

This issue has been spotted while reviewing a different patch, and
exists since 2a0faed, so backpatch all the way down.

Discussion: https://postgr.es/m/CA+hUKGLngd9cKHtTUuUdEo2eWEgUcZ_EQRbP55MigV2t_zTReg@mail.gmail.com
Backpatch-through: 14
MasaoFujii and others added 27 commits January 7, 2026 13:58
GUC is a commonly used term but previously appeared only
in the acronym documentation. This commit adds glossary and
documentation index entries for GUC to make it easier to find
and understand.

Author: Robert Treat <rob@xzilla.net>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CABV9wwPQnkeo_G6-orMGnHPK9SXGVWm7ajJPzsbE6944tDx=hQ@mail.gmail.com
Followup to 44f4951.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CAHGQGwH_UfN96vcvLGA%3DYro%2Bo6qCn0nEgEGoviwzEiLTHtt2Pw%40mail.gmail.com
Backpatch-through: 18
Introduced in b139bd3.

Reported-by: Man Zeng <zengman@halodbtech.com>
Discussion: https://postgr.es/m/tencent_121C1BB152CAF3195C99D56C@qq.com
During catcache searches, the most-recently searched entries are kept at
the head of the list to speed up subsequent searches, keeping the
"freshest" entries at its beginning.  A rehash of the catcache was doing
the opposite: fresh entries were moved to the tail of the newly-created
buckets, causing a rehash to slow down a bit.

When a rehash is done, this commit switches the code to use
dlist_push_tail() instead of dlist_push_head(), so as fresh entries are
kept at the head of the lists, not their tail.

Author: ChangAo Chen <cca5507@qq.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/tencent_9EA10D8512B5FE29E7323F780A0749768708@qq.com
Commit c7eab0e changed the default password_encryption setting to
'scram-sha-256', so update the example for creating a user with an
assigned password.

In addition, commit 08951a7 added new options that in turn pass
default tokens NOBYPASSRLS and NOREPLICATION to the CREATE ROLE
command, so fix this omission as well for v16 and later.

Reported-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/cff1ea60-c67d-4320-9e33-094637c2c4fb%40iki.fi
Backpatch-through: 14
…up()"

This reverts commit f30848c.  Due to
buildfarm failures related to recovery conflicts while executing the
WAIT FOR command.  It appears that WAIT FOR still holds xmin for a while.

Reported-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2BL0OQR8dQtsNBSaA3FNNyQeFabdeRVzurm0b7Xtp592g%40mail.gmail.com
The USER and IN GROUP clauses of CREATE ROLE are deprecated, and
commit 8e78f0a removed them from the CREATE ROLE syntax syntax
synopsis in the docs. However, previously CREATE USER and
CREATE GROUP docs still listed these clauses.

Since CREATE USER is equivalent to CREATE ROLE ... WITH LOGIN and
CREATE GROUP is equivalent to CREATE ROLE, their documented syntax
synopsis should match CREATE ROLE to avoid confusion.

Therefore this commit removes the deprecated USER and IN GROUP
clauses from the CREATE USER and CREATE GROUP syntax synopsis
in the docs.

Author: Japin Li <japinli@hotmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/MEAPR01MB3031C30E72EF16CFC08C8565B687A@MEAPR01MB3031.ausprd01.prod.outlook.com
Fix comments that incorrectly described transformations performed by the
"Avoid extra index searches through preprocessing" mechanism introduced
by commit b3f1a13.

Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-By: Chao Li <li.evan.chao@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/20251230190145.c3c88c5eb0f88b136adda92f@sraoss.co.jp
Backpatch-through: 18
This commit does the following to get tests passing for
MSVC/AArch64:

* Implements spin_delay() with an ISB instruction (like we do for
gcc/clang on AArch64).

* Sets USE_ARMV8_CRC32C unconditionally.  Vendor-supported versions
of Windows for AArch64 require at least ARMv8.1, which is where CRC
extension support became mandatory.

* Implements S_UNLOCK() with _InterlockedExchange().  The existing
implementation for MSVC uses _ReadWriteBarrier() (a compiler
barrier), which is insufficient for this purpose on non-TSO
architectures.

There are likely other changes required to take full advantage of
the hardware (e.g., atomics/arch-arm.h, simd.h,
pg_popcount_aarch64.c), but those can be dealt with later.

Author: Niyas Sait <niyas.sait@linaro.org>
Co-authored-by: Greg Burd <greg@burd.me>
Co-authored-by: Dave Cramer <davecramer@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Tested-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/A6152C7C-F5E3-4958-8F8E-7692D259FF2F%40greg.burd.me
Discussion: https://postgr.es/m/CAFPTBD-74%2BAEuN9n7caJ0YUnW5A0r-KBX8rYoEJWqFPgLKpzdg%40mail.gmail.com
The isolation tests for INSERT ON CONFLICT behavior during CREATE INDEX
CONCURRENTLY and REINDEX CONCURRENTLY (added by bc32a12,
2bc7e88, and 90eae92) were disabled in 77038d6 due to
persistent CI flakiness, after several attempts at stabilization.

This commit removes them and introduces a TAP test in test_misc module
(010_index_concurrently_upsert.pl) that covers the same scenarios.  This
new test should hopefully be more stable while providing assurance that
the fixes in all those commits (plus 81f7211) continue to work.

Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/ccssrhafzbp3a3beju3ptyc56a7gbfimj4vwkbokoldofckrc7@bso37rxskjtf
Discussion: https://postgr.es/m/CANtu0ogv+6wqRzPK241jik4U95s1pW3MCZ3rX5ZqbFdUysz7Qw@mail.gmail.com
Discussion: https://postgr.es/m/202512112014.icpomgc37zx4@alvherre.pgsql
This commit adds tab completion support for the keywords "pstdin" and
"pstdout" in the \copy command. "pstdin" is now suggested after FROM,
and "pstdout" is suggested after TO, alongside filenames and other
keywords.

Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/20251231183953.95132e171e43abd5e9b78084@sraoss.co.jp
…tion

When repurposing a standby to a logical replica, pg_createsubscriber
uses for the new replica a set of configuration parameters saved into
postgresql.auto.conf, to force recovery patterns when the physical
replica is promoted.

While not wrong in practice, this approach can cause issues when forcing
again recovery on a logical replica or its base backup as the recovery
parameters are not reset on the target server once pg_createsubscriber
is done with the node.

This commit aims at improving the situation, by changing the way
recovery parameters are saved on the target node.  Instead of writing
all the configuration to postgresql.auto.conf, this file now uses an
include_if_exists, that points to a pg_createsubscriber.conf.  This new
file contains all the recovery configuration, and is renamed to
pg_createsubscriber.conf.disabled when pg_createsubscriber exits.  This
approach resets the recovery parameters, and offers the benefit to keep
a trace of the setup used when the target node got promoted, for
debugging purposes.  If pg_createsubscriber.conf cannot be renamed
(unlikely scenario), a warning is issued to inform users that a manual
intervention may be required to reset this configuration.

This commit includes a test case to demonstrate the problematic case: a
standby node created from a base backup of what was the target node of
pg_createsubscriber does not get confused when started.  If removing
this new logic, the test fails with the standby not able to start due
to an incorrect recovery target setup, where the startup process fails
quickly with a FATAL.

I have provided the design idea for the patch, that Alyona has written
(with some code adjustments from me).  This could be considered as a
bug, but after discussion this is put into the bucket for improvements.
Redesigning pg_createsubscriber would not be acceptable in the stable
branches anyway.

Author: Alyona Vinter <dlaaren8@gmail.com>
Reviewed-by: Ilyasov Ian <ianilyasov@outlook.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andrey Rudometov <unlimitedhikari@gmail.com>
Discussion: https://postgr.es/m/CAGWv16K6L6Pzm99i1KiXLjFWx2bUS3DVsR6yV87-YR9QO7xb3A@mail.gmail.com
Remove all configure checks and workarounds for strnlen() missing.  It
is required by POSIX 2008.

Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/98ce805c-6103-421b-adc3-fcf8f3dddbe3%40eisentraut.org
rindex() has been removed from POSIX 2008.  Replace the one remaining
use with the equivalent and more standard strrchr().

Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/98ce805c-6103-421b-adc3-fcf8f3dddbe3%40eisentraut.org
Author: "Dewei Dai" <daidewei1970@163.com>
Author: zengman <zengman@halodbtech.com>
Author: Zhiyuan Su <suzhiyuan_pg@126.com>
Discussion: https://postgr.es/m/2026010719201902382410@163.com
Discussion: https://postgr.es/m/tencent_4DC563C83443A4B1082D2BFF@qq.com
Discussion: https://postgr.es/m/44656d72.2a63.19b9b92b0a3.Coremail.suzhiyuan_pg@126.com
The only user-visible change is the fix in the "malformed
pg_dependencies" error detail. That one is new in commit e1405aa,
so no backpatching required.
btree_gist's gist_inet_ops and gist_cidr_ops opclasses are
fundamentally broken: they rely on an approximate representation of
the inet values and hence sometimes miss rows they should return.
We want to eventually get rid of them altogether, but as the first
step on that journey, we should mark them not-opcdefault.

To do that, roll up the preceding deltas since 1.2 into a new base
script btree_gist--1.9.sql.  This will allow installing 1.9 without
going through a transient situation where gist_inet_ops and
gist_cidr_ops are marked as opcdefault; trying to create them that
way will fail if there's already a matching default opclass in the
core system.  Additionally provide btree_gist--1.8--1.9.sql, so
that a database that's been pg_upgraded from an older version can
be migrated to 1.9.

I noted along the way that commit 57e3c51 had missed marking the
gist_bool_ops support functions as PARALLEL SAFE.  While that probably
has little harmful effect (since AFAIK we don't check that when
calling index support functions), this seems like a good time to make
things consistent.

Readers will also note that I removed the former habit of installing
some opclass operators/functions with ALTER OPERATOR FAMILY, instead
just rolling them all into the CREATE OPERATOR CLASS steps.  The
comment in btree_gist--1.2.sql that it's necessary to use ALTER for
pg_upgrade reproducibility has been obsolete since we invented the
amadjustmembers infrastructure.  Nowadays, gistadjustmembers will
force all operators and non-required support functions to have "soft"
opfamily dependencies, regardless of whether they are installed by
CREATE or ALTER.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/2483812.1754072263@sss.pgh.pa.us
This patch completes the transition to making inet_ops be default for
inet/cidr columns, rather than btree_gist's opclasses.  Once we do
that, though, pg_upgrade has a big problem.  A dump from an older
version will see btree_gist's opclasses as being default, so it will
not mention the opclass explicitly in CREATE INDEX commands, which
would cause the restore to create the indexes using inet_ops.  Since
that's not compatible with what's actually in the files, havoc would
ensue.

This isn't readily fixable, because the CREATE INDEX command strings
are built by the older server's pg_get_indexdef() function; pg_dump
hasn't nearly enough knowledge to modify those strings successfully.
Even if we cared to put in the work to make that happen in pg_dump,
it would be counterproductive because the end goal here is to get
people off of these opclasses.  Allowing such indexes to persist
through pg_upgrade wouldn't advance that goal.

Therefore, this patch just adds code to pg_upgrade to detect indexes
that would be problematic and refuse to upgrade.

There's another issue too: even without any indexes to worry about,
pg_dump in binary-upgrade mode will reproduce the "CREATE OPERATOR
CLASS ... DEFAULT" commands for btree_gist's opclasses, and those
will fail because now we have a built-in opclass that provides a
conflicting default.  We could ask users to drop the btree_gist
extension altogether before upgrading, but that would carry very
severe penalties.  It would affect perfectly-valid indexes for other
data types, and it would drop operators that might be relied on in
views or other database objects.  Instead, put a hack in DefineOpClass
to ignore the DEFAULT clauses for these opclasses when in
binary-upgrade mode.  This will result in installing a version of
btree_gist that isn't quite the version it claims to be, but that can
be fixed by issuing ALTER EXTENSION UPDATE afterwards.

Since we don't apply that hack when not in binary-upgrade mode,
it is now impossible to install any version of btree_gist
less than 1.9 via CREATE EXTENSION.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/2483812.1754072263@sss.pgh.pa.us
In the wake of the previous commit, this script will fail
if executed via CREATE EXTENSION, so it's useless.  Remove it,
but keep the delta scripts, allowing old (even very old)
versions of the btree_gist SQL objects to be upgraded to 1.9
via ALTER EXTENSION UPDATE after a pg_upgrade.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/2483812.1754072263@sss.pgh.pa.us
Comprehensive design proposal for making shared_buffers dynamically
adjustable via SIGHUP without requiring a PostgreSQL restart. Covers:

- Detailed analysis of all NBuffers-dependent data structures and code paths
- Cross-system prior art (MySQL/InnoDB, Oracle SGA, SQL Server, MariaDB)
- Phased implementation: virtual address reservation, online grow, online
  shrink, dynamic hash table resizing
- ProcSignalBarrier-based coordination protocol
- 15 edge/corner cases analyzed (concurrent resize, crash recovery, pinned
  condemned buffers, huge pages, AIO interactions, etc.)
- Portability layer for Linux, FreeBSD, macOS, Windows
- Performance impact analysis with zero steady-state overhead goal
- Testing strategy covering unit, concurrency, crash recovery, and stress
- References to Dmitry Dolgov's active pgsql-hackers RFC patch series

https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos
Add infrastructure for dynamically resizing the shared buffer pool
via SIGHUP, without requiring a PostgreSQL restart. This implements
Phases 1-3 from the design document.

Key changes:

New GUC: max_shared_buffers (PGC_POSTMASTER)
  When set > shared_buffers, reserves virtual address space at startup
  for buffer pool arrays, enabling online resize up to that limit.
  Default 0 means same as shared_buffers (no online resize, preserving
  current behavior).

Changed GUC: shared_buffers (PGC_POSTMASTER -> PGC_SIGHUP)
  Now dynamically adjustable via ALTER SYSTEM + pg_reload_conf() when
  max_shared_buffers is configured. Check/assign hooks validate limits
  and initiate resize requests.

Buffer pool memory management (buf_resize.c):
  - BufferPoolReserveMemory(): reserves VA space via MAP_SHARED|MAP_NORESERVE
  - Grow: commits memory, initializes new descriptors, emits barrier
  - Shrink: drains condemned buffers (flush dirty, wait for unpins),
    emits barrier, decommits memory via MADV_DONTNEED
  - Zero steady-state overhead: base pointers never change, NBuffers
    remains a plain int updated via ProcSignalBarrier

Coordination protocol:
  - New PROCSIGNAL_BARRIER_BUFFER_POOL_RESIZE barrier type
  - ProcessBarrierBufferPoolResize() updates backend-local NBuffers
  - Serialized resize operations with progress tracking

Hash table pre-sizing:
  - When max_shared_buffers > shared_buffers, the buffer lookup hash
    table is pre-sized for MaxNBuffers, avoiding rehashing on grow

Separate memory allocation path:
  - When max_shared_buffers is configured, buffer arrays are allocated
    from separately-mapped memory regions instead of main shmem segment
  - BufferManagerShmemSize() excludes array sizes from main shmem
  - Global pointers (BufferDescriptors, BufferBlocks, etc.) remain
    stable across resize operations

https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos
Major fixes to make the online buffer pool resize actually work:

1. GUC variable indirection: The GUC variable for shared_buffers was
   pointing directly at NBuffers, which caused the GUC mechanism to
   overwrite NBuffers on SIGHUP before any actual resize occurred.
   Introduce SharedBuffersGUC as the GUC target variable. NBuffers is
   now only updated by the resize code (or at startup).

2. Postmaster/child coordination: The assign hook now properly
   distinguishes between postmaster (requests resize) and child
   processes (read current_buffers from shared memory). Replaced
   ProcSignalBarrier approach with SIGHUP-based coordination since
   the postmaster lacks a PGPROC for ConditionVariable waits.

3. ExecuteBufferPoolResize in postmaster loop: Added call to
   process_pm_reload_request() after ProcessConfigFile() but before
   SignalChildren(), ensuring the resize completes before children
   update their NBuffers.

4. MADV_POPULATE_WRITE fallback: Handle older kernels (pre-5.14) that
   don't support MADV_POPULATE_WRITE by falling back to manual page
   touching for memory commit.

Tested: grow (32MB->64MB->128MB), shrink (128MB->48MB), grow-back
(48MB->96MB), exceed-max rejection, resize under concurrent pgbench
load (zero failed transactions), clean shutdown/restart persistence.

https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos
…writer drain

Address code review findings across all severity levels:

CRITICAL:
- Add pg_read_barrier() in child processes after reading current_buffers
  atomic to pair with pg_write_barrier() in GrowBufferPool/ShrinkBufferPool,
  ensuring descriptor initialization is visible on ARM/POWER architectures
- Add rollback logic to BufferPoolCommitMemory() for partial
  MADV_POPULATE_WRITE failures (release already-committed pages via
  MADV_DONTNEED when a later array fails)

HIGH:
- Replace bare arithmetic with mul_size()/add_size() throughout
  BufferPoolReserveMemory, BufferPoolCommitMemory, BufferPoolDecommitMemory
  to detect integer overflow on 32-bit systems
- Fix stale comment claiming MAP_PRIVATE when code uses MAP_SHARED
- Fix header comment claiming BufResizeLock (LWLock) when struct uses
  mutex spinlock
- Update file header comment to describe SIGHUP-based coordination
  instead of removed ProcSignalBarrier approach

MEDIUM:
- Remove unused includes (postmaster/bgwriter.h, storage/ipc.h,
  storage/pg_shmem.h, storage/lwlock.h)
- Add comment on magic number 16 in Assert (matches GUC minimum)

Move buffer eviction from postmaster to bgwriter:
- ShrinkBufferPool now only updates NBuffers and records condemned range
- New BufPoolDrainCondemnedBuffers() runs in bgwriter main loop with
  full backend infrastructure (ResourceOwner, private refcounts)
- Fixes SIGSEGV crash when EvictUnpinnedBuffer was called from postmaster

https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos
…sues

Round 2 code review fixes addressing findings from 5 focused review agents.

CRITICAL fixes:
- Bgwriter stale decommit race: BufPoolDrainCondemnedBuffers now validates
  that status is still BUF_RESIZE_DRAINING and drain_from/drain_to match
  cached values before decommitting. Prevents MADV_REMOVE from destroying
  pages that a concurrent grow has reclaimed for active buffers.
- GrowBufferPool active descriptor race: Skip reinitialization of descriptors
  that still have BM_TAG_VALID set from a cancelled shrink, as backends may
  hold active pins. Such buffers are naturally reused by clock sweep.
- EXEC_BACKEND guard: Emit FATAL on platforms using EXEC_BACKEND (Windows)
  when max_shared_buffers is configured, since fork()-based MAP_SHARED
  inheritance is required.

HIGH fixes:
- Switch decommit to MADV_REMOVE for buffer blocks (always page-aligned).
  MADV_DONTNEED on MAP_SHARED|MAP_ANONYMOUS only zaps PTEs without
  releasing shmem-backed pages. MADV_REMOVE punches holes in the shmem
  backing store. Falls back to MADV_DONTNEED if MADV_REMOVE unavailable.
- Make BufferPoolCommitMemory incremental: only commit [old, new) range
  instead of [0, new). Avoids re-touching live pages and ensures rollback
  on partial failure only affects the new range.

MEDIUM fixes:
- Remove dead BUF_RESIZE_COMPLETING enum value (never set anywhere)
- Fix remaining overcount in drain: only count buffers that fail eviction,
  not all valid buffers before attempting eviction

https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.