Summary
posix_fini removes a brick from the shared disk_space_check thread using a direct diskxl_count decrement instead of the safe removal protocol (delete_posix_diskxl). This causes two bugs:
-
Use-after-free (brick multiplexing + init failure): The brick's posix_diskxl entry stays on the shared iteration list after posix_fini frees priv. The disk thread continues visiting the freed brick's entry, dereferencing priv->base_path, priv->disk_reserve, and priv->disk_unit_percent.
-
Reference count double-decrement (all multi-brick shutdowns): When both posix_notify(PARENT_DOWN) and posix_fini run (normal operational shutdown), ctx->diskxl_count is decremented twice per brick. The uint32_t underflow to UINT32_MAX prevents a double pthread_join by arithmetic coincidence, but the ref-count is structurally wrong.
How to reproduce
Bug 1: UAF on init failure in brick mux
- Enable brick multiplexing:
gluster volume set all cluster.brick-multiplex on
- Start a volume with 2+ bricks on the same node
- Cause one brick's
posix_init to fail after posix_spawn_disk_space_check_thread succeeds but before init completes (e.g., remove the brick directory's volume-id xattr, or make the health check test directory unwritable)
- The process continues serving other bricks. The disk_space_check thread visits the failed brick's freed
priv pointer.
The race window is narrow (between the failed brick registering with the disk thread and the cleanup thread's PARENT_DOWN delivery), but the UAF is structurally present.
Bug 2: double-decrement on normal shutdown
- Run any multi-brick glusterfsd (with or without brick mux)
- Stop the volume:
gluster volume stop <vol>
posix_notify(PARENT_DOWN) calls delete_posix_diskxl → decrements diskxl_count
posix_fini directly decrements diskxl_count again
- With 2 bricks:
diskxl_count goes 2→1 (PARENT_DOWN) →0 (fini, joins thread) → underflow to UINT32_MAX (brick 2's PARENT_DOWN)
No crash occurs because UINT32_MAX != 0, so the count == 0 join check doesn't trigger again. The thread was already joined by brick 1's fini.
Root cause and history
Commit 3f93be77e1 (Mohit Agrawal, 2020-11-09, PR #1595, fixes #1482) refactored the disk_space_check thread from per-brick to per-process to solve a real scaling problem: with brick mux at 250 bricks, 250 threads were spawned just for disk space checks.
Before the refactor, the thread was per-brick (priv->disk_space_check), and posix_fini cleaned it up the same way as all per-brick threads — set a flag, cancel, join:
if (priv->disk_space_check) {
priv->disk_space_check_active = _gf_false;
(void)gf_thread_cleanup_xint(priv->disk_space_check);
priv->disk_space_check = 0;
}
Simple and correct — no shared state, no ref-counting, no iteration list.
The refactor introduced two code paths for removing a brick from the now-shared thread:
Safe path (delete_posix_diskxl):
- Sets
pxl->detach_notify = true
- Waits for
pxl->is_use == false (disk thread signals when done with this brick)
- Removes pxl from
ctx->diskth_xl list
- Sets
priv->pxl = NULL (idempotent guard for subsequent calls)
- Decrements
diskxl_count
- Frees pxl struct
- Joins thread if last brick
Called by: posix_notify(PARENT_DOWN) (line 213), posix_reconfigure (line 426).
Shortcut in posix_fini (lines 1301-1312):
- Directly decrements
diskxl_count
- Joins thread if last brick
- Does NOT: set
detach_notify, wait for is_use, remove from list, set priv->pxl = NULL, free pxl
The author correctly used the safe function for paths where the process stays alive (PARENT_DOWN, reconfigure), and the shortcut for posix_fini — the terminal cleanup path. The implicit assumption was that by the time posix_fini runs, PARENT_DOWN has already safely removed the brick from the list, so the full handshake is unnecessary overhead and fini only needs to handle the ref-count for the shared join.
This assumption holds for normal operational shutdown (Path A) and brick mux detach (Path C), where PARENT_DOWN always precedes fini. It does not hold for init-failure (Path B), where posix_fini runs without a prior PARENT_DOWN, and the brick's pxl entry stays on the iteration list while priv gets freed.
The idempotent guard (priv->pxl == NULL check at line 132) that makes delete_posix_diskxl safe to call from both paths was already present in the function the author wrote — it just wasn't used from fini.
Fix
Replace the 12-line unsafe block in posix_fini with a single call to the existing safe function:
- pthread_mutex_lock(&ctx->xl_lock);
- {
- count = --ctx->diskxl_count;
- if (count == 0)
- pthread_cond_signal(&ctx->xl_cond);
- }
- pthread_mutex_unlock(&ctx->xl_lock);
-
- if (count == 0) {
- pthread_join(ctx->disk_space_check, NULL);
- ctx->disk_space_check = 0;
- }
+ delete_posix_diskxl(priv, ctx);
delete_posix_diskxl is already idempotent: it checks priv->pxl at entry (line 132) and returns immediately if NULL. On paths where PARENT_DOWN already ran, priv->pxl was set to NULL (line 142), so fini's call is a safe no-op. This eliminates both the UAF (brick removed from list before priv freed) and the double-decrement (single decrement point per brick).
Version range and history
Affected: all releases since GlusterFS 8 (November 2020, commit 3f93be77e1).
Notably, the same author (Mohit Agrawal) had previously fixed the identical bug class for the janitor thread in f138d3fa22 (2019-07-16, "posix: In brick_mux brick is crashed while start/stop volume in loop"):
Brick is crashed in janitor task at the time of accessing priv. If posix priv is cleaned up before call janitor task then janitor task is crashed.
That fix introduced the janitor_task_stop flag and the PARENT_DOWN drain protocol — the exact safe-removal pattern that delete_posix_diskxl implements for the disk thread. When the disk thread was refactored to per-process 16 months later (3f93be77e1), the safe function was created but not used from posix_fini, re-introducing the pattern that the janitor fix had resolved.
This is part of a recurring pattern in brick mux: shared resources that work fine per-brick become UAF hazards when refactored to per-process, because posix_fini's teardown was designed for the per-brick model. Related upstream history:
No prior upstream report exists for the disk_space_check UAF or double-decrement specifically.
Summary
posix_finiremoves a brick from the shareddisk_space_checkthread using a directdiskxl_countdecrement instead of the safe removal protocol (delete_posix_diskxl). This causes two bugs:Use-after-free (brick multiplexing + init failure): The brick's
posix_diskxlentry stays on the shared iteration list afterposix_finifreespriv. The disk thread continues visiting the freed brick's entry, dereferencingpriv->base_path,priv->disk_reserve, andpriv->disk_unit_percent.Reference count double-decrement (all multi-brick shutdowns): When both
posix_notify(PARENT_DOWN)andposix_finirun (normal operational shutdown),ctx->diskxl_countis decremented twice per brick. Theuint32_tunderflow toUINT32_MAXprevents a doublepthread_joinby arithmetic coincidence, but the ref-count is structurally wrong.How to reproduce
Bug 1: UAF on init failure in brick mux
gluster volume set all cluster.brick-multiplex onposix_initto fail afterposix_spawn_disk_space_check_threadsucceeds but before init completes (e.g., remove the brick directory's volume-id xattr, or make the health check test directory unwritable)privpointer.The race window is narrow (between the failed brick registering with the disk thread and the cleanup thread's
PARENT_DOWNdelivery), but the UAF is structurally present.Bug 2: double-decrement on normal shutdown
gluster volume stop <vol>posix_notify(PARENT_DOWN)callsdelete_posix_diskxl→ decrementsdiskxl_countposix_finidirectly decrementsdiskxl_countagaindiskxl_countgoes 2→1 (PARENT_DOWN) →0 (fini, joins thread) → underflow toUINT32_MAX(brick 2's PARENT_DOWN)No crash occurs because
UINT32_MAX != 0, so thecount == 0join check doesn't trigger again. The thread was already joined by brick 1's fini.Root cause and history
Commit
3f93be77e1(Mohit Agrawal, 2020-11-09, PR #1595, fixes #1482) refactored the disk_space_check thread from per-brick to per-process to solve a real scaling problem: with brick mux at 250 bricks, 250 threads were spawned just for disk space checks.Before the refactor, the thread was per-brick (
priv->disk_space_check), andposix_finicleaned it up the same way as all per-brick threads — set a flag, cancel, join:Simple and correct — no shared state, no ref-counting, no iteration list.
The refactor introduced two code paths for removing a brick from the now-shared thread:
Safe path (
delete_posix_diskxl):pxl->detach_notify = truepxl->is_use == false(disk thread signals when done with this brick)ctx->diskth_xllistpriv->pxl = NULL(idempotent guard for subsequent calls)diskxl_countCalled by:
posix_notify(PARENT_DOWN)(line 213),posix_reconfigure(line 426).Shortcut in
posix_fini(lines 1301-1312):diskxl_countdetach_notify, wait foris_use, remove from list, setpriv->pxl = NULL, free pxlThe author correctly used the safe function for paths where the process stays alive (PARENT_DOWN, reconfigure), and the shortcut for
posix_fini— the terminal cleanup path. The implicit assumption was that by the timeposix_finiruns, PARENT_DOWN has already safely removed the brick from the list, so the full handshake is unnecessary overhead and fini only needs to handle the ref-count for the shared join.This assumption holds for normal operational shutdown (Path A) and brick mux detach (Path C), where PARENT_DOWN always precedes fini. It does not hold for init-failure (Path B), where
posix_finiruns without a prior PARENT_DOWN, and the brick's pxl entry stays on the iteration list whileprivgets freed.The idempotent guard (
priv->pxl == NULLcheck at line 132) that makesdelete_posix_diskxlsafe to call from both paths was already present in the function the author wrote — it just wasn't used from fini.Fix
Replace the 12-line unsafe block in
posix_finiwith a single call to the existing safe function:delete_posix_diskxlis already idempotent: it checkspriv->pxlat entry (line 132) and returns immediately if NULL. On paths wherePARENT_DOWNalready ran,priv->pxlwas set to NULL (line 142), so fini's call is a safe no-op. This eliminates both the UAF (brick removed from list beforeprivfreed) and the double-decrement (single decrement point per brick).Version range and history
Affected: all releases since GlusterFS 8 (November 2020, commit
3f93be77e1).Notably, the same author (Mohit Agrawal) had previously fixed the identical bug class for the janitor thread in
f138d3fa22(2019-07-16, "posix: In brick_mux brick is crashed while start/stop volume in loop"):That fix introduced the
janitor_task_stopflag and the PARENT_DOWN drain protocol — the exact safe-removal pattern thatdelete_posix_diskxlimplements for the disk thread. When the disk thread was refactored to per-process 16 months later (3f93be77e1), the safe function was created but not used fromposix_fini, re-introducing the pattern that the janitor fix had resolved.This is part of a recurring pattern in brick mux: shared resources that work fine per-brick become UAF hazards when refactored to per-process, because
posix_fini's teardown was designed for the per-brick model. Related upstream history:e589d8de66(2020-03-04): Mohit fixes multiple brick mux teardown leaks (xlator, local_pool, rpc, inode)04413e71aa(2023-05-11): Xavi Hernandez fixes two more brick mux crash bugs (auth module + stale pointer)No prior upstream report exists for the disk_space_check UAF or double-decrement specifically.