Skip to content

posix: disk_space_check thread unsafe removal in posix_fini #4684

@ThalesBarretto

Description

@ThalesBarretto

Summary

posix_fini removes a brick from the shared disk_space_check thread using a direct diskxl_count decrement instead of the safe removal protocol (delete_posix_diskxl). This causes two bugs:

  1. Use-after-free (brick multiplexing + init failure): The brick's posix_diskxl entry stays on the shared iteration list after posix_fini frees priv. The disk thread continues visiting the freed brick's entry, dereferencing priv->base_path, priv->disk_reserve, and priv->disk_unit_percent.

  2. Reference count double-decrement (all multi-brick shutdowns): When both posix_notify(PARENT_DOWN) and posix_fini run (normal operational shutdown), ctx->diskxl_count is decremented twice per brick. The uint32_t underflow to UINT32_MAX prevents a double pthread_join by arithmetic coincidence, but the ref-count is structurally wrong.

How to reproduce

Bug 1: UAF on init failure in brick mux

  1. Enable brick multiplexing: gluster volume set all cluster.brick-multiplex on
  2. Start a volume with 2+ bricks on the same node
  3. Cause one brick's posix_init to fail after posix_spawn_disk_space_check_thread succeeds but before init completes (e.g., remove the brick directory's volume-id xattr, or make the health check test directory unwritable)
  4. The process continues serving other bricks. The disk_space_check thread visits the failed brick's freed priv pointer.

The race window is narrow (between the failed brick registering with the disk thread and the cleanup thread's PARENT_DOWN delivery), but the UAF is structurally present.

Bug 2: double-decrement on normal shutdown

  1. Run any multi-brick glusterfsd (with or without brick mux)
  2. Stop the volume: gluster volume stop <vol>
  3. posix_notify(PARENT_DOWN) calls delete_posix_diskxl → decrements diskxl_count
  4. posix_fini directly decrements diskxl_count again
  5. With 2 bricks: diskxl_count goes 2→1 (PARENT_DOWN) →0 (fini, joins thread) → underflow to UINT32_MAX (brick 2's PARENT_DOWN)

No crash occurs because UINT32_MAX != 0, so the count == 0 join check doesn't trigger again. The thread was already joined by brick 1's fini.

Root cause and history

Commit 3f93be77e1 (Mohit Agrawal, 2020-11-09, PR #1595, fixes #1482) refactored the disk_space_check thread from per-brick to per-process to solve a real scaling problem: with brick mux at 250 bricks, 250 threads were spawned just for disk space checks.

Before the refactor, the thread was per-brick (priv->disk_space_check), and posix_fini cleaned it up the same way as all per-brick threads — set a flag, cancel, join:

if (priv->disk_space_check) {
    priv->disk_space_check_active = _gf_false;
    (void)gf_thread_cleanup_xint(priv->disk_space_check);
    priv->disk_space_check = 0;
}

Simple and correct — no shared state, no ref-counting, no iteration list.

The refactor introduced two code paths for removing a brick from the now-shared thread:

Safe path (delete_posix_diskxl):

  • Sets pxl->detach_notify = true
  • Waits for pxl->is_use == false (disk thread signals when done with this brick)
  • Removes pxl from ctx->diskth_xl list
  • Sets priv->pxl = NULL (idempotent guard for subsequent calls)
  • Decrements diskxl_count
  • Frees pxl struct
  • Joins thread if last brick

Called by: posix_notify(PARENT_DOWN) (line 213), posix_reconfigure (line 426).

Shortcut in posix_fini (lines 1301-1312):

  • Directly decrements diskxl_count
  • Joins thread if last brick
  • Does NOT: set detach_notify, wait for is_use, remove from list, set priv->pxl = NULL, free pxl

The author correctly used the safe function for paths where the process stays alive (PARENT_DOWN, reconfigure), and the shortcut for posix_fini — the terminal cleanup path. The implicit assumption was that by the time posix_fini runs, PARENT_DOWN has already safely removed the brick from the list, so the full handshake is unnecessary overhead and fini only needs to handle the ref-count for the shared join.

This assumption holds for normal operational shutdown (Path A) and brick mux detach (Path C), where PARENT_DOWN always precedes fini. It does not hold for init-failure (Path B), where posix_fini runs without a prior PARENT_DOWN, and the brick's pxl entry stays on the iteration list while priv gets freed.

The idempotent guard (priv->pxl == NULL check at line 132) that makes delete_posix_diskxl safe to call from both paths was already present in the function the author wrote — it just wasn't used from fini.

Fix

Replace the 12-line unsafe block in posix_fini with a single call to the existing safe function:

-    pthread_mutex_lock(&ctx->xl_lock);
-    {
-        count = --ctx->diskxl_count;
-        if (count == 0)
-            pthread_cond_signal(&ctx->xl_cond);
-    }
-    pthread_mutex_unlock(&ctx->xl_lock);
-
-    if (count == 0) {
-        pthread_join(ctx->disk_space_check, NULL);
-        ctx->disk_space_check = 0;
-    }
+    delete_posix_diskxl(priv, ctx);

delete_posix_diskxl is already idempotent: it checks priv->pxl at entry (line 132) and returns immediately if NULL. On paths where PARENT_DOWN already ran, priv->pxl was set to NULL (line 142), so fini's call is a safe no-op. This eliminates both the UAF (brick removed from list before priv freed) and the double-decrement (single decrement point per brick).

Version range and history

Affected: all releases since GlusterFS 8 (November 2020, commit 3f93be77e1).

Notably, the same author (Mohit Agrawal) had previously fixed the identical bug class for the janitor thread in f138d3fa22 (2019-07-16, "posix: In brick_mux brick is crashed while start/stop volume in loop"):

Brick is crashed in janitor task at the time of accessing priv. If posix priv is cleaned up before call janitor task then janitor task is crashed.

That fix introduced the janitor_task_stop flag and the PARENT_DOWN drain protocol — the exact safe-removal pattern that delete_posix_diskxl implements for the disk thread. When the disk thread was refactored to per-process 16 months later (3f93be77e1), the safe function was created but not used from posix_fini, re-introducing the pattern that the janitor fix had resolved.

This is part of a recurring pattern in brick mux: shared resources that work fine per-brick become UAF hazards when refactored to per-process, because posix_fini's teardown was designed for the per-brick model. Related upstream history:

No prior upstream report exists for the disk_space_check UAF or double-decrement specifically.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions