Skip to content

gfapi: glfs_fini does not close open file descriptors (no openfds drain) #4668

@ThalesBarretto

Description

@ThalesBarretto

Summary

glfs_fini() does not iterate fs->openfds to close file descriptors before tearing down the xlator graph. If a consumer (QEMU, NFS-Ganesha, custom gfapi program) calls glfs_fini() with open fds, the library proceeds to inode_table_destroy_all which force-frees inodes and destroys fd_mem_pool while fd_t objects still reference them. The result is dangling pointers, leaked glfs_fd_t objects, and unbounded memory growth on the brick.

This is a companion issue to #4666, which covers the concurrent open race (Bug A). This issue covers the pre-existing fd leak (Bug B). They are independent bugs with independent fixes. #4667 (the shutting_down gate) fixes Bug A but does not address this issue.

The requirement to close all fds before calling glfs_fini() is not documented in glfs.h — it is a silent contract.

Relationship to #4666

#4666 (Bug A) This issue (Bug B)
What New fd opened during fini Pre-existing fd not closed before fini
Race? Yes — concurrent threads No — ordering/contract violation
Fix shutting_down gate (#4667) Drain fs->openfds in fini
TLA+ invariant NoStaleFds NoLeakedFds

Root Cause

pub_glfs_fini (api/src/glfs.c:1243) never touches fs->openfds. The list is initialized in glfs_new_fs (line 754), populated by glfs_fd_bind (line 728), and depopulated only by individual glfs_fd_destroy calls (line 682) triggered by glfs_closeglfs_mark_glfd_for_deletionGF_REF_PUT. If the consumer never calls glfs_close, the glfs_fd_t objects stay on the list with refcount=1 indefinitely.

An exhaustive grep of openfds in api/src/ confirms this — the only iteration is __glfs_migrate_openfds (glfs-resolve.c:920), which migrates fds during graph switch, not teardown.

Destruction Sequence

When glfs_fini proceeds with open fds on the list:

pub_glfs_fini()
  │
  ├─ drain loop exits (call_pool->cnt == 0)
  │   "leaked frames may exist, we ignore" (line 1303)
  │
  ├─ inode_table_destroy_all (line 1373)
  │   ├─ WARNING: "active inode with refcount found" (inode.c:1892)
  │   ├─ __inode_ref_reduce_by_n(trav, 0)  ← force ref to 0 (line 1898)
  │   └─ mem_pool_destroy(fd_mem_pool)     ← ALL fd_t freed (line 1908)
  │       ╔═══════════════════════════════════════════════╗
  │       ║  fd_t is FREED. glfs_fd_t->fd is now dangling ║
  │       ╚═══════════════════════════════════════════════╝
  │
  ├─ glusterfs_graph_deactivate (line 1380)
  │   └─ xlator fini() on each xlator → private freed
  │
  ├─ RPC disconnect → server sees RPCSVC_EVENT_DISCONNECT
  │   └─ server_connection_cleanup walks fd table (server-helpers.c:301)
  │      Server-side resources cleaned up, but client damage already done
  │
  └─ glfs_free_from_ctx (line 1414)
      glfs_fd_t objects LEAKED (refcount=1, nobody calls GF_REF_PUT)

The inode_table_destroy function explicitly acknowledges this at inode.c:1835-1852:

"Ideally at this point in time, there should be no inodes with refs remaining. But there are quite a few chances where the inodes leak... force free the inodes by changing the ref to 0. The problem with this is that any reference to inode after this calling this function will lead to a crash."

The developers chose "approach 2" (force-free despite refs) knowing it would crash if anything still held a reference. Open fds hold such references.

TLA+ Formal Proof

A TLA+ model of the gfapi fd lifecycle composed with the session lifecycle proves the invariant violation. The key counterexample (5 steps, 52 states explored):

State 1: session=ACTIVE      fd1=UNUSED   leaked={}
State 2: session=ACTIVE      fd1=OPEN     leaked={}      ← GlfsOpen(fd1)
State 3: session=FINI_CALLED  fd1=OPEN     leaked={}      ← GlfsFiniStart
State 4: session=DRAINING     fd1=OPEN     leaked={}      ← GlfsFiniDrain
State 5: session=DESTROYED    fd1=LEAKED   leaked={fd1}   ← SessionDestroyed
                                           ^^^^^^^^^^^^^^^^
                                           NoLeakedFds VIOLATED

No drain action exists between states 4 and 5 to close fd1. With an openfds drain action added, the model checker verifies all invariants hold for the common case (147 states, depth 11).

Notably, the shutting_down gate from #4667 does not change this trace — fd1 was opened before fini started. The gate prevents stale fds (opens after fini), not leaked fds (opens before fini).

Verified safety boundary of the drain fix

A refined TLA+ model (gfapi_fd_drain_actual.tla) captures the sequential code ordering, the leaked-frames edge case (async op holds extra GF_REF_GET that outlives the 10s drain timeout), and the fd_unref+NULL mitigation. With the mitigation, ALL invariants HOLD:

Invariant Result Meaning
DrainedImpliesAllClosed HOLDS The drain always empties fs->openfds before inode_table_destroy_all
LeakedOnlyFromHeldRefs HOLDS Drain never misses an fd; only extra refs (leaked frames) could cause issues
NoLeakedFds HOLDS The fd_unref+NULL mitigation neutralizes leaked frames: inode ref released early, fd pointer NULLed so eventual destroy is a clean GF_FREE
CompositionSafety HOLDS No leaked fds, no stale fds after session destroy

How the fd_unref+NULL mitigation works: The drain calls fd_unref(glfd->fd) to release the fd_t's inode ref while the inode table is still alive, then sets glfd->fd = NULL. If a leaked frame's eventual GF_REF_PUT later triggers glfs_fd_destroy, it skips fd_unref (fd is NULL) and does a clean GF_FREE(glfd). No use-after-free, no dangling pointer.

TLC: 129 states generated, 56 distinct, depth 9. Model checking completed, no error found.

The "leaked frames" problem itself (line 1303, in-flight STACK_WIND operations outliving the 10s timeout) is a separate pre-existing issue (2013) not addressed by this fix. See #404.

Evidence

  • Source: pub_glfs_fini never references openfds (exhaustive grep, 10 hits in api/src/, none in the fini path)
  • Source: inode_table_destroy force-frees at inode.c:1898 and destroys fd_mem_pool at inode.c:1908
  • Source: glfs.h:311 documents "call after all operations finished" but never mentions closing fds
  • Test suite: Two test files comment out glfs_fini with the note "glfs fini path is racy and crashes the program" (afr-lock-heal-basic.c:183, afr-lock-heal-advanced.c:228 — Ravishankar N, 3b90883ce7, 2019)
  • Production: Samba's vfs_glusterfs leaked 13,000 fds/day/brick with persistent SMB2 connections (Samba Bugzilla #16043, fixed consumer-side by MR EC statvfs should return minimum total inodes in the disperse group in statfs output #4474)

History

This problem dates to the first glfs_fini implementation:

Date Who What Commit/Issue
2013-08 Anand Avati "implement a minimal glfs_fini()" — "further cleanups underway" 35b0978646
2015-02 Poornima G inode_table_destroy added with "approach 2" (force-free despite refs) 5e12c658d6
2015-02 Soumya Koduri cleanup_started added for upcalls — checked in 2 of 30+ entry points, never generalized 42e19645b1
2018 Xavi Hernandez #404 "Implement proper cleanup sequence" — closed wontfix #404
2019-10 Ravishankar N Comments out glfs_fini in tests: "racy and crashes" 3b90883ce7
2025 Pranith Kumar #4527 adds cleanup_started to one more path (still incomplete) #4527
2026-04 This issue First formal specification and proposed systemic fix

The "further cleanups underway" from 2013 never materialized. Each subsequent developer encountered the problem and applied a local workaround rather than a systemic fix.

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions