You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
glfs_fini() does not iterate fs->openfds to close file descriptors before tearing down the xlator graph. If a consumer (QEMU, NFS-Ganesha, custom gfapi program) calls glfs_fini() with open fds, the library proceeds to inode_table_destroy_all which force-frees inodes and destroys fd_mem_pool while fd_t objects still reference them. The result is dangling pointers, leaked glfs_fd_t objects, and unbounded memory growth on the brick.
This is a companion issue to #4666, which covers the concurrent open race (Bug A). This issue covers the pre-existing fd leak (Bug B). They are independent bugs with independent fixes. #4667 (the shutting_down gate) fixes Bug A but does not address this issue.
The requirement to close all fds before calling glfs_fini() is not documented in glfs.h — it is a silent contract.
pub_glfs_fini (api/src/glfs.c:1243) never touches fs->openfds. The list is initialized in glfs_new_fs (line 754), populated by glfs_fd_bind (line 728), and depopulated only by individual glfs_fd_destroy calls (line 682) triggered by glfs_close → glfs_mark_glfd_for_deletion → GF_REF_PUT. If the consumer never calls glfs_close, the glfs_fd_t objects stay on the list with refcount=1 indefinitely.
An exhaustive grep of openfds in api/src/ confirms this — the only iteration is __glfs_migrate_openfds (glfs-resolve.c:920), which migrates fds during graph switch, not teardown.
Destruction Sequence
When glfs_fini proceeds with open fds on the list:
pub_glfs_fini()
│
├─ drain loop exits (call_pool->cnt == 0)
│ "leaked frames may exist, we ignore" (line 1303)
│
├─ inode_table_destroy_all (line 1373)
│ ├─ WARNING: "active inode with refcount found" (inode.c:1892)
│ ├─ __inode_ref_reduce_by_n(trav, 0) ← force ref to 0 (line 1898)
│ └─ mem_pool_destroy(fd_mem_pool) ← ALL fd_t freed (line 1908)
│ ╔═══════════════════════════════════════════════╗
│ ║ fd_t is FREED. glfs_fd_t->fd is now dangling ║
│ ╚═══════════════════════════════════════════════╝
│
├─ glusterfs_graph_deactivate (line 1380)
│ └─ xlator fini() on each xlator → private freed
│
├─ RPC disconnect → server sees RPCSVC_EVENT_DISCONNECT
│ └─ server_connection_cleanup walks fd table (server-helpers.c:301)
│ Server-side resources cleaned up, but client damage already done
│
└─ glfs_free_from_ctx (line 1414)
glfs_fd_t objects LEAKED (refcount=1, nobody calls GF_REF_PUT)
The inode_table_destroy function explicitly acknowledges this at inode.c:1835-1852:
"Ideally at this point in time, there should be no inodes with refs remaining. But there are quite a few chances where the inodes leak... force free the inodes by changing the ref to 0. The problem with this is that any reference to inode after this calling this function will lead to a crash."
The developers chose "approach 2" (force-free despite refs) knowing it would crash if anything still held a reference. Open fds hold such references.
TLA+ Formal Proof
A TLA+ model of the gfapi fd lifecycle composed with the session lifecycle proves the invariant violation. The key counterexample (5 steps, 52 states explored):
State 1: session=ACTIVE fd1=UNUSED leaked={}
State 2: session=ACTIVE fd1=OPEN leaked={} ← GlfsOpen(fd1)
State 3: session=FINI_CALLED fd1=OPEN leaked={} ← GlfsFiniStart
State 4: session=DRAINING fd1=OPEN leaked={} ← GlfsFiniDrain
State 5: session=DESTROYED fd1=LEAKED leaked={fd1} ← SessionDestroyed
^^^^^^^^^^^^^^^^
NoLeakedFds VIOLATED
No drain action exists between states 4 and 5 to close fd1. With an openfds drain action added, the model checker verifies all invariants hold for the common case (147 states, depth 11).
Notably, the shutting_down gate from #4667 does not change this trace — fd1 was opened before fini started. The gate prevents stale fds (opens after fini), not leaked fds (opens before fini).
Verified safety boundary of the drain fix
A refined TLA+ model (gfapi_fd_drain_actual.tla) captures the sequential code ordering, the leaked-frames edge case (async op holds extra GF_REF_GET that outlives the 10s drain timeout), and the fd_unref+NULL mitigation. With the mitigation, ALL invariants HOLD:
Invariant
Result
Meaning
DrainedImpliesAllClosed
HOLDS
The drain always empties fs->openfds before inode_table_destroy_all
LeakedOnlyFromHeldRefs
HOLDS
Drain never misses an fd; only extra refs (leaked frames) could cause issues
NoLeakedFds
HOLDS
The fd_unref+NULL mitigation neutralizes leaked frames: inode ref released early, fd pointer NULLed so eventual destroy is a clean GF_FREE
CompositionSafety
HOLDS
No leaked fds, no stale fds after session destroy
How the fd_unref+NULL mitigation works: The drain calls fd_unref(glfd->fd) to release the fd_t's inode ref while the inode table is still alive, then sets glfd->fd = NULL. If a leaked frame's eventual GF_REF_PUT later triggers glfs_fd_destroy, it skips fd_unref (fd is NULL) and does a clean GF_FREE(glfd). No use-after-free, no dangling pointer.
TLC: 129 states generated, 56 distinct, depth 9. Model checking completed, no error found.
The "leaked frames" problem itself (line 1303, in-flight STACK_WIND operations outliving the 10s timeout) is a separate pre-existing issue (2013) not addressed by this fix. See #404.
Evidence
Source:pub_glfs_fini never references openfds (exhaustive grep, 10 hits in api/src/, none in the fini path)
Source:inode_table_destroy force-frees at inode.c:1898 and destroys fd_mem_pool at inode.c:1908
Source:glfs.h:311 documents "call after all operations finished" but never mentions closing fds
Test suite: Two test files comment out glfs_fini with the note "glfs fini path is racy and crashes the program" (afr-lock-heal-basic.c:183, afr-lock-heal-advanced.c:228 — Ravishankar N, 3b90883ce7, 2019)
First formal specification and proposed systemic fix
The "further cleanups underway" from 2013 never materialized. Each subsequent developer encountered the problem and applied a local workaround rather than a systemic fix.
Summary
glfs_fini()does not iteratefs->openfdsto close file descriptors before tearing down the xlator graph. If a consumer (QEMU, NFS-Ganesha, custom gfapi program) callsglfs_fini()with open fds, the library proceeds toinode_table_destroy_allwhich force-frees inodes and destroysfd_mem_poolwhilefd_tobjects still reference them. The result is dangling pointers, leakedglfs_fd_tobjects, and unbounded memory growth on the brick.This is a companion issue to #4666, which covers the concurrent open race (Bug A). This issue covers the pre-existing fd leak (Bug B). They are independent bugs with independent fixes. #4667 (the
shutting_downgate) fixes Bug A but does not address this issue.The requirement to close all fds before calling
glfs_fini()is not documented inglfs.h— it is a silent contract.Relationship to #4666
shutting_downgate (#4667)fs->openfdsin finiNoStaleFdsNoLeakedFdsRoot Cause
pub_glfs_fini(api/src/glfs.c:1243) never touchesfs->openfds. The list is initialized inglfs_new_fs(line 754), populated byglfs_fd_bind(line 728), and depopulated only by individualglfs_fd_destroycalls (line 682) triggered byglfs_close→glfs_mark_glfd_for_deletion→GF_REF_PUT. If the consumer never callsglfs_close, theglfs_fd_tobjects stay on the list with refcount=1 indefinitely.An exhaustive grep of
openfdsinapi/src/confirms this — the only iteration is__glfs_migrate_openfds(glfs-resolve.c:920), which migrates fds during graph switch, not teardown.Destruction Sequence
When
glfs_finiproceeds with open fds on the list:The
inode_table_destroyfunction explicitly acknowledges this atinode.c:1835-1852:The developers chose "approach 2" (force-free despite refs) knowing it would crash if anything still held a reference. Open fds hold such references.
TLA+ Formal Proof
A TLA+ model of the gfapi fd lifecycle composed with the session lifecycle proves the invariant violation. The key counterexample (5 steps, 52 states explored):
No drain action exists between states 4 and 5 to close fd1. With an
openfdsdrain action added, the model checker verifies all invariants hold for the common case (147 states, depth 11).Notably, the
shutting_downgate from #4667 does not change this trace — fd1 was opened before fini started. The gate prevents stale fds (opens after fini), not leaked fds (opens before fini).Verified safety boundary of the drain fix
A refined TLA+ model (
gfapi_fd_drain_actual.tla) captures the sequential code ordering, the leaked-frames edge case (async op holds extraGF_REF_GETthat outlives the 10s drain timeout), and the fd_unref+NULL mitigation. With the mitigation, ALL invariants HOLD:DrainedImpliesAllClosedfs->openfdsbeforeinode_table_destroy_allLeakedOnlyFromHeldRefsNoLeakedFdsCompositionSafetyHow the fd_unref+NULL mitigation works: The drain calls
fd_unref(glfd->fd)to release the fd_t's inode ref while the inode table is still alive, then setsglfd->fd = NULL. If a leaked frame's eventualGF_REF_PUTlater triggersglfs_fd_destroy, it skipsfd_unref(fd is NULL) and does a cleanGF_FREE(glfd). No use-after-free, no dangling pointer.TLC: 129 states generated, 56 distinct, depth 9. Model checking completed, no error found.
The "leaked frames" problem itself (line 1303, in-flight STACK_WIND operations outliving the 10s timeout) is a separate pre-existing issue (2013) not addressed by this fix. See #404.
Evidence
pub_glfs_fininever referencesopenfds(exhaustive grep, 10 hits inapi/src/, none in the fini path)inode_table_destroyforce-frees atinode.c:1898and destroysfd_mem_poolatinode.c:1908glfs.h:311documents "call after all operations finished" but never mentions closing fdsglfs_finiwith the note "glfs fini path is racy and crashes the program" (afr-lock-heal-basic.c:183,afr-lock-heal-advanced.c:228— Ravishankar N,3b90883ce7, 2019)History
This problem dates to the first
glfs_finiimplementation:35b09786465e12c658d6cleanup_startedadded for upcalls — checked in 2 of 30+ entry points, never generalized42e19645b1glfs_finiin tests: "racy and crashes"3b90883ce7cleanup_startedto one more path (still incomplete)The "further cleanups underway" from 2013 never materialized. Each subsequent developer encountered the problem and applied a local workaround rather than a systemic fix.
Related Issues
afr_notify()due to access after free inafr_has_quorum#4644 —afr_notifynull deref during fini (OPEN)shutting_downgate / Fix A (DRAFT PR, fixes gfapi: use-after-free race between glfs_open and glfs_fini (no shutdown gate) #4666 only)