Add upload and download timings#177
Draft
Hitenjain14 wants to merge 51 commits into
Draft
Conversation
…s3server into feat/enterprise-timings
- Allow audit target initialization even if logsearchapi is unavailable - Detect connection refused errors and log warning instead of failing - Server will start normally and retry when sending logs - Fixes 'Unable to initialize server audit HTTP target' error
Picks up: - core/node: HealthyByLFB() and LFB-aware max-nonce for chaos networks - feat: Add cache-first allocation and blobber caching for offline mode - increased maxConnsPerHost to match TCPDialer Concurrency - wasmsdk: export faucet + allow wallet set with privateKey
Implements two complementary approaches for high-throughput S3 operations: 1. LogCache (logcache.go): Log-structured ACID cache - Append-only cache file on NVMe (data + metadata unified) - Group commit with fdatasync (amortized ~0.02ms/entry at high concurrency) - In-memory index for O(1) GET/HEAD/LIST lookups - Committed objects stay in cache for continued GET serving - Background drain to blobbers via DoMultiOperation - Crash recovery via cache file replay - Adaptive: small files (≤1MB) cached, large files go direct to blobbers 2. WAL Intent Log (wal.go): Lightweight crash-recovery for writeback cache - Metadata-only entries (~100 bytes, no data duplication) - Works alongside MinIO's writeback cache for GET speed - Ensures ACID by recording intent before cache write 3. Config-driven tuning (initSDK.go): - sdk_batch_size, locked_blobbers_cap, enable_wal, wal_dir, wal_commit_workers - All performance knobs exposed via zs3server.json 4. S3 operation integration (gateway-zcn.go, dStorage.go): - PUT: size-adaptive routing (cache vs direct blobber) - GET/HEAD: cache-first with blobber fallback - DELETE: cache eviction + blobber delete - LIST: cache index merge with blobber listing Measured results on test2 (12 cores, 3 enterprise blobbers): - PUT (WAL inline, 1KiB): 3672 obj/s at conc=256 (ACID) - PUT (sync direct IPs): 289 obj/s at conc=64 - GET (WAL-served): 3900 obj/s at conc=128 - GET (writeback cache): 8734 obj/s at conc=256 - Historical baseline: PUT 0.58, GET 125 obj/s
PUT: size-adaptive routing — ≤1MB through LogCache (append + fdatasync), >1MB direct to blobbers GET: LogCache first via file-backed reader (sendfile-capable), blobber fallback on miss HEAD: LogCache index lookup, blobber fallback LIST: merge LogCache entries with blobber listing DELETE: evict from LogCache + delete from blobbers COPY: if source in LogCache, read from cache; else blobber copy LogCache GET uses limitedFileReader with WriteTo interface for sendfile optimization. Committed objects stay in cache index for continued GET serving.
Hot cache: 20K-entry in-memory LRU of recently-PUT objects. GET checks hot cache first (~0.01ms) before falling back to shared fd pread (~0.1ms). Eliminates os.Open/Close per GET that was adding ~1-2ms overhead. Results: GET +22% at conc=16 (2706 → 3306 obj/s). GET ceiling at ~3300 obj/s is Go net/http handler overhead, not cache layer.
Writeback cache handles all S3 operations natively (PUT/GET/HEAD/LIST/DELETE/COPY). WAL intent log (metadata-only, ~100 bytes per entry) provides crash recovery. PUT: writeback cache write (~1ms) + WAL intent fdatasync (~0.5ms amortized) GET: MinIO sendfile from cache (bypasses gateway, ~0.5ms for 1KB) DELETE: WAL marks deleted + MinIO cache eviction Expected: PUT ~3000 obj/s, GET ~8000 obj/s, 1059 MB/s for 1MiB GET. Requires: MINIO_CACHE_COMMIT=writeback in docker-compose + zs3server.json enable_wal=true.
…ivalent) - nfs_server.go: NFSv3 server via go-nfs library, starts on configurable port - nfs_fs.go: billy.Filesystem backed by Züs blobbers — reuses putFile, getFileReader, getRegularRefs (same data path as S3 gateway) - nfs_file.go: billy.File with local temp staging, uploads to blobbers on Close - Config: enable_nfs, nfs_port, nfs_cache_dir in zs3server.json - Shares WAL, writeback cache, and batch workers with S3 gateway - Architecture doc: NFS_GATEWAY_ARCHITECTURE.md
Reduces data loss window from 60s to 5s (12x safer). Increases background drain workers from 10 to 20 (faster commit to blobbers). Zero performance impact on PUT/GET — only affects background commit frequency.
…ance Route NFS reads/writes through MinIO's in-process ObjectLayer API instead of direct blobber calls. Eliminates HTTP loopback overhead and synchronous blobber commits that made NFS 40x slower than S3. Key changes: - nfs_s3client.go: direct CacheObjectLayer.PutObject/GetObjectNInfo calls - nfs_file.go: in-memory bytes.Buffer for files <=1MB (no temp file I/O) - nfs_fs.go: stat cache after writes, S3 cache-aware Stat/ReadDir/OpenFile - nfs_server.go: increased CachingHandler to 8192 entries - object-api-common.go: exported GetGlobalCacheObjectAPI/GetGlobalObjectAPI - NFS_ARCHITECTURE.md: full architecture doc with measured data + NFSv4.1 plan Measured improvement (1KB files, chain stopped, 12 cores, 3 eblobbers 2+1): NFS PUT: 9 -> 48 obj/s (+5.3x) NFS GET: 44 -> 103 obj/s (+2.3x) Remaining gap vs S3 (357 PUT, 506 GET) is NFSv3 protocol overhead.
- nfs_cache_mode: "memory" option for fire-and-forget writes (no disk I/O, async blobber commit, no crash recovery) - nfs_cache_mode: "disk" (default) unchanged — ACID via /mcache + WAL - Dual-access test script (tests/dual_access_test.sh): 10 test cases verifying S3 write->NFS read, NFS write->S3 read, LIST, DELETE, overwrite - Updated NFS_ARCHITECTURE.md with nconnect=16 benchmark data showing NFS GET at 63% of S3 (312 vs 493 obj/s), and path to S3 parity via NFS-Ganesha FSAL or go-nfs concurrent dispatch patch Measured (Python bench, chain stopped, 12 cores, 3 eblobbers 2+1): nconnect=16 disk mode: PUT 61, GET 312 obj/s (1KB) nconnect=16 memory mode: PUT 47, GET 320 obj/s (1KB) Memory mode does NOT improve concurrent throughput — bottleneck is go-nfs RPC dispatch, not backend storage.
Replace go-nfs (60 obj/s) with NFS-Ganesha + FSAL_VFS + inotify blobber sync. NFS-Ganesha handles the NFS protocol in C (fast), our Go process watches the export directory and commits changes to blobbers async. Architecture: NFS client → Ganesha (C, NFSv4) → /nfs_export (NVMe) → instant return Background: inotify → putFile() → blobbers (async, ACID via WAL) Config: "nfs_ganesha_export_dir": "/nfs_export" in zs3server.json Requires: apt install nfs-ganesha nfs-ganesha-vfs Measured (Python bench, chain stopped, 12 cores, 3 eblobbers 2+1): NFS PUT 1KB: 3,208 obj/s (was 60 with go-nfs) — 53x improvement NFS GET 1KB: 7,794 obj/s (was 277 with go-nfs) — 28x improvement NFS PUT 1KB is 9x faster than S3 via boto3 (362 obj/s)
After blobber commit, delete files from tmpfs export dir to bound cache usage. When tmpfs exceeds 80%, spill committed files to NVMe. Config: nfs_cache_evict: true (default) — delete from tmpfs after commit nfs_spillover_dir: "/path" — NVMe directory for overflow This keeps tmpfs usage bounded for sustained workloads: Small files (1KB, 5K/s): 8GB tmpfs holds 25 min of burst Eviction at 50 commits/s keeps steady-state usage at ~50MB
- nfs_direct_threshold: files > 2MB bypass tmpfs, commit to blobbers directly via inotify (avoids filling cache with large data) - s3_direct_threshold: same for S3 PUT path — large files bypass batch channel and use DoMultiOperation directly - 3 NFS cache modes: tmpfs (fastest), nvme (crash-safe), direct (slow) - Configurable via zs3server.json Architecture per file size: ≤2MB: NFS write → tmpfs/nvme → batch commit → blobbers (5K obj/s) >2MB: NFS write → tmpfs/nvme → immediate commit → blobbers Same logic for S3: ≤threshold: S3 PUT → writeback cache → batch commit >threshold: S3 PUT → DoMultiOperation → blobbers directly
Replace blocking putFile() path with direct DoMultiOperation batches.
Each sync worker collects files into batches of 25, commits with a single
WM lock acquisition per batch instead of per file.
Before: sync worker → putFile → blocks 500ms → next file (50 files/s)
After: sync worker → collect 25 files → DoMultiOperation → 25 files committed
Expected: 5 workers × 25 files / 500ms = 250 files/s
Also: debounced inotify (200ms quiet before commit), 10K file channel buffer,
2s backoff on failure, rate-limited retries.
Adaptive config: - FileSizeTracker (1000-entry ring buffer) tracks PUT sizes from S3 + NFS - StartAdaptiveLoop() samples median every 30s, adjusts batch/worker config - ConfigForFileSize() returns optimal settings per size category Cache router: - TryCacheRead() checks NFS export + /mcache before blobber download - Enables cross-protocol visibility (NFS write → S3 GET from local cache) - Tracks hit/miss stats with periodic logging Benchmark cleanup: - tests/cleanup_bench.sh: cleans caches, truncates logs, reports disk usage - Supports --dry-run flag
…e architecture doc The cache_router inside zs3server only checks /nfs_export for cross-protocol reads (NFS write → S3 GET). MinIO's cache layer already handles /mcache lookups before our handler runs — no need to duplicate that. Updated NFS_ARCHITECTURE.md with full multi-level cache diagram: Level 1: Router (/router repo) — compute node local NVMe cache Level 2: zs3server — /mcache (S3) + /nfs_export (NFS) + CacheRouter Level 3: Blobbers — erasure-coded persistent storage Added data flow diagrams for: - App requests S3 object (full cache hierarchy) - App writes via NFS (Ganesha → tmpfs → blobber sync) - External S3 object not on Züs (Router pulls from AWS → stores in Züs)
…3-upstream fallback, router-order fix, stub xattr defense
New internal HTTP endpoints consumed by the FSAL_ZUS Ganesha plugin to implement
NFS read-through-blobbers + directory listing union.
- POST /internal/prewarm {bucket, key} — fetch object from Züs into /nfs_export
via existing getFileReader, using singleflight dedup + MarkCommitted anti-loop
+ os.Rename for atomicity. Fast-path when file already non-stub-cached.
- GET /internal/list?bucket=X&prefix=Y[&stub=1] — returns Züs directory listing.
With stub=1, creates sparse placeholders with real sizes + user.zus.stub xattr
so apps see all entries via readdir; actual content arrives on first open via
prewarm. 30s TTL cache.
- S3-upstream fallback (config-gated) — on Züs 404, fetch from configured
external S3 (AWS/MinIO), stream to client via io.Pipe + TeeReader, async
cache-back to Züs. minio-go client, singleflight dedup on bucket/key.
Replaces the role of the hypothetical /router repo; single source of truth.
BlobberSync hardening
- nfs_blobber_sync.go: new MarkCommitted(relPath) method.
- nfs_blobber_sync.go: always skip files with user.zus.stub xattr (defense in
depth — MarkCommitted alone insufficient due to fsnotify Create firing before
caller can populate the committed map).
Router-order bug fix (gateway-main.go)
- GatewayExtraRouters must register BEFORE registerSTSRouter. STS uses
PathPrefix(/) + MatcherFunc(POST && form-urlencoded && no queries) which
hijacked POST /internal/* whenever client omitted Content-Type: application/json
(curl default is form-urlencoded). Returning 400 MissingParameter STS errors.
Measured results (SF=10 TPC-DS, 3.8 GB parquet, 452 files, 4+1 enterprise alloc)
- TPC-DS 10 queries, via /mnt/zus_nfs + FSAL_ZUS: 222.24s (vs NVMe 180.05s, +23%)
- S3-API cold: 223.08s; warm: 221.43s (warm≈cold, workload is CPU-bound at SF=10)
- fio seq_read 1M NFS: 577 MB/s (45% of NVMe 1294 MB/s)
- Prewarm throughput: 547-624 MB/s sustained at 16-way concurrency (Züs→tmpfs)
- Cold single-file NFS read (stub→prewarm→retry): 389ms for 1.9MB
Architecture documented in FSAL_ZUS_ARCHITECTURE.md.
Mount options in FSAL_ZUS_MOUNT.md.
Architecture:
App → zs3server → /nfs_export (tmpfs, unified cache for S3+NFS)
HIT → serve from tmpfs (5K+ obj/s)
MISS → blobbers (GoSDK)
MISS → external S3 (AWS) → write to /nfs_export → blobber sync
Key changes:
- external_s3.go: Router function fetches from AWS S3 on Züs miss,
writes to /nfs_export so blobber sync commits to Züs
- gateway-zcn.go PutObject: S3 PUT now writes to /nfs_export
(unified cache) so NFS clients see S3-written files instantly
- gateway-zcn.go GetObjectNInfo: 3-level lookup:
/nfs_export → blobbers → external S3
- cache_router.go: simplified to check /nfs_export only
(MinIO /mcache handled by MinIO's own cache layer)
- NFS_ARCHITECTURE.md: full multi-level cache diagram
Config:
"external_s3_endpoint": "https://s3.amazonaws.com"
"external_s3_region": "us-east-1"
…-safe eviction Architecture - Three tiers: tmpfs (/nfs_export) → spillover (/root/nfs_spillover) → blobber. - Unified layout: S3 and NFS share /nfs_export paths. - Per-file state via xattrs: user.zus.stub (placeholder) / user.zus.committed (real). - Fully configurable: nfs_tmpfs_cache_enabled, nfs_spillover_cache_enabled, nfs_spillover_max_bytes, nfs_cache_disabled. Routes (new) - cache_router.go: TryCacheRead — full-file serve from tmpfs or spillover without going through gosdk. Skips stubs via user.zus.stub xattr. - cache_clear_router.go, cache_stats_router.go: /internal/cache_clear, /internal/cache_stats operational endpoints. - commit_router.go: /internal/commit for NFS write-marker integration. - fallback_s3.go: upstream S3 fallback on Züs 404 (replaces external_s3.go). - list_router.go: /internal/list + stub materialisation for NFS readdir. - prewarm_router.go: /internal/prewarm fetch-from-blobber, io.Copy to tmpfs, xattr gating, short-read / ENOSPC stub restoration. - inode_rel_map.go: inode → rel_path map for FSAL_ZUS handle-to-path recovery. - mirror_s3_to_export.go: S3 PUT → /nfs_export tree mirroring for NFS visibility. Read path (S3 — gateway-zcn.go) - Fix A: local-file fast path. Serves ranges directly from tmpfs fd when file has user.zus.committed + blocks>0. Bypasses gosdk, consensus, erasure decode. - Lazy cache-back on range-read miss: getFileReader serves the range + spawns cacheBackFullFetch in background. Deduped via cacheBackInflight sync.Map, gated by RecentlyEvicted (120 s cool-off). - Full-file read cache-back-tee: TeeReader fills tmpfs as bytes flow to client. - s3ContentHash used only for directory ETag; never written as file body. NFS write path - PutObject: MarkCommitted before open (inotify skip gate) → io.Copy to /nfs_export → putFile to blobbers → Setxattr committed after success. - nfs_blobber_sync: inotify watcher + batched upload with per-path locks serialising S3-write / NFS-write / spill / cache-back / commit on one relPath. - Skip gates: HasPrefix(basename, "."), .cacheback suffix, .cachefetch suffix (so temp files mid-rename never upload to blobber). Eviction (nfs_blobber_sync.go) - spilloverMonitor: 1 s tick; if tmpfs > 60% → SpillNow. - spillCommittedFiles: oldest-mtime candidate; skip if openFdInodes hits, atime < 120 s, or committed xattr missing. Copies to spillover, then under cross-process OFD F_WRLCK on the tmpfs inode: Removexattr committed → O_TRUNC → Truncate(origSize) sparse stub → Setxattr stub → release. Paired with FSAL_ZUS F_RDLCK so spill blocks while reader holds fd. - EnsureFreeTmpfs: skip-failed-candidate loop (was bailing on first spill error, leaving prewarm ENOSPC-stuck). - evictSpilloverOldest: skip files with mtime < 120 s OR atime < 120 s (strictatime mount bumps atime on every read; catches active spillover readers). - OFD locks (F_OFD_SETLKW/F_OFD_SETLK) everywhere; POSIX fcntl locks released on any same-process close and were dropping the guard before intended. - RecentlyEvicted TTL 30 s → 120 s (longer cool-off against re-cache-back of just-evicted keys). SF10 TPC-DS benches (10-query, SF10 ≈ 3.8 GB, test2 on localhost): - NFS 8G/0: 228 s (best; near Apr-14 baseline 223 s) - NFS 1G/1G: 366 s (no spillover benefit on this setup) - S3 8G/0: 386 s (Fix A fast path, no thrash) - S3 0/1G: 385 s (spillover-only; TryCacheRead serves full-file reads) - S3 1G/1G: 752 s (tmpfs thrash actively harmful at dataset > cache) - S3 1G/0: 1135 s (worst — cache thrash + Fix A lock overhead > raw blobber) Conclusion: single-tier tmpfs sized ≥ working set is the best config; spillover becomes valuable only when blobber fetch is expensive (WAN) or working set >> tmpfs.
Full 9-config sweep on SF10 dataset (test2, single node): - Best: NFS 8G tmpfs = 228 s (near Apr-14 baseline 223 s) - Worst: S3 1G tmpfs no-spill = 1135 s (cache thrash > no cache) - Spillover adds zero benefit when blobbers are on localhost; only wins for WAN blobbers or dataset >> cache. - S3 "no-thrash" configs converge at ~385 s — per-range HTTP/MinIO overhead dominates once cache stops thrashing. Documents the cache architecture, read paths, eviction policy, and all fixes landed this session (lazy cache-back, OFD locks, atime-based eviction grace, EnsureFreeTmpfs skip-fail, prewarm ENOSPC stub restore, FSAL_ZUS EIO fallback). Recommended default: single-tier tmpfs sized ≥ working set.
…r skip wal.go: - Delete keeps the walDeleted entry alive with a Timestamp so a racing GET can detect 'recently deleted' for 60s (WasRecentlyDeleted). - ClearTombstone removes the marker on successful PUT so visibility is restored after a DELETE → PUT sequence. - maybeCompact harvests expired (>60s) tombstones. gateway-zcn.go: - GetObjectNInfo and GetObjectInfo consult WasRecentlyDeleted and return 404 while the tombstone is live, closing the read-after- delete race at the gateway layer before gosdk consensus. - PUT now writes to a tmp path and renames in-place so a racing GET never sees a partial file (closes truncate-race on overwrite). - ClearTombstone fires on successful PUT (and on CompleteMultipart). - ShouldCacheFile admission gate wired into both range-miss and full-miss cache-back paths (size-fits + hit-rate + not-recently- evicted), stopping the thrash ratio from climbing above ~1.05 on S3 1G/0 configs. - largeObjectCacheSkipBytes=1 GiB: objects ≥1 GiB skip tmpfs entirely and spool to spillover NVMe — fixes the Llama3 1.5 GB PUT ENOSPC crash caused by peak 2× usage during write-then-rename. - RecordGet at the top of GetObjectNInfo so the predictor sees every request, not just cache-hits. - NFS counters (tmpfs/spillover/prewarm) wired via prewarmHandler. multipart.go: CompleteMultipartUpload clears the tombstone on success. Validated: Porcupine 10/10 LINEARIZABLE at w=4/400 and w=8/400 ops on put/get/del/copy/move/rename; S3 thrash ratio 2.95× → 1.06× after admission gate.
prefetch_predictor.go (new, ~200 LOC): - Per-dir ring buffer of recent GET keys. - Sorted-run detection: when the last N requests in a dir form a monotonic sequence, dispatch cacheBackFullFetch for the next N keys via alloc.GetRefs so sequential workloads (MLPerf readouts, TPC-DS full-table scans) stream through tmpfs with no blobber round-trips. - Thread-safe per-dir locks, bounded buffer size, backoff on evict. cache_router.go: - Wrap served readers in klauspost/readahead so OS-level readers see 1 MiB prefetched chunks regardless of caller buffer size. - Issue FADV_SEQUENTIAL + FADV_WILLNEED via unix.Fadvise on cache files before handing the fd to the HTTP writer — doubles effective sequential MB/s on spillover-NVMe hits.
initSDK.go: - FallbackS3WriteThrough config field accepts "async" | "mirror" | "" (off). async posts to a bounded worker pool; mirror fails the PUT if the upstream S3 write fails. - FallbackS3WriteMaxBytes caps per-object mirror size so huge checkpoints do not block the PUT ack path. fallback_s3_write.go (new, ~250 LOC): - syncPutToUpstream / syncPutStreamToUpstream / syncDeleteFromUpstream with retry + exponential backoff and a 16-way bounded concurrency channel. - Optional mirror mode surfaces upstream errors on the PUT response so multi-region deployments can guarantee cross-region landing before ack. nfs_blobber_sync.go: - spilloverMonitor is now gated on NFSSpilloverCacheEnabled so deployments with tmpfs-only cache skip the monitor goroutine. - commitBatch and commitDeleteBatch invoke the write-through hooks from fallback_s3_write.go so the NFS path also fans out to the upstream S3 when configured. - Removed duplicate stubbuf declaration that tripped the linter.
cache_stats_router.go: expose nfs_tmpfs_hits, nfs_spillover_hits, and nfs_prewarm_fetches in the /cache-stats JSON body so operators can distinguish tmpfs-only hits from spillover-NVMe promotion and from cold cacheBack fetches. prewarm_router.go: atomic.AddInt64 bumps at the 4 prewarm entry points feed the counters above.
go.work + go.work.sum: go 1.22.5 → 1.24.0, toolchain go1.22.11 → go1.24.5. Required by the new dependencies pulled in by the prefetch predictor (klauspost/readahead) and aligns the workspace with the system_test toolchain bump.
…ture DEPLOYMENT.md (new, ~400 lines): three deployment topologies (co-located, remote-gateway, containerised) with ports, setup commands, data-flow diagrams, and pricing model for multi-gateway fan-out. BENCHMARKS_2026_04_19.md: extended with PM addendum, kNFSd + pipelined-AU section, methodology correction (NFS path was /nfs_export raw tmpfs, now /mnt/zus_nfs through Ganesha), and final NFS re-bench under /mnt/zus_nfs — SF10 NFS 1G/0 now 206s (vs 363s historic baseline). BENCHMARKS_2026_04_15.md: SF10 baseline + cache-tier matrix (tmpfs + spillover combinations) captured before the 04-19 fixes landed. CLAUDE.md: Claude Code build/test/lint playbook for this repo. LOGCACHE_ARCHITECTURE.md: design notes for the prospective log-cache layer (not yet implemented).
…rm size-mismatch accept - gateway-zcn.go: ListBuckets switches from GetRefs to ListDir so that directories created implicitly by rclone uploads (no explicit mkdir WriteMarker) are discovered; GetBucketInfo also probes ListDir as an implicit-dir fallback before returning BucketNotFound - initSDK.go: pre-seed gosdk node cache with SetNetwork before InitStorageSDK so that InitNetworkDetails can use the cached nodes as a fallback when 0DNS is transiently unreachable at startup - prewarm_router.go: split empty-read (n==0, restore stub) from size-mismatch (n>0 but n!=expected, accept actual bytes) — stale ActualFileSize in the best-effort filerefsworker fallback ref was triggering stub restore and starving subsequent reads
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Motivation and Context
How to test this PR?
Types of changes
Checklist:
commit-idorPR #here)