perf: Reduce Thrift+Nimble decoding CPU time by 32% by reusing ZSTD_DCtx across streams in Nimble deserialization (#639)#639
Closed
sdruzkin wants to merge 1 commit intofacebookincubator:mainfrom
Closed
Conversation
8155c87 to
dd1a665
Compare
sdruzkin
added a commit
to sdruzkin/nimble
that referenced
this pull request
Apr 7, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594
sdruzkin
added a commit
to sdruzkin/nimble
that referenced
this pull request
Apr 7, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: Pull Request resolved: facebookincubator#639 ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594
dd1a665 to
5dfc6f9
Compare
sdruzkin
added a commit
to sdruzkin/nimble
that referenced
this pull request
Apr 7, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594
5dfc6f9 to
89b0616
Compare
sdruzkin
added a commit
to sdruzkin/nimble
that referenced
this pull request
Apr 7, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: Pull Request resolved: facebookincubator#639 ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594
f528b52 to
8ae4d83
Compare
sdruzkin
added a commit
to sdruzkin/nimble
that referenced
this pull request
Apr 9, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594
sdruzkin
added a commit
to sdruzkin/nimble
that referenced
this pull request
Apr 9, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594
35fa0c9 to
3713cdd
Compare
sdruzkin
added a commit
to sdruzkin/nimble
that referenced
this pull request
Apr 9, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594
sdruzkin
added a commit
to sdruzkin/nimble
that referenced
this pull request
Apr 9, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: Pull Request resolved: facebookincubator#639 ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594
3713cdd to
a763935
Compare
a763935 to
0c09c53
Compare
sdruzkin
added a commit
to sdruzkin/nimble
that referenced
this pull request
Apr 23, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594
sdruzkin
added a commit
to sdruzkin/nimble
that referenced
this pull request
Apr 23, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594
0c09c53 to
c4aa234
Compare
sdruzkin
added a commit
to sdruzkin/nimble
that referenced
this pull request
Apr 23, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: Pull Request resolved: facebookincubator#639 ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594
c4aa234 to
d000601
Compare
sdruzkin
added a commit
to sdruzkin/nimble
that referenced
this pull request
Apr 23, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: Pull Request resolved: facebookincubator#639 ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594
d000601 to
ec50350
Compare
…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594
ec50350 to
1b6ab44
Compare
|
This pull request has been merged in 2e2721b. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.
This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.
Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9
Benchmark results on aidn_v2
CPU time went from 40 CPU seconds to 27 CPU seconds.
Reviewed By: harsharastogi
Differential Revision: D99456594