perf: Reduce Thrift+Nimble decoding CPU time by 32% by reusing ZSTD_DCtx across streams in Nimble deserialization (#639) by sdruzkin · Pull Request #639 · facebookincubator/nimble

sdruzkin · 2026-04-04T15:58:38Z

Summary:

ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.

This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.

Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9

Benchmark results on aidn_v2

CPU time went from 40 CPU seconds to 27 CPU seconds.

Reviewed By: harsharastogi

Differential Revision: D99456594

meta-codesync · 2026-04-04T15:58:46Z

@sdruzkin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99456594.

…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594

…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: Pull Request resolved: facebookincubator#639 ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594

…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594

…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: Pull Request resolved: facebookincubator#639 ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594

…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594

…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: Pull Request resolved: facebookincubator#639 ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594

…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594

…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: Pull Request resolved: facebookincubator#639 ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594

…Ctx across streams in Nimble deserialization (facebookincubator#639) Summary: ZSTD_decompress() (one-shot API) internally allocates and frees a ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy deserialization path, this happens once per stream per batch, creating a hot alloc/free cycle through jemalloc's large extent allocator. This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer and threads it through StreamDataReader and StreamData to use ZSTD_decompressDCtx() instead. The change is backward-compatible: all new parameters default to nullptr, falling back to the original ZSTD_decompress() when no context is provided. Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9 # Benchmark results on aidn_v2 CPU time went from 40 CPU seconds to 27 CPU seconds. Reviewed By: harsharastogi Differential Revision: D99456594

meta-codesync · 2026-04-25T16:05:55Z

This pull request has been merged in 2e2721b.

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 4, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 4, 2026

sdruzkin changed the title ~~Reduce Thrift+Nimble decoding CPU time by 32% by reusing ZSTD_DCtx across streams in Nimble deserialization~~ perf: Reduce Thrift+Nimble decoding CPU time by 32% by reusing ZSTD_DCtx across streams in Nimble deserialization Apr 6, 2026

meta-codesync Bot changed the title ~~perf: Reduce Thrift+Nimble decoding CPU time by 32% by reusing ZSTD_DCtx across streams in Nimble deserialization~~ perf: Reduce Thrift+Nimble decoding CPU time by 32% by reusing ZSTD_DCtx across streams in Nimble deserialization (#639) Apr 7, 2026

sdruzkin force-pushed the export-D99456594 branch from 8155c87 to dd1a665 Compare April 7, 2026 16:24

sdruzkin force-pushed the export-D99456594 branch from dd1a665 to 5dfc6f9 Compare April 7, 2026 16:27

sdruzkin force-pushed the export-D99456594 branch from 5dfc6f9 to 89b0616 Compare April 7, 2026 17:32

sdruzkin force-pushed the export-D99456594 branch 2 times, most recently from f528b52 to 8ae4d83 Compare April 9, 2026 19:00

sdruzkin force-pushed the export-D99456594 branch 2 times, most recently from 35fa0c9 to 3713cdd Compare April 9, 2026 22:29

sdruzkin force-pushed the export-D99456594 branch from 3713cdd to a763935 Compare April 9, 2026 22:32

sdruzkin force-pushed the export-D99456594 branch from a763935 to 0c09c53 Compare April 23, 2026 17:00

sdruzkin force-pushed the export-D99456594 branch from 0c09c53 to c4aa234 Compare April 23, 2026 17:03

sdruzkin force-pushed the export-D99456594 branch from c4aa234 to d000601 Compare April 23, 2026 17:10

sdruzkin force-pushed the export-D99456594 branch from d000601 to ec50350 Compare April 23, 2026 17:19

sdruzkin force-pushed the export-D99456594 branch from ec50350 to 1b6ab44 Compare April 24, 2026 21:39

meta-codesync Bot closed this in 2e2721b Apr 25, 2026

facebook-github-tools Bot added the Merged label Apr 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Reduce Thrift+Nimble decoding CPU time by 32% by reusing ZSTD_DCtx across streams in Nimble deserialization (#639)#639

perf: Reduce Thrift+Nimble decoding CPU time by 32% by reusing ZSTD_DCtx across streams in Nimble deserialization (#639)#639
sdruzkin wants to merge 1 commit intofacebookincubator:mainfrom
sdruzkin:export-D99456594

sdruzkin commented Apr 4, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Apr 4, 2026

Uh oh!

meta-codesync Bot commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sdruzkin commented Apr 4, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results on aidn_v2

Uh oh!

meta-codesync Bot commented Apr 4, 2026

Uh oh!

meta-codesync Bot commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sdruzkin commented Apr 4, 2026 •

edited by meta-codesync Bot

Loading