Skip to content

perf: Reduce Thrift+Nimble decoding CPU time by 32% by reusing ZSTD_DCtx across streams in Nimble deserialization (#639)#639

Closed
sdruzkin wants to merge 1 commit intofacebookincubator:mainfrom
sdruzkin:export-D99456594
Closed

perf: Reduce Thrift+Nimble decoding CPU time by 32% by reusing ZSTD_DCtx across streams in Nimble deserialization (#639)#639
sdruzkin wants to merge 1 commit intofacebookincubator:mainfrom
sdruzkin:export-D99456594

Conversation

@sdruzkin
Copy link
Copy Markdown
Contributor

@sdruzkin sdruzkin commented Apr 4, 2026

Summary:

ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.

This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.

Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9

Benchmark results on aidn_v2

CPU time went from 40 CPU seconds to 27 CPU seconds.

Reviewed By: harsharastogi

Differential Revision: D99456594

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 4, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 4, 2026

@sdruzkin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99456594.

@sdruzkin sdruzkin changed the title Reduce Thrift+Nimble decoding CPU time by 32% by reusing ZSTD_DCtx across streams in Nimble deserialization perf: Reduce Thrift+Nimble decoding CPU time by 32% by reusing ZSTD_DCtx across streams in Nimble deserialization Apr 6, 2026
@meta-codesync meta-codesync Bot changed the title perf: Reduce Thrift+Nimble decoding CPU time by 32% by reusing ZSTD_DCtx across streams in Nimble deserialization perf: Reduce Thrift+Nimble decoding CPU time by 32% by reusing ZSTD_DCtx across streams in Nimble deserialization (#639) Apr 7, 2026
sdruzkin added a commit to sdruzkin/nimble that referenced this pull request Apr 7, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639)

Summary:

ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.

This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.

Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9

# Benchmark results on aidn_v2
CPU time went from 40 CPU seconds to 27 CPU seconds.

Reviewed By: harsharastogi

Differential Revision: D99456594
sdruzkin added a commit to sdruzkin/nimble that referenced this pull request Apr 7, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639)

Summary:
Pull Request resolved: facebookincubator#639

ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.

This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.

Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9

# Benchmark results on aidn_v2
CPU time went from 40 CPU seconds to 27 CPU seconds.

Reviewed By: harsharastogi

Differential Revision: D99456594
sdruzkin added a commit to sdruzkin/nimble that referenced this pull request Apr 7, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639)

Summary:

ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.

This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.

Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9

# Benchmark results on aidn_v2
CPU time went from 40 CPU seconds to 27 CPU seconds.

Reviewed By: harsharastogi

Differential Revision: D99456594
sdruzkin added a commit to sdruzkin/nimble that referenced this pull request Apr 7, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639)

Summary:
Pull Request resolved: facebookincubator#639

ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.

This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.

Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9

# Benchmark results on aidn_v2
CPU time went from 40 CPU seconds to 27 CPU seconds.

Reviewed By: harsharastogi

Differential Revision: D99456594
@sdruzkin sdruzkin force-pushed the export-D99456594 branch 2 times, most recently from f528b52 to 8ae4d83 Compare April 9, 2026 19:00
sdruzkin added a commit to sdruzkin/nimble that referenced this pull request Apr 9, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639)

Summary:

ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.

This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.

Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9

# Benchmark results on aidn_v2
CPU time went from 40 CPU seconds to 27 CPU seconds.

Reviewed By: harsharastogi

Differential Revision: D99456594
sdruzkin added a commit to sdruzkin/nimble that referenced this pull request Apr 9, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639)

Summary:

ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.

This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.

Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9

# Benchmark results on aidn_v2
CPU time went from 40 CPU seconds to 27 CPU seconds.

Reviewed By: harsharastogi

Differential Revision: D99456594
@sdruzkin sdruzkin force-pushed the export-D99456594 branch 2 times, most recently from 35fa0c9 to 3713cdd Compare April 9, 2026 22:29
sdruzkin added a commit to sdruzkin/nimble that referenced this pull request Apr 9, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639)

Summary:

ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.

This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.

Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9

# Benchmark results on aidn_v2
CPU time went from 40 CPU seconds to 27 CPU seconds.

Reviewed By: harsharastogi

Differential Revision: D99456594
sdruzkin added a commit to sdruzkin/nimble that referenced this pull request Apr 9, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639)

Summary:
Pull Request resolved: facebookincubator#639

ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.

This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.

Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9

# Benchmark results on aidn_v2
CPU time went from 40 CPU seconds to 27 CPU seconds.

Reviewed By: harsharastogi

Differential Revision: D99456594
sdruzkin added a commit to sdruzkin/nimble that referenced this pull request Apr 23, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639)

Summary:

ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.

This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.

Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9

# Benchmark results on aidn_v2
CPU time went from 40 CPU seconds to 27 CPU seconds.

Reviewed By: harsharastogi

Differential Revision: D99456594
sdruzkin added a commit to sdruzkin/nimble that referenced this pull request Apr 23, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639)

Summary:

ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.

This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.

Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9

# Benchmark results on aidn_v2
CPU time went from 40 CPU seconds to 27 CPU seconds.

Reviewed By: harsharastogi

Differential Revision: D99456594
sdruzkin added a commit to sdruzkin/nimble that referenced this pull request Apr 23, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639)

Summary:
Pull Request resolved: facebookincubator#639

ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.

This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.

Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9

# Benchmark results on aidn_v2
CPU time went from 40 CPU seconds to 27 CPU seconds.

Reviewed By: harsharastogi

Differential Revision: D99456594
sdruzkin added a commit to sdruzkin/nimble that referenced this pull request Apr 23, 2026
…Ctx across streams in Nimble deserialization (facebookincubator#639)

Summary:
Pull Request resolved: facebookincubator#639

ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.

This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.

Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9

# Benchmark results on aidn_v2
CPU time went from 40 CPU seconds to 27 CPU seconds.

Reviewed By: harsharastogi

Differential Revision: D99456594
…Ctx across streams in Nimble deserialization (facebookincubator#639)

Summary:

ZSTD_decompress() (one-shot API) internally allocates and frees a
ZSTD_DCtx (~100-200KB) on every call. In the Nimble legacy
deserialization path, this happens once per stream per batch,
creating a hot alloc/free cycle through jemalloc's large extent
allocator.

This diff adds a persistent ZSTD_DCtx owned by nimble::Deserializer
and threads it through StreamDataReader and StreamData to use
ZSTD_decompressDCtx() instead. The change is backward-compatible:
all new parameters default to nullptr, falling back to the original
ZSTD_decompress() when no context is provided.

Reusing ZSTD_DCtx should be safe according to https://facebook.github.io/zstd/zstd_manual.html#Chapter9

# Benchmark results on aidn_v2
CPU time went from 40 CPU seconds to 27 CPU seconds.

Reviewed By: harsharastogi

Differential Revision: D99456594
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 25, 2026

This pull request has been merged in 2e2721b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant