Skip to content

fix: disable lockedSynchronizers in dumpAllThreads to avoid ZGC safepoint heap scan#16195

Open
eddieran wants to merge 2 commits intoapache:3.3from
eddieran:fix/jvmutil-disable-locked-synchronizers-heap-scan
Open

fix: disable lockedSynchronizers in dumpAllThreads to avoid ZGC safepoint heap scan#16195
eddieran wants to merge 2 commits intoapache:3.3from
eddieran:fix/jvmutil-disable-locked-synchronizers-heap-scan

Conversation

@eddieran
Copy link
Copy Markdown

@eddieran eddieran commented Apr 7, 2026

What is the purpose of the change

Fixes #16194

JVMUtil.jstack() calls ThreadMXBean.dumpAllThreads(true, true). The lockedSynchronizers=true parameter forces the JVM to scan the entire Java heap at a safepoint to find all AbstractOwnableSynchronizer instances. On ZGC with large heaps, this causes catastrophic safepoint pauses (36–39 seconds measured on a 65GB heap with ~1950 threads) that freeze the entire application.

This PR changes lockedSynchronizers from true to false, eliminating the heap scan.

Root Cause

On ZGC, HeapInspection::find_instances_at_safepoint() iterates the entire heap, and every object reference must pass through ZGC's load barrier (color bit check → relocate → forwarding table → remap). On our 65GB heap, this resulted in a ~37-second safepoint pause. For comparison, normal ZGC safepoint operations (Mark Start, Mark End, Relocate Start) complete in 0.1–0.8ms.

The OpenJDK community already fixed this on the tooling side (JDK-8324066: "clhsdb jstack should not scan for j.u.c locks by default"), but the programmatic API (ThreadMXBean.dumpAllThreads) has no such protection.

Production Impact

When AbortPolicyWithReport fires on ZGC + large heap:

  1. dumpAllThreads(true, true) → 37s full application freeze
  2. Queued requests immediately exhaust pool on release → cascading freezes
  3. Observed: 4 consecutive dumps → ~150s near-total service unavailability

Brief changelog

  • JVMUtil.jstack(): Change dumpAllThreads(true, true) to dumpAllThreads(true, false)

What is lost

Only the "Locked synchronizers" section at the bottom of each thread's dump — i.e., java.util.concurrent.locks.ReentrantLock / ReadWriteLock ownership. All other diagnostic info is retained:

Information Retained?
Thread name, ID, state Yes
Full stack traces Yes
synchronized block contention (BLOCKED on ...) Yes
synchronized monitor ownership (- locked ...) Yes
Waiting/parking state Yes
Lock owner for BLOCKED threads Yes
j.u.c.locks ownership (ReentrantLock, etc.) No

Verifying this change

Existing tests pass — all tests in AbortPolicyWithReportTest mock the jstack() method and are not affected by the parameter change.

The fix can be verified by:

  1. Deploy a Dubbo app with ZGC + large heap (≥32GB)
  2. Exhaust the thread pool to trigger AbortPolicyWithReport
  3. Observe safepoint duration in GC logs — should drop from ~37s to <100ms

eddieran added 2 commits April 7, 2026 14:09
… heap scan (apache#16194)

`ThreadMXBean.dumpAllThreads(true, true)` with lockedSynchronizers=true
forces the JVM to scan the entire heap at a safepoint to find all
AbstractOwnableSynchronizer instances. On ZGC with large heaps (65GB+),
this causes ~37-second safepoint pauses that freeze all application
threads, leading to cascading thread pool exhaustion.

Change lockedSynchronizers from true to false. This retains locked
monitor information (derived from thread stacks, cheap) but skips the
expensive heap scan. Only java.util.concurrent.locks ownership info
is lost from the thread dump output.

Fixes apache#16194
Context is documented in the issue and PR description.

Fixes apache#16194
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.75%. Comparing base (69084bd) to head (b12b3cc).

Additional details and impacted files
@@             Coverage Diff              @@
##                3.3   #16195      +/-   ##
============================================
- Coverage     60.80%   60.75%   -0.05%     
+ Complexity    11756    11750       -6     
============================================
  Files          1953     1953              
  Lines         89118    89118              
  Branches      13444    13444              
============================================
- Hits          54188    54145      -43     
- Misses        29368    29397      +29     
- Partials       5562     5576      +14     
Flag Coverage Δ
integration-tests-java21 32.15% <0.00%> (-0.01%) ⬇️
integration-tests-java8 32.23% <0.00%> (-0.09%) ⬇️
samples-tests-java21 32.21% <0.00%> (+0.07%) ⬆️
samples-tests-java8 29.70% <0.00%> (-0.06%) ⬇️
unit-tests-java11 59.02% <100.00%> (-0.01%) ⬇️
unit-tests-java17 58.52% <100.00%> (+<0.01%) ⬆️
unit-tests-java21 58.49% <100.00%> (-0.02%) ⬇️
unit-tests-java25 58.44% <100.00%> (-0.02%) ⬇️
unit-tests-java8 59.04% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] JVMUtil.jstack() causes ~37s safepoint pause on ZGC with large heaps due to dumpAllThreads(true, true)

2 participants