Skip to content

Copy-on-Write AnnotatedTypeMirror prototype#1837

Draft
wmdietl wants to merge 13 commits into
masterfrom
cow-prototype
Draft

Copy-on-Write AnnotatedTypeMirror prototype#1837
wmdietl wants to merge 13 commits into
masterfrom
cow-prototype

Conversation

@wmdietl

@wmdietl wmdietl commented Jun 23, 2026

Copy link
Copy Markdown
Member

This experiment is already described in the performance-notes.md file:

  **copy-on-write PROTOTYPED and characterized** (branch
  `cow-prototype`, an un-merged experiment — a PR was opened but will not be merged): solves the
  soundness blocker (Guava-validated), allocation −4.8%, but **no
  configuration is wall-positive** (all-caches +5.3%, elementTypeCache-only +1.8%/noise) — a memory/GC
  win only, not a wall win. See the narrative's "Copy-on-write ATMs — PROTOTYPED" entry.

Opening just for completeness, will be abandoned.

wmdietl and others added 13 commits June 22, 2026 18:32
Document the measured-flat result for migrating annotation-name dispatch
from value-based String comparison (AnnotationUtils.annotationName) to
javac Name identity comparison (annotationNameAsName + ==), framework-wide.

The fully-realized form (CFAM caches its Name, annotationNameAsName has a
CFAM fast path, isSupportedQualifier uses a companion Set<Name>) is within
the run-to-run noise band on both allocation and wall clock across nullness
and Value workloads. Root cause: CheckerFrameworkAnnotationMirror already
caches the decoded interned name, so the toString()/intern() the change
targets is not on the hot path. A ceiling profile on an annotation-saturated
workload puts all name-handling frames at 0 samples; the hot leaf that does
appear (IdentityHashMap.get -> getQualifierKind) is itself already rejected.

Also add the "measure the ceiling on an adversarial workload" habit and a
do-not-propose entry to the cf-performance skill.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…caches

Replace the placeholder "scope-bounded expression-type cache" direction
(#2 under the architectural getAnnotatedType item) with the measured
result. Instrumented addComputedTypeAnnotations(Tree, ATM), splitting it
at the useFlow boundary and recording per-tree redundancy plus the unsound
fraction on all-systems (120 files) and a loop-heavy workload.

Both forms are unsound. Caching the pre-flow (2a) type is 25% (all-systems)
to 38% (loops) unstable because tree annotators transitively query
flow-refined subexpression types. A phase-scoped full cache, restricted to
the post-flow checking traversal (analysis.isRunning() false, frozen
flowResult), is still 26% to 59% unstable: the same expression tree yields
different types across repeated queries even after flow completes. The flow
step itself is cheap, so the cost lives in the 2a computation whose result
is the unstable thing; the stable trees are the cheap-to-recompute ones.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extend the rejected getAnnotatedType expression-cache direction with the
leaf-subset measurement. The hypothesis was that subexpression-free leaves
(identifiers, literals) have a flow-independent prefix and so could be
cached even where compound trees cannot. Instrumented, split by kind,
checking-phase only.

Rejected: in the checking phase the pre-flow leaf type is still unstable
22% (identifiers) / 40% (literals) on all-systems. The cause is context-
dependence in which hierarchy's annotations are applied -- the same `1`
literal returns `int` vs `@Initialized int` (nullness runs with the
Initialization checker) -- so a tree-keyed cache would serve an
under-annotated type. The cost is immaterial anyway: safely-cacheable
checking-phase leaf 2a is ~1.5% of compile. getAnnotatedType is keyed by
call context, not tree identity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
My two earlier entries rejected the pre-flow ("split") and leaf-subset
getAnnotatedType caches as "25-59% unstable." That was a measurement
artifact: the per-tree signature used AnnotatedTypeMirror.toString(),
which conflates cross-hierarchy completeness (int vs @initialized int =
the Initialization annotation absent vs present) with real within-hierarchy
disagreement, and the comparison indexed getTopAnnotations() (an unstable-
order Set) positionally, misaligning hierarchies across calls.

Re-measured per hierarchy (keyed by top-annotation name; present-disagree
and completeness counted separately): value leaves are 0% unstable / 0%
incomplete over ~70k (all-systems) and ~208k (loops) repeats; compound
expressions ~0.46% / 0%; type-name identifiers ~0.7%. The pre-flow type is
stable; flow-dependence is confined to step 2b, as the split assumed. So
the split cache is a sound candidate (value still pending a built-cache
A/B). Reclassify accordingly and record the per-hierarchy measurement
method in the skill so the toString trap is not repeated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rement

My prior correction reached the right conclusion (the pre-flow split cache
is a sound candidate) but blamed the wrong cause: "unstable getTopAnnotations
iteration order." That is wrong. getTopAnnotations returns a build-once,
ArrayList-backed AnnotationMirrorSet over a TreeMap<QualifierKind,...> --
already a stable, deterministic order.

The actual confound was the instrumentation: a static per-Tree map shared
across the FOUR sub-factories a nullness compile runs (NullnessNoInit,
Initialization, InitializationFieldAccess, KeyFor), so it compared one type
system's snapshot of a tree against another's. Re-measured with a per-factory
(instance) map: value leaves are still 0% disagree / 0% completeness
(58,447 / 130,894 repeats); compound 0.69% / 0. The candidate stands; the
explanation is now correct. Skill updated with both measurement traps
(toString completeness conflation; static Tree-keyed state across factories).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ject it

Resolve the "candidate, pending A/B" status of the pre-flow split cache by
actually building it and measuring. A working per-factory cache (frozen
pre-flow types, populated in addComputedTypeAnnotations, served by an
overridden getAnnotatedType that re-applies only the flow step, toggled by
-Dcf.preflowcache) was implemented and tested.

Rejected for two independent reasons:
- Unsound across checkers. IndexTest fails (PredecrementTest: an extra
  array.access.unsafe.low). UpperBound/LowerBound/Optional/Lock/Signedness
  override addComputedTypeAnnotations to add flow-dependent annotations after
  super; intercepting at getAnnotatedType bypasses that subclass logic on a
  hit. A nullness-only diagnostic diff was clean, which is why nullness
  validation missed it.
- Flat where sound. Deterministic A/B (nullness): all-systems alloc -1.6% /
  wall ~0; loops alloc +2.0% (worse) / wall -4.7%. The per-hit deepCopy+freeze
  overhead cancels the saved 2a work; the projected 4-15% did not appear.

Update the skill: stability != cacheable; a getAnnotatedType cache must
compose with subclass addComputedTypeAnnotations overrides; validate on the
override-checkers (Index/Lock/Optional/Signedness), not just nullness.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…index

Move the pre-flow split cache from "Open, low-value" to "Closed, do not
re-propose" now that it has been built and rejected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-2 cost)

Profiled the Nullness Checker on the Guava `guava` module (625 files, 7,814
JFR samples) -- the first hotspot trace of a large heavy-generics codebase
(Guava is the 994-work-unit case used only for correctness/budget so far).

Leaf self-time generalizes: same flat profile as checkNullness, no CF leaf
above 3.85% (IdentityHashMap lookups, ATM traversal, apiComplete, defaulting,
AnnotatedTypeCopier, createType). New fact: type-argument inference is ~23%
inclusive (methodFromUse 24.6% -> inferTypeArgs 23.0% -> reduceOneStep 14.1%),
where it is negligible on all-systems/checkNullness -- but 50% of that is
getAnnotatedType and 28% is the already-shipped #1829 incorporation, so it
reinforces the Short list rather than opening a new lever. Note Guava (or
jspecify-conformance) as the venue for any future inference work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Editorial pass for review: rewrite the pre-flow split-cache entry and the
companion skill lesson to present the final understanding coherently
(built & rejected; stability != cacheable; the two measurement traps;
getTopAnnotations is already stable), dropping the incremental
"prior revision of this entry" framing. No new findings; content-equivalent.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Prototype of copy-on-write that unblocks the immutability allocation win
(return shared frozen cache masters instead of deepCopy()), under -Dcf.cow.

Design: the 6 post-pipeline caches return cowCopy() (a non-frozen shallow copy
sharing the master's frozen children) instead of deepCopy(); the ~13 child
accessors and the three fixupBoundAnnotations lazily unshare a frozen child of
a non-frozen parent (cowChild/cowChildren); a per-node cowDirty flag keeps the
COW scan off the hot path for non-cache types.

Validated: ElementTypeCacheWildcardBound, all-systems 269 (identical
diagnostics), full Guava nullness build (BUILD SUCCESS, 0 crashes).
Measured (all-systems 269): allocation -4.8%, wall +5.3%. Correctness-complete
but wall-negative; off by default.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… win

Two wall optimizations attempted on the COW prototype:
1. Gate every child accessor with cowActive() (= COW && cowDirty) so non-cache
   types skip the cowChild call and field write. Removed the field-write cost
   but did not move wall (the cost was not the write).
2. Flip only elementTypeCache (read-only-majority consumers) instead of all six
   caches, reverting the full-walk caches (element/fromMember/fromExpression/
   fromType/methodAsMemberOf) to deepCopy.

Results (all-systems 269): all-six COW = alloc -4.8% / wall +5.3%;
elementTypeCache-only = alloc -0.2% / wall +1.8% (noise). No configuration is
wall-positive. Structural reasons: the copier is already cheap (~1-2%), so
eliminating it cannot beat the global per-accessor cowActive() tax; and the
cache consumers fully walk the result (defaulting/annotators), so piecemeal
per-child cowCopy is slower than one batched deepCopy and the read-only-skip
benefit never materializes. COW is a memory/GC-pressure optimization; for wall
clock, the existing deepCopy is already optimal.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Restore all six cache flips on top of the cowActive() gating, so this is the
max-allocation-reduction reference (alloc -4.8%).

Tested the "downstream / GC-bound timing win" hypothesis on all-systems 269
across a heap sweep. The wall penalty shrinks as the heap tightens (-Xmx512m
+8.1%, 320m +0.5%, 256m +0.1%), BUT a clean GC measurement at 256m shows COW
does NOT reduce GC: 0.77s vs 0.76s pauses, 225 collections both. So the
gap-closing is measurement variance, not GC savings -- the -4.8% allocation
does not convert to fewer collections. CF compilation is CPU-bound (~96%
on-CPU, GC <=4% even at scale per the notes), so the allocation reduction has
no wall payoff (GC ceiling ~4%, COW captures none of it) and cannot offset the
+5% CPU tax. No single-compile or downstream timing win from COW.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wmdietl wmdietl marked this pull request as draft June 23, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant