Add persistent dependency inference cache for incremental --changed-dependents#23228
Add persistent dependency inference cache for incremental --changed-dependents#23228jasonwbarnett wants to merge 6 commits intopantsbuild:mainfrom
Conversation
Implement IncrementalDependents subsystem that persists the forward dependency graph to disk. When enabled via --incremental-dependents-enabled, only targets whose BUILD files or source files have changed (based on mtime+size fingerprinting) need their dependencies re-resolved. This dramatically reduces wall time for --changed-dependents=transitive in large monorepos by avoiding redundant dependency inference on unchanged targets across pantsd restarts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…arse Address.parse() fails on bare spec strings like "src/python/foo.py:bar" because it expects "//" prefix. Instead, build a spec→Address lookup dict from AllUnexpandedTargets for O(1) resolution of cached dep specs. Also simplify CachedEntry to store deps as spec strings directly rather than structured JSON tuples, and remove now-unused serialization helpers. Results: 52927-target monorepo - Cold cache: 3m12s (same as before, writes 29MB cache) - Warm cache: 38s (dep graph in 1.6s, 52927 targets from cache) - 5x speedup on warm cache, 100% identical output Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mtime-based fingerprinting fails across machines because git clone sets all file mtimes to the checkout timestamp, making the cache useless on CI agents. SHA-256 content hashing costs only ~5 seconds more for 18K files but makes the cache fully portable. Benchmark (52,927 targets): - Cold cache: 3m22s (writes cache) - Warm cache: 43s (sha256 fingerprints, 100% cache hits) - Cross-machine: cache is portable via S3 (1.3MB compressed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These .claude/worktrees/ entries were accidentally staged by git add -A and are not part of the persistent dep cache changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
159893c to
18fdfb0
Compare
- Unit tests for CachedEntry, save/load roundtrip, JSON edge cases - Unit tests for SHA-256 file hashing - Unit tests for compute_source_fingerprint (BUILD changes, source changes, stability) - Integration tests verifying incremental mode matches standard mode for direct deps, transitive deps, empty inputs, and special-cased deps - Fix missing Address import in incremental_dependents.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Could you elaborate (here or in an issue) on your setup? Of the 53k targets, what is the rough breakdown by types? I've most often seen --changed-since used with a filter to select on an "uncommon" type, such as "deploy all the helm stuff" or "publish all the docker images". From #23224 I take it you are filtering on a common type (like python_sources), is that correct? I know you have looked at this from a few different angles, does performance get worse with:
If I wanted to make a case like yours -- or even more pathological! -- what would I need? |
- Replace IncrementalDependents subsystem with PANTS_INCREMENTAL_DEPENDENTS env var to avoid "No such options scope" errors in tests that use dependents rules without registering the subsystem - Add release notes entry to docs/notes/2.32.x.md - Fix unused import (textwrap) and formatting issues caught by CI linters - All tests pass: dependents_test, incremental_dependents_test, py_constraints_test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
I'm going to write up a comprehensive explanation of how I arrived at the conclusion and all of the supporting evidence. Give me a couple of hours to get it done. |
Performance Investigation:
|
| Target Type | Count | % of Total |
|---|---|---|
python_source |
20,850 | 39.4% |
file |
15,422 | 29.1% |
resource |
8,620 | 16.3% |
python_test |
2,945 | 5.6% |
python_sources (generators) |
1,767 | 3.3% |
python_requirement |
1,263 | 2.4% |
python_tests (generators) |
693 | 1.3% |
shell_source |
303 | 0.6% |
docker_image |
92 | 0.2% |
| Other (resources, distributions, etc.) | 972 | 1.8% |
| Total | 52,927 | 100% |
Benchmark Results
All times are wall-clock elapsed seconds. Pants version 2.32.0.dev7 from source unless noted.
Test 1: The Core Bottleneck — --changed-dependents vs not
| Command | Time | Output |
|---|---|---|
filter (no dependents) |
37s | 0 targets |
filter --changed-dependents=direct |
2m53s | 0 targets |
filter --changed-dependents=transitive |
2m40s | 440 targets |
Key finding: Adding --changed-dependents=direct jumps from 37s to 2m53s — a 4.7x increase — even when the result is 0 additional targets. The entire cost is building the full reverse dependency graph via map_addresses_to_dependents().
Test 2: The --changed-since Depth Does NOT Matter
| Range | Changed Files | Time | Output |
|---|---|---|---|
HEAD~1 |
9 files | 2m49s | 0 targets |
HEAD~3 |
53 files | 2m40s | 440 targets |
HEAD~10 |
79 files | 2m56s | 811 targets |
Times are within noise. The depth of --changed-since is irrelevant — the bottleneck is always map_addresses_to_dependents() which processes all 53K targets regardless of how many files changed.
Test 3: The Filter Type Does NOT Matter
| Filter | Time | Output |
|---|---|---|
--filter-target-type=+python_test |
2m40s | 440 targets |
--filter-target-type=+docker_image |
2m38s | 20 targets |
Same cost whether finding 440 test targets or 20 Docker targets. The filter is applied AFTER the full dependency graph is built.
Test 4: dependents Goal Shows the Same Bottleneck
| Command | Time | Output |
|---|---|---|
dependents --transitive <single-file> |
2m50s | 1,601 dependents |
dependencies <single-file> |
38s | 10 dependencies |
list :: |
27s | 52,927 targets |
Computing forward dependencies for a single target: 38 seconds.
Computing reverse dependents for a single target: 2m50s (requires building the full reverse graph for ALL 53K targets).
Test 5: Warm pantsd Does NOT Help
| Run | Time |
|---|---|
| Cold pantsd | 2m44s |
| Warm pantsd (identical command) | 2m39s |
| Warm pantsd (different range) | 2m51s |
Warm pantsd provides essentially zero benefit for this operation. The map_addresses_to_dependents rule is recomputed on every invocation because it depends on AllUnexpandedTargets, which the Pants source describes as "relatively expensive to compute and frequently invalidated".
Test 6: Pre-built Binary vs From-Source
| Version | Time |
|---|---|
| Pants 2.30.0 (pre-built binary) | 2m56s |
| Pants 2.32.0.dev7 (from source) | 2m40s |
No meaningful difference. The bottleneck is the same in both versions.
Test 7: Work Unit Timing (from -linfo logs)
"Map all targets to their dependents" — reported as a long-running task at:
60.2s elapsed
90.1s elapsed
120.0s elapsed
This single rule (map_addresses_to_dependents) accounts for ~120 seconds out of ~160 seconds of total execution (75% of wall time).
Root Cause Analysis
What map_addresses_to_dependents() Does
@rule(desc="Map all targets to their dependents")
async def map_addresses_to_dependents(all_targets: AllUnexpandedTargets) -> AddressToDependents:
dependencies_per_target = await concurrently(
resolve_dependencies(
DependenciesRequest(tgt.get(Dependencies), ...)
)
for tgt in all_targets # ALL 52,927 targets
)
# Invert the forward deps to build the reverse map
address_to_dependents = defaultdict(set)
for tgt, dependencies in zip(all_targets, dependencies_per_target):
for dependency in dependencies:
address_to_dependents[dependency].add(tgt.address)
return AddressToDependents(...)This rule:
- Resolves
AllUnexpandedTargets— every target in the repository (52,927) - For each target, calls
resolve_dependencies()which includes:- Parsing the target's BUILD file for explicit dependencies
- Running dependency inference (Python import parsing, Docker COPY analysis, Shell source detection)
- Resolving inferred module names to target addresses
- Inverts the forward dependency graph into a reverse mapping
Step 2 is the expensive part. Python import inference uses a Rust-based tree-sitter parser (fast per-file), but the per-target overhead of the rule engine — resolving imports to target addresses via the module mapper, handling ambiguity, validating results — adds up at 53K scale.
Why Warm pantsd Doesn't Help
map_addresses_to_dependents takes AllUnexpandedTargets as its sole input. AllUnexpandedTargets is a rule that scans the entire filesystem for BUILD files and resolves all targets. The Pants engine's InvalidationWatcher (inotify-based) detects any filesystem change and invalidates AllUnexpandedTargets, which cascades to invalidate AddressToDependents.
Even without actual file changes, the engine must re-verify that all BUILD files are unchanged, re-hash target definitions, and confirm the cached result is still valid. At 53K targets, this verification itself is non-trivial.
Why the Filter Doesn't Help
The filter (--filter-target-type=+python_test, --tag="-integration") is applied after map_addresses_to_dependents completes. The full reverse graph for all 53K targets is built first, then the result is filtered down.
This is a deliberate design choice (see pantsbuild/pants#15544): filtering before building the graph would cause missed dependents when a filtered-out target is an intermediate link in the dependency chain.
Conditions to Reproduce
To reproduce this performance characteristic, you need:
- Many targets (>30K, ideally >50K). The cost scales roughly linearly with target count.
--changed-dependents=director--changed-dependents=transitive. Without this flag, the operation is fast (~30-40s) because it only finds owners of changed files, not their dependents.- Any amount of changed files — even 0 changed files triggers the full graph build if
--changed-dependentsis set.
The target type distribution doesn't matter much. file and resource targets (which make up 45% of our targets) have trivial dependency inference, but they still contribute to the 53K targets that map_addresses_to_dependents must process.
Synthetic Reproduction
To create a synthetic test case:
# Create 50K targets in a fresh repo
mkdir big-repo && cd big-repo
pants init
# Add Python backend and interpreter constraints
cat > pants.toml <<'EOF'
[GLOBAL]
pants_version = "2.31.0"
backend_packages = ["pants.backend.python"]
[python]
interpreter_constraints = ["==3.11.*"]
EOF
# Ignore pants cache so git diff doesn't explode
echo '/.pants.*' > .gitignore
for i in $(seq 1 500); do
mkdir -p "pkg${i}"
for j in $(seq 1 100); do
echo "x = $j" > "pkg${i}/file${j}.py"
done
echo 'python_sources()' > "pkg${i}/BUILD.pants"
done
git init && git add . && git commit -m "init"
# Modify an existing file
sed -i 's/x = 1/x = 999/' pkg1/file1.py
git add . && git commit -m "change"
time pants --changed-since=HEAD~ --changed-dependents=transitive listSummary
The performance issue is real, reproducible, and caused by map_addresses_to_dependents() resolving dependencies for ALL targets in the repo whenever --changed-dependents is used. The cost is O(N) where N = total target count, regardless of:
- How many files changed
- What target type is being filtered for
- Whether pantsd is warm or cold
- The depth of the git history
At 53K targets, this costs ~2m40s. The rule engine's in-memory caching doesn't help because AllUnexpandedTargets is invalidated on every invocation.
|
Thanks, this is very helpful. I'll do some analysis, but I'm out for the next of the week and may not be able to post anything before then. A few clarifying questions:
|
|
Summary
Adds an opt-in persistent disk cache for the dependency graph computed by
map_addresses_to_dependents(). When enabled via--incremental-dependents-enabled, the forward dependency graph is serialized to~/.cache/pants/incremental_dep_graph_v2.jsonafter each run and loaded on the next run. Only targets whose source files have changed (by SHA-256 content hash) need their dependencies re-resolved.This dramatically reduces wall time for
--changed-dependents=transitivein large repos with many targets.Motivation
In a monorepo with ~53K targets,
pants --changed-since=HEAD~3 --changed-dependents=transitive filtertakes ~3.5 minutes becausemap_addresses_to_dependents()callsresolve_dependencies()for every target — even when pantsd is warm. The rule engine's in-memory memoization is invalidated by any filesystem change, and theAllUnexpandedTargets→AddressToDependentscascade forces full recomputation each time.The persistent cache breaks this cycle: even on a cold pantsd start (fresh CI agent), previously computed dependency edges are reused for unchanged targets.
Results
Tested on a monorepo with 52,927 targets:
Design
New subsystem:
--incremental-dependents-enabledOpt-in flag. When disabled (default), behavior is completely unchanged.
Cache format
JSON file at
~/.cache/pants/incremental_dep_graph_v2.json:{ "version": 2, "buildroot": "/path/to/repo", "entries": { "src/python/foo/bar.py:lib": { "fingerprint": "<sha256>", "deps": ["src/python/baz/qux.py:lib", "3rdparty/python:requests"] } } }Fingerprinting
Each target's cache key is SHA-256 of:
This is ~1 second for 18K files and is fully portable across machines.
Safety
resolve_dependencies()as normal.tmp, thenos.replace)Files changed
src/python/pants/backend/project_info/dependents.py— Modifiedmap_addresses_to_dependents()to use incremental mode when enabledsrc/python/pants/backend/project_info/incremental_dependents.py— New: cache persistence, fingerprinting,IncrementalDependentssubsystemCI usage
The cache can be shared across ephemeral CI agents via S3:
🤖 Generated with Claude Code