Weisfeiler-Lehman canonicalization cache, by Claude#105#107
Open
Weisfeiler-Lehman canonicalization cache, by Claude#105#107
Conversation
… lookup The component cache in Ganak stores sub-problem counts keyed by a fingerprint derived from the sorted list of absolute variable IDs and long clause IDs that make up each component. This means that two structurally isomorphic components -- identical graph topology, same clause structure, same independence/non- independence classification of variables, but different variable numberings -- are treated as entirely distinct cache entries, causing guaranteed misses even when the same structural sub-problem has already been solved. Analysis of benchmark data across 1600 instances showed that the average component is seen only ~2 times during a full run, and cache hits tend to cluster on very small components (22-64 vars avg) while misses dominate on larger ones (150-548 vars avg). The root cause is not cache pollution or eviction policy, but rather the inability of the fingerprint to identify structural equivalence across different variable numberings. This commit implements Weisfeiler-Lehman (WL) canonical component hashing, a graph-coloring technique that produces a structure-invariant fingerprint for each component. For each component with nVars <= wl_canonize_threshold (default 25), we compute a canonical ordering of the variables and express every clause -- both long (3+ literal) and binary -- as a sorted list of canonical variable indices (0..nVars-1) rather than absolute variable IDs. The canonical ordering is determined by: (1) an initial color per variable encoding its long-clause degree, binary-clause degree, and whether it belongs to the independent support; (2) one round of WL refinement where each variable's new hash incorporates the sorted multiset of its long-clause neighbours' initial colors; (3) a stable sort on (wl1_hash, init_color, original_var_id). The resulting sorted list of canonical clauses, plus nVars, is hashed via chibihash64 to produce the cache key. Two isomorphic components produce identical sorted_canon_clauses and therefore hit the same cache bucket. The implementation is structured as follows. A new CanonInfo struct (src/comp_types/canon_info.hpp) carries the canonical variable ordering, the sorted list of canonical clauses, the structural hash, and a valid flag. CompAnalyzer::compute_canon_info() (src/comp_analyzer.cpp) builds this in O(n * d) time where n is the component size and d is average clause degree, using holder.begin_long / holder.begin_bin with orig_size to access all irredundant clauses. The CanonInfo is computed immediately after make_comp_from_archetype() in CompManager::record_remaining_comps_for() and threaded through CompCacheIF::create_new_comp() -> CompCache<T>::create_new_comp() -> CacheableComp<T> constructor -> T::set_comp(). Both cache entry types are updated: HashedComp::set_comp() substitutes canon->hash for the raw-data chibihash when a valid CanonInfo is present; DiffPackedComp::set_comp() stores the canonical clause data (same encoding used to derive the hash) in its packed array and sets an is_canonical_ flag so that equals() correctly compares canonical entries against each other using memcmp on the structure-invariant data rather than the absolute-ID packed representation. The threshold (wl_canonize_threshold, default 25) is intentionally conservative for this initial implementation. WL canonicalization is O(n * d) in compute time and involves heap allocation for the canonical clause vectors, so applying it to very large components would hurt throughput more than it helps the cache. The threshold can be tuned via the config. Setting it to 0 disables the feature entirely, restoring prior behavior. A known limitation: the 1-round WL refinement does not distinguish all non-isomorphic graphs (it can produce collisions for regular graphs), so the hash remains probabilistic in character -- a structural hash collision yields a false cache hit in HashedComp mode, with the same probability guarantees as before. In DiffPackedComp mode the is_canonical_ flag and memcmp on canonical data provide exact equality. All existing tests (ctest, 20-instance fuzz suite with --exact) pass without modification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Track three counters for the WL canonicalization feature introduced in the previous commit: wl_canon_computed (components where CanonInfo was valid and the canonical hash was used), wl_canon_skipped (components too large for the threshold or threshold==0), and wl_canon_hits (cache hits that occurred on a canonically-keyed lookup). The counters are incremented in record_remaining_comps_for() immediately after compute_canon_info() and after find_comp_and_incorporate_cnt() returns. They are printed in DataAndStatistics::print_short() alongside the existing cache stats, showing computed/skipped/hits counts and the canonical hit rate percentage. Add --wlcanon CLI flag in main.cpp wired to conf.wl_canonize_threshold, so the feature can be disabled (--wlcanon 0) or the threshold tuned from the command line without recompiling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In comp_analyzer.cpp (compute_canon_info), add three VERBOSE_DEBUG_DO blocks:
1. After computing initial WL colors: prints nVars, nLongCls, and for each
variable its id, long-clause degree, binary-clause degree, and is_indep flag.
2. After the WL refinement round: prints the per-variable wl1 hash in hex,
useful for diagnosing color collisions between non-isomorphic components.
3. After computing the final structural hash: prints the hash in hex and the
full canonical clause list as (i,j,...) tuples, so the canonical form can
be inspected and compared across two runs on isomorphic components.
In comp_management.cpp (record_remaining_comps_for), add two blocks:
- When CanonInfo is valid: prints the component size, structural hash, number
of canonical clauses, and the canonical variable ordering (original var ids).
- When skipped (too large): prints the component size and the configured
threshold, so it is clear why canonicalization was bypassed.
- On cache hit: extends the existing hit message with wl_canonical=yes/no so
it is immediately visible whether the hit was served via canonical hashing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lity Two structurally isomorphic components can have different projected model counts if their variables differ in membership in the independent support (projection set). The previous implementation only included the clause structure (canonical literal indices) in the canonical hash and equality data, but omitted whether each canonical position belongs to the projection set. Symptom: the CHECK_COUNT assertion fired with correct=8, actual=16 (2× error) on projected model counting formulas when --wlcanon is enabled. Root cause: a component with all non-projection variables (projected count=1) matched a cached component with the same clause structure but projection variables (projected count=2), returning the wrong count. Fix: populate CanonInfo::canon_is_indep[i] (0 or 1 for each canonical position), then append this to the hdata array used for both the hash (chibihash64) and the DiffPackedComp equality comparison (memcmp). Two components now compare equal only when their clause structures AND their per-position is_indep profiles both match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WL canonical hashing is sound only when two structurally isomorphic components are guaranteed to have the same count. For weighted model counting (modes 1-6), each variable carries an individual weight, so two components with the same clause topology but different per-variable weights have different weighted counts and must NOT be treated as cache-equivalent. counter->weighted() returns false only for FGenMpz (mode 0, plain integer/projected counting). All other field types -- rational (1), complex (2/6), polynomial/GF (3), parity (4), mod-prime (5) -- return true. Adding an early return for weighted() makes compute_canon_info a no-op (returns invalid CanonInfo) for those modes, so no WL cache entries are created and no unsound hits occur. WL canonicalization provides essentially no benefit for weighted modes anyway: two different components would need the same clause structure AND the same per-variable weight at every corresponding canonical position -- a near-zero probability in practice. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.