Skip to content

Weisfeiler-Lehman canonicalization cache, by Claude#105#107

Open
msoos wants to merge 5 commits intodevelfrom
wl-canon-cache
Open

Weisfeiler-Lehman canonicalization cache, by Claude#105#107
msoos wants to merge 5 commits intodevelfrom
wl-canon-cache

Conversation

@msoos
Copy link
Copy Markdown
Collaborator

@msoos msoos commented Apr 6, 2026

No description provided.

msoos and others added 5 commits April 5, 2026 22:58
… lookup

The component cache in Ganak stores sub-problem counts keyed by a fingerprint
derived from the sorted list of absolute variable IDs and long clause IDs that
make up each component. This means that two structurally isomorphic components
-- identical graph topology, same clause structure, same independence/non-
independence classification of variables, but different variable numberings --
are treated as entirely distinct cache entries, causing guaranteed misses even
when the same structural sub-problem has already been solved. Analysis of
benchmark data across 1600 instances showed that the average component is seen
only ~2 times during a full run, and cache hits tend to cluster on very small
components (22-64 vars avg) while misses dominate on larger ones (150-548 vars
avg). The root cause is not cache pollution or eviction policy, but rather the
inability of the fingerprint to identify structural equivalence across different
variable numberings.

This commit implements Weisfeiler-Lehman (WL) canonical component hashing, a
graph-coloring technique that produces a structure-invariant fingerprint for
each component. For each component with nVars <= wl_canonize_threshold (default
25), we compute a canonical ordering of the variables and express every clause
-- both long (3+ literal) and binary -- as a sorted list of canonical variable
indices (0..nVars-1) rather than absolute variable IDs. The canonical ordering
is determined by: (1) an initial color per variable encoding its long-clause
degree, binary-clause degree, and whether it belongs to the independent support;
(2) one round of WL refinement where each variable's new hash incorporates the
sorted multiset of its long-clause neighbours' initial colors; (3) a stable
sort on (wl1_hash, init_color, original_var_id). The resulting sorted list of
canonical clauses, plus nVars, is hashed via chibihash64 to produce the cache
key. Two isomorphic components produce identical sorted_canon_clauses and
therefore hit the same cache bucket.

The implementation is structured as follows. A new CanonInfo struct
(src/comp_types/canon_info.hpp) carries the canonical variable ordering, the
sorted list of canonical clauses, the structural hash, and a valid flag.
CompAnalyzer::compute_canon_info() (src/comp_analyzer.cpp) builds this in O(n
* d) time where n is the component size and d is average clause degree, using
holder.begin_long / holder.begin_bin with orig_size to access all irredundant
clauses. The CanonInfo is computed immediately after make_comp_from_archetype()
in CompManager::record_remaining_comps_for() and threaded through
CompCacheIF::create_new_comp() -> CompCache<T>::create_new_comp() ->
CacheableComp<T> constructor -> T::set_comp(). Both cache entry types are
updated: HashedComp::set_comp() substitutes canon->hash for the raw-data
chibihash when a valid CanonInfo is present; DiffPackedComp::set_comp() stores
the canonical clause data (same encoding used to derive the hash) in its packed
array and sets an is_canonical_ flag so that equals() correctly compares
canonical entries against each other using memcmp on the structure-invariant
data rather than the absolute-ID packed representation.

The threshold (wl_canonize_threshold, default 25) is intentionally conservative
for this initial implementation. WL canonicalization is O(n * d) in compute
time and involves heap allocation for the canonical clause vectors, so applying
it to very large components would hurt throughput more than it helps the cache.
The threshold can be tuned via the config. Setting it to 0 disables the feature
entirely, restoring prior behavior. A known limitation: the 1-round WL
refinement does not distinguish all non-isomorphic graphs (it can produce
collisions for regular graphs), so the hash remains probabilistic in character
-- a structural hash collision yields a false cache hit in HashedComp mode, with
the same probability guarantees as before. In DiffPackedComp mode the
is_canonical_ flag and memcmp on canonical data provide exact equality. All
existing tests (ctest, 20-instance fuzz suite with --exact) pass without
modification.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Track three counters for the WL canonicalization feature introduced in the
previous commit: wl_canon_computed (components where CanonInfo was valid and
the canonical hash was used), wl_canon_skipped (components too large for the
threshold or threshold==0), and wl_canon_hits (cache hits that occurred on
a canonically-keyed lookup). The counters are incremented in
record_remaining_comps_for() immediately after compute_canon_info() and after
find_comp_and_incorporate_cnt() returns. They are printed in
DataAndStatistics::print_short() alongside the existing cache stats, showing
computed/skipped/hits counts and the canonical hit rate percentage.

Add --wlcanon CLI flag in main.cpp wired to conf.wl_canonize_threshold, so
the feature can be disabled (--wlcanon 0) or the threshold tuned from the
command line without recompiling.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In comp_analyzer.cpp (compute_canon_info), add three VERBOSE_DEBUG_DO blocks:
  1. After computing initial WL colors: prints nVars, nLongCls, and for each
     variable its id, long-clause degree, binary-clause degree, and is_indep flag.
  2. After the WL refinement round: prints the per-variable wl1 hash in hex,
     useful for diagnosing color collisions between non-isomorphic components.
  3. After computing the final structural hash: prints the hash in hex and the
     full canonical clause list as (i,j,...) tuples, so the canonical form can
     be inspected and compared across two runs on isomorphic components.

In comp_management.cpp (record_remaining_comps_for), add two blocks:
  - When CanonInfo is valid: prints the component size, structural hash, number
    of canonical clauses, and the canonical variable ordering (original var ids).
  - When skipped (too large): prints the component size and the configured
    threshold, so it is clear why canonicalization was bypassed.
  - On cache hit: extends the existing hit message with wl_canonical=yes/no so
    it is immediately visible whether the hit was served via canonical hashing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lity

Two structurally isomorphic components can have different projected model
counts if their variables differ in membership in the independent support
(projection set). The previous implementation only included the clause
structure (canonical literal indices) in the canonical hash and equality
data, but omitted whether each canonical position belongs to the projection
set.

Symptom: the CHECK_COUNT assertion fired with correct=8, actual=16 (2×
error) on projected model counting formulas when --wlcanon is enabled.
Root cause: a component with all non-projection variables (projected count=1)
matched a cached component with the same clause structure but projection
variables (projected count=2), returning the wrong count.

Fix: populate CanonInfo::canon_is_indep[i] (0 or 1 for each canonical
position), then append this to the hdata array used for both the hash
(chibihash64) and the DiffPackedComp equality comparison (memcmp). Two
components now compare equal only when their clause structures AND their
per-position is_indep profiles both match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WL canonical hashing is sound only when two structurally isomorphic
components are guaranteed to have the same count.  For weighted model
counting (modes 1-6), each variable carries an individual weight, so
two components with the same clause topology but different per-variable
weights have different weighted counts and must NOT be treated as
cache-equivalent.

counter->weighted() returns false only for FGenMpz (mode 0, plain
integer/projected counting). All other field types -- rational (1),
complex (2/6), polynomial/GF (3), parity (4), mod-prime (5) -- return
true. Adding an early return for weighted() makes compute_canon_info
a no-op (returns invalid CanonInfo) for those modes, so no WL cache
entries are created and no unsound hits occur.

WL canonicalization provides essentially no benefit for weighted modes
anyway: two different components would need the same clause structure
AND the same per-variable weight at every corresponding canonical
position -- a near-zero probability in practice.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant