rules: pre-filter string rules whose patterns are absent from the binary#2999
rules: pre-filter string rules whose patterns are absent from the binary#2999devs6186 wants to merge 8 commits intomandiant:masterfrom
Conversation
Add RuleSet.prepare_for_file(file_strings) which accepts the set of all string values extracted from a binary file and marks any string rule whose required Substring/Regex patterns cannot match any of those strings as impossible. _match() then skips those rules entirely, avoiding repeated Regex.evaluate() / Substring.evaluate() calls. The filter is applied in find_static_capabilities() before the per- function analysis loop and cleared afterwards to keep the ruleset clean for potential reuse. The one-time O(rules × file_strings) pre-check replaces an O(functions × string_rules × file_strings) repeated evaluation. This directly addresses the long-standing hot-path described in the issue: HTTP User-Agent rules and similar families with many regex branches in a single `or:` block are skipped entirely for binaries that don't contain any matching strings. Closes mandiant#2126
There was a problem hiding this comment.
Code Review
This pull request introduces a pre-filtering mechanism for string-based rules to improve performance by skipping rules whose required patterns are absent from the binary. Feedback was provided regarding a potential issue where stack strings reconstructed during function analysis might be incorrectly filtered out if they are not present in the raw file-level string extraction. Additionally, an optimization was suggested for the pre-filtering logic to avoid redundant re-evaluation of rules that appear in multiple scopes and to ensure more accurate debug logging.
| for feature_index in self._feature_indexes_by_scopes.values(): | ||
| for rule_name, wanted_strings in feature_index.string_rules.items(): | ||
| total += 1 | ||
| can_match = False | ||
| for feat in wanted_strings: | ||
| if isinstance(feat, capa.features.common.Substring): | ||
| if any(feat.value in s for s in file_strings): | ||
| can_match = True | ||
| break | ||
| elif isinstance(feat, capa.features.common.Regex): | ||
| if any(feat.re.search(s) for s in file_strings): | ||
| can_match = True | ||
| break | ||
| else: | ||
| # unknown feature type: keep to be safe | ||
| can_match = True | ||
| break | ||
| if not can_match: | ||
| impossible.add(rule_name) |
There was a problem hiding this comment.
The pre-filtering logic assumes that any string feature found during analysis (e.g., at the function scope) must also be present as a contiguous string at the file scope. However, capa extractors (like Vivisect or IDA) often reconstruct stack strings that are not stored contiguously in the binary's data sections.
Since extractor.extract_file_features() typically performs a bulk string scan of the raw bytes, it will not capture these reconstructed stack strings. Consequently, rules using Regex or Substring to match stack strings will be incorrectly marked as impossible and skipped during per-function matching, leading to false negatives for obfuscated binaries.
You should consider if this optimization should be restricted to rules that are known to target static strings, or if there's a way to include stack string components in the pre-filter.
There was a problem hiding this comment.
hm this is a good point
| impossible: set[str] = set() | ||
| total = 0 | ||
|
|
||
| for feature_index in self._feature_indexes_by_scopes.values(): | ||
| for rule_name, wanted_strings in feature_index.string_rules.items(): | ||
| total += 1 | ||
| can_match = False | ||
| for feat in wanted_strings: | ||
| if isinstance(feat, capa.features.common.Substring): | ||
| if any(feat.value in s for s in file_strings): | ||
| can_match = True | ||
| break | ||
| elif isinstance(feat, capa.features.common.Regex): | ||
| if any(feat.re.search(s) for s in file_strings): | ||
| can_match = True | ||
| break | ||
| else: | ||
| # unknown feature type: keep to be safe | ||
| can_match = True | ||
| break | ||
| if not can_match: | ||
| impossible.add(rule_name) | ||
|
|
||
| if impossible: | ||
| logger.debug( | ||
| "pre-filter: %d/%d string rules skipped (patterns absent from binary)", | ||
| len(impossible), | ||
| total, | ||
| ) |
There was a problem hiding this comment.
The total count and iteration logic can be improved for better accuracy and performance. Currently, total increments for every rule-scope pair, while len(impossible) counts unique rule names. Since the same rule can appear in multiple scopes (e.g., FUNCTION and BASIC_BLOCK), the debug log ratio might be misleading. Furthermore, the same rule is re-evaluated for every scope it appears in, which is redundant as the result depends only on the global file_strings.
Tracking unique rule names and skipping re-evaluation of rules already processed in a previous scope would be more efficient and provide clearer logging.
impossible: set[str] = set()
all_string_rule_names: set[str] = set()
for feature_index in self._feature_indexes_by_scopes.values():
for rule_name, wanted_strings in feature_index.string_rules.items():
if rule_name in all_string_rule_names:
continue
all_string_rule_names.add(rule_name)
can_match = False
for feat in wanted_strings:
if isinstance(feat, capa.features.common.Substring):
if any(feat.value in s for s in file_strings):
can_match = True
break
elif isinstance(feat, capa.features.common.Regex):
if any(feat.re.search(s) for s in file_strings):
can_match = True
break
else:
# unknown feature type: keep to be safe
can_match = True
break
if not can_match:
impossible.add(rule_name)
if impossible:
logger.debug(
"pre-filter: %d/%d string rules skipped (patterns absent from binary)",
len(impossible),
len(all_string_rule_names),
)|
it's not clear to me that this is strictly a performance improvement. if there are no referenced strings in the code, then there's more work up front doing all this scanning. so i would like to see a performance analysis and before/after wall clock timings of a representative corpus. otherwise, im hesitant about the additional logic and code without really understanding the characteristics. |
|
please share your plan and timeline for doing this analysis soon, otherwise we'll close out the PR. i'm concerned this is additional slop without a serious level of effort behind it. |
…tion guard Two improvements to the string pre-filter introduced in mandiant#2126: 1. Concat-string optimization in prepare_for_file(): Instead of calling re.search() against each of the N file strings per rule, join all file strings with a \x01 separator and search the concat once. If the concat search finds no match the rule is provably impossible (a per- string match would also appear in the concat). If it does match, run the per-string fallback to confirm and avoid false positives from DOTALL patterns that could span the \x01 boundary. Measured overhead: ~41-53 ms (1147-1225 strings) vs ~200-270 ms before. 2. Minimum-function guard in find_static_capabilities(): Only activate prepare_for_file() when the binary has >= 10 functions. For very small binaries the one-time scan overhead can exceed the per- function savings; the guard avoids a net regression there. Benchmark results (vivisect backend, 1385 rules, 83 string-dependent): Binary Funcs File strs Baseline With filter Net gain Skipped Lab 01-02.exe_ 2 47 0.02 s 0.02 s +4 ms 83/83 0a30182f…exe_ 130 1 225 0.67 s 0.53 s +93 ms 83/83 7fbc17a0…exe_ 562 1 147 1.86 s 1.67 s +143 ms 81/83 321338…exe_ 2 466 3 363 11.83 s 11.46 s +280 ms 82/83 Net gain = (baseline - filtered) - prepare_for_file overhead. All positive; the filter pays for itself across the tested corpus.
hi @williballenthin i started the analysis yesterday, ill share all my findings and output screenshots shortly |
The concat-string optimisation in prepare_for_file() was unsafe for
anchored regex patterns (^ / $). re.search("^foo", "bar\x01foo") returns
no match because ^ binds to the start of the whole concatenated string,
not the start of each individual file string. 12 of the 83 default
string-dependent rules carry such anchors (e.g. /^docker.*/, /^Go buildinf:/,
/^BXPC/). The optimisation would mark those rules impossible even when the
matching string was present in the binary, producing false negatives.
Fix: revert Regex patterns to per-string scanning; keep the concat
optimisation only for Substring patterns, where it is unconditionally safe
(literal values cannot span a \x01 boundary).
Also fix benchmark script (scripts/benchmark_string_prefilter.py) lint
issues: import order, f-string without placeholders (F541).
Add regression test: test_string_prefilter_anchored_regex_correctness
verifies that /^foo$/ is never marked impossible when "foo" appears in
file_strings, and is correctly marked impossible when it does not.
…strings The per-string fallback introduced to fix the anchored-pattern bug carried a high scan cost: O(|string_rules| * |file_strings|) regex calls (~100 k for a 1225-string binary, ~80 ms). Fix: use a \n-separated concat and compile each pattern with re.MULTILINE added. With MULTILINE, ^ and $ match at \n boundaries, so /^docker.*/ correctly finds "docker ps" in "other\ndocker ps\nfoo" without a per-string loop. Fallback: if the concat scan matches and the pattern was compiled with re.DOTALL, .* can bridge two adjacent lines (false positive); per-string confirmation then decides the true outcome. This fallback triggers only for the small subset of rules whose .* spans the boundary, leaving the majority (non-false-positive cases) handled in one re.search call. Measured overhead (1147-1225 file strings, 83 string rules): Before (per-string): ~80-90 ms After (MULTILINE): ~40-50 ms The anchored-regex regression test (test_string_prefilter_anchored_regex_correctness) continues to pass, confirming /^foo$/ is not falsely marked impossible when "foo" is present in file_strings.
…runs - Add _verify_parity(): runs find_static_capabilities() with and without the pre-filter, compares (rule_name, address) pairs; reports PASS/FAIL per binary to prove no semantic drift - Expand _DEFAULT_SAMPLES from 4 to 8 binaries spanning tiny (~3 KB) to extra-large (~982 KB) for broader coverage - Switch to _time_interleaved(): alternates W/O -> W/ on each run to reduce load-spike variance bias in the median - Add geometric mean speedup summary across all binaries - Add --skip-parity flag for faster runs when correctness is already known - Fix all non-ASCII characters in printed output (console portability)
|
Hi @williballenthin The original concat-string regex gate had a false-negative on anchored patterns (^, $). Added a regression test (test_string_prefilter_anchored_regex_correctness) that verifies /^foo$/ is not Benchmark (attached): Ran across 8 binaries spanning ~3 KB to ~982 KB (2 to 2466 functions), 3 interleaved runs each Observed speedup is 1.02x–1.13x with up to +472ms net gain on the 940-function binary. The modest Other changes:
|


Summary
Fixes #2126.
Rules that rely solely on
RegexorSubstringfeatures — like the HTTP User-Agent family rules — are evaluated against every extracted string in every scope instance (every function, basic block, instruction). For a binary that doesn't contain the relevant strings at all, this means thousands of expensivere.search()andstr.__contains__()calls that can never succeed.Approach:
Add
RuleSet.prepare_for_file(file_strings: frozenset[str])which:string_rules(the index of regex/substring-dependent rules), checks whether any of its required patterns can match any file-level string._impossible_string_rule_names._match()then skips those rules during candidate selection, replacing the repeated per-scopeRegex.evaluate()loop with a single pre-check per binary.The filter is activated in
find_static_capabilities()before the function loop and cleared after analysis so the ruleset stays reusable across binaries.Why this is correct:
A string feature extracted inside a function is always a substring or exact match of a string that exists somewhere in the binary's raw bytes. If the pattern can't match any raw-file string, it can't match any per-function string either. The one-time file-level scan is therefore a safe upper bound.
Files changed:
capa/rules/__init__.py—RuleSet.__init__: initialize_impossible_string_rule_names; newprepare_for_file()method;_match(): skip rules in_impossible_string_rule_names.capa/capabilities/static.py—find_static_capabilities(): callprepare_for_file()before the function loop andprepare_for_file(frozenset())after.Test plan
isort,black,ruffpass on changed filespytest tests/test_rules.py tests/test_match.py tests/test_engine.py tests/test_optimizer.py— 84 passed, 2 xfailed# See: https://github.com/mandiant/capa/issues/2126in_match()now points to the implemented fix