rules: pre-filter string rules whose patterns are absent from the binary by devs6186 · Pull Request #2999 · mandiant/capa

devs6186 · 2026-04-06T16:53:20Z

Summary

Rules that rely solely on Regex or Substring features — like the HTTP User-Agent family rules — are evaluated against every extracted string in every scope instance (every function, basic block, instruction). For a binary that doesn't contain the relevant strings at all, this means thousands of expensive re.search() and str.__contains__() calls that can never succeed.

Approach:

Add RuleSet.prepare_for_file(file_strings: frozenset[str]) which:

Accepts the set of all string values extracted from the binary at file scope.
For each rule in string_rules (the index of regex/substring-dependent rules), checks whether any of its required patterns can match any file-level string.
Marks rules whose patterns are provably absent in _impossible_string_rule_names.

_match() then skips those rules during candidate selection, replacing the repeated per-scope Regex.evaluate() loop with a single pre-check per binary.

The filter is activated in find_static_capabilities() before the function loop and cleared after analysis so the ruleset stays reusable across binaries.

Why this is correct:

A string feature extracted inside a function is always a substring or exact match of a string that exists somewhere in the binary's raw bytes. If the pattern can't match any raw-file string, it can't match any per-function string either. The one-time file-level scan is therefore a safe upper bound.

Files changed:

capa/rules/__init__.py — RuleSet.__init__: initialize _impossible_string_rule_names; new prepare_for_file() method; _match(): skip rules in _impossible_string_rule_names.
capa/capabilities/static.py — find_static_capabilities(): call prepare_for_file() before the function loop and prepare_for_file(frozenset()) after.

Test plan

isort, black, ruff pass on changed files
pytest tests/test_rules.py tests/test_match.py tests/test_engine.py tests/test_optimizer.py — 84 passed, 2 xfailed
No changes to rule evaluation logic — the filter only skips rules that provably cannot match
The comment # See: https://github.com/mandiant/capa/issues/2126 in _match() now points to the implemented fix

Add RuleSet.prepare_for_file(file_strings) which accepts the set of all string values extracted from a binary file and marks any string rule whose required Substring/Regex patterns cannot match any of those strings as impossible. _match() then skips those rules entirely, avoiding repeated Regex.evaluate() / Substring.evaluate() calls. The filter is applied in find_static_capabilities() before the per- function analysis loop and cleared afterwards to keep the ruleset clean for potential reuse. The one-time O(rules × file_strings) pre-check replaces an O(functions × string_rules × file_strings) repeated evaluation. This directly addresses the long-standing hot-path described in the issue: HTTP User-Agent rules and similar families with many regex branches in a single `or:` block are skipped entirely for binaries that don't contain any matching strings. Closes mandiant#2126

gemini-code-assist

Code Review

This pull request introduces a pre-filtering mechanism for string-based rules to improve performance by skipping rules whose required patterns are absent from the binary. Feedback was provided regarding a potential issue where stack strings reconstructed during function analysis might be incorrectly filtered out if they are not present in the raw file-level string extraction. Additionally, an optimization was suggested for the pre-filtering logic to avoid redundant re-evaluation of rules that appear in multiple scopes and to ensure more accurate debug logging.

gemini-code-assist · 2026-04-06T16:55:02Z

capa/rules/__init__.py

+        for feature_index in self._feature_indexes_by_scopes.values():
+            for rule_name, wanted_strings in feature_index.string_rules.items():
+                total += 1
+                can_match = False
+                for feat in wanted_strings:
+                    if isinstance(feat, capa.features.common.Substring):
+                        if any(feat.value in s for s in file_strings):
+                            can_match = True
+                            break
+                    elif isinstance(feat, capa.features.common.Regex):
+                        if any(feat.re.search(s) for s in file_strings):
+                            can_match = True
+                            break
+                    else:
+                        # unknown feature type: keep to be safe
+                        can_match = True
+                        break
+                if not can_match:
+                    impossible.add(rule_name)


The pre-filtering logic assumes that any string feature found during analysis (e.g., at the function scope) must also be present as a contiguous string at the file scope. However, capa extractors (like Vivisect or IDA) often reconstruct stack strings that are not stored contiguously in the binary's data sections.

Since extractor.extract_file_features() typically performs a bulk string scan of the raw bytes, it will not capture these reconstructed stack strings. Consequently, rules using Regex or Substring to match stack strings will be incorrectly marked as impossible and skipped during per-function matching, leading to false negatives for obfuscated binaries.

You should consider if this optimization should be restricted to rules that are known to target static strings, or if there's a way to include stack string components in the pre-filter.

hm this is a good point

gemini-code-assist · 2026-04-06T16:55:02Z

capa/rules/__init__.py

+        impossible: set[str] = set()
+        total = 0
+
+        for feature_index in self._feature_indexes_by_scopes.values():
+            for rule_name, wanted_strings in feature_index.string_rules.items():
+                total += 1
+                can_match = False
+                for feat in wanted_strings:
+                    if isinstance(feat, capa.features.common.Substring):
+                        if any(feat.value in s for s in file_strings):
+                            can_match = True
+                            break
+                    elif isinstance(feat, capa.features.common.Regex):
+                        if any(feat.re.search(s) for s in file_strings):
+                            can_match = True
+                            break
+                    else:
+                        # unknown feature type: keep to be safe
+                        can_match = True
+                        break
+                if not can_match:
+                    impossible.add(rule_name)
+
+        if impossible:
+            logger.debug(
+                "pre-filter: %d/%d string rules skipped (patterns absent from binary)",
+                len(impossible),
+                total,
+            )


The total count and iteration logic can be improved for better accuracy and performance. Currently, total increments for every rule-scope pair, while len(impossible) counts unique rule names. Since the same rule can appear in multiple scopes (e.g., FUNCTION and BASIC_BLOCK), the debug log ratio might be misleading. Furthermore, the same rule is re-evaluated for every scope it appears in, which is redundant as the result depends only on the global file_strings.

Tracking unique rule names and skipping re-evaluation of rules already processed in a previous scope would be more efficient and provide clearer logging.

impossible: set[str] = set() all_string_rule_names: set[str] = set() for feature_index in self._feature_indexes_by_scopes.values(): for rule_name, wanted_strings in feature_index.string_rules.items(): if rule_name in all_string_rule_names: continue all_string_rule_names.add(rule_name) can_match = False for feat in wanted_strings: if isinstance(feat, capa.features.common.Substring): if any(feat.value in s for s in file_strings): can_match = True break elif isinstance(feat, capa.features.common.Regex): if any(feat.re.search(s) for s in file_strings): can_match = True break else: # unknown feature type: keep to be safe can_match = True break if not can_match: impossible.add(rule_name) if impossible: logger.debug( "pre-filter: %d/%d string rules skipped (patterns absent from binary)", len(impossible), len(all_string_rule_names), )

williballenthin · 2026-04-06T20:24:50Z

it's not clear to me that this is strictly a performance improvement. if there are no referenced strings in the code, then there's more work up front doing all this scanning.

so i would like to see a performance analysis and before/after wall clock timings of a representative corpus. otherwise, im hesitant about the additional logic and code without really understanding the characteristics.

williballenthin · 2026-04-06T20:35:24Z

please share your plan and timeline for doing this analysis soon, otherwise we'll close out the PR. i'm concerned this is additional slop without a serious level of effort behind it.

…tion guard Two improvements to the string pre-filter introduced in mandiant#2126: 1. Concat-string optimization in prepare_for_file(): Instead of calling re.search() against each of the N file strings per rule, join all file strings with a \x01 separator and search the concat once. If the concat search finds no match the rule is provably impossible (a per- string match would also appear in the concat). If it does match, run the per-string fallback to confirm and avoid false positives from DOTALL patterns that could span the \x01 boundary. Measured overhead: ~41-53 ms (1147-1225 strings) vs ~200-270 ms before. 2. Minimum-function guard in find_static_capabilities(): Only activate prepare_for_file() when the binary has >= 10 functions. For very small binaries the one-time scan overhead can exceed the per- function savings; the guard avoids a net regression there. Benchmark results (vivisect backend, 1385 rules, 83 string-dependent): Binary Funcs File strs Baseline With filter Net gain Skipped Lab 01-02.exe_ 2 47 0.02 s 0.02 s +4 ms 83/83 0a30182f…exe_ 130 1 225 0.67 s 0.53 s +93 ms 83/83 7fbc17a0…exe_ 562 1 147 1.86 s 1.67 s +143 ms 81/83 321338…exe_ 2 466 3 363 11.83 s 11.46 s +280 ms 82/83 Net gain = (baseline - filtered) - prepare_for_file overhead. All positive; the filter pays for itself across the tested corpus.

devs6186 · 2026-04-07T07:43:27Z

please share your plan and timeline for doing this analysis soon, otherwise we'll close out the PR. i'm concerned this is additional slop without a serious level of effort behind it.

hi @williballenthin i started the analysis yesterday, ill share all my findings and output screenshots shortly

The concat-string optimisation in prepare_for_file() was unsafe for anchored regex patterns (^ / $). re.search("^foo", "bar\x01foo") returns no match because ^ binds to the start of the whole concatenated string, not the start of each individual file string. 12 of the 83 default string-dependent rules carry such anchors (e.g. /^docker.*/, /^Go buildinf:/, /^BXPC/). The optimisation would mark those rules impossible even when the matching string was present in the binary, producing false negatives. Fix: revert Regex patterns to per-string scanning; keep the concat optimisation only for Substring patterns, where it is unconditionally safe (literal values cannot span a \x01 boundary). Also fix benchmark script (scripts/benchmark_string_prefilter.py) lint issues: import order, f-string without placeholders (F541). Add regression test: test_string_prefilter_anchored_regex_correctness verifies that /^foo$/ is never marked impossible when "foo" appears in file_strings, and is correctly marked impossible when it does not.

…strings The per-string fallback introduced to fix the anchored-pattern bug carried a high scan cost: O(|string_rules| * |file_strings|) regex calls (~100 k for a 1225-string binary, ~80 ms). Fix: use a \n-separated concat and compile each pattern with re.MULTILINE added. With MULTILINE, ^ and $ match at \n boundaries, so /^docker.*/ correctly finds "docker ps" in "other\ndocker ps\nfoo" without a per-string loop. Fallback: if the concat scan matches and the pattern was compiled with re.DOTALL, .* can bridge two adjacent lines (false positive); per-string confirmation then decides the true outcome. This fallback triggers only for the small subset of rules whose .* spans the boundary, leaving the majority (non-false-positive cases) handled in one re.search call. Measured overhead (1147-1225 file strings, 83 string rules): Before (per-string): ~80-90 ms After (MULTILINE): ~40-50 ms The anchored-regex regression test (test_string_prefilter_anchored_regex_correctness) continues to pass, confirming /^foo$/ is not falsely marked impossible when "foo" is present in file_strings.

…runs - Add _verify_parity(): runs find_static_capabilities() with and without the pre-filter, compares (rule_name, address) pairs; reports PASS/FAIL per binary to prove no semantic drift - Expand _DEFAULT_SAMPLES from 4 to 8 binaries spanning tiny (~3 KB) to extra-large (~982 KB) for broader coverage - Switch to _time_interleaved(): alternates W/O -> W/ on each run to reduce load-spike variance bias in the median - Add geometric mean speedup summary across all binaries - Add --skip-parity flag for faster runs when correctness is already known - Fix all non-ASCII characters in printed output (console portability)

devs6186 · 2026-04-07T11:28:13Z

Hi @williballenthin
Addressed all feedback. Here's a summary of what changed and what the benchmark shows.
Correctness fixes:

The original concat-string regex gate had a false-negative on anchored patterns (^, $).
re.search(/^foo$/, "bar\x01foo") fails because ^ binds to the start of the whole concatenated string,
not individual entries. Fixed by using a \n-separated concat with re.MULTILINE as a fast-path candidate
filter, then confirming with a per-string re.search() only when the fast path matches. This gives
correctness on all anchored patterns while keeping most of the concat speedup.

Added a regression test (test_string_prefilter_anchored_regex_correctness) that verifies /^foo$/ is not
marked impossible when "foo" is present in file strings, and is correctly marked impossible when
absent.

Benchmark (attached):

Ran across 8 binaries spanning ~3 KB to ~982 KB (2 to 2466 functions), 3 interleaved runs each
(alternating W/O → W/ to reduce load-spike bias). Every binary includes a parity check:
find_static_capabilities() is run with and without the prefilter and the full set of matched
(rule_name, address) pairs is compared. All 8 pass — no semantic drift.

Observed speedup is 1.02x–1.13x with up to +472ms net gain on the 940-function binary. The modest
relative speedup reflects the structural reality: 43 of 1385 rules are function-scope string-dependent
(~3%), so the theoretical ceiling for this optimization is roughly 1.05–1.10x. The benchmark confirms
we're close to that ceiling with correct behavior.

Other changes:

Minimum-function guard (skip prepare_for_file for binaries with fewer than 10 functions) to avoid
overhead on trivial inputs.
prepare_for_file(frozenset()) called at end of find_static_capabilities to reset state for potential
ruleset reuse.
Benchmark script expanded to 8 binaries with parity verification, geomean summary, and --skip-parity
flag.

gemini-code-assist bot reviewed Apr 6, 2026

View reviewed changes

rules: avoid stack-string false negatives in string prefilter

1a36f9e

devs6186 added 4 commits April 7, 2026 13:31

scripts: fix benchmark net-gain formula (overhead was double-subtracted)

a4595fa

Merge upstream/master into perf/2126-string-prefilter

f3e4a7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rules: pre-filter string rules whose patterns are absent from the binary#2999

rules: pre-filter string rules whose patterns are absent from the binary#2999
devs6186 wants to merge 8 commits intomandiant:masterfrom
devs6186:perf/2126-string-prefilter

devs6186 commented Apr 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 6, 2026

Uh oh!

williballenthin Apr 7, 2026

Uh oh!

gemini-code-assist bot Apr 6, 2026

Uh oh!

williballenthin commented Apr 6, 2026

Uh oh!

williballenthin commented Apr 6, 2026

Uh oh!

devs6186 commented Apr 7, 2026

Uh oh!

devs6186 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devs6186 commented Apr 6, 2026

Summary

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

williballenthin Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

williballenthin commented Apr 6, 2026

Uh oh!

williballenthin commented Apr 6, 2026

Uh oh!

devs6186 commented Apr 7, 2026

Uh oh!

devs6186 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants