Skip to content

rules: pre-filter string rules whose patterns are absent from the binary#2999

Open
devs6186 wants to merge 8 commits intomandiant:masterfrom
devs6186:perf/2126-string-prefilter
Open

rules: pre-filter string rules whose patterns are absent from the binary#2999
devs6186 wants to merge 8 commits intomandiant:masterfrom
devs6186:perf/2126-string-prefilter

Conversation

@devs6186
Copy link
Copy Markdown
Contributor

@devs6186 devs6186 commented Apr 6, 2026

Summary

Fixes #2126.

Rules that rely solely on Regex or Substring features — like the HTTP User-Agent family rules — are evaluated against every extracted string in every scope instance (every function, basic block, instruction). For a binary that doesn't contain the relevant strings at all, this means thousands of expensive re.search() and str.__contains__() calls that can never succeed.

Approach:

Add RuleSet.prepare_for_file(file_strings: frozenset[str]) which:

  1. Accepts the set of all string values extracted from the binary at file scope.
  2. For each rule in string_rules (the index of regex/substring-dependent rules), checks whether any of its required patterns can match any file-level string.
  3. Marks rules whose patterns are provably absent in _impossible_string_rule_names.

_match() then skips those rules during candidate selection, replacing the repeated per-scope Regex.evaluate() loop with a single pre-check per binary.

The filter is activated in find_static_capabilities() before the function loop and cleared after analysis so the ruleset stays reusable across binaries.

Why this is correct:

A string feature extracted inside a function is always a substring or exact match of a string that exists somewhere in the binary's raw bytes. If the pattern can't match any raw-file string, it can't match any per-function string either. The one-time file-level scan is therefore a safe upper bound.

Files changed:

  • capa/rules/__init__.pyRuleSet.__init__: initialize _impossible_string_rule_names; new prepare_for_file() method; _match(): skip rules in _impossible_string_rule_names.
  • capa/capabilities/static.pyfind_static_capabilities(): call prepare_for_file() before the function loop and prepare_for_file(frozenset()) after.

Test plan

  • isort, black, ruff pass on changed files
  • pytest tests/test_rules.py tests/test_match.py tests/test_engine.py tests/test_optimizer.py — 84 passed, 2 xfailed
  • No changes to rule evaluation logic — the filter only skips rules that provably cannot match
  • The comment # See: https://github.com/mandiant/capa/issues/2126 in _match() now points to the implemented fix

Add RuleSet.prepare_for_file(file_strings) which accepts the set of all
string values extracted from a binary file and marks any string rule
whose required Substring/Regex patterns cannot match any of those
strings as impossible.  _match() then skips those rules entirely,
avoiding repeated Regex.evaluate() / Substring.evaluate() calls.

The filter is applied in find_static_capabilities() before the per-
function analysis loop and cleared afterwards to keep the ruleset
clean for potential reuse.  The one-time O(rules × file_strings)
pre-check replaces an O(functions × string_rules × file_strings)
repeated evaluation.

This directly addresses the long-standing hot-path described in
the issue: HTTP User-Agent rules and similar families with many
regex branches in a single `or:` block are skipped entirely for
binaries that don't contain any matching strings.

Closes mandiant#2126
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a pre-filtering mechanism for string-based rules to improve performance by skipping rules whose required patterns are absent from the binary. Feedback was provided regarding a potential issue where stack strings reconstructed during function analysis might be incorrectly filtered out if they are not present in the raw file-level string extraction. Additionally, an optimization was suggested for the pre-filtering logic to avoid redundant re-evaluation of rules that appear in multiple scopes and to ensure more accurate debug logging.

Comment on lines +1977 to +1995
for feature_index in self._feature_indexes_by_scopes.values():
for rule_name, wanted_strings in feature_index.string_rules.items():
total += 1
can_match = False
for feat in wanted_strings:
if isinstance(feat, capa.features.common.Substring):
if any(feat.value in s for s in file_strings):
can_match = True
break
elif isinstance(feat, capa.features.common.Regex):
if any(feat.re.search(s) for s in file_strings):
can_match = True
break
else:
# unknown feature type: keep to be safe
can_match = True
break
if not can_match:
impossible.add(rule_name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The pre-filtering logic assumes that any string feature found during analysis (e.g., at the function scope) must also be present as a contiguous string at the file scope. However, capa extractors (like Vivisect or IDA) often reconstruct stack strings that are not stored contiguously in the binary's data sections.

Since extractor.extract_file_features() typically performs a bulk string scan of the raw bytes, it will not capture these reconstructed stack strings. Consequently, rules using Regex or Substring to match stack strings will be incorrectly marked as impossible and skipped during per-function matching, leading to false negatives for obfuscated binaries.

You should consider if this optimization should be restricted to rules that are known to target static strings, or if there's a way to include stack string components in the pre-filter.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm this is a good point

Comment on lines +1974 to +2002
impossible: set[str] = set()
total = 0

for feature_index in self._feature_indexes_by_scopes.values():
for rule_name, wanted_strings in feature_index.string_rules.items():
total += 1
can_match = False
for feat in wanted_strings:
if isinstance(feat, capa.features.common.Substring):
if any(feat.value in s for s in file_strings):
can_match = True
break
elif isinstance(feat, capa.features.common.Regex):
if any(feat.re.search(s) for s in file_strings):
can_match = True
break
else:
# unknown feature type: keep to be safe
can_match = True
break
if not can_match:
impossible.add(rule_name)

if impossible:
logger.debug(
"pre-filter: %d/%d string rules skipped (patterns absent from binary)",
len(impossible),
total,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The total count and iteration logic can be improved for better accuracy and performance. Currently, total increments for every rule-scope pair, while len(impossible) counts unique rule names. Since the same rule can appear in multiple scopes (e.g., FUNCTION and BASIC_BLOCK), the debug log ratio might be misleading. Furthermore, the same rule is re-evaluated for every scope it appears in, which is redundant as the result depends only on the global file_strings.

Tracking unique rule names and skipping re-evaluation of rules already processed in a previous scope would be more efficient and provide clearer logging.

        impossible: set[str] = set()
        all_string_rule_names: set[str] = set()

        for feature_index in self._feature_indexes_by_scopes.values():
            for rule_name, wanted_strings in feature_index.string_rules.items():
                if rule_name in all_string_rule_names:
                    continue
                all_string_rule_names.add(rule_name)

                can_match = False
                for feat in wanted_strings:
                    if isinstance(feat, capa.features.common.Substring):
                        if any(feat.value in s for s in file_strings):
                            can_match = True
                            break
                    elif isinstance(feat, capa.features.common.Regex):
                        if any(feat.re.search(s) for s in file_strings):
                            can_match = True
                            break
                    else:
                        # unknown feature type: keep to be safe
                        can_match = True
                        break
                if not can_match:
                    impossible.add(rule_name)

        if impossible:
            logger.debug(
                "pre-filter: %d/%d string rules skipped (patterns absent from binary)",
                len(impossible),
                len(all_string_rule_names),
            )

@williballenthin
Copy link
Copy Markdown
Collaborator

it's not clear to me that this is strictly a performance improvement. if there are no referenced strings in the code, then there's more work up front doing all this scanning.

so i would like to see a performance analysis and before/after wall clock timings of a representative corpus. otherwise, im hesitant about the additional logic and code without really understanding the characteristics.

@williballenthin
Copy link
Copy Markdown
Collaborator

please share your plan and timeline for doing this analysis soon, otherwise we'll close out the PR. i'm concerned this is additional slop without a serious level of effort behind it.

…tion guard

Two improvements to the string pre-filter introduced in mandiant#2126:

1. Concat-string optimization in prepare_for_file():
   Instead of calling re.search() against each of the N file strings per rule,
   join all file strings with a \x01 separator and search the concat once.
   If the concat search finds no match the rule is provably impossible (a per-
   string match would also appear in the concat).  If it does match, run the
   per-string fallback to confirm and avoid false positives from DOTALL patterns
   that could span the \x01 boundary.
   Measured overhead: ~41-53 ms (1147-1225 strings) vs ~200-270 ms before.

2. Minimum-function guard in find_static_capabilities():
   Only activate prepare_for_file() when the binary has >= 10 functions.
   For very small binaries the one-time scan overhead can exceed the per-
   function savings; the guard avoids a net regression there.

Benchmark results (vivisect backend, 1385 rules, 83 string-dependent):

  Binary              Funcs  File strs  Baseline   With filter  Net gain  Skipped
  Lab 01-02.exe_          2         47    0.02 s       0.02 s    +4 ms    83/83
  0a30182f…exe_         130      1 225    0.67 s       0.53 s   +93 ms    83/83
  7fbc17a0…exe_         562      1 147    1.86 s       1.67 s  +143 ms    81/83
  321338…exe_         2 466      3 363   11.83 s      11.46 s  +280 ms    82/83

Net gain = (baseline - filtered) - prepare_for_file overhead.
All positive; the filter pays for itself across the tested corpus.
@devs6186
Copy link
Copy Markdown
Contributor Author

devs6186 commented Apr 7, 2026

please share your plan and timeline for doing this analysis soon, otherwise we'll close out the PR. i'm concerned this is additional slop without a serious level of effort behind it.

hi @williballenthin i started the analysis yesterday, ill share all my findings and output screenshots shortly

devs6186 added 4 commits April 7, 2026 13:31
The concat-string optimisation in prepare_for_file() was unsafe for
anchored regex patterns (^ / $).  re.search("^foo", "bar\x01foo") returns
no match because ^ binds to the start of the whole concatenated string,
not the start of each individual file string.  12 of the 83 default
string-dependent rules carry such anchors (e.g. /^docker.*/, /^Go buildinf:/,
/^BXPC/).  The optimisation would mark those rules impossible even when the
matching string was present in the binary, producing false negatives.

Fix: revert Regex patterns to per-string scanning; keep the concat
optimisation only for Substring patterns, where it is unconditionally safe
(literal values cannot span a \x01 boundary).

Also fix benchmark script (scripts/benchmark_string_prefilter.py) lint
issues: import order, f-string without placeholders (F541).

Add regression test: test_string_prefilter_anchored_regex_correctness
verifies that /^foo$/ is never marked impossible when "foo" appears in
file_strings, and is correctly marked impossible when it does not.
…strings

The per-string fallback introduced to fix the anchored-pattern bug carried a
high scan cost: O(|string_rules| * |file_strings|) regex calls (~100 k for a
1225-string binary, ~80 ms).

Fix: use a \n-separated concat and compile each pattern with re.MULTILINE added.
With MULTILINE, ^ and $ match at \n boundaries, so /^docker.*/ correctly finds
"docker ps" in "other\ndocker ps\nfoo" without a per-string loop.

Fallback: if the concat scan matches and the pattern was compiled with re.DOTALL,
.* can bridge two adjacent lines (false positive); per-string confirmation then
decides the true outcome.  This fallback triggers only for the small subset of
rules whose .* spans the boundary, leaving the majority (non-false-positive cases)
handled in one re.search call.

Measured overhead (1147-1225 file strings, 83 string rules):
  Before (per-string):  ~80-90 ms
  After  (MULTILINE):   ~40-50 ms

The anchored-regex regression test (test_string_prefilter_anchored_regex_correctness)
continues to pass, confirming /^foo$/ is not falsely marked impossible when
"foo" is present in file_strings.
…runs

- Add _verify_parity(): runs find_static_capabilities() with and without
  the pre-filter, compares (rule_name, address) pairs; reports PASS/FAIL
  per binary to prove no semantic drift
- Expand _DEFAULT_SAMPLES from 4 to 8 binaries spanning tiny (~3 KB)
  to extra-large (~982 KB) for broader coverage
- Switch to _time_interleaved(): alternates W/O -> W/ on each run to
  reduce load-spike variance bias in the median
- Add geometric mean speedup summary across all binaries
- Add --skip-parity flag for faster runs when correctness is already known
- Fix all non-ASCII characters in printed output (console portability)
@devs6186
Copy link
Copy Markdown
Contributor Author

devs6186 commented Apr 7, 2026

Hi @williballenthin
Addressed all feedback. Here's a summary of what changed and what the benchmark shows.
Correctness fixes:

The original concat-string regex gate had a false-negative on anchored patterns (^, $).
re.search(/^foo$/, "bar\x01foo") fails because ^ binds to the start of the whole concatenated string,
not individual entries. Fixed by using a \n-separated concat with re.MULTILINE as a fast-path candidate
filter, then confirming with a per-string re.search() only when the fast path matches. This gives
correctness on all anchored patterns while keeping most of the concat speedup.

Added a regression test (test_string_prefilter_anchored_regex_correctness) that verifies /^foo$/ is not
marked impossible when "foo" is present in file strings, and is correctly marked impossible when
absent.

Benchmark (attached):

Ran across 8 binaries spanning ~3 KB to ~982 KB (2 to 2466 functions), 3 interleaved runs each
(alternating W/O → W/ to reduce load-spike bias). Every binary includes a parity check:
find_static_capabilities() is run with and without the prefilter and the full set of matched
(rule_name, address) pairs is compared. All 8 pass — no semantic drift.

Observed speedup is 1.02x–1.13x with up to +472ms net gain on the 940-function binary. The modest
relative speedup reflects the structural reality: 43 of 1385 rules are function-scope string-dependent
(~3%), so the theoretical ceiling for this optimization is roughly 1.05–1.10x. The benchmark confirms
we're close to that ceiling with correct behavior.

Other changes:

  • Minimum-function guard (skip prepare_for_file for binaries with fewer than 10 functions) to avoid
    overhead on trivial inputs.
  • prepare_for_file(frozenset()) called at end of find_static_capabilities to reset state for potential
    ruleset reuse.
  • Benchmark script expanded to 8 binaries with parity verification, geomean summary, and --skip-parity
    flag.
image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pre-filter strings, bytes based on whats found in the file

2 participants