Skip to content

tests: add data-driven test fixtures for rule matcher#2987

Open
williballenthin wants to merge 1 commit intomasterfrom
capa-matcher-fixtures
Open

tests: add data-driven test fixtures for rule matcher#2987
williballenthin wants to merge 1 commit intomasterfrom
capa-matcher-fixtures

Conversation

@williballenthin
Copy link
Copy Markdown
Collaborator

@williballenthin williballenthin commented Apr 2, 2026

closes #2985

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed
  • This submission includes AI-generated code and I have provided details in the description.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a data-driven testing framework for the capa rule matcher, enabling the definition of test cases in YAML using a specialized DSL for synthetic features. The changes include a new documentation guide, a parser for both static and dynamic features, and a set of initial test fixtures. Review feedback identifies a bug in the _parse_number_literal function where hexadecimal values containing the letter 'e' are incorrectly parsed as floats, and notes that the instruction line parsing logic is fragile when feature text contains colons.

Comment on lines +577 to +580
def _parse_number_literal(value: str) -> int | float:
if any(character in value for character in ".eE"):
return float(value)
return int(value, 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of _parse_number_literal incorrectly attempts to parse hexadecimal numbers containing 'e' or 'E' (e.g., 0x1e, 0xDEADBEEF) as floats, which will raise a ValueError. Hexadecimal literals should always be parsed as integers.

Suggested change
def _parse_number_literal(value: str) -> int | float:
if any(character in value for character in ".eE"):
return float(value)
return int(value, 0)
def _parse_number_literal(value: str) -> int | float:
if "0x" in value.lower():
return int(value, 0)
if any(character in value for character in ".eE"):
return float(value)
return int(value, 0)

Comment on lines +149 to +162
parts = rest.split(": ", 2)
if len(parts) == 2:
instruction_address = _parse_static_address(parts[0])
feature_text = parts[1]
elif len(parts) == 3:
function_address = _parse_static_address(parts[0])
if function_address != self.current_function:
raise ValueError(
f"instruction line changed function without a function header: {line}"
)
instruction_address = _parse_static_address(parts[1])
feature_text = parts[2]
else:
raise ValueError(f"unsupported instruction feature line: {line}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for splitting instruction lines is fragile when the feature text contains : (e.g., string(key: value)). In such cases, split(": ", 2) will produce 3 parts, and the code will incorrectly attempt to parse the first part of the feature as an instruction address. A more robust approach is to check if the potential second address part is actually a valid address before treating it as the 3-part form.

        addr1_text, sep1, remainder = rest.partition(": ")
        if not sep1:
            raise ValueError(f"unsupported instruction feature line: {line}")

        addr2_text, sep2, feature_text = remainder.partition(": ")
        if sep2:
            try:
                # Try to parse the second part as an address to distinguish between
                # <func-addr>: <insn-addr>: <feature> and <insn-addr>: <feature-with-colon>
                instruction_address = _parse_static_address(addr2_text)
                function_address = _parse_static_address(addr1_text)
                if function_address != self.current_function:
                    raise ValueError(
                        f"instruction line changed function without a function header: {line}"
                    )
            except ValueError:
                # Second part is not an address, so it must be part of the feature
                instruction_address = _parse_static_address(addr1_text)
                feature_text = remainder
        else:
            instruction_address = _parse_static_address(addr1_text)
            feature_text = remainder

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test fixture files should be moved under:

tests/fixtures/static/dynamic/*
tests/fixtures/matcher/dynamic/*

@@ -0,0 +1,36 @@
- name: dynamic-span-window
description: span-of-calls matching honors the configured window size
flavor: dynamic
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe flavor comes from the fixture file location?

Comment on lines +9 to +11
scopes:
static: unsupported
dynamic: span of calls
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice if we didn't have to specify this

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least the unsupported ones

span size: 2
rules:
- meta:
name: span-resolve-add-veh
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need "description" here, too, to explain why the test exists

expect:
matches:
span-resolve-add-veh:
- [call, 2456, 3052, 3064, 11]
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is painful. maybe we can assert that the call IDs are unique within the test features/events, and then we can just refer to the call IDs, not to the process/thread/etc.

- name: scope-boundary
description: function scope aggregates across basic blocks, but basic block scope does not
flavor: static
base address: 0x401000
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem necessary in the common case

@williballenthin williballenthin force-pushed the capa-matcher-fixtures branch from e28743f to beae5ef Compare April 2, 2026 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

create data-driven rule matching test fixtures

1 participant