Skip to content

loader: fix recycled TID overwriting matched calls in dynamic layout#2882

Open
devs6186 wants to merge 4 commits intomandiant:masterfrom
devs6186:fix/2619-recycled-tid-dynamic-layout
Open

loader: fix recycled TID overwriting matched calls in dynamic layout#2882
devs6186 wants to merge 4 commits intomandiant:masterfrom
devs6186:fix/2619-recycled-tid-dynamic-layout

Conversation

@devs6186
Copy link
Copy Markdown
Contributor

closes #2619

Root cause

When an OS recycles a thread ID (TID) within the same process, compute_dynamic_layout() iterated the same ThreadAddress twice. On the second encounter it executed:

calls_by_thread[t.address] = []   # overwrites the first instance's calls

The matched calls from the first thread instance were still present in matched_calls (added by the rule engine), but had been erased from the layout, causing _get_call_name() to raise:

ValueError: ('name not found for call', ...)

Additionally, if both the first and the second instance had matched calls, t.address was appended to threads_by_process[p.address] twice, creating a duplicate thread entry in the layout.

Fix

Two targeted changes in compute_dynamic_layout() (capa/loader.py):

  1. Replace the overwriting assignment with setdefault() so calls from all instances of a recycled TID accumulate under the same key:

    # before
    calls_by_thread[t.address] = []
    # after
    calls_by_thread.setdefault(t.address, [])
  2. Guard the thread-registration block with a membership check so the thread address is only added to matched_threads / threads_by_process once, regardless of how many times the TID is recycled:

    # before
    if calls_by_thread[t.address]:
        matched_threads.add(t.address)
        threads_by_process[p.address].append(t.address)
    # after
    if calls_by_thread[t.address] and t.address not in matched_threads:
        matched_threads.add(t.address)
        threads_by_process[p.address].append(t.address)

This is the source-level fix requested by the maintainer: the data is now consistent before it reaches the rendering stage, so no exception handling in the renderer is needed.

Test

tests/test_loader.py::test_compute_dynamic_layout_recycled_tid constructs a minimal mock DynamicFeatureExtractor that yields the same ThreadAddress twice (simulating TID recycling), with a matched call only in the first instance. It asserts that compute_dynamic_layout() returns a layout containing exactly one thread entry with the matched call intact.

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed
  • This submission includes AI-generated code and I have provided details in the description.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @devs6186, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in the dynamic analysis loader where the recycling of thread IDs by the operating system could lead to data loss and runtime errors. Specifically, it prevents the overwriting of matched calls associated with a recycled TID and ensures that such TIDs are not registered multiple times in the layout. These changes enhance the robustness and accuracy of dynamic analysis results, particularly in environments where TIDs are frequently reused. Additionally, the pull request includes a new test to cover this specific scenario and updates the documentation to provide a clearer overview of capa's output consumption methods.

Highlights

  • Dynamic Layout Bug Fix: Resolved an issue where recycled Thread IDs (TIDs) could overwrite previously matched calls in the dynamic layout, leading to ValueError during rendering. The fix ensures that calls from all instances of a recycled TID are accumulated correctly.
  • Duplicate Thread Entry Prevention: Implemented a check to prevent duplicate thread entries in the dynamic layout when an OS recycles a TID. This ensures each unique thread address is registered only once, maintaining data consistency.
  • New Test Case: Added a dedicated test case (test_compute_dynamic_layout_recycled_tid) to validate the fix for recycled TIDs, simulating the scenario and asserting correct behavior.
  • Documentation Update: Updated the usage documentation to include a table comparing various methods for consuming capa output, such as CLI, IDA Pro, Ghidra, Binary Ninja, dynamic sandboxes, and the web UI.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • CHANGELOG.md
    • Added bug fix entry for recycled TID issue.
    • Added documentation entry for capa output consumption table.
  • capa/loader.py
    • Fixed recycled TID overwriting matched calls in dynamic layout.
    • Prevented duplicate thread entries for recycled TIDs.
    • Reformatted several function calls and import statements for improved readability.
  • doc/usage.md
    • Added a table outlining different ways to consume capa output.
  • tests/test_loader.py
    • Added a new test case to verify the fix for recycled TIDs in dynamic layout.
Activity
  • No human activity has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where recycled thread IDs in dynamic analysis could lead to overwritten data and subsequent errors. The fix correctly uses setdefault to accumulate calls from all instances of a recycled TID and adds a check to prevent duplicate thread registration. A comprehensive test case has been added to validate the fix. The overall implementation is correct and of high quality. The PR also includes significant code formatting changes, which improve readability. For future PRs, it would be beneficial to separate such large-scale formatting changes from functional bug fixes to make reviewing easier. The documentation updates are also a welcome addition.

@devs6186 devs6186 force-pushed the fix/2619-recycled-tid-dynamic-layout branch from 72dbf97 to 29bb083 Compare February 24, 2026 20:24
Copy link
Copy Markdown
Collaborator

@mike-hunhoff mike-hunhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @devs6186, thanks for looking into the data overwriting issue. While this fix prevents matched calls from being lost when a TID is recycled, it doesn't quite address the requirement for unique tracking of distinct thread and process lifecycles. By accumulating all calls under the same ThreadAddress, the layout still fuses separate executions into a single entry, which doesn't solve the core problem described in #2361.

Because we want to avoid merging partial solutions for this component, we'd like to see a more comprehensive fix that addresses the uniqueness and de-duplication problem entirely for both thread and process lifecycles (for example, by incorporating sequence IDs or timestamps into the Address classes).

If you're interested in fully addressing these requirements, we'd love to see those updates in this PR. Alternatively, if you don't have the time or interest to take on the broader scope right now, let us know and we can close this for now.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains many format-based changes. Please revert these if they are not necessary to pass linting checks.

@devs6186
Copy link
Copy Markdown
Contributor Author

Hi @devs6186, thanks for looking into the data overwriting issue. While this fix prevents matched calls from being lost when a TID is recycled, it doesn't quite address the requirement for unique tracking of distinct thread and process lifecycles. By accumulating all calls under the same ThreadAddress, the layout still fuses separate executions into a single entry, which doesn't solve the core problem described in #2361.

Because we want to avoid merging partial solutions for this component, we'd like to see a more comprehensive fix that addresses the uniqueness and de-duplication problem entirely for both thread and process lifecycles (for example, by incorporating sequence IDs or timestamps into the Address classes).

If you're interested in fully addressing these requirements, we'd love to see those updates in this PR. Alternatively, if you don't have the time or interest to take on the broader scope right now, let us know and we can close this for now.

i appreciate your review, i will look into it moe broadly, i wanna solve a problem completely instead of just submitting a pr, give me days time or two maximum, ill come back with a better solution. Thank you !

@devs6186 devs6186 force-pushed the fix/2619-recycled-tid-dynamic-layout branch 2 times, most recently from 944f89c to 90ba1b0 Compare March 2, 2026 14:23
…D lifecycles

Adds an optional `id` field to `ProcessAddress` and `ThreadAddress` that
sandbox backends can populate with a sandbox-specific unique identifier
(e.g. VMRay monitor_id, or a sequential counter for CAPE). When set, this
field becomes part of equality/hashing so that two process or thread
instances that share the same OS-assigned PID/TID are treated as distinct
addresses throughout capa's pipeline.

This comprehensively fixes the ValueError crash in render (mandiant#2619) by solving
the root uniqueness problem described in mandiant#2361: rather than merging recycled
lifecycles into a single entry, each instance now gets its own identity.

Changes:
- address.py: add optional `id` to ProcessAddress and ThreadAddress; update
  __eq__, __hash__, __lt__, __repr__ accordingly; backward-compatible (id=None
  by default)
- freeze/__init__.py: extend from_capa/to_capa to encode/decode the new id
  fields using extended tuple lengths; old 2/3/4-element tuples still decoded
  correctly for backward compatibility
- vmray/extractor.py: pass monitor_id as id to both ProcessAddress and
  ThreadAddress so each VMRay monitor instance is uniquely tracked
- cape/file.py: detect PID reuse via two-pass counting and assign sequential
  ids; processes with unique PIDs keep id=None (no behavior change)
- render/verbose.py: add _format_process_fields / _format_thread_fields helpers
  that include the id in rendered output when present
- tests/test_address_uniqueness.py: 35 unit tests covering identity, hashing,
  sorting, freeze roundtrip (incl. backward compat), and compute_dynamic_layout
  behavior for both recycled TIDs and recycled PIDs
@devs6186 devs6186 force-pushed the fix/2619-recycled-tid-dynamic-layout branch from 90ba1b0 to e4aa79a Compare March 2, 2026 14:25
@devs6186 devs6186 marked this pull request as ready for review March 2, 2026 14:26
Copy link
Copy Markdown
Collaborator

@mike-hunhoff mike-hunhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @devs6186 . I started a pass and quickly realized that there appears to be a lot of code added to main backwards compatibility. This change is going to be considered a breaking change requiring a major release, so let's not over complicate the code to maintan backwards compatibility. Also, please ensure that process and thread uniqueness is addressed for all supported sandboxes. Presently, it looks like you only have process and thread covered for VMRay and process covered for CAPE.

- CAPE file.py: single pass with sequential IDs (no two-pass)
- CAPE process.py: add thread uniqueness with sequential IDs
- Drakvuf helpers.py: assign id=0 to process/thread addresses
- freeze: always include id in tuples, remove backwards-compat branching
- freeze: revert format-only changes in loads_static/loads_dynamic/main
- verbose.py: remove over-engineered pid-id display logic
- tests: update for simplified API, remove backwards-compat tests
@devs6186
Copy link
Copy Markdown
Contributor Author

thanks @mike-hunhoff, really appreciate the detailed review. pushed the changes in 2585abf addressing everything:

dropped all the backwards-compat complexity:
freeze format now always uses the full tuple with the id field included. removed all the conditional len() checks and the "if not zero then None" logic. since this is a breaking change anyway, the code is much simpler now.

CAPE processes - single pass:
replaced the two-pass count-then-assign approach with a straightforward single pass. every process gets a sequential id starting from 0, no special-casing.

CAPE threads:
added thread uniqueness handling in cape/process.py. each thread now gets a sequential id to handle recycled TIDs within the same process.

Drakvuf:
updated drakvuf/helpers.py to pass id=0 for all process and thread addresses so the id field is populated across all supported sandboxes.

reverted format-only changes:
reverted all the unnecessary line wrapping in freeze/init.py (loads_static, loads_dynamic, main) and the DynamicCallAddress.eq reformatting in address.py.

sandbox coverage now:

  • VMRay: process + thread (uses monitor_id)
  • CAPE: process + thread (sequential ids)
  • Drakvuf: process + thread (id=0)

@devs6186
Copy link
Copy Markdown
Contributor Author

Hi @mike-hunhoff

I hope you're doing well, I know this is a busy period and I genuinely appreciate the time you've already put into the project.

I wanted to send a brief follow-up on Discussion #99. The GSoC application deadline is approaching and I want to make sure I'm heading in the right direction before I finalize my proposal. I'm not asking for an exhaustive review , even a quick "this looks promising / you're off-track on X" would be enormously helpful at this stage.

Thank you

@mike-hunhoff
Copy link
Copy Markdown
Collaborator

Hi @mike-hunhoff

I hope you're doing well, I know this is a busy period and I genuinely appreciate the time you've already put into the project.

I wanted to send a brief follow-up on Discussion #99. The GSoC application deadline is approaching and I want to make sure I'm heading in the right direction before I finalize my proposal. I'm not asking for an exhaustive review , even a quick "this looks promising / you're off-track on X" would be enormously helpful at this stage.

Thank you

@devs6186 please keep all GSoC-related correspondence in the flare-gsoc discussions so we can better track them.

@mike-hunhoff
Copy link
Copy Markdown
Collaborator

Thanks for working on this! This is definitely the right approach to solve the recycled PID/TID issue.

I have a few suggestions to make this cleaner and more complete before we merge.

1. Rename id to instance_id

The name id is a bit too generic and might be confused with OS PIDs/TIDs. I think instance_id is much clearer because it explicitly says this distinguishes instances of the same PID/TID. If instance_id is too verbose, consider uid?

2. Serialization in freeze (Flat Tuples vs Alternatives)

I don't love how we are passing around 6 values for calls now in freeze (e.g. (ppid, pid, tid, call_id, process_id, thread_id)). It's very positional and easy to mess up.

Since we are breaking backward compatibility here anyway, we should consider a few alternatives for how we serialize these complex dynamic addresses. Here are the options I see:

Option A: Nested Dictionaries

{
  "type": "call",
  "call_id": 5,
  "thread": {
    "type": "thread",
    "tid": 42,
    "instance_id": 1,
    "process": {
      "type": "process",
      "pid": 200,
      "instance_id": 1,
      "parent": { ... }
    }
  }
}
  • Pros: Highly readable, self-documenting, and consistent with the rest of the freeze format (which uses Pydantic/JSON objects heavily).
  • Cons: Lots of repeated keys. Even though capa compresses the freeze file with zlib, this will cause memory bloat when loading and parsing the raw JSON.

Option B: Nested Tuples

Instead of a flat tuple, we use nested tuples to reflect the hierarchy: ((((parent_process), pid, instance_id), tid, instance_id), call_id).

  • Pros: Much more compact than dictionaries, but still preserves the structural relationships. Much easier to unpack cleanly in Python than a flat 6-tuple.
  • Cons: Not as human-readable in the raw JSON as a dictionary.

Option C: Keep Flat Tuples (Current PR)

  • Pros: Most compact representation.
  • Cons: Hard to read, maintain, and easy to get the position of values wrong.

3. Track the parent process uniquely

Right now we only track the parent via ppid (the raw OS PID). But wouldn't we also need the unique ID for the parent process if it was also recycled?

If we go with Option A or B for serialization, we can easily reference the full ProcessAddress of the parent, solving this edge case too.

What do you think about these options? @devs6186 @williballenthin @mr-tz

@devs6186
Copy link
Copy Markdown
Contributor Author

devs6186 commented Apr 8, 2026

Thanks for the suggestions @mike-hunhoff.

I ruled out Option C first. The old 6-item tuple (ppid, pid, tid, call_id, process_id, thread_id) is fully positional, so it is easy to mix fields up and hard to read without counting positions. That was the main readability issue you called out, and I agree with that.

I also considered Option A. It is the easiest to read in raw JSON, but I do not think it is the right tradeoff for the freeze format. Freeze stores zlib(json(...)) in memory, and dynamic call addresses can appear in very large numbers. With nested dicts, we would repeat keys like type, pid, and instance_id for every address entry, which adds real memory and parsing overhead. That matches your earlier concern about repeated keys causing memory bloat when loading the raw JSON.

So I went with Option B.
I pushed an update and rechecked the dynamic address path end to end. The new shape is:

process = (parent_tuple | None, pid, instance_id)
thread = (process_tuple, tid, instance_id)
call = (thread_tuple, call_id)

This keeps the hierarchy visible in the serialized form without the repeated-key cost of nested dicts.

I also changed parent tracking so ProcessAddress now stores parent: Optional[ProcessAddress] instead of only ppid: int. That means a recycled parent process can still be identified correctly. In the extractors, VMRay resolves parents through origin_monitor_id, CAPE links parents while walking the process list and uses a skeleton ProcessAddress(pid=ppid) when the parent is not present in the report, and Drakvuf creates an explicit parent node from ppid.

One more issue came up while validating this: equality and hashing were already using the full parent address, but sorting was still only using the parent PID. That meant two children under recycled parent instances could still sort ambiguously. I fixed that as well and added regression tests for recycled parent process and thread cases.

I reran the checks on the changed surface: black, ruff, and the dynamic address, freeze, proto, and extractor tests. Current result is 155 passed, 1 xfailed.

The xfail is the proto roundtrip for dynamic addresses. That is expected for now because the current proto schema still stores a flat (ppid, pid) view and cannot rebuild multi-generation parent ancestry. Fixing that will need a follow-up change to the proto format.

I would appreciate your thoughts on both the approach and the implementation here as well: @williballenthin @mr-tz

@devs6186 devs6186 requested a review from mike-hunhoff April 8, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dynamic: render: ValueError "name not found for call"

2 participants