Skip to content

address: add optional id field for unique tracking of recycled PID/TID lifecycles#2889

Closed
devs6186 wants to merge 1 commit intomandiant:masterfrom
devs6186:fix/2619-address-uniqueness-v2
Closed

address: add optional id field for unique tracking of recycled PID/TID lifecycles#2889
devs6186 wants to merge 1 commit intomandiant:masterfrom
devs6186:fix/2619-address-uniqueness-v2

Conversation

@devs6186
Copy link
Copy Markdown
Contributor

@devs6186 devs6186 commented Mar 2, 2026

Closes #2619
Addresses #2361

Summary

This is a comprehensive fix that addresses the root uniqueness problem described in #2361, which in turn eliminates the ValueError crash in #2619.

The previous approach (PR #2882) prevented data loss by merging calls from recycled TIDs under the same ThreadAddress. The maintainer correctly noted that this still fuses separate lifecycle instances into a single entry, which doesn't solve the core problem.

This PR solves the problem at the identity level by adding an optional id field to ProcessAddress and ThreadAddress:

# Two thread instances with the same OS TID but different sandbox IDs
# are now genuinely distinct addresses throughout capa's pipeline
thread1 = ThreadAddress(process=p, tid=42, id=10)  # first lifecycle
thread2 = ThreadAddress(process=p, tid=42, id=20)  # recycled TID
assert thread1 != thread2  # distinct identities, separate layout entries

Changes

capa/features/address.py

  • ProcessAddress: add optional id: Optional[int] field; update __eq__, __hash__, __lt__, __repr__; id=None by default (fully backward-compatible)
  • ThreadAddress: same treatment

capa/features/extractors/vmray/extractor.py

capa/features/extractors/cape/file.py

  • Two-pass detection of PID reuse; assign sequential id values (1, 2, …) only when a (ppid, pid) pair appears more than once; unique PIDs keep id=None (no behavior change for normal reports)

capa/features/freeze/__init__.py

  • from_capa: encodes id fields in extended tuples — PROCESS uses 3-tuple (ppid, pid, id), THREAD uses 5-tuple (ppid, pid, tid, process_id, thread_id), CALL uses 6-tuple
  • to_capa: decodes by tuple length — old 2/3/4-element tuples (existing freeze files) still decode correctly with id=None

capa/render/verbose.py

  • _format_process_fields(): renders pid:X normally, adds ,id:Y when present
  • _format_thread_fields(): renders pid:X,tid:Y normally, adds ,id:Z when present
  • Existing render functions (render_process, render_thread, render_span_of_calls, render_call) now use these helpers

tests/test_address_uniqueness.py (new file, 35 tests)

  • TestProcessAddressUniqueness: equality, hashing, sorting, dict/set behavior, repr
  • TestThreadAddressUniqueness: same, plus propagation from recycled process id
  • TestCallAddressWithUniqueThreads: calls in distinct thread instances are distinct
  • TestFreezeRoundtrip: roundtrip for all combinations of id/no-id; backward compat for old tuples
  • TestComputeDynamicLayoutRecycledTid: both thread instances appear separately with their own calls
  • TestComputeDynamicLayoutRecycledPid: both process instances appear separately

Backward Compatibility

  • Existing code that creates ProcessAddress(pid=..., ppid=...) or ThreadAddress(process=..., tid=...) continues to work unchanged — id defaults to None
  • Freeze files written by older versions of capa are still loadable — the decoder branches on tuple length
  • Backends that don't provide a unique id (DRAKVUF, etc.) are unaffected; they'll continue working exactly as before, with id=None on all addresses

Note on formatting

Format-only changes have been removed from this PR. The only line-wrapping changes present are those directly caused by the addition of new id=... arguments.

Checklist

  • Tests added for all scenarios (address identity, freeze roundtrip, layout computation)
  • Backward compatibility maintained for existing freeze files and extractors
  • No format-only changes included
  • CHANGELOG updated

…D lifecycles

Adds an optional `id` field to `ProcessAddress` and `ThreadAddress` that
sandbox backends can populate with a sandbox-specific unique identifier
(e.g. VMRay monitor_id, or a sequential counter for CAPE). When set, this
field becomes part of equality/hashing so that two process or thread
instances that share the same OS-assigned PID/TID are treated as distinct
addresses throughout capa's pipeline.

This comprehensively fixes the ValueError crash in render (mandiant#2619) by solving
the root uniqueness problem described in mandiant#2361: rather than merging recycled
lifecycles into a single entry, each instance now gets its own identity.

Changes:
- address.py: add optional `id` to ProcessAddress and ThreadAddress; update
  __eq__, __hash__, __lt__, __repr__ accordingly; backward-compatible (id=None
  by default)
- freeze/__init__.py: extend from_capa/to_capa to encode/decode the new id
  fields using extended tuple lengths; old 2/3/4-element tuples still decoded
  correctly for backward compatibility
- vmray/extractor.py: pass monitor_id as id to both ProcessAddress and
  ThreadAddress so each VMRay monitor instance is uniquely tracked
- cape/file.py: detect PID reuse via two-pass counting and assign sequential
  ids; processes with unique PIDs keep id=None (no behavior change)
- render/verbose.py: add _format_process_fields / _format_thread_fields helpers
  that include the id in rendered output when present
- tests/test_address_uniqueness.py: 35 unit tests covering identity, hashing,
  sorting, freeze roundtrip (incl. backward compat), and compute_dynamic_layout
  behavior for both recycled TIDs and recycled PIDs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@google-cla
Copy link
Copy Markdown

google-cla bot commented Mar 2, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a comprehensive fix for the problem of non-unique process and thread identifiers in dynamic analysis traces, which previously led to data inconsistencies and crashes. By introducing an optional id field to ProcessAddress and ThreadAddress, the system can now accurately distinguish between different lifecycle instances that might share the same OS-assigned IDs. This ensures that each unique process or thread is correctly tracked throughout capa's analysis pipeline, improving the reliability and accuracy of dynamic analysis results, all while maintaining full backward compatibility with existing data and configurations.

Highlights

  • Unique Tracking for Recycled PIDs/TIDs: Introduced an optional id field to ProcessAddress and ThreadAddress classes to uniquely track distinct lifecycle instances even when OS-assigned PIDs/TIDs are reused. This resolves issues with data loss and ValueError crashes caused by non-unique identifiers.
  • Address Class Enhancements: Updated ProcessAddress and ThreadAddress in capa/features/address.py to include the new id field, modifying their __init__, __repr__, __hash__, __eq__, and __lt__ methods for proper handling of uniqueness and sorting.
  • Extractor Integration: Integrated the new id field into dynamic extractors: VMRay's monitor_id is now used as the id for processes and threads, and CAPE's extractor implements a two-pass detection to assign sequential id values for reused PIDs.
  • Freeze File Compatibility: Modified the freeze serialization/deserialization logic to support the new id fields in PROCESS, THREAD, and CALL address tuples, ensuring full backward compatibility with existing freeze files and older capa versions.
  • Verbose Rendering Updates: Updated verbose rendering functions to display the optional id field for processes and threads, providing more detailed output when unique IDs are present.
  • Comprehensive Testing: Added a new test file (tests/test_address_uniqueness.py) with 35 new tests covering various scenarios, including address identity, hashing, sorting, freeze roundtrip, and dynamic layout computation for recycled PIDs/TIDs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@devs6186 devs6186 closed this Mar 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a comprehensive and well-executed pull request that addresses the core issue of PID/TID recycling by introducing an optional id field. The changes are thoughtfully implemented across the address representation, extractors, serialization format, and rendering, with excellent test coverage to ensure correctness and backward compatibility.

My main feedback is to add assertions to enforce that process and thread ids are positive integers. This would make the design more robust by solidifying assumptions made in the sorting and serialization logic, preventing potential bugs with future or different backend implementations.

Overall, this is a great improvement to capa's dynamic analysis capabilities.

Comment on lines +65 to +70
def __init__(self, pid: int, ppid: int = 0, id: Optional[int] = None):
assert ppid >= 0
assert pid > 0
self.ppid = ppid
self.pid = pid
self.id = id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation has two potential issues related to the id field:

  1. Sorting: The __lt__ method uses -1 as a sentinel for None, which assumes all valid IDs are non-negative.
  2. Serialization: The freeze logic uses 0 as a sentinel for None, which would incorrectly handle a legitimate ID of 0 (it would be deserialized as None).

While current extractors seem to use positive IDs, this is not enforced. A future backend could provide id=0 or negative IDs, triggering these bugs.

To make the implementation more robust and prevent these potential data loss and sorting issues, I suggest enforcing that id must be a positive integer if it is not None. A similar assertion should be added to ThreadAddress.

Suggested change
def __init__(self, pid: int, ppid: int = 0, id: Optional[int] = None):
assert ppid >= 0
assert pid > 0
self.ppid = ppid
self.pid = pid
self.id = id
def __init__(self, pid: int, ppid: int = 0, id: Optional[int] = None):
assert ppid >= 0
assert pid > 0
if id is not None:
assert id > 0, "ProcessAddress id must be a positive integer"
self.ppid = ppid
self.pid = pid
self.id = id

Comment on lines +108 to +112
def __init__(self, process: ProcessAddress, tid: int, id: Optional[int] = None):
assert tid >= 0
self.process = process
self.tid = tid
self.id = id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For the same reasons outlined in the comment on ProcessAddress.__init__ (potential sorting and serialization issues), I recommend adding an assertion here to ensure that thread id values are positive integers if provided. This will make the overall design more robust.

Suggested change
def __init__(self, process: ProcessAddress, tid: int, id: Optional[int] = None):
assert tid >= 0
self.process = process
self.tid = tid
self.id = id
def __init__(self, process: ProcessAddress, tid: int, id: Optional[int] = None):
assert tid >= 0
if id is not None:
assert id > 0, "ThreadAddress id must be a positive integer"
self.process = process
self.tid = tid
self.id = id

@devs6186 devs6186 deleted the fix/2619-address-uniqueness-v2 branch March 10, 2026 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dynamic: render: ValueError "name not found for call"

1 participant