address: add optional id field for unique tracking of recycled PID/TID lifecycles by devs6186 · Pull Request #2889 · mandiant/capa

devs6186 · 2026-03-02T14:18:30Z

Closes #2619
Addresses #2361

Summary

This is a comprehensive fix that addresses the root uniqueness problem described in #2361, which in turn eliminates the ValueError crash in #2619.

The previous approach (PR #2882) prevented data loss by merging calls from recycled TIDs under the same ThreadAddress. The maintainer correctly noted that this still fuses separate lifecycle instances into a single entry, which doesn't solve the core problem.

This PR solves the problem at the identity level by adding an optional id field to ProcessAddress and ThreadAddress:

# Two thread instances with the same OS TID but different sandbox IDs
# are now genuinely distinct addresses throughout capa's pipeline
thread1 = ThreadAddress(process=p, tid=42, id=10)  # first lifecycle
thread2 = ThreadAddress(process=p, tid=42, id=20)  # recycled TID
assert thread1 != thread2  # distinct identities, separate layout entries

Changes

`capa/features/address.py`

ProcessAddress: add optional id: Optional[int] field; update __eq__, __hash__, __lt__, __repr__; id=None by default (fully backward-compatible)
ThreadAddress: same treatment

`capa/features/extractors/vmray/extractor.py`

Pass monitor_id as id to both ProcessAddress and ThreadAddress — VMRay's monitor IDs are exactly the kind of sandbox-specific unique identifier envisioned in Use more fields to address dynamic address processes #2361

`capa/features/extractors/cape/file.py`

Two-pass detection of PID reuse; assign sequential id values (1, 2, …) only when a (ppid, pid) pair appears more than once; unique PIDs keep id=None (no behavior change for normal reports)

`capa/features/freeze/init.py`

from_capa: encodes id fields in extended tuples — PROCESS uses 3-tuple (ppid, pid, id), THREAD uses 5-tuple (ppid, pid, tid, process_id, thread_id), CALL uses 6-tuple
to_capa: decodes by tuple length — old 2/3/4-element tuples (existing freeze files) still decode correctly with id=None

`capa/render/verbose.py`

_format_process_fields(): renders pid:X normally, adds ,id:Y when present
_format_thread_fields(): renders pid:X,tid:Y normally, adds ,id:Z when present
Existing render functions (render_process, render_thread, render_span_of_calls, render_call) now use these helpers

`tests/test_address_uniqueness.py` (new file, 35 tests)

TestProcessAddressUniqueness: equality, hashing, sorting, dict/set behavior, repr
TestThreadAddressUniqueness: same, plus propagation from recycled process id
TestCallAddressWithUniqueThreads: calls in distinct thread instances are distinct
TestFreezeRoundtrip: roundtrip for all combinations of id/no-id; backward compat for old tuples
TestComputeDynamicLayoutRecycledTid: both thread instances appear separately with their own calls
TestComputeDynamicLayoutRecycledPid: both process instances appear separately

Backward Compatibility

Existing code that creates ProcessAddress(pid=..., ppid=...) or ThreadAddress(process=..., tid=...) continues to work unchanged — id defaults to None
Freeze files written by older versions of capa are still loadable — the decoder branches on tuple length
Backends that don't provide a unique id (DRAKVUF, etc.) are unaffected; they'll continue working exactly as before, with id=None on all addresses

Note on formatting

Format-only changes have been removed from this PR. The only line-wrapping changes present are those directly caused by the addition of new id=... arguments.

Checklist

Tests added for all scenarios (address identity, freeze roundtrip, layout computation)
Backward compatibility maintained for existing freeze files and extractors
No format-only changes included
CHANGELOG updated

…D lifecycles Adds an optional `id` field to `ProcessAddress` and `ThreadAddress` that sandbox backends can populate with a sandbox-specific unique identifier (e.g. VMRay monitor_id, or a sequential counter for CAPE). When set, this field becomes part of equality/hashing so that two process or thread instances that share the same OS-assigned PID/TID are treated as distinct addresses throughout capa's pipeline. This comprehensively fixes the ValueError crash in render (mandiant#2619) by solving the root uniqueness problem described in mandiant#2361: rather than merging recycled lifecycles into a single entry, each instance now gets its own identity. Changes: - address.py: add optional `id` to ProcessAddress and ThreadAddress; update __eq__, __hash__, __lt__, __repr__ accordingly; backward-compatible (id=None by default) - freeze/__init__.py: extend from_capa/to_capa to encode/decode the new id fields using extended tuple lengths; old 2/3/4-element tuples still decoded correctly for backward compatibility - vmray/extractor.py: pass monitor_id as id to both ProcessAddress and ThreadAddress so each VMRay monitor instance is uniquely tracked - cape/file.py: detect PID reuse via two-pass counting and assign sequential ids; processes with unique PIDs keep id=None (no behavior change) - render/verbose.py: add _format_process_fields / _format_thread_fields helpers that include the id in rendered output when present - tests/test_address_uniqueness.py: 35 unit tests covering identity, hashing, sorting, freeze roundtrip (incl. backward compat), and compute_dynamic_layout behavior for both recycled TIDs and recycled PIDs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

google-cla · 2026-03-02T14:18:49Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

gemini-code-assist · 2026-03-02T14:18:54Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a comprehensive fix for the problem of non-unique process and thread identifiers in dynamic analysis traces, which previously led to data inconsistencies and crashes. By introducing an optional id field to ProcessAddress and ThreadAddress, the system can now accurately distinguish between different lifecycle instances that might share the same OS-assigned IDs. This ensures that each unique process or thread is correctly tracked throughout capa's analysis pipeline, improving the reliability and accuracy of dynamic analysis results, all while maintaining full backward compatibility with existing data and configurations.

Highlights

Unique Tracking for Recycled PIDs/TIDs: Introduced an optional id field to ProcessAddress and ThreadAddress classes to uniquely track distinct lifecycle instances even when OS-assigned PIDs/TIDs are reused. This resolves issues with data loss and ValueError crashes caused by non-unique identifiers.
Address Class Enhancements: Updated ProcessAddress and ThreadAddress in capa/features/address.py to include the new id field, modifying their __init__, __repr__, __hash__, __eq__, and __lt__ methods for proper handling of uniqueness and sorting.
Extractor Integration: Integrated the new id field into dynamic extractors: VMRay's monitor_id is now used as the id for processes and threads, and CAPE's extractor implements a two-pass detection to assign sequential id values for reused PIDs.
Freeze File Compatibility: Modified the freeze serialization/deserialization logic to support the new id fields in PROCESS, THREAD, and CALL address tuples, ensuring full backward compatibility with existing freeze files and older capa versions.
Verbose Rendering Updates: Updated verbose rendering functions to display the optional id field for processes and threads, providing more detailed output when unique IDs are present.
Comprehensive Testing: Added a new test file (tests/test_address_uniqueness.py) with 35 new tests covering various scenarios, including address identity, hashing, sorting, freeze roundtrip, and dynamic layout computation for recycled PIDs/TIDs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

CHANGELOG.md
- address: add optional id field to ProcessAddress/ThreadAddress for unique tracking of recycled PID/TID lifecycles @devs6186 dynamic: render: ValueError "name not found for call" #2619

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This is a comprehensive and well-executed pull request that addresses the core issue of PID/TID recycling by introducing an optional id field. The changes are thoughtfully implemented across the address representation, extractors, serialization format, and rendering, with excellent test coverage to ensure correctness and backward compatibility.

My main feedback is to add assertions to enforce that process and thread ids are positive integers. This would make the design more robust by solidifying assumptions made in the sorting and serialization logic, preventing potential bugs with future or different backend implementations.

Overall, this is a great improvement to capa's dynamic analysis capabilities.

gemini-code-assist · 2026-03-02T14:23:05Z

capa/features/address.py

+    def __init__(self, pid: int, ppid: int = 0, id: Optional[int] = None):
        assert ppid >= 0
        assert pid > 0
        self.ppid = ppid
        self.pid = pid
+        self.id = id


The current implementation has two potential issues related to the id field:

Sorting: The __lt__ method uses -1 as a sentinel for None, which assumes all valid IDs are non-negative.

Serialization: The freeze logic uses 0 as a sentinel for None, which would incorrectly handle a legitimate ID of 0 (it would be deserialized as None).

While current extractors seem to use positive IDs, this is not enforced. A future backend could provide id=0 or negative IDs, triggering these bugs.

To make the implementation more robust and prevent these potential data loss and sorting issues, I suggest enforcing that id must be a positive integer if it is not None. A similar assertion should be added to ThreadAddress.

Suggested change

def __init__(self, pid: int, ppid: int = 0, id: Optional[int] = None):

assert ppid >= 0

assert pid > 0

self.ppid = ppid

self.pid = pid

self.id = id

def __init__(self, pid: int, ppid: int = 0, id: Optional[int] = None):

assert ppid >= 0

assert pid > 0

if id is not None:

assert id > 0, "ProcessAddress id must be a positive integer"

self.ppid = ppid

self.pid = pid

self.id = id

gemini-code-assist · 2026-03-02T14:23:05Z

capa/features/address.py

+    def __init__(self, process: ProcessAddress, tid: int, id: Optional[int] = None):
        assert tid >= 0
        self.process = process
        self.tid = tid
+        self.id = id


For the same reasons outlined in the comment on ProcessAddress.__init__ (potential sorting and serialization issues), I recommend adding an assertion here to ensure that thread id values are positive integers if provided. This will make the overall design more robust.

Suggested change

def __init__(self, process: ProcessAddress, tid: int, id: Optional[int] = None):

assert tid >= 0

self.process = process

self.tid = tid

self.id = id

def __init__(self, process: ProcessAddress, tid: int, id: Optional[int] = None):

assert tid >= 0

if id is not None:

assert id > 0, "ThreadAddress id must be a positive integer"

self.process = process

self.tid = tid

self.id = id

devs6186 closed this Mar 2, 2026

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

devs6186 deleted the fix/2619-address-uniqueness-v2 branch March 10, 2026 10:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

address: add optional id field for unique tracking of recycled PID/TID lifecycles#2889

address: add optional id field for unique tracking of recycled PID/TID lifecycles#2889
devs6186 wants to merge 1 commit intomandiant:masterfrom
devs6186:fix/2619-address-uniqueness-v2

devs6186 commented Mar 2, 2026

Uh oh!

google-cla bot commented Mar 2, 2026

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devs6186 commented Mar 2, 2026

Summary

Changes

capa/features/address.py

capa/features/extractors/vmray/extractor.py

capa/features/extractors/cape/file.py

capa/features/freeze/__init__.py

capa/render/verbose.py

tests/test_address_uniqueness.py (new file, 35 tests)

Backward Compatibility

Note on formatting

Checklist

Uh oh!

google-cla bot commented Mar 2, 2026

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`capa/features/address.py`

`capa/features/extractors/vmray/extractor.py`

`capa/features/extractors/cape/file.py`

`capa/features/freeze/init.py`

`capa/render/verbose.py`

`tests/test_address_uniqueness.py` (new file, 35 tests)