address: add optional id field for unique tracking of recycled PID/TID lifecycles#2889
address: add optional id field for unique tracking of recycled PID/TID lifecycles#2889devs6186 wants to merge 1 commit intomandiant:masterfrom
Conversation
…D lifecycles Adds an optional `id` field to `ProcessAddress` and `ThreadAddress` that sandbox backends can populate with a sandbox-specific unique identifier (e.g. VMRay monitor_id, or a sequential counter for CAPE). When set, this field becomes part of equality/hashing so that two process or thread instances that share the same OS-assigned PID/TID are treated as distinct addresses throughout capa's pipeline. This comprehensively fixes the ValueError crash in render (mandiant#2619) by solving the root uniqueness problem described in mandiant#2361: rather than merging recycled lifecycles into a single entry, each instance now gets its own identity. Changes: - address.py: add optional `id` to ProcessAddress and ThreadAddress; update __eq__, __hash__, __lt__, __repr__ accordingly; backward-compatible (id=None by default) - freeze/__init__.py: extend from_capa/to_capa to encode/decode the new id fields using extended tuple lengths; old 2/3/4-element tuples still decoded correctly for backward compatibility - vmray/extractor.py: pass monitor_id as id to both ProcessAddress and ThreadAddress so each VMRay monitor instance is uniquely tracked - cape/file.py: detect PID reuse via two-pass counting and assign sequential ids; processes with unique PIDs keep id=None (no behavior change) - render/verbose.py: add _format_process_fields / _format_thread_fields helpers that include the id in rendered output when present - tests/test_address_uniqueness.py: 35 unit tests covering identity, hashing, sorting, freeze roundtrip (incl. backward compat), and compute_dynamic_layout behavior for both recycled TIDs and recycled PIDs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers a comprehensive fix for the problem of non-unique process and thread identifiers in dynamic analysis traces, which previously led to data inconsistencies and crashes. By introducing an optional Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This is a comprehensive and well-executed pull request that addresses the core issue of PID/TID recycling by introducing an optional id field. The changes are thoughtfully implemented across the address representation, extractors, serialization format, and rendering, with excellent test coverage to ensure correctness and backward compatibility.
My main feedback is to add assertions to enforce that process and thread ids are positive integers. This would make the design more robust by solidifying assumptions made in the sorting and serialization logic, preventing potential bugs with future or different backend implementations.
Overall, this is a great improvement to capa's dynamic analysis capabilities.
| def __init__(self, pid: int, ppid: int = 0, id: Optional[int] = None): | ||
| assert ppid >= 0 | ||
| assert pid > 0 | ||
| self.ppid = ppid | ||
| self.pid = pid | ||
| self.id = id |
There was a problem hiding this comment.
The current implementation has two potential issues related to the id field:
- Sorting: The
__lt__method uses-1as a sentinel forNone, which assumes all valid IDs are non-negative. - Serialization: The
freezelogic uses0as a sentinel forNone, which would incorrectly handle a legitimate ID of0(it would be deserialized asNone).
While current extractors seem to use positive IDs, this is not enforced. A future backend could provide id=0 or negative IDs, triggering these bugs.
To make the implementation more robust and prevent these potential data loss and sorting issues, I suggest enforcing that id must be a positive integer if it is not None. A similar assertion should be added to ThreadAddress.
| def __init__(self, pid: int, ppid: int = 0, id: Optional[int] = None): | |
| assert ppid >= 0 | |
| assert pid > 0 | |
| self.ppid = ppid | |
| self.pid = pid | |
| self.id = id | |
| def __init__(self, pid: int, ppid: int = 0, id: Optional[int] = None): | |
| assert ppid >= 0 | |
| assert pid > 0 | |
| if id is not None: | |
| assert id > 0, "ProcessAddress id must be a positive integer" | |
| self.ppid = ppid | |
| self.pid = pid | |
| self.id = id |
| def __init__(self, process: ProcessAddress, tid: int, id: Optional[int] = None): | ||
| assert tid >= 0 | ||
| self.process = process | ||
| self.tid = tid | ||
| self.id = id |
There was a problem hiding this comment.
For the same reasons outlined in the comment on ProcessAddress.__init__ (potential sorting and serialization issues), I recommend adding an assertion here to ensure that thread id values are positive integers if provided. This will make the overall design more robust.
| def __init__(self, process: ProcessAddress, tid: int, id: Optional[int] = None): | |
| assert tid >= 0 | |
| self.process = process | |
| self.tid = tid | |
| self.id = id | |
| def __init__(self, process: ProcessAddress, tid: int, id: Optional[int] = None): | |
| assert tid >= 0 | |
| if id is not None: | |
| assert id > 0, "ThreadAddress id must be a positive integer" | |
| self.process = process | |
| self.tid = tid | |
| self.id = id |
Closes #2619
Addresses #2361
Summary
This is a comprehensive fix that addresses the root uniqueness problem described in #2361, which in turn eliminates the
ValueErrorcrash in #2619.The previous approach (PR #2882) prevented data loss by merging calls from recycled TIDs under the same
ThreadAddress. The maintainer correctly noted that this still fuses separate lifecycle instances into a single entry, which doesn't solve the core problem.This PR solves the problem at the identity level by adding an optional
idfield toProcessAddressandThreadAddress:Changes
capa/features/address.pyProcessAddress: add optionalid: Optional[int]field; update__eq__,__hash__,__lt__,__repr__;id=Noneby default (fully backward-compatible)ThreadAddress: same treatmentcapa/features/extractors/vmray/extractor.pymonitor_idasidto bothProcessAddressandThreadAddress— VMRay's monitor IDs are exactly the kind of sandbox-specific unique identifier envisioned in Use more fields to address dynamic address processes #2361capa/features/extractors/cape/file.pyidvalues (1, 2, …) only when a(ppid, pid)pair appears more than once; unique PIDs keepid=None(no behavior change for normal reports)capa/features/freeze/__init__.pyfrom_capa: encodesidfields in extended tuples — PROCESS uses 3-tuple(ppid, pid, id), THREAD uses 5-tuple(ppid, pid, tid, process_id, thread_id), CALL uses 6-tupleto_capa: decodes by tuple length — old 2/3/4-element tuples (existing freeze files) still decode correctly withid=Nonecapa/render/verbose.py_format_process_fields(): renderspid:Xnormally, adds,id:Ywhen present_format_thread_fields(): renderspid:X,tid:Ynormally, adds,id:Zwhen presentrender_process,render_thread,render_span_of_calls,render_call) now use these helperstests/test_address_uniqueness.py(new file, 35 tests)TestProcessAddressUniqueness: equality, hashing, sorting, dict/set behavior, reprTestThreadAddressUniqueness: same, plus propagation from recycled process idTestCallAddressWithUniqueThreads: calls in distinct thread instances are distinctTestFreezeRoundtrip: roundtrip for all combinations of id/no-id; backward compat for old tuplesTestComputeDynamicLayoutRecycledTid: both thread instances appear separately with their own callsTestComputeDynamicLayoutRecycledPid: both process instances appear separatelyBackward Compatibility
ProcessAddress(pid=..., ppid=...)orThreadAddress(process=..., tid=...)continues to work unchanged —iddefaults toNoneid=Noneon all addressesNote on formatting
Format-only changes have been removed from this PR. The only line-wrapping changes present are those directly caused by the addition of new
id=...arguments.Checklist