Skip to content

fix(dynamic): handle PID/TID reuse in address identity#2933

Closed
Vib3ss wants to merge 3 commits intomandiant:masterfrom
Vib3ss:issue-2619-dynamic-address-identity
Closed

fix(dynamic): handle PID/TID reuse in address identity#2933
Vib3ss wants to merge 3 commits intomandiant:masterfrom
Vib3ss:issue-2619-dynamic-address-identity

Conversation

@Vib3ss
Copy link
Copy Markdown

@Vib3ss Vib3ss commented Mar 16, 2026

Implemented a fix for #2619 by preventing PID/TID reuse collisions in dynamic analysis.
Process/thread identities now support optional lifecycle IDs, and dynamic layout aggregation no longer drops matched calls when a thread address is reused.

The change is backward-compatible across freeze/proto/render address handling and includes new regression tests (6 passed) plus related dynamic/proto/render tests (50 passed).

  • No CHANGELOG update needed

Closes #2619

@google-cla
Copy link
Copy Markdown

google-cla bot commented Mar 16, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the robustness of dynamic analysis by addressing the challenge of PID/TID reuse. It introduces optional lifecycle identifiers to process and thread addresses, ensuring that each instance is uniquely identifiable throughout its existence. This enhancement prevents data loss and aggregation issues that previously occurred when the operating system recycled process or thread identifiers. The changes are integrated across the address definition, data extraction, serialization, and rendering components, maintaining backward compatibility and including new tests to verify the fix.

Highlights

  • Enhanced Address Identity: Introduced an optional id field to ProcessAddress and ThreadAddress classes to uniquely identify processes and threads, even when PIDs or TIDs are reused. This ensures distinct identities across their lifecycles.
  • Updated Address Comparison and Hashing: Modified the __hash__, __eq__, and __lt__ methods for ProcessAddress and ThreadAddress to incorporate the new id field, guaranteeing correct identity and ordering when ids are present.
  • Improved Dynamic Analysis Extraction: Updated CAPE and VMRay extractors to assign unique lifecycle ids to processes and threads, preventing collisions during dynamic analysis when PIDs/TIDs are recycled by the operating system.
  • Backward-Compatible Serialization/Deserialization: Adjusted the freeze and proto modules to handle the new optional id fields in addresses, ensuring backward compatibility for existing serialized data while supporting the enhanced address structures.
  • Fix for Dynamic Layout Aggregation: Resolved an issue in capa/loader.py where matched calls were incorrectly dropped during dynamic layout aggregation when a thread address was reused, ensuring all relevant calls are preserved.
  • New Regression Tests: Added comprehensive regression tests to validate the correct behavior of address identity with lifecycle ids, round-trip serialization, and the fix for recycled TIDs in dynamic layout aggregation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • capa/features/address.py
    • Added Optional type hint.
    • Introduced an optional id field to ProcessAddress and ThreadAddress constructors.
    • Updated __repr__, __hash__, __eq__, and __lt__ methods to include the id for identity.
  • capa/features/extractors/cape/file.py
    • Modified process extraction to assign unique ids to ProcessAddress instances when PID/PPID reuse is detected, preventing collisions.
  • capa/features/extractors/cape/process.py
    • Updated thread extraction to assign unique ids to ThreadAddress instances when TID reuse is detected.
  • capa/features/extractors/vmray/extractor.py
    • Integrated monitor_id into ProcessAddress and ThreadAddress creation for VMRay extractions to ensure unique identification.
  • capa/features/freeze/init.py
    • Extended from_capa and to_capa methods to handle the new optional id fields in ProcessAddress, ThreadAddress, and DynamicCallAddress for serialization and deserialization.
  • capa/loader.py
    • Corrected the dynamic layout aggregation logic to properly handle thread address reuse by using setdefault and checking matched_threads to prevent dropping matched calls.
  • capa/render/proto/init.py
    • Updated addr_to_pb2 function to parse and serialize ProcessAddress, ThreadAddress, and DynamicCallAddress with their new optional id fields for protobuf rendering.
  • capa/render/verbose.py
    • Enhanced format_address, render_process, render_thread, and render_span_of_calls functions to display the optional id fields in verbose output.
  • tests/test_address_uniqueness.py
    • Added a new test file containing regression tests to verify the correct behavior of ProcessAddress and ThreadAddress identity with ids, freeze/thaw roundtrips, and the fix for recycled TIDs in dynamic layout aggregation.
Activity
  • The pull request implements a fix for issue dynamic: render: ValueError "name not found for call" #2619, addressing PID/TID reuse collisions.
  • New regression tests (6 passed) were added to validate the changes.
  • Related dynamic, proto, and render tests (50 passed) were executed to ensure compatibility and correctness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses PID/TID reuse by introducing optional lifecycle IDs to ProcessAddress and ThreadAddress. The changes are well-integrated across feature extraction, serialization (freeze), and rendering, ensuring backward compatibility. The addition of regression tests is also a great inclusion. I have a couple of suggestions to improve consistency in logging and output formatting.

Comment on lines +38 to 44
seq: dict[int, int] = {}
for tid in threads:
seq[tid] = seq.get(tid, 0) + 1
thread_id = seq[tid] - 1 if counts[tid] > 1 else None

address: ThreadAddress = ThreadAddress(process=ph.address, tid=tid, id=thread_id)
yield ThreadHandle(address=address, inner={})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function handles TID reuse by assigning a lifecycle ID, which is great. However, unlike the corresponding get_processes function which warns about PID/PPID reuse, there's no warning here for TID reuse. For consistency and better diagnostics, consider adding a warning when a TID is reused within a process.

    seq: dict[int, int] = {}
    warned_tids: set[int] = set()
    for tid in threads:
        if counts[tid] > 1 and tid not in warned_tids:
            logger.warning("TID reuse detected for tid %d in process %s", tid, ph.address)
            warned_tids.add(tid)

        seq[tid] = seq.get(tid, 0) + 1
        thread_id = seq[tid] - 1 if counts[tid] > 1 else None

        address: ThreadAddress = ThreadAddress(process=ph.address, tid=tid, id=thread_id)
        yield ThreadHandle(address=address, inner={})

Comment on lines +68 to +111
if len(address.value) == 2:
ppid, pid = address.value
assert isinstance(ppid, int)
assert isinstance(pid, int)
return f"process{{pid:{pid}}}"
ppid, pid, process_id = address.value
assert isinstance(ppid, int)
assert isinstance(pid, int)
assert isinstance(process_id, int)
if process_id >= 0:
return f"process{{pid:{pid},id:{process_id}}}"
return f"process{{pid:{pid}}}"
elif address.type == frz.AddressType.THREAD:
assert isinstance(address.value, tuple)
ppid, pid, tid = address.value
if len(address.value) == 3:
ppid, pid, tid = address.value
assert isinstance(ppid, int)
assert isinstance(pid, int)
assert isinstance(tid, int)
return f"process{{pid:{pid},tid:{tid}}}"
ppid, pid, tid, process_id, thread_id = address.value
assert isinstance(ppid, int)
assert isinstance(pid, int)
assert isinstance(tid, int)
return f"process{{pid:{pid},tid:{tid}}}"
assert isinstance(process_id, int)
assert isinstance(thread_id, int)
s = f"process{{pid:{pid},tid:{tid}"
if process_id >= 0:
s += f",pid_id:{process_id}"
if thread_id >= 0:
s += f",id:{thread_id}"
return s + "}"
elif address.type == frz.AddressType.CALL:
assert isinstance(address.value, tuple)
ppid, pid, tid, id_ = address.value
return f"process{{pid:{pid},tid:{tid},call:{id_}}}"
if len(address.value) == 4:
ppid, pid, tid, id_ = address.value
return f"process{{pid:{pid},tid:{tid},call:{id_}}}"
ppid, pid, tid, id_, process_id, thread_id = address.value
s = f"process{{pid:{pid},tid:{tid}"
if process_id >= 0:
s += f",pid_id:{process_id}"
if thread_id >= 0:
s += f",id:{thread_id}"
return s + f",call:{id_}}}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The keys used for formatting process and thread lifecycle IDs are inconsistent and can be ambiguous.

  • For PROCESS addresses, the process lifecycle ID is keyed as id.
  • For THREAD and CALL addresses, the process lifecycle ID is keyed as pid_id, while the thread lifecycle ID is keyed as id.

This is inconsistent (id vs pid_id for the same concept) and ambiguous (id can mean process or thread lifecycle ID depending on context).

To improve clarity and consistency, I suggest using p_id for the process lifecycle ID and t_id for the thread lifecycle ID across all address types.

        if len(address.value) == 2:
            ppid, pid = address.value
            assert isinstance(ppid, int)
            assert isinstance(pid, int)
            return f"process{{pid:{pid}}}"
        ppid, pid, process_id = address.value
        assert isinstance(ppid, int)
        assert isinstance(pid, int)
        assert isinstance(process_id, int)
        if process_id >= 0:
            return f"process{{pid:{pid},p_id:{process_id}}}"
        return f"process{{pid:{pid}}}"
    elif address.type == frz.AddressType.THREAD:
        assert isinstance(address.value, tuple)
        if len(address.value) == 3:
            ppid, pid, tid = address.value
            assert isinstance(ppid, int)
            assert isinstance(pid, int)
            assert isinstance(tid, int)
            return f"process{{pid:{pid},tid:{tid}}}"
        ppid, pid, tid, process_id, thread_id = address.value
        assert isinstance(ppid, int)
        assert isinstance(pid, int)
        assert isinstance(tid, int)
        assert isinstance(process_id, int)
        assert isinstance(thread_id, int)
        s = f"process{{pid:{pid},tid:{tid}"
        if process_id >= 0:
            s += f",p_id:{process_id}"
        if thread_id >= 0:
            s += f",t_id:{thread_id}"
        return s + "}"
    elif address.type == frz.AddressType.CALL:
        assert isinstance(address.value, tuple)
        if len(address.value) == 4:
            ppid, pid, tid, id_ = address.value
            return f"process{{pid:{pid},tid:{tid},call:{id_}}}"
        ppid, pid, tid, id_, process_id, thread_id = address.value
        s = f"process{{pid:{pid},tid:{tid}"
        if process_id >= 0:
            s += f",p_id:{process_id}"
        if thread_id >= 0:
            s += f",t_id:{thread_id}"
        return s + f",call:{id_}}}"

@github-actions github-actions bot dismissed their stale review March 18, 2026 14:53

CHANGELOG updated or no update needed, thanks! 😄

@mike-hunhoff
Copy link
Copy Markdown
Collaborator

dup of #2882

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dynamic: render: ValueError "name not found for call"

2 participants