perf: ClamAV parallel prefetch, shared pebble pool, PE checksum skip, YARA init guard by wmetcalf · Pull Request #3018 · kevoreilly/CAPEv2

wmetcalf · 2026-05-12T16:16:26Z

Summary

Four independent processing-pipeline performance improvements observed under profiling on heavy tasks:

ClamAV parallel prefetch (clamav.py, CAPE.py): Fan out ALLMATCHSCAN calls via ThreadPoolExecutor before the sequential per-file loop. ClamAV was measured at ~58% of CAPE module wall-clock on tasks with 10–20 dropped/extracted files; parallel prefetch cuts that from sum(per-file scan times) to slowest single scan. Adds a per-task LRU cache to avoid rescanning the same binary twice within one analysis.
Shared pebble pool (file_extra_info.py): Replace the with pebble.ProcessPool(...) as pool: pattern that was creating and tearing down a worker pool for every file with a module-level singleton (get_extractor_pool()). Pool startup was ~85 s on a 70-file task.
PE checksum skip (parse_pe.py): get_actual_checksum() returns None when the PE's reported checksum is 0, skipping the O(file-size) Python loop recompute. No loss of signal — a zero reported checksum means no comparison can be made anyway.
pkcs7 extraction fix (parse_pe.py): backend.load_der_pkcs7_certificates() was removed from the Backend class in cryptography ≥ 40.x. The call silently failed inside suppress(Exception), leaving digital_signers empty for every signed PE. Switched to pkcs7.load_der_pkcs7_certificates(). Also adds a fallback in file_extra_info.py to re-run cert extraction on reprocess if digital_signers was previously populated as empty.
YARA init guard (objects.py, process.py): File.init_yara() now short-circuits on repeated calls; init_worker() loads YARA rules eagerly at worker startup so the first task in each worker doesn't pay the load cost.

Test plan

Submit a task with 10+ dropped files and confirm CAPE module time drops
Verify digital_signers is non-empty for signed PEs after the pkcs7 fix
Confirm process.py -r <id> reprocess populates digital_signers for previously-empty entries
Check that actual_checksum is absent (not 0x00000000) for PEs with checksum field = 0

🤖 Generated with Claude Code

… YARA init guard - clamav.py: Add prefetch_clamav() using ThreadPoolExecutor to fan out ALLMATCHSCAN calls in parallel before the sequential per-file loop. Reduces ClamAV wall-clock from sum(scans) to slowest single scan. Adds module-level _CLAMAV_CACHE + _CACHE_LOCK for per-task caching. - CAPE.py: Call prefetch_clamav() with all target + dropped + procdump paths before the per-file processing loop; clear cache on exit. - file_extra_info.py: Share a module-level pebble.ProcessPool singleton (get_extractor_pool()) instead of creating/destroying one per file. Also re-run PortableExecutable() when pe.digital_signers is empty so previously-analysed files get the cert data filled in on reprocess. - parse_pe.py: get_actual_checksum() returns None when the PE reports checksum=0, skipping the expensive O(file_size) recompute. Fix backend.load_der_pkcs7_certificates → pkcs7.load_der_pkcs7_certificates for cryptography ≥ 40.x (old method removed from Backend). - objects.py: Add idempotency guard to File.init_yara() so repeated calls in multi-worker environments don't reload rules unnecessarily. - process.py: Eagerly call File.init_yara() in init_worker() so YARA rules are loaded once at worker startup rather than on first task.

gemini-code-assist

Code Review

This pull request implements several performance optimizations for CAPE processing, most notably parallelizing ClamAV scans via a new caching mechanism and reusing a process-wide worker pool for file extraction to minimize subprocess overhead. It also reduces processing time by skipping PE checksum recomputations when the reported checksum is zero and ensures YARA rules are eagerly compiled in worker processes. Review feedback identifies a high-severity risk of deadlocks in the worker logging initialization and suggests a more efficient check for digital signers to avoid redundant parsing of unsigned files.

gemini-code-assist · 2026-05-12T16:18:19Z

+    # Fix for open file handles on rotated logs in workers
+    for h in log.handlers[:]:
+        if isinstance(h, logging.FileHandler):
+            h.close()
+        log.removeHandler(h)


Re-introducing calls to h.close() and log.removeHandler(h) in the worker initializer risks a deadlock. If the parent process was holding the logging lock (e.g., another thread was logging) during the fork(), the lock will be permanently held in the child process. Attempting to acquire it via these methods will cause the worker to hang.

Furthermore, the subsequent calls to log.addHandler() (lines 207, 225, 234) also attempt to acquire this lock. The previous implementation used direct list operations like log.handlers.clear() and log.handlers.append() specifically to bypass the lock and avoid deadlocks in forked workers. Given that workers are short-lived (maxtasksperchild), the file descriptor leak is a safer trade-off than the risk of deadlocked workers.

Suggested change

# Fix for open file handles on rotated logs in workers

for h in log.handlers[:]:

if isinstance(h, logging.FileHandler):

h.close()

log.removeHandler(h)

# Avoid fork deadlock: use direct list ops instead of

# handler.close()/removeHandler()/addHandler() which acquire locks.

# Inherited FDs are intentionally leaked: closing them via os.close()

# frees the fd number, but the old Python stream still references it;

# when GC finalizes that stream it may close a new handler's fd.

# Workers are short-lived (max_tasks) so the leak is harmless.

log.handlers.clear()

gemini-code-assist · 2026-05-12T16:18:20Z

+        elif not data_dictionary["pe"].get("digital_signers"):
+            with PortableExecutable(file_path) as pe:
+                data_dictionary["pe"]["digital_signers"] = pe.get_digital_signers(pe.pe)
+                data_dictionary["pe"]["guest_signers"] = pe.get_guest_digital_signers(task_id)


This check will trigger for every unsigned PE file during reprocessing, as digital_signers will be an empty list. This causes the file to be re-opened and re-parsed, which is a significant performance hit for tasks with many unsigned payloads.

Since the cryptography bug is now fixed for new analyses, consider if this 'fallback' for old records is worth the performance cost on all unsigned files. If it must be kept, it would be more efficient to check if the PE has a security directory entry (e.g., via pe.OPTIONAL_HEADER.DATA_DIRECTORY[pefile.DIRECTORY_ENTRY['IMAGE_DIRECTORY_ENTRY_SECURITY']].Size > 0) before performing a full re-extraction.

Copilot

Pull request overview

This PR introduces several performance-oriented optimizations across CAPE’s processing pipeline (ClamAV scanning, extractor execution, PE parsing, and YARA initialization) to reduce per-task wall-clock time under heavy file workloads.

Changes:

Add parallel ClamAV prefetch + a per-process scan result cache to avoid repeated socket scans.
Reuse a shared pebble.ProcessPool for file extractors instead of creating a new pool per file.
Skip expensive PE checksum recomputation when the reported checksum is zero, and update PKCS7 certificate extraction to a supported cryptography API.
Make YARA initialization idempotent and eagerly compile rules at worker startup.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`utils/process.py`	Adjust worker logger handler reset and eagerly initializes YARA rules in each worker.
`modules/processing/CAPE.py`	Adds ClamAV parallel prefetch orchestration before per-file processing.
`lib/cuckoo/common/objects.py`	Adds an idempotency guard (and `force`) to YARA initialization.
`lib/cuckoo/common/integrations/parse_pe.py`	Skips checksum recompute for zero reported checksum; fixes PKCS7 signer extraction call.
`lib/cuckoo/common/integrations/file_extra_info.py`	Reuses a shared pebble pool for extractors; adds signer re-extraction fallback.
`lib/cuckoo/common/integrations/clamav.py`	Implements threaded batch scanning and a module-level cache for ClamAV results.

+    # Fix for open file handles on rotated logs in workers
+    for h in log.handlers[:]:
+        if isinstance(h, logging.FileHandler):
+            h.close()
+        log.removeHandler(h)

+    # Restore Console Handler
    ch = ConsoleHandler()
    ch.setFormatter(FORMATTER)
-    log.handlers.append(ch)
+    log.addHandler(ch)


+        passed (used after a yara-rule update). Without this guard, repeated
+        callers (e.g. some integration paths that didn't go through the
+        get_yara() wrapper) re-compiled all six categories on every call."""
+
+        if cls.yara_initialized and not force:
+            return


    def get_reported_checksum(self, pe: pefile.PE) -> str:
        """Get checksum from optional header
        @return: checksum or None.
        """
        return f"0x{pe.OPTIONAL_HEADER.CheckSum:08x}" if pe else None

    def get_actual_checksum(self, pe: pefile.PE) -> str:
        """Get calculated checksum of PE
-        @return: checksum or None.
+        @return: checksum string, or None if unavailable / not computed.


+        The recomputed value is only consumed by the
+        `static_pe_anomaly` signature, which only compares it against
+        the embedded `reported_checksum` when that field is non-zero
+        (`if reported and reported != actual` — see
+        modules/signatures/all/static_pe_anomaly.py). When the PE has
+        no embedded checksum (compilers commonly omit it; almost every
+        dropper/packer leaves it 0), the recompute result would never
+        be consulted — pure throwaway work.
+
+        Skip the expensive recompute in that case and return None so
+        the field is honestly absent rather than fabricated. The
+        signature already gracefully handles missing keys via the
+        `if reported and ...` guard, and downstream consumers can
+        distinguish "not computed" from "computed and zero". Saves
+        ~40s on a 19-payload analysis with reported=0 binaries.


+# Module-level cache for per-task `prefetch_clamav` results. Keyed by
+# absolute file path. Each value is the same list of match-strings that
+# `get_clamav` would otherwise compute. Populated by `prefetch_clamav`
+# and consumed transparently by `get_clamav`. Cleared at task boundary
+# via `clear_clamav_cache` to avoid leaking results across analyses on
+# a long-lived worker process.
+_CACHE_LOCK = threading.Lock()
+_CLAMAV_CACHE = {}
+
+


        if "pe" not in data_dictionary:
            with PortableExecutable(file_path) as pe:
                data_dictionary["pe"] = pe.run(task_id)
+        elif not data_dictionary["pe"].get("digital_signers"):
+            with PortableExecutable(file_path) as pe:
+                data_dictionary["pe"]["digital_signers"] = pe.get_digital_signers(pe.pe)
+                data_dictionary["pe"]["guest_signers"] = pe.get_guest_digital_signers(task_id)


…traction - process.py: Replace h.close()/removeHandler()/addHandler() calls with direct log.handlers[:] = [] and log.handlers.append() to avoid acquiring logging locks in child processes after fork() (deadlock risk if parent held the lock at fork time). - file_extra_info.py: Only re-run PortableExecutable cert extraction when DigiSig.json confirms the file is signed (aux_signers non-empty). Without this guard, every unsigned PE would trigger a redundant re-parse on reprocess.

Setting it inside the HAVE_YARA try block marked init complete after the first successful category (binaries), so any concurrent or subsequent init_yara() call would short-circuit before CAPE was reached — KeyError: CAPE in get_yara. Also fixes YARA_X path which never set yara_initialized at all.

… covers both

wmetcalf · 2026-05-15T13:32:09Z

Fixed test_get_yaras failure (3b7ca00)

Root cause: File.yara_initialized = True was set inside the while True / HAVE_YARA try block — after the first successful category compile (binaries). Any subsequent init_yara() call would hit the guard and return before reaching CAPE, leaving File.yara_rules["CAPE"] unset → KeyError: 'CAPE' in get_yara().

Kevin's fix commit added the flag to the YARA_X path but also inside the per-category loop, same bug.

Fix: single File.yara_initialized = True moved to after the outer for category in categories: loop, covering both YARA and YARA_X paths. Both in-loop occurrences removed. Pushed in 3b7ca00.

wmetcalf · 2026-05-15T18:02:03Z

Hi @kevoreilly — the test_get_yaras failure is fixed in the latest push (3b7ca00). Root cause was File.yara_initialized = True being set inside the per-category loop after the first successful compile, so any subsequent init_yara() call short-circuited before reaching CAPE. Moved to after the outer loop so all six categories are guaranteed loaded before the guard can fire. Also cleaned up your YARA_X fix commit (68ccb06) which had the same in-loop placement — the single after-loop assignment covers both paths.

Should be ready to merge now if you're happy with the rest of the changes.

wmetcalf · 2026-05-16T03:08:02Z

@kevoreilly to clarify what this PR does and why the diff looks the way it does:

The original bug (still in your current master): File.yara_initialized = True is set inside the while True / HAVE_YARA try block — it fires after the first category (binaries) compiles successfully, before CAPE, macro, monitor etc. are processed. Any second call to init_yara() (e.g. from a test) hits the guard and returns early with CAPE missing from yara_rules → KeyError: 'CAPE'.

Your commit 68ccb06 added File.yara_initialized = True to the YARA_X path — but also inside the per-category loop, same problem.

This PR's fix: moves the single File.yara_initialized = True to after the outer for category in categories: loop completes, so it only fires once all six categories are loaded. That one line covers both HAVE_YARA and HAVE_YARA_X paths. Your in-loop YARA_X addition becomes redundant and is removed — that's the -1 you're seeing.

Net result vs your master:

+ guard at top of init_yara()
- File.yara_initialized = True from inside HAVE_YARA try block (the original bug)
+ File.yara_initialized = True after the outer for loop (the fix)
- File.yara_initialized = True from inside HAVE_YARA_X block (your partial fix, now redundant)

If tests are still failing on your side let me know what the output is and I'll take a look.

kevoreilly · 2026-05-16T07:48:55Z

The tests are still failing, link from above to failed tests is https://github.com/kevoreilly/CAPEv2/actions/runs/25920160083/job/76186792607?pr=3018

Copilot AI review requested due to automatic review settings May 12, 2026 16:16

Copilot started reviewing on behalf of wmetcalf May 12, 2026 16:17 View session

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

wmetcalf mentioned this pull request May 12, 2026

feat: PE cert signatures, headless browser stealer, COM office martian, ransomware_message fix CAPESandbox/community#571

Merged

Copilot AI reviewed May 12, 2026

View reviewed changes

wmetcalf and others added 4 commits May 12, 2026 11:38

Set yara_initialized in HAVE_YARA_X path

68ccb06

fix: remove in-loop yara_initialized set from YARA_X path; after-loop…

3b7ca00

… covers both

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: ClamAV parallel prefetch, shared pebble pool, PE checksum skip, YARA init guard#3018

perf: ClamAV parallel prefetch, shared pebble pool, PE checksum skip, YARA init guard#3018
wmetcalf wants to merge 5 commits into
kevoreilly:masterfrom
wmetcalf:perf/clamav-yara-pe-improvements

wmetcalf commented May 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Uh oh!

gemini-code-assist Bot May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

wmetcalf commented May 15, 2026

Uh oh!

wmetcalf commented May 15, 2026

Uh oh!

wmetcalf commented May 16, 2026

Uh oh!

kevoreilly commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    # Fix for open file handles on rotated logs in workers
-    for h in log.handlers[:]:
-        if isinstance(h, logging.FileHandler):
-            h.close()
-        log.removeHandler(h)
+    # Avoid fork deadlock: use direct list ops instead of
+    # handler.close()/removeHandler()/addHandler() which acquire locks.
+    # Inherited FDs are intentionally leaked: closing them via os.close()
+    # frees the fd number, but the old Python stream still references it;
+    # when GC finalizes that stream it may close a new handler's fd.
+    # Workers are short-lived (max_tasks) so the leak is harmless.
+    log.handlers.clear()

Conversation

wmetcalf commented May 12, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

wmetcalf commented May 15, 2026

Uh oh!

wmetcalf commented May 15, 2026

Uh oh!

wmetcalf commented May 16, 2026

Uh oh!

kevoreilly commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants