Skip to content

ingester: cache list of files, speedup on large queue#1850

Merged
nuclearcat merged 1 commit intokernelci:mainfrom
nuclearcat:ingester-speedup
Apr 13, 2026
Merged

ingester: cache list of files, speedup on large queue#1850
nuclearcat merged 1 commit intokernelci:mainfrom
nuclearcat:ingester-speedup

Conversation

@nuclearcat
Copy link
Copy Markdown
Member

Before: Every 5-second cycle calls os.scandir() on the 2M-entry directory. Each scan enumerates all entries via readdir(), which is extremely slow on a flat directory that large. Might take 20-30 seconds.

After:

  1. scandir() runs once, caching all .json entries
  2. Each cycle pops a chunk of up to INGEST_CYCLE_BATCH_SIZE (default 50,000) files from the cache and processes them
  3. Re-scan only happens when the cache is fully drained
  4. On scandir error, cache is cleared → forces re-scan next cycle

With 2M files: one scan instead of ~40 scans (2M / 50K chunks = 40 cycles of scan-free processing). If each scandir of 2M entries takes ~30 seconds, that saves ~20 minutes of pure directory enumeration overhead.

The INGEST_CYCLE_BATCH_SIZE(50k default) is configurable via env var if you want to tune the chunk size. Also note this fixed a latent bug in the old code where json_files would retain stale data from the previous iteration if scandir threw an exception (the except: pass didn't reset it).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes the ingester’s spool monitoring loop to avoid repeatedly scanning extremely large flat directories, improving throughput when the queue contains millions of files.

Changes:

  • Add a per-process cache of discovered .json spool files and process them in chunks per cycle.
  • Introduce INGEST_CYCLE_BATCH_SIZE (env-configurable) to control how many cached files are processed each loop iteration.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
backend/kernelCI_app/management/commands/monitor_submissions.py Cache scandir results and process cached entries in batches to reduce repeated directory enumeration.
backend/kernelCI_app/constants/ingester.py Add INGEST_CYCLE_BATCH_SIZE constant sourced from environment with basic parsing fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nuclearcat nuclearcat force-pushed the ingester-speedup branch 4 times, most recently from 936f5b6 to 1adc362 Compare April 13, 2026 11:12
@nuclearcat nuclearcat requested a review from Copilot April 13, 2026 11:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes the submissions ingester loop to avoid repeatedly scanning extremely large spool directories by caching scan results and processing files in configurable-sized chunks per cycle.

Changes:

  • Add INGEST_CYCLE_BATCH_SIZE to control how many cached files are processed per monitoring cycle.
  • Refactor monitor_submissions to scan the spool directory only when the cache is depleted and process cached paths in batches.
  • Switch ingestion/performance-test plumbing from os.DirEntry objects to plain path strings.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
backend/kernelCI_app/tests/performanceTests/test_ingest_perf.py Updates perf tests to pass file path strings instead of DirEntry objects.
backend/kernelCI_app/management/commands/monitor_submissions.py Adds cached scanning + per-cycle batch processing; refactors Prometheus setup and scan handling.
backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py Updates ingestion API to accept list[str] and builds metadata using basename/getsize.
backend/kernelCI_app/constants/ingester.py Introduces configurable INGEST_CYCLE_BATCH_SIZE env var with default of 50,000.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@MarceloRobert MarceloRobert added enhancement New feature or request Ingester The issue relates to the ingester tool, including the command itself and related functions. labels Apr 13, 2026
@nuclearcat nuclearcat requested a review from Copilot April 13, 2026 11:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Before: Every 5-second cycle calls os.scandir() on the 2M-entry directory.
Each scan enumerates all entries via readdir(), which is extremely slow
on a flat directory that large. Might take 20-30 seconds.

After:
1. scandir() runs once, caching all .json entries
2. Each cycle pops a chunk of up to INGEST_CYCLE_BATCH_SIZE (default 50,000) files from the cache and processes them
3. Re-scan only happens when the cache is fully drained
4. On scandir error, cache is cleared → forces re-scan next cycle

With 2M files: one scan instead of ~40 scans (2M / 50K chunks = 40 cycles of scan-free processing).
If each scandir of 2M entries takes ~30 seconds, that saves ~20 minutes of pure directory enumeration overhead.

The INGEST_CYCLE_BATCH_SIZE(50k default) is configurable via env var if you want to tune the chunk size.
Also note this fixed a latent bug in the old code where json_files would retain stale data from the previous
iteration if scandir threw an exception (the except: pass didn't reset it).

Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nuclearcat nuclearcat added this pull request to the merge queue Apr 13, 2026
Merged via the queue into kernelci:main with commit c3ffaf1 Apr 13, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Ingester The issue relates to the ingester tool, including the command itself and related functions.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants