Skip to content

Use OnceRetentionStrategy for single-use agents#213

Open
rhuddleston wants to merge 3 commits into
jenkinsci:masterfrom
Quiq:once-retention-strategy
Open

Use OnceRetentionStrategy for single-use agents#213
rhuddleston wants to merge 3 commits into
jenkinsci:masterfrom
Quiq:once-retention-strategy

Conversation

@rhuddleston
Copy link
Copy Markdown

@rhuddleston rhuddleston commented Apr 20, 2026

Use OnceRetentionStrategy for single-use Nomad agents

The current hardcoded CloudRetentionStrategy only checks a worker's total idle time. When idleTerminationInMinutes is configured to 0, this creates a race condition that frequently terminates booting containers before they can connect. Conversely, setting it to > 0 leaves zombie containers running idly after their build has finished.

This PR introduces a custom NomadOnceRetentionStrategy for single-use (non-reusable) Nomad workers. This strategy natively solves both problems by granting a generous timeout during initial boot (respecting the idleTerminationInMinutes), but instantly terminating the worker the millisecond its build is complete.

Rather than importing OnceRetentionStrategy from the durable-task plugin, this PR implements a custom version. The durable-task version explicitly forbids nodes from implementing EphemeralNode. Because existing Jenkins environments have saved their NomadCloud workers as EphemeralNode objects, removing the interface causes Jenkins to crash on boot due to XStream serialization failures. Writing our own custom strategy allows us to safely fix the zombie container issue while maintaining backward compatibility with existing saved Jenkins configurations.

Testing done

  • Automated Tests: Added NomadWorkerTest.java to verify the correct retention strategy is applied based on the "Reusable" flag. Verified that the existing test suite continues to pass (mvn test) without compilation errors or regressions.
  • Manual Testing:
    1. Built the .hpi artifact locally and installed it on a live Jenkins server.
    2. Configured a Nomad cloud with a Worker Template where "Reusable" is unchecked and "Idle termination time" is set to 5.
    3. Triggered a job. Verified that the container booted successfully, the container was immediately terminated upon job completion without leaving a zombie slot behind.

Submitter checklist

  • [ x ] Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • [ x ] Ensure that the pull request title represents the desired changelog entry
  • [ x ] Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • [ x ] Ensure you have provided tests that demonstrate the feature works or the issue is fixed

Fixes zombie containers and premature termination during boot by routing non-reusable Nomad workers to a custom NomadOnceRetentionStrategy.
Maintains EphemeralNode implementation to prevent XStream deserialization crashes on existing configurations.
@rhuddleston rhuddleston force-pushed the once-retention-strategy branch from 5c70672 to 5d51b8e Compare April 20, 2026 05:49
@roman-vynar
Copy link
Copy Markdown

Perfect, thanks!

@rhuddleston
Copy link
Copy Markdown
Author

@multani is this something you can approve?

@roman-vynar
Copy link
Copy Markdown

With this fix our nomad workers are very fast and no unnecessary delays due to a race condition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants