Use OnceRetentionStrategy for single-use agents#213
Open
rhuddleston wants to merge 3 commits into
Open
Conversation
Fixes zombie containers and premature termination during boot by routing non-reusable Nomad workers to a custom NomadOnceRetentionStrategy. Maintains EphemeralNode implementation to prevent XStream deserialization crashes on existing configurations.
5c70672 to
5d51b8e
Compare
|
Perfect, thanks! |
Author
|
@multani is this something you can approve? |
|
With this fix our nomad workers are very fast and no unnecessary delays due to a race condition. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Use OnceRetentionStrategy for single-use Nomad agents
The current hardcoded CloudRetentionStrategy only checks a worker's total idle time. When
idleTerminationInMinutesis configured to0, this creates a race condition that frequently terminates booting containers before they can connect. Conversely, setting it to> 0leaves zombie containers running idly after their build has finished.This PR introduces a custom
NomadOnceRetentionStrategyfor single-use (non-reusable) Nomad workers. This strategy natively solves both problems by granting a generous timeout during initial boot (respecting theidleTerminationInMinutes), but instantly terminating the worker the millisecond its build is complete.Rather than importing
OnceRetentionStrategyfrom thedurable-taskplugin, this PR implements a custom version. Thedurable-taskversion explicitly forbids nodes from implementingEphemeralNode. Because existing Jenkins environments have saved theirNomadCloudworkers asEphemeralNodeobjects, removing the interface causes Jenkins to crash on boot due to XStream serialization failures. Writing our own custom strategy allows us to safely fix the zombie container issue while maintaining backward compatibility with existing saved Jenkins configurations.Testing done
NomadWorkerTest.javato verify the correct retention strategy is applied based on the "Reusable" flag. Verified that the existing test suite continues to pass (mvn test) without compilation errors or regressions..hpiartifact locally and installed it on a live Jenkins server.Submitter checklist