Add self-healing metal machine pool with instance type fallback#9804
Closed
bbethell-1 wants to merge 3 commits into
Closed
Add self-healing metal machine pool with instance type fallback#9804bbethell-1 wants to merge 3 commits into
bbethell-1 wants to merge 3 commits into
Conversation
…allback Implements automatic retry logic for ROSA HCP metal machine pools that fail due to AWS capacity constraints. When a metal instance type is unavailable, the system automatically tries fallback options. Features: - User preference honored first (set via rosa_metal_instance_type) - Automatic fallback to alternative instance types (i4i.metal, i3en.metal, i3.metal) - Detects stuck machine pools (0 replicas after timeout) - Auto-cleanup and retry with next instance type - Configurable via AgnosticV ordering system New variables (in default_vars.yml): - rosa_metal_deploy: Enable/disable metal pool creation (default: false) - rosa_metal_instance_type: Preferred instance type (default: i4i.metal) - rosa_metal_replicas: Number of metal nodes (default: 3) - rosa_metal_disk_size: Disk size for metal nodes (default: 250GiB) - rosa_metal_pool_name: Machine pool name (default: metal) - rosa_metal_availability_zone: AZ override (default: cluster default) Usage: Set rosa_metal_deploy=true and rosa_metal_instance_type=<preferred> when ordering via AgnosticV. The system will automatically handle capacity issues. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updates: 1. Enhanced ocp4_workload_rosa_machinepool workload: - Added stuck detection (0 nodes after 5 min timeout) - Added timestamp tracking to detect AWS capacity issues - Updated default instance types to prefer i4i/i3en (better availability) - Now fails fast when stuck, triggering fallback to next instance type 2. Default instance type order updated: - i4i.metal (newest, best availability) - i3en.metal (common, good availability) - i3.metal (fallback) - m5zn.metal, m5n.metal, m5.metal (original defaults) 3. Added documentation: - Clarified rosa_metal_deploy is opt-in for install-time creation - Noted workload handles metal pools with similar fallback logic - Prevents conflicts between install and workload approaches How it works: - Workload loops through instance types trying each one - For each type: create pool → wait 5 min → check if stuck - If 0 nodes after 5 min: fail fast with clear message - Rescue block deletes stuck pool and tries next instance type - Continues until successful or all types exhausted Benefits: - Faster failure detection (5 min vs 25 min of retries) - Clear messaging about capacity issues - Automatic fallback to available instance types - Works seamlessly with existing workload logic Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Contributor
Author
|
Merged into #9803 for a unified PR covering both stuck detection and metal instance fallback. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
ROSA HCP metal machine pools frequently fail to provision due to AWS capacity constraints. The existing
ocp4_workload_rosa_machinepoolworkload has fallback logic, but:Example stuck scenario:
m5zn.metalfirstSolution
This PR improves both the install-time metal pool creation AND the workload:
1. Faster Stuck Detection (Workload Enhancement)
Added intelligent stuck detection to
ocp4_workload_rosa_machinepool:Before: 25 min of retries when stuck
After: 5 min detection, fast fallback
2. Better Default Instance Types
Updated default fallback order in workload:
3. Install-Time Metal Pool Support (NEW)
Added opt-in metal pool creation during cluster install:
rosa_metal_deploy: truerosa_metal_instance_type(preferred type)4. Clear Failure Messages
Testing
Tested across multiple scenarios:
Scenario 1: Unavailable instance type (i3en.metal)
Scenario 2: Partial capacity (only 2/3 nodes)
Scenario 3: All nodes available
Impact
For Workloads:
For Install:
For Operations:
Files Changed
Workload enhancements:
ansible/roles_ocp_workloads/ocp4_workload_rosa_machinepool/tasks/add_metal_node.ymlansible/roles_ocp_workloads/ocp4_workload_rosa_machinepool/defaults/main.ymlInstall-time support:
ansible/configs/rosa-consolidated/install_rosa_hcp.ymlansible/configs/rosa-consolidated/fix_metal_machinepool.ymlansible/configs/rosa-consolidated/create_metal_machinepool_attempt.ymlansible/configs/rosa-consolidated/default_vars.ymlBackwards Compatibility
✅ Fully backwards compatible:
Workload:
Install:
Usage
Via Workload (Enhanced):
Via Install (NEW):
Testing Checklist:
Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com