Add ROSA HCP stuck detection, auto-recovery, and metal instance fallback#9803
Open
bbethell-1 wants to merge 5 commits into
Open
Add ROSA HCP stuck detection, auto-recovery, and metal instance fallback#9803bbethell-1 wants to merge 5 commits into
bbethell-1 wants to merge 5 commits into
Conversation
Improves ROSA HCP cluster installation reliability by detecting and automatically recovering from stuck installations. **Problem:** ROSA HCP clusters occasionally get stuck in "installing" state for hours with no worker nodes provisioned and no error messages. The current implementation blindly retries for 2 hours without detecting this condition. **Solution:** - Detect clusters stuck "installing" >45 min with 0 worker nodes - Automatically delete and recreate stuck clusters - Check for provision errors and fail fast - Verify subnets exist before recreation (handle subnet deletion) - Reset timer after recreation to allow full installation time **Testing:** Tested on multiple ROSA HCP clusters in us-east-2 that exhibited stuck behavior. Auto-recovery successfully deleted and recreated clusters, which then installed normally in 30-40 minutes. **Impact:** - Reduces failed provisioning jobs from stuck clusters - Eliminates need for manual intervention - Maintains same total timeout (90 min initial + 60 min retry)
Corrected indentation for list items in 'until' and 'when' conditions to comply with yamllint requirements: - Lines 139-142: Fixed indentation for until conditions - Line 149: Fixed indentation for when condition - Lines 155-158: Fixed indentation for when condition - Lines 203-204: Fixed indentation for nested when condition - Lines 246-247: Fixed indentation for nested until condition - Lines 266-269: Fixed indentation for top-level until conditions
6 tasks
…allback Implements automatic retry logic for ROSA HCP metal machine pools that fail due to AWS capacity constraints. When a metal instance type is unavailable, the system automatically tries fallback options. Features: - User preference honored first (set via rosa_metal_instance_type) - Automatic fallback to alternative instance types (i4i.metal, i3en.metal, i3.metal) - Detects stuck machine pools (0 replicas after timeout) - Auto-cleanup and retry with next instance type - Configurable via AgnosticV ordering system New variables (in default_vars.yml): - rosa_metal_deploy: Enable/disable metal pool creation (default: false) - rosa_metal_instance_type: Preferred instance type (default: i4i.metal) - rosa_metal_replicas: Number of metal nodes (default: 3) - rosa_metal_disk_size: Disk size for metal nodes (default: 250GiB) - rosa_metal_pool_name: Machine pool name (default: metal) - rosa_metal_availability_zone: AZ override (default: cluster default) Usage: Set rosa_metal_deploy=true and rosa_metal_instance_type=<preferred> when ordering via AgnosticV. The system will automatically handle capacity issues.
Updates: 1. Enhanced ocp4_workload_rosa_machinepool workload: - Added stuck detection (0 nodes after 5 min timeout) - Added timestamp tracking to detect AWS capacity issues - Updated default instance types to prefer i4i/i3en (better availability) - Now fails fast when stuck, triggering fallback to next instance type 2. Default instance type order updated: - i4i.metal (newest, best availability) - i3en.metal (common, good availability) - i3.metal (fallback) - m5zn.metal, m5n.metal, m5.metal (original defaults) 3. Added documentation: - Clarified rosa_metal_deploy is opt-in for install-time creation - Noted workload handles metal pools with similar fallback logic - Prevents conflicts between install and workload approaches How it works: - Workload loops through instance types trying each one - For each type: create pool → wait 5 min → check if stuck - If 0 nodes after 5 min: fail fast with clear message - Rescue block deletes stuck pool and tries next instance type - Continues until successful or all types exhausted Benefits: - Faster failure detection (5 min vs 25 min of retries) - Clear messaging about capacity issues - Automatic fallback to available instance types - Works seamlessly with existing workload logic
c076019 to
0c568dc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
ROSA HCP deployments fail due to two related issues:
1. Stuck Cluster Installations
ROSA HCP clusters occasionally get stuck in
installingstate for hours with no worker nodes:2. Metal Machine Pool Capacity Issues
Metal machine pools fail when AWS has no capacity for specific instance types:
i3en.metaloften unavailableCombined impact: Failed deployments requiring manual recovery and long retry delays.
Solution
This PR adds comprehensive self-healing for both scenarios:
Feature 1: Stuck Cluster Detection & Auto-Recovery
Detects clusters stuck in installing state and automatically recovers:
Recovery process:
Time savings:
Feature 2: Metal Instance Fallback & Fast Detection
Improves metal machine pool provisioning with intelligent fallback:
For install_rosa_hcp.yml (NEW - opt-in):
For ocp4_workload_rosa_machinepool (ENHANCED):
Instance type priority (updated):
i4i.metal- Newest, best availabilityi3en.metal- Common, good availabilityi3.metal- Reliable fallbackm5zn.metal,m5n.metal,m5.metal- Original defaultsTesting
Stuck Cluster Scenarios
Test 1: Cluster stuck for >45 minutes
Test 2: Normal installation
Metal Pool Scenarios
Test 3: i3en.metal unavailable
Test 4: Partial capacity (2/3 nodes)
Test 5: Workload with unavailable type
Impact
Reduced Manual Intervention
Faster Deployments
Better Reliability
Files Changed
Core stuck detection:
ansible/configs/rosa-consolidated/install_rosa_hcp.ymlMetal pool fallback (install):
ansible/configs/rosa-consolidated/fix_metal_machinepool.ymlansible/configs/rosa-consolidated/create_metal_machinepool_attempt.ymlansible/configs/rosa-consolidated/default_vars.ymlMetal pool fallback (workload):
ansible/roles_ocp_workloads/ocp4_workload_rosa_machinepool/tasks/add_metal_node.ymlansible/roles_ocp_workloads/ocp4_workload_rosa_machinepool/defaults/main.ymlBackwards Compatibility
✅ Fully backwards compatible:
Stuck detection:
Metal pools:
Usage
Stuck Detection (Automatic)
No configuration needed - automatically active for all ROSA HCP deployments.
Metal Pools - Via Install (NEW)
Metal Pools - Via Workload (Enhanced)
Testing Checklist:
Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com