feat(pipeline): support multiple subscriptions per region in swiftv2 long-running tests#4334
Conversation
…iptions Replace flat subscriptionId/location parameters with a scenarios matrix so the same pipeline runs eastus2euap in both the NSM feature-flagged subscription (37deca37) and a baseline subscription (9b8218f9). - Add scenarios array parameter with subscriptionId, location, label, enableAccelnet, and longrunningZones per entry - Introduce stageLabel in sub-templates to generate unique stage names when the same region appears under different subscriptions - Make accelnet gating data-driven via enableAccelnet flag instead of hard-coded centraluseuap checks
…d fix filename - Rename longrunningZones to longRunningPodZones for clarity - Rename longrunning-pod-tests-stage.yaml to long-running-pod-tests-stage.yaml for consistent hyphenation with other template files
…cripts Scripts and inline YAML az commands were not specifying the target subscription, causing multi-subscription scenarios to operate against the wrong subscription (service connection default). Add az account set --subscription at the top of each script that receives a subscriptionId parameter. For pipeline YAML inline commands, add --subscription explicitly since they have no script-level context. Scripts updated: create_aks, create_pe, create_nsg, create_peerings, deploy_linuxbyon, deploy_accelnetbyon, verify_infrastructure. YAML templates updated: infrastructure-setup-stage, datapath-tests-stage, long-running-pipeline-template, long-running-pod-tests-stage.
…ounts Add existence checks before resource creation so scripts can safely re-run after a partial failure without hitting 'already exists' errors. Storage accounts now use discover-then-reuse: check for existing sa1*/sa2* in the RG before creating new ones. Downstream stages always discover storage accounts from the RG instead of relying on pipeline output variables, eliminating the StorageAccount1/2 variable passthrough chain. Scripts made idempotent: - create_aks: skip cluster/nodepool if already provisioned - create_vnets: skip subnet if it already exists - create_peerings: skip peering if it already exists - create_pe: skip DNS zone/links/endpoint if they exist, discover storage account instead of requiring it as a parameter - create_nsg: add create_nsg_rule_if_missing helper - create_storage: discover-then-reuse existing accounts - deploy_linuxbyon: skip VMSS if exists, helm upgrade --install Pipeline templates simplified: - Remove storageAccount1/2 parameters and stageDependencies expressions - Datapath stage always discovers storage accounts from the RG - Remove CreateStorageAccounts task name and output variables
…isions Tests can exceed 2 hours, so increase the lease duration from 180 to 300 minutes to prevent concurrent runs from acquiring the lease while datapath tests are still running.
Remove the VerifyInfrastructure and VerifyAccelnetBYON jobs that gated provisioning behind infraExists checks. Since all scripts are now idempotent (skip existing resources), running them unconditionally is simpler and self-healing — a partial failure on a previous run is automatically repaired on the next run. - Delete verify_infrastructure.sh (no longer needed) - Remove VerifyAccelnetBYON job (deploy_accelnetbyon.sh checks internally) - EnsureNodeLabels now depends on CreateCluster directly - All provisioning jobs always run, relying on idempotent scripts
Check for serviceAssociationLinks on the subnet via az rest before calling the expensive subnet delegator. If a SAL is already present, skip the delegation entirely.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR refactors the swiftv2 long-running Azure DevOps pipeline to run independent infra/test/cleanup stage chains per region × subscription “scenario”, enabling parallel validation across multiple subscriptions for the same region while making provisioning scripts idempotent and explicitly subscription-scoped.
Changes:
- Replaces single
subscriptionId/region mapping with ascenariosmatrix (label + subscriptionId + location + feature flags + zones) and updates stage naming to avoid collisions. - Adds explicit subscription targeting (
az account set/--subscription) across provisioning and test stages; increases lease TTL to 300 minutes. - Removes the infrastructure verify gate and shifts to idempotent “discover-or-create” behavior for infra resources, including storage account discovery.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| .pipelines/swiftv2-long-running/template/longrunning-pod-tests-stage.yaml | Adds stageLabel for scenario-based stage naming and pins AKS credential fetch to the scenario subscription. |
| .pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml | Refactors pipeline orchestration from per-location to per-scenario, updates lease TTL, and rewires dependencies by scenario label. |
| .pipelines/swiftv2-long-running/template/infrastructure-setup-stage.yaml | Removes verification gate and makes infra setup always-run + idempotent; gates accelnet BYON by per-scenario flag. |
| .pipelines/swiftv2-long-running/template/datapath-tests-stage.yaml | Switches stage naming to scenario label and discovers storage accounts at runtime from RG. |
| .pipelines/swiftv2-long-running/scripts/verify_infrastructure.sh | Deleted verification script as part of removing the verify gate. |
| .pipelines/swiftv2-long-running/scripts/deploy_linuxbyon.sh | Makes BYON VMSS/Helm actions more idempotent and explicitly sets subscription context. |
| .pipelines/swiftv2-long-running/scripts/deploy_accelnetbyon.sh | Adds optional subscription arg and sets subscription context explicitly. |
| .pipelines/swiftv2-long-running/scripts/create_vnets.sh | Skips existing subnets and avoids subnet delegation work when a Service Association Link already exists. |
| .pipelines/swiftv2-long-running/scripts/create_storage.sh | Discover-or-create storage accounts by prefix instead of per-run random names; removes stage output variable pass-through. |
| .pipelines/swiftv2-long-running/scripts/create_peerings.sh | Adds subscription context and makes peering creation idempotent. |
| .pipelines/swiftv2-long-running/scripts/create_pe.sh | Makes PE/DNS steps idempotent and discovers sa1* automatically when not provided. |
| .pipelines/swiftv2-long-running/scripts/create_nsg.sh | Adds subscription context and makes NSG rule creation idempotent. |
| .pipelines/swiftv2-long-running/scripts/create_aks.sh | Adds subscription context and attempts idempotent AKS + nodepool creation. |
| .pipelines/swiftv2-long-running/pipeline.yaml | Introduces the scenarios matrix parameter (multi-subscription + per-scenario zones). |
| .pipelines/swiftv2-long-running/LONGRUNNING-TESTS.md | Updates docs to mention scenarios, but still contains some stale references. |
Comments suppressed due to low confidence (1)
.pipelines/swiftv2-long-running/scripts/create_nsg.sh:1
- When the rule already exists,
create_nsg_rule_if_missingreturns success and the trailing&& echo ... createdstill prints, which makes logs misleading (it was skipped, not created). Consider moving the success message intocreate_nsg_rule_if_missingwith distinct 'created' vs 'skipped' messaging, or have the function indicate whether it created anything.
#!/usr/bin/env bash
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The field is .serviceAssociationLinks[0].name (e.g. 'SubnetDelegator'), not .serviceAssociationLinks[0].properties.link which doesn't exist. This caused the check to always return empty, falling through to the expensive delegation call every time.
…ned roles Remove manage_storage_rbac.sh and all calls to it. Storage Blob Data Contributor is now pre-assigned at the subscription level instead of being assigned/revoked per-run by the pipeline. - Delete manage_storage_rbac.sh - Remove RBAC assign call from create_storage.sh - Remove 'Assign RBAC' and 'Remove RBAC' tasks from datapath stage - Document required RBAC prerequisites in LONGRUNNING-TESTS.md - Sort storage account discovery queries for determinism - Simplify enableAccelnetBYON condition expression
9baf369 to
c25556c
Compare
…ported subscriptions The linux.bicep template in Networking-Aquarius hardcodes a managed identity that doesn't exist in the eastus2euap_baseline subscription. Gate the DeployLinuxBYON job and swiftv2-linux-byon datapath tests behind a new enableLinuxBYON flag, mirroring the existing enableAccelnet pattern. - Add enableLinuxBYON to each scenario in pipeline.yaml - Thread parameter through long-running-pipeline-template and infrastructure-setup-stage - Gate DeployLinuxBYON job with condition - Skip swiftv2-linux-byon workload when enableLinuxBYON is false - Update ReleaseLease dependsOn for all flag combinations
c25556c to
cb786fc
Compare
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…est jobs - Replace parameters.location with parameters.stageLabel in all artifact names to prevent collisions when multiple subscriptions share a region - Add az account set --subscription to all AzureCLI@2 test jobs so Go code inherits the correct subscription context - Add stageLabel parameter to metrics-setup-steps.yaml
Move get_ssh_public_key call inside create_linux_vmss after the existence check so the script no longer fails when the Key Vault secret is inaccessible but the VMSS already exists.
Azure DevOps coerces booleans to strings when object parameters are passed through 'extends'. The string "false" is truthy in template expressions, causing skip conditions to never fire. This resulted in swiftv2-linux-byon and accelnet stages running for the baseline scenario where they should be skipped. Use eq(scenario.enableLinuxBYON, true) and eq(scenario.enableAccelnet, true) instead of bare boolean references in compile-time expressions.
NRMS DINE policy (NRMS-VNET-DINE-NSG-v016) periodically reconciles NRMS-managed NSGs and removes custom security rules. This causes the NSGBlocked_S1toS2 and NSGBlocked_S2toS1 tests to fail because deny rules created during infra setup are wiped before tests run. Add an 'Ensure NSG Deny Rules' step in the ConnectivityTests job that idempotently re-applies deny rules via create_nsg.sh immediately before the Go test executes, minimizing the window for NRMS reconciliation.
…ests Add enableDatapathLinux parameter to scenario matrix to gate the swiftv2-linux datapath test stage. Set to false for eastus2euap_baseline to disable the stage while NRMS DINE policy issue is investigated. Update ReleaseLease dependsOn logic to handle scenarios where no datapath stages run (falls back to depending only on AcquireLease).
This reverts commit d16bd67.
…eploymentAndWait DeleteDeploymentAndWait only waited for pods to be gone but not the Deployment resource itself, causing 409 AlreadyExists when the rotating pod test immediately recreated with the same name.
… nodepool setup When AKS node auto-repair marks a node with the remediator.kubernetes.azure.com/unschedulable taint, the node is effectively dead for scheduling. Add a check in ensure_zone_nodepools.sh that detects this taint on npz* nodes, maps them back to their VMSS instance, and deletes the instance so the VMSS provisions a replacement.
The blanket kubectl label commands were wrongly applied to every node during cluster creation. Node labeling is already handled correctly elsewhere in the pipeline.
The blanket kubectl label commands were wrongly applied to every node during cluster creation. Node labeling is already handled correctly elsewhere in the pipeline.
WaitForMTPNCCleanupK8s checks ALL MTPNCs in a namespace, which always times out in the rotating test where 5 other deployments share the namespace. Add WaitForMTPNCCleanupByDeploymentK8s that filters MTPNCs by deployment pod name prefix, and use it in deleteRotatingDeployment with an increased timeout of 180s.
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
DNCRC will not reconcile different PodNetworks with the same network and subnet combination. Replace per-zone, per-test-type PNs (pn-rotating-z<N>, pn-alwayson-z<N>) with a single region-scoped PN (pn-longrunning-<buildID>) shared by all zones and both rotating and always-on test scenarios. PodNetworkInstances remain namespace-scoped and zone-suffixed, all pointing to the shared PN.
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…locks
Replace the ${{ each workload }} loop with compound boolean gate with
three separate per-workload template inclusions, each gated by a single
simple ${{ if eq(scenario.flag, true) }} condition.
This addresses PR feedback about the complex nested boolean condition
being hard to parse.
Add lock_resource_groups.sh that applies CanNotDelete locks and deletion_due_time, gc_scenario, and gc_skip tags to the main resource group and each AKS managed (MC_) resource group. Called from a new LockResourceGroups job in infrastructure-setup-stage after cluster creation.
…ed lookup - Add swiftv2-longrun-role tag to storage accounts for unique filtering - Export storage account names as ADO output variables from create_storage.sh - Pass storage account names explicitly to create_pe.sh and datapath stages - Remove duplicate DiscoverStorageAccounts task from datapath-tests-stage - Reduce 7+ redundant az storage account list CLI calls to 1 per run
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Summary
Enable the swiftv2 long-running pipeline to run independent stage chains per region×subscription pair, so the same region (e.g.
eastus2euap) can have clusters in different Azure subscriptions tested in parallel. Includes numerous pipeline fixes discovered during rollout.Changes
Core multi-subscription support
Multi-subscription scenarios matrix — Refactor the pipeline from a single
subscriptionIdto ascenarioslist where each entry carries its ownsubscriptionId,location,label, and feature flags (enableLinuxBYON,enableDatapathLinux). Each scenario gets its own full stage chain (infra → tests → cleanup).Subscription context in all scripts — Add
az account set --subscriptionto every provisioning script so commands target the correct subscription instead of relying on the CLI default.Rename
longrunningZones→longRunningPodZones— Align parameter naming with its actual purpose; fix stage template filename to match.Infra reliability
Idempotent provisioning — All infra scripts now check for existing resources before creating (AKS clusters, subnets, peerings, storage accounts, PEs, NSGs, VMSS). Re-runs after partial failures self-heal without duplicating resources.
Storage account discover-then-reuse — Replace random-name-per-run storage creation with prefix-based discovery (
sa1*/sa2*). Downstream stages always discover accounts from the RG, eliminating thestageDependenciesvariable passthrough chain.Remove verify gate — Delete
verify_infrastructure.shand theVerifyInfrastructure/VerifyAccelnetBYONjobs. Since scripts are idempotent, they always run and skip existing resources.Skip subnet delegation when SAL exists — Check
serviceAssociationLinksviaaz restbefore calling the expensive subnet delegator. Fix jq path for the SAL check.Increase lease TTL to 300 min — Prevent concurrent datapath test collisions when test runs exceed 2 hours.
Remove runtime RBAC management — Delete
manage_storage_rbac.sh; require roles to be pre-assigned rather than granted at pipeline time.Pipeline fixes discovered during rollout
Feature flags for per-scenario control — Add
enableLinuxBYONandenableDatapathLinuxflags to skip Linux BYON provisioning or Linux datapath tests for subscriptions/regions that don't support them.Use
stageLabelfor artifact names — Fix artifact name collisions across scenarios by incorporating the stage label; set subscription context in test jobs.Defer SSH key fetch — Only fetch the SSH key from Key Vault when VMSS creation is actually needed, avoiding unnecessary failures when BYON is disabled.
Boolean parameter handling — Use
eq(x, true)instead of bare string comparisons for boolean checks in ADO object parameters.Remove incorrect node labeling from
create_aks.sh— Stop labeling system pool nodes with workload labels that should only be set on BYON nodes.Test code fixes
Wait for Deployment deletion before recreating —
DeleteDeploymentAndWaitink8s_helpers.gonow polls until the Deployment object is fully deleted before returning, preventingAlreadyExistserrors on recreation.Auto-remediate remediator-unschedulable taint —
ensure_zone_nodepools.shdetects and removes theremediator=unschedulable:NoScheduletaint on zone pool nodes that blocks pod scheduling.Scope MTPNC cleanup to deployment — In the rotating pod test, scope the MTPNC cleanup wait to only the specific deployment's pods rather than the entire namespace.
Single PodNetwork per region — Replace per-zone, per-test-type PodNetworks (
pn-rotating-z<N>,pn-alwayson-z<N>) with a single region-scoped PN (pn-longrunning-<buildID>) shared by all zones and both test scenarios. DNCRC will not reconcile different PNs with the same network and subnet combination. All PodNetworkInstances now reference the shared PN.