Skip to content

feat(pipeline): support multiple subscriptions per region in swiftv2 long-running tests#4334

Merged
isaac-dasan merged 26 commits intomasterfrom
isaac/swiftv2/longrunnning-acntest
Apr 22, 2026
Merged

feat(pipeline): support multiple subscriptions per region in swiftv2 long-running tests#4334
isaac-dasan merged 26 commits intomasterfrom
isaac/swiftv2/longrunnning-acntest

Conversation

@isaac-dasan
Copy link
Copy Markdown
Member

@isaac-dasan isaac-dasan commented Apr 8, 2026

Summary

Enable the swiftv2 long-running pipeline to run independent stage chains per region×subscription pair, so the same region (e.g. eastus2euap) can have clusters in different Azure subscriptions tested in parallel. Includes numerous pipeline fixes discovered during rollout.

Changes

Core multi-subscription support

  1. Multi-subscription scenarios matrix — Refactor the pipeline from a single subscriptionId to a scenarios list where each entry carries its own subscriptionId, location, label, and feature flags (enableLinuxBYON, enableDatapathLinux). Each scenario gets its own full stage chain (infra → tests → cleanup).

  2. Subscription context in all scripts — Add az account set --subscription to every provisioning script so commands target the correct subscription instead of relying on the CLI default.

  3. Rename longrunningZoneslongRunningPodZones — Align parameter naming with its actual purpose; fix stage template filename to match.

Infra reliability

  1. Idempotent provisioning — All infra scripts now check for existing resources before creating (AKS clusters, subnets, peerings, storage accounts, PEs, NSGs, VMSS). Re-runs after partial failures self-heal without duplicating resources.

  2. Storage account discover-then-reuse — Replace random-name-per-run storage creation with prefix-based discovery (sa1*/sa2*). Downstream stages always discover accounts from the RG, eliminating the stageDependencies variable passthrough chain.

  3. Remove verify gate — Delete verify_infrastructure.sh and the VerifyInfrastructure/VerifyAccelnetBYON jobs. Since scripts are idempotent, they always run and skip existing resources.

  4. Skip subnet delegation when SAL exists — Check serviceAssociationLinks via az rest before calling the expensive subnet delegator. Fix jq path for the SAL check.

  5. Increase lease TTL to 300 min — Prevent concurrent datapath test collisions when test runs exceed 2 hours.

  6. Remove runtime RBAC management — Delete manage_storage_rbac.sh; require roles to be pre-assigned rather than granted at pipeline time.

Pipeline fixes discovered during rollout

  1. Feature flags for per-scenario control — Add enableLinuxBYON and enableDatapathLinux flags to skip Linux BYON provisioning or Linux datapath tests for subscriptions/regions that don't support them.

  2. Use stageLabel for artifact names — Fix artifact name collisions across scenarios by incorporating the stage label; set subscription context in test jobs.

  3. Defer SSH key fetch — Only fetch the SSH key from Key Vault when VMSS creation is actually needed, avoiding unnecessary failures when BYON is disabled.

  4. Boolean parameter handling — Use eq(x, true) instead of bare string comparisons for boolean checks in ADO object parameters.

  5. Remove incorrect node labeling from create_aks.sh — Stop labeling system pool nodes with workload labels that should only be set on BYON nodes.

Test code fixes

  1. Wait for Deployment deletion before recreatingDeleteDeploymentAndWait in k8s_helpers.go now polls until the Deployment object is fully deleted before returning, preventing AlreadyExists errors on recreation.

  2. Auto-remediate remediator-unschedulable taintensure_zone_nodepools.sh detects and removes the remediator=unschedulable:NoSchedule taint on zone pool nodes that blocks pod scheduling.

  3. Scope MTPNC cleanup to deployment — In the rotating pod test, scope the MTPNC cleanup wait to only the specific deployment's pods rather than the entire namespace.

  4. Single PodNetwork per region — Replace per-zone, per-test-type PodNetworks (pn-rotating-z<N>, pn-alwayson-z<N>) with a single region-scoped PN (pn-longrunning-<buildID>) shared by all zones and both test scenarios. DNCRC will not reconcile different PNs with the same network and subnet combination. All PodNetworkInstances now reference the shared PN.

…iptions

Replace flat subscriptionId/location parameters with a scenarios matrix
so the same pipeline runs eastus2euap in both the NSM feature-flagged
subscription (37deca37) and a baseline subscription (9b8218f9).

- Add scenarios array parameter with subscriptionId, location, label,
  enableAccelnet, and longrunningZones per entry
- Introduce stageLabel in sub-templates to generate unique stage names
  when the same region appears under different subscriptions
- Make accelnet gating data-driven via enableAccelnet flag instead of
  hard-coded centraluseuap checks
…d fix filename

- Rename longrunningZones to longRunningPodZones for clarity
- Rename longrunning-pod-tests-stage.yaml to long-running-pod-tests-stage.yaml
  for consistent hyphenation with other template files
…cripts

Scripts and inline YAML az commands were not specifying the target
subscription, causing multi-subscription scenarios to operate against
the wrong subscription (service connection default).

Add az account set --subscription at the top of each script that
receives a subscriptionId parameter. For pipeline YAML inline commands,
add --subscription explicitly since they have no script-level context.

Scripts updated: create_aks, create_pe, create_nsg, create_peerings,
deploy_linuxbyon, deploy_accelnetbyon, verify_infrastructure.
YAML templates updated: infrastructure-setup-stage, datapath-tests-stage,
long-running-pipeline-template, long-running-pod-tests-stage.
…ounts

Add existence checks before resource creation so scripts can safely
re-run after a partial failure without hitting 'already exists' errors.

Storage accounts now use discover-then-reuse: check for existing sa1*/sa2*
in the RG before creating new ones. Downstream stages always discover
storage accounts from the RG instead of relying on pipeline output
variables, eliminating the StorageAccount1/2 variable passthrough chain.

Scripts made idempotent:
- create_aks: skip cluster/nodepool if already provisioned
- create_vnets: skip subnet if it already exists
- create_peerings: skip peering if it already exists
- create_pe: skip DNS zone/links/endpoint if they exist, discover
  storage account instead of requiring it as a parameter
- create_nsg: add create_nsg_rule_if_missing helper
- create_storage: discover-then-reuse existing accounts
- deploy_linuxbyon: skip VMSS if exists, helm upgrade --install

Pipeline templates simplified:
- Remove storageAccount1/2 parameters and stageDependencies expressions
- Datapath stage always discovers storage accounts from the RG
- Remove CreateStorageAccounts task name and output variables
…isions

Tests can exceed 2 hours, so increase the lease duration from 180 to
300 minutes to prevent concurrent runs from acquiring the lease while
datapath tests are still running.
Remove the VerifyInfrastructure and VerifyAccelnetBYON jobs that gated
provisioning behind infraExists checks. Since all scripts are now
idempotent (skip existing resources), running them unconditionally is
simpler and self-healing — a partial failure on a previous run is
automatically repaired on the next run.

- Delete verify_infrastructure.sh (no longer needed)
- Remove VerifyAccelnetBYON job (deploy_accelnetbyon.sh checks internally)
- EnsureNodeLabels now depends on CreateCluster directly
- All provisioning jobs always run, relying on idempotent scripts
Check for serviceAssociationLinks on the subnet via az rest before
calling the expensive subnet delegator. If a SAL is already present,
skip the delegation entirely.
@isaac-dasan isaac-dasan requested a review from a team as a code owner April 8, 2026 22:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR refactors the swiftv2 long-running Azure DevOps pipeline to run independent infra/test/cleanup stage chains per region × subscription “scenario”, enabling parallel validation across multiple subscriptions for the same region while making provisioning scripts idempotent and explicitly subscription-scoped.

Changes:

  • Replaces single subscriptionId/region mapping with a scenarios matrix (label + subscriptionId + location + feature flags + zones) and updates stage naming to avoid collisions.
  • Adds explicit subscription targeting (az account set / --subscription) across provisioning and test stages; increases lease TTL to 300 minutes.
  • Removes the infrastructure verify gate and shifts to idempotent “discover-or-create” behavior for infra resources, including storage account discovery.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
.pipelines/swiftv2-long-running/template/longrunning-pod-tests-stage.yaml Adds stageLabel for scenario-based stage naming and pins AKS credential fetch to the scenario subscription.
.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml Refactors pipeline orchestration from per-location to per-scenario, updates lease TTL, and rewires dependencies by scenario label.
.pipelines/swiftv2-long-running/template/infrastructure-setup-stage.yaml Removes verification gate and makes infra setup always-run + idempotent; gates accelnet BYON by per-scenario flag.
.pipelines/swiftv2-long-running/template/datapath-tests-stage.yaml Switches stage naming to scenario label and discovers storage accounts at runtime from RG.
.pipelines/swiftv2-long-running/scripts/verify_infrastructure.sh Deleted verification script as part of removing the verify gate.
.pipelines/swiftv2-long-running/scripts/deploy_linuxbyon.sh Makes BYON VMSS/Helm actions more idempotent and explicitly sets subscription context.
.pipelines/swiftv2-long-running/scripts/deploy_accelnetbyon.sh Adds optional subscription arg and sets subscription context explicitly.
.pipelines/swiftv2-long-running/scripts/create_vnets.sh Skips existing subnets and avoids subnet delegation work when a Service Association Link already exists.
.pipelines/swiftv2-long-running/scripts/create_storage.sh Discover-or-create storage accounts by prefix instead of per-run random names; removes stage output variable pass-through.
.pipelines/swiftv2-long-running/scripts/create_peerings.sh Adds subscription context and makes peering creation idempotent.
.pipelines/swiftv2-long-running/scripts/create_pe.sh Makes PE/DNS steps idempotent and discovers sa1* automatically when not provided.
.pipelines/swiftv2-long-running/scripts/create_nsg.sh Adds subscription context and makes NSG rule creation idempotent.
.pipelines/swiftv2-long-running/scripts/create_aks.sh Adds subscription context and attempts idempotent AKS + nodepool creation.
.pipelines/swiftv2-long-running/pipeline.yaml Introduces the scenarios matrix parameter (multi-subscription + per-scenario zones).
.pipelines/swiftv2-long-running/LONGRUNNING-TESTS.md Updates docs to mention scenarios, but still contains some stale references.
Comments suppressed due to low confidence (1)

.pipelines/swiftv2-long-running/scripts/create_nsg.sh:1

  • When the rule already exists, create_nsg_rule_if_missing returns success and the trailing && echo ... created still prints, which makes logs misleading (it was skipped, not created). Consider moving the success message into create_nsg_rule_if_missing with distinct 'created' vs 'skipped' messaging, or have the function indicate whether it created anything.
#!/usr/bin/env bash

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .pipelines/swiftv2-long-running/template/infrastructure-setup-stage.yaml Outdated
Comment thread .pipelines/swiftv2-long-running/scripts/create_vnets.sh
Comment thread .pipelines/swiftv2-long-running/scripts/create_storage.sh Outdated
Comment thread .pipelines/swiftv2-long-running/template/datapath-tests-stage.yaml Outdated
Comment thread .pipelines/swiftv2-long-running/scripts/create_aks.sh
Comment thread .pipelines/swiftv2-long-running/LONGRUNNING-TESTS.md Outdated
Comment thread .pipelines/swiftv2-long-running/LONGRUNNING-TESTS.md Outdated
The field is .serviceAssociationLinks[0].name (e.g. 'SubnetDelegator'),
not .serviceAssociationLinks[0].properties.link which doesn't exist.
This caused the check to always return empty, falling through to
the expensive delegation call every time.
…ned roles

Remove manage_storage_rbac.sh and all calls to it. Storage Blob Data
Contributor is now pre-assigned at the subscription level instead of
being assigned/revoked per-run by the pipeline.

- Delete manage_storage_rbac.sh
- Remove RBAC assign call from create_storage.sh
- Remove 'Assign RBAC' and 'Remove RBAC' tasks from datapath stage
- Document required RBAC prerequisites in LONGRUNNING-TESTS.md
- Sort storage account discovery queries for determinism
- Simplify enableAccelnetBYON condition expression
@isaac-dasan isaac-dasan force-pushed the isaac/swiftv2/longrunnning-acntest branch from 9baf369 to c25556c Compare April 9, 2026 03:04
…ported subscriptions

The linux.bicep template in Networking-Aquarius hardcodes a managed
identity that doesn't exist in the eastus2euap_baseline subscription.
Gate the DeployLinuxBYON job and swiftv2-linux-byon datapath tests
behind a new enableLinuxBYON flag, mirroring the existing enableAccelnet
pattern.

- Add enableLinuxBYON to each scenario in pipeline.yaml
- Thread parameter through long-running-pipeline-template and
  infrastructure-setup-stage
- Gate DeployLinuxBYON job with condition
- Skip swiftv2-linux-byon workload when enableLinuxBYON is false
- Update ReleaseLease dependsOn for all flag combinations
@isaac-dasan isaac-dasan force-pushed the isaac/swiftv2/longrunnning-acntest branch from c25556c to cb786fc Compare April 9, 2026 03:19
@isaac-dasan
Copy link
Copy Markdown
Member Author

/azp run Azure Container Networking PR

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

…est jobs

- Replace parameters.location with parameters.stageLabel in all artifact
  names to prevent collisions when multiple subscriptions share a region
- Add az account set --subscription to all AzureCLI@2 test jobs so Go
  code inherits the correct subscription context
- Add stageLabel parameter to metrics-setup-steps.yaml
Move get_ssh_public_key call inside create_linux_vmss after the
existence check so the script no longer fails when the Key Vault
secret is inaccessible but the VMSS already exists.
Azure DevOps coerces booleans to strings when object parameters are
passed through 'extends'. The string "false" is truthy in template
expressions, causing skip conditions to never fire. This resulted in
swiftv2-linux-byon and accelnet stages running for the baseline
scenario where they should be skipped.

Use eq(scenario.enableLinuxBYON, true) and eq(scenario.enableAccelnet,
true) instead of bare boolean references in compile-time expressions.
NRMS DINE policy (NRMS-VNET-DINE-NSG-v016) periodically reconciles
NRMS-managed NSGs and removes custom security rules. This causes the
NSGBlocked_S1toS2 and NSGBlocked_S2toS1 tests to fail because deny
rules created during infra setup are wiped before tests run.

Add an 'Ensure NSG Deny Rules' step in the ConnectivityTests job that
idempotently re-applies deny rules via create_nsg.sh immediately before
the Go test executes, minimizing the window for NRMS reconciliation.
…ests

Add enableDatapathLinux parameter to scenario matrix to gate the
swiftv2-linux datapath test stage. Set to false for eastus2euap_baseline
to disable the stage while NRMS DINE policy issue is investigated.

Update ReleaseLease dependsOn logic to handle scenarios where no
datapath stages run (falls back to depending only on AcquireLease).
…eploymentAndWait

DeleteDeploymentAndWait only waited for pods to be gone but not the
Deployment resource itself, causing 409 AlreadyExists when the rotating
pod test immediately recreated with the same name.
… nodepool setup

When AKS node auto-repair marks a node with the
remediator.kubernetes.azure.com/unschedulable taint, the node is
effectively dead for scheduling. Add a check in ensure_zone_nodepools.sh
that detects this taint on npz* nodes, maps them back to their VMSS
instance, and deletes the instance so the VMSS provisions a replacement.
The blanket kubectl label commands were wrongly applied to every node
during cluster creation. Node labeling is already handled correctly
elsewhere in the pipeline.
The blanket kubectl label commands were wrongly applied to every node
during cluster creation. Node labeling is already handled correctly
elsewhere in the pipeline.
WaitForMTPNCCleanupK8s checks ALL MTPNCs in a namespace, which always
times out in the rotating test where 5 other deployments share the
namespace. Add WaitForMTPNCCleanupByDeploymentK8s that filters MTPNCs
by deployment pod name prefix, and use it in deleteRotatingDeployment
with an increased timeout of 180s.
@isaac-dasan
Copy link
Copy Markdown
Member Author

/azp run Azure Container Networking PR

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

DNCRC will not reconcile different PodNetworks with the same network
and subnet combination. Replace per-zone, per-test-type PNs
(pn-rotating-z<N>, pn-alwayson-z<N>) with a single region-scoped PN
(pn-longrunning-<buildID>) shared by all zones and both rotating and
always-on test scenarios. PodNetworkInstances remain namespace-scoped
and zone-suffixed, all pointing to the shared PN.
@isaac-dasan
Copy link
Copy Markdown
Member Author

/azp run Azure Container Networking PR

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread .pipelines/swiftv2-long-running/scripts/create_aks.sh
Comment thread .pipelines/swiftv2-long-running/scripts/create_aks.sh
Comment thread .pipelines/swiftv2-long-running/scripts/create_storage.sh Outdated
Comment thread .pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml Outdated
…locks

Replace the ${{ each workload }} loop with compound boolean gate with
three separate per-workload template inclusions, each gated by a single
simple ${{ if eq(scenario.flag, true) }} condition.

This addresses PR feedback about the complex nested boolean condition
being hard to parse.
Add lock_resource_groups.sh that applies CanNotDelete locks and
deletion_due_time, gc_scenario, and gc_skip tags to the main resource
group and each AKS managed (MC_) resource group. Called from a new
LockResourceGroups job in infrastructure-setup-stage after cluster
creation.
…ed lookup

- Add swiftv2-longrun-role tag to storage accounts for unique filtering
- Export storage account names as ADO output variables from create_storage.sh
- Pass storage account names explicitly to create_pe.sh and datapath stages
- Remove duplicate DiscoverStorageAccounts task from datapath-tests-stage
- Reduce 7+ redundant az storage account list CLI calls to 1 per run
@isaac-dasan
Copy link
Copy Markdown
Member Author

/azp run Azure Container Networking PR

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@isaac-dasan isaac-dasan enabled auto-merge April 16, 2026 23:49
@isaac-dasan isaac-dasan added this pull request to the merge queue Apr 21, 2026
Merged via the queue into master with commit aeee496 Apr 22, 2026
31 of 33 checks passed
@isaac-dasan isaac-dasan deleted the isaac/swiftv2/longrunnning-acntest branch April 22, 2026 00:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants