[XL] Add Python Crank Scheduling tool by LoopedBard3 · Pull Request #2106 · aspnet/Benchmarks

LoopedBard3 · 2025-08-07T22:38:22Z

In order to make simplify the work needed when updating the scenarios we run and to minimize the chance of error, this adds a python script to be used to generate a CI schedule from a single configuration file. Most of the recently added and updated pipeline flows already used this new flow, but this update does add an option for a machine_group to ensure machines only use other machines at similar perf levels for load and db machines.

Changes include the addition of the crank-scheduler, running the configurations through the scheduler one more time with the updated benchmarks.template.liquid, updating the benchmarks.template.liquid to include the new steps to run, and added the machine_group configuration option where applicable.

…he new scheduling tool.

…generating the yml pipelines.

…added machine_groups for the base azure configuration.

…e new template.

Copilot

Pull Request Overview

This PR introduces a comprehensive Python-based crank scheduling tool to automate CI pipeline generation and optimize machine/scenario allocation across multiple performance testing machines.

Adds a complete Python crank scheduler with sophisticated machine allocation algorithms and multi-YAML generation capabilities
Updates existing CI configurations to use the new machine group system and multi-capability machine definitions
Replaces manual YAML matrix files with JSON-based configuration and automated pipeline generation

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
scripts/crank-scheduler/*.py	Core scheduler implementation with machine allocation, runtime estimation, and template generation
scripts/crank-scheduler/requirements.txt	Python dependencies for the scheduler
scripts/crank-scheduler/*.md	Documentation and configuration guides
build/benchmarks_ci*.json	Updated machine configurations with new capability-based structure and machine groups
build/benchmarks*.yml	Updated pipeline files generated by the new scheduler
build/benchmarks.template.liquid	Updated template comments to reflect new generation process

…e unused requirements and code, and added some new entries to .gitignore.

…. Also improved the scheduler to better handle role-priority based profile selection.

DrewScoggins · 2025-09-08T19:38:54Z

+        epilog="""
+Examples:
+  # Generate schedule from JSON files
+  python main.py --config config.json --format table


--format is used in the examples, but we don't seem to have a --config argument anywhere.

DrewScoggins · 2025-09-08T19:59:05Z

+
+    machines_by_type = {}
+    for machine in machines:
+        # Get primary machine type (lowest priority capability)


nit: I assume by lowest here we mean lowest number, which would actually be higher priority.

Copilot

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-22T22:15:44Z

+```json
+{
+  "name": "performance-test-scenario",
+  "scenario_type": 2,
+  "estimated_runtime": 45.0,
+  "target_machines": ["machine-1", "machine-2"]
+}
+```
+
+#### Scenario Properties
+
+- **name**: Scenario identifier
+- **scenario_type**: Number of machines required (1=SUT only, 2=SUT+Load, 3=SUT+Load+DB)


This example and the scenario property list document a scenario_type field, but DataLoader.load_combined_configuration currently reads the scenario type from the type key and example_complete_features.json also uses "type". As written, a config that follows this README and uses scenario_type will cause a ScenarioType lookup failure; please either adjust the loader to accept scenario_type or update the docs/examples to use the actual key (type) so JSON configs are valid.

Copilot · 2026-01-22T22:15:44Z

+```json
+{
+  "name": "Simple Single Machine Test",
+  "template": "simple-single.yml",
+  "scenario_type": 1,
+  "target_machines": ["single-type-machine", "multi-type-machine"],
+  "estimated_runtime": 10.0,
+  "description": "Basic single machine scenario with default profiles"
+}
+```
+
+**Result:** Uses default profiles for all machines
+
+### 2. Custom Profile Selection
+
+```json
+{
+  "name": "Triple Machine Test with Custom Profiles",
+  "template": "triple-custom.yml", 
+  "scenario_type": 3,
+  "target_machines": ["multi-type-machine"],
+  "estimated_runtime": 45.0,
+  "profile_overrides": {
+    "multi-type-machine": {
+      "sut": "multi-sut-high-cpu",
+      "load": "multi-load-high-throughput",
+      "db": "multi-db-memory-optimized"
+    }
+  }
+}
+```
+
+**Result:** Uses specific custom profiles for each machine type
+
+### 3. Mixed Profile Usage
+
+```json
+{
+  "name": "Mixed Profile Scenario",
+  "template": "mixed-profiles.yml",
+  "scenario_type": 2,
+  "target_machines": ["single-type-machine", "multi-type-machine"],
+  "profile_overrides": {
+    "multi-type-machine": {
+      "sut": "multi-sut-low-memory"
+    }
+  }
+}
+```
+
+**Result:**
+
+- `single-type-machine`: Uses default profile
+- `multi-type-machine` SUT: Uses custom profile
+- `multi-type-machine` LOAD: Uses default profile
+
+## Configuration Properties Explained
+
+### Machine Properties
+
+| Property             | Required | Description                                    |
+| -------------------- | -------- | ---------------------------------------------- |
+| `name`               | ✅        | Unique machine identifier                      |
+| `capabilities`       | ✅        | Dict of machine types this machine can fulfill |
+| `preferred_partners` | ❌        | List of preferred machines for other roles     |
+
+### Capability Properties
+
+| Property          | Required | Description                                                         |
+| ----------------- | -------- | ------------------------------------------------------------------- |
+| `machine_type`    | ✅        | Key: "sut", "load", or "db"                                         |
+| `priority`        | ✅        | 1=preferred, 2=secondary, 3=fallback                                |
+| `profiles`        | ✅        | List of available profile names                                     |
+| `default_profile` | ❌        | Which profile to use by default (defaults to first profile in list) |
+
+### Scenario Properties
+
+| Property            | Required | Description                        |
+| ------------------- | -------- | ---------------------------------- |
+| `name`              | ✅        | Scenario identifier                |
+| `template`          | ✅        | YAML template file                 |
+| `scenario_type`     | ✅        | 1=single, 2=dual, 3=triple machine |
+| `target_machines`   | ✅        | List of machines to run on         |
+| `estimated_runtime` | ❌        | Runtime in minutes                 |
+| `description`       | ❌        | Human-readable description         |
+| `profile_overrides` | ❌        | Custom profile overrides           |


These scenario examples and the "Scenario Properties" table document a scenario_type field, but the scheduler code reads the type from a type key in the JSON (and example_complete_features.json uses "type"). Using scenario_type as shown here will break loading; please align the docs with the implementation (or update the loader to accept scenario_type) so configuration authors can rely on the documented shape.

Copilot · 2026-01-22T22:15:44Z

+| Property             | Required | Description                                    |
+| -------------------- | -------- | ---------------------------------------------- |
+| `name`               | ✅        | Unique machine identifier                      |
+| `capabilities`       | ✅        | Dict of machine types this machine can fulfill |
+| `preferred_partners` | ❌        | List of preferred machines for other roles     |
+


The machine configuration docs list name, capabilities, and preferred_partners, but do not mention the new machine_group field that is now used by the scheduler for group-based compatibility (see Machine.machine_group in models.py and the updated build/benchmarks_ci*.json files). To make the new grouping behavior discoverable and configurable, please extend this table (and the surrounding text) to describe the machine_group field and how it interacts with enforce_machine_groups in metadata.

Copilot · 2026-01-22T22:15:45Z

-# - Update this file with the result of the template generation
+# - The file benchmarks*.json defines how each pipeline set of jobs is run in parallel
+# - Update the associated benchmarks*.json file with machine and scenario updates
+# - Install python and install the requirements for the crank-scheduler in benchmarks/scripts/crank-scheduler/requirements.txt


The instructions here reference benchmarks/scripts/crank-scheduler/requirements.txt, but in this repo the requirements file lives at scripts/crank-scheduler/requirements.txt (and the example command below already uses ./scripts/crank-scheduler/main.py). To prevent confusion when following these steps, consider updating this path (and any similarly generated headers in the CI YAML files) to match the actual directory layout.

Suggested change

# - Install python and install the requirements for the crank-scheduler in benchmarks/scripts/crank-scheduler/requirements.txt

# - Install python and install the requirements for the crank-scheduler in scripts/crank-scheduler/requirements.txt

Copilot · 2026-01-22T22:15:45Z

+    def process_yaml_generation(args, partial_schedules: List[PartialSchedule], config: CombinedConfiguration) -> list:
+        """
+        Unified flow for YAML generation (single or multi)
+
+        Returns:
+            bool: True if YAML files were generated, False otherwise


The docstring for process_yaml_generation states that the function returns a bool, but the implementation actually returns a list of dictionaries describing the generated YAML files. Please update the docstring (and/or add a return type annotation) to reflect the real return type so callers know what to expect.

Suggested change

def process_yaml_generation(args, partial_schedules: List[PartialSchedule], config: CombinedConfiguration) -> list:

"""

Unified flow for YAML generation (single or multi)

Returns:

bool: True if YAML files were generated, False otherwise

def process_yaml_generation(args, partial_schedules: List[PartialSchedule], config: CombinedConfiguration) -> List[dict]:

"""

Unified flow for YAML generation (single or multi)

Returns:

List[dict]: List of metadata dictionaries for each generated YAML file

Copilot · 2026-01-22T22:15:45Z

+        schedule_times = ScheduleOperations.generate_schedule_times(
+            config, len(partial_schedules))


The CLI overrides for --target-yamls and --schedule-offset are applied here after partial_schedules have already been computed in main.py, and generate_schedule_times uses the overridden target_yaml_count instead of the actual len(partial_schedules). This can lead to mismatches (e.g., some partial schedules never get a YAML file when target_yamls is reduced, or extra offset times are generated and then dropped when target_yamls is increased). To avoid silently skipping work, apply the overrides before splitting the schedule (or re-split after updating yaml_generation) so that both partial_schedules and schedule_times are derived from the same effective target_yaml_count.

Suggested change

schedule_times = ScheduleOperations.generate_schedule_times(

config, len(partial_schedules))

# Ensure the YAML generation config's target count matches the actual

# number of partial schedules so that we don't silently drop or omit work.

effective_count = len(partial_schedules)

if config.metadata.yaml_generation is not None:

config.metadata.yaml_generation.target_yaml_count = effective_count

schedule_times = ScheduleOperations.generate_schedule_times(

config, effective_count)

Copilot · 2026-01-22T22:15:46Z

+                try:
+                    partner_index = preferred_partners.index(machine.name)
+                    score += 0.01 * (partner_index + 1)  # 0.01, 0.02, 0.03, ...
+                except ValueError:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except ValueError:

except ValueError:

# Machine not found in preferred_partners; skip partner bias adjustment.

Simplified alternative to PR aspnet#2106's full crank-scheduler. Uses a pod model where machines are fixed groups (SUT + load + DB) instead of individual machines with capability scoring and preferred partners. Key simplifications: - Pods define fixed machine groupings (no role priority/scoring) - Shared machines between pods handled via collision detection - Same greedy longest-job-first bin-packing algorithm - Same Liquid template YAML generation - ~570 lines vs ~2000 lines in the full scheduler Includes: - scripts/pod-scheduler/ (5 Python files + README) - build/benchmarks_ci_pods.json (pod-based config for CI benchmarks) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

) * Add pod-based crank scheduler prototype Simplified alternative to PR #2106's full crank-scheduler. Uses a pod model where machines are fixed groups (SUT + load + DB) instead of individual machines with capability scoring and preferred partners. Key simplifications: - Pods define fixed machine groupings (no role priority/scoring) - Shared machines between pods handled via collision detection - Same greedy longest-job-first bin-packing algorithm - Same Liquid template YAML generation - ~570 lines vs ~2000 lines in the full scheduler Includes: - scripts/pod-scheduler/ (5 Python files + README) - build/benchmarks_ci_pods.json (pod-based config for CI benchmarks) * Add azure, azure-eastus2, and cobalt pod configs Pod-based configurations for all three additional CI environments: - benchmarks_ci_azure_pods.json: 6 pods, 14 runs (matches main) - benchmarks_ci_azure_eastus2_pods.json: 2 pods, 12 runs (matches main) - benchmarks_ci_cobalt_pods.json: 4 pods, 44 runs (matches main) Notable pod patterns: - Azure IDNA pods cross-use each other as load machines - Cobalt hosted has 28-core variant pods sharing physical machines with full-core pods (handled by collision detection) - Azure eastus2 pods share load/db, serialized automatically Also fixes unicode bar chars for Windows compatibility. * Update azure pod config: merge eastus2, keep IDNA on linux loads Reflects main branch changes from PR #2166: - Merged cobalt-cloud-lin pods (eastus2) into azure config - Removed separate benchmarks_ci_azure_eastus2_pods.json - Kept IDNA pod load profiles on linux machines (load jobs require linux), reverting the main branch profile change - Added cobalt-cloud-lin-azl3-dual pod for type-2 scenarios (uses cobalt-cloud-lin-db as load instead of client) - Total runs: 26 (matches main azure pipeline) * Regenerate pipeline YAMLs from pod-scheduler configs Generated via: python ./scripts/pod-scheduler/main.py --config ./build/benchmarks_ci_pods.json --template ./build/benchmarks.template.liquid --yaml-output ./build python ./scripts/pod-scheduler/main.py --config ./build/benchmarks_ci_azure_pods.json --template ./build/benchmarks.template.liquid --yaml-output ./build --base-name benchmarks-ci-azure python ./scripts/pod-scheduler/main.py --config ./build/benchmarks_ci_cobalt_pods.json --template ./build/benchmarks.template.liquid --yaml-output ./build --base-name benchmarks-ci-cobalt * Cap timeoutInMinutes at 240 (max 2x old 120 default) Formula is now max(120, min(240, 2 * estimated_runtime)). This prevents scenarios with long runtimes (e.g. Proxies at 150min) from setting unreasonably high timeouts compared to previous values. Resulting timeouts: 120 (default), 140 (Grpc), 180 (PGO/Containers), 240 (Proxies) * Address review feedback - Fix 4 incorrect template filenames in benchmarks_ci_pods.json: crossgen-scenarios -> crossgen2-scenarios, custom-proxies-scenarios -> proxies-custom-scenarios, single-file-scenarios -> singlefile-scenarios, websockets-scenarios -> websocket-scenarios - Fix machine utilization calculation bug (was inflating totals for machines not in current stage) - Remove unused imports (sys, Any, Dict, json, Pod) - Remove dead render_with_liquid function and --template CLI arg - Add guard against empty queues (ZeroDivisionError) - Update README and docstrings to reflect removed template arg Code: - Validate cron schedules at load time and raise on unsupported hour fields instead of silently no-op'ing the offset for split YAMLs - Add optional 'timeout' override per scenario; fall back to the runtime-derived formula when absent - Move pipeline plumbing (pool, service-bus connection/namespace) into JSON metadata.pipeline with the previous hardcoded values as defaults - Strict validation of duplicate pods, duplicate scenario.pods entries, empty queues; default scheduler to fail-fast on unknown/invalid pod references with a --lenient opt-out - Stricter job-id sanitization (handles '.', '/', parens, leading digits, unicode) and explicit duplicate detection in generated YAML - Replace id(stage) bookkeeping in split_schedule with explicit indices; add stable name tie-breaker to create_schedule for deterministic output - Use Run.job_name in the generator instead of duplicating the regex - Drop stale '--template' arg from generated YAML headers and README Tests: - 41 unit + snapshot tests covering models, config loader, scheduler, generator, and YAML parity with the committed *_pods.json configs Cleanup: - Revert benchmarks.template.liquid and benchmarks_ci_azure.json to main; the deleted crank-scheduler does not consume them - Regenerate all four pipeline YAMLs against the new generator * Remove unused benchmarks.template.liquid The Liquid template was only consumed by the deleted crank-scheduler. The pod-scheduler renders pipeline YAML directly via Python, and grep confirms no other script, pipeline, or build step reads this file. * Remove orphaned benchmarks.yml and benchmarks.matrix.0[12].yml These were artifacts of the old hand-driven matrix.yml -> json -> Liquid template -> benchmarks.yml workflow. Their only inbound references were stale documentation comments cross-pointing between each other; nothing in the repo (no script, no pipeline) consumed them. * Document pod-scheduler flow across READMEs and YAML headers - Generated YAML headers now embed the exact regen command (with the source config and base name) and a pointer to scripts/pod-scheduler/README.md, so each file documents how to reproduce itself - New build/README.md maps each *_pods.json config to the YAML it produces, lists the hand-maintained scenario templates, and explains the typical edit/regenerate workflow - Top-level README.md gains a 'Continuous benchmarking pipelines' section linking to the pod-scheduler and build/ docs - pod-scheduler README's Quick Start now uses repo-root-relative commands and points at the snapshot tests for verification - Tests cover the new _format_source_path helper and the snapshot test passes the source config so headers stay verified * Remove orphaned crank-scheduler JSON configs benchmarks_ci.json, benchmarks_ci_azure.json, and benchmarks_ci_cobalt.json used the old 'machines + capabilities' format consumed by the deleted crank-scheduler. Their replacements (benchmarks_ci_pods.json, benchmarks_ci_azure_pods.json, benchmarks_ci_cobalt_pods.json) drive the pod-scheduler. grep finds zero inbound references for any of the three across scripts, pipelines, docs, and tests. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Parker Bibus <parker.bibus@microsoft.com>

LoopedBard3 · 2026-05-04T22:44:33Z

Closing in favor of: #2167

LoopedBard3 added 6 commits August 5, 2025 16:01

Add the crank-scheduler to the benchmarks scripts folder for use as t…

c514463

…he new scheduling tool.

Delete no long used matrix files for scheduling.

97820d7

Update the benchmarks.template.liquid with the latest information on …

c061939

…generating the yml pipelines.

Add machine_group to ensure machines run with like machines.

166aa5f

Run the azure configurations through the updated liquid template and …

0f9b865

…added machine_groups for the base azure configuration.

Run the benchmarks_ci_cobalt.json through the scheduler again with th…

4dd3048

…e new template.

LoopedBard3 requested review from DrewScoggins, Copilot and sebastienros August 7, 2025 22:38

LoopedBard3 self-assigned this Aug 7, 2025

LoopedBard3 added the enhancement label Aug 7, 2025

LoopedBard3 changed the title ~~[Very Large] Add Python Crank Scheduling tool~~ [XL] Add Python Crank Scheduling tool Aug 7, 2025

Copilot AI reviewed Aug 7, 2025

View reviewed changes

Comment thread scripts/crank-scheduler/utils.py Outdated

Comment thread scripts/crank-scheduler/main.py Outdated

Comment thread scripts/crank-scheduler/models.py

Comment thread scripts/crank-scheduler/scheduler.py Outdated

LoopedBard3 marked this pull request as draft August 7, 2025 23:00

LoopedBard3 added 2 commits August 7, 2025 16:46

Fix incorrect scenario count being used for the Multi-Yaml Summary.

6c3c64e

Cleanup for PR, lots of fixing whitespace and long lines, removed som…

a9d28b0

…e unused requirements and code, and added some new entries to .gitignore.

LoopedBard3 force-pushed the AddPythonSchedulingTool branch from c79428e to a9d28b0 Compare August 8, 2025 00:35

LoopedBard3 marked this pull request as ready for review August 11, 2025 18:17

LoopedBard3 mentioned this pull request Aug 14, 2025

Readd perflin blazor_scenarios runs. #2109

Merged

LoopedBard3 added 2 commits August 14, 2025 13:50

Add Perflin machines to benchmarks_ci.json and to the blazor scenario…

956ab5d

…. Also improved the scheduler to better handle role-priority based profile selection.

Remove unnecessary preferred_partners.

04a65c6

DrewScoggins reviewed Sep 8, 2025

View reviewed changes

LoopedBard3 requested a review from Copilot January 22, 2026 22:09

Copilot started reviewing on behalf of LoopedBard3 January 22, 2026 22:09 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

LoopedBard3 mentioned this pull request Apr 29, 2026

Move cobalt cloud jobs from benchmarks-ci-azure-eastus2 into benchmarks-ci-azure #2166

Merged

LoopedBard3 mentioned this pull request Apr 29, 2026

Add pod-based crank scheduler for simplified benchmark scheduling #2167

Merged

LoopedBard3 closed this May 4, 2026

	# - Install python and install the requirements for the crank-scheduler in benchmarks/scripts/crank-scheduler/requirements.txt
	# - Install python and install the requirements for the crank-scheduler in scripts/crank-scheduler/requirements.txt

		schedule_times = ScheduleOperations.generate_schedule_times(
		config, len(partial_schedules))

-        schedule_times = ScheduleOperations.generate_schedule_times(
-            config, len(partial_schedules))
+        # Ensure the YAML generation config's target count matches the actual
+        # number of partial schedules so that we don't silently drop or omit work.
+        effective_count = len(partial_schedules)
+        if config.metadata.yaml_generation is not None:
+            config.metadata.yaml_generation.target_yaml_count = effective_count
+        schedule_times = ScheduleOperations.generate_schedule_times(
+            config, effective_count)

	except ValueError:
	except ValueError:
	# Machine not found in preferred_partners; skip partner bias adjustment.

Conversation

LoopedBard3 commented Aug 7, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DrewScoggins Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

DrewScoggins Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

LoopedBard3 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants