[XL] Add Python Crank Scheduling tool#2106
Conversation
…he new scheduling tool.
…generating the yml pipelines.
…added machine_groups for the base azure configuration.
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a comprehensive Python-based crank scheduling tool to automate CI pipeline generation and optimize machine/scenario allocation across multiple performance testing machines.
- Adds a complete Python crank scheduler with sophisticated machine allocation algorithms and multi-YAML generation capabilities
- Updates existing CI configurations to use the new machine group system and multi-capability machine definitions
- Replaces manual YAML matrix files with JSON-based configuration and automated pipeline generation
Reviewed Changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/crank-scheduler/*.py | Core scheduler implementation with machine allocation, runtime estimation, and template generation |
| scripts/crank-scheduler/requirements.txt | Python dependencies for the scheduler |
| scripts/crank-scheduler/*.md | Documentation and configuration guides |
| build/benchmarks_ci*.json | Updated machine configurations with new capability-based structure and machine groups |
| build/benchmarks*.yml | Updated pipeline files generated by the new scheduler |
| build/benchmarks.template.liquid | Updated template comments to reflect new generation process |
…e unused requirements and code, and added some new entries to .gitignore.
c79428e to
a9d28b0
Compare
…. Also improved the scheduler to better handle role-priority based profile selection.
| epilog=""" | ||
| Examples: | ||
| # Generate schedule from JSON files | ||
| python main.py --config config.json --format table |
There was a problem hiding this comment.
--format is used in the examples, but we don't seem to have a --config argument anywhere.
|
|
||
| machines_by_type = {} | ||
| for machine in machines: | ||
| # Get primary machine type (lowest priority capability) |
There was a problem hiding this comment.
nit: I assume by lowest here we mean lowest number, which would actually be higher priority.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 24 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ```json | ||
| { | ||
| "name": "performance-test-scenario", | ||
| "scenario_type": 2, | ||
| "estimated_runtime": 45.0, | ||
| "target_machines": ["machine-1", "machine-2"] | ||
| } | ||
| ``` | ||
|
|
||
| #### Scenario Properties | ||
|
|
||
| - **name**: Scenario identifier | ||
| - **scenario_type**: Number of machines required (1=SUT only, 2=SUT+Load, 3=SUT+Load+DB) |
There was a problem hiding this comment.
This example and the scenario property list document a scenario_type field, but DataLoader.load_combined_configuration currently reads the scenario type from the type key and example_complete_features.json also uses "type". As written, a config that follows this README and uses scenario_type will cause a ScenarioType lookup failure; please either adjust the loader to accept scenario_type or update the docs/examples to use the actual key (type) so JSON configs are valid.
| ```json | ||
| { | ||
| "name": "Simple Single Machine Test", | ||
| "template": "simple-single.yml", | ||
| "scenario_type": 1, | ||
| "target_machines": ["single-type-machine", "multi-type-machine"], | ||
| "estimated_runtime": 10.0, | ||
| "description": "Basic single machine scenario with default profiles" | ||
| } | ||
| ``` | ||
|
|
||
| **Result:** Uses default profiles for all machines | ||
|
|
||
| ### 2. Custom Profile Selection | ||
|
|
||
| ```json | ||
| { | ||
| "name": "Triple Machine Test with Custom Profiles", | ||
| "template": "triple-custom.yml", | ||
| "scenario_type": 3, | ||
| "target_machines": ["multi-type-machine"], | ||
| "estimated_runtime": 45.0, | ||
| "profile_overrides": { | ||
| "multi-type-machine": { | ||
| "sut": "multi-sut-high-cpu", | ||
| "load": "multi-load-high-throughput", | ||
| "db": "multi-db-memory-optimized" | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| **Result:** Uses specific custom profiles for each machine type | ||
|
|
||
| ### 3. Mixed Profile Usage | ||
|
|
||
| ```json | ||
| { | ||
| "name": "Mixed Profile Scenario", | ||
| "template": "mixed-profiles.yml", | ||
| "scenario_type": 2, | ||
| "target_machines": ["single-type-machine", "multi-type-machine"], | ||
| "profile_overrides": { | ||
| "multi-type-machine": { | ||
| "sut": "multi-sut-low-memory" | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| **Result:** | ||
|
|
||
| - `single-type-machine`: Uses default profile | ||
| - `multi-type-machine` SUT: Uses custom profile | ||
| - `multi-type-machine` LOAD: Uses default profile | ||
|
|
||
| ## Configuration Properties Explained | ||
|
|
||
| ### Machine Properties | ||
|
|
||
| | Property | Required | Description | | ||
| | -------------------- | -------- | ---------------------------------------------- | | ||
| | `name` | ✅ | Unique machine identifier | | ||
| | `capabilities` | ✅ | Dict of machine types this machine can fulfill | | ||
| | `preferred_partners` | ❌ | List of preferred machines for other roles | | ||
|
|
||
| ### Capability Properties | ||
|
|
||
| | Property | Required | Description | | ||
| | ----------------- | -------- | ------------------------------------------------------------------- | | ||
| | `machine_type` | ✅ | Key: "sut", "load", or "db" | | ||
| | `priority` | ✅ | 1=preferred, 2=secondary, 3=fallback | | ||
| | `profiles` | ✅ | List of available profile names | | ||
| | `default_profile` | ❌ | Which profile to use by default (defaults to first profile in list) | | ||
|
|
||
| ### Scenario Properties | ||
|
|
||
| | Property | Required | Description | | ||
| | ------------------- | -------- | ---------------------------------- | | ||
| | `name` | ✅ | Scenario identifier | | ||
| | `template` | ✅ | YAML template file | | ||
| | `scenario_type` | ✅ | 1=single, 2=dual, 3=triple machine | | ||
| | `target_machines` | ✅ | List of machines to run on | | ||
| | `estimated_runtime` | ❌ | Runtime in minutes | | ||
| | `description` | ❌ | Human-readable description | | ||
| | `profile_overrides` | ❌ | Custom profile overrides | |
There was a problem hiding this comment.
These scenario examples and the "Scenario Properties" table document a scenario_type field, but the scheduler code reads the type from a type key in the JSON (and example_complete_features.json uses "type"). Using scenario_type as shown here will break loading; please align the docs with the implementation (or update the loader to accept scenario_type) so configuration authors can rely on the documented shape.
| | Property | Required | Description | | ||
| | -------------------- | -------- | ---------------------------------------------- | | ||
| | `name` | ✅ | Unique machine identifier | | ||
| | `capabilities` | ✅ | Dict of machine types this machine can fulfill | | ||
| | `preferred_partners` | ❌ | List of preferred machines for other roles | | ||
|
|
There was a problem hiding this comment.
The machine configuration docs list name, capabilities, and preferred_partners, but do not mention the new machine_group field that is now used by the scheduler for group-based compatibility (see Machine.machine_group in models.py and the updated build/benchmarks_ci*.json files). To make the new grouping behavior discoverable and configurable, please extend this table (and the surrounding text) to describe the machine_group field and how it interacts with enforce_machine_groups in metadata.
| # - Update this file with the result of the template generation | ||
| # - The file benchmarks*.json defines how each pipeline set of jobs is run in parallel | ||
| # - Update the associated benchmarks*.json file with machine and scenario updates | ||
| # - Install python and install the requirements for the crank-scheduler in benchmarks/scripts/crank-scheduler/requirements.txt |
There was a problem hiding this comment.
The instructions here reference benchmarks/scripts/crank-scheduler/requirements.txt, but in this repo the requirements file lives at scripts/crank-scheduler/requirements.txt (and the example command below already uses ./scripts/crank-scheduler/main.py). To prevent confusion when following these steps, consider updating this path (and any similarly generated headers in the CI YAML files) to match the actual directory layout.
| # - Install python and install the requirements for the crank-scheduler in benchmarks/scripts/crank-scheduler/requirements.txt | |
| # - Install python and install the requirements for the crank-scheduler in scripts/crank-scheduler/requirements.txt |
| def process_yaml_generation(args, partial_schedules: List[PartialSchedule], config: CombinedConfiguration) -> list: | ||
| """ | ||
| Unified flow for YAML generation (single or multi) | ||
|
|
||
| Returns: | ||
| bool: True if YAML files were generated, False otherwise |
There was a problem hiding this comment.
The docstring for process_yaml_generation states that the function returns a bool, but the implementation actually returns a list of dictionaries describing the generated YAML files. Please update the docstring (and/or add a return type annotation) to reflect the real return type so callers know what to expect.
| def process_yaml_generation(args, partial_schedules: List[PartialSchedule], config: CombinedConfiguration) -> list: | |
| """ | |
| Unified flow for YAML generation (single or multi) | |
| Returns: | |
| bool: True if YAML files were generated, False otherwise | |
| def process_yaml_generation(args, partial_schedules: List[PartialSchedule], config: CombinedConfiguration) -> List[dict]: | |
| """ | |
| Unified flow for YAML generation (single or multi) | |
| Returns: | |
| List[dict]: List of metadata dictionaries for each generated YAML file |
| schedule_times = ScheduleOperations.generate_schedule_times( | ||
| config, len(partial_schedules)) |
There was a problem hiding this comment.
The CLI overrides for --target-yamls and --schedule-offset are applied here after partial_schedules have already been computed in main.py, and generate_schedule_times uses the overridden target_yaml_count instead of the actual len(partial_schedules). This can lead to mismatches (e.g., some partial schedules never get a YAML file when target_yamls is reduced, or extra offset times are generated and then dropped when target_yamls is increased). To avoid silently skipping work, apply the overrides before splitting the schedule (or re-split after updating yaml_generation) so that both partial_schedules and schedule_times are derived from the same effective target_yaml_count.
| schedule_times = ScheduleOperations.generate_schedule_times( | |
| config, len(partial_schedules)) | |
| # Ensure the YAML generation config's target count matches the actual | |
| # number of partial schedules so that we don't silently drop or omit work. | |
| effective_count = len(partial_schedules) | |
| if config.metadata.yaml_generation is not None: | |
| config.metadata.yaml_generation.target_yaml_count = effective_count | |
| schedule_times = ScheduleOperations.generate_schedule_times( | |
| config, effective_count) |
| try: | ||
| partner_index = preferred_partners.index(machine.name) | ||
| score += 0.01 * (partner_index + 1) # 0.01, 0.02, 0.03, ... | ||
| except ValueError: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except ValueError: | |
| except ValueError: | |
| # Machine not found in preferred_partners; skip partner bias adjustment. |
Simplified alternative to PR aspnet#2106's full crank-scheduler. Uses a pod model where machines are fixed groups (SUT + load + DB) instead of individual machines with capability scoring and preferred partners. Key simplifications: - Pods define fixed machine groupings (no role priority/scoring) - Shared machines between pods handled via collision detection - Same greedy longest-job-first bin-packing algorithm - Same Liquid template YAML generation - ~570 lines vs ~2000 lines in the full scheduler Includes: - scripts/pod-scheduler/ (5 Python files + README) - build/benchmarks_ci_pods.json (pod-based config for CI benchmarks) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Simplified alternative to PR aspnet#2106's full crank-scheduler. Uses a pod model where machines are fixed groups (SUT + load + DB) instead of individual machines with capability scoring and preferred partners. Key simplifications: - Pods define fixed machine groupings (no role priority/scoring) - Shared machines between pods handled via collision detection - Same greedy longest-job-first bin-packing algorithm - Same Liquid template YAML generation - ~570 lines vs ~2000 lines in the full scheduler Includes: - scripts/pod-scheduler/ (5 Python files + README) - build/benchmarks_ci_pods.json (pod-based config for CI benchmarks) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
) * Add pod-based crank scheduler prototype Simplified alternative to PR #2106's full crank-scheduler. Uses a pod model where machines are fixed groups (SUT + load + DB) instead of individual machines with capability scoring and preferred partners. Key simplifications: - Pods define fixed machine groupings (no role priority/scoring) - Shared machines between pods handled via collision detection - Same greedy longest-job-first bin-packing algorithm - Same Liquid template YAML generation - ~570 lines vs ~2000 lines in the full scheduler Includes: - scripts/pod-scheduler/ (5 Python files + README) - build/benchmarks_ci_pods.json (pod-based config for CI benchmarks) * Add azure, azure-eastus2, and cobalt pod configs Pod-based configurations for all three additional CI environments: - benchmarks_ci_azure_pods.json: 6 pods, 14 runs (matches main) - benchmarks_ci_azure_eastus2_pods.json: 2 pods, 12 runs (matches main) - benchmarks_ci_cobalt_pods.json: 4 pods, 44 runs (matches main) Notable pod patterns: - Azure IDNA pods cross-use each other as load machines - Cobalt hosted has 28-core variant pods sharing physical machines with full-core pods (handled by collision detection) - Azure eastus2 pods share load/db, serialized automatically Also fixes unicode bar chars for Windows compatibility. * Update azure pod config: merge eastus2, keep IDNA on linux loads Reflects main branch changes from PR #2166: - Merged cobalt-cloud-lin pods (eastus2) into azure config - Removed separate benchmarks_ci_azure_eastus2_pods.json - Kept IDNA pod load profiles on linux machines (load jobs require linux), reverting the main branch profile change - Added cobalt-cloud-lin-azl3-dual pod for type-2 scenarios (uses cobalt-cloud-lin-db as load instead of client) - Total runs: 26 (matches main azure pipeline) * Regenerate pipeline YAMLs from pod-scheduler configs Generated via: python ./scripts/pod-scheduler/main.py --config ./build/benchmarks_ci_pods.json --template ./build/benchmarks.template.liquid --yaml-output ./build python ./scripts/pod-scheduler/main.py --config ./build/benchmarks_ci_azure_pods.json --template ./build/benchmarks.template.liquid --yaml-output ./build --base-name benchmarks-ci-azure python ./scripts/pod-scheduler/main.py --config ./build/benchmarks_ci_cobalt_pods.json --template ./build/benchmarks.template.liquid --yaml-output ./build --base-name benchmarks-ci-cobalt * Cap timeoutInMinutes at 240 (max 2x old 120 default) Formula is now max(120, min(240, 2 * estimated_runtime)). This prevents scenarios with long runtimes (e.g. Proxies at 150min) from setting unreasonably high timeouts compared to previous values. Resulting timeouts: 120 (default), 140 (Grpc), 180 (PGO/Containers), 240 (Proxies) * Address review feedback - Fix 4 incorrect template filenames in benchmarks_ci_pods.json: crossgen-scenarios -> crossgen2-scenarios, custom-proxies-scenarios -> proxies-custom-scenarios, single-file-scenarios -> singlefile-scenarios, websockets-scenarios -> websocket-scenarios - Fix machine utilization calculation bug (was inflating totals for machines not in current stage) - Remove unused imports (sys, Any, Dict, json, Pod) - Remove dead render_with_liquid function and --template CLI arg - Add guard against empty queues (ZeroDivisionError) - Update README and docstrings to reflect removed template arg Code: - Validate cron schedules at load time and raise on unsupported hour fields instead of silently no-op'ing the offset for split YAMLs - Add optional 'timeout' override per scenario; fall back to the runtime-derived formula when absent - Move pipeline plumbing (pool, service-bus connection/namespace) into JSON metadata.pipeline with the previous hardcoded values as defaults - Strict validation of duplicate pods, duplicate scenario.pods entries, empty queues; default scheduler to fail-fast on unknown/invalid pod references with a --lenient opt-out - Stricter job-id sanitization (handles '.', '/', parens, leading digits, unicode) and explicit duplicate detection in generated YAML - Replace id(stage) bookkeeping in split_schedule with explicit indices; add stable name tie-breaker to create_schedule for deterministic output - Use Run.job_name in the generator instead of duplicating the regex - Drop stale '--template' arg from generated YAML headers and README Tests: - 41 unit + snapshot tests covering models, config loader, scheduler, generator, and YAML parity with the committed *_pods.json configs Cleanup: - Revert benchmarks.template.liquid and benchmarks_ci_azure.json to main; the deleted crank-scheduler does not consume them - Regenerate all four pipeline YAMLs against the new generator * Remove unused benchmarks.template.liquid The Liquid template was only consumed by the deleted crank-scheduler. The pod-scheduler renders pipeline YAML directly via Python, and grep confirms no other script, pipeline, or build step reads this file. * Remove orphaned benchmarks.yml and benchmarks.matrix.0[12].yml These were artifacts of the old hand-driven matrix.yml -> json -> Liquid template -> benchmarks.yml workflow. Their only inbound references were stale documentation comments cross-pointing between each other; nothing in the repo (no script, no pipeline) consumed them. * Document pod-scheduler flow across READMEs and YAML headers - Generated YAML headers now embed the exact regen command (with the source config and base name) and a pointer to scripts/pod-scheduler/README.md, so each file documents how to reproduce itself - New build/README.md maps each *_pods.json config to the YAML it produces, lists the hand-maintained scenario templates, and explains the typical edit/regenerate workflow - Top-level README.md gains a 'Continuous benchmarking pipelines' section linking to the pod-scheduler and build/ docs - pod-scheduler README's Quick Start now uses repo-root-relative commands and points at the snapshot tests for verification - Tests cover the new _format_source_path helper and the snapshot test passes the source config so headers stay verified * Remove orphaned crank-scheduler JSON configs benchmarks_ci.json, benchmarks_ci_azure.json, and benchmarks_ci_cobalt.json used the old 'machines + capabilities' format consumed by the deleted crank-scheduler. Their replacements (benchmarks_ci_pods.json, benchmarks_ci_azure_pods.json, benchmarks_ci_cobalt_pods.json) drive the pod-scheduler. grep finds zero inbound references for any of the three across scripts, pipelines, docs, and tests. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Parker Bibus <parker.bibus@microsoft.com>
|
Closing in favor of: #2167 |
In order to make simplify the work needed when updating the scenarios we run and to minimize the chance of error, this adds a python script to be used to generate a CI schedule from a single configuration file. Most of the recently added and updated pipeline flows already used this new flow, but this update does add an option for a machine_group to ensure machines only use other machines at similar perf levels for load and db machines.
Changes include the addition of the crank-scheduler, running the configurations through the scheduler one more time with the updated benchmarks.template.liquid, updating the benchmarks.template.liquid to include the new steps to run, and added the machine_group configuration option where applicable.