feat(ci): parallelize reference tests via bucketing#5599
Open
jkrumbiegel wants to merge 34 commits intomasterfrom
Open
feat(ci): parallelize reference tests via bucketing#5599jkrumbiegel wants to merge 34 commits intomasterfrom
jkrumbiegel wants to merge 34 commits intomasterfrom
Conversation
Collaborator
Benchmark ResultsSHA: 15a8ac8db170a9190368bbc27a6276dc2ea26b8a Warning These results are subject to substantial noise because GitHub's CI runs on shared machines that are not ideally suited for benchmarking. |
Split reference tests into N buckets (default 4) that run in parallel CI jobs. Each bucket runs mod1(i, N) == j of the @reference_test invocations, controlled by REFTEST_BUCKET and REFTEST_NBUCKETS env vars. Local runs are unaffected (all tests run when env vars are unset). The consolidation job now merges bucket artifacts per backend before combining across backends, and correctly computes missing_files.txt using the new skipped_names.txt output.
- html_widgets resize_to tests: add explicit WGLMakie.activate!() calls so they don't depend on px_per_unit/scalefactor state left by earlier tests (which may be skipped under bucketing) - deletion tests: replace display(f) with colorbuffer(f) + getscreen() since the inline display path doesn't reliably wait for the WGLMakie session to be established
Adapt the bucket parallelization to the refactored CI that uses reusable workflows (_cairomakie.yml, _glmakie.yml, _wglmakie.yml) and a consolidation step in _reference-tests.yml.
3a1583b to
55abaf8
Compare
Use a custom cache-name with include-matrix: false so all 4 buckets of the same backend and Julia version share one cache entry. Without this, each bucket gets its own cache key (including the bucket number), causing every bucket to precompile independently (~11 min each). On the first run buckets still precompile in parallel (unavoidable), but on subsequent runs all buckets hit the shared cache immediately.
Contiguous chunks (tests 1-95 in bucket 1, 96-190 in bucket 2, etc.) mean each bucket touches fewer distinct code paths than interleaved mod1(i, N) assignment, reducing JIT/compilation overhead. A dry-run counting pass determines the total number of tests before the real run. Since the test files are already compiled from the counting pass, the second include is essentially free.
Member
|
Cool! Do we make sure, that we re-use compilation? |
Each backend now has a setup job that installs packages and runs Pkg.precompile() before the bucket jobs start. Bucket jobs restore the shared cache, so precompilation is already done. Bucket counts adjusted to match backend speed: - CairoMakie: 2 buckets (faster backend) - GLMakie: 2 buckets (faster backend) - WGLMakie: 4 buckets (slowest, ~1h unbucketed) This reduces total runner-hours while keeping wall-clock time low, which matters with limited org runner capacity.
Use Pkg.test with PRECOMPILE_ONLY=true in the setup job instead of Pkg.precompile(). This ensures the test environment is resolved and precompiled with exactly the same settings and dependencies as the real test run. The test scripts exit(0) immediately when the flag is set, after Pkg.test has done its precompilation work.
The julia-actions/cache post step runs too late to be available to dependent jobs in the same run. Instead, upload compiled/, packages/, and artifacts/ as an artifact from the setup job. Bucket jobs download and extract it before running tests. The julia-actions/cache is kept on the setup job for cross-run caching (subsequent pushes to the same PR).
The bucket jobs were re-running pkg"dev ..." which regenerated the Manifest with different state hashes, partially invalidating the compiled pkgimages from the setup job and causing segfaults. Fix: include monorepo/Manifest.toml in the depot artifact and skip the install step in bucket jobs entirely, so the environment is byte-identical to the setup job.
Mixing ~/.julia/ paths and workspace paths in one artifact caused
upload-artifact to use /home/runner/ as LCA, mangling the Manifest
path on download. Split into two artifacts per backend:
- julia-depot: ~/.julia/{compiled,packages,artifacts,scratchspaces}
downloaded directly to ~/.julia
- manifest: monorepo/Manifest.toml downloaded to monorepo/
GitHub Actions artifacts don't preserve file permissions (executables lose +x) and Julia's compiled pkgimages segfault when loaded on a different runner instance. Both issues make the setup-job-with-artifact approach unviable for sharing precompilation state. Go back to the simpler design: shared cache keys across buckets of the same backend+version. First run precompiles in parallel (same wall-clock), subsequent runs all hit cache. Also remove the now-unused PRECOMPILE_ONLY guards from runtests.
upload-artifact uses zip which strips Unix permissions (executable bits), breaking compiled pkgimages and binaries like Electron. Fix: tar the depot ourselves before uploading, preserving all permissions. Bucket jobs extract with tar, getting a byte-identical depot including correct executable bits on .so files and binaries.
The dry-run counting pass (include files with COUNT_ONLY=true)
caused method override warnings from double-including the test
files. Replace with a simple regex count of @reference_test
occurrences, recursively following include("...") directives.
If the count is slightly off (e.g. a test uses a loop), the last
bucket just gets a few more tests — no correctness issue.
Member
Author
It should compile once and then copy those files to the bucket jobs as an artifact. The compilation step is also cached so in the next run it should reuse that. |
The resize_to EScreenshot tests measure DOM bounding boxes to determine
screenshot size. If the Electron window starts at a different size
(depending on what previous tests left behind), the layout engine
produces different bounding boxes, leading to screenshots of the wrong
size ("images don't have the same size, difference will be Inf").
Reset the window to 1200x900 before each display() call so the layout
starts from a deterministic state.
capturePage() captures at the device pixel ratio, which can differ between CI runs depending on Electron window state. Resize the captured image to the logical content size (win_size) before saving, so screenshots have deterministic pixel dimensions regardless of DPR.
… calls set_theme!() in @reference_test resets CURRENT_DEFAULT_THEME completely, including WGLMakie screen config to px_per_unit=1, scalefactor=1. So the resize_to tests' references were generated with those defaults, not with px_per_unit=2 from the preceding test's activate! call. Remove the activate!(px_per_unit=2) calls from resize_to tests - they were incorrectly overriding what set_theme! establishes. Also revert the snapshot_figure window reset and image resize changes which were workarounds for this misdiagnosis.
Change from running on every PR to only when a collaborator comments /benchmark on a PR. This saves CI resources for PRs that don't need benchmarking. Also triple the sample count (20 → 60) since benchmarks now run on demand rather than on every push.
Set JULIA_PKG_PRECOMPILE_AUTO=0 on the install step so pkg"dev ..." doesn't trigger precompilation. Precompilation happens once during Pkg.test() which resolves the full test environment including test deps. Without this, dev triggers a precompile pass against the monorepo env, then Pkg.test resolves the test env and precompiles again.
Backend workflows now take julia-version as an input instead of using a version matrix. _reference-tests.yml calls each backend twice (once per version). This means Julia 1.10 bucket jobs start as soon as the 1.10 setup finishes, without waiting for the Julia 1 setup (or vice versa).
c98ef33 to
d2b9642
Compare
On Julia 1.10, coverage=true causes Pkg.test to spawn the subprocess with use_pkgimages=false. The setup job was precompiling without coverage (use_pkgimages=true), so the bucket jobs rejected all cached .ji files due to the flag mismatch. Pass coverage=true in the setup job's Pkg.test call so precompilation uses identical flags. Also removes the JULIA_DEBUG=loading diagnostic from the previous commit.
Coverage was configured with everything disabled (.codecov.yml has project: false, patch: false, annotations: false). Removing it lets Julia 1.10 use pkgimages (coverage=true forces use_pkgimages=false on 1.10), significantly speeding up package loading in bucket jobs.
GitHub Actions runners are heterogeneous (mix of AMD Zen3, Intel
Haswell, etc.). Pkgimages compiled for znver3 are rejected on Intel
runners ("runtime-disabled features"). Use the same multi-target
string as Julia's official binaries so pkgimages work on any x86_64
runner.
Also removes remaining coverage=true from WGLMakie/GLMakie bucket
jobs that were missed in the previous commit.
Replace the three near-identical _cairomakie.yml, _glmakie.yml, _wglmakie.yml with a single _backend-tests.yml that takes inputs for backend name, Julia version, bucket count, system packages, and shell. The bucket list is auto-generated from nbuckets via seq|jq in the setup job, eliminating the redundant bucket-list input. The shell input allows callers to wrap test execution in xvfb-run for display-dependent backends.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Parallelize reference tests by splitting them into buckets that run on separate CI machines, with a setup job that precompiles once per backend+version and shares the depot via artifact.
Pipeline per backend+version:
REFTEST_BUCKET/REFTEST_NBUCKETSenv vars in the@reference_testmacro. Uses contiguous chunks (tests 1-N in bucket 1, N+1-2N in bucket 2, etc.) so each bucket touches fewer distinct code paths, reducing JIT overhead. Total count is obtained via a regex scan of@reference_testoccurrences in the included files.Pkg.testwithPRECOMPILE_ONLY=trueenv var (test scriptsexit(0)immediately). This resolves the full test environment and precompiles everything with the same flags the bucket jobs will use.~/.julia/{compiled,packages,artifacts,scratchspaces}+monorepo/Manifest.tomlis packed into a zstd tarball and uploaded. Bucket jobs download, extract, and go straight toPkg.test(nopkg"dev", noPkg.update()). This keeps the environment byte-identical between setup and buckets so pkgimages aren't invalidated.generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1)) so pkgimages work across GitHub's heterogeneous runner fleet._backend-tests.ymlreplaces the three near-identical backend workflows. Backends differ only inbackendname,nbuckets,system-packages, andtest-prefix(forxvfb-runwrapping).Bucket counts: CairoMakie: 2, GLMakie: 2, WGLMakie: 4 (WGLMakie is the slowest).
Other changes:
coverage=truefrom test runs: there was no active codecov integration (.codecov.ymlhad everything disabled), and on Julia 1.10coverage=trueforcesuse_pkgimages=false, preventing pkgimage reuse.JULIA_PKG_PRECOMPILE_AUTO=0around the install step sopkg"dev"doesn't kick off a stray precompile pass beforePkg.testresolves the real test env./benchmarkon a PR (instead of running on every push). Sample count bumped 20→60.Wall-clock impact (critical path: slowest backend+version job):
Speedup ~2.4x on first run. Subsequent runs on the same PR benefit from the
julia-actions/cacherestore in the setup job.Checks
ReferenceImagesartifact produced with same structure as before (downstreamreference_images_site.ymlunchanged)/benchmarkcomment