feat(ci): parallelize reference tests via bucketing by jkrumbiegel · Pull Request #5599 · MakieOrg/Makie.jl

jkrumbiegel · 2026-04-16T13:06:59Z

Summary

Parallelize reference tests by splitting them into buckets that run on separate CI machines, with a setup job that precompiles once per backend+version and shares the depot via artifact.

Pipeline per backend+version:

Setup (install + precompile) → bucket 1 ┐
                              → bucket 2 ├→ consolidation
                              → bucket N ┘

Bucketing controlled by REFTEST_BUCKET / REFTEST_NBUCKETS env vars in the @reference_test macro. Uses contiguous chunks (tests 1-N in bucket 1, N+1-2N in bucket 2, etc.) so each bucket touches fewer distinct code paths, reducing JIT overhead. Total count is obtained via a regex scan of @reference_test occurrences in the included files.
Setup job runs Pkg.test with PRECOMPILE_ONLY=true env var (test scripts exit(0) immediately). This resolves the full test environment and precompiles everything with the same flags the bucket jobs will use.
Depot transfer via artifact: ~/.julia/{compiled,packages,artifacts,scratchspaces} + monorepo/Manifest.toml is packed into a zstd tarball and uploaded. Bucket jobs download, extract, and go straight to Pkg.test (no pkg"dev", no Pkg.update()). This keeps the environment byte-identical between setup and buckets so pkgimages aren't invalidated.
JULIA_CPU_TARGET set to the portable multi-target string Julia itself uses (generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1)) so pkgimages work across GitHub's heterogeneous runner fleet.
Per-version independence: each Julia version runs as a separate reusable-workflow call, so Julia 1.10 buckets don't wait for Julia 1.12 setup (and vice versa).
Single parameterized workflow _backend-tests.yml replaces the three near-identical backend workflows. Backends differ only in backend name, nbuckets, system-packages, and test-prefix (for xvfb-run wrapping).

Bucket counts: CairoMakie: 2, GLMakie: 2, WGLMakie: 4 (WGLMakie is the slowest).

Other changes:

Made a few WGLMakie reference tests independent of execution order (they previously relied on state left by preceding tests).
Dropped coverage=true from test runs: there was no active codecov integration (.codecov.yml had everything disabled), and on Julia 1.10 coverage=true forces use_pkgimages=false, preventing pkgimage reuse.
Set JULIA_PKG_PRECOMPILE_AUTO=0 around the install step so pkg"dev" doesn't kick off a stray precompile pass before Pkg.test resolves the real test env.
The compilation benchmark workflow is now triggered by a collaborator posting /benchmark on a PR (instead of running on every push). Sample count bumped 20→60.

Wall-clock impact (critical path: slowest backend+version job):

	Unbucketed (master)	Bucketed, first run (no cache)
Wall-clock	~65m (WGLMakie Julia 1)	~27m (GLMakie Julia 1.10)

Speedup ~2.4x on first run. Subsequent runs on the same PR benefit from the julia-actions/cache restore in the setup job.

Checks

CI green on all backends × versions × buckets
Consolidated ReferenceImages artifact produced with same structure as before (downstream reference_images_site.yml unchanged)
Local runs unaffected when env vars are unset
Benchmark workflow tested via /benchmark comment

MakieBot · 2026-04-16T14:06:39Z

Benchmark Results

SHA: 15a8ac8db170a9190368bbc27a6276dc2ea26b8a

Warning

These results are subject to substantial noise because GitHub's CI runs on shared machines that are not ideally suited for benchmarking.

Split reference tests into N buckets (default 4) that run in parallel CI jobs. Each bucket runs mod1(i, N) == j of the @reference_test invocations, controlled by REFTEST_BUCKET and REFTEST_NBUCKETS env vars. Local runs are unaffected (all tests run when env vars are unset). The consolidation job now merges bucket artifacts per backend before combining across backends, and correctly computes missing_files.txt using the new skipped_names.txt output.

- html_widgets resize_to tests: add explicit WGLMakie.activate!() calls so they don't depend on px_per_unit/scalefactor state left by earlier tests (which may be skipped under bucketing) - deletion tests: replace display(f) with colorbuffer(f) + getscreen() since the inline display path doesn't reliably wait for the WGLMakie session to be established

Adapt the bucket parallelization to the refactored CI that uses reusable workflows (_cairomakie.yml, _glmakie.yml, _wglmakie.yml) and a consolidation step in _reference-tests.yml.

Use a custom cache-name with include-matrix: false so all 4 buckets of the same backend and Julia version share one cache entry. Without this, each bucket gets its own cache key (including the bucket number), causing every bucket to precompile independently (~11 min each). On the first run buckets still precompile in parallel (unavoidable), but on subsequent runs all buckets hit the shared cache immediately.

Contiguous chunks (tests 1-95 in bucket 1, 96-190 in bucket 2, etc.) mean each bucket touches fewer distinct code paths than interleaved mod1(i, N) assignment, reducing JIT/compilation overhead. A dry-run counting pass determines the total number of tests before the real run. Since the test files are already compiled from the counting pass, the second include is essentially free.

SimonDanisch · 2026-04-17T09:30:55Z

Cool! Do we make sure, that we re-use compilation?

Each backend now has a setup job that installs packages and runs Pkg.precompile() before the bucket jobs start. Bucket jobs restore the shared cache, so precompilation is already done. Bucket counts adjusted to match backend speed: - CairoMakie: 2 buckets (faster backend) - GLMakie: 2 buckets (faster backend) - WGLMakie: 4 buckets (slowest, ~1h unbucketed) This reduces total runner-hours while keeping wall-clock time low, which matters with limited org runner capacity.

Use Pkg.test with PRECOMPILE_ONLY=true in the setup job instead of Pkg.precompile(). This ensures the test environment is resolved and precompiled with exactly the same settings and dependencies as the real test run. The test scripts exit(0) immediately when the flag is set, after Pkg.test has done its precompilation work.

The julia-actions/cache post step runs too late to be available to dependent jobs in the same run. Instead, upload compiled/, packages/, and artifacts/ as an artifact from the setup job. Bucket jobs download and extract it before running tests. The julia-actions/cache is kept on the setup job for cross-run caching (subsequent pushes to the same PR).

The bucket jobs were re-running pkg"dev ..." which regenerated the Manifest with different state hashes, partially invalidating the compiled pkgimages from the setup job and causing segfaults. Fix: include monorepo/Manifest.toml in the depot artifact and skip the install step in bucket jobs entirely, so the environment is byte-identical to the setup job.

Mixing ~/.julia/ paths and workspace paths in one artifact caused upload-artifact to use /home/runner/ as LCA, mangling the Manifest path on download. Split into two artifacts per backend: - julia-depot: ~/.julia/{compiled,packages,artifacts,scratchspaces} downloaded directly to ~/.julia - manifest: monorepo/Manifest.toml downloaded to monorepo/

GitHub Actions artifacts don't preserve file permissions (executables lose +x) and Julia's compiled pkgimages segfault when loaded on a different runner instance. Both issues make the setup-job-with-artifact approach unviable for sharing precompilation state. Go back to the simpler design: shared cache keys across buckets of the same backend+version. First run precompiles in parallel (same wall-clock), subsequent runs all hit cache. Also remove the now-unused PRECOMPILE_ONLY guards from runtests.

upload-artifact uses zip which strips Unix permissions (executable bits), breaking compiled pkgimages and binaries like Electron. Fix: tar the depot ourselves before uploading, preserving all permissions. Bucket jobs extract with tar, getting a byte-identical depot including correct executable bits on .so files and binaries.

The dry-run counting pass (include files with COUNT_ONLY=true) caused method override warnings from double-including the test files. Replace with a simple regex count of @reference_test occurrences, recursively following include("...") directives. If the count is slightly off (e.g. a test uses a loop), the last bucket just gets a few more tests — no correctness issue.

jkrumbiegel · 2026-04-17T11:37:14Z

Cool! Do we make sure, that we re-use compilation?

It should compile once and then copy those files to the bucket jobs as an artifact. The compilation step is also cached so in the next run it should reuse that.

The resize_to EScreenshot tests measure DOM bounding boxes to determine screenshot size. If the Electron window starts at a different size (depending on what previous tests left behind), the layout engine produces different bounding boxes, leading to screenshots of the wrong size ("images don't have the same size, difference will be Inf"). Reset the window to 1200x900 before each display() call so the layout starts from a deterministic state.

capturePage() captures at the device pixel ratio, which can differ between CI runs depending on Electron window state. Resize the captured image to the logical content size (win_size) before saving, so screenshots have deterministic pixel dimensions regardless of DPR.

… calls set_theme!() in @reference_test resets CURRENT_DEFAULT_THEME completely, including WGLMakie screen config to px_per_unit=1, scalefactor=1. So the resize_to tests' references were generated with those defaults, not with px_per_unit=2 from the preceding test's activate! call. Remove the activate!(px_per_unit=2) calls from resize_to tests - they were incorrectly overriding what set_theme! establishes. Also revert the snapshot_figure window reset and image resize changes which were workarounds for this misdiagnosis.

Change from running on every PR to only when a collaborator comments /benchmark on a PR. This saves CI resources for PRs that don't need benchmarking. Also triple the sample count (20 → 60) since benchmarks now run on demand rather than on every push.

Set JULIA_PKG_PRECOMPILE_AUTO=0 on the install step so pkg"dev ..." doesn't trigger precompilation. Precompilation happens once during Pkg.test() which resolves the full test environment including test deps. Without this, dev triggers a precompile pass against the monorepo env, then Pkg.test resolves the test env and precompiles again.

Backend workflows now take julia-version as an input instead of using a version matrix. _reference-tests.yml calls each backend twice (once per version). This means Julia 1.10 bucket jobs start as soon as the 1.10 setup finishes, without waiting for the Julia 1 setup (or vice versa).

…0 recompilation

On Julia 1.10, coverage=true causes Pkg.test to spawn the subprocess with use_pkgimages=false. The setup job was precompiling without coverage (use_pkgimages=true), so the bucket jobs rejected all cached .ji files due to the flag mismatch. Pass coverage=true in the setup job's Pkg.test call so precompilation uses identical flags. Also removes the JULIA_DEBUG=loading diagnostic from the previous commit.

Coverage was configured with everything disabled (.codecov.yml has project: false, patch: false, annotations: false). Removing it lets Julia 1.10 use pkgimages (coverage=true forces use_pkgimages=false on 1.10), significantly speeding up package loading in bucket jobs.

GitHub Actions runners are heterogeneous (mix of AMD Zen3, Intel Haswell, etc.). Pkgimages compiled for znver3 are rejected on Intel runners ("runtime-disabled features"). Use the same multi-target string as Julia's official binaries so pkgimages work on any x86_64 runner. Also removes remaining coverage=true from WGLMakie/GLMakie bucket jobs that were missed in the previous commit.

Replace the three near-identical _cairomakie.yml, _glmakie.yml, _wglmakie.yml with a single _backend-tests.yml that takes inputs for backend name, Julia version, bucket count, system packages, and shell. The bucket list is auto-generated from nbuckets via seq|jq in the setup job, eliminating the redundant bucket-list input. The shell input allows callers to wrap test execution in xvfb-run for display-dependent backends.

…expressions)

github-project-automation Bot added this to PR review Apr 16, 2026

github-project-automation Bot moved this to Work in progress in PR review Apr 16, 2026

jkrumbiegel added 3 commits April 17, 2026 10:25

ci: apply bucketing to new reusable workflow structure

55abaf8

Adapt the bucket parallelization to the refactored CI that uses reusable workflows (_cairomakie.yml, _glmakie.yml, _wglmakie.yml) and a consolidation step in _reference-tests.yml.

jkrumbiegel force-pushed the jk/parallel-reftests branch from 3a1583b to 55abaf8 Compare April 17, 2026 08:28

jkrumbiegel added the skip-changelog Skips changelog enforcer label Apr 17, 2026

jkrumbiegel added 2 commits April 17, 2026 10:57

jkrumbiegel added 9 commits April 17, 2026 11:35

fix(ci): create ~/.julia before restoring depot artifact

297d176

jkrumbiegel added 9 commits April 17, 2026 14:34

ci: compress depot tarball with zstd to speed up upload/download

2692ac8

fix(ci): create monorepo/ dir before extracting Manifest

d932aaa

ci: disable auto-precompile in compute-pipeline and makie workflows too

36ef14e

jkrumbiegel force-pushed the jk/parallel-reftests branch from c98ef33 to d2b9642 Compare April 18, 2026 17:35

jkrumbiegel and others added 11 commits April 18, 2026 19:58

debug: add JULIA_DEBUG=loading to CairoMakie bucket 1 to diagnose 1.1…

77e6aae

…0 recompilation

debug: add JULIA_DEBUG=loading to all setup jobs

f5e13ae

Merge branch 'master' into jk/parallel-reftests

d53028f

fix(ci): replace shell input with test-prefix (shell doesn't support …

97d0a18

…expressions)

fix(ci): use compact jq output for bucket list

573e609

ci: remove JULIA_DEBUG=loading from setup jobs

a2f0076

ci: add comments about pkgimage cache mismatches and coverage

23ac427

jkrumbiegel marked this pull request as ready for review April 20, 2026 11:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ci): parallelize reference tests via bucketing#5599

feat(ci): parallelize reference tests via bucketing#5599
jkrumbiegel wants to merge 34 commits intomasterfrom
jk/parallel-reftests

jkrumbiegel commented Apr 16, 2026 •

edited

Loading

Uh oh!

MakieBot commented Apr 16, 2026 •

edited

Loading

Uh oh!

SimonDanisch commented Apr 17, 2026

Uh oh!

jkrumbiegel commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jkrumbiegel commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checks

Uh oh!

MakieBot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

Uh oh!

SimonDanisch commented Apr 17, 2026

Uh oh!

jkrumbiegel commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jkrumbiegel commented Apr 16, 2026 •

edited

Loading

MakieBot commented Apr 16, 2026 •

edited

Loading

jkrumbiegel commented Apr 17, 2026 •

edited

Loading