Skip to content

feat(ci): parallelize reference tests via bucketing#5599

Open
jkrumbiegel wants to merge 34 commits intomasterfrom
jk/parallel-reftests
Open

feat(ci): parallelize reference tests via bucketing#5599
jkrumbiegel wants to merge 34 commits intomasterfrom
jk/parallel-reftests

Conversation

@jkrumbiegel
Copy link
Copy Markdown
Member

@jkrumbiegel jkrumbiegel commented Apr 16, 2026

Summary

Parallelize reference tests by splitting them into buckets that run on separate CI machines, with a setup job that precompiles once per backend+version and shares the depot via artifact.

Pipeline per backend+version:

Setup (install + precompile) → bucket 1 ┐
                              → bucket 2 ├→ consolidation
                              → bucket N ┘
  • Bucketing controlled by REFTEST_BUCKET / REFTEST_NBUCKETS env vars in the @reference_test macro. Uses contiguous chunks (tests 1-N in bucket 1, N+1-2N in bucket 2, etc.) so each bucket touches fewer distinct code paths, reducing JIT overhead. Total count is obtained via a regex scan of @reference_test occurrences in the included files.
  • Setup job runs Pkg.test with PRECOMPILE_ONLY=true env var (test scripts exit(0) immediately). This resolves the full test environment and precompiles everything with the same flags the bucket jobs will use.
  • Depot transfer via artifact: ~/.julia/{compiled,packages,artifacts,scratchspaces} + monorepo/Manifest.toml is packed into a zstd tarball and uploaded. Bucket jobs download, extract, and go straight to Pkg.test (no pkg"dev", no Pkg.update()). This keeps the environment byte-identical between setup and buckets so pkgimages aren't invalidated.
  • JULIA_CPU_TARGET set to the portable multi-target string Julia itself uses (generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1)) so pkgimages work across GitHub's heterogeneous runner fleet.
  • Per-version independence: each Julia version runs as a separate reusable-workflow call, so Julia 1.10 buckets don't wait for Julia 1.12 setup (and vice versa).
  • Single parameterized workflow _backend-tests.yml replaces the three near-identical backend workflows. Backends differ only in backend name, nbuckets, system-packages, and test-prefix (for xvfb-run wrapping).

Bucket counts: CairoMakie: 2, GLMakie: 2, WGLMakie: 4 (WGLMakie is the slowest).

Other changes:

  • Made a few WGLMakie reference tests independent of execution order (they previously relied on state left by preceding tests).
  • Dropped coverage=true from test runs: there was no active codecov integration (.codecov.yml had everything disabled), and on Julia 1.10 coverage=true forces use_pkgimages=false, preventing pkgimage reuse.
  • Set JULIA_PKG_PRECOMPILE_AUTO=0 around the install step so pkg"dev" doesn't kick off a stray precompile pass before Pkg.test resolves the real test env.
  • The compilation benchmark workflow is now triggered by a collaborator posting /benchmark on a PR (instead of running on every push). Sample count bumped 20→60.

Wall-clock impact (critical path: slowest backend+version job):

Unbucketed (master) Bucketed, first run (no cache)
Wall-clock ~65m (WGLMakie Julia 1) ~27m (GLMakie Julia 1.10)

Speedup ~2.4x on first run. Subsequent runs on the same PR benefit from the julia-actions/cache restore in the setup job.

Checks

  • CI green on all backends × versions × buckets
  • Consolidated ReferenceImages artifact produced with same structure as before (downstream reference_images_site.yml unchanged)
  • Local runs unaffected when env vars are unset
  • Benchmark workflow tested via /benchmark comment

@github-project-automation github-project-automation Bot moved this to Work in progress in PR review Apr 16, 2026
@MakieBot
Copy link
Copy Markdown
Collaborator

MakieBot commented Apr 16, 2026

Benchmark Results

SHA: 15a8ac8db170a9190368bbc27a6276dc2ea26b8a

Warning

These results are subject to substantial noise because GitHub's CI runs on shared machines that are not ideally suited for benchmarking.

GLMakie
CairoMakie
WGLMakie

Split reference tests into N buckets (default 4) that run in parallel
CI jobs. Each bucket runs mod1(i, N) == j of the @reference_test
invocations, controlled by REFTEST_BUCKET and REFTEST_NBUCKETS env vars.
Local runs are unaffected (all tests run when env vars are unset).

The consolidation job now merges bucket artifacts per backend before
combining across backends, and correctly computes missing_files.txt
using the new skipped_names.txt output.
- html_widgets resize_to tests: add explicit WGLMakie.activate!() calls
  so they don't depend on px_per_unit/scalefactor state left by earlier
  tests (which may be skipped under bucketing)
- deletion tests: replace display(f) with colorbuffer(f) + getscreen()
  since the inline display path doesn't reliably wait for the WGLMakie
  session to be established
Adapt the bucket parallelization to the refactored CI that uses
reusable workflows (_cairomakie.yml, _glmakie.yml, _wglmakie.yml)
and a consolidation step in _reference-tests.yml.
@jkrumbiegel jkrumbiegel force-pushed the jk/parallel-reftests branch from 3a1583b to 55abaf8 Compare April 17, 2026 08:28
@jkrumbiegel jkrumbiegel added the skip-changelog Skips changelog enforcer label Apr 17, 2026
Use a custom cache-name with include-matrix: false so all 4 buckets
of the same backend and Julia version share one cache entry. Without
this, each bucket gets its own cache key (including the bucket number),
causing every bucket to precompile independently (~11 min each).

On the first run buckets still precompile in parallel (unavoidable),
but on subsequent runs all buckets hit the shared cache immediately.
Contiguous chunks (tests 1-95 in bucket 1, 96-190 in bucket 2, etc.)
mean each bucket touches fewer distinct code paths than interleaved
mod1(i, N) assignment, reducing JIT/compilation overhead.

A dry-run counting pass determines the total number of tests before
the real run. Since the test files are already compiled from the
counting pass, the second include is essentially free.
@SimonDanisch
Copy link
Copy Markdown
Member

Cool! Do we make sure, that we re-use compilation?

Each backend now has a setup job that installs packages and runs
Pkg.precompile() before the bucket jobs start. Bucket jobs restore
the shared cache, so precompilation is already done.

Bucket counts adjusted to match backend speed:
- CairoMakie: 2 buckets (faster backend)
- GLMakie: 2 buckets (faster backend)
- WGLMakie: 4 buckets (slowest, ~1h unbucketed)

This reduces total runner-hours while keeping wall-clock time low,
which matters with limited org runner capacity.
Use Pkg.test with PRECOMPILE_ONLY=true in the setup job instead of
Pkg.precompile(). This ensures the test environment is resolved and
precompiled with exactly the same settings and dependencies as the
real test run. The test scripts exit(0) immediately when the flag
is set, after Pkg.test has done its precompilation work.
The julia-actions/cache post step runs too late to be available
to dependent jobs in the same run. Instead, upload compiled/,
packages/, and artifacts/ as an artifact from the setup job.
Bucket jobs download and extract it before running tests.

The julia-actions/cache is kept on the setup job for cross-run
caching (subsequent pushes to the same PR).
The bucket jobs were re-running pkg"dev ..." which regenerated the
Manifest with different state hashes, partially invalidating the
compiled pkgimages from the setup job and causing segfaults.

Fix: include monorepo/Manifest.toml in the depot artifact and skip
the install step in bucket jobs entirely, so the environment is
byte-identical to the setup job.
Mixing ~/.julia/ paths and workspace paths in one artifact caused
upload-artifact to use /home/runner/ as LCA, mangling the Manifest
path on download. Split into two artifacts per backend:
- julia-depot: ~/.julia/{compiled,packages,artifacts,scratchspaces}
  downloaded directly to ~/.julia
- manifest: monorepo/Manifest.toml downloaded to monorepo/
GitHub Actions artifacts don't preserve file permissions (executables
lose +x) and Julia's compiled pkgimages segfault when loaded on a
different runner instance. Both issues make the setup-job-with-artifact
approach unviable for sharing precompilation state.

Go back to the simpler design: shared cache keys across buckets of
the same backend+version. First run precompiles in parallel (same
wall-clock), subsequent runs all hit cache.

Also remove the now-unused PRECOMPILE_ONLY guards from runtests.
upload-artifact uses zip which strips Unix permissions (executable
bits), breaking compiled pkgimages and binaries like Electron.

Fix: tar the depot ourselves before uploading, preserving all
permissions. Bucket jobs extract with tar, getting a byte-identical
depot including correct executable bits on .so files and binaries.
The dry-run counting pass (include files with COUNT_ONLY=true)
caused method override warnings from double-including the test
files. Replace with a simple regex count of @reference_test
occurrences, recursively following include("...") directives.

If the count is slightly off (e.g. a test uses a loop), the last
bucket just gets a few more tests — no correctness issue.
@jkrumbiegel
Copy link
Copy Markdown
Member Author

jkrumbiegel commented Apr 17, 2026

Cool! Do we make sure, that we re-use compilation?

It should compile once and then copy those files to the bucket jobs as an artifact. The compilation step is also cached so in the next run it should reuse that.

The resize_to EScreenshot tests measure DOM bounding boxes to determine
screenshot size. If the Electron window starts at a different size
(depending on what previous tests left behind), the layout engine
produces different bounding boxes, leading to screenshots of the wrong
size ("images don't have the same size, difference will be Inf").

Reset the window to 1200x900 before each display() call so the layout
starts from a deterministic state.
capturePage() captures at the device pixel ratio, which can differ
between CI runs depending on Electron window state. Resize the
captured image to the logical content size (win_size) before saving,
so screenshots have deterministic pixel dimensions regardless of DPR.
… calls

set_theme!() in @reference_test resets CURRENT_DEFAULT_THEME completely,
including WGLMakie screen config to px_per_unit=1, scalefactor=1. So the
resize_to tests' references were generated with those defaults, not with
px_per_unit=2 from the preceding test's activate! call.

Remove the activate!(px_per_unit=2) calls from resize_to tests - they
were incorrectly overriding what set_theme! establishes. Also revert
the snapshot_figure window reset and image resize changes which were
workarounds for this misdiagnosis.
Change from running on every PR to only when a collaborator comments
/benchmark on a PR. This saves CI resources for PRs that don't need
benchmarking.

Also triple the sample count (20 → 60) since benchmarks now run on
demand rather than on every push.
Set JULIA_PKG_PRECOMPILE_AUTO=0 on the install step so pkg"dev ..."
doesn't trigger precompilation. Precompilation happens once during
Pkg.test() which resolves the full test environment including test
deps. Without this, dev triggers a precompile pass against the
monorepo env, then Pkg.test resolves the test env and precompiles
again.
Backend workflows now take julia-version as an input instead of using
a version matrix. _reference-tests.yml calls each backend twice (once
per version). This means Julia 1.10 bucket jobs start as soon as the
1.10 setup finishes, without waiting for the Julia 1 setup (or vice
versa).
@jkrumbiegel jkrumbiegel force-pushed the jk/parallel-reftests branch from c98ef33 to d2b9642 Compare April 18, 2026 17:35
jkrumbiegel and others added 11 commits April 18, 2026 19:58
On Julia 1.10, coverage=true causes Pkg.test to spawn the subprocess
with use_pkgimages=false. The setup job was precompiling without
coverage (use_pkgimages=true), so the bucket jobs rejected all cached
.ji files due to the flag mismatch. Pass coverage=true in the setup
job's Pkg.test call so precompilation uses identical flags.

Also removes the JULIA_DEBUG=loading diagnostic from the previous
commit.
Coverage was configured with everything disabled (.codecov.yml has
project: false, patch: false, annotations: false). Removing it lets
Julia 1.10 use pkgimages (coverage=true forces use_pkgimages=false
on 1.10), significantly speeding up package loading in bucket jobs.
GitHub Actions runners are heterogeneous (mix of AMD Zen3, Intel
Haswell, etc.). Pkgimages compiled for znver3 are rejected on Intel
runners ("runtime-disabled features"). Use the same multi-target
string as Julia's official binaries so pkgimages work on any x86_64
runner.

Also removes remaining coverage=true from WGLMakie/GLMakie bucket
jobs that were missed in the previous commit.
Replace the three near-identical _cairomakie.yml, _glmakie.yml,
_wglmakie.yml with a single _backend-tests.yml that takes inputs for
backend name, Julia version, bucket count, system packages, and shell.

The bucket list is auto-generated from nbuckets via seq|jq in the
setup job, eliminating the redundant bucket-list input.

The shell input allows callers to wrap test execution in xvfb-run
for display-dependent backends.
@jkrumbiegel jkrumbiegel marked this pull request as ready for review April 20, 2026 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog Skips changelog enforcer

Projects

Status: Work in progress

Development

Successfully merging this pull request may close these issues.

3 participants