Skip to content

GPU offloading preconditioner#4953

Merged
connorjward merged 18 commits intofiredrakeproject:mainfrom
dsroberts:dsroberts/offload-pc
Apr 30, 2026
Merged

GPU offloading preconditioner#4953
connorjward merged 18 commits intofiredrakeproject:mainfrom
dsroberts:dsroberts/offload-pc

Conversation

@dsroberts
Copy link
Copy Markdown
Contributor

Description

First pass at firedrake-configure for GPU-enabled PETSc builds, as well as a corresponding github actions workflow. An optional --gpu-arch flag has been added which adds all the necessary configuration to go from a fresh Ubuntu installation to a working Firedrake build (mostly) following existing build instructions. OffloadPC itself is more or less unchanged from @Olender's work for now, but with a few extra checks for whether GPU offload is available. I expect it will take a few iterations to get this going under CI, so once that's up and running reliably we can look at further development to the OffloadPC functionality and testing.

I've tried to put this together in a way that can be expanded upon fairly easily. We're interested in ROCm/HIP as well as Cuda, so there are a couple of dictionaries that just have a cuda key for now that will be expanded upon in time. I also thought that OffloadPC should be a no-op with a warning if there are any issues around GPUs, rather than crashing out. I think this is the right choice for workflow portability and heterogeneous systems, but it would be just as easy to raise exceptions there.

Comment thread firedrake/preconditioners/offload.py Outdated
Comment thread firedrake/preconditioners/offload.py Outdated
Comment thread firedrake/preconditioners/offload.py Outdated
Comment thread firedrake/preconditioners/offload.py Outdated
Comment thread firedrake/preconditioners/offload.py Outdated
Comment thread firedrake/preconditioners/offload.py Outdated
Comment thread tests/firedrake/offload/test_poisson_offloading_pc.py
Comment thread tests/firedrake/offload/test_poisson_offloading_pc.py Outdated
Comment thread tests/firedrake/offload/test_poisson_offloading_pc.py Outdated
Comment thread tests/firedrake/offload/test_poisson_offloading_pc.py Outdated
Comment thread tests/firedrake/offload/test_poisson_offloading_pc.py Outdated
Comment thread tests/firedrake/offload/test_poisson_offloading_pc.py Outdated
@pbrubeck
Copy link
Copy Markdown
Contributor

Thanks for this, it looks really promising. I have some suggestions about the interface.

It seems to me that the current implementation might be abusing from AssembledPC, but I believe it'd be possible to add code to AssembledPC to convert the matrix and vectors back and forth if assembled_mat_type: "aijcusparse". Hopefully, having a single class makes the logic clearer to prevent undesired reassembly of the operators.

Once AssembledPC supports aijcusparse, we can implement a separate class as a shortcut, so users do not need to set the assembled_mat_type option.

Comment thread .github/workflows/core.yml Outdated
Comment thread .github/workflows/core.yml Outdated
Comment thread .github/workflows/core.yml
Comment thread firedrake/preconditioners/offload.py Outdated
Comment thread firedrake/preconditioners/offload.py Outdated
Comment thread firedrake/preconditioners/offload.py
Comment thread .github/workflows/core.yml Outdated
Comment thread scripts/firedrake-configure Outdated
Comment thread scripts/firedrake-configure Outdated
Comment thread scripts/firedrake-configure Outdated
Comment thread .github/workflows/core.yml Outdated
@dsroberts dsroberts marked this pull request as draft March 10, 2026 22:56
@dsroberts
Copy link
Copy Markdown
Contributor Author

I should have marked this as a draft from the start, apologies for missing that. @pbrubeck thanks for the detailed feedback. I like your suggested approach of having AssembledPC perform the copy to device memory if you've asked for it. In my mind, that's more intuitive than having a separate 'preconditioner'. If we could do it in a way that would add a param to AssembledPC, we could then have the user decide what to do if a GPU is unavailable rather than guessing as we are now. E.g.

    parameters = {
        "ksp_type": "preonly",
        "pc_type": "python",
        "pc_python_type": "firedrake.AssembledPC",
        ...
        "assembled_pc_offload": "try"
    }

Where offload can take the values of never (i.e. do not offload, even if available - default setting), try (attempt to offload continue but warn if not possible) or always (attempt to offload and raise an exception if it fails). I'd let the implementation pick the device matrix type, as I want to expand this to AMD devices as well (mat_type=aijhipsparse) - we have hardware available to test this. Anyway, that all depends on getting the build and test procedure working. There is still much to do there.

@dsroberts dsroberts added the gpu For special runner label Mar 11, 2026
@dsroberts
Copy link
Copy Markdown
Contributor Author

The issue in the CI seems to indicate that the host that the GPU tests attempted to run on did not have Nvidia drivers installed. Is it possible there are runners with the gpu tag that don't have GPUs, or drivers are missing or something?

@connorjward
Copy link
Copy Markdown
Contributor

The issue in the CI seems to indicate that the host that the GPU tests attempted to run on did not have Nvidia drivers installed. Is it possible there are runners with the gpu tag that don't have GPUs, or drivers are missing or something?

I've passed this on so it should get investigated and fixed shortly.

@Olender
Copy link
Copy Markdown
Contributor

Olender commented Mar 12, 2026

The issue in the CI seems to indicate that the host that the GPU tests attempted to run on did not have Nvidia drivers installed. Is it possible there are runners with the gpu tag that don't have GPUs, or drivers are missing or something?

Have you tried setting up PETSc inside the Nvidia-provided Cuda Docker image locally and running the tests, instead of on your local PC installation, just to check? I attempted this before (at the time it was PETSc installed inside a clean cuda:12.9 Ubuntu Docker image from Nvidia, though I can check my Dockerfile to confirm if needed), and I couldn’t get PETSc to work properly in that setup.

When passing the --download-mpi flag to PETSc and then checking ompi_info, I saw the following MPI extensions: affinity, cuda, ftmpi, rocm, shortfloat. It seemed odd that both rocm and cuda appeared, especially since this was inside the CUDA container with no other prior installations.

Could you check whether you’re seeing the same extensions on your side?

Note: this was on a older PETSc version

@dsroberts
Copy link
Copy Markdown
Contributor Author

Thanks for the info @Olender. My testing setup is a little more complicated by necessity as I don't have access to a system that I can both run Docker on and that has Nvidia hardware. I'm testing the build set-up from the base Ubuntu image in Docker as in the longer term we're trying to augment the current installation instructions to include GPU builds. For that reason, I'm cross-compiling, so I download the full CUDA software stack into the container (stuck at 12.9 until this PETSc issue is resolved) and build PETSc, though I can't run make check due to lack of hardware. In the mean time, for actual code development and testing I'm using a bare-metal build on an HPC system we have access to.
In modifying firedrake-configure, I'm hoping that we can stick with apt-provided OpenMPI, as this minimises the differences between the GPU and non-GPU installation steps. It won't be 'GPU-aware', though I can't imagine a lot of people installing Firedrake locally need GPU-direct RDMA. If that doesn't work, I'd probably want to bring in a pre-configured OpenMPI via. HPC-X rather than have PETSc build it. That would be a last resort if neither of the previous options work. On the HPC system, the sysadmins build OpenMPI with around 60 flags passed to ./configure - I trust that it is configured exactly as it needs to be for that system and there is no way that a PETSc-built OpenMPI will match it in terms of stability and performance.

@dsroberts dsroberts force-pushed the dsroberts/offload-pc branch from 013d6f0 to 4e8ca5b Compare March 13, 2026 00:09
Copy link
Copy Markdown
Contributor

@connorjward connorjward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is getting better. The CI error now appears to be a more mundane missing apt-get update.

Comment thread .github/workflows/core.yml Outdated
Comment thread .github/workflows/core.yml
@dsroberts dsroberts force-pushed the dsroberts/offload-pc branch 4 times, most recently from 33bf483 to 04d12d9 Compare March 16, 2026 01:57
@dsroberts
Copy link
Copy Markdown
Contributor Author

I think this is getting better. The CI error now appears to be a more mundane missing apt-get update.

Yes, its looking good now. After a few iterations we have a working build. Now I can work on integrating a GPU test into firedrake-check.

@dsroberts dsroberts force-pushed the dsroberts/offload-pc branch from 69c58e7 to decca0a Compare March 16, 2026 06:58
@pbrubeck
Copy link
Copy Markdown
Contributor

This preconditioner is only going to work with already assembled matrices until we have one form assembly on GPUs.

For sure, that was more of a follow-up to @pbrubeck's comment from last month

Is this because A cannot live in the GPU if it is matfree? Do we benefit much from A being matfree if P is not matfree?

There are subtleties here I'm not across yet. Especially around fieldsplit where there are solves nested in other operations. I don't know how offload would be managed in that case, or even if it can be done.

I commented on the distinction between A and P because I saw one of the use cases with -offload_pc_type ksp. There we are expecting A and P to be used in the GPU, it makes sense to do a single copy to device when these two are the same object (th most common scenario), but we should be prepared for the other case as well.

@connorjward
Copy link
Copy Markdown
Contributor

So maybe pyproject.toml gets firedrake[cuda] and firedrake[hip] optional dependency sets, but then you have the conditional import problem.

Yeah this is the approach I was imagining. We already have to play conditional import games for things like JAX and PyTorch. I'm not overly concerned about this.

Its one thing to initialise a device, its another to ensure that GPU offloading has actually happened. Maybe I'm being overly cautious, but I don't think its safe to just assume that if we found a GPU, that it was used in the computation. For example, the SOR preconditioner has absolutely no GPU support, but is happy to take a aijcusparse matrix and ignore the device buffer in favour of doing all work on the CPU. I would want a test to make sure that doesn't happen in future on the cases we know offload properly today. The only way I know for sure to get that information is from -log_view.

Fair enough. That is challenging. You could probably achieve this in a reasonable way if you added a step to the GPU CI very much like this one in petsctools:

python tests/firedrake/offload/test_poisson_offloading_pc.py -log_view | grep ...

Note that I'm running the test file as a script and therefore relying on the if __name__ == "__main__" block.

@connorjward
Copy link
Copy Markdown
Contributor

@pbrubeck and @dsroberts I want to make sure that we're not being too perfectionist here. This is brand new functionality and doesn't have to be absolutely perfect in all circumstances from the outset. Once this is merged we have 6 months until it is in release and as such we can change the API as much as we like until then.

@dsroberts dsroberts force-pushed the dsroberts/offload-pc branch from 0bc7bc3 to ff2e343 Compare April 28, 2026 05:48
@dsroberts
Copy link
Copy Markdown
Contributor Author

@connorjward you make a good point. I think what we have now is a good starting point and a clear idea of what work will come next. Since G-ADOPT is my priority, the next round of GPU work to come from me will be working out how to offload after a fieldsplit has been performed. HIP/ROCm to come after that. Also, please ensure @Olender is listed as a co-author in the merge commit, this work borrows a lot from #4166, and @Olender absolutely deserves credit for that in the next round of change logs.

@dsroberts dsroberts marked this pull request as ready for review April 28, 2026 07:17
Copy link
Copy Markdown
Contributor

@connorjward connorjward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me a couple of days to get to this. I think once these very minor things are addressed I am happy to merge.

Comment thread tests/firedrake/offload/test_poisson_offloading_pc.py Outdated
Comment thread firedrake/preconditioners/offload.py Outdated
Comment thread firedrake/preconditioners/offload.py Outdated
prefix = pc.getOptionsPrefix() or ""
options_prefix = prefix + self._prefix

self.device_mat = device_matrix_type(pc.comm.rank == 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this touches on a bigger open question about how we should do logging and warning and whether we want to force it to be collective or not. I am happy to continue to pass the rank here for the moment.

That said I think making warn a kwarg here is a good idea. Just add *, to the argument list.

Comment thread firedrake/preconditioners/offload.py Outdated
Comment thread firedrake/utils.py Outdated
Comment thread firedrake/utils.py Outdated
Comment thread tests/firedrake/offload/test_poisson_offloading_pc.py Outdated
dsroberts and others added 5 commits April 30, 2026 10:19
Co-authored-by: Connor Ward <c.ward20@imperial.ac.uk>
Co-authored-by: Connor Ward <c.ward20@imperial.ac.uk>
Co-authored-by: Connor Ward <c.ward20@imperial.ac.uk>
Co-authored-by: Connor Ward <c.ward20@imperial.ac.uk>
…est for offload preconditioner. Add single-process tests to GPU github action.
@dsroberts dsroberts force-pushed the dsroberts/offload-pc branch from 70d7f93 to 47a19de Compare April 30, 2026 03:31
@dsroberts dsroberts force-pushed the dsroberts/offload-pc branch from d5cb43f to 1c49242 Compare April 30, 2026 06:48
@dsroberts
Copy link
Copy Markdown
Contributor Author

dsroberts commented Apr 30, 2026

@connorjward I know what you mean regarding logging, I don't think there is a nice, one-size-fits-all solution. This was an easy one as its a collectively consistent function that will never change throughout a job run, hence using pc.comm.rank and @cache as output control mechanisms.

Thanks for pointing out that test. I've attempted to add in the nproc=1 tests with -m skipslepc to the GPU CI. All CPU-only cases should still work with a GPU-enabled installation and it isn't going to be feasible to put every GPU test into firedrake-check, so I think its a worthwhile thing to add. Its in a "DROP BEFORE MERGE" commit for now, if the test that's running as I write this comment fails, its probably best to drop and I'll come back to it in another PR as its coming up to the end of the working day for me.

ETA: I guess that should be -m 'not skipslepc'. On further reflection I do think that's out of scope for this PR. Feel free to purge that last commit and review what's there. I'll open another for merging offload testing framework into the existing pytest setup.

Comment thread firedrake/preconditioners/offload.py
Copy link
Copy Markdown
Contributor

@connorjward connorjward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with this. Thank you so much for doing this! It's only a small PC but the surrounding work is very involved. This will be very useful for later GPU work.

@connorjward connorjward merged commit 8d92204 into firedrakeproject:main Apr 30, 2026
7 of 8 checks passed
@connorjward connorjward mentioned this pull request Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpu For special runner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants