bugfix: reuse explicit GPU IDs cyclically across MPI ranks#1348
Open
hilaolu wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
RELION currently handles explicit GPU-ID mappings inconsistently across MPI programs. Some paths, such as ml_optimiser_mpi.cpp, already support cyclic reuse of shorter GPU-ID lists. For example, --gpu "0:1" can be expanded over more MPI ranks by assigning ranks modulo the number of provided GPU-ID groups. Other paths assume a 1:1 mapping between GPU-ID groups and MPI ranks. In relion_autopick_mpi and relion_find_amyloid_mpi, this can index past the parsed GPU-ID list:
On a single-GPU system, running neural-network/Topaz autopicking with multiple MPI ranks and --gpu "0" can therefore segfault or fail with an uninformative textToInteger error.
This PR centralizes GPU-ID rank mapping in args.{h,cpp}:
The helper validates parsed GPU IDs and reuses explicit GPU-ID groups cyclically when there are more MPI ranks than GPU-ID groups. This makes mappings like:
0
0:1
0:1:2:3
work naturally with larger MPI rank counts, while preserving existing fully explicit mappings such as:
0:1:0:1
It also replaces direct std::isdigit(*gpu_ids.begin()) checks with a helper that safely treats empty GPU-ID strings as automatic mapping. This brings autopicking, amyloid tracing, MotionCor, and AreTomo GPU assignment behavior in line with the existing cyclic behavior in ml_optimiser_mpi.cpp.
It works for my case ( single GPU ). More test or further modification are welcomed.