bugfix: reuse explicit GPU IDs cyclically across MPI ranks by hilaolu · Pull Request #1348 · 3dem/relion

hilaolu · 2026-06-15T17:55:24Z

RELION currently handles explicit GPU-ID mappings inconsistently across MPI programs. Some paths, such as ml_optimiser_mpi.cpp, already support cyclic reuse of shorter GPU-ID lists. For example, --gpu "0:1" can be expanded over more MPI ranks by assigning ranks modulo the number of provided GPU-ID groups. Other paths assume a 1:1 mapping between GPU-ID groups and MPI ranks. In relion_autopick_mpi and relion_find_amyloid_mpi, this can index past the parsed GPU-ID list:

  allThreadIDs[node->rank][0]

On a single-GPU system, running neural-network/Topaz autopicking with multiple MPI ranks and --gpu "0" can therefore segfault or fail with an uninformative textToInteger error.

  [nixos:430042:0:430042] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  [nixos:430046:0:430046] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x21)
  [nixos:430047:0:430047] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1de088db)
  [nixos:430043:0:430043] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  [nixos:430040:0:430040] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x71)
  [nixos:430050:0:430050] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  [nixos:430052:0:430052] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x61)
  in: /build/source/src/strings.cpp, line 251
  ERROR:
  Error in textToInteger
  [nixos:430054:0:430054] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  [nixos:430045:0:430045] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  --------------------------------------------------------------------------
  MPI_ABORT was invoked on rank 12 in communicator MPI_COMM_WORLD
    Proc: [[10062,1],12]
    Errorcode: 1

  NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
  You may or may not see output from other processes, depending on
  exactly when Open MPI kills them.
  --------------------------------------------------------------------------
  [nixos:430053:0:430053] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
  [nixos:430049:0:430049] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  [nixos:430044:0:430044] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  [nixos:430048:0:430048] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  --------------------------------------------------------------------------
  prterun has exited due to process rank 12 with PID 0 on node nixos calling
  "abort". This may have caused other processes in the application to be
  terminated by signals sent by prterun (as reported here).
  --------------------------------------------------------------------------

This PR centralizes GPU-ID rank mapping in args.{h,cpp}:

  getDeviceIDsForRank(allThreadIDs, rank)

The helper validates parsed GPU IDs and reuses explicit GPU-ID groups cyclically when there are more MPI ranks than GPU-ID groups. This makes mappings like:

0
0:1
0:1:2:3

work naturally with larger MPI rank counts, while preserving existing fully explicit mappings such as:

0:1:0:1

It also replaces direct std::isdigit(*gpu_ids.begin()) checks with a helper that safely treats empty GPU-ID strings as automatic mapping. This brings autopicking, amyloid tracing, MotionCor, and AreTomo GPU assignment behavior in line with the existing cyclic behavior in ml_optimiser_mpi.cpp.

It works for my case ( single GPU ). More test or further modification are welcomed.

bugfix: reuse explicit GPU IDs cyclically across MPI ranks

a5929dc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bugfix: reuse explicit GPU IDs cyclically across MPI ranks#1348

bugfix: reuse explicit GPU IDs cyclically across MPI ranks#1348
hilaolu wants to merge 1 commit into
3dem:ver5.1from
hilaolu:ver5.1

hilaolu commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hilaolu commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant