Skip to content

bugfix: reuse explicit GPU IDs cyclically across MPI ranks#1348

Open
hilaolu wants to merge 1 commit into
3dem:ver5.1from
hilaolu:ver5.1
Open

bugfix: reuse explicit GPU IDs cyclically across MPI ranks#1348
hilaolu wants to merge 1 commit into
3dem:ver5.1from
hilaolu:ver5.1

Conversation

@hilaolu

@hilaolu hilaolu commented Jun 15, 2026

Copy link
Copy Markdown

RELION currently handles explicit GPU-ID mappings inconsistently across MPI programs. Some paths, such as ml_optimiser_mpi.cpp, already support cyclic reuse of shorter GPU-ID lists. For example, --gpu "0:1" can be expanded over more MPI ranks by assigning ranks modulo the number of provided GPU-ID groups. Other paths assume a 1:1 mapping between GPU-ID groups and MPI ranks. In relion_autopick_mpi and relion_find_amyloid_mpi, this can index past the parsed GPU-ID list:

  allThreadIDs[node->rank][0]

On a single-GPU system, running neural-network/Topaz autopicking with multiple MPI ranks and --gpu "0" can therefore segfault or fail with an uninformative textToInteger error.

  [nixos:430042:0:430042] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  [nixos:430046:0:430046] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x21)
  [nixos:430047:0:430047] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1de088db)
  [nixos:430043:0:430043] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  [nixos:430040:0:430040] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x71)
  [nixos:430050:0:430050] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  [nixos:430052:0:430052] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x61)
  in: /build/source/src/strings.cpp, line 251
  ERROR:
  Error in textToInteger
  [nixos:430054:0:430054] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  [nixos:430045:0:430045] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  --------------------------------------------------------------------------
  MPI_ABORT was invoked on rank 12 in communicator MPI_COMM_WORLD
    Proc: [[10062,1],12]
    Errorcode: 1

  NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
  You may or may not see output from other processes, depending on
  exactly when Open MPI kills them.
  --------------------------------------------------------------------------
  [nixos:430053:0:430053] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
  [nixos:430049:0:430049] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  [nixos:430044:0:430044] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  [nixos:430048:0:430048] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  --------------------------------------------------------------------------
  prterun has exited due to process rank 12 with PID 0 on node nixos calling
  "abort". This may have caused other processes in the application to be
  terminated by signals sent by prterun (as reported here).
  --------------------------------------------------------------------------

This PR centralizes GPU-ID rank mapping in args.{h,cpp}:

  getDeviceIDsForRank(allThreadIDs, rank)

The helper validates parsed GPU IDs and reuses explicit GPU-ID groups cyclically when there are more MPI ranks than GPU-ID groups. This makes mappings like:

0
0:1
0:1:2:3

work naturally with larger MPI rank counts, while preserving existing fully explicit mappings such as:

0:1:0:1

It also replaces direct std::isdigit(*gpu_ids.begin()) checks with a helper that safely treats empty GPU-ID strings as automatic mapping. This brings autopicking, amyloid tracing, MotionCor, and AreTomo GPU assignment behavior in line with the existing cyclic behavior in ml_optimiser_mpi.cpp.

It works for my case ( single GPU ). More test or further modification are welcomed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant