Fixes bug preventing early stopping from exiting the training loop when not running in distributed mode (issue 1384) by richardtomsett · Pull Request #1502 · ACEsuit/mace

richardtomsett · 2026-06-16T21:49:17Z

Issue 1384 describes a bug where early stopping fails to stop the training loop when distributed training is not enabled. The bug appears to have been introduced in this PR, which updated mace/tools/train.py so that all ranks exit from the training loop correctly if patience is exceeded. In addressing this issue, the PR introduced a bug that meant the training loop never checks correctly for patience being exceeded when not running in distributed mode.

This is a minimal fix that ensures patience being exceeded exits the loop in serial mode, and adds a test.

I've tested this locally on my laptop with a small training run with --max_num_epochs=10000 and --patience=1, and the training loop correctly exits due to patience being exceeded:

2026-06-16 22:47:40.107 INFO: ===========TRAINING===========
2026-06-16 22:47:40.107 INFO: Started training, reporting errors on validation set
2026-06-16 22:47:40.107 INFO: Loss metrics on validation set
/redacted/path/info/mace/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py:752: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, device pinned memory won't be used.
  super().__init__(loader)
2026-06-16 22:47:40.200 INFO: Initial: head: Default, loss=0.11416074, RMSE_E_per_atom=  337.88 meV, RMSE_F=    0.00 meV / A
2026-06-16 22:47:40.898 INFO: Epoch 0: head: Default, loss=0.11416074, RMSE_E_per_atom=  337.88 meV, RMSE_F=    0.00 meV / A
2026-06-16 22:47:41.364 INFO: Epoch 1: head: Default, loss=0.11416074, RMSE_E_per_atom=  337.88 meV, RMSE_F=    0.00 meV / A
2026-06-16 22:47:41.365 INFO: Stopping optimization after 1 epochs without improvement
2026-06-16 22:47:41.365 INFO: Training complete

NB I have not tested this in distributed training mode on a machine with a GPU.

…g loop when not running in distributed mode (issue 1384), and adds a regression test

richardtomsett added 2 commits June 16, 2026 22:14

fix: fixes the bug preventing early stopping from exiting the trainin…

1890bc0

…g loop when not running in distributed mode (issue 1384), and adds a regression test

Reduce max_num_epochs in test from 100 to 10

238c9e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes bug preventing early stopping from exiting the training loop when not running in distributed mode (issue 1384)#1502

Fixes bug preventing early stopping from exiting the training loop when not running in distributed mode (issue 1384)#1502
richardtomsett wants to merge 2 commits into
ACEsuit:mainfrom
richardtomsett:fix/non-distributed-early-stopping

richardtomsett commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

richardtomsett commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant