Skip to content

Fixes bug preventing early stopping from exiting the training loop when not running in distributed mode (issue 1384)#1502

Open
richardtomsett wants to merge 2 commits into
ACEsuit:mainfrom
richardtomsett:fix/non-distributed-early-stopping
Open

Fixes bug preventing early stopping from exiting the training loop when not running in distributed mode (issue 1384)#1502
richardtomsett wants to merge 2 commits into
ACEsuit:mainfrom
richardtomsett:fix/non-distributed-early-stopping

Conversation

@richardtomsett

Copy link
Copy Markdown

Issue 1384 describes a bug where early stopping fails to stop the training loop when distributed training is not enabled. The bug appears to have been introduced in this PR, which updated mace/tools/train.py so that all ranks exit from the training loop correctly if patience is exceeded. In addressing this issue, the PR introduced a bug that meant the training loop never checks correctly for patience being exceeded when not running in distributed mode.

This is a minimal fix that ensures patience being exceeded exits the loop in serial mode, and adds a test.

I've tested this locally on my laptop with a small training run with --max_num_epochs=10000 and --patience=1, and the training loop correctly exits due to patience being exceeded:

2026-06-16 22:47:40.107 INFO: ===========TRAINING===========
2026-06-16 22:47:40.107 INFO: Started training, reporting errors on validation set
2026-06-16 22:47:40.107 INFO: Loss metrics on validation set
/redacted/path/info/mace/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py:752: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, device pinned memory won't be used.
  super().__init__(loader)
2026-06-16 22:47:40.200 INFO: Initial: head: Default, loss=0.11416074, RMSE_E_per_atom=  337.88 meV, RMSE_F=    0.00 meV / A
2026-06-16 22:47:40.898 INFO: Epoch 0: head: Default, loss=0.11416074, RMSE_E_per_atom=  337.88 meV, RMSE_F=    0.00 meV / A
2026-06-16 22:47:41.364 INFO: Epoch 1: head: Default, loss=0.11416074, RMSE_E_per_atom=  337.88 meV, RMSE_F=    0.00 meV / A
2026-06-16 22:47:41.365 INFO: Stopping optimization after 1 epochs without improvement
2026-06-16 22:47:41.365 INFO: Training complete

NB I have not tested this in distributed training mode on a machine with a GPU.

…g loop when not running in distributed mode (issue 1384), and adds a regression test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant