Skip to content

Test on MacOS and remove progress bar from example#159

Open
Yoshanuikabundi wants to merge 17 commits into
mainfrom
notebook_timeout
Open

Test on MacOS and remove progress bar from example#159
Yoshanuikabundi wants to merge 17 commits into
mainfrom
notebook_timeout

Conversation

@Yoshanuikabundi

@Yoshanuikabundi Yoshanuikabundi commented Nov 12, 2024

Copy link
Copy Markdown
Contributor

Fixes #133

Changes made in this Pull Request:

  • Adds MacOS back to examples testing matrix
  • Reduces amount of output sent during fit

The timeout that's being triggered is not the cell timeout (default about 30 minutes), but a hard coded one for the notebook kernel to transmit the cell's output to nbval. My hypothesis is that the fit takes serious time and the progress bar is constantly sending updates to the output, producing more output than can be processed in 5 seconds. Platform differences may reflect that the MacOS runners take more time (and therefore produce more progress bar updates), or something weird to do with how outputs are transmitted.

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@Yoshanuikabundi

Copy link
Copy Markdown
Contributor Author

Oops didn't mean to reformat the workflow yaml :/ will fix once I know if this worked

@codecov-commenter

codecov-commenter commented Nov 12, 2024

Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.53%. Comparing base (5b4b1d4) to head (a487f72).

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread .github/workflows/examples-ci.yaml Outdated

fail-fast: false
matrix:
os: [macOS-12, ubuntu-latest]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macos-12 is being removed next month 😞

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is macos-latest appropriate?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think so, it's no longer intel but AmberTools doesn't work on macOS-13. Thanks so much for looking at this!

@Yoshanuikabundi

Copy link
Copy Markdown
Contributor Author

Hey look it seems to have worked! I've switched to macos-latest as Matt advises macos-12 is being removed, but the examples CI had already passed so I think I've correctly diagnosed and fixed the issue.

@Yoshanuikabundi Yoshanuikabundi marked this pull request as ready for review November 12, 2024 05:00
@Yoshanuikabundi

Copy link
Copy Markdown
Contributor Author

Ok looks like there are fun new errors to deal with. Not sure I have any insight on these ones, I've tried a few things and if this one doesn't work I'm out of ideas. If the current round of tests fail, I can revert to macos-12 and leave this for the future, or else I'm totally happy for you to hack on this branch :)

@lilyminium

lilyminium commented Nov 12, 2024

Copy link
Copy Markdown
Collaborator

Thanks so much for fixing the issue @Yoshanuikabundi, you are a wizard!!

RuntimeError: MPS backend out of memory (MPS allocated: 8.00 MB, other allocations: 16.00 KB, max allowed: 7.93 GB). Tried to allocate 256 bytes on shared pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

What an exotic new error I've never seen before, even when Mac examples was running, haha. It looks like this? actions/runner-images#9918 So not easily fixable with until macos-15 is out, and with macos-13 out of the ambertools picture, downgrading to macos-12 seems the only way for CI to pass (until macos-12 disappears.)

Edit: unless there's a way to tell torch not to use mps?

Comment thread .github/workflows/examples-ci.yaml Outdated
mattwthompson and others added 9 commits January 30, 2025 08:21
* Add GPU-enabled CI

* Update file reference

* Remove other CI

* Bump runner version

* Try allocating 10 GB

* Debug

* Debug

* Sync CUDA environment with DGL environment

* Debug Torch/CUDA interaction

* Try adding `pytorch-gpu`

* Debug

* tmp add print and trial test

* check dgl

* add torchdata package

* add other torch packages required

* Try bumping to newer DGL channel targeting PyTorch 2.1

* Add back `pytorch-gpu`?

* Revert "tmp add print and trial test"

This reverts commit 6c5f42c.

* Revert more temporary changes, fix coverage

* Syntax

* Debug

* Debug

* Fix

---------

Co-authored-by: Lily Wang <lily@mdanalysis.org>
* add weights_only=False

* Revert "add weights_only=False"

This reverts commit fb71d58.

* Revert "Revert "add weights_only=False""

This reverts commit 6e19fb4.
* Constrain torchdata<=0.10.0

* Also update examples environment
@mattwthompson

Copy link
Copy Markdown
Member

Test failure(s) appear to be genuine. This is with DGL 2.3.0, PyTorch 2.3.1, Python 3.11

Click me!

________ examples/train-gnn-notebook/train-gnn-notebook.ipynb::Cell 10 _________
[gw0] darwin -- Python 3.11.14 /Users/runner/micromamba/envs/openff-nagl-test/bin/python
Notebook cell execution failed
Cell 10: Cell execution caused an exception

Input:
from pytorch_lightning import Trainer

trainer = Trainer(max_epochs=200)

trainer.progress_bar_callback.disable()
trainer.checkpoint_callback.monitor = "val/loss"

trainer.fit(
training_model,
datamodule=data_module
)

Traceback:


KeyError Traceback (most recent call last)
Cell In[1], line 8
5 trainer.progress_bar_callback.disable()
6 trainer.checkpoint_callback.monitor = "val/loss"
----> 8 trainer.fit(
9 training_model,
10 datamodule=data_module
11 )

File ~/micromamba/envs/openff-nagl-test/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py:584, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path, weights_only)
582 self.training = True
583 self.should_stop = False
--> 584 call._call_and_handle_interrupt(
585 self,
586 self._fit_impl,
587 model,
588 train_dataloaders,
589 val_dataloaders,
590 datamodule,
591 ckpt_path,
File ~/micromamba/envs/openff-nagl-test/lib/python3.11/site-packages/lightning_utilities/core/apply_func.py:98, in _apply_to_collection_slow(data, dtype, function, wrong_dtype, include_none, allow_frozen, *args, **kwargs)
86 def _apply_to_collection_slow(
87 data: Any,
88 dtype: Union[type, Any, tuple[Union[type, Any]]],
(...) 95 ) -> Any:
96 # Breaking condition
97 if isinstance(data, dtype) and (wrong_dtype is None or not isinstance(data, wrong_dtype)):
---> 98 return function(data,*args,**kwargs)
100 elem_type = type(data)
102 # Recursively apply to collection items

File ~/micromamba/envs/openff-nagl-test/lib/python3.11/site-packages/lightning_fabric/utilities/apply_func.py:104, in move_data_to_device..batch_to(data)
102 if isinstance(data, Tensor) and isinstance(device, torch.device) and device.type not in _BLOCKING_DEVICE_TYPES:
103 kwargs["non_blocking"] = True
--> 104 data_output = data.to(device,**kwargs)
105 if data_output is not None:
106 return data_output

File ~/micromamba/envs/openff-nagl-test/lib/python3.11/site-packages/openff/nagl/molecule/_dgl/batch.py:14, in DGLMoleculeBatch.to(self, device)
13 def to(self, device: str):
---> 14 graph = self.graph.to(device)
15 return type(self)(
16 graph, n_representations=self.n_representations, n_atoms=self.n_atoms
17 )

File ~/micromamba/envs/openff-nagl-test/lib/python3.11/site-packages/dgl/heterograph.py:5714, in DGLGraph.to(self, device, **kwargs)
5711 ret = copy.copy(self)
5713 # 1. Copy graph structure
-> 5714 ret._graph = self._graph.copy_to(utils.to_dgl_context(device))
5716 # 2. Copy features
5717 # TODO(minjie): handle initializer
5718 new_nframes = []

File ~/micromamba/envs/openff-nagl-test/lib/python3.11/site-packages/dgl/utils/internal.py:585, in to_dgl_context(ctx)
583 def to_dgl_context(ctx):
584 """Convert a backend context to DGLContext"""
--> 585 device_type = nd.DGLContext.STR2MASK[F.device_type(ctx)]
586 device_id = F.device_id(ctx)
587 return nd.DGLContext(device_type, device_id)

KeyError: 'mps'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mac CI won't execute electric field output

4 participants