Skip to content

Enhancement: Better Metrics#362

Draft
sfc-gh-jrasley wants to merge 4 commits into
mainfrom
mwyatt/better-metrics
Draft

Enhancement: Better Metrics#362
sfc-gh-jrasley wants to merge 4 commits into
mainfrom
mwyatt/better-metrics

Conversation

@sfc-gh-jrasley
Copy link
Copy Markdown
Collaborator

No description provided.

@sfc-gh-mwyatt sfc-gh-mwyatt changed the title Mwyatt/better metrics Enhancement: Better Metrics Mar 12, 2026
@sfc-gh-mwyatt sfc-gh-mwyatt marked this pull request as draft March 12, 2026 21:26
Copy link
Copy Markdown
Collaborator

@sfc-gh-sbekman sfc-gh-sbekman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have another closer read when you feel it's ready, but overall looks great - excellent work, Mike.

Left a few small suggestions.

Comment thread arctic_training/debug.py
Comment on lines -153 to -154
# this will lead to wrong peak reports if `see_mem_usage` is also used during the run,
# as it resets the peak counter and there is only one counter
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have you removed this warning?

def _gather_object(value: Union[float, int, list], world_size: int) -> List[float]:
"""All-gather a scalar or list across ranks, returning a flat list."""
output: list = [None] * world_size
torch.distributed.all_gather_object(output, value)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, if we call this many times this method is much slower than tensor gather. I wonder if it actually adds up to a non-insignificant overhead. I think it'd be ok if we were to call it once on a dict or some such, otherwise gather would be many times faster.

self.register("iter_time", reduce="mean", fmt=human_format_secs, display_name="iter time", accumulate=True)
self.register("iter_tflops", derive=_derive_tflops("iter_time"), fmt=".1f", display_name="iter tflops")
self.register("mem_ma", reduce="mean", fmt=lambda v: f"{v:.2f} GB", display_name="MA")
self.register("mem_max_ma", reduce="mean", fmt=lambda v: f"{v:.2f} GB", display_name="Max_MA")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_ma has to be max.

probably the same for mem_nv

not sure about mem_ma - I think it should be max as well.

I suppose to debug memory multi-iteration metrics are simply wrong no matter how this is configured - but since we care for max, probably max for all 3 is the most sensible multi-iteration choice.


Args:
name: Key used with ``record()``.
reduce: ``"mean"`` or ``"sum"`` — how to reduce across GAS micro-steps.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you call it GAS micro-steps? it can be gas=1 and log_internal=10

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the docs include the default value for each?

wandb: Whether to include in wandb logs.
accumulate: If ``True``, ``report()`` aggregates all values since
the previous report. If ``False`` (default), only the latest
GAS cycle's values are used.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above - doesn't have to be GAS>1

Comment thread arctic_training/debug.py
f"NV {round(nv_mem / 2**30, 2):0.2f} GB",
]
)
ma_gb = round(get_accelerator().memory_allocated() / 2**30, 2)
Copy link
Copy Markdown
Collaborator

@sfc-gh-sbekman sfc-gh-sbekman Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mike, while at it, let's please fix my sloppiness - use s/_gb/_gib/ and s/GB/GiB/ later in the metrics registry. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants