Enhancement: Better Metrics#362
Conversation
sfc-gh-sbekman
left a comment
There was a problem hiding this comment.
I will have another closer read when you feel it's ready, but overall looks great - excellent work, Mike.
Left a few small suggestions.
| # this will lead to wrong peak reports if `see_mem_usage` is also used during the run, | ||
| # as it resets the peak counter and there is only one counter |
There was a problem hiding this comment.
why have you removed this warning?
| def _gather_object(value: Union[float, int, list], world_size: int) -> List[float]: | ||
| """All-gather a scalar or list across ranks, returning a flat list.""" | ||
| output: list = [None] * world_size | ||
| torch.distributed.all_gather_object(output, value) |
There was a problem hiding this comment.
btw, if we call this many times this method is much slower than tensor gather. I wonder if it actually adds up to a non-insignificant overhead. I think it'd be ok if we were to call it once on a dict or some such, otherwise gather would be many times faster.
| self.register("iter_time", reduce="mean", fmt=human_format_secs, display_name="iter time", accumulate=True) | ||
| self.register("iter_tflops", derive=_derive_tflops("iter_time"), fmt=".1f", display_name="iter tflops") | ||
| self.register("mem_ma", reduce="mean", fmt=lambda v: f"{v:.2f} GB", display_name="MA") | ||
| self.register("mem_max_ma", reduce="mean", fmt=lambda v: f"{v:.2f} GB", display_name="Max_MA") |
There was a problem hiding this comment.
max_ma has to be max.
probably the same for mem_nv
not sure about mem_ma - I think it should be max as well.
I suppose to debug memory multi-iteration metrics are simply wrong no matter how this is configured - but since we care for max, probably max for all 3 is the most sensible multi-iteration choice.
|
|
||
| Args: | ||
| name: Key used with ``record()``. | ||
| reduce: ``"mean"`` or ``"sum"`` — how to reduce across GAS micro-steps. |
There was a problem hiding this comment.
why do you call it GAS micro-steps? it can be gas=1 and log_internal=10
There was a problem hiding this comment.
should the docs include the default value for each?
| wandb: Whether to include in wandb logs. | ||
| accumulate: If ``True``, ``report()`` aggregates all values since | ||
| the previous report. If ``False`` (default), only the latest | ||
| GAS cycle's values are used. |
There was a problem hiding this comment.
same as above - doesn't have to be GAS>1
| f"NV {round(nv_mem / 2**30, 2):0.2f} GB", | ||
| ] | ||
| ) | ||
| ma_gb = round(get_accelerator().memory_allocated() / 2**30, 2) |
There was a problem hiding this comment.
Mike, while at it, let's please fix my sloppiness - use s/_gb/_gib/ and s/GB/GiB/ later in the metrics registry. Thank you!
No description provided.