top | item 36959435

Monitor and Optimize your large-scale model training

13 points| roanakb | 2 years ago |trainy.ai

2 comments

order

krawfy|2 years ago

This is really cool! When we were trying to launch the GSPMD feature for PyTorch/XLA at Google, one of our biggest bottlenecks was network overhead, but we didn't really have any robust tools to dig into it and perform root cause analysis. I'm loving the tools I see come out of Trainy.

roanakb|2 years ago

Thanks! Let me know if there are any features you'd like to see added.