Yes, nsys is very helpful, especially when looking at perf issues. It’s often the case that bugs present like in this blog though - you just notice that training curves have regressed somehow - so even with good tooling it can be hard to figure out where to start looking in these very complex systems. Only gets worse if the symptoms only show up when running for a long time and at scale in a cluster.
jebarker|4 months ago