(no title)
sureshvoz | 2 months ago
In distributed training (LLMs), the bottleneck is no longer just disk I/O or CPU cycles—it’s the "straggler problem" during collective communication (like All-Reduce). When you’re running on 400Gbps+ RoCE (RDMA over Converged Ethernet) networks, the network "wire time" is often lower than the clock jitter on a standard Linux kernel.
If your clocks are skewed by even 2-3 milliseconds, your telemetry becomes essentially useless. It looks like packets are arriving before they were sent, or worse, your profiling tools can’t accurately pinpoint which GPU is stalling the rest of the 16,384-node fleet. We’ve reached a point where microsecond-accurate clocks isn't just a requirement for HFT firms; it’s becoming the baseline for anyone trying to keep $100s of millions of NVidia GPUs from idling while they wait for a collective sync.
perryizgr8|2 months ago