top | item 46417247

(no title)

sureshvoz | 2 months ago

This is a great breakdown, and it’s worth noting that we are hitting a "microsecond wall" in modern GPU clusters that makes standard NTP effectively obsolete.

In distributed training (LLMs), the bottleneck is no longer just disk I/O or CPU cycles—it’s the "straggler problem" during collective communication (like All-Reduce). When you’re running on 400Gbps+ RoCE (RDMA over Converged Ethernet) networks, the network "wire time" is often lower than the clock jitter on a standard Linux kernel.

If your clocks are skewed by even 2-3 milliseconds, your telemetry becomes essentially useless. It looks like packets are arriving before they were sent, or worse, your profiling tools can’t accurately pinpoint which GPU is stalling the rest of the 16,384-node fleet. We’ve reached a point where microsecond-accurate clocks isn't just a requirement for HFT firms; it’s becoming the baseline for anyone trying to keep $100s of millions of NVidia GPUs from idling while they wait for a collective sync.

discuss

perryizgr8|2 months ago

If you have network infrastructure that supports 400G I'm pretty sure it has solid PTP built in. And as far as I remember from my networking days setting it up is almost as simple as setting up NTP, you just need a single machine with a GPS lock.