top | item 41375706

(no title)

xtacy | 1 year ago

It's also a bit odd that they do not implement congestion control. Congestion control is fundamental unless you only have point-to-point data transfers, which is rarely the case. All-reduce operation during training requires N to 1 data transfer. In these scenarios the sender needs to control its data transfer rates so as to not overwhelm not just the receiver, but also the network... if this is not done, it will cause congestion collapse (https://en.wikipedia.org/wiki/Network_congestion#:~:text=ser...).

discuss

kiratp|1 year ago

Current public SOTA seems to be “no congestion control”

> We proceeded without DCQCN for our 400G deployments. At this time, we have had over a year of experience with just PFC for flow control, without any other transport-level congestion control. We have observed stable performance and lack of persistent congestion for training collectives.

https://engineering.fb.com/2024/08/05/data-center-engineerin...

_zoltan_|1 year ago

PFC is congestion control.

jcims|1 year ago

I probably shouldn't be commenting because I don't have any experience at this level, but given it's a closed system where they control supply and demand it seems they could manage away most congestion issues with scheduling/orchestration. They still have a primitive flow control in the protocol and it seems like you could create something akin to a virtual sliding window just by instrumenting the retransmits.

But now I am curious with the distribution of observed window sizes is in the wild.

Edit: I'd bet the simpler protocol is more vulnerable to various spoofing attacks though.

Edit2: Lol I hope the frame IDs are for illustrative purposes only - https://chipsandcheese.com/2024/08/27/teslas-ttpoe-at-hot-ch...

xtacy|1 year ago

In principle, with perfect knowledge of flows at any given instant, you can assign credits/rate-of-transmission for each flow to prevent congestion. But, in practice this is somewhat nuanced to build, and there are various tradeoffs to consider: what happens if the flows are so short that coordinating with a centralised scheduler incurs a latency overhead that is comparable to the flow duration? There's been research to demonstrate that one can strike a sweet spot, but I don't think it's practical nor has it been really deployed in the wild. And of course, this scheduler has to be made reliable as it's a single point of failure.

Such ideas are, however, worth revisiting when the workload is unique enough (in this case, it is), and the performance gains are so big enough...

pantalaimon|1 year ago

> I'd bet the simpler protocol is more vulnerable to various spoofing attacks though.

This is a protocol between compute nodes in a data center, it's layer 2 so there is no way to reach this over the internet.