(no title)
xtacy
|
1 year ago
It's also a bit odd that they do not implement congestion control. Congestion control is fundamental unless you only have point-to-point data transfers, which is rarely the case. All-reduce operation during training requires N to 1 data transfer. In these scenarios the sender needs to control its data transfer rates so as to not overwhelm not just the receiver, but also the network... if this is not done, it will cause congestion collapse (https://en.wikipedia.org/wiki/Network_congestion#:~:text=ser...).
kiratp|1 year ago
> We proceeded without DCQCN for our 400G deployments. At this time, we have had over a year of experience with just PFC for flow control, without any other transport-level congestion control. We have observed stable performance and lack of persistent congestion for training collectives.
https://engineering.fb.com/2024/08/05/data-center-engineerin...
_zoltan_|1 year ago
jcims|1 year ago
But now I am curious with the distribution of observed window sizes is in the wild.
Edit: I'd bet the simpler protocol is more vulnerable to various spoofing attacks though.
Edit2: Lol I hope the frame IDs are for illustrative purposes only - https://chipsandcheese.com/2024/08/27/teslas-ttpoe-at-hot-ch...
xtacy|1 year ago
Such ideas are, however, worth revisiting when the workload is unique enough (in this case, it is), and the performance gains are so big enough...
pantalaimon|1 year ago
This is a protocol between compute nodes in a data center, it's layer 2 so there is no way to reach this over the internet.