top | item 47163358

(no title)

ClaireGz | 5 days ago

This is super helpful — most writeups skip over the actual communication steps, so seeing the All-to-All flow laid out makes it much clearer.

Curious from your experiments: at 1M+ context, does communication start dominating vs compute?

I keep seeing cases where bigger context windows are technically possible but don’t translate into better results unless the context is very structured, so I wonder where the real scaling limit ends up being in practice.

discuss

DARSHANFOFADIYA|5 days ago

As we scale to 1MN context length (inference) the biggest bottleneck is memory and to tackle that at scale we pay the price of communication overhead. Now fortunately the gpus are smartly fetching data for the next step while the previous step is computing thus masking the communication overhead and keeping responses at such scale appear realistic.

The quality degradation as context length increaes is a whole another science problem