top | item 39750114

(no title)

30x is the type of number that when you see it in a generational improvement, you should ignore it as marketing fluff.

discuss

azeirah|1 year ago

From how I understood it, it means they optimised the entire stack from CUDA to the networking interconnects specifically for data centers, meaning you get 30x more inference per dollar for a datacenter. This is probably not fluff, but it's only relevant for a very very specific use-case, ie enterprises with the money to buy a stack to serve thousands of users with LLMs.

It doesn't matter for anyone who's not microsoft, aws or openai or similar.

misterdabb|1 year ago

It's a weird graph... It's specifically tokens per GPU but the x-axis is "interactivity per second", so the y-axis is including Blackwell being twice the size and also the increase from fp8 -> fp4, note it will needs to be counted multiple time as half as much data is needed to be going through the networks as well.

acchow|1 year ago

They showed 30x was for FP4. Who is using FP4 in practice?