top | item 40764612

(no title)

That's called uOP cache, which Intel has been using since Sandy Bridge (and AMD but I don't remember on top of my head since when). But that's more transistors for the cache and its control mechanism.

discuss

jart|1 year ago

It's definitely better than what NVIDIA does, inventing an entirely new ISA each year. If the hardware isn't paying the cost for a frontend, then it shovels the burden onto software. There's a reason every AI app has to bundle a 500mb matrix multiplication library in each download, and it's because GPUs force you to compile your code ten times for the last ten years of ISAs.

saagarjha|1 year ago

Part of it is that, but part of it is that people pay for getting from 95% optimal to 99% optimal, and doing that is actually a lot of work. If you peek inside the matrix multiplication library you'll note that it's not just "we have the best algorithm for the last 7 GPU microarchitectures" but also 7 implementations for the latest architecture because that's just how you need to be to go fast. Kind of like how if you take an uninformed look at glibc memcpy you'll see there is an AVX2 path and a ERMS path but also it will switch between algorithms based on the size of the input. You can easily go "yeah my SSE2 code is tiny and gets decent performance" but if you stop there you're leaving something on the table, and with GPUs it's this but even more extreme.

camel-cdr|1 year ago

Using the uops directy as the isa would be a bad idea for code density. In RISC-V land, vendors tend to target standard extensions/profiles, but when they hardware is capable of other operations they often expose those through custom extensions.

kiratp|1 year ago

IMO if the trade off is cheaper, faster hardware iteration then Nvidia’s strategy makes a lot of sense.