top | item 45625025

(no title)

lepicz | 4 months ago

i did a bit dev on ps3 and i remember there was a small memory on the chip, like 256k that was accessible to programmer.

i always found this very appealing, having a blazing fast memory under programmer control so i wonder: why don't we have that on other cpus?

discuss

flohofwoe|4 months ago

> why don't we have that on other cpus

Pure speculation from my side, but I'd think that the advantages over traditional big register banks and on-chip caches are not that great, especially when you're writing 'cache-aware code'. You also need to consider that the PS3 was full of design compromises to keep cost down, e.g. there simply might not have been enough die space for a cache controller for each SPU, or the die space was more vaulable to get a few more kilobytes of static scratch memory instead of the cache logic.

Also, AFAIK on some GPU architectures you have something similar like per-core static scratch space, that's where restrictions are coming from that uniform data per shader invocation may at most be 64 KBytes on some GPU architectures, etc...

fredoralive|4 months ago

It's kinda a neat idea for a fixed target CPU like on a games console, but for a general purpose CPU range you generally don't want to reveal too much behind the curtain like that, what if you did a new model with a bigger scratchpad? Would existing software just ignore it? Or a budget model with less? Do you have a crash, or just an slow fallback? The system where the CPU magically makes cache work is better when you're in a situation where the CPU and software problem aren't fixed.

trelane|4 months ago

This is the SPU's local store. It's mentioned in the article. More details at https://en.wikipedia.org/wiki/Cell_(processor) Apparently it could in theory go to 4GiB.

"The local store does not operate like a conventional CPU cache since it is neither transparent to software nor does it contain hardware structures that predict which data to load."

I think the general term for this is scratchpad memory. https://en.wikipedia.org/wiki/Scratchpad_memory

This kind of indicates the problem with it. When switching tasks, each local store would have to be put into main RAM and the new task's local stores pulled back out. This would make switching tasks increasingly expensive. I believe the PS3 (and maybe all cell processors) dealt with this by not having tasks switch on the SPUs.

izacus|4 months ago

We call it "cache" don't we these days? And they've become massive - Apple M series and AMX Strix series have 24/32MB of L3 cache.

This is where a lot of their performance comes from.

protimewaster|4 months ago

Is the cache on M series and Strix under control of the programmer? I was under the impression those were traditional caches and thus automatically handled by the hardware.

izacus|4 months ago

*AMD, obviously I meant AMD, not AMX :)

bitwize|4 months ago

The TI-99/4A had 256 BYTES (128 words) of static RAM available to the CPU. All accesses the 16K of main memory had to be done through the video chip. This made a lot of things on the TI-99/4A slow, but there were occasional bits of brilliance where you see a tiny bit of the system it could've been. Thanks to the fast SRAM and 16-bit CPU, the smooth scrolling in Parsec was done entirely in software—the TMS9918A video chip lacking scroll registers entirely.

ack_complete|4 months ago

Some do, some ARM-based devices have tightly coupled memory (TCM). The RP2040 in the original Raspberry Pi also has a 4K bank for each core intended for stack and per-core variables, though it is not limited to access only by that core.

The main disadvantage of such dedicated memory is inefficient usage compared to using that same amount of fast local memory to cache _all_ of main memory.

corysama|4 months ago

“Shared mem” in CUDA and compute shaders works the same way. And, geometry shaders are very reminiscent of PS2 VU1 programming, during essentially computer shaders that can output directly to the rasterizer without going through DRAM.

MaxBarraclough|4 months ago

Sounds a little like the 10MB of EDRAM on the Xbox 360, although I think it was only accessible by the GPU.

https://en.wikipedia.org/wiki/Xbox_360_technical_specificati...

scraft|4 months ago

On the PS2 there was a very small memory area, called the scratchpad, that was very quick to access, the rough idea on the PS2 was to DMA data in and out of the scratch pad, and then do work in the data, without creating contention with everything else going on at the same time.

In general most developers struggled to do much with it, it was just too small (combined with the fiddlyness of using it).

PS2 programmer's were very used to thinking in this way as it's how the rendering had to be done. There is a couple of vector units, and one of them is connected to the GPU, so the general structure most developers followed was to have 4 buffers in the VU memory (I think it only had 16kb of memory or something pretty small), but essentially in parallel you'd have:

1. New data being DMAd in from main memory to VU memory (into say buffer 1/4). 2. Previous data in buffer 3/4 being transformed, lit, coloured, etc and output into buffer 4/4. 3. Data from buffer 2/4 being sent/rendered by the GPU.

Then once the above had finished it would flip, so you'd alternate like:

Data in: B1 (main memory to VU) Data out: B2 (VU to GPU) Data process from: B3 (VU processing) Data process to: B4 (VU processing)

Data in: B3 Data out: B4 Data process from: B1 Data process to: B2

The VU has two pipelines running in parallel (float and integer), and every instruction had an exact number of cycles to process, if you read a result before it is ready you stall the pipeline, so you had to painstakingly interleave and order your instructions to process three verts at a time and be very clever about register pressure etc.

There is obviously some clever syncing logic to allow all of this to work, allowing the DMA to wait until the VU kicks off the next GPU batch etc.

It was complex to get your head around, set up all the moving parts and debug when it goes wrong. When it goes wrong it pretty much just hangs, so you had to write a lot of validators. On PS2 you basically spend the frame building up a huge DMA list, and then at the end of the frame kick it off and it renders everything, so the DMA will transfer VU programs to the VU, upload data to the VU, wait for it to process and upload next batch, at the end upload next program, upload settings to GPU registers, bacially everything. Once that DMA is kicked off no more CPU code is involved in rendering the frame, so you have a MB or so of pure memory transfer instructions firing off, if any of them are wrong you are in a world of pain.

Then throw in, just to keep things interesting, the fact that anything you write to memory is likely stuck in caches, and DMA doesn't seem caches, so extra care has to be taken to make sure caches are flushed before using DMA.

It was a magical, horrible, wonderful, painful, joyous, impossible, satisfying, sickening, amazing time.

otabdeveloper4|4 months ago

> why don't we have that on other cpus?

We do, it's called "cache" or "registers".

maximilianburke|4 months ago

It's definitely not registers; the SPEs had 128 128-bit registers each.

In some ways it's like cache, it has the latency of L1 cache (6 cycles), but it's fully deterministic in terms of access.

lepicz|4 months ago

as a programmer you have (almost) no control over cache. that's not what i meant.

registers ok, but i want at least one megabyte of them :)