top | item 41318339

(no title)

xoranth | 1 year ago

General questions for gamedevs here. How useful is SIMD given that now we have compute shaders on the GPU? If so, what workloads still require SIMD/why would you choose one over the other?

discuss

h0l0cube|1 year ago

Specifically physics benefits from CPU processing. Efficient rendering pipelines are typically one-way (CPU -> GPU), whereas the results of physics calculations are depended on both by the game logic and rendering, and it's much simpler (and probably more efficient) to keep that computation on the CPU. The exception to this is could be on UMA architectures like the Apple M-series and the PS4, where memory transport isn't a limiting factor – though memory/cache invalidation might be an issue?

eigenspace|1 year ago

Even with UMA architectures where you eliminate the memory transport costs, it still costs a ton of time to actually launch a GPU kernel from the CPU.

dxuh|1 year ago

With graphics you mostly prepare everything you want to render and then transfer all of it to the GPU. Physics still lends itself fairly well to GPU acceleration as well (compared to other things), but simply preparing something, transferring it to the GPU and being done is not enough. You need to at least get it back, even just to render it, but likely also to have gameplay depend on it. And with graphics programming the expensive part is often the communication between the CPU and the GPU and trying to avoid synchronization (especially with the old graphics APIs), so transferring there and back is expensive. Also physics code is full of branches, while graphics usually is not. GPUs (or rather really wide vectorization generally) don't like branches much and if you do only certain parts of the physics simulation on the GPU, then you need to transfer there and back (and synchronize) even more. I'm just a hobby gamedev and I know that people have done physics on the GPU (PhysX), but to me the things I mentioned sound like big hurdles.

EDIT: one more big thing is also that at least for AAA games you want to keep the GPU doing graphics so it looks good. You usually never have GPU cycles to spare.

eigenspace|1 year ago

I'm not a gamedev, but I do a lot of numerical work. GPUs are great, but they're no replacement for SIMD.

For example, I just made a little example on my desktop where I summed up 256 random Float32 numbers, and doing it in serial takes around 152 nanoseconds, whereas doing it with SIMD took just 10 nanoseconds. Doing the exact same thing with my GPU took 20 microseconds, so 2000x slower:

    julia> using CUDA, SIMD, BenchmarkTools

    julia> function vsum(::Type{Vec{N, T}}, v::Vector{T}) where {N, T}
               s = Vec{N, T}(0)
               lane = VecRange{N}(0)
               for i ∈ 1:N:length(v)
                   s += v[lane + i]
               end
               sum(s)
           end;

    julia> let L = 256
               print("Serial benchmark:  "); @btime vsum(Vec{1, Float32}, v)  setup=(v=rand(Float32, $L))
               print("SIMD benchmark:    "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
               print("GPU benchmark:     "); @btime sum(v)                    setup=(v=CUDA.rand($L))
           end;
    Serial benchmark:    152.239 ns (0 allocations: 0 bytes)
    SIMD benchmark:      10.359 ns (0 allocations: 0 bytes)
    GPU benchmark:       19.917 μs (56 allocations: 1.47 KiB)

The reason for that is simply that it just takes that long to send data back and forth to the GPU and launch a kernel. Almost none of that time was actually spent doing the computation. E.g. here's what that benchmark looks like if instead I have 256^2 numbers:

    julia> let L = 256^2
               print("Serial benchmark:  "); @btime vsum(Vec{1, Float32}, v)  setup=(v=rand(Float32, $L))
               print("SIMD benchmark:    "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
               print("GPU benchmark:     "); @btime sum(v)                    setup=(v=CUDA.rand($L))
           end;
    Serial benchmark:    42.370 μs (0 allocations: 0 bytes)
    SIMD benchmark:      2.669 μs (0 allocations: 0 bytes)
    GPU benchmark:       27.592 μs (112 allocations: 2.97 KiB)

so we're now at the point where the GPU is faster than serial, but still slower than SIMD. If we go up to 256^3 numbers, now we're able to see a convincing advantage for the GPU:

    julia> let L = 256^3
               print("Serial benchmark:  "); @btime vsum(Vec{1, Float32}, v)  setup=(v=rand(Float32, $L))
               print("SIMD benchmark:    "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
               print("GPU benchmark:     "); @btime sum(v)                    setup=(v=CUDA.rand($L))
           end;
    Serial benchmark:    11.024 ms (0 allocations: 0 bytes)
    SIMD benchmark:      2.061 ms (0 allocations: 0 bytes)
    GPU benchmark:       353.119 μs (113 allocations: 2.98 KiB)

So the lesson here is that GPUs are only worth it if you actually have enough data to saturate the GPU, but otherwise you're way better off using SIMD.

GPUs are also just generally a lot more limiting than SIMD in many other ways.

xoranth|1 year ago

Thank you for your reply!

> GPUs are also just generally a lot more limiting than SIMD in many other ways.

What do you mean? (besides things like CUDA being available only on Nvidia/fragmentation issues.)