AS I know GPUs execute code pretty fast as long as all threads in a warp go the same execution path. Branching causes performance degradation. But executing exactly the same code for multiple coroutines seems for me to be practically impossible. So, can good performance be reached with such approach at all?
zozbot234|11 days ago
(Beyond that, "executing the same code" on multiple instances of a single coroutine ought to be sometimes possible on an opportunistic basis.)
ablob|12 days ago