top | item 46731564

(no title)

touisteur | 1 month ago

I thought so but no, iterative small matrix multiplication kernel in tensor cores, or pure (generative) compute with ultra-late reduction and ultra-small working memory. nsight-compute says everything is in L1 or small register file, no spilling, and that I am compute bound, good ILP. Can't find a way to get more than 10% for the 300W difference. Thus asking if anyone did better and how and how reliable the HW stays.

discuss

No comments yet.