top | item 22844211

(no title)

akssri | 5 years ago

- Function in Cupy takes 29.4ms, Numpy takes 427 ms. Happy ?

- Broadcasting semantics + division takes care of the outer-product normalization. This is 2 L1 ops in size of the matrix & the input (~ xSCAL).

Pedantry is still not an argument.

discuss

order

dragandj|5 years ago

Thank you so much, that's phenomenal news for me! (Since I can make neanderthal code go at 23ms (GPU) and 3XX ms (CPU) when I implement it as NumPy/CuPy/PyTorch does (sans float64 conversion, of course) You saved me from having to fiddle with Python (which I don't particularly enjoy). Thanks again!

Can you please post your implementation of this function, here, so I can try it on my machine and compare it to Neanderthal?