top | item 44140626

(no title)

unignorant | 9 months ago

yeah, it seems likely the underlying task here (one reasoning step away) was: replace as many fp32 operations as possible in this kernel with fp16. i'm not sure exactly how challenging a port like that is, but intuitively seems a bit less impressive

maybe this intuition is wrong but would be great for the work to address it explicitly if so!

discuss

AlotOfReading|9 months ago

Only seems to have done that in a couple places, like the MatMul. The softmax kernel (https://github.com/ScalingIntelligence/good-kernels/blob/mai...) seem to be entirely bog-standard, and the layernorm kernels are only slightly more interesting.

ekelsen|8 months ago

I looked at the softmax kernel and the cast that it does from a float* to a float4* is extremely brittle -- it's trivial to break by offsetting the input slightly.

Very likely a kernel for a standard library could not employ such a trick that relies on alignment of input pointers. Certainly not without a fallback.