This trick is very useful on Nvidia GPUs for calculating mins and maxes in some cases, e.g. atomic mins (better u32 support than f32) or warp-wide mins with `redux.sync` (only supports u32, not f32).
I hazard to guess that it would be the same, because the compiler would produce a loop out of .iter(), would expose the loop index via .enumerate(), and would keep track of that index in .min_by(). I suppose the lambda would be inlined, maybe even along with comparisons.
I wonder could that be made faster by using AVX instructions; they allow to find the minimum value among several u32 values, but not immediately its index.
Another speed up method here would be using simd, although it would be interesting to see in the assembly if it was auto-vectorized already.
This reminds me of a trick to sort floats faster, even if they have negatives, nans, and inf: map each float to a sortable int version of itself where one can compare them as ints (the precise mapping depending on how you want to order stuff like Nan). The one time conversion is fast and will pay off for the lg(n) comparisons. Then after sorting, map them back.
why_only_15|5 months ago
teo_zero|5 months ago
why_only_15|5 months ago
TheDudeMan|5 months ago
nine_k|5 months ago
I wonder could that be made faster by using AVX instructions; they allow to find the minimum value among several u32 values, but not immediately its index.
meisel|5 months ago
This reminds me of a trick to sort floats faster, even if they have negatives, nans, and inf: map each float to a sortable int version of itself where one can compare them as ints (the precise mapping depending on how you want to order stuff like Nan). The one time conversion is fast and will pay off for the lg(n) comparisons. Then after sorting, map them back.