Agree. Normal Python for loop apply to a Numpy array to do simple math is just pure nonsense.
Just tested how would it be without compile nonsense.
```
a = np.random.random(int(1e6))
%%timtit
np.average(a)
%timeit
np.average(a[::16])
```
And my result is that no matter how uncontiguous in memory (here I take every 16 elements like what they did, and I tested for 2,4,8,16), we are doing less operations so it always end up faster. Contrastingly their SIMD compiled code is 10-20X slower in uncontiguous case.
And for a larger array that is 16X of the contiguous one, but we only take 1/16 of its element, the result is like 10X slower as shown by the article. But I suspect that purely now you have a 16X larger array to load from memory, which itself is slow in nature.
```
b = np.random.random(int(16e6))
np.average(b[::16])
```
Which conclude that people should use Numpy in the right way. It is really hard to beat pure numpy speed.
But that's precisely what makes this a good exercise, you can see how far you are able to close the gap between the naive looping implementation and the optimized array implementation.
akasakahakada|2 years ago
Just tested how would it be without compile nonsense.
```
a = np.random.random(int(1e6))
%%timtit
np.average(a)
%timeit
np.average(a[::16])
```
And my result is that no matter how uncontiguous in memory (here I take every 16 elements like what they did, and I tested for 2,4,8,16), we are doing less operations so it always end up faster. Contrastingly their SIMD compiled code is 10-20X slower in uncontiguous case.
And for a larger array that is 16X of the contiguous one, but we only take 1/16 of its element, the result is like 10X slower as shown by the article. But I suspect that purely now you have a 16X larger array to load from memory, which itself is slow in nature.
```
b = np.random.random(int(16e6))
np.average(b[::16])
```
Which conclude that people should use Numpy in the right way. It is really hard to beat pure numpy speed.
nerdponx|2 years ago
Elucalidavah|2 years ago
But that's not the function in the article. The article implements `(a + b) / 2`.
And, on my system, simple `return (arr1 + arr2) / 2` takes 1.2ms, while the `average_arrays_4` takes 0.74ms.
thatsit|2 years ago
I can only imagine that this is already backed into Numpy now.