(no title)
saynsedit | 9 years ago
#include <stdlib.h>
#include <stdint.h>
uint64_t sum (char *p, size_t nwords)
{
uint64_t res = 0;
size_t i;
for (i = 0; i < nwords; i += 8) {
uint64_t tmp;
memcpy(&tmp, &p[i], sizeof(tmp));
res += tmp;
}
return res;
}
qb45|9 years ago
Deal breaker: your memcpy invocation requires a sufficiently smart compiler to convert into normal unaligned load on x86 and seems to prevent GCC autovectorization. In this case OP actually didn't want vectorization, but in general it happens that such workarounds confuse compilers and produce worse code.
saynsedit|9 years ago
Vectorization is in general not applicable here since it usually requires aligned memory... not all implementations do, but most. In any case, benchmarking is more appropriate than armchair optimizing.