nortiero | 7 years ago | on: Unity Engine ToS change makes cloud-based SpatialOS games illegal
nortiero's comments
nortiero | 7 years ago | on: How Did Indonesia and Malaysia Become Majority-Muslim?
nortiero | 7 years ago | on: Treating violence as a public health problem has produced great results
nortiero | 7 years ago | on: No Man’s Sky developer Sean Murray: ‘It was as bad as things can get’
nortiero | 7 years ago | on: Ask HN: What's the best documentation you've ever read?
nortiero | 7 years ago | on: Arizona Uber crash driver was 'watching TV'
nortiero | 7 years ago | on: Intel CEO resigns after relationship with employee
nortiero | 8 years ago | on: Starbucks to Close All U.S. Stores for Racial-Bias Education
edit: gratitude, not appreciation, silly me.
nortiero | 8 years ago | on: Starbucks to Close All U.S. Stores for Racial-Bias Education
nortiero | 8 years ago | on: Chuck Moore, Extreme Programmer
nortiero | 8 years ago | on: Ask HN: How do you deal with overconfident and mediocre individuals?
nortiero | 9 years ago | on: Memory bandwidth
But as soon as the wheels start turning, they spit out a lot of bytes, for sure!
nortiero | 9 years ago | on: Memory bandwidth
Here is the small program I wrote, first allocates an array, then writes a random placed linked list that touches all the array cells (e.g. start->|4|6|2|5|3|end , so it goes to cell 1, then 4, then 5, then 3,2,6 and done.
Then it reads the same array from start to finish, just to sum contents. This is fast and goes to 6-8 GB/s, depending on unrelated memory pressure from video and os tasks. This is the advertised speed. So I don't think it's a swap issue (and my SSD is way faster :-) I've also tried with smaller and bigger arrays. Just ensure it stays in real memory.
Moving around in a big array is very hard on the memory controller: - no prefetch possible; - row and column changes at every access (almost), so commands have to be issued to burst terminate, close row, close bank, precharge maybe, open bank, activate row, fetch address, and who knows; - no advantage from interleave; - no advantage from a fat 64 or 128 bit bus;
It is the combined time needed for each and every memory read (due to randomization) that cause performance to drop.
This is the code I've been using. I've modified it to work on unices (OSX has an issue with timeval struct), but have not tested on them. It compiles, may need to be altered a bit. Launch it with ./a.out <array_size> <random seed>. You will note that, as soon as the array moves out of caches, hell breaks loose. Works with gcc or clang.
nortiero | 9 years ago | on: Memory bandwidth
As an example, on my Macbook Air 2011 with ~10 GB/s of maximum ram bandwidth, random access to memory can take 100 time more than a sequential one.
This in C, with full optimizations and using a very low overhead read loop.
Using the same metrics of the author:
best case: ~ 3 bytes per cycle
(around 6 Gigabyte per second of available bandwidth)
worst case: ~ 0.024 bytes per cycle (every scheduler, prefetch, already open column mostly defied)
Note that worst case uses 10 seconds (!) to read and sum in a random way all the cells of an array of 100.000.000 of 4 byte integers, exactly once. Main loop is light enough not to influence the test.
That's about 40 megabytes per second out of 6.000 available.
What can I say.. CPU designers are truly wizards!
nortiero | 9 years ago | on: Changes I would make to Go
nortiero | 9 years ago | on: GCC optimized code gives strange floating point results
floating point calculations can be executed at wider range (§5.1.2.3.12);
assignments and casts are under obligation to wipe out that extra precision (§ same paragraph);
I am no language lawyer, but .. given the issue of program observable effects (§5.1.2.3.12) and the "as is" implied rule that governs optimizations, how possibly could equality be stable over time?
nortiero | 9 years ago | on: GCC optimized code gives strange floating point results
On the other side, IEEE 754 allows extended precision to be used in place of double and of course requires that any chosen precision be kept or else.
But C Standard mandates IEEE 754 , too!
It seems to me that modern C deliberately chose to ignore such kind of mismatch in the name of (substantial) performance gains. K&R, good or bad, was way simpler!