This is Core 2, so it still has the P6's micro architectural limitation that it can only read two (or three?) values from the register files each cycle. But it's a 4-way processor, so it can potentially need up to 8 operands. If the other operands are on the bypass networks, it's fine. If not, then the CPU stalls. My guess would be that the MOV has the effect of keeping an operand on the bypass network long enough to avoid a stall. Totally guessing, though, but that might explain why it's so sensitive to instruction ordering.
Sorry, this explanation is almost surely incorrect. How long something is available on the bypass network is determined by how long the instruction that produces the value takes to "exit" the pipeline. I can't imagine any scenario where a consumer instruction causes a producer instruction (i.e., an instruction "ahead" of it) to stall. Note this would be a dangerous design point because of the risk of deadlocks.
What's the source for your claim that the Core uarch's register file is underdesigned in comparison to the dispatch width? I'd be extremely surprised if this were the case. Last time I looked at the data, about 50-70% of the reads go to the register file not the bypass network.
Could it have something to do with stalling while register renaming?
r8d/r9d are used "a lot" so it may have something to do with dependencies between steps, especially from end of the loop to the beginning (if I understood correctly)
Sorry for the off topic discussion, but I would like to give some attention to FreePascal. In my opinion, it is a great piece of software, and is so underrated. If you find C/C++ too error-prone or hard to learn, Object Pascal is a very good alternative. FreePascal is a multi-platform, Delphi compatibe Object Pascal compiler, can generates pretty optimized native code for multiple architectures (including ARM) and has plenty of libraries. If you already hadn't, please give Lazarus[1] a try, it's a nice RAD IDE (very similar to Borland Delphi 7) shipped with the FreePascal compiler.
I love pascal, but once you move beyond the turbo pascal features of Free Pascal and start trying to use the Lazarus IDE and the Delphi look-alike features the going gets really rough. The documentation is terrible. There's a lot there, but it's all fragmented and half baked.
They need to freeze development and go look at the documentation for python, perl, racket or a whole host of other languages.
Does this have anything whatsoever to do with the posted topic? There must be some connection, and I'm twisting my brain trying to figure out what it is.
If I may also use this opportunity to advertise a Pascal-like language, as it is also severely underrated. The language's name is Nimrod [1], it compiles to C so it is also very portable, uses whitespace to delimit blocks, is statically typed with generics and has great support for metaprogramming. I am developing an IDE for it called Aporia [2], it is written in Nimrod, it definitely doesn't have as many features as Lazarus but it works well as a simple editor. Other nimrod projects can be found in the nimrod-code organisation on github[3].
I highly recommend the Stanford EE380 video "Things CPU Architects Need To Think About". Even though it is from 2004, much of the material is still relevant. Bob Colwell notes that they had similar unexpected throughput slowdowns when implementing the P6 and Netburst processor cores, and discovered that adding random delays cleared them out. The cause for throughput hiccups could be traced back, sometimes hundreds or thousands of instructions, but were extremely difficult to address. They also found that manufacturing assembly lines had similar problems that were also addresses by adding delay.
It's such a good talk. There's even the point where he almost says "we're sorry" regarding Pentium 4, at the time where a lot of people still believed Intel's marketing that it's "the best." I understood him, as the application I worked on at the time was significantly faster on the Pentium M than on the Pentium 4.
There are figures on page 6 that answer the OP's question directly. And there's also this money quote: "Building an accurate model for modern processors is virtually impossible."
Very interesting indeed. In particular:
"we find that most performance results stay flat, in particular for the Nop Killer pass (NOPKILL), which leads to the conclusion that most alignment directives are not helping"
The code keeps using high registers for no discernible reason, instead of sticking with EAX--EDI; this means that each instruction is 1 byte longer than it could be.
This has consequences to the instruction fetching and decoding circuitry, where (in the Core 2) you can only read 16 instruction bytes per cycle (or 6 instructions). It is possible that the extraneous MOV instructions are just resulting in a better instruction alignment.
The code keeps using high registers for no discernible reason
The function[1] inside of which the assembly is located declares a lot of variables as well. I don't know how well Free Pascal does register allocation, but perhaps this avoid clobbering the registers it prefers. Alignment seems like a likely candidate, but isn't likely to explain why the exact ordering of the extra ops makes a difference.
Would help to see the assembly for the whole function.
The answer can be one of a billion reasons:
dependency chain breaking, register renaming deficiencies,
alias breaking, etc.
First step needs to be the hardware performance counters.
Usually it will tell you something interesting, but leave you scratching your head.
Then you find a friend at intel, who laughs, and explains some wildly esoteric detail of the core2 processor that causes this.
This is why scheduling and register allocation models in compilers don't bother with ILP based modeling for x86 processors. Even if you modeled the externally published architecture perfectly, including port restrictions, decode lengths, branch stalls, cycle counts, register requirements, etc, the processor just does whatever it wants internally anyway, because they don't really give you all the details.
In my experience the timing jitter on a general-purpose PC - the typical difference between runs of the same code on the same data - is in the ballpark of a couple of percent (and no, back-to-back runs aren't fully independent of each other so you can't average it out over a couple of dozen runs).
The claimed timing difference was less than one percent so while I certainly can't say it wasn't real, I'm not seeing any evidence that it was.
>randomly inserting nop instructions in programs can easily increase performance by 5% or more, and no, compilers cannot easily exploit this.
Could someone please help me understand why this isn't exploitable? Surely one could just tack a module onto the end of the compiler that bruteforces this? Or something like a genetic algo to trial & error it.
[+] [-] rayiner|12 years ago|reply
[+] [-] microarchitect|12 years ago|reply
What's the source for your claim that the Core uarch's register file is underdesigned in comparison to the dispatch width? I'd be extremely surprised if this were the case. Last time I looked at the data, about 50-70% of the reads go to the register file not the bypass network.
[+] [-] Filligree|12 years ago|reply
[+] [-] raverbashing|12 years ago|reply
r8d/r9d are used "a lot" so it may have something to do with dependencies between steps, especially from end of the loop to the beginning (if I understood correctly)
[+] [-] conductor|12 years ago|reply
[1] - http://lazarus.freepascal.org
[+] [-] claystu|12 years ago|reply
They need to freeze development and go look at the documentation for python, perl, racket or a whole host of other languages.
[+] [-] sliverstorm|12 years ago|reply
[+] [-] dom96|12 years ago|reply
[1] - http://nimrod-code.org
[2] - https://github.com/nimrod-code/aporia
[3] - https://github.com/nimrod-code/
[+] [-] rogerbinns|12 years ago|reply
http://www.stanford.edu/class/ee380/ay0304.html (18 Feb 2004, site appears to be having issues at the moment)
[+] [-] acqq|12 years ago|reply
[+] [-] zwegner|12 years ago|reply
Here's an alternative link, that actually works: http://lang.stanford.edu/courses/ee380/2003-2004/040218-ee38...
[+] [-] incision|12 years ago|reply
MAO - an Extensible Micro-Architectural Optimizer - http://static.googleusercontent.com/external_content/untrust...
[+] [-] kilowatt|12 years ago|reply
[+] [-] PuercoPop|12 years ago|reply
[+] [-] pbsd|12 years ago|reply
This has consequences to the instruction fetching and decoding circuitry, where (in the Core 2) you can only read 16 instruction bytes per cycle (or 6 instructions). It is possible that the extraneous MOV instructions are just resulting in a better instruction alignment.
[+] [-] nkurz|12 years ago|reply
The function[1] inside of which the assembly is located declares a lot of variables as well. I don't know how well Free Pascal does register allocation, but perhaps this avoid clobbering the registers it prefers. Alignment seems like a likely candidate, but isn't likely to explain why the exact ordering of the extra ops makes a difference.
Would help to see the assembly for the whole function.
[1] https://github.com/tangentstorm/coinops/blob/junkops/sha256_...
[+] [-] mistercow|12 years ago|reply
[+] [-] raymondh|12 years ago|reply
Here's a full explanation (with references and links to the diagnostic tools): http://stackoverflow.com/a/17906589/1001643
[+] [-] DannyBee|12 years ago|reply
First step needs to be the hardware performance counters. Usually it will tell you something interesting, but leave you scratching your head.
Then you find a friend at intel, who laughs, and explains some wildly esoteric detail of the core2 processor that causes this.
This is why scheduling and register allocation models in compilers don't bother with ILP based modeling for x86 processors. Even if you modeled the externally published architecture perfectly, including port restrictions, decode lengths, branch stalls, cycle counts, register requirements, etc, the processor just does whatever it wants internally anyway, because they don't really give you all the details.
[+] [-] rwallace|12 years ago|reply
The claimed timing difference was less than one percent so while I certainly can't say it wasn't real, I'm not seeing any evidence that it was.
[+] [-] Havoc|12 years ago|reply
>randomly inserting nop instructions in programs can easily increase performance by 5% or more, and no, compilers cannot easily exploit this.
Could someone please help me understand why this isn't exploitable? Surely one could just tack a module onto the end of the compiler that bruteforces this? Or something like a genetic algo to trial & error it.
[+] [-] ryen|12 years ago|reply
[+] [-] monistowl|12 years ago|reply
[deleted]