top | item 6114530

Why would useless MOV instructions speed up a tight loop in x86_64 assembly?

120 points| nkurz | 12 years ago |github.com | reply

41 comments

order
[+] rayiner|12 years ago|reply
This is Core 2, so it still has the P6's micro architectural limitation that it can only read two (or three?) values from the register files each cycle. But it's a 4-way processor, so it can potentially need up to 8 operands. If the other operands are on the bypass networks, it's fine. If not, then the CPU stalls. My guess would be that the MOV has the effect of keeping an operand on the bypass network long enough to avoid a stall. Totally guessing, though, but that might explain why it's so sensitive to instruction ordering.
[+] microarchitect|12 years ago|reply
Sorry, this explanation is almost surely incorrect. How long something is available on the bypass network is determined by how long the instruction that produces the value takes to "exit" the pipeline. I can't imagine any scenario where a consumer instruction causes a producer instruction (i.e., an instruction "ahead" of it) to stall. Note this would be a dangerous design point because of the risk of deadlocks.

What's the source for your claim that the Core uarch's register file is underdesigned in comparison to the dispatch width? I'd be extremely surprised if this were the case. Last time I looked at the data, about 50-70% of the reads go to the register file not the bypass network.

[+] Filligree|12 years ago|reply
Sorry, what's a bypass network in this context?
[+] raverbashing|12 years ago|reply
Could it have something to do with stalling while register renaming?

r8d/r9d are used "a lot" so it may have something to do with dependencies between steps, especially from end of the loop to the beginning (if I understood correctly)

[+] conductor|12 years ago|reply
Sorry for the off topic discussion, but I would like to give some attention to FreePascal. In my opinion, it is a great piece of software, and is so underrated. If you find C/C++ too error-prone or hard to learn, Object Pascal is a very good alternative. FreePascal is a multi-platform, Delphi compatibe Object Pascal compiler, can generates pretty optimized native code for multiple architectures (including ARM) and has plenty of libraries. If you already hadn't, please give Lazarus[1] a try, it's a nice RAD IDE (very similar to Borland Delphi 7) shipped with the FreePascal compiler.

[1] - http://lazarus.freepascal.org

[+] claystu|12 years ago|reply
I love pascal, but once you move beyond the turbo pascal features of Free Pascal and start trying to use the Lazarus IDE and the Delphi look-alike features the going gets really rough. The documentation is terrible. There's a lot there, but it's all fragmented and half baked.

They need to freeze development and go look at the documentation for python, perl, racket or a whole host of other languages.

[+] sliverstorm|12 years ago|reply
Does this have anything whatsoever to do with the posted topic? There must be some connection, and I'm twisting my brain trying to figure out what it is.
[+] dom96|12 years ago|reply
If I may also use this opportunity to advertise a Pascal-like language, as it is also severely underrated. The language's name is Nimrod [1], it compiles to C so it is also very portable, uses whitespace to delimit blocks, is statically typed with generics and has great support for metaprogramming. I am developing an IDE for it called Aporia [2], it is written in Nimrod, it definitely doesn't have as many features as Lazarus but it works well as a simple editor. Other nimrod projects can be found in the nimrod-code organisation on github[3].

[1] - http://nimrod-code.org

[2] - https://github.com/nimrod-code/aporia

[3] - https://github.com/nimrod-code/

[+] rogerbinns|12 years ago|reply
I highly recommend the Stanford EE380 video "Things CPU Architects Need To Think About". Even though it is from 2004, much of the material is still relevant. Bob Colwell notes that they had similar unexpected throughput slowdowns when implementing the P6 and Netburst processor cores, and discovered that adding random delays cleared them out. The cause for throughput hiccups could be traced back, sometimes hundreds or thousands of instructions, but were extremely difficult to address. They also found that manufacturing assembly lines had similar problems that were also addresses by adding delay.

http://www.stanford.edu/class/ee380/ay0304.html (18 Feb 2004, site appears to be having issues at the moment)

[+] acqq|12 years ago|reply
It's such a good talk. There's even the point where he almost says "we're sorry" regarding Pentium 4, at the time where a lot of people still believed Intel's marketing that it's "the best." I understood him, as the application I worked on at the time was significantly faster on the Pentium M than on the Pentium 4.
[+] incision|12 years ago|reply
Interesting, relevant paper mentioned in response to the linked SO question:

MAO - an Extensible Micro-Architectural Optimizer - http://static.googleusercontent.com/external_content/untrust...

[+] kilowatt|12 years ago|reply
There are figures on page 6 that answer the OP's question directly. And there's also this money quote: "Building an accurate model for modern processors is virtually impossible."
[+] PuercoPop|12 years ago|reply
Very interesting indeed. In particular: "we find that most performance results stay flat, in particular for the Nop Killer pass (NOPKILL), which leads to the conclusion that most alignment directives are not helping"
[+] pbsd|12 years ago|reply
The code keeps using high registers for no discernible reason, instead of sticking with EAX--EDI; this means that each instruction is 1 byte longer than it could be.

This has consequences to the instruction fetching and decoding circuitry, where (in the Core 2) you can only read 16 instruction bytes per cycle (or 6 instructions). It is possible that the extraneous MOV instructions are just resulting in a better instruction alignment.

[+] nkurz|12 years ago|reply
The code keeps using high registers for no discernible reason

The function[1] inside of which the assembly is located declares a lot of variables as well. I don't know how well Free Pascal does register allocation, but perhaps this avoid clobbering the registers it prefers. Alignment seems like a likely candidate, but isn't likely to explain why the exact ordering of the extra ops makes a difference.

Would help to see the assembly for the whole function.

[1] https://github.com/tangentstorm/coinops/blob/junkops/sha256_...

[+] mistercow|12 years ago|reply
I'm not sure that I'm buying that a 0.7% time difference is anything other than noise after only 25 runs.
[+] DannyBee|12 years ago|reply
The answer can be one of a billion reasons: dependency chain breaking, register renaming deficiencies, alias breaking, etc.

First step needs to be the hardware performance counters. Usually it will tell you something interesting, but leave you scratching your head.

Then you find a friend at intel, who laughs, and explains some wildly esoteric detail of the core2 processor that causes this.

This is why scheduling and register allocation models in compilers don't bother with ILP based modeling for x86 processors. Even if you modeled the externally published architecture perfectly, including port restrictions, decode lengths, branch stalls, cycle counts, register requirements, etc, the processor just does whatever it wants internally anyway, because they don't really give you all the details.

[+] rwallace|12 years ago|reply
In my experience the timing jitter on a general-purpose PC - the typical difference between runs of the same code on the same data - is in the ballpark of a couple of percent (and no, back-to-back runs aren't fully independent of each other so you can't average it out over a couple of dozen runs).

The claimed timing difference was less than one percent so while I certainly can't say it wasn't real, I'm not seeing any evidence that it was.

[+] Havoc|12 years ago|reply
From the stackoverflow page:

>randomly inserting nop instructions in programs can easily increase performance by 5% or more, and no, compilers cannot easily exploit this.

Could someone please help me understand why this isn't exploitable? Surely one could just tack a module onto the end of the compiler that bruteforces this? Or something like a genetic algo to trial & error it.

[+] ryen|12 years ago|reply
Shouldn't something like this be measured using #CPU ops per CPU-time instead of just 'seconds'?