Why would useless MOV instructions speed up a tight loop in x86_64 assembly?

[+] rayiner|12 years ago|reply

This is Core 2, so it still has the P6's micro architectural limitation that it can only read two (or three?) values from the register files each cycle. But it's a 4-way processor, so it can potentially need up to 8 operands. If the other operands are on the bypass networks, it's fine. If not, then the CPU stalls. My guess would be that the MOV has the effect of keeping an operand on the bypass network long enough to avoid a stall. Totally guessing, though, but that might explain why it's so sensitive to instruction ordering.

[+] microarchitect|12 years ago|reply

Sorry, this explanation is almost surely incorrect. How long something is available on the bypass network is determined by how long the instruction that produces the value takes to "exit" the pipeline. I can't imagine any scenario where a consumer instruction causes a producer instruction (i.e., an instruction "ahead" of it) to stall. Note this would be a dangerous design point because of the risk of deadlocks.

What's the source for your claim that the Core uarch's register file is underdesigned in comparison to the dispatch width? I'd be extremely surprised if this were the case. Last time I looked at the data, about 50-70% of the reads go to the register file not the bypass network.

[+] Filligree|12 years ago|reply

Sorry, what's a bypass network in this context?

[+] raverbashing|12 years ago|reply

Could it have something to do with stalling while register renaming?

r8d/r9d are used "a lot" so it may have something to do with dependencies between steps, especially from end of the loop to the beginning (if I understood correctly)

[+] conductor|12 years ago|reply

Sorry for the off topic discussion, but I would like to give some attention to FreePascal. In my opinion, it is a great piece of software, and is so underrated. If you find C/C++ too error-prone or hard to learn, Object Pascal is a very good alternative. FreePascal is a multi-platform, Delphi compatibe Object Pascal compiler, can generates pretty optimized native code for multiple architectures (including ARM) and has plenty of libraries. If you already hadn't, please give Lazarus[1] a try, it's a nice RAD IDE (very similar to Borland Delphi 7) shipped with the FreePascal compiler.

[1] - http://lazarus.freepascal.org

[+] claystu|12 years ago|reply

I love pascal, but once you move beyond the turbo pascal features of Free Pascal and start trying to use the Lazarus IDE and the Delphi look-alike features the going gets really rough. The documentation is terrible. There's a lot there, but it's all fragmented and half baked.

They need to freeze development and go look at the documentation for python, perl, racket or a whole host of other languages.

[+] sliverstorm|12 years ago|reply

Does this have anything whatsoever to do with the posted topic? There must be some connection, and I'm twisting my brain trying to figure out what it is.

[+] dom96|12 years ago|reply

If I may also use this opportunity to advertise a Pascal-like language, as it is also severely underrated. The language's name is Nimrod [1], it compiles to C so it is also very portable, uses whitespace to delimit blocks, is statically typed with generics and has great support for metaprogramming. I am developing an IDE for it called Aporia [2], it is written in Nimrod, it definitely doesn't have as many features as Lazarus but it works well as a simple editor. Other nimrod projects can be found in the nimrod-code organisation on github[3].

[1] - http://nimrod-code.org

[2] - https://github.com/nimrod-code/aporia

[3] - https://github.com/nimrod-code/

[+] rogerbinns|12 years ago|reply

I highly recommend the Stanford EE380 video "Things CPU Architects Need To Think About". Even though it is from 2004, much of the material is still relevant. Bob Colwell notes that they had similar unexpected throughput slowdowns when implementing the P6 and Netburst processor cores, and discovered that adding random delays cleared them out. The cause for throughput hiccups could be traced back, sometimes hundreds or thousands of instructions, but were extremely difficult to address. They also found that manufacturing assembly lines had similar problems that were also addresses by adding delay.

http://www.stanford.edu/class/ee380/ay0304.html (18 Feb 2004, site appears to be having issues at the moment)

[+] acqq|12 years ago|reply

It's such a good talk. There's even the point where he almost says "we're sorry" regarding Pentium 4, at the time where a lot of people still believed Intel's marketing that it's "the best." I understood him, as the application I worked on at the time was significantly faster on the Pentium M than on the Pentium 4.

[+] zwegner|12 years ago|reply

Thanks for the link! Looks interesting.

Here's an alternative link, that actually works: http://lang.stanford.edu/courses/ee380/2003-2004/040218-ee38...

[+] incision|12 years ago|reply

Interesting, relevant paper mentioned in response to the linked SO question:

MAO - an Extensible Micro-Architectural Optimizer - http://static.googleusercontent.com/external_content/untrust...

[+] kilowatt|12 years ago|reply

There are figures on page 6 that answer the OP's question directly. And there's also this money quote: "Building an accurate model for modern processors is virtually impossible."

[+] PuercoPop|12 years ago|reply

Very interesting indeed. In particular: "we ﬁnd that most performance results stay ﬂat, in particular for the Nop Killer pass (NOPKILL), which leads to the conclusion that most alignment directives are not helping"

[+] pbsd|12 years ago|reply

The code keeps using high registers for no discernible reason, instead of sticking with EAX--EDI; this means that each instruction is 1 byte longer than it could be.

This has consequences to the instruction fetching and decoding circuitry, where (in the Core 2) you can only read 16 instruction bytes per cycle (or 6 instructions). It is possible that the extraneous MOV instructions are just resulting in a better instruction alignment.

[+] nkurz|12 years ago|reply

The code keeps using high registers for no discernible reason

The function[1] inside of which the assembly is located declares a lot of variables as well. I don't know how well Free Pascal does register allocation, but perhaps this avoid clobbering the registers it prefers. Alignment seems like a likely candidate, but isn't likely to explain why the exact ordering of the extra ops makes a difference.

Would help to see the assembly for the whole function.

[1] https://github.com/tangentstorm/coinops/blob/junkops/sha256_...

[+] mistercow|12 years ago|reply

I'm not sure that I'm buying that a 0.7% time difference is anything other than noise after only 25 runs.

[+] raymondh|12 years ago|reply

Adding a MOV shifts subsequent instructions and can relieve deleterious aliasing in the branch prediction buffer.

Here's a full explanation (with references and links to the diagnostic tools): http://stackoverflow.com/a/17906589/1001643

[+] DannyBee|12 years ago|reply

The answer can be one of a billion reasons: dependency chain breaking, register renaming deficiencies, alias breaking, etc.

First step needs to be the hardware performance counters. Usually it will tell you something interesting, but leave you scratching your head.

Then you find a friend at intel, who laughs, and explains some wildly esoteric detail of the core2 processor that causes this.

This is why scheduling and register allocation models in compilers don't bother with ILP based modeling for x86 processors. Even if you modeled the externally published architecture perfectly, including port restrictions, decode lengths, branch stalls, cycle counts, register requirements, etc, the processor just does whatever it wants internally anyway, because they don't really give you all the details.

[+] rwallace|12 years ago|reply

In my experience the timing jitter on a general-purpose PC - the typical difference between runs of the same code on the same data - is in the ballpark of a couple of percent (and no, back-to-back runs aren't fully independent of each other so you can't average it out over a couple of dozen runs).

The claimed timing difference was less than one percent so while I certainly can't say it wasn't real, I'm not seeing any evidence that it was.

[+] Havoc|12 years ago|reply

From the stackoverflow page:

>randomly inserting nop instructions in programs can easily increase performance by 5% or more, and no, compilers cannot easily exploit this.

Could someone please help me understand why this isn't exploitable? Surely one could just tack a module onto the end of the compiler that bruteforces this? Or something like a genetic algo to trial & error it.

[+] ryen|12 years ago|reply

Shouldn't something like this be measured using #CPU ops per CPU-time instead of just 'seconds'?

[+] monistowl|12 years ago|reply

[deleted]

41 comments