Instructions per cycle: AMD Zen 2 versus Intel

[+] NohatCoder|6 years ago|reply

In case anyone is not aware: This is a very small sample of microbenchmarks. When benchmarking very simple tasks like these performance tend to vary wildly between architectures.

For instance instructions are assigned to one of a handful of ports when executed, certain instructions may only be assigned to certain ports, what ports an instruction may be assigned to differ between architectures. If an inner loop use only a few different instructions one architecture may be unlucky in that most of the instructions need the same ports, and so it can execute fewer instruction overall.

For real benchmarking use lots of different complicated jobs. It is not perfect, but it is the best way we have of comparing different processors head to head.

[+] mrb|6 years ago|reply

Indeed. Back in 1999 the AMD K7 was a full 3 times faster than Intel on microbenchmarks measuring the performance of ROR/ROL instructions, because the throughput per clock of these rotate instructions was exactly 3 times higher than on Intel. Obviously this did not mean that AMD was 3 times faster than Intel.

Picking 1 or 2 random microbenchmarks like the blog post author did is not useful to categorize overall performance across all real-world workloads. If he had picked different ones, they might have shown AMD twice faster than Intel.

[+] ncmncm|6 years ago|reply

The author appears to be benchmarking the specific operations that bottleneck their json parsing library when running on an Intel chip, which seems reasonable, on its face. It can fail if the library is limited by a different set of operations, on the different machine. But that is unlikely if the specific operations tested are slower.

[+] yifanlu|6 years ago|reply

Assuming both Intel and AMD implement performance monitors the same (i.e. same notion of instructions executed, which may be hard to measure with speculative execution), the comparison is still flawed because it doesn’t matter if Intel can do more instruction per cycle if AMD can produce more cycles in a span of wall time.

> However, it is not clear whether these reports are genuinely based on measures of instruction per cycle. Rather it appears that they are measures of the amount of work done per unit of time normalized by processor frequency.

That’s precisely why nobody really uses IPC as a way to compare processors. “How much work done per unit of time” is a much better measurement and I guess for historical reasons, people conflate it with IPC.

But real textbook IPC is useless for comparison.

[+] shaklee3|6 years ago|reply

I think it would have been useful if the author benchmarked the actual time taken to parse a large json file, and did a sanity check to make sure the time difference made sense with ipc/clock factored in.

[+] black_puppydog|6 years ago|reply

> But real textbook IPC is useless for comparison.

It's useful for comparing architectures and the implementation thereof, to gauge the potential of one line of processors over the other.

I agree that for the customer it's not the right thing to be looking for.

[+] Someone|6 years ago|reply

I don’t see how it is flawed. The article doesn’t discuss whether the AMD CPU is faster than the Intel CPU, it discusses the claim "that the most recent AMD processors surpass Intel in terms of instructions per cycle” (https://www.guru3d.com/articles_pages/amd_ryzen_7_3800x_revi...)

And IPC, IMO, is a better measurement for a chip’s design than pure speed, as it removes the “but how good a process do you have access to” from the equation.

[+] MrSauna|6 years ago|reply

Some more realistic single core workloads at same frequencies: 3900x vs 9900k https://hothardware.com/reviews/amd-ryzen-9-3900x-vs-core-i9...

[+] unknown|6 years ago|reply

[deleted]

[+] BeeOnRope|6 years ago|reply

I'm this case the frequencies are similar and so wall clock time reflects the IPC difference (also, the two CPUs take the same code path, so the I is the same in this case, which isn't always true).

[+] endorphone|6 years ago|reply

"the comparison is still flawed because it doesn’t matter if Intel can do more instruction per cycle if AMD can produce more cycles in a span of wall time."

The reason Intel had the "per core" superiority crown for years is that it had a better IPC performance due to design efficiency. Both manufacturers are pushing against the same frequency ceiling, so if you went AMD you had to significantly increase the core count to catch up, and could never match the still important single-thread performance.

We know from large scale, comprehensive benchmarks that AMD has massively picked up the pace and is neck and neck with Intel. At the same processor speed it matches the best Intel processors.

But yeah, this article is just terrible. Not just tiny, minuscule, extremely myopic benchmarks, but then a gross over-reach with conclusions. And in the way that ignorance begets ignorance, the fact that it's trending on a couple of social news sites means that now Google is surfacing it as canonical information when it's just a junk, extremely lazy analysis.

[+] jonstewart|6 years ago|reply

It’s depressing how many comments here are quick to dismiss the benchmarking/article. Yes, yes, memory bandwidth, I/O, and cache hierarchies are all important, but Daniel Lemire is one of the top people in the world when it comes to optimizing algorithms for modern CPUs. Do you like search engines? Lemire has made them significantly faster. He is often able to take code/algorithms that already seem fast, and make them much faster. He’s recently branched out beyond search engine core algorithms into some aspects of string processing (base64, UTF-8 validation, JSON parsing).

In this blog post, he’s paying attention to IPC because he’s typically working with inner loops where the data’s being delivered from RAM to L1 as efficiently as possible.

[+] BeeOnRope|6 years ago|reply

I have plenty of respect for Daniel (and you can even find me below in this discussion defending some aspects of this test), but I too find some fault with this article.

The main problem I have is that the claim in dispute seems to be that Zen 2 has comparable (perhaps slightly higher) IPC to Skylake, and then Daniel picks out two benchmarks and shows that Skylake has higher IPC than Zen 2... proving what exactly?

Contradicting people who said that Zen 2 had a higher IPC on every benchmark? Yes, those people were wrong, but it's easy to prove a point if you pick an argument almost no one was making it in the first place.

In the same (second) benchmark that he selected the "basic_decoder" sub-benchmark, but there is also another benchmark "bogus" which tests the empty function calling time, and this case I measure a reversed scenario: Intel at IPC 2.25 and AMD at 3.43. So should we now say that Intel IPC is "quite poor"?

[+] reitzensteinm|6 years ago|reply

The second example is just a benchmark of tzcnt, added in BMI1. It's a very specific and very bizarre benchmark to do when you could just look up the reciprocal throughput (unfortunately Zen 2 has not yet been added).

https://www.agner.org/optimize/instruction_tables.pdf

Edit: This is wrong as BeeOnRope points out below.

The first is SIMD heavy, so Zen 2 mostly closing the gap with Intel in one of the areas where Zen 1 was very weak is a good thing.

[+] BeeOnRope|6 years ago|reply

Zen2 is on uops.info, it's 2L0.5T on Zen, 3L1T on Intel, so slight theoretical edge for AMD (2 vs 1 uops tho).

That said, I don't agree it's a tzcnt benchmark - there are about 9 instructions only one of which is tzcnt. I'm not sure why Zen2 is worse here.

[+] eyegor|6 years ago|reply

I think the only real way to compare IPC is to actually talk to the architects. Trying to write microbenchmarks is a fools errand when you aren't aware of how the cpu processes the instructions you give it. Are you actually stressing the fpu, or is the cpu speculatively executing and then branch predicting the workload (common for micro loops)? If it is, is that what you meant to test? Are you trying to compare like for like (in which case you have to write assembly), or are you trying to write performance benchmarks (and then the only meaningful metric is cpu time)?

This is an interesting idea, but I'm not sure how you could derive meaning from comparing two vastly different architectures at such a high level.

[+] alecmg|6 years ago|reply

Useless, strictly academic interest.

There is more than execution ports in design of processors. Not every task can be SIMD optimized to extent of approaching theoretical IPC limits, most will be bottlenecked by memory access or even IO.

I prefer the "fake" but real-world IPC. Same clocks, same real world task, measure time to finish.

[+] Erwin|6 years ago|reply

I think this was more of a response to the linked benchmark at guru3d which said:

> Instructions per cycle (IPC)

> For many people, this is the holy grail of CPU measurements in terms of how fast an architecture per core really is.

Based on his work with simdjson, professor Lemire seems to be quite aware of microbenchmarks being problematic. But general articles out here and on HN are proclaiming Intel is doomed and can never recover, due to mitigations/lack of cores/lack of chiplets. Those concerns have yet to be reflected in the stock price.

[+] Symmetry|6 years ago|reply

Omar Bradley once said “Amateurs talk strategy. Professionals talk logistics.” I'd say that in CPU design amateurs talk about execution resources but professionals talk about cache hierarchies. But that's too awkward to make a good quip.

[+] fluffything|6 years ago|reply

I think recommending people to to prefer {insert your favourite benchmark here} is very bad advice, and disproving your claim that Lemire's benchmarks are useless because YOU don't care about them is as simple as showing that they are useful for Lemire, which is something this post shows.

If you care enough about a particular CPU to do benchmarks, you should benchmark what YOU care about.

Lemire's job is to improve the implementation of particular algorithms to make optimal use of the hardware. Knowing the different theoretical hardware limits tells you how good an implementation is doing along different axes, and benchmarking those limits is a critical part of doing Lemire's job correctly.

You probably have a different use case for computers than Lemire, and it is therefore completely reasonable for you to care about different benchmarks.

[+] zippie|6 years ago|reply

IPC microbenchmarks do not properly reflect the complex workloads running on post Zen2 microarchitecture. Zen2 upends microarchitecture schematics enough to warrant a different metric.

IPC MB’s, in my experience, tend to benchmark best case scenarios and that is probably the exception rather than the rule for application workloads in modern MA’s. Case in point, microbenchmarks showed significant improvements in IPC for Zen2 in lieu of Skylake yet for the application workload (CPU data bound), Skylake held up neck and neck.

The more appropriate benchmarking metric for post-Zen2 processors is CPI [0].

[0] https://john.e-wilkes.com/papers/2013-EuroSys-CPI2.pdf

[+] mmrezaie|6 years ago|reply

But isn't CPI is just reverse of IPC, and CPI just makes the IPC score being bounded between 0..1?

[+] chucklenorris|6 years ago|reply

Heh, I'm curious if he used the mitigations for all the side channel flaws for the intel processors.

[+] BeeOnRope|6 years ago|reply

The mitigations don't affect CPU bound benchmarks [1] which don't call into the kernel or use specific user-space mitigations, so it won't matter here.

[1] There are some rare exceptions, such as https://travisdowns.github.io/blog/2019/03/19/random-writes-... , but it is unlikely to matter here.

[+] _ph_|6 years ago|reply

While only being part of the performance equation, analyzing IPC can be quite interesting in understanding the design of the processor and how performance might be achieved.

One thing itches me with the presented comparison: it is running very few benchmarks generated with the same compiler. For a thorough IPC analysis, shouldn't the tests rather being programmed in assembly to exclude any influence by the compiler choice? Also probably a wider range of algorithms should be checked, as IPC on modern processors depends less on how many cycles a certain instruction takes (you should be able to find that in the manuals), but how well multiple components of the processor can be utilized at the same time. Which extremely depends on the actual program to be run.

[+] tempguy9999|6 years ago|reply

I'm rather surprised at the claim that "but it might easily execute 7 billion instructions per second on a single core". I'd even question it except the author's an expert.

If you can keep it fed then ok but one cache miss to main mem, either instruction or data, will allow the instruction buffers to completely empty and stay empty for quite a long time. I don't think you can control placement to reasonably assure cache hits always for anything but the most trivial code, am I missing something?

Also if you could keep a consistent throughput like this I wonder if thermal throttling might have to kick in. I mean you're doing a lot of work...

[+] touisteur|6 years ago|reply

I can't find it back but in a recent article I read that it was useful to have an idea of the upper-boundary abilities of an arch+algorithm, so that you 'know' what you're aiming for, but it might not be attainable practically without huge human or decades of superoptimizer effort... Yes if your algorithm reaches for cold data, you'll get hit. Can you get around that? Do you really need to hit the cache when you're computing the seven-billionth decimal of pi or factoring numbers ? This work is quite interesting, if only for compilers or superoptimizers.

[+] Const-me|6 years ago|reply

I wonder how reliable are these Linux syscalls?

Found this http://manpages.ubuntu.com/manpages/trusty/man2/perf_event_o... and that article doesn't instill much confidence in the reliability of these counters. Comment for CPU_CYCLES says "Be wary of what happens during CPU frequency scaling", comment for INSTRUCTIONS says "these can be affected by various issues, most notably hardware interrupt counts", BRANCH_INSTRUCTIONS says "Prior to Linux 2.6.34, this used the wrong event on AMD processors" and so on.

If I wanted to measure what OP was measuring, I would disable frequency scaling (probably doable on overclocker-targeted motherboards, also search finds some utilities which claim to do that, both windows and linux ones), measure time, then divide by frequency.

[+] amluto|6 years ago|reply

CPU_CYCLES counts cycles. This means that the time per cycle varies with frequency. If you're trying to see how many cycles something that fits in L1 takes, CPU_CYCLES is the right thing to measure.

[+] nabla9|6 years ago|reply

In more comprehensive single thread benchmarks (single thread POV Ray) Intel can still beat Zen 2 architecture sometimes. This test seems to indicate the reason why.

[+] qxnqd|6 years ago|reply

ITT: AMD apologists.

Sorry guys but Intel is still king of single core performance. But that's not a problem because I'm sure by 2050 most desktop applications and games will correctly make use of many cores, then AMD will reign

[+] tempguy9999|6 years ago|reply

Worth responding to blatant troll to point out it's not about performance but performance by price for 99% of uses.

[+] ncmncm|6 years ago|reply

By 2050 we might well be more concerned with which rocks can be slung farther.

Assuming civilization will survive until then, given current political trends, is rash.

74 comments