What's new in CPUs since the 80s and how does it affect programmers?

[+] fiatmoney|10 years ago|reply

It's strange the extent to which programming interview questions reflect an 80s view of the cost of operations, particularly the overabundance of linked list and binary tree questions. Cache misses ain't free and memory scans are relatively cheap after you do the initial lookup.

[+] trsohmers|10 years ago|reply

Yep! The big thing people don't realize that it is memory operations that are the biggest cost in terms of both time and energy.

While it takes ~100 picojoules to do a double precision floating point operation on a Ivy Bridge Intel processor, it takes 4200 picojoules to move the 64 bits from DRAM to your registers. Most people assume that the huge power usage is because you need to move data from off the chip, but the reality (and surprising fact to most people) is that over 60% (~2500 picojoules) of the energy usage of moving the data is consumed by the on chip cache hierarchy. That doesn't mean the SRAM caches themselves, but all the additional logic that makes it hardware managed (TLBs, etc) that give you functionality like virtual memory translations, cache coherency, etc.

Getting rid of all of that cruft that has been added since the 80s to make programmers lives easier would actually reduce power consumption and latency significantly... My startup is working on that problem by removing all of that additional logic from the hardware and instead having it managed at compile time. The best thing though would be having programmers really think about locality when writing their programs though.

[+] TillE|10 years ago|reply

Nobody ever used a linked list for performance.

And it's not reasonable to abandon trees and graphs and everything just for the sake of cache locality. Algorithm first, CPU optimization second. Especially because you can control allocation very easily with something like an object pool, which will minimize cache misses.

[+] chipsy|10 years ago|reply

I think most such interview questions are informed by CS curricula, which also have only changed modestly since 1980(some new languages and tools). The elaborate pointer chasing structures available then have basically sufficed and are in many instances less useful new.

[+] unknown|10 years ago|reply

[deleted]

[+] sklogic|10 years ago|reply

Would you really prefer to be asked to write a cache-friendly hash map on a whiteboard?

[+] ansible|10 years ago|reply

What interested me in the design of the Mill CPU is how it throws out the usual design of machine language.

I'm not talking about assembly language, and I know the difference, BTW.

In the name of software compatibility, we're still trying to program CPUs using machine language that wouldn't be so strange to a programmer from the 1980's. Sure, there's more registers, and some fun new stuff, but it isn't all that different.

Except that in the 1980's, the CPU actually implemented those instructions. These days, it is all a lie, especially with regards to things like register sets and aliasing. Yes, of course, logically, what the programmer wanted to happen does, but today even programming at assembly level, you are far, far removed from what the CPU is actually doing.

Edit: Here's the website: http://millcomputing.com/docs/

[+] gsg|10 years ago|reply

> In the name of software compatibility

You say this as if it is a bad thing (or am I misinterpreting here?), but compatibility is enormously valuable. That's why the strategy of choosing compatibility over cleanliness of architecture is so widespread in successful complex systems - ISAs, OSs, the Web, programming languages, etc etc.

It's hard to love the resulting complexities, but remaining compatible really is almost always the right thing to do.

[+] gizmo686|10 years ago|reply

It is worth noting that many of the Mill features are also a "lie" (for example, the belt is still just registers and register renaming). Its just that the Mill uses lies designed for modern processor technology.

[+] dang|10 years ago|reply

> how it throws out the usual design of machine language

Now I'm curious. How does it?

[+] unknown|10 years ago|reply

[deleted]

[+] mgrennan|10 years ago|reply

This is a wonderful post that no-one will care about. This may be the only post.

Today, programmers are more interested in the rate they can turn out "Just Works" code. These kinds of details are fare fare to down in the weeds for a continuous development artists.

[+] dang|10 years ago|reply

All: I had a negative reaction to this comment too at first, but on reflection it reads less like a jab and more like a rueful lament about mainstream programming culture. The fact that mgrennan loved the article and was so perfectly wrong about it not being appreciated here suggests that he or she just hasn't realized yet how much passion this community shares for the craft.

It's depressing to feel like you're the only one who cares, and when one has felt like that for a long time, curmudgeonly biases develop. So mgrennan, please get to know your fellow HNers, who love this stuff. And HN, let's be charitable to mgrennan, who may have been mistaken but whose heart is probably in the right place.

[+] javajosh|10 years ago|reply

Two responses. First, once we find "Just Works" components that we like then they can be optimized. Second, we can use inspiration from articles like this to describe "Just Works" approaches and patterns that "go with the grain" of what Earth's real fabrication capabilities are actually producing.

For example, I found the discussion of how cores coordinate access to main memory on a shared bus to be quite fascinating; an easy insight there was that our programming patterns should support hard data partitions (less shared main memory than parallel main memory). One naive way to get there is to use N processes where N is something like the number of cores on the machine, and one of them serves as a message router. Something like what `httpd` does.

I really wouldn't mind if someone who knows more about the JVM implementation could talk about how and why the JVM threading model is better than native processes, for example, especially in light of memory contention.

[+] detaro|10 years ago|reply

> This is a wonderful post that no-one will care about. This may be the only post.

I think you are underestimating the crowd here. Last time it was posted it got quite a few responses: https://news.ycombinator.com/item?id=8873250 (already a while back, but might be interesting for reference/to bring topics up again)

[+] mikeash|10 years ago|reply

By writing articles like this, and linking to them, people are trying to fight that.

By stating that nobody will care, you're just encouraging the problem.

[+] scott_s|10 years ago|reply

Highly technical posts on HN have a tendency to accrue a lot of upvotes before a discussion develops. I attribute this to (perhaps optimistically) people actually taking the time to read the post. When something is so technically dense, it can also be difficult for people to add anything to it; hence it takes a while before a real discussion starts.

[+] mgrennan|10 years ago|reply

Good to see I was wrong.

[+] annnnd|10 years ago|reply

> A few years back, I used a Pentium 4 system...

I hate it when blog posts don't include the date. Judging by the linked question this blog post must be at most a few months old, but there was nothing on the page that would tell me that. One of the most important questions is whether the information in the article is still applicable... In this case it is, but it would be nice if readers knew it. Not to mention 10 years from now when somebody stumbles across this writeup. </rant>

EDIT: Nice article though. :)

[+] stefantalpalaru|10 years ago|reply

It's from January 11 2015, according to this page: http://danluu.com/blog/archives/

[+] userbinator|10 years ago|reply

"However, loads can be reordered with earlier stores. For example, if you write

    mov 1, [%esp]
    mov [%ebx], %eax

it can be executed as if you wrote

    mov [%ebx], %eax
    mov 1, [%esp]"

The confusing mix of Intel and GAS/AT&T syntax aside, this is not possible since it would give different results when ebx == esp.

[+] caf|10 years ago|reply

It's not a static decision though - the memory accesses can still be reordered when %ebx != %esp, though of course this only ends up visible where there are multiple CPUs involved.

For example, consider the case above and assume that the initial conditions are:

  (%esp) == 0
  (%ebx) == 0

Now imagine we have a second CPU executing simultaneously, with the same %ebx and %esp as the first CPU, but executing this:

  mov $1, (%ebx)
  mov (%esp), %eax

Now if there was no reordering, either one or both CPUs must end with %eax == 1. However, the hoisting of loads before earlier stores means that you can actually end up with both CPUs have %eax == 0 after this executes.

[+] nickpsecurity|10 years ago|reply

It's nice but let's be clear on the best feature: application-accelerators. I brought up the Cavium Octeon 3 in a discussion on game systems:

http://www.cavium.com/OCTEON-III_CN7XXX.html

Intel, IBM, mainframes, and embedded SOC's are all taking the same approach to a degree of combining 1-N general-purpose cores with dedicated hardware for performance-critical stuff or just stuff that shouldn't add overhead. The Octeon line is an extreme example with them adding accelerators till they hit around 500. Most modern variant being the "semi-custom" business of Intel and AMD that is making more of it happen for those with the money.

This is peripheral to an improvement in computers known as network on a chip. This plus extra layers of functionality in silicon lets the companies easily do stuff like that. The next step is incorporating FPGA logic in the processors. We already see it in embedded scene. Just wait till Intel uses Altera technology in Xeons. SGI's Altix machines with FPGA's using NUMA were already quite powerful. Imagine the same benefit of no, remote-memory access for the FPGA logic working side-by-side with CPU software. Will be badass.

[+] Aoyagi|10 years ago|reply

This is a question from someone rather ignorant, so please don't hit me: Why didn't the Bulldozer affect the programmers? It (or Piledriver) seems to be doing quite well in applications with good threading.

[+] Symmetry|10 years ago|reply

In general all x86 chips are carefully designed to fulfill the abstraction that is the x86 ISA so to the programmer it shouldn't matter whether their code runs on an Atom or a Bulldozer or an i7.

[+] Tobu|10 years ago|reply

I don't think "good at executing bad code" applies for embedded CPUs. Of course, good code is still more the province of the compiler (with PGO for example).

[+] hackuser|10 years ago|reply

Who is Dan Luu? Does he have expertise in this area? Is he a leading expert? (No offense to Dan if he is reading this; I just don't know.)

[+] ajdlinux|10 years ago|reply

You can start by looking at http://danluu.com/about/. Assuming his CV is accurate he presumably has a reasonable amount of knowledge in the area.

[+] bitwize|10 years ago|reply

It's a great article but this drove me nuts:

DON'T USE FUCKING AT&T ASSEMBLY SYNTAX.

Literally everyone uses Intel syntax, except in those situations where they are forced to use AT&T syntax (inline assembly in C on Unix, somehow your box doesn't have NASM). Using AT&T syntax for examples just confuses people. Write assembler the right way. Destination, source. Come on.

[+] chrisseaton|10 years ago|reply

The default output format of HotSpot is AT&T. The default output format of GCC is AT&T. The default output format of Clang is AT&T. Tools like the universal compiler output viewer https://gcc.godbolt.org use AT&T by default.

Intel isn't the universally accepted format you think it is. I'm a professional VM researcher and I use AT&T more often than Intel. In fact I most often see Intel when reading Intel documentation.

You're shouting about nothing more than empirical than tabs vs spaces, and even then I think your side is actually in a minority.

[+] binarycrusader|10 years ago|reply

Not only is the all caps and profanity uncalled for here, but the unfounded assertion that "everyone" uses "Intel syntax" is wrong. The Solaris assembler as one example, uses the "AT&T syntax". Also, this is reflected in older UNIX apis such as bcopy which use src, dest.

Personally, I always preferred src,dest over dest,src but as long as the language is consistent I don't care.

[+] unknown|10 years ago|reply

[deleted]

68 comments