It's strange the extent to which programming interview questions reflect an 80s view of the cost of operations, particularly the overabundance of linked list and binary tree questions. Cache misses ain't free and memory scans are relatively cheap after you do the initial lookup.
Yep! The big thing people don't realize that it is memory operations that are the biggest cost in terms of both time and energy.
While it takes ~100 picojoules to do a double precision floating point operation on a Ivy Bridge Intel processor, it takes 4200 picojoules to move the 64 bits from DRAM to your registers. Most people assume that the huge power usage is because you need to move data from off the chip, but the reality (and surprising fact to most people) is that over 60% (~2500 picojoules) of the energy usage of moving the data is consumed by the on chip cache hierarchy. That doesn't mean the SRAM caches themselves, but all the additional logic that makes it hardware managed (TLBs, etc) that give you functionality like virtual memory translations, cache coherency, etc.
Getting rid of all of that cruft that has been added since the 80s to make programmers lives easier would actually reduce power consumption and latency significantly... My startup is working on that problem by removing all of that additional logic from the hardware and instead having it managed at compile time. The best thing though would be having programmers really think about locality when writing their programs though.
And it's not reasonable to abandon trees and graphs and everything just for the sake of cache locality. Algorithm first, CPU optimization second. Especially because you can control allocation very easily with something like an object pool, which will minimize cache misses.
I think most such interview questions are informed by CS curricula, which also have only changed modestly since 1980(some new languages and tools). The elaborate pointer chasing structures available then have basically sufficed and are in many instances less useful new.
What interested me in the design of the Mill CPU is how it throws out the usual design of machine language.
I'm not talking about assembly language, and I know the difference, BTW.
In the name of software compatibility, we're still trying to program CPUs using machine language that wouldn't be so strange to a programmer from the 1980's. Sure, there's more registers, and some fun new stuff, but it isn't all that different.
Except that in the 1980's, the CPU actually implemented those instructions. These days, it is all a lie, especially with regards to things like register sets and aliasing. Yes, of course, logically, what the programmer wanted to happen does, but today even programming at assembly level, you are far, far removed from what the CPU is actually doing.
You say this as if it is a bad thing (or am I misinterpreting here?), but compatibility is enormously valuable. That's why the strategy of choosing compatibility over cleanliness of architecture is so widespread in successful complex systems - ISAs, OSs, the Web, programming languages, etc etc.
It's hard to love the resulting complexities, but remaining compatible really is almost always the right thing to do.
It is worth noting that many of the Mill features are also a "lie" (for example, the belt is still just registers and register renaming). Its just that the Mill uses lies designed for modern processor technology.
This is a wonderful post that no-one will care about. This may be the only post.
Today, programmers are more interested in the rate they can turn out "Just Works" code. These kinds of details are fare fare to down in the weeds for a continuous development artists.
All: I had a negative reaction to this comment too at first, but on reflection it reads less like a jab and more like a rueful lament about mainstream programming culture. The fact that mgrennan loved the article and was so perfectly wrong about it not being appreciated here suggests that he or she just hasn't realized yet how much passion this community shares for the craft.
It's depressing to feel like you're the only one who cares, and when one has felt like that for a long time, curmudgeonly biases develop. So mgrennan, please get to know your fellow HNers, who love this stuff. And HN, let's be charitable to mgrennan, who may have been mistaken but whose heart is probably in the right place.
Two responses. First, once we find "Just Works" components that we like then they can be optimized. Second, we can use inspiration from articles like this to describe "Just Works" approaches and patterns that "go with the grain" of what Earth's real fabrication capabilities are actually producing.
For example, I found the discussion of how cores coordinate access to main memory on a shared bus to be quite fascinating; an easy insight there was that our programming patterns should support hard data partitions (less shared main memory than parallel main memory). One naive way to get there is to use N processes where N is something like the number of cores on the machine, and one of them serves as a message router. Something like what `httpd` does.
I really wouldn't mind if someone who knows more about the JVM implementation could talk about how and why the JVM threading model is better than native processes, for example, especially in light of memory contention.
> This is a wonderful post that no-one will care about. This may be the only post.
I think you are underestimating the crowd here. Last time it was posted it got quite a few responses: https://news.ycombinator.com/item?id=8873250 (already a while back, but might be interesting for reference/to bring topics up again)
Highly technical posts on HN have a tendency to accrue a lot of upvotes before a discussion develops. I attribute this to (perhaps optimistically) people actually taking the time to read the post. When something is so technically dense, it can also be difficult for people to add anything to it; hence it takes a while before a real discussion starts.
I hate it when blog posts don't include the date. Judging by the linked question this blog post must be at most a few months old, but there was nothing on the page that would tell me that. One of the most important questions is whether the information in the article is still applicable... In this case it is, but it would be nice if readers knew it. Not to mention 10 years from now when somebody stumbles across this writeup. </rant>
It's not a static decision though - the memory accesses can still be reordered when %ebx != %esp, though of course this only ends up visible where there are multiple CPUs involved.
For example, consider the case above and assume that the initial conditions are:
(%esp) == 0
(%ebx) == 0
Now imagine we have a second CPU executing simultaneously, with the same %ebx and %esp as the first CPU, but executing this:
mov $1, (%ebx)
mov (%esp), %eax
Now if there was no reordering, either one or both CPUs must end with %eax == 1. However, the hoisting of loads before earlier stores means that you can actually end up with both CPUs have %eax == 0 after this executes.
Intel, IBM, mainframes, and embedded SOC's are all taking the same approach to a degree of combining 1-N general-purpose cores with dedicated hardware for performance-critical stuff or just stuff that shouldn't add overhead. The Octeon line is an extreme example with them adding accelerators till they hit around 500. Most modern variant being the "semi-custom" business of Intel and AMD that is making more of it happen for those with the money.
This is peripheral to an improvement in computers known as network on a chip. This plus extra layers of functionality in silicon lets the companies easily do stuff like that. The next step is incorporating FPGA logic in the processors. We already see it in embedded scene. Just wait till Intel uses Altera technology in Xeons. SGI's Altix machines with FPGA's using NUMA were already quite powerful. Imagine the same benefit of no, remote-memory access for the FPGA logic working side-by-side with CPU software. Will be badass.
This is a question from someone rather ignorant, so please don't hit me: Why didn't the Bulldozer affect the programmers? It (or Piledriver) seems to be doing quite well in applications with good threading.
In general all x86 chips are carefully designed to fulfill the abstraction that is the x86 ISA so to the programmer it shouldn't matter whether their code runs on an Atom or a Bulldozer or an i7.
I don't think "good at executing bad code" applies for embedded CPUs. Of course, good code is still more the province of the compiler (with PGO for example).
Literally everyone uses Intel syntax, except in those situations where they are forced to use AT&T syntax (inline assembly in C on Unix, somehow your box doesn't have NASM). Using AT&T syntax for examples just confuses people. Write assembler the right way. Destination, source. Come on.
The default output format of HotSpot is AT&T. The default output format of GCC is AT&T. The default output format of Clang is AT&T. Tools like the universal compiler output viewer https://gcc.godbolt.org use AT&T by default.
Intel isn't the universally accepted format you think it is. I'm a professional VM researcher and I use AT&T more often than Intel. In fact I most often see Intel when reading Intel documentation.
You're shouting about nothing more than empirical than tabs vs spaces, and even then I think your side is actually in a minority.
Not only is the all caps and profanity uncalled for here, but the unfounded assertion that "everyone" uses "Intel syntax" is wrong. The Solaris assembler as one example, uses the "AT&T syntax". Also, this is reflected in older UNIX apis such as bcopy which use src, dest.
Personally, I always preferred src,dest over dest,src but as long as the language is consistent I don't care.
[+] [-] fiatmoney|10 years ago|reply
[+] [-] trsohmers|10 years ago|reply
While it takes ~100 picojoules to do a double precision floating point operation on a Ivy Bridge Intel processor, it takes 4200 picojoules to move the 64 bits from DRAM to your registers. Most people assume that the huge power usage is because you need to move data from off the chip, but the reality (and surprising fact to most people) is that over 60% (~2500 picojoules) of the energy usage of moving the data is consumed by the on chip cache hierarchy. That doesn't mean the SRAM caches themselves, but all the additional logic that makes it hardware managed (TLBs, etc) that give you functionality like virtual memory translations, cache coherency, etc.
Getting rid of all of that cruft that has been added since the 80s to make programmers lives easier would actually reduce power consumption and latency significantly... My startup is working on that problem by removing all of that additional logic from the hardware and instead having it managed at compile time. The best thing though would be having programmers really think about locality when writing their programs though.
[+] [-] TillE|10 years ago|reply
And it's not reasonable to abandon trees and graphs and everything just for the sake of cache locality. Algorithm first, CPU optimization second. Especially because you can control allocation very easily with something like an object pool, which will minimize cache misses.
[+] [-] chipsy|10 years ago|reply
[+] [-] unknown|10 years ago|reply
[deleted]
[+] [-] sklogic|10 years ago|reply
[+] [-] ansible|10 years ago|reply
I'm not talking about assembly language, and I know the difference, BTW.
In the name of software compatibility, we're still trying to program CPUs using machine language that wouldn't be so strange to a programmer from the 1980's. Sure, there's more registers, and some fun new stuff, but it isn't all that different.
Except that in the 1980's, the CPU actually implemented those instructions. These days, it is all a lie, especially with regards to things like register sets and aliasing. Yes, of course, logically, what the programmer wanted to happen does, but today even programming at assembly level, you are far, far removed from what the CPU is actually doing.
Edit: Here's the website: http://millcomputing.com/docs/
[+] [-] gsg|10 years ago|reply
You say this as if it is a bad thing (or am I misinterpreting here?), but compatibility is enormously valuable. That's why the strategy of choosing compatibility over cleanliness of architecture is so widespread in successful complex systems - ISAs, OSs, the Web, programming languages, etc etc.
It's hard to love the resulting complexities, but remaining compatible really is almost always the right thing to do.
[+] [-] gizmo686|10 years ago|reply
[+] [-] dang|10 years ago|reply
Now I'm curious. How does it?
[+] [-] unknown|10 years ago|reply
[deleted]
[+] [-] mgrennan|10 years ago|reply
Today, programmers are more interested in the rate they can turn out "Just Works" code. These kinds of details are fare fare to down in the weeds for a continuous development artists.
[+] [-] dang|10 years ago|reply
It's depressing to feel like you're the only one who cares, and when one has felt like that for a long time, curmudgeonly biases develop. So mgrennan, please get to know your fellow HNers, who love this stuff. And HN, let's be charitable to mgrennan, who may have been mistaken but whose heart is probably in the right place.
[+] [-] javajosh|10 years ago|reply
For example, I found the discussion of how cores coordinate access to main memory on a shared bus to be quite fascinating; an easy insight there was that our programming patterns should support hard data partitions (less shared main memory than parallel main memory). One naive way to get there is to use N processes where N is something like the number of cores on the machine, and one of them serves as a message router. Something like what `httpd` does.
I really wouldn't mind if someone who knows more about the JVM implementation could talk about how and why the JVM threading model is better than native processes, for example, especially in light of memory contention.
[+] [-] detaro|10 years ago|reply
I think you are underestimating the crowd here. Last time it was posted it got quite a few responses: https://news.ycombinator.com/item?id=8873250 (already a while back, but might be interesting for reference/to bring topics up again)
[+] [-] mikeash|10 years ago|reply
By stating that nobody will care, you're just encouraging the problem.
[+] [-] scott_s|10 years ago|reply
[+] [-] mgrennan|10 years ago|reply
[+] [-] annnnd|10 years ago|reply
I hate it when blog posts don't include the date. Judging by the linked question this blog post must be at most a few months old, but there was nothing on the page that would tell me that. One of the most important questions is whether the information in the article is still applicable... In this case it is, but it would be nice if readers knew it. Not to mention 10 years from now when somebody stumbles across this writeup. </rant>
EDIT: Nice article though. :)
[+] [-] stefantalpalaru|10 years ago|reply
[+] [-] userbinator|10 years ago|reply
[+] [-] caf|10 years ago|reply
For example, consider the case above and assume that the initial conditions are:
Now imagine we have a second CPU executing simultaneously, with the same %ebx and %esp as the first CPU, but executing this: Now if there was no reordering, either one or both CPUs must end with %eax == 1. However, the hoisting of loads before earlier stores means that you can actually end up with both CPUs have %eax == 0 after this executes.[+] [-] nickpsecurity|10 years ago|reply
http://www.cavium.com/OCTEON-III_CN7XXX.html
Intel, IBM, mainframes, and embedded SOC's are all taking the same approach to a degree of combining 1-N general-purpose cores with dedicated hardware for performance-critical stuff or just stuff that shouldn't add overhead. The Octeon line is an extreme example with them adding accelerators till they hit around 500. Most modern variant being the "semi-custom" business of Intel and AMD that is making more of it happen for those with the money.
This is peripheral to an improvement in computers known as network on a chip. This plus extra layers of functionality in silicon lets the companies easily do stuff like that. The next step is incorporating FPGA logic in the processors. We already see it in embedded scene. Just wait till Intel uses Altera technology in Xeons. SGI's Altix machines with FPGA's using NUMA were already quite powerful. Imagine the same benefit of no, remote-memory access for the FPGA logic working side-by-side with CPU software. Will be badass.
[+] [-] Aoyagi|10 years ago|reply
[+] [-] Symmetry|10 years ago|reply
[+] [-] Tobu|10 years ago|reply
[+] [-] hackuser|10 years ago|reply
[+] [-] ajdlinux|10 years ago|reply
[+] [-] bitwize|10 years ago|reply
DON'T USE FUCKING AT&T ASSEMBLY SYNTAX.
Literally everyone uses Intel syntax, except in those situations where they are forced to use AT&T syntax (inline assembly in C on Unix, somehow your box doesn't have NASM). Using AT&T syntax for examples just confuses people. Write assembler the right way. Destination, source. Come on.
[+] [-] chrisseaton|10 years ago|reply
Intel isn't the universally accepted format you think it is. I'm a professional VM researcher and I use AT&T more often than Intel. In fact I most often see Intel when reading Intel documentation.
You're shouting about nothing more than empirical than tabs vs spaces, and even then I think your side is actually in a minority.
[+] [-] binarycrusader|10 years ago|reply
Personally, I always preferred src,dest over dest,src but as long as the language is consistent I don't care.
[+] [-] unknown|10 years ago|reply
[deleted]