universal_sinc's comments

universal_sinc | 2 years ago | on: AVX10/128 is a silly idea

This article highly underestimates the value of keeping 128b vector performance high. Most code doesn't get recompiled or compiled with the appropriate flags. There is significant overhead involved in supporting 1x512b operations, 2x256b operations, or 4x128b operations per cycle with the same datapaths, forwarding network, and register files. Until 128b vector performance gets deprecated this tension incentives narrow implementations.

universal_sinc | 2 years ago | on: Arm’s Neoverse V2

The idea is to write a C++ model that that produces cycle accurate outputs of the branch predictor, core pipeline, queues, memory latency, cache hierarchy, prefetch behaviour, etc. Transistor level accuracy isn't needed as long as the resulting cycle timings are identical or near identical. The improvement in workload runtime compared to a Verilog simulation is precisely because they aren't trying to model every transistor, but just the important parameters which effect performance.

Let's take a simple example: Instead of modeling a 64-bit adder in all its gory transistor level detail, you can just have the model return the correct data after 1 "cycle" or whatever your ALU latency is. As long as that cycle latency is the same as the real hardware, you'll get an accurate performance number.

What's particularly useful about these models is they enable much easier and faster state space exploration to see how a circuit would perform, well before going ahead with the Verilog implementation, which relatively speaking can take circuit designers ages. "How much faster would my CPU be if it had a 20% larger register file" can be answered in a day or two before getting a circuit designer to go try and implement such a thing.

If you want an open source example, take a look at the gem5 project (https://www.gem5.org). It's not quite as sophisticated as the proprietary models used in industry, but it's a used widely in academia and open source hardware design and is a great place to start.

universal_sinc | 2 years ago | on: Arm’s Neoverse V2

Absolutely! Chip designers have a several tools to do this.

First, they create detailed software models (usually in C++) of their chips to estimate performance as closely as they can before laying out a single transitory. These models can run code just like a real hardware device, albeit slowly.

Once the chip is designed, verilog simulators are programs used to generate the exact logical output of a circuit, which can be used to measure performance on a workload. However, this method is even slower than the first!

For larger workloads and higher speed, they use extraordinarily expensive FPGA-based platforms called Emulators. This allows circuits to be run at speeds in the MHz range before ever being sent to a fab. Booting an OS, running a complex multicore workload with shared memory, they can measure almost any workload. But this method is not available until late in the design phase and the boxes themselves are prohibitively expensive from being deployed very widely.

The software models are the most useful for estimating performance, as long as they are written early and well :)

universal_sinc | 3 years ago | on: Fastest-ever logic gates could make computers a million times faster

Even 0.1ns is way slow. A modern silicon cmos gate will switch under 10ps, which is how we can fit 25+ gates in a single cycle at >3GHz. Everyone should remember that cpu frequency is not the same as the frequency a single gate can switch. Also keep in mind we are mostly wire limited anyway, as resistivity of copper at <50nm line widths is quite unlike its bulk resistivity, and scales super-linearly. This prevents us from further shrinking wires at all.

universal_sinc | 5 years ago | on: 100-GHz Single-Flux-Quantum Bit-Serial Adder Based on 10-KA/Cm2 Niobium Process

Just so everyone is aware of the scales involved in modern circuits, an entire CPU Core like in Snapdragon 865 might only take up an area of ~3mm2, measuring just ~1.75mm across. 3mm gets you across the CPU and back again. We think in nanometers and picoseconds down here. (Not withstanding signals travel much slower than the speed of light in our tiny copper wires, remember t=RC)

universal_sinc | 5 years ago | on: Intel outsources Core i3 to TSMC's 5nm process

It's not as bad as you think. From a high-level: Modern Synthesis tools turn your RTL code (which is coded in an HDL or Hardware Description Language) into gates, and then map them to a library of "Standard Cells". These foundry-specific cells are physical plans for an AND, OR, XOR, gates, flip-flops, etc. Once the code is mapped to these cells they are run through a Place&Route tool, which lays out all the mapped standard cells onto a plane, and then wires them together in 3D following a set of design rules from whatever foundry you are using. Finally after verifying the physical properties of the output design, you ship it to your foundry using a industry standard format called "GDS2" which is basically a series of 2D layers for turning into actual lithography masks. Doing this process (commonly called "RTL to GDS2") is non-trivial, but could be done to target a new foundry in <6 months. Now, Intel is known to use some custom layout methods rather than this Synthesized flow I've described, but that's pretty out of vogue and is a vestige of their early days.

universal_sinc | 5 years ago | on: AMD Zen 3 Ryzen Deep Dive Review

The importance of AMD's use of chiplets should not be understated. Especially in the server space, it allows them to achieve far better yields on a monster L3 cache than Intel or any other competitor. Along with the modernized Zen UArch, this means they can achieve the same perf metrics at significantly lower cost than Intel.

universal_sinc | 5 years ago | on: AMD Zen 3/Ryzen 5000 announcement [video]

They way modern CPU design works is that the development team maintains a cycle-accurate software model of the CPU. Changes can be made to the pipeline and memory system, those changes simulated against a real workload, and results reported without ever laying a single transistor. These models are part of the secret sauce to figuring out how to eek out 19% IPC. Yes the simulations are slow. But you can get a lot of value out of 50k or 100k simulated clock cycles.