Wow: Intel unveils 1 teraflop chip with 50-plus cores

[+] mrb|14 years ago|reply

"Wow?" This is actually disappointingly low raw TFLOPS performance.

Intel's Knights Ferry GPGPU ASIC is not yet available, but already outperformed by 2-year-old chips from AMD and Nvidia who have both been selling GPU ASICs breaking the 1 TFLOPS barrier (single precision) for over two years now. The AMD Radeon HD 5870 and HD 6970 both reach 2.7 TFLOPS, and AMD makes a dual-ASIC PCIe card HD 6990 reaching 5.1 TFLOPS. Nvidia's mid-level GTX 275 (1.01 TFLOPS) was released in April 2009.

In fact, Knights Ferry evolved from the Larrabee GPU project, which disappointed Intel so much in terms of performance that they decided to forgo the GPU market (as it was clearly not going to be competitive), and to remain focused only on the GPGPU market by evolving Knights Ferry from Larrabee.

The one strong advantage of Knights Ferry is not performance, but x86 compatibility, as it would theoretically make it easy to port programs to. Although one would still have to rewrite the app to use the LRBni instruction set (512-bit regs) to fully exploit the computing performance... or else one would be limited to a quarter of its potential with SSE (128-bit regs.)

Another relative advantage of Knights Ferry is that each of the 50+ cores will probably be able to execute 50 unique instructions every clock cycle, making it very flexible. (Compared to, say, the HD 6970 which has 384 "cores" or VLIW units only able to execute 24 unique instructions: the ASIC is organized in 24 SIMD engines of 16 VLIW units each, the 16 VLIW units in each SIMD engine execute the same instruction in 16 different thread contexts, for a total of 384 threads.)

Edit: my bad, it looks like Intel claims 1 TFLOPS in double precision, which would put it up to the level of upcoming AMD chips (HD 7970 rumored to provide 4.1 SP TFLOPS or 1.0 DP TFLOPS in early 2012.)

[+] stuntprogrammer|14 years ago|reply

A few problems in your comment:

1) The article text is wrong (and doesn't match the pics of the slides). The chip demonstrated today is Knights Corner which is a new part, not the older Knights Ferry SDV.

2) When counting flops we need to distinguish between single precision flops and double precision flops. You're comparison isn't valid -- Knights Corner was shown sustaining over 1TF on a double precision code. Nvidia's most recent flagship GPU has a theoretical peak of 515GF/s but sustains less than 225GF/s on the same DGEMM operation. Knights Corner is sustaining 4-5x that, and this implies that it's theoretical peak is higher again. AMD's GPUs also cannot touch this with a single chip. Their dual chip 6990 has what looks like the same theoretical peak but far lower practical performance due to being more of a graphics part than a compute part (e.g. look at the cache structures).

You are correct that these are real cores, each with a wide vector unit. If we wanted the equivalent of GPU "cores" we should multiply out by the vector width per core.

[+] rbanffy|14 years ago|reply

> Although one would still have to rewrite the app to use the LRBni instruction set (512-bit regs) to fully exploit the computing performance...

http://drdobbs.com/architecture-and-design/216402188 ("A First Look at the Larrabee New Instructions")

[+] anigbrowl|14 years ago|reply

I guess this is partly about stealing AMD's thunder following their recent release of the 16-core bulldozer chips. Not that the technologies directly compare, but most non-geeks are just going to see 50 cores >> 16 and buy Intel again next time.

[+] modeless|14 years ago|reply

I see questions about why this is better than a GPU for anything. Two main things:

1. The double-precision floating point performance is a lot better.

2. Unlike GPUs which have baroque memory access restrictions and many performance cliffs, this is a much more familiar SMP architecture with a unified coherent cache hierarchy.

[+] Klinky|14 years ago|reply

1. It's estimated to come out a year later than AMD's Radeon part which will boast similar double-precision floating point performance. Both could be delayed, though if Southern Islands comes out on time & Knights Corner gets delayed, by the time KC sees the light of day AMD or nVidia might have another part out by then that will offer even higher DPFP performance.

2. I am not sold that modeling the cores after x86/SMP means special care won't be needed to feed the Intel MIC architecture properly. I'd like to see some real world numbers on purchasable hardware.

[+] joshu|14 years ago|reply

Heh. Article is nearly incoherent:

> If you're building a new system and want to future-proof it, the Knights Ferry chip uses a double PCI Express slot. Chrysos said the systems are also likely to run alongside a few Xeon processors.

[+] rbanffy|14 years ago|reply

The memory bus must be saying "Great. Another 50 mouths to feed".

You have to design your program very carefully if you don't want the cores to starve.

[+] jwatte|14 years ago|reply

So it doesn't run the general x64 system architecture? Then how is this different from GPGPU? I thought NVIDIA broke a teraflop on a dual slot a while back (dunno if it was single GPU.) Slot based coprocessors have always been a very niche kind of thing.

Basically, if I can't hook it up to my SSD array and also my GPU, then it's not a "real" computer -- like the reporter was talking about a laptop. And if I can't rent it by the hour from Amazon, then it's not really a good investment (Amazon already has GPU instances.)

Or, you know, maybe this time it will work, when every time before, a co-processor platform has failed...

[+] r3demon|14 years ago|reply

AMD Radeon HD 6990 already has over 1 TFLOPS performance double-precision, and there's no problem buying it, Intel is too late.

[+] phamilton|14 years ago|reply

Yes there is no problem buying it, but have you ever tried programming in OpenCL? Complexity aside, GPGPU hits a big bottleneck when dealing with large datasets. There just isn't enough memory available on the GPU, and transfers to and from the device are costly.

[+] rayiner|14 years ago|reply

Far more limited architecture.

[+] cultureulterior|14 years ago|reply

I'll be very interested to see how this does with raytracing.

[+] nextparadigms|14 years ago|reply

This was to be expected. In a classic disruptive innovation fashion, Intel is starting to move upmarket, where the profits are higher, and in a few years they'll be leaving the mobile and notebook/PC market to ARM.

[+] ck2|14 years ago|reply

These aren't x86 cores, are they?

I mean 50 atom cores would be downright silly.

50 i3 cores, well then you might have something.

[+] rayiner|14 years ago|reply

Yes, these are x86 cores. Actually quite a bit like the Pentium (p55c) with a 512-bit vector unit bolted on.

[+] suivix|14 years ago|reply

What is the significance of this over standard GPUs that can already do over a teraflop?

See table: http://en.wikipedia.org/wiki/Northern_Islands_(GPU_family)

[+] stuntprogrammer|14 years ago|reply

The AMD 6990 has ~1.37TF double precision, but uses 2 GPU chips to do it, where this chip is that perf level.

It is difficult to get good performance out of the GPUs for a very wide range of highly parallel programs. Effectively, you are programming a part that is trying to give you mainly a graphics part, since that is where the volume is, with enough compute compromises to try to grow that market. MIC is designed to be a compute processor from the get go. How about this for a difference: it can boot linux all on its own! You can ssh into it and run programs. You can even run 'reverse offload' programs that call out to code on the CPU! Trying doing any of that with a GPU.

BTW, this MIC chip has a large number of cores (50+), these are real cores, and they're not doing the GPU marketing trick of counting SIMD lanes as "cores". You could multiply 50+ * 16 to get the equivalent number of GPU "cores". Each core is cache coherent, with a decent memory hierarchy designed for compute. There's no graphics tax on here.

I have much more expectation that Intel can leverage their massive process advantage to keep MIC ahead on compute performance each generation. It'll be a relief to have compute parts rather than repurposed GPUs.

[+] phamilton|14 years ago|reply

x86 instruction set. GPUs are a pain to program and port applications to.

If I recall correctly, this is somewhat of a spinoff of the larrabee chip - http://en.wikipedia.org/wiki/Larrabee_(microarchitecture)

You can see in the wikipedia page the benefits of Larrabee over traditional GPUs. I believe the new chip was designed to be even more flexible and similar to modern processors.

[+] eliben|14 years ago|reply

It's x86, a normal Intel CPU. It can even run Linux. Drastically different from what a GPU is

[+] cleverjake|14 years ago|reply

mobility.

[+] ciderpunx|14 years ago|reply

I should probably get one of these for my laptop.

39 comments