top | item 36135166

(no title)

That reminds me of a story about SiCortex.

My university (University of Colorado Boulder) bought one of a very few SiCortex systems ever sold. As an undergraduate I competed at SC07 and SC08 in the Cluster Challenge competition.

Our coach was a CU facility member, Doug, who also was responsible for the SiCortex box we bought. At SC08 he told us that another one of the teams was competing with a SiCortex box. We knew that they would win the LINPACK part of the challenge, but we didn't know that LINPACK was basically the only thing they managed to get working.

We had also heard rumors that SiCortex was in trouble financially at that time. When we were walking the show floor at SC08, we came across the huge SiCortex booth, which had 10 or so machines of different sizes (I believe the smallest was a 64-core workstation and the largest was a ~5000 whole rack system).

I remarked to Doug that SiCortex didn't look to be in such bad shape.

Doug turned to me and said, "25% of the machines SiCortex has ever made are in that booth".

The SiCortex idea was like VLIW. On paper the numbers look great. On highly optimized synthetic benchmarks it looks good. On real world code you find out how hard it is to get good performance.

discuss

notacoward|2 years ago

Sounds about right. The processors were six-core 500MHz single issue, which was pretty damn slow even by the standards of the time (2006-2008), beefed up with some extra floating point and some relatively very fast interconnect stuff. The key is that there were a lot of processors - 972 in the big machine. So you really really had to rely on parallelism to get any kind of performance, and a lot of code doesn't parallelize well at all. Even in the HPC space, a surprising number of codes just aren't designed to work for more than about 64.

Also the machine positively sucked for linear integer code - like, say, compilers or OS kernels. One of the first things customers would do, naturally, was compile. Bad first impression. Also, it was nearly impossible to get a Lustre MDS for a thousand-node cluster (which is what the biggest machine was) to run for any length of time without falling over, because Lustre was designed around the assumption that the MDS would be bigger and beefier than anything else and have "poor man's flow control" in the form of a relatively slow network. In our case it was exactly the same and completely unprotected because the interconnect was the fastest part of the system by quite a margin. That was my nightmare for those two years. I've heard that Lustre has since added some flow control ("network request scheduler") but I was never able to benefit from that. PVFS2 worked better, and Gluster (which I worked on for nearly a decade afterward) would probably have been better still because it's more fully distributed and less CPU-hungry.

The reason I mention all this is that there's an important lesson: building a system with a very unusual set of performance characteristics is a terrible idea business-wise, because people won't be able to realize its potential. Not even in a fairly specialized market. They'll just think it's slow. Unless it's truly bespoke, literally a one-off or close to it, nobody will want it.

P.S. I actually had to visit CU-Boulder to debug something on that machine, with the aforementioned Doug. It became one of my favorite "war stories" from a 30-year career, but this has gone on long enough so I'll skip it.

lproven|2 years ago

Hey, some of us are listening, and would love to hear that story!