(no title)
bsoft16385 | 2 years ago
My university (University of Colorado Boulder) bought one of a very few SiCortex systems ever sold. As an undergraduate I competed at SC07 and SC08 in the Cluster Challenge competition.
Our coach was a CU facility member, Doug, who also was responsible for the SiCortex box we bought. At SC08 he told us that another one of the teams was competing with a SiCortex box. We knew that they would win the LINPACK part of the challenge, but we didn't know that LINPACK was basically the only thing they managed to get working.
We had also heard rumors that SiCortex was in trouble financially at that time. When we were walking the show floor at SC08, we came across the huge SiCortex booth, which had 10 or so machines of different sizes (I believe the smallest was a 64-core workstation and the largest was a ~5000 whole rack system).
I remarked to Doug that SiCortex didn't look to be in such bad shape.
Doug turned to me and said, "25% of the machines SiCortex has ever made are in that booth".
The SiCortex idea was like VLIW. On paper the numbers look great. On highly optimized synthetic benchmarks it looks good. On real world code you find out how hard it is to get good performance.
notacoward|2 years ago
Also the machine positively sucked for linear integer code - like, say, compilers or OS kernels. One of the first things customers would do, naturally, was compile. Bad first impression. Also, it was nearly impossible to get a Lustre MDS for a thousand-node cluster (which is what the biggest machine was) to run for any length of time without falling over, because Lustre was designed around the assumption that the MDS would be bigger and beefier than anything else and have "poor man's flow control" in the form of a relatively slow network. In our case it was exactly the same and completely unprotected because the interconnect was the fastest part of the system by quite a margin. That was my nightmare for those two years. I've heard that Lustre has since added some flow control ("network request scheduler") but I was never able to benefit from that. PVFS2 worked better, and Gluster (which I worked on for nearly a decade afterward) would probably have been better still because it's more fully distributed and less CPU-hungry.
The reason I mention all this is that there's an important lesson: building a system with a very unusual set of performance characteristics is a terrible idea business-wise, because people won't be able to realize its potential. Not even in a fairly specialized market. They'll just think it's slow. Unless it's truly bespoke, literally a one-off or close to it, nobody will want it.
P.S. I actually had to visit CU-Boulder to debug something on that machine, with the aforementioned Doug. It became one of my favorite "war stories" from a 30-year career, but this has gone on long enough so I'll skip it.
lproven|2 years ago