stuntprogrammer's comments

stuntprogrammer | 10 years ago | on: Inside Pascal: Nvidia 's Newest Computing Platform

Technically, yes, a teraflop is a teraflop, and is directly comparable. It just means you can do an awful lot of floating point operations per second. But many systems are sensitive to memory size, memory bandwidth, and as a result communication costs (i.e. latency/bandwidth of the interconnect between machines).

The benchmark is essentially bottlenecked on FP64 matrix multiplies. If that's what you need to do, then sure, it's indicative.

Some machine learning workloads are also bottlenecked on matrix multiply, but don't need FP64 precision. They can use FP16. Fits a bigger model in a given memory size, makes better use of memory bandwidth, and given the right hardware support, you can get extremely high rates as on Pascal.

Personally, I find the memory system on Pascal more interesting than raw flops rate. Also, the use of nvlink to link multiple GPUs..

stuntprogrammer | 10 years ago | on: Inside Pascal: Nvidia 's Newest Computing Platform

Pascal won't be cheap so comparing to a top-end Xeon E5 v4, it's about 7x the theoretical FP64 performance (assuming the Xeon is 2.2GHz * 22 cores * 8 avx pipes per * 2 for FMA per socket, given the price range). Similar story at FP32. However, the GPU wins on FP16.

For historical performance, just pick one of the machines that did x teraflops. E.g. the first teraflop computer used ~6000 200MHz pentium pro chips around 1996.

stuntprogrammer | 10 years ago | on: Ask HN: Can you short later stage startups?

Essentially, no, at least not in the usual sense.

The "best" options are to either 1) short ETFs that have a suitable concentration in some subsector of the industry you think will be affected by falling unicorns, esp. leveraged versions of the funds or 2) short investors heavily concentrated in private unicorn investments such as NASDAQ:GSVC.

stuntprogrammer | 10 years ago | on: Upthere, a cloud storage service, wants to make file syncing a thing of the past

I strongly agree with you re 'conservation of distributed systems problems'. I'm being vague because I know what they're doing, but I'm limited in what I can comment on.

Their public comments are that they are running their own servers and not reselling other storage.

Personal opinion: no one can do (c) because of (a). That is, any possible (c) must tackle the fundamental hardness in the distributed systems problem, and this is what we agree (a) is doing. Using immutable objects as in S3 just shifts the problem elsewhere, while it reduces it, it doesn't solve it.

stuntprogrammer | 10 years ago | on: Upthere, a cloud storage service, wants to make file syncing a thing of the past

I agree with many things you say. Take AWS S3 as an example: it reduces the problem by making objects immutable. Internally, they can cache aggressively without scaling fine-grained consistency on the objects as such. As an S3 user I can aggressively cache on the client too.

Now we've reduced the problem to consistency on the metadata structure which aggregates objects for the user. There are "well known" ways to do this for traditional trees. Other well known options include doing an all search based approach (i.e. always talk to a server for metadata, perhaps with local result caching), and so on.

stuntprogrammer | 11 years ago | on: Amazon buys secretive chip maker Annapurna Labs for $350M

That's why the NDA is so frustrating; we can't talk re the fun features and demos.. I know the AMD and Cavium parts rather well also. The Amd one doesn't support what I have in mind. The Cavium does on paper but the cores, even with just public info, can be seen to be underwhelming. Typical network processor style, which struggles with other workloads. Unfortunately for them, they picked up a couple of the less clued in Calxeda execs that don't understand SW or workloads very well.

stuntprogrammer | 11 years ago | on: Amazon buys secretive chip maker Annapurna Labs for $350M

Unfortunately, the only public information I recall seeing was the use of a quad Cortex-A15 based SOC with integrated dual 10G in a NAS box.

However, the real capabilities of the product line were far more interesting, with very cool demos up and running with awesome metrics. The server possibilities are huge... especially if you provide opaque optimized services in a cloud to user workloads running on the x86 side.

stuntprogrammer | 11 years ago | on: Software optimization resources

These are invaluable references -- I hope we have many people here familiar with them, or new readers appreciating them.

It's a shame that nothing quite so comprehensive exists for IO, as network and storage accesses, patterns and quirks are often more of a bottleneck than CPU, for many applications.

stuntprogrammer | 11 years ago | on: Nvidia's new mobile superchip

For your 128GF I take it you are assuming 1:4 with FP32. That would be unexpected. E.g. in contrast to AMD parts, the recent NV parts based on Maxwell2 (such as GM204 in the gtx980) are 1:32 for FP64:FP32. I would expect that ratio to hold in this part, and the SMs are likely highly similar, giving the ~16GF number I used.

I haven't seen the automotive workloads, but they strike me as repetitive and regular enough that a Denver style part would do ok. That said, I'm not shocked by the use of A57+A53.

(Disclaimer: I worked on an early version of Denver on the code morphing software, and at an ARM server vendor on A57 based SOC).

stuntprogrammer | 11 years ago | on: Nvidia's new mobile superchip

In this case, I believe it's 1TF of FP16, or 500GFlops FP32. You're likely looking at 16GF FP64.

I've also heard, though unconfirmed, that on the CPU side it's quad A57 + quad A53 rather than Denver derivatives.

stuntprogrammer | 11 years ago | on: Errplane (YC W13) Snags $8.1M for Open-Source InfluxDB Time Database

It's been a while since I've been an insider, and these comments are purely from an outside perspective, interacting with such users.

Vertica is seeing use for more historical stuff, and where the time series queries are pretty simple. Informix time series is doing ok, and has better support for rich queries, but isn't really playing realtime. MemSQL has the realtime perf (hi guys!) but needs to beef up on expressiveness. SAP HANA could do it, but not seeing major uptake there.

Still seeing lots of ad hoc solutions, and the expected experimentation with the usual hadoop menagerie (spark is helping make that practical).

The sensor stuff gets interesting at scale. Individual sources may not be producing data that quickly, but in aggregate it can be entertaining volume. Esp. when it comes to mobile things, and correlations become interesting to look at.

Deep thoughts need to wait for the coffee to kick in.

I suspect we'll see a lot of reinvention of technology to cope with these problems; perhaps even open source..

stuntprogrammer | 11 years ago | on: Errplane (YC W13) Snags $8.1M for Open-Source InfluxDB Time Database

Disclaimers: I was CTO@Kx for a while, but also like InfluxDB :-)

It is definitely more finance oriented, though Kx are making moves towards other application areas. Another difference I'd highlight is that Kx concentrate on the core database itself, esp. performance and expressiveness of the query language, and leave things like GUIs and admin add-on tools to partners (like first derivatives and aquaq).

kdb does just fine with metrics and sensor data. Personally, I would argue that it's weaker on string handling though, which can hurt in certain use cases.

I doubt it'll go open source any time soon. However, it being around a long time is something salescritters can spin to wonderful effect re stability, support, etc etc. ;-)

I think there are fine application areas in finance that you should consider -- just consider the many areas where the core problem isn't related to juggling TB of market data ticks coming off the exchanges.

stuntprogrammer | 11 years ago | on: A Conversation with Arthur Whitney (2009)

No, I wouldn't go back. That said, it has made it rather difficult to fit comfortably in standard settings. I've been successful so far but taking Arthur's lessons and applying from various domains, to software, and into cluster/system/soc arch in my case, has been viewed as rather unorthodox.

I find that especially in the valley, adherence to buzzwords and fashion of the day is a little too common for my taste now.

stuntprogrammer | 11 years ago | on: A Conversation with Arthur Whitney (2009)

Yes, he is that good. I worked directly with him for a few years and it deeply changed my long term approach. Side effect is that it became harder to deal with the "normals" ;-) seriously though, it made me very impatient with the sorry state of the "state of the art" in the valley.

stuntprogrammer | 14 years ago | on: Wow: Intel unveils 1 teraflop chip with 50-plus cores

Is there a cost? Of course. But arguably it's in the noise on these chips. Knights Ferry and Corner are using a scalar x86 core derived from the P54C. How many transistors was that? About 3.3 million. By contrast, Nvidia's 16-core Fermi is a 3 billion transistor design. (No, Fermi doesn't have 512 cores, that's a marketing number based on declaring that a SIMD lane is a "cuda core", if we do the same trick with MIC we start doing 50+ cores * 16 wide and claiming 800 cores).

How can we resolve this dissonance? Easy -- ignoring the fixed function and graphics only parts of Fermi, most of the transistors are going to be in the caches, the floating point units and the interconnect. These are places MIC will also spend billions of transistors but they're not carrying legacy dead weight from x86 history -- the FPU is 16 wide by definition must have a new ISA. The cost of the scalar cores will not be remotely dominant.

I'm not sure why you are concerned about the pin count on the processor, except perhaps if you are complaining about changing socket designs which is a different argument. The i7 2600 would fit in a LGA 1155 (i.e. 1155 pins) whereas Fermi was using a 1981 pin design on the compute SKUs. The sandy bridge CPU design is a fine one. The GPU is rapidly improving (e.g. ivy bridge should be significantly better, and will be a 1.4 billion transistor design in the same 22nm as Knights Corner).

stuntprogrammer | 14 years ago | on: Wow: Intel unveils 1 teraflop chip with 50-plus cores

We should distinguish between designing for a benchmark and designing for a set of workloads. Everyone choices representative workloads they care about and evaluate design choices on a variety of metrics from simulating execution of parts of those workloads.

Linpack is a common go-to number because, for all the flaws, it's a widely quoted number. E.g. used in the top500 ranking. It tends to let the cpu crank away and not stress the interconnect, and is widely viewed as an upper bound on perf for the machine. In the E5 case it'll be particularly helped by the move to AVX enabled cores, and take more advantage of that than general workloads. Realistic hpc workloads stress a lot more of the machine beyond the cpu. Interconnect performance in particular.

People like to dump on x86 but it's not that bad. There are plenty of features no one really uses and we still have around, but those features will often end up being microcoded and not gunking up the rest of the core. The big issue is decoder power and performance. x86 decode is complex. On the flipside, the code density is pretty good and that is important. Secondly, Intel and others, have added various improvements that help avoid the downsides. E.g. caching of decode, post-decode loop buffers and uop caches etc. Plus the new ISA extensions are much kinder..

stuntprogrammer | 14 years ago | on: Wow: Intel unveils 1 teraflop chip with 50-plus cores

Sustaining 1TF on DGEMM was explicitly mentioned by Intel in the presentation/briefing.

It's also mentioned in the press release:

http://newsroom.intel.com/community/intel_newsroom/blog/2011...

"The first presentation of the first silicon of “Knights Corner” co-processor showed that Intel architecture is capable of delivering more than 1 TFLOPs of double precision floating point performance (as measured by the Double-precision, General Matrix-Matrix multiplication benchmark -- DGEMM). This was the first demonstration of a single processing chip capable of achieving such a performance level."

Does it mean much? It means something to me, and is a great first step for those of us running compute intensive codes. They really wouldn't get far if they designed the chip only around being able to do this.

As I mentioned elsewhere in the thread, the article text is incorrect. The chip we're discussing is Knights Corner not Knights Ferry. The latter has been in early user hands for quite some time now and I've spent plenty of time hacking on it. Knights Corner is the new chip that is working it's way to production via the usual process with ship for revenue in 2012.

The 2018 target is for an exascale machine, not shipment of initial MIC devices. TACC have already announced they'll be building out a 10 petaflop MIC based system next year to go operational by 2013.

Yes, I'm comparing a chip that has not shipped, but given the perf advantage, given the tools and productivity advantage, given the multiyear process advantage Intel is sustaining, this is not a chip to be ignored. Knights Corner is shipping on 22nm. Other vendors have notoriously had difficultly on previous processes, depend on fabs like TSMC who are doing 28nm for them, and will be later to 14nm etc.

page 2