stuntprogrammer | 10 years ago | on: Inside Pascal: Nvidia 's Newest Computing Platform
stuntprogrammer's comments
stuntprogrammer | 10 years ago | on: Inside Pascal: Nvidia 's Newest Computing Platform
For historical performance, just pick one of the machines that did x teraflops. E.g. the first teraflop computer used ~6000 200MHz pentium pro chips around 1996.
stuntprogrammer | 10 years ago | on: Ask HN: Can you short later stage startups?
The "best" options are to either 1) short ETFs that have a suitable concentration in some subsector of the industry you think will be affected by falling unicorns, esp. leveraged versions of the funds or 2) short investors heavily concentrated in private unicorn investments such as NASDAQ:GSVC.
stuntprogrammer | 10 years ago | on: Upthere, a cloud storage service, wants to make file syncing a thing of the past
Their public comments are that they are running their own servers and not reselling other storage.
Personal opinion: no one can do (c) because of (a). That is, any possible (c) must tackle the fundamental hardness in the distributed systems problem, and this is what we agree (a) is doing. Using immutable objects as in S3 just shifts the problem elsewhere, while it reduces it, it doesn't solve it.
stuntprogrammer | 10 years ago | on: Upthere, a cloud storage service, wants to make file syncing a thing of the past
Now we've reduced the problem to consistency on the metadata structure which aggregates objects for the user. There are "well known" ways to do this for traditional trees. Other well known options include doing an all search based approach (i.e. always talk to a server for metadata, perhaps with local result caching), and so on.
stuntprogrammer | 10 years ago | on: Upthere, a cloud storage service, wants to make file syncing a thing of the past
stuntprogrammer | 11 years ago | on: Low Latency Performance Tuning for Red Hat Enterprise Linux 7
(I've built things for a few chicago/newyork hft shops).
stuntprogrammer | 11 years ago | on: Amazon buys secretive chip maker Annapurna Labs for $350M
stuntprogrammer | 11 years ago | on: Amazon buys secretive chip maker Annapurna Labs for $350M
However, the real capabilities of the product line were far more interesting, with very cool demos up and running with awesome metrics. The server possibilities are huge... especially if you provide opaque optimized services in a cloud to user workloads running on the x86 side.
stuntprogrammer | 11 years ago | on: Software optimization resources
It's a shame that nothing quite so comprehensive exists for IO, as network and storage accesses, patterns and quirks are often more of a bottleneck than CPU, for many applications.
stuntprogrammer | 11 years ago | on: Nvidia's new mobile superchip
I haven't seen the automotive workloads, but they strike me as repetitive and regular enough that a Denver style part would do ok. That said, I'm not shocked by the use of A57+A53.
(Disclaimer: I worked on an early version of Denver on the code morphing software, and at an ARM server vendor on A57 based SOC).
stuntprogrammer | 11 years ago | on: Nvidia's new mobile superchip
I've also heard, though unconfirmed, that on the CPU side it's quad A57 + quad A53 rather than Denver derivatives.
stuntprogrammer | 11 years ago | on: Errplane (YC W13) Snags $8.1M for Open-Source InfluxDB Time Database
Vertica is seeing use for more historical stuff, and where the time series queries are pretty simple. Informix time series is doing ok, and has better support for rich queries, but isn't really playing realtime. MemSQL has the realtime perf (hi guys!) but needs to beef up on expressiveness. SAP HANA could do it, but not seeing major uptake there.
Still seeing lots of ad hoc solutions, and the expected experimentation with the usual hadoop menagerie (spark is helping make that practical).
The sensor stuff gets interesting at scale. Individual sources may not be producing data that quickly, but in aggregate it can be entertaining volume. Esp. when it comes to mobile things, and correlations become interesting to look at.
Deep thoughts need to wait for the coffee to kick in.
I suspect we'll see a lot of reinvention of technology to cope with these problems; perhaps even open source..
stuntprogrammer | 11 years ago | on: Errplane (YC W13) Snags $8.1M for Open-Source InfluxDB Time Database
It is definitely more finance oriented, though Kx are making moves towards other application areas. Another difference I'd highlight is that Kx concentrate on the core database itself, esp. performance and expressiveness of the query language, and leave things like GUIs and admin add-on tools to partners (like first derivatives and aquaq).
kdb does just fine with metrics and sensor data. Personally, I would argue that it's weaker on string handling though, which can hurt in certain use cases.
I doubt it'll go open source any time soon. However, it being around a long time is something salescritters can spin to wonderful effect re stability, support, etc etc. ;-)
I think there are fine application areas in finance that you should consider -- just consider the many areas where the core problem isn't related to juggling TB of market data ticks coming off the exchanges.
stuntprogrammer | 11 years ago | on: A Conversation with Arthur Whitney (2009)
stuntprogrammer | 11 years ago | on: A Conversation with Arthur Whitney (2009)
I find that especially in the valley, adherence to buzzwords and fashion of the day is a little too common for my taste now.
stuntprogrammer | 11 years ago | on: A Conversation with Arthur Whitney (2009)
stuntprogrammer | 14 years ago | on: Wow: Intel unveils 1 teraflop chip with 50-plus cores
How can we resolve this dissonance? Easy -- ignoring the fixed function and graphics only parts of Fermi, most of the transistors are going to be in the caches, the floating point units and the interconnect. These are places MIC will also spend billions of transistors but they're not carrying legacy dead weight from x86 history -- the FPU is 16 wide by definition must have a new ISA. The cost of the scalar cores will not be remotely dominant.
I'm not sure why you are concerned about the pin count on the processor, except perhaps if you are complaining about changing socket designs which is a different argument. The i7 2600 would fit in a LGA 1155 (i.e. 1155 pins) whereas Fermi was using a 1981 pin design on the compute SKUs. The sandy bridge CPU design is a fine one. The GPU is rapidly improving (e.g. ivy bridge should be significantly better, and will be a 1.4 billion transistor design in the same 22nm as Knights Corner).
stuntprogrammer | 14 years ago | on: Wow: Intel unveils 1 teraflop chip with 50-plus cores
Linpack is a common go-to number because, for all the flaws, it's a widely quoted number. E.g. used in the top500 ranking. It tends to let the cpu crank away and not stress the interconnect, and is widely viewed as an upper bound on perf for the machine. In the E5 case it'll be particularly helped by the move to AVX enabled cores, and take more advantage of that than general workloads. Realistic hpc workloads stress a lot more of the machine beyond the cpu. Interconnect performance in particular.
People like to dump on x86 but it's not that bad. There are plenty of features no one really uses and we still have around, but those features will often end up being microcoded and not gunking up the rest of the core. The big issue is decoder power and performance. x86 decode is complex. On the flipside, the code density is pretty good and that is important. Secondly, Intel and others, have added various improvements that help avoid the downsides. E.g. caching of decode, post-decode loop buffers and uop caches etc. Plus the new ISA extensions are much kinder..
stuntprogrammer | 14 years ago | on: Wow: Intel unveils 1 teraflop chip with 50-plus cores
It's also mentioned in the press release:
http://newsroom.intel.com/community/intel_newsroom/blog/2011...
"The first presentation of the first silicon of “Knights Corner” co-processor showed that Intel architecture is capable of delivering more than 1 TFLOPs of double precision floating point performance (as measured by the Double-precision, General Matrix-Matrix multiplication benchmark -- DGEMM). This was the first demonstration of a single processing chip capable of achieving such a performance level."
Does it mean much? It means something to me, and is a great first step for those of us running compute intensive codes. They really wouldn't get far if they designed the chip only around being able to do this.
As I mentioned elsewhere in the thread, the article text is incorrect. The chip we're discussing is Knights Corner not Knights Ferry. The latter has been in early user hands for quite some time now and I've spent plenty of time hacking on it. Knights Corner is the new chip that is working it's way to production via the usual process with ship for revenue in 2012.
The 2018 target is for an exascale machine, not shipment of initial MIC devices. TACC have already announced they'll be building out a 10 petaflop MIC based system next year to go operational by 2013.
Yes, I'm comparing a chip that has not shipped, but given the perf advantage, given the tools and productivity advantage, given the multiyear process advantage Intel is sustaining, this is not a chip to be ignored. Knights Corner is shipping on 22nm. Other vendors have notoriously had difficultly on previous processes, depend on fabs like TSMC who are doing 28nm for them, and will be later to 14nm etc.
The benchmark is essentially bottlenecked on FP64 matrix multiplies. If that's what you need to do, then sure, it's indicative.
Some machine learning workloads are also bottlenecked on matrix multiply, but don't need FP64 precision. They can use FP16. Fits a bigger model in a given memory size, makes better use of memory bandwidth, and given the right hardware support, you can get extremely high rates as on Pascal.
Personally, I find the memory system on Pascal more interesting than raw flops rate. Also, the use of nvlink to link multiple GPUs..