top | item 25597663

(no title)

jfrisby | 5 years ago

Well, for starters, it's simply untrue that every company (likely) will (barring bankruptcy of course) eventually need extreme scale, and the implicit assumption that the up-front cost of implementing such scalability is necessarily worthwhile. That may be true in Staruplandia, but the industry is a _lot_ bigger than the world of Silicon Valley style startups.

For example: My current company is a small lifestyle biz. Sure, it's a "tech company", has an event-processing pipeline and the like, but if we were 100x more successful than our wildest ambitions, we still wouldn't need anywhere near 25.6k IDs/sec/server. And, given the objectives of the company, it would make more sense to turn customers away than to grow the team to accommodate that sort of demand.

The simple fact is that in a great many situations extreme scale isn't needed. And, either way, the cost of implementing extreme scalability can inflict its own harm. Highly scalable systems usually come with more operational complexity, and steeper learning curves. When you have a small team, this can impair -- or even cripple -- product development. If your company hasn't found product/market fit yet, this attitude of sacrificing the present for an imagined future can materially reduce the likelihood of that future coming to pass. Of course, the opposite is sometimes true as well. Companies have, in fact, failed because they found product market fit but took so many shortcuts they couldn't adapt and grow into their success. But the point is that determining how much to invest in future-proofing is a complex and nuanced problem not amenable to sweeping generalizations.

Now, this clearly isn't such an extreme case, but frankly Sonyflake (the original, not the Rust implementation) seems to be operationally simpler than Snowflake while offering perfectly reasonable tradeoffs. The Rust implementation might prove entirely useful to any number of organizations _based on their needs_. If they have a Rust codebase with a single process per machine this could easily be a simpler and more robust option than Sonyflake.

The kind of arrogant dismissiveness based on one's own personal (and highly specialized) experience that's shown in the downvoted comment tends to leave a bad taste in peoples' mouths. Thus the downvoting.

discuss

order

secondcoming|5 years ago

> And, either way, the cost of implementing extreme scalability can inflict its own harm. Highly scalable systems usually come with more operational complexity, and steeper learning curves

I think you're overestimating the cost factor of future-proofing an architecture for extreme scale (and 26K/sec/server isn't actually that 'extreme'). And instead of downvoting people who've walked that walk, perhaps people may realise this by engaging instead.

Also, I didn't read any 'arrogant dismissiveness' in the post. Each to their own! Happy New Year!

jfrisby|5 years ago

As I touched on, the cost can vary a lot depending on the particulars of the situation.

For example, one company I came into was using Redshift and fighting a lot of problems with it. The problems stemmed from a combination of not knowing how to use it effectively (e.g. batching writes), it being the wrong tool for the job (they were using it for a combination of OLAP and OLTP(!!) workloads), and so forth. The long and short is that for both workloads, a basic RDS Postgres instance -- or a pair, one for OLAP, one for OLTP -- would've been considerably cheaper (they'd had to scale up a couple notches from the minimum cluster size/type because of performance), avoided correctness problems (e.g. UNIQUE constraints actually doing the thing), been several orders of magnitude more performant at the scale they were at, etc. They simply didn't understand the tool and tried to apply it anyway. They basically thought Redshift was like any RDBMS, but faster "because it stores data in a columnar format."

Had they understood it, designing a data pipeline that could make use of such a tool would have required considerably more engineering person-hours than the pipeline they actually built.

Obviously this is a terribly extreme example, but the learning curve for tools -- including the time/effort needed to discover unknown unknowns -- is a cost that must be factored in.

And, even if your org has the expertise and experience already, more-scalable solutions often have a lot more moving parts than simpler solutions. Another organization I was at wanted to set up a data pipeline. They decided to go with Kafka (self-managed because of quirks/limitations of AWS' offering at the time), and Avro. After 2 months that _team_ (4 sr + 1 jr engineers, IIRC) had accomplished depressingly little. Both in terms of functionality, and performance. Even considering only the _devops_ workload of Terraforming the setup and management of the various pieces (Avro server, Zookeeper cluster, Kafka cluster, IAM objects to control access between things...), it was a vastly more complicated project than the pipeline it was meant to replace (SQS-based). Yes, that's a bit of an apples-to-oranges comparison but the project's goal was to replace SQS with Kafka for both future throughput needs and desired capabilities (replaying old data into other targets / playing incoming data into multiple targets without coupling the receiving code).

By the time I left, that project had: 1. Not shipped. 2. Was not feature-complete. 3. Was still experiencing correctness issues. 4. Had Terraform code the CTO considered to be of unacceptably poor quality. 4. Monitoring and observability was, let's say, a "work in progress." Some of that is for sure the Second System Effect, but it is not at all clear to me that they would have been better off if they'd gone with Kafka from day 1.

Given that we could pretty easily extract another 2 orders of magnitude throughput out of SQS, there's a real discussion to be had about whether or not a better approach might've been to make a Go consumer that consumed data more efficiency, and shunted data to multiple destinations -- including S3 to allow for replay. That would've been a 1-2 week project for a single engineer. Kafka is 100% the right tool for the job _beyond a certain scale_ (both of throughput and DAG complexity), but the company was something like 4 years in when I got there, and had been using SQS productively for quite some time.

And no, 26k/sec/server isn't especially huge. I was referring to the fact that the downvoted commenter was making sweeping generalizations. Sweeping generalizations tend to shut discussion down, not prompt more nuanced discussion. Other threads on this post have seen very interesting and productive discussions emerge, but note that the downvoted commenter's post hasn't really drawn anything other than people being sucked into the very discussion we're having now. It's counter-productive.