top | item 25599390

(no title)

jfrisby | 5 years ago

As I touched on, the cost can vary a lot depending on the particulars of the situation.

For example, one company I came into was using Redshift and fighting a lot of problems with it. The problems stemmed from a combination of not knowing how to use it effectively (e.g. batching writes), it being the wrong tool for the job (they were using it for a combination of OLAP and OLTP(!!) workloads), and so forth. The long and short is that for both workloads, a basic RDS Postgres instance -- or a pair, one for OLAP, one for OLTP -- would've been considerably cheaper (they'd had to scale up a couple notches from the minimum cluster size/type because of performance), avoided correctness problems (e.g. UNIQUE constraints actually doing the thing), been several orders of magnitude more performant at the scale they were at, etc. They simply didn't understand the tool and tried to apply it anyway. They basically thought Redshift was like any RDBMS, but faster "because it stores data in a columnar format."

Had they understood it, designing a data pipeline that could make use of such a tool would have required considerably more engineering person-hours than the pipeline they actually built.

Obviously this is a terribly extreme example, but the learning curve for tools -- including the time/effort needed to discover unknown unknowns -- is a cost that must be factored in.

And, even if your org has the expertise and experience already, more-scalable solutions often have a lot more moving parts than simpler solutions. Another organization I was at wanted to set up a data pipeline. They decided to go with Kafka (self-managed because of quirks/limitations of AWS' offering at the time), and Avro. After 2 months that _team_ (4 sr + 1 jr engineers, IIRC) had accomplished depressingly little. Both in terms of functionality, and performance. Even considering only the _devops_ workload of Terraforming the setup and management of the various pieces (Avro server, Zookeeper cluster, Kafka cluster, IAM objects to control access between things...), it was a vastly more complicated project than the pipeline it was meant to replace (SQS-based). Yes, that's a bit of an apples-to-oranges comparison but the project's goal was to replace SQS with Kafka for both future throughput needs and desired capabilities (replaying old data into other targets / playing incoming data into multiple targets without coupling the receiving code).

By the time I left, that project had: 1. Not shipped. 2. Was not feature-complete. 3. Was still experiencing correctness issues. 4. Had Terraform code the CTO considered to be of unacceptably poor quality. 4. Monitoring and observability was, let's say, a "work in progress." Some of that is for sure the Second System Effect, but it is not at all clear to me that they would have been better off if they'd gone with Kafka from day 1.

Given that we could pretty easily extract another 2 orders of magnitude throughput out of SQS, there's a real discussion to be had about whether or not a better approach might've been to make a Go consumer that consumed data more efficiency, and shunted data to multiple destinations -- including S3 to allow for replay. That would've been a 1-2 week project for a single engineer. Kafka is 100% the right tool for the job _beyond a certain scale_ (both of throughput and DAG complexity), but the company was something like 4 years in when I got there, and had been using SQS productively for quite some time.

And no, 26k/sec/server isn't especially huge. I was referring to the fact that the downvoted commenter was making sweeping generalizations. Sweeping generalizations tend to shut discussion down, not prompt more nuanced discussion. Other threads on this post have seen very interesting and productive discussions emerge, but note that the downvoted commenter's post hasn't really drawn anything other than people being sucked into the very discussion we're having now. It's counter-productive.

discuss

No comments yet.