top | item 29398561

(no title)

rmb938 | 4 years ago

Maybe I missed it but are you able to talk about how many messages a second, partition count and average message size?

I run a few hundred Kafka clusters with message counts per second in the tens of millions for some clusters, a few thousand partitions, message sizes around 7kb with gzip compression, and have never needed the amount of CPU and network/disk throughput mentioned. With node counts range between ~10-25. Most of my clusters reaching those speeds at most average around 7Gbps of disk throughput per broker.

I have recently started running Kafka in GCP with their balanced ssd disks capping out at 1.2Gbps I'm not seeing much of a performance impact. It requires a few more brokers to reach the same throughput but not having any of the performance and scaling issues mentioned in this post.

My brokers are sized a bit differently than mentioned in the post as well, low amount of CPU (maximum 20ish cores) but much more memory around 248GB for my larger clusters. So maybe that has to do with it? Maybe the broker sizes that were chosen are not ideal for the workload?

Maybe I've been lucky in my setups but I would like to know a bit more. Having been running Kafka since the 0.10 days and now on 2.6 for all my clusters this type of performance problem seems a bit puzzling.

discuss

lizthegrey|4 years ago

1.5M messages/sec, average message size 1kb pre compression, 300 bytes post compression/batching.

the problem was that we were really really disk limited before for keeping the 48 hour window of data, having to keep everything on NVMe or EBS was astoundingly expensive.

but yeah, we run it all off 6 brokers now.

throwdbaaway|4 years ago

If I understand correctly, there were:

- issues with tail latency and cost when using gp2

- issues with generally bad performance when using st1

- issues with reliability when using gp3 (as an early adopter of aws "GA" product)

- issues with insufficient disk space when using local-attached nvme

- issues with confluent licensing cost

And tiered storage solves all of that.

The thing is, I have not seen kafka struggling with disk performance when running on gcp pd-ssd. Perhaps even pd-balanced would do the trick, as indicated by rmb938's comment. I am glad that you guys finally landed on a boring solution now, but things have been rather boring for years with another cloud provider. Perhaps there is no material impact from the high tail latency when using gp2, and you just needed a better contract negotiator? Surely the tail latency would be worse now whenever data need to be pulled from S3?

camel_gopher|4 years ago

Did you look at any of the other solutions such as fq? 300 bytes is a solidly small size. I’m guessing Kafka has gotten faster since this doc was published, but might be worth investigating. https://github.com/neophenix/StateOfTheMQ/blob/master/state_...