top | item 39189267

(no title)

darkbatman | 2 years ago

we are actually trying something similar but possible kinesis + clickhouse or kafka + clickhouse. Currently kinesis seems easier to deal with but not a good intergration or sink connector available to process records at scale for kinesis to put into clickhouse. Were you ever felt into similar problems where you had to process records at huge scale to be able to insert into clickhouse without much delay.

One more thing is kinesis can have duplicates while kafka is exactly once delivery.

discuss

order

ashug|2 years ago

I'm not familiar with Kinesis's sink APIs, but yes I'd imagine you'll have to write your own connector from scratch.

To answer your question, though, no: in the Kafka connector, the frequency of inserts into ClickHouse is configurable relatively independent of the batch size, so you don't need massive scale for real-time CH inserts. To save you a couple hours, here's an example config for the connector:

  # Snippet from connect-distributed.properties

  # Max bytes per batch: 1 GB
  fetch.max.bytes=1000000000
  consumer.fetch.max.bytes=1000000000
  max.partition.fetch.bytes=1000000000
  consumer.max.partition.fetch.bytes=1000000000

  # Max age per batch: 2 seconds
  fetch.max.wait.ms=2000
  consumer.fetch.max.wait.ms=2000

  # Max records per batch: 1 million
  max.poll.records=1000000
  consumer.max.poll.records=1000000

  # Min bytes per batch: 500 MB
  fetch.min.bytes=500000000
  consumer.fetch.min.bytes=500000000
You also might need to increase `message.max.bytes` on the broker/cluster side.

If you're still deciding, I'd recommend Kafka over Kinesis because (1) it's open source so more options, e.g. self host or Confluent or AWS MSK and (2) it has a much bigger community, meaning better support, more StackOverflow answers, a plug-and-play CH Kafka connector, etc.

darkbatman|2 years ago

Thanks these config are helpful