top | item 29192232

(no title)

wlrd | 4 years ago

what about the latency of data warehouses? how do you get around that?

discuss

Good callout. Sometimes, I joke that warehouse ingestion latency is the bane of my existence, but it's improving...

Our average customer runs Hightouch syncs roughly every hour, but we can actually run syncs up to every minute! HT has a lot of optimizations like only sending changes to destinations instead of all data every run.

On the warehouse side, we're seeing a lot of improvements. BigQuery has streaming insert APIs [0] implemented with a parallel database on the backend that's joined at read time. Combined with timestamp partitioned tables (sortable) and our in-warehouse diff'ing, you can actually create a streaming pipeline in Hightouch. Some companies like JetBlue are doing cool stuff with lambda views on top of Snowflake [1]. Our power users at Hightouch are running syncs as fast as every minute.

For wider context, we find 90%+ of business use cases to be just fine in batch. It's amazing to see how many people are still replacing... manual CSV workflows... with Hightouch :)

That said, there are some use cases for truly real-time workflows (e.g. a post-checkout email), and for that, customers either implement outside of Hightouch or lately, we've been fiddling around with letting customers plug directly into streams like Kafka, Kinesis, PubSub - though they lose the power of SQL aggregations _for now_.

Streaming SQL databases like Materialize [2] will fix this fundamentally, and Hightouch can connect to them. Email hello@hightouch.io if you want to try any of the new stuff!

[0]: https://cloud.google.com/bigquery/docs/write-api [1]: https://discourse.getdbt.com/t/how-to-create-near-real-time-... [2]: https://materialize.com/