Giving meaning to 100B analytics events a day with Kafka, Dataflow and BigQuery

[+] jwilliams|8 years ago|reply

I've been working with all these tools for a while now (and a bunch more along the way).

It's not the orthodox cloud-thinking, but you're often best off processing at the point of creation (or ingestion). Normalize the data as much as you can, (probably) compress[1], and send as close to the target as possible.

If you grab the data, send elsewhere, transform <repeat>... it all gets slow and expensive pretty quickly. Also a headache to manage failures.

This is especially true if you're using BigQuery. Stage your near-raw data into BigQuery and then use it's muscle as much as you can. A classic example here might be de-duplicating data. A painful prospect for many distributed systems, but pretty easy on the BigQuery side.

This is all especially true for time-series data. With BigQuery time partitions you can keep the queries fast (and the costs reasonable).

Also limits the range of technologies and languages you need to wrangle too.

1: Choosing your data format and compression approach can make a huge difference.

[+] lima|8 years ago|reply

I built a similar pipeline using Kafka and ClickHouse - it's amazing how easy it is nowadays to ingest and analyze billions of events a day using standard tools.

ClickHouse can even ingest directly from Kafka (courtesy of Cloudflare - http://github.com/vavrusa contributed it).

[+] SiempreViernes|8 years ago|reply

Do you learn anything useful about humanity tho?

[+] ishi|8 years ago|reply

Can you elaborate a bit about the servers you used (hardware, cluster size, etc.)?

[+] ryanworl|8 years ago|reply

For anyone who wants to move this much data out of AWS and into another cloud provider, Kinesis Streams does not charge for bandwidth out. If you do the math, it works out to a very large savings over doing public internet data transfers if you are moving terabytes per day. The compute costs to copy data from Kafka to Kinesis should be minimal since you’re essentially just operating a pipe and not doing much actual compute, and this can operate easily on spot instances.

[+] nh2|8 years ago|reply

Can you elaborate on that? I haven't been able to conclude that from what it says on the pricing page [1]:

> Data transfer is free. AWS does not charge for data transfer from your data producers to Amazon Kinesis Data Streams, or from Amazon Kinesis Data Streams to your Amazon Kinesis Applications.

That sounds like it's sonly free to Amazon Kinesis Applications (== inside AWS).

And on [2] it says:

> If you use Amazon EC2 for running your Amazon Kinesis Applications, you will be charged for Amazon EC2 resources in addition to Amazon Kinesis Data Streams costs.

So that sounds like you will eventually pay the normal egress cost of EC2.

[1]: https://aws.amazon.com/kinesis/data-streams/pricing/ [2]: https://aws.amazon.com/kinesis/data-streams/faqs/

[+] maxnevermind|8 years ago|reply

"In digital advertising, ..." Stopped reading right there. Maybe a great article though. I started working on that thing called BigData not that long time ago but now realized that 50% of the job is advertisement, not fan of it at all, I want to like the end product or at least be neutral about it.

[+] EdwardDiego|8 years ago|reply

I work for a digital advertising company, handling the terabytes of data we produce each day - and we're not using it to track people or build profiles, we're using it to monitor our system and record important events. Several thousand publishers monetise their sites through us, so I'm happily neutral about the industry.

[+] GrandNewbien|8 years ago|reply

Could you elaborate? There's plenty of big data tasks and roles that aren't even closely related to marketing.

[+] davidbrent|8 years ago|reply

Very interesting read, but as more of an analyst, I kept waiting for the ‘meaning.’ Maybe I missed it, but for anyone else who may be considering reading this, it is more about ‘giving business structure’ to analytic events.

Good article none the less.

[+] throwaway66666|8 years ago|reply

You 'd be surprised how good modern hardware is. We are being hit with between 1.5 and 2 million requests per minute (steady traffic no spikes except increased usage in the weekends), and our analytics solution runs on just 2 main servers and costs $3k per month (total associated costs except human labour which is 1 engineer working on it part time now).

We talked with a famous analytics company and they gave us a quote of 1 million yearly to work with us. So how did we get it down from 1m to 35k?

We pretty much do the thing the article suggests (roll-ups section). We compute data hourly. Alongside with the hourly computation we also dump some extra data that can be used to compute numbers for the day (eg unique user ids, from unique user ids per hour), then from the day we get per week and then per month. We also have less moving pieces (our stack is way more traditional), and we manage our own hardware (key for keeping costs down)

When you get data back from the system you only hit the pre-computed cache, no query touches the main system from the dashboards. We only allow queries running in a 30 minute window to run on the live system - to ensure that no crazy load is going to be built on top of it and we use that to mostly catch anomalies on the real time data. (our parsing time is good too, between 10 seconds and 1 minute compared to the 2-30 minutes the article gives).

However this is the "You are alive but you 're not living" angst of analytics. All the data is there, but you cannot freely prod it for answers and patterns. If you want to get an answers about past data, you need to go through an overly complex process of raising a new cluster and ingesting old backups, multiple times, then waiting for a few days. It get's relatively expensive, slow and at times will demoralize you and make you back off from getting the answers you need. You could try keeping a smaller cluster that only gets % percent of the data (eg only 2%) for finding trends, drawing heatmaps etc and that one can run in realtime but your CEO will say that's a stupid idea to your face and it's realtime-all or nothing.

You might say that's a situation you can live with provided the absolutely insane cost savings, but when the company goes at a nicer retreat for only a selective elite few that easily costs 20k, or runs an over the top kitsch open party party/recruiting that costs 80k, and you are being dragged into a meeting on a monday morning and confronted; Why did the "3k per month-3 billion requests per day" system cost 6k this month? (because we had multiple clusters in parallel computing historical data for the past 6 months that you asked for). You just get bitter you didn't give the analytics company the 1 million they asked for and be done with it.

[+] buremba|8 years ago|reply

If you're pre-calculating the metrics and dropping the raw data from your systems, you can't actually get the benefits of the ad-hoc systems. You can't ask a new question to your existing analytics data and you need to do custom development every time you need to see a new metric.

[+] manigandham|8 years ago|reply

How much data size are you dealing with? For that price range, you can probably just get MemSQL or Clickhouse and handle real-time queries across all of the data.

[+] kasey_junk|8 years ago|reply

For those bad at arithmetic that’s 33k rps

[+] xstartup|8 years ago|reply

Here is an example Pipeline which supports loading data from an unbounded source to BigQuery in batches using load jobs (evading BigQuery's Streaming Insert cost)

See: https://zero-master.github.io/posts/pub-sub-bigquery-beam/

[+] AWebOfBrown|8 years ago|reply

Are you the author? OT but I'm amazed the author charged merely 100 euro for implementing that solution for the subject startup, even if they're cash-strapped. I'm not familiar with BigQuery, but I'm curious what a normal rate for solving that issue would look like.

[+] manigandham|8 years ago|reply

You can also skip pub/sub and/or use it to write files to cloud storage, then load from there by using a cloud function that will trigger the load job when a new object is created.

[+] cobookman|8 years ago|reply

FYI dataflow's bigqueryio does not use stream inserts. Instead it batches the data into many load jobs

[+] asavinov|8 years ago|reply

> Giving meaning to ...

Ingesting such amounts of data is a challenge indeed. But problems will become much more complicated if it is necessary to perform complex analysis during data ingestion. Such analysis (not simply event pre-processing) can arise because of the following reasons:

* It is physically not possible to store this amount of events. For example, assume you collect them from devices and sensors

* It is necessary to make faster decisions, e.g., in mission critical applications

* It can be more efficient to do some analytics before storing data (as opposed to first storing data persistently and then loading it again for analysis)

Such analysis can be done by conventional tools like Spark Streaming (micro batch processing) or Kafka Streams (works only with Kafka). One novel approach is implemented in Bistro Streams [0] (I am an author). It is intended for general-purpose data processing including both batch and stream analytics but it radically differs from MapReduce, SQL and other set-oriented data processing frameworks. It represents data via functions and processes data via column operations rather than having only set operations.

[0] Bistro: https://github.com/asavinov/bistro

[+] buremba|8 years ago|reply

Sounds like they over-engineered the solution. If you have ad-hoc use-case, BigQuery is great but it's quite expensive. If you just need to pre-calculate the metrics using SQL, Athena / Prestodb / Clickhouse / Redshift Spectrum might be much easier and cost-efficient.

[+] vgt|8 years ago|reply

BigQuery PM here. I'd love to genuinely understand why you have that impression.

BigQuery's on-demand model charges you EXACTLY for what you consume. Meaning, your resource efficiency is 100% [0].

By contrast, typical "cluster pricing" technologies require you to pay for 100% of your cluster uptime. In private data centers, it's difficult to get above 30% average efficiency.

BigQuery also takes care of all software, security, and hardware maintenance, including reprocessing data in our storage system for maximum performance and scaling your BigQuery "cluster" for you.[1]

BigQuery has a perpetual free tier of 10GB of data stored and 1TB of data processed per month.

Finally, BigQuery is the only technology we're aware of whose logical storage system doesn't charge you for loads - meaning we don't compromise your query capacity, nor do we bill you for loads.

[0] https://cloud.google.com/blog/big-data/2016/02/visualizing-t...

[1] https://cloud.google.com/blog/big-data/2016/08/google-bigque...

[+] djhworld|8 years ago|reply

The downside to Athena/PrestoDB is you need to get the data into an optimal format/file size to get the best performance (e.g. Parquet or other columnar format)

This is fine for batch workloads where 'real time', i.e. latency of < 5 minutes, ideally < 1 minute, isn't much of a concern.

but adds complication if you need to make data available quickly, because you either adopt some hybrid approach where you have 'recent' data in database and everything else in S3 - meaning your query layer has added complexity.

I believe this is how BigQuery works with streaming inserts, where recent streamed data is actually stored in BigTable, and asynchronously copied into Capacitor over time (this might be outdated information)

So while BigQuery seems expensive, the other solutions have a lot of other costs that you need to factor in!

[+] manigandham|8 years ago|reply

Agreed. The pricing is definitely problematic for constant use at smaller companies. We had a single query that read 200GB of data once an hour, and this ends up costing around 700/month. That's for 1 dashboard chart. Add up to 10 charts and we're looking at 7k/month just to run a single company report updated hourly.

BigQuery has high throughput but also very high latency and the shared tenancy means unpredictable query times, with the same SQL taking 10 seconds or 60. They are working on both but it'll be awhile before any changes. DML merge and DDL statements are now in beta though which resolved other big obstacles in automation.

It's a great system but at this point I'd only recommend it for occasional but heavy queries that need to scan massive datasets with complex joins or for some ETL uses. Perhaps if they charged for compressed storage and scans then it would be different but right now Snowflake Data is a much more usable system day to day.

[+] legendofneo|8 years ago|reply

I agree with this statement if you consider BigQuery's on-demand pricing, but at a given scale the $40,000/month flat rate becomes competitive.

[+] hinkley|8 years ago|reply

I see titles like this and my first thought is “more people need to watch Real Genius or watch it again.”

“What do you think a phase conjugate tracking system is for, Kent?”

Great. You made a system to track a billion people a day. You’re murdering privacy and then bragging about it. And bragging about it during a giant shitstorm caused by Facebook. The fuck is wrong with you?

[+] Arzh|8 years ago|reply

I want to know how people use my website, fuck me right?

68 comments