top | item 15993947

Busting myths about BigQuery

108 points| tzury | 8 years ago |cloud.google.com | reply

42 comments

[+] tzury|8 years ago|reply

We at Reblaze [0] push to BQ nearly 3.5 Billion records every day.

We use it as our infrastructure for big-data analysis and we get the results at the speed of light.

Every 60 seconds, there are nearly 1K queries that analyze users and IPs behavior.

Our entire ML is based on it, and there are half a dozen of applications in our platform that make use of BQ - from detecting scrapers, analyzing user input, defeating DDoS and more.

The built-in (with legacy sql variant [1]) Math and Window functions are super handy, the easy DataLab [2] integration, and the last but not least, Google DataStudio[3] that let us generate interactive reports in literally minutes are all making our choice (3+ years a great one).

BQ replaced 2,160,000 monthly cores (250 instances * 12 cores each always on) and petabytes of storage and cost us several grands per month.

This perhaps one of the greatest hidden gems available today in the cloud sphere and I recommend everyone to give it a try.

A very good friend of mine replaced MixPanel with BigQuery and saved nearly a quarter of a million dollars a year since [4].

[0] https://blog.reblaze.com/how-to-stop-a-ddos-attack-in-under-...

[1] https://cloud.google.com/bigquery/docs/reference/legacy-sql

[2] https://cloud.google.com/datalab/

[3] https://datastudio.google.com

[4] https://blog.doit-intl.com/replacing-mixpanel-with-bigquery-...

[+] pentium10|8 years ago|reply

We are using BQ for 4 years now. Since the beginning with dual-writing to diferent engines. We did an architecture that routes data to Redis, Riak, Elasticsearch, Mysql, BigQuery. We often have 500+ lines of queries in BigQuery writen in standard SQL that runs our reporting system. The average query time we have is 10 seconds usually scanning at least a terabyte. We truly love the ability to write in Javascript the UDFs, and our marketing team loves to combine their Google Sheets (mapping data) with a real BQ query. We heavily use the 'dryrun" feature of BQ to see without running the query how much resources and costs will generate. We even implemented a quota for users, based on this feature, wich works great, and only when they do a mistake in a query targetting alerts them at a certain treshold. Afer a couple of years we naturally started to consolidate all our data into BigQuery. Since it's Standard SQL all our team members can start using and reporting + tooling works quite well. DataStudio is another service which we started to use more and more as well.

[+] indogooner|8 years ago|reply

How do you handle updates in BigQuery in case your queries require latest version of records? Do you use views? If yes does that affect the query latency too much?

[+] bzillins|8 years ago|reply

As long as you turn a PG table into a BQ dataset with enough partitions (BQ tables) such that queries do not touch a disproportionate amount of unnecessary data, you utilize date partitioning, and you can deal with a minimum 1.5 second query time the service is truly amazing. My only frustrations are that the majority of streaming service degradations are never noted on the Status dashboard and that you can get really odd error states like headers stating that you should be getting json while the body contains a html error message.

[+] scrollaway|8 years ago|reply

Tangential: Anyone here used both BigQuery and Redshift?

Here's the primary reason that is currently keeping me on Redshift:

Redshift is basically Postgres at the query layer which is insanely cool: All the Postgres tooling works with it. All the Postgres expertise I have carries over. It feels like a lot less of a "lock in".

Anyone care to chime in?

[+] nothrabannosir|8 years ago|reply

I have used both. My biggest aha moment with BQ was realising it isn’t “sql to query a database”, but “map reduce as a sql”. Redshift really feels like a DB: I’m always worrying about the number of the runtime complexity of my sql, the number of nodes in my fleet, how my joins work, it just feels like a big big DB. BQ is less like that. I’m thinking in terms of memory complexity of my operation : N, log N, etc. Always having as many nodes available as it can be parallelised to. You’re writing a map reduce job on your data, and it’s... it feels like the more appropriate boundary for abstraction.

This is a bit hand wavy, but there is really no better way to get a feel for it than to try it. And think of it less as a db, and more as a map reduce cluster :) if that helps.

[+] sturgill|8 years ago|reply

We dual write event data to BigQuery and Redshift. BigQuery is incredibly cheap at the storage level, but can be pricey to run queries against large datasets. Personally I am not a fan of paying per-query.

Historically BigQuery data was immutable. You could delete a table, but you couldn't modify or delete a single row. That support was added recently, but there are daily limits imposed on those actions.

I'm a big fan of immutable data, up until you find that someone made a mistake somewhere. Naturally you can make modifications in ETL, but if you do that long enough your ETL job is a mess of time-based if statements.

We originally moved to also support Redshift because of BigQuery not supporting row level updates at that time. Someone on the content site would forget to put an & between URL params and our source data would be busted. Like I said, immutable data is great as long as you never make mistakes...

I'm also not a huge fan of nested records. If you use BigQuery, do yourself a favor and make sure you have a uuid field (it's a good practice any way). When you get data back from the client it can be a pain to piece records back together (nested records come back as single rows for the most part). This makes certain queries really fast, but it can be a pain to work with in aggregate.

My preferred setup these days is to dual write to Redshift and BigQuery. We run almost all analysis off of Redshift, but have a natural backup of data stored elsewhere. And like I said, BigQuery storage rates are dirt cheap. And if our Redshift cluster goes down for any reason, we can hop over to BigQuery to find what we need (and backfill any missing data points).

[+] manigandham|8 years ago|reply

It’s still SQL at the end of the day so not a big deal, BigQuery has a newer Standard SQL dialect now too.

It is nice to be able to reuse existing tooling for redshift but we’re willing to make the trade for bigquery since the no-ops model is so much nicer. It’s basically magical how well it works.

[+] elvinyung|8 years ago|reply

I don't know why more things aren't PostgreSQL protocol compatible. It seems like a relatively easy way to make a system accessible to a huge ecosystem.

CockroachDB is the only other well-known project I know of that does this (but of course, it's not really an analytic database right now).

[+] zerop|8 years ago|reply

Not sure if this relevant to what you want:

Analyzing AWS Detailed Billing Reports Using BigQuery and ReDash

https://blog.powerupcloud.com/analyzing-aws-detailed-billing...

[+] nerdponx|8 years ago|reply

Is Azure SQL Data Warehouse comparable to either of these? It's distributed on the backend, and more or less standard T-SQL on the front end.

[+] IU84|8 years ago|reply

I guess - if you consider 30 year old legacy technologies insanely cool... Redshift is a shared-nothing architecture, so you'll always face scalability and concurrency issues. Snowflake is a brand new MPP shared-data architecture, and it can scale compute and storage separately. Blows Redshift away.

[+] djhworld|8 years ago|reply

If you rely on JDBC support, BQ has a JDBC driver but it's written by a third party

[+] lima|8 years ago|reply

I recently discovered ClickHouse[1], Yandex's recently open sourced BigQuery equivalent (someone on HN pointed me to it!).

If you're looking for a OLAP database running on your own infrastructure, make sure to give it a try.

[1]: https://clickhouse.yandex/

[+] nemothekid|8 years ago|reply

Anything you run on your own infrastrucutre I wouldn't really call a BigQuery equivalent. As far as performance goes, if you attacked the problem with the same amount of CPU cores, I think you could get similar performance with Presto/Dremel on top of any fast data store.

What I think is different about BigQuery is just the on-demand parallelism. This means you pay per query instead of per instance which is confusing at first, but then a lot of the operational knobs simply go away (for better or for worse).

[+] tzury|8 years ago|reply

Thanks for the link. Seems interesting. BQ is a serverless though.

[+] hbarka|8 years ago|reply

Inside Google, isn’t F1 and Plx preferred over BQ?

[+] ddorian43|8 years ago|reply

See "oltp vs olap"

[+] jnordwick|8 years ago|reply

So Google ads are now HN material?

This is literally just Google putting out a press release touting one of their products. It isn't unbiased or doesn't attempt to be complete, but people here seem to think Google's opinion on their own technology is beyond reproach.

Think of something like this being on Microsoft or Oracle website. No way it would be voted up.

[+] bloudermilk|8 years ago|reply

It’s a shame you’re getting downvoted, since you’re right about the content. The list of “truths” isn’t anything more than a feature list. I got to #4 before coming to the comments, where I found the top three comments were all well-written positive reviews of the product. The whole submission screams of a well-executed marketing play.

[+] bdcravens|8 years ago|reply

How is that any different than a blog post from Jeff Barr?

[+] jwswj|8 years ago|reply

Totally ageee.