top | item 29232346

Databricks response to Snowflake's accusation of lacking integrity

217 points| rxin | 4 years ago |databricks.com

156 comments

order

drej|4 years ago

What I find hilarious is that companies argue who can query 100 TB faster and try to sell this to people. I've been on the receiving end of offers by both of the companies in question and used both platforms (and sadly migrated some data jobs to them).

While they can crunch large datasets, they are laughably slow for the datasets most people have. So while I did propose we use these solutions for our big-ish data projects, management kept pushing for us to migrate our tiny datasets (tens of gigabytes or smaller) and the perf expectedly tanked compared to our other solutions (Postgres, Redshift, pandas etc.), never mind the immense costs to migrate everything and train everyone up.

Yes, these are very good products. But PLEASE, for the love of god, don't migrate to them unless you know you need them (and by 'need' I don't mean pimping your resume).

sanketsarang|4 years ago

I did work on making a database myself, and I must say that querying 100TB fast, let alone storing 100TB of data, is a real problem. Some companies (very few) don't have much choice but to use a DB that works on 100TB. If you do have small data, then you have a lot of options. But if your data is large, then you have very few options. So it is correct to be competing on how fast a DB can query 100TB of data; while at the same time being slow if you have just 10GB of data. Some databases are designed only for large data, and should not be used if your data is small.

tshanmu|4 years ago

Resume driven development FTW!

StephenJGL|4 years ago

Very true. You have to understand the actual capabilities and your actual requirements. We work with petabyte size datasets and BigQuery is hard to beat. Our other reporting systems are still all in MySQL though.

autokad|4 years ago

its my experience if its just 10s of GBs then use 'normal' solutions. if TB then spark is great for that. note I have only used DataBricks & Spark, no snowflake.

scapecast|4 years ago

The irony here is that what Databricks is doing to Snowflake is exactly what Snowflake did to AWS and Redshift.

Same playbook - show that you’re better in a key metric that’s easy to understand (performance) to get the attention, but then pitch the paradigm change.

In Snowflake’s case, that was separation of storage and compute.

In Databrick’s case, it’s the Lakehouse Architecture.

I think the reason why Snowflake is so nervous because they know they can’t win this game.

falaki|4 years ago

To be fair Apache Spark, which started long before either company existed, was built on the assumption that compute and storage should be separate. Unlike Hadoop, Spark did not come with any storage system and could read from any source.

doppelganger1|4 years ago

SF spreads a lot of FUD saying that DB can’t perform, and it was true. DB then went out and hired a lot of engineering talent with a diverse background and has been investing a lot of money in being a best in class SQL offering, so what do you do? You do something to get people’s attention. They’re saying, “hey, we have great performance too, you should also look at us for your SQL workloads.”

ignoramous|4 years ago

> I think the reason why Snowflake is so nervous because they know they can’t win this game.

Isn't Databricks' delta.io, which their Data Lakehouse product builds on top of, open source? Snowflake could take the best parts from and run with it?

glogla|4 years ago

In what way is lakehouse architecture beneficial over something like Snowflake or BigQuery?

I understand the appeal over having lake and warehouse as separate components, but with those native cloud warehouses, you can already do everything a lake does.

avip|4 years ago

I've used both products in production. Both are good++.

The blog wars seem extremely ridiculous to me. I don't recall ever choosing one over another based on how fast it runs on some imaginary arbitrary dataset.

paxys|4 years ago

Manufactured rivalries can be a great thing for business. We have been debating Coke vs Pepsi, Nike vs Reebok, McDonald's vs Burger King for decades now while these companies laugh all the way to the bank.

kartoonhero|4 years ago

Its not ridiculous at all. This is the coming of age for a brand new data architecture.

One of the biggest FUDs for a data lake architecture is performance - and this benchmark should put that concern to rest.

CactusOnFire|4 years ago

It was inevitable.

Both Databricks and Snowflake have inflated marketing budgets, and marketing feels they have to "beat" the other one or they'll lose the market.

inetknght|4 years ago

Snowflake accuses other companies of lacking integrity?

I really wish I could block all of Snowflake's domain from my inbox. Sadly, Google encourages spammers to just create a new email address. So I get a few emails each month from Snowflake who ask me to try their products. I've never done business with them and there's no unsubscribe link.

Fuck Snowflake for thinking it has any room to talk about integrity.

doppelganger1|4 years ago

What I find comical is they accuse Databricks of lacking integrity but they don’t actually call out anything except their benchmark was faster than what Databricks did in Snowflake. Databricks then reruns the benchmark and says the only reason that Snowflake’s was faster was because of the built in dataset they used. Databricks was able to match Snowflakes numbers using it but when they loaded the actual data set, it was much slower, which is how a proper TPC benchmark is supposed to happen. They then said that Databricks blog doesn’t match the TPC results, but when I looked at them, they do match. I guess Snowflake just expects people to take arguments at face value. Then I saw someone on LinkedIn complaining that Databricks must have used some beta version. I didn’t see a beta version being used, but that kind of goes out the window when Databricks follows up and then posts that they matched Snowflake when they used their built in TPC data set.

This is funny and interesting to watch but also a distraction I feel. Amazon says it best when they say, “Leaders start with the customer and work backwards. They work vigorously to earn and keep customer trust. Although leaders pay attention to competitors, they obsess over customers.”

boublepop|4 years ago

Snowflake must be kicking themselves hard now for letting a story that was “Databricks is a viable alternative” turn into “Snowflake has absolutely no integrity and will fling mud even while they are gaming the statistics”

Really can’t see what they can do now short of “bending” to Databricks and entering the competition. And naturally it’s no longer just enough that they show comparable performance. They have to hit their games stats somehow otherwise any news even of they beat Databricks will be reported as “see, we told you they where cheating”

bloodyplonker22|4 years ago

Databricks is trying to punch up at the market leader. Every decent marketer knows that you should never do the opposite and punch down.

djbusby|4 years ago

I'm crap at marketing and know the only-punch-up rule.

aliswe|4 years ago

what differences in size (or height) are we talking about?

jchw|4 years ago

Before the Snowflake blog post, I did not know what Snowflake or Databricks were. I can only imagine that this rivalry is great for both of them, even if Databricks is somewhat on the advantage end, at least from a tactical standpoint; I admit though that they seem to be a bit unnecessarily defensive considering the position they're in with the exchange.

In general though, I'm still not complaining. It's interesting to see a dispute like this unfold.

qaq|4 years ago

Snowflake is 120B Market Cap Darling of Cloud Data warehouses I doubt obscurity is a problem they are trying to solve

AdamProut|4 years ago

I would say that TPC-DS and TPC-H are really table stakes benchmarks for data warehouses at this point in time (maybe they weren't 10 years ago). How to build a database that does well on them is well documented in the literature now[1][2][3][4] (maybe a few other papers). Its not easy to build such a database, but its "just" hard work and many companies have the $$ necessary to do that work. There isn't any magic or technical moat in the results for databricks (or snowflake, or redshift, etc.).

I think Databricks is overly enthusiastic about their results as they have been trying to be competitive with cloud DWs on these benchmarks for a number of years now. They have finally caught up (by building deltalake and their photon query engine which implement a number of standard DW features).

  [1] http://www.vldb.org/pvldb/vol13/p1206-dreseler.pdf
  [2] https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
  [3] https://web.stanford.edu/class/cs245/readings/c- store.pdf
  [4] http://sites.computer.org/debull/A12mar/vectorwise.pdf

thrtlvlmidnight|4 years ago

I agree with everything above. The main advantage the newer data warehouses have over the legacy on-prem incumbents is that they had the chance to build from scratch having learned from all of the challenges that the original players encountered.

The public pissing contest is entertaining while also being silly and slightly cringe, but I think it's a nice story for Databricks nonetheless. They now have a performant SQL-based analytics engine that can credibly compete with the best DWs in the market today, and it's just one part of their overall platform.

The sense I get is that Snowflake wants the conversation to be "no matter what you do, you need a data warehouse, and we're the best in the business at that." Databricks' Lakehouse approach is a fundamental challenge to that, and if they're getting this kind of performance from their analytics engine against the market-leading data warehouses today, that's a big momentum shift in their favour.

redwood|4 years ago

As much as I love seeing competition in the space and am enjoying my popcorn, I really don't understand what Databricks is doing here: this feels like a childish foodfight rather than an obsession with the customer...

saj1th|4 years ago

:) That is a good question. Why spend eng cycles to submit results to the TPC council - why not just focus on customers?

I believe the co-founders have addressed this in the blog.

> Our goal was to dispel the myth that Data Lakehouse cannot have best-in-class price and performance. Rather than making our own benchmarks, we sought the truth and participated in the official TPC benchmark.

I'm sure anybody seriously looking at evaluating data platforms would want to look at things holistically. There are different dimensions like open ecosystem, support for machine learning, performance etc. And different teams evaluating these platforms would stack rank them in different orders.

These blogs, I believe, show that Databricks is a viable choice for customers when performance is a top priority (along with other dimensions). That IMO is customer obsession.

kf6nux|4 years ago

I'd say helping customers spot fraud* is serving the customers' interests.

* I haven't executed the test suite, but fraud seems likely.

jjoonathan|4 years ago

All publicity is good publicity.

Both participants in a fight can win by implicitly excluding their real competitors.

glogla|4 years ago

Yes, the tone of those blogposts, the likelihood of fake benchmarks submitted on someone else's behalf and especially the deluge of new accounts supporting them makes me want to trust Databricks even less than the PoC my company ran with them last year and spending time with their terrible, terrible salespeople.

EDIT: I forgot lying about how open they are when all their interesting technologies (like the new sql engine and the good parts of delta) are proprietary.

vgt|4 years ago

I think Snowflake cultivates a very careful public image, but in private their sales people use.. how do you say.. aggressive techniques.. databricks is addressing the source of market confusion head-on

cai22r|4 years ago

[deleted]

benjaminwootton|4 years ago

Ive been following this and it’s kind of embarrassing to watch.

I love working with Databricks and Snowflake. They both knock it out of the park for their respective use case. They’re amazing products.

It makes no sense to fall out about this though.

For a 100TB dataset with a funky calculation, Spark will trounce Snowflake. For a 1 row dataset, Snowflake will return before the spark job has been serialised.

imslowbutnice|4 years ago

What are you talking about. Spark isn't even used, and TPC DS is not a funky calculation at all. It's supposed to be a collection of typical datawarehouse type queries. Although I'm not really sure what funky means, but why would Spark trounce Snowflake on "funky" calculation at all. Do you mean an ML algorithm, and are you implying that TPC-DS has anything close to an ML Algorithm? And why would Snowflake perform better on returning one row, they are columnar stored.

nojvek|4 years ago

Why would Spark trounce Snowflake. What makes it inherently so much faster at 100TB jobs?

Also what kind of queries are we talking about?

__MatrixMan__|4 years ago

Instead of blog posts written but experts in app A based on their experience with app B, I wish there were a platform for this kind of comparison.

Some objective third party sets the goal and then each company submits automation (selenium?) that configures their own app to achieve the goal. Entrants are scored by:

- time

- storage

- compute

- config complexity

No need to waste time making your opponent look bad, just focus on making your self look good, and do it on a level playing field.

rxin|4 years ago

Isn’t that what the official TPC does?

falaki|4 years ago

That is exactly the role of tpc.org.

renewiltord|4 years ago

If you want some information like this quick, you're gonna have to pay to run it.

michaelhartm|4 years ago

Data Wars: Snowflake vs Databricks (0 - 2)?

drawturkey|4 years ago

Snowflake has way more revenue, is worth 3 times more than Databricks and is growing faster. I'd say Snowflake is still in the lead. Plus, just look at Snowflake's customer list. It's a "who's who", Databricks is a "Who's that?".

naattee|4 years ago

snowflake should just pony up and do a TPC-DS audited benchmark

maslam|4 years ago

Everyone win when data platforms submit audited benchmarks...

boringg|4 years ago

And how soon is the S-1 for Databricks dropping?

Normal_gaussian|4 years ago

so, alternatives?

Aside from the Azure/GCP/AWS internal offeringa I know about Snowflake and Firebolt, Databricks is new to me.

glogla|4 years ago

Redshift is pretty terrible, stay away. AWS is even worse at delivering promises than Databricks and that's saying something.

I heard Google BigQuery is good. It is completely SaaS (like AWS Athena that works).

Unicorns often run their own stack and you could replicate that, if you have the apetite. Netflix and Apple run Trino + Spark on k8s + Iceberg. Uber used their own Hudi thing, not sure if they still do.

ethbr0|4 years ago

https://en.m.wikipedia.org/wiki/Databricks

"Databricks is an enterprise software company founded by the creators of Apache Spark. [...] Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks."

tyingq|4 years ago

Oracle and Teradata still have data warehouse pitches ;)

kofejnik|4 years ago

maybe clickhouse?

funstuff007|4 years ago

I guess if anyone suggests "sampling" the data in meeting these days, they get their head blown off.

xiaodai|4 years ago

Spark compares itself to Hadoop only on the front page. I wonder how Spark compares to Firebolt.

uvdn7|4 years ago

Now I see that getting rid of the DeWitt clause is indeed great. Kudos to both companies.

1cvmask|4 years ago

This reminds me of the old performance ads of Oracle where they would show you how everything ran better on Oracle. They used to put those ads at airports, business lounges and the back cover of newspapers and magazines read by non-technical executives like the FT and Economist.

Everyone technical knew they would game every environment to come out with superior results. I suppose it worked. As the top executives buy big system software and ignore the IT crowd who could easily point out the flaws in the methodology of the"studies".

Breakdown of one of those example ads:

https://db2news.wordpress.com/2011/06/08/a-closer-examinatio...

initplus|4 years ago

A key part of the Oracle strategy is making it a breach of license to publish any benchmarking data. No performance data about Oracle's database is allowed to be published without their approval, which means no negative results are published.

supercanuck|4 years ago

similiar as to how SAP is still showing growth even thought their core product (ERP Financials) hasn't changed much.

rdxm|4 years ago

[deleted]

falaki|4 years ago

tl;dr: The data warehouse company used a pre-baked TPC-DS dataset and claimed they have similar performance to Databricks. Turns out if you use the official TPC-DS data generation scripts, you get much worse performance.

slownews45|4 years ago

Even worse, they claimed to have similar performance to Databricks AND claimed databricks "lacked integrity". WOW, talk about chutzpah!

tyingq|4 years ago

I read the original post, the Snowflake response, and this. From that I gather that both of them aren't being completely honest or fair when making comparisons. A fair amount of truth, but also some clever wording and omission on both their parts. Which is not surprising or particularly new in this space :)

arnon|4 years ago

That's altering the methods - and generally considered a violation of the validity of the results.

dreyfan|4 years ago

Databricks is a rapidly approaching IPO. Trying to justify their valuation with their overpriced in-memory hadoop.

kartoonhero|4 years ago

Databricks is way more than hadoop or spark. A great analogy - Spark is a great engine but you need to design and build all of the other subsystems.

Databricks is an F1 car - everything is built out. You get in and drive - FAST.

hello_moto|4 years ago

Serious question: Databricks, Snowflake, Dremio. All these "Data" platform companies => which one do you have for your Data Lake and Data Warehouse solution?

I'm sick and tired of these companies Snake Oiling the Data industry by offering "the easiest" platform to satisfy your Data Lake + Warehouse solution only to fall hard whenever you hook it up with your production data (big dataset).

PS: Anyone selling Data Lakehouse (Data Lake + Warehouse as one platform) is on meth.

kartoonhero|4 years ago

Please read up on Lakehouse.

Data Lake + Merge support + DW performance is now possible.

That is the game changer.