What I find hilarious is that companies argue who can query 100 TB faster and try to sell this to people. I've been on the receiving end of offers by both of the companies in question and used both platforms (and sadly migrated some data jobs to them).
While they can crunch large datasets, they are laughably slow for the datasets most people have. So while I did propose we use these solutions for our big-ish data projects, management kept pushing for us to migrate our tiny datasets (tens of gigabytes or smaller) and the perf expectedly tanked compared to our other solutions (Postgres, Redshift, pandas etc.), never mind the immense costs to migrate everything and train everyone up.
Yes, these are very good products. But PLEASE, for the love of god, don't migrate to them unless you know you need them (and by 'need' I don't mean pimping your resume).
I did work on making a database myself, and I must say that querying 100TB fast, let alone storing 100TB of data, is a real problem. Some companies (very few) don't have much choice but to use a DB that works on 100TB. If you do have small data, then you have a lot of options. But if your data is large, then you have very few options. So it is correct to be competing on how fast a DB can query 100TB of data; while at the same time being slow if you have just 10GB of data. Some databases are designed only for large data, and should not be used if your data is small.
Very true. You have to understand the actual capabilities and your actual requirements. We work with petabyte size datasets and BigQuery is hard to beat. Our other reporting systems are still all in MySQL though.
its my experience if its just 10s of GBs then use 'normal' solutions. if TB then spark is great for that. note I have only used DataBricks & Spark, no snowflake.
The irony here is that what Databricks is doing to Snowflake is exactly what Snowflake did to AWS and Redshift.
Same playbook - show that you’re better in a key metric that’s easy to understand (performance) to get the attention, but then pitch the paradigm change.
In Snowflake’s case, that was separation of storage and compute.
In Databrick’s case, it’s the Lakehouse Architecture.
I think the reason why Snowflake is so nervous because they know they can’t win this game.
To be fair Apache Spark, which started long before either company existed, was built on the assumption that compute and storage should be separate. Unlike Hadoop, Spark did not come with any storage system and could read from any source.
SF spreads a lot of FUD saying that DB can’t perform, and it was true. DB then went out and hired a lot of engineering talent with a diverse background and has been investing a lot of money in being a best in class SQL offering, so what do you do? You do something to get people’s attention. They’re saying, “hey, we have great performance too, you should also look at us for your SQL workloads.”
> I think the reason why Snowflake is so nervous because they know they can’t win this game.
Isn't Databricks' delta.io, which their Data Lakehouse product builds on top of, open source? Snowflake could take the best parts from and run with it?
In what way is lakehouse architecture beneficial over something like Snowflake or BigQuery?
I understand the appeal over having lake and warehouse as separate components, but with those native cloud warehouses, you can already do everything a lake does.
I've used both products in production. Both are good++.
The blog wars seem extremely ridiculous to me. I don't recall ever choosing one over another based on how fast it runs on some imaginary arbitrary dataset.
Manufactured rivalries can be a great thing for business. We have been debating Coke vs Pepsi, Nike vs Reebok, McDonald's vs Burger King for decades now while these companies laugh all the way to the bank.
Snowflake accuses other companies of lacking integrity?
I really wish I could block all of Snowflake's domain from my inbox. Sadly, Google encourages spammers to just create a new email address. So I get a few emails each month from Snowflake who ask me to try their products. I've never done business with them and there's no unsubscribe link.
Fuck Snowflake for thinking it has any room to talk about integrity.
What I find comical is they accuse Databricks of lacking integrity but they don’t actually call out anything except their benchmark was faster than what Databricks did in Snowflake. Databricks then reruns the benchmark and says the only reason that Snowflake’s was faster was because of the built in dataset they used. Databricks was able to match Snowflakes numbers using it but when they loaded the actual data set, it was much slower, which is how a proper TPC benchmark is supposed to happen. They then said that Databricks blog doesn’t match the TPC results, but when I looked at them, they do match. I guess Snowflake just expects people to take arguments at face value. Then I saw someone on LinkedIn complaining that Databricks must have used some beta version. I didn’t see a beta version being used, but that kind of goes out the window when Databricks follows up and then posts that they matched Snowflake when they used their built in TPC data set.
This is funny and interesting to watch but also a distraction I feel. Amazon says it best when they say, “Leaders start with the customer and work backwards. They work vigorously to earn and keep customer trust. Although leaders pay attention to competitors, they obsess over customers.”
Snowflake must be kicking themselves hard now for letting a story that was “Databricks is a viable alternative” turn into “Snowflake has absolutely no integrity and will fling mud even while they are gaming the statistics”
Really can’t see what they can do now short of “bending” to Databricks and entering the competition. And naturally it’s no longer just enough that they show comparable performance. They have to hit their games stats somehow otherwise any news even of they beat Databricks will be reported as “see, we told you they where cheating”
Before the Snowflake blog post, I did not know what Snowflake or Databricks were. I can only imagine that this rivalry is great for both of them, even if Databricks is somewhat on the advantage end, at least from a tactical standpoint; I admit though that they seem to be a bit unnecessarily defensive considering the position they're in with the exchange.
In general though, I'm still not complaining. It's interesting to see a dispute like this unfold.
I would say that TPC-DS and TPC-H are really table stakes benchmarks for data warehouses at this point in time (maybe they weren't 10 years ago). How to build a database that does well on them is well documented in the literature now[1][2][3][4] (maybe a few other papers). Its not easy to build such a database, but its "just" hard work and many companies have the $$ necessary to do that work. There isn't any magic or technical moat in the results for databricks (or snowflake, or redshift, etc.).
I think Databricks is overly enthusiastic about their results as they have been trying to be competitive with cloud DWs on these benchmarks for a number of years now. They have finally caught up (by building deltalake and their photon query engine which implement a number of standard DW features).
I agree with everything above. The main advantage the newer data warehouses have over the legacy on-prem incumbents is that they had the chance to build from scratch having learned from all of the challenges that the original players encountered.
The public pissing contest is entertaining while also being silly and slightly cringe, but I think it's a nice story for Databricks nonetheless. They now have a performant SQL-based analytics engine that can credibly compete with the best DWs in the market today, and it's just one part of their overall platform.
The sense I get is that Snowflake wants the conversation to be "no matter what you do, you need a data warehouse, and we're the best in the business at that." Databricks' Lakehouse approach is a fundamental challenge to that, and if they're getting this kind of performance from their analytics engine against the market-leading data warehouses today, that's a big momentum shift in their favour.
As much as I love seeing competition in the space and am enjoying my popcorn, I really don't understand what Databricks is doing here: this feels like a childish foodfight rather than an obsession with the customer...
:) That is a good question. Why spend eng cycles to submit results to the TPC council - why not just focus on customers?
I believe the co-founders have addressed this in the blog.
> Our goal was to dispel the myth that Data Lakehouse cannot have best-in-class price and performance. Rather than making our own benchmarks, we sought the truth and participated in the official TPC benchmark.
I'm sure anybody seriously looking at evaluating data platforms would want to look at things holistically. There are different dimensions like open ecosystem, support for machine learning, performance etc. And different teams evaluating these platforms would stack rank them in different orders.
These blogs, I believe, show that Databricks is a viable choice for customers when performance is a top priority (along with other dimensions). That IMO is customer obsession.
Yes, the tone of those blogposts, the likelihood of fake benchmarks submitted on someone else's behalf and especially the deluge of new accounts supporting them makes me want to trust Databricks even less than the PoC my company ran with them last year and spending time with their terrible, terrible salespeople.
EDIT: I forgot lying about how open they are when all their interesting technologies (like the new sql engine and the good parts of delta) are proprietary.
I think Snowflake cultivates a very careful public image, but in private their sales people use.. how do you say.. aggressive techniques.. databricks is addressing the source of market confusion head-on
Ive been following this and it’s kind of embarrassing to watch.
I love working with Databricks and Snowflake. They both knock it out of the park for their respective use case. They’re amazing products.
It makes no sense to fall out about this though.
For a 100TB dataset with a funky calculation, Spark will trounce Snowflake. For a 1 row dataset, Snowflake will return before the spark job has been serialised.
What are you talking about. Spark isn't even used, and TPC DS is not a funky calculation at all. It's supposed to be a collection of typical datawarehouse type queries. Although I'm not really sure what funky means, but why would Spark trounce Snowflake on "funky" calculation at all. Do you mean an ML algorithm, and are you implying that TPC-DS has anything close to an ML Algorithm? And why would Snowflake perform better on returning one row, they are columnar stored.
Instead of blog posts written but experts in app A based on their experience with app B, I wish there were a platform for this kind of comparison.
Some objective third party sets the goal and then each company submits automation (selenium?) that configures their own app to achieve the goal. Entrants are scored by:
- time
- storage
- compute
- config complexity
No need to waste time making your opponent look bad, just focus on making your self look good, and do it on a level playing field.
Snowflake has way more revenue, is worth 3 times more than Databricks and is growing faster. I'd say Snowflake is still in the lead. Plus, just look at Snowflake's customer list. It's a "who's who", Databricks is a "Who's that?".
Redshift is pretty terrible, stay away. AWS is even worse at delivering promises than Databricks and that's saying something.
I heard Google BigQuery is good. It is completely SaaS (like AWS Athena that works).
Unicorns often run their own stack and you could replicate that, if you have the apetite. Netflix and Apple run Trino + Spark on k8s + Iceberg. Uber used their own Hudi thing, not sure if they still do.
"Databricks is an enterprise software company founded by the creators of Apache Spark. [...] Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks."
This reminds me of the old performance ads of Oracle where they would show you how everything ran better on Oracle. They used to put those ads at airports, business lounges and the back cover of newspapers and magazines read by non-technical executives like the FT and Economist.
Everyone technical knew they would game every environment to come out with superior results. I suppose it worked. As the top executives buy big system software and ignore the IT crowd who could easily point out the flaws in the methodology of the"studies".
A key part of the Oracle strategy is making it a breach of license to publish any benchmarking data. No performance data about Oracle's database is allowed to be published without their approval, which means no negative results are published.
tl;dr: The data warehouse company used a pre-baked TPC-DS dataset and claimed they have similar performance to Databricks. Turns out if you use the official TPC-DS data generation scripts, you get much worse performance.
I read the original post, the Snowflake response, and this. From that I gather that both of them aren't being completely honest or fair when making comparisons. A fair amount of truth, but also some clever wording and omission on both their parts. Which is not surprising or particularly new in this space :)
Serious question: Databricks, Snowflake, Dremio. All these "Data" platform companies => which one do you have for your Data Lake and Data Warehouse solution?
I'm sick and tired of these companies Snake Oiling the Data industry by offering "the easiest" platform to satisfy your Data Lake + Warehouse solution only to fall hard whenever you hook it up with your production data (big dataset).
PS: Anyone selling Data Lakehouse (Data Lake + Warehouse as one platform) is on meth.
gnabgib|4 years ago
drej|4 years ago
While they can crunch large datasets, they are laughably slow for the datasets most people have. So while I did propose we use these solutions for our big-ish data projects, management kept pushing for us to migrate our tiny datasets (tens of gigabytes or smaller) and the perf expectedly tanked compared to our other solutions (Postgres, Redshift, pandas etc.), never mind the immense costs to migrate everything and train everyone up.
Yes, these are very good products. But PLEASE, for the love of god, don't migrate to them unless you know you need them (and by 'need' I don't mean pimping your resume).
sanketsarang|4 years ago
tshanmu|4 years ago
StephenJGL|4 years ago
unknown|4 years ago
[deleted]
autokad|4 years ago
scapecast|4 years ago
Same playbook - show that you’re better in a key metric that’s easy to understand (performance) to get the attention, but then pitch the paradigm change.
In Snowflake’s case, that was separation of storage and compute.
In Databrick’s case, it’s the Lakehouse Architecture.
I think the reason why Snowflake is so nervous because they know they can’t win this game.
falaki|4 years ago
doppelganger1|4 years ago
ignoramous|4 years ago
Isn't Databricks' delta.io, which their Data Lakehouse product builds on top of, open source? Snowflake could take the best parts from and run with it?
glogla|4 years ago
I understand the appeal over having lake and warehouse as separate components, but with those native cloud warehouses, you can already do everything a lake does.
avip|4 years ago
The blog wars seem extremely ridiculous to me. I don't recall ever choosing one over another based on how fast it runs on some imaginary arbitrary dataset.
paxys|4 years ago
kartoonhero|4 years ago
One of the biggest FUDs for a data lake architecture is performance - and this benchmark should put that concern to rest.
CactusOnFire|4 years ago
Both Databricks and Snowflake have inflated marketing budgets, and marketing feels they have to "beat" the other one or they'll lose the market.
inetknght|4 years ago
I really wish I could block all of Snowflake's domain from my inbox. Sadly, Google encourages spammers to just create a new email address. So I get a few emails each month from Snowflake who ask me to try their products. I've never done business with them and there's no unsubscribe link.
Fuck Snowflake for thinking it has any room to talk about integrity.
doppelganger1|4 years ago
This is funny and interesting to watch but also a distraction I feel. Amazon says it best when they say, “Leaders start with the customer and work backwards. They work vigorously to earn and keep customer trust. Although leaders pay attention to competitors, they obsess over customers.”
boublepop|4 years ago
Really can’t see what they can do now short of “bending” to Databricks and entering the competition. And naturally it’s no longer just enough that they show comparable performance. They have to hit their games stats somehow otherwise any news even of they beat Databricks will be reported as “see, we told you they where cheating”
bloodyplonker22|4 years ago
djbusby|4 years ago
aliswe|4 years ago
jchw|4 years ago
In general though, I'm still not complaining. It's interesting to see a dispute like this unfold.
qaq|4 years ago
AdamProut|4 years ago
I think Databricks is overly enthusiastic about their results as they have been trying to be competitive with cloud DWs on these benchmarks for a number of years now. They have finally caught up (by building deltalake and their photon query engine which implement a number of standard DW features).
thrtlvlmidnight|4 years ago
The public pissing contest is entertaining while also being silly and slightly cringe, but I think it's a nice story for Databricks nonetheless. They now have a performant SQL-based analytics engine that can credibly compete with the best DWs in the market today, and it's just one part of their overall platform.
The sense I get is that Snowflake wants the conversation to be "no matter what you do, you need a data warehouse, and we're the best in the business at that." Databricks' Lakehouse approach is a fundamental challenge to that, and if they're getting this kind of performance from their analytics engine against the market-leading data warehouses today, that's a big momentum shift in their favour.
redwood|4 years ago
saj1th|4 years ago
I believe the co-founders have addressed this in the blog.
> Our goal was to dispel the myth that Data Lakehouse cannot have best-in-class price and performance. Rather than making our own benchmarks, we sought the truth and participated in the official TPC benchmark.
I'm sure anybody seriously looking at evaluating data platforms would want to look at things holistically. There are different dimensions like open ecosystem, support for machine learning, performance etc. And different teams evaluating these platforms would stack rank them in different orders.
These blogs, I believe, show that Databricks is a viable choice for customers when performance is a top priority (along with other dimensions). That IMO is customer obsession.
kf6nux|4 years ago
* I haven't executed the test suite, but fraud seems likely.
jjoonathan|4 years ago
Both participants in a fight can win by implicitly excluding their real competitors.
glogla|4 years ago
EDIT: I forgot lying about how open they are when all their interesting technologies (like the new sql engine and the good parts of delta) are proprietary.
vgt|4 years ago
s_barrow1|4 years ago
[deleted]
cai22r|4 years ago
[deleted]
benjaminwootton|4 years ago
I love working with Databricks and Snowflake. They both knock it out of the park for their respective use case. They’re amazing products.
It makes no sense to fall out about this though.
For a 100TB dataset with a funky calculation, Spark will trounce Snowflake. For a 1 row dataset, Snowflake will return before the spark job has been serialised.
imslowbutnice|4 years ago
nojvek|4 years ago
Also what kind of queries are we talking about?
unknown|4 years ago
[deleted]
unknown|4 years ago
[deleted]
__MatrixMan__|4 years ago
Some objective third party sets the goal and then each company submits automation (selenium?) that configures their own app to achieve the goal. Entrants are scored by:
- time
- storage
- compute
- config complexity
No need to waste time making your opponent look bad, just focus on making your self look good, and do it on a level playing field.
rxin|4 years ago
falaki|4 years ago
renewiltord|4 years ago
michaelhartm|4 years ago
drawturkey|4 years ago
naattee|4 years ago
maslam|4 years ago
boringg|4 years ago
Normal_gaussian|4 years ago
Aside from the Azure/GCP/AWS internal offeringa I know about Snowflake and Firebolt, Databricks is new to me.
glogla|4 years ago
I heard Google BigQuery is good. It is completely SaaS (like AWS Athena that works).
Unicorns often run their own stack and you could replicate that, if you have the apetite. Netflix and Apple run Trino + Spark on k8s + Iceberg. Uber used their own Hudi thing, not sure if they still do.
ethbr0|4 years ago
"Databricks is an enterprise software company founded by the creators of Apache Spark. [...] Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks."
tyingq|4 years ago
kofejnik|4 years ago
funstuff007|4 years ago
xiaodai|4 years ago
uvdn7|4 years ago
1cvmask|4 years ago
Everyone technical knew they would game every environment to come out with superior results. I suppose it worked. As the top executives buy big system software and ignore the IT crowd who could easily point out the flaws in the methodology of the"studies".
Breakdown of one of those example ads:
https://db2news.wordpress.com/2011/06/08/a-closer-examinatio...
initplus|4 years ago
belter|4 years ago
The "Unbreakable" Marketing Campaign:
https://www.oreilly.com/library/view/the-oracle-hackers/9780...
https://www.zdnet.com/article/invincible-oracle-not-so-secur...
supercanuck|4 years ago
dautkhanov|4 years ago
[deleted]
rdxm|4 years ago
[deleted]
falaki|4 years ago
slownews45|4 years ago
tyingq|4 years ago
arnon|4 years ago
imslowbutnice|4 years ago
[deleted]
xiaodai|4 years ago
dreyfan|4 years ago
kartoonhero|4 years ago
Databricks is an F1 car - everything is built out. You get in and drive - FAST.
hello_moto|4 years ago
I'm sick and tired of these companies Snake Oiling the Data industry by offering "the easiest" platform to satisfy your Data Lake + Warehouse solution only to fall hard whenever you hook it up with your production data (big dataset).
PS: Anyone selling Data Lakehouse (Data Lake + Warehouse as one platform) is on meth.
kartoonhero|4 years ago
Data Lake + Merge support + DW performance is now possible.
That is the game changer.