buttaphingas's comments

buttaphingas | 1 year ago | on: Snowflake Arctic Instruct (128x3B MoE), largest open source model

Snowflake has some examples of that: https://www.snowflake.com/blog/easy-secure-llm-inference-ret... https://quickstarts.snowflake.com/guide/asking_questions_to_... And I assume these new models would be now available for embedding. Is this different to what Databricks offers?

buttaphingas | 2 years ago | on: Most companies do not need Snowflake or Databricks

Vendor says their solution is cheaper than competitor solution. I'm shocked.

buttaphingas | 2 years ago | on: BlazingMQ: High-performance open source message queuing system

Check out 0MQ (ZeroMQ), Informatica Ultra Messaging (formerly 29 West), and aeron.io They create reliable uni/multicast protocols over UDP, including strategies for NAK storms etc.

buttaphingas | 3 years ago | on: Why is Snowflake so expensive

And to get the very best price for those clusters your you'd need to commit to the CSP for three years!

Would love to know the TCO trade-off between procuring, securing and deploying on your own clusters vs having them managed via SaaS.

buttaphingas | 3 years ago | on: Why is Snowflake so expensive

It's all around the ethos of ease of use. Snowflake does a lot of smarts in the background so that you don't have the overhead of managing indexes. And not just indexes, there is just less human intervention required overall compared to something like Teradata or even a modern lakehouse.

That said, they've kind of introduced it with the Search Optimization Service, which is like an index across the whole table for fast lookups, but even that is automatically maintained in your behalf.

buttaphingas | 4 years ago | on: Databricks response to Snowflake's accusation of lacking integrity

They're still lacking things in the SQL space. For example, Databricks say they're ACID compliant, but it's only on a single-table basis. Snowflake offers multi-table ACID consistency, which is something that you would expect by default in the data warehousing world. If I'm loading, say, 10 tables in parallel, I want to be able to roll-back or commit the complete set of transactions in order to maintain data consistency. I'm sure you could work around this limitation, but it would feel like a hack, especially if you're coming from a traditional DWH world (Teradata, Netezza etc.).

Snowflake now offers Scala, Java and Python support, so it would seem their capabilities are converging even more, but both with their own strengths due to their respective histories.

buttaphingas | 4 years ago | on: Databricks response to Snowflake's accusation of lacking integrity

But you do still have to secure the S3 buckets, right? And I guess also secure the infrastructure you have to deploy in order to run Databricks. Plus then configure for cross-AZ failover etc. So you get flexibility, but I would think at the cost of much more human labor to get it up and running.

Snowflake uses the Arrow data format with their drivers, so is plenty fast enough when retrieving data in general. But it would be way less efficient if a data scientist just does a SELECT * to bring everything back from a table to load into a notebook.

Snowflake has had Scala support since earlier in the year, along with Java UDFs, and also just announced Python support - not a Python connector, but executing Python code directly on the Snowflake platform. Not GA yet though.

buttaphingas | 4 years ago | on: Databricks response to Snowflake's accusation of lacking integrity

You can use Scala, Java and Python with Snowflake now, as well as process structured, semi-structured and unstructured data. So I guess that means it doesn't fit into the data warehouse category, but is not a lakehouse either.

buttaphingas | 4 years ago | on: Databricks response to Snowflake's accusation of lacking integrity

I actually see them as variations on the same architecture. Databricks keeps their metadata in files, Snowflake keeps theirs in a database, but they both, ultimately, are querying data stored in a columnar format on blob store (and, to be fair, Snowflake have been doing that with ACID-compliant SQL for a lot longer than Databricks). So using SQL over blob at high performance has been around for a while.

Databricks say their solution is better because it's open (though keep the optimizations you need to run this at scale to themselves, i.e. is ultimately proprietary). Snowflake says theirs is better because it's a fully managed service, meaning no infrastructure to procure or manage, is fully HA across multiple data centers by default etc.

Databricks push 'open' but really still want you to use their proprietary tech for first transforming into something usable (Parquet/Delta) and then querying with Photon/SQL, though you can also use other tech. With Snowflake you can just ingest and query, but it has to be through their engine.

Customers should do their own valudation and see which one fits their needs best.

buttaphingas | 4 years ago | on: Databricks response to Snowflake's accusation of lacking integrity

Delta is open source, but Databricks keeps optimizations for themselves as proprietary. I'm not sure why it would be any better than Snowflake's solution, which is automatically deployed across multiple AZs as a fully HA system and gives full ACID transaction compliance across any number of tables (not just per-table).

buttaphingas | 4 years ago | on: Databricks response to Snowflake's accusation of lacking integrity

Databricks isn't open source, as they keep hold of all the IP that makes it much better than OS Spark. Whether you buy Snowflake or Databricks, you're buying proprietary software.

buttaphingas | 4 years ago | on: Snowflake’s response to Databricks’ TPC-DS post

The "failover and failback for business continuity" is specifically for cross-region/cloud, i.e. this is something you explicitly have to do. tbh I've never used it, as I guess this would be only for very large accounts. But all editions have automatic failover between AZs out-of-the-box.

[Edit] Highly Available would be a better description per region, as that's out of the box with no configuration. e.g. if a node dies, your cluster will automatically heal and resubmit your query. If there's an entire AZ outage, your query should be resubmitted in another AZ. I think this is why failover/back is called out separately, as that's not automatic, incurs additional costs etc. Here's a link with an explanation: www.snowflake.com/blog/how-to-make-data-protection-and-high-availability-for-analytics-fast-and-easy

I didn't know DB did MVs, masking etc., so yes, that makes sense. Maybe a better idea would be to have a minimum offering comparison, and then a maximum offering comparison (with multi-AZ failover, masking feature costs etc. included) - the reality for a customer would be somewhere between those extremes.

buttaphingas | 4 years ago | on: Snowflake’s response to Databricks’ TPC-DS post

I've used Snowflake for the past few years, and it's worth pointing out that when it comes to overall cost, there's a lot you get with Snowflake for free. For example, they have HA across 3 AZs out of the box, included in the price and with no configuration required.

If I'm reading what Databricks published correctly, it seems that they've only used 1 driver node for this benchmark, in other words it's a dev setup. If they want to compare apples-to-apples then they should configure, and price, a multi-AZ HA set-up.

I'm not sure if this is still applicable to Photon, however - can anyone confirm?

buttaphingas | 4 years ago | on: Snowflake’s response to Databricks’ TPC-DS post

This is incorrect. Every edition of Snowflake is deployed across multiple availability zones with automatic failover in the case of failure or AZ outage. This is included in the price and requires no configuration by the customer. Cross-cloud/region failover requires the top edition and a few lines of SQL to configure (plus cloud egress costs for data replication).

The higher editions of Snowflake include features like materialised views, dynamic data masking, BYOK, PCI & HIPAA compliance etc., non of which are required for the benchmark.