(no title)
somurzakov | 5 years ago
Distributed ML is tough to train because of very little control over train loop. I personally prefer using single server trainkng even on large datasets, or switch to online learning algos that do train/inference/retrain at the same time.
as for snowflake, I havent heard of people using snowflake to train ML, but sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune
mrslave|5 years ago
How do Snowflake (and Redshift, mentioned above) compare with CitusDB? I really like the PostgreSQL experience offered by Citus. I've been bit by too many commercial databases where the sales brochure promises the product does X, Y, and Z, only to discover later that you can't do any of them together because reasons.
disgruntledphd2|5 years ago
But Spark is super cool and actually has algorithms which complete in a reasonable time frame on hardware I can get access to.
Like, I understand that the SQL portion is pretty commoditised (though even there, SparkSQL python and R API's are super nice), but I'm not aware of any other frameworks for doing distributed training of ML models.
Have all the hipsters moved to GPUs or something? \s
> sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune
It's so very expensive though, and their pricing model is frustratingly annoying (why the hell do I need tickets?).
That being said, tuning Spark/Presto or any of the non-managed alternatives is no fun either, so I wonder if it's the right tradeoff.
One thing I really, really like about Spark is the ability to write Python/R/Scala code to solve the problems that cannot be usefully expressed in SQL.
All the replies to my original comment seem to forget that, or maybe Snowflake has such functionality and I'm unaware of it.
marcinzm|5 years ago
Tensorflow, PyTorch (not sure if Ray is needed) and Mxnet all support distributed training across CPUs/GPUs in a single machine or multiple machines. So does XGBoost if you don't want deep learning. You can then run them with KubeFlow or on whatever platform your SaaS provider has (GCP AI Platform, AWS Sagemaker, etc.).
edit:
>All the replies to my original comment seem to forget that, or maybe Snowflake has such functionality and I'm unaware of it.
Snowflake has support for custom Javascript UDFs and a lot of built in features (you can do absurd things with window functions). I also found it much faster than Spark.