top | item 27032848

(no title)

timita | 4 years ago

> Am I supposed to use... excel to do that?

This is gratuitous. You have a clear bias, granted, because it seems your domain is so specific, only a procedural language will do. But it seems you are unfamiliar with modern SQL tools. Some of the obvious ones that come to mind: Metabase[0] for visualisation or Apache MADlib[1] for in-database statistics and machine learning.

[0] https://github.com/metabase/metabase [1] http://madlib.apache.org/

discuss

order

jplr8922|4 years ago

I humbly disagree. Its not that ''my domain is so specific'', but that data analysis in itself is a discovery process which is not straightforward. If you want to explore the jungle, you need a machete.

As for Metabase of MADlib, you are right ; I was not aware of these new tools. They look great, and I'm certain they can help a lot of people. However you assume that they are available! Not all IT departments are open to the idea of buying new software, and if the suggestion comes from an outsider it will be perceived as an insult (been there, done that many many time). And when they refuse, now what? You go back to the usual procedural languages (R, Python, Julia, etc) which are free and don't require the perpetual oversight of some DBA who thinks that all you need is an AVG(X) and GROUP BY since kurtosis is domain specific anyway.

I've meet some Excel-VBA users who couldn't care less about pro devs or decorators since ''they can already do everything by themselves''. Same thing with the SQL only, Python only, Tableau only or wathever-only crowd.

dr_kiszonka|4 years ago

I usually use Python and R for analysis. However, when dealing with larger datasets, e.g., 0.5 - 2 PB, I have to rely on SQL/BigQuery because I can't get Python and R to deal such workloads in reasonable time. I tried Dask, but I couldn't resolve a few bugs it had at the time.

If you were to find outliers in a 1 PB table, what tools would you use?