(no title)
robertkoss | 5 months ago
Especially when considering testability and composability, using a DataFrame API inside regular languages like Python is far superior IMO.
robertkoss | 5 months ago
Especially when considering testability and composability, using a DataFrame API inside regular languages like Python is far superior IMO.
gigatexal|5 months ago
Why is the dataframe approach getting hate when you’re talking about runtime details?
That folks understand the almost conversational aspect of SQL vs. that of the dataframe api but the other points make no difference.
If you’re a competent dev/data person and are productive with the dataframe then yay. Also setup and creating test data and such it’s all objects and functions after all — if anything it’s better than the horribad experience of ORMs.
drej|5 months ago
Sure, Python code is more testable and composable (and I do love that). Have I seen _any_ analysts write tests or compose their queries? I'm not saying these people don't exist, but I have yet to bump into any.
robertkoss|5 months ago
If you use Athena you still have to worry about shuffling and joining, it is just hidden.. It is Trino / Presto under the hood and if you click explain you can see the execution plan, which is essentially the same as looking into the SparkUI.
Who cares about JVM versions nowadays? No one is hosting Spark themselves.
Literally every tool now supports DataFrame AND SQL APIs and to me there is no reason to pick up SQL if you are familiar with a little bit of Python
ritchie46|5 months ago
Cluster configuration is optional if you want this control. Anyhow, this doesn't have much to do with the query API, be it SQL or DataFrame.
ayhanfuat|5 months ago
riku_iki|5 months ago
I think this part(query optimizations) in general not solved/solvable, and it is sometimes/often(depending on domain) necessary to digg into details to make data transformation working.
mr_toad|5 months ago
That said the last Python code I wrote as a data engineer was to run tests on an SQL database, because the equivalent in SQL would have been tens of thousands of lines of wallpaper code.
gigatexal|5 months ago