(no title)
jochem9 | 6 months ago
Python is dynamically typed, which you can patch a bit with type hints, but it's still easy to go to production with incompatible types, leading to panics in prod. It's uncompiled nature also makes it very slow.
SQL is pretty much impossible to unit test, yet often you will end up with logic that you want to test. E.g. to optimize a query.
For SQL I don't have a solution. It's a 50 year old language that lacks a lot of features you would expect. It's also the defacto standard for database access.
For Python I would say that we should start adopting statically typed compiled languages. Rust has polars as dataframe package, but the language itself isn't that easy to pick up. Go is very easy to learn, but has no serious dataframe package, so you end up doing a lot of that work yourself in goroutines. Maybe there are better options out there.
orochimaaru|6 months ago
In general, choice of language isn’t important - again if you’re using spark your data frame structure schema defines that structure Python or not.
Most folks confuse pandas with “data engineering”. It’s not. Most data engineering is spark.
rovr138|6 months ago
sbrother|6 months ago
physicles|6 months ago
- input data should be pseudorandom, so the chance of a test being “accidentally correct” is minimized
- you need a way to verify only part of the result set. Or, at the very least, a way to write tests so that if you add a column to the result set, your test doesn’t automatically break
In addition, I added CSV exports so you can verify the results by hand, and hot-reload for queries with CTEs — if you change a .sql file then it will immediately rerun each CTE incrementally and show you which ones’ output changed.
greekorich|6 months ago
SQL is the most beautiful, expressive, get stuff done language I've used.
It is perfect for whatever data engineering is defined as.
antupis|6 months ago