top | item 45071440

(no title)

jochem9 | 6 months ago

One thing that I don't see mentioned but that does bug me: data engineers often use a lot of Python and SQL, even the ones that have heavily adopted software engineering best practices. Yet both languages are not great for this.

Python is dynamically typed, which you can patch a bit with type hints, but it's still easy to go to production with incompatible types, leading to panics in prod. It's uncompiled nature also makes it very slow.

SQL is pretty much impossible to unit test, yet often you will end up with logic that you want to test. E.g. to optimize a query.

For SQL I don't have a solution. It's a 50 year old language that lacks a lot of features you would expect. It's also the defacto standard for database access.

For Python I would say that we should start adopting statically typed compiled languages. Rust has polars as dataframe package, but the language itself isn't that easy to pick up. Go is very easy to learn, but has no serious dataframe package, so you end up doing a lot of that work yourself in goroutines. Maybe there are better options out there.

discuss

orochimaaru|6 months ago

If you’re using some variety of spark for your data engineering then scala is an option too.

In general, choice of language isn’t important - again if you’re using spark your data frame structure schema defines that structure Python or not.

Most folks confuse pandas with “data engineering”. It’s not. Most data engineering is spark.

rovr138|6 months ago

in spark, doesn't pyspark and sql both still get translated to scala?

sbrother|6 months ago

When I was most recently at Google (2021-ish) my team owned a bunch of SQL Pipelines that had fairly effective SQL tests. Not my favorite thing to work on, but it was a productive way to transform data. There are lots of open source versions of the same idea, but I have yet to see them accompanied with ergonomic testing. Any recommendations or pointers to open source SQL testing frameworks?

physicles|6 months ago

Could you describe what made those tests effective? I just wrote some tools to write concise tests for some analytics queries, and some principles I stumbled on are:

- input data should be pseudorandom, so the chance of a test being “accidentally correct” is minimized

- you need a way to verify only part of the result set. Or, at the very least, a way to write tests so that if you add a column to the result set, your test doesn’t automatically break

In addition, I added CSV exports so you can verify the results by hand, and hot-reload for queries with CTEs — if you change a .sql file then it will immediately rerun each CTE incrementally and show you which ones’ output changed.

greekorich|6 months ago

I've been a professional java dev for a decade. I've written a little python, clojure, lots of JS/TS/Node.

SQL is the most beautiful, expressive, get stuff done language I've used.

It is perfect for whatever data engineering is defined as.

antupis|6 months ago

SQL is beautiful when it works but when it doesn’t you end up with some abomination eg if you need some kind dynamic query.