(no title)
prions | 3 years ago
Speaking from my DE experience at Spotify and previously in startup land, the biggest challenge is the slow and distant feedback loop. The vast majority of data pipelines don't run on your machine and don't behave like they do on a local machine. They run as massively distributed processes and their state is opaque to the developer.
Validating the correctness of a large scale data pipeline can be incredibly difficult as the successful operation of a pipeline doesn't conclusively determine whether the data is actually correct for the end user. People working seriously in this space understand that traditional practices here like unit testing only go so far. And integration testing really needs to work at scale with easily recyclable infrastructure (and data) to not be a massive drag on developer productivity. Even getting the correct kind of data to be fed into a test can be very difficult if the ops/infra of the org isn't designed for it.
The best data tooling isn't going to look exactly like traditional swe tooling. Tools that vastly reduce the feedback loop of developing (and debugging) distributed pipelines running in the cloud and also provide means of validating the output on meaningful data is where tooling should be going. Trying to shoehorn traditional SWE best practices will really only take off once that kind of developer experience is realized.
mywittyname|3 years ago
I'm glad to see someone calling this out because the comment here are a sea of "data engineering needs more unit tests." Reliably getting data into a database is rarely where I've experienced issues. That's the easy part.
This is the biggest opportunity in this space, IMHO, since validation and data completeness/accuracy is where I spend the bulk of my work. Something that can analyze datasets and provide some sort of ongoing monitoring for confidence on the completeness and accuracy of the data would be great. These tools seem to exist mainly in the network security realm, but I'm sure they could be generalized to the DE space. When I can't leverage a second system for validation, I will generally run some rudimentary statistics to check to see if the volume and types of data I'm getting is similar to what is expected.
abrazensunset|3 years ago
They all have various strengths and weaknesses with respect to anomaly detection, schema change alerts, rules-based approaches, sampled diffs on PRs, incident management, tracking lineage for impact analysis, and providing usage/performance monitoring.
Datafold, Metaplane, Validio, Monte Carlo, Bigeye
Great Expectations has always been an open source standby as well and is being turned into a product.
snidane|3 years ago
robertlagrant|3 years ago
The key problem is that more you validate incoming data, the more you can demonstrate correctness, but then the more often data coming in will be rejected, and you will be paged out of hours :)
conkeisterdoor|3 years ago
I've never been in a SWE role before, but am related to and have known a number of them, and have a general sense of what being a SWE entails. That disclaimer out of the way, it's my gut feeling that a DE typically does more "hacky" kind of coding than a SWE. Whereas SWEs have much more clearly established standards for how to do certain things.
My first modules were a hot nasty mess. I've been refactoring and refining them over the past 1.5 years so they're more effective, efficient, and easier to maintain. But they've always just worked, and that has been good enough for my employer.
I have one 1600 line module solely dedicated to validating a set of invoices from a single source. It took me months of trial and error to get that monster working reliably.
alexpetralia|3 years ago
azurezyq|3 years ago
prions|3 years ago
Some pain points:
- Does it take forever to spin up infra to run a single test?
- Is grabbing test data a manual process? This can be a huge pain especially if the test data is binary like avro or parquet. Test inputs and results should be human friendly
- Does setting up a testing environment require filling out tons of yaml files and manual steps?
- Things built at the wrong level of abstraction! This always irks me to experience. Keep your abstractions clean between which tools in your data stack do what. When people start inlining task-specific logic at the DAG level in airflow, or let their individual tasks figure out triggering or scheduling decisions is when things just become confusing.
Right now my workflow allows me to run a prod job (google cloud dataflow) from my local machine. It consumes prod data and writes to a test-prefixed path. With unit tests on the scala code + successful run of the dataflow job + validation and metrics thrown on the prod job I can feel pretty comfortable with the correctness of the pipeline.
oa335|3 years ago