The plus side is that its not obviously async which is nice, the downside is that the syntax is less fun (you need to explicitly link forwards, rather than backwards.)
What do people do to curate/version /transform their raw datasets these days? I am vaguely aware of the "chuck it all into s3" strategy for hanging onto raw data, and related strategies where instead of s3 it's a db of some flavor. What are folks doing for record-keeping for what today's raw data contains vs tomorrow's?
And the next step - a curated dataset has a time-bound provenance - what are folks doing to keep track of the transformations/cleaning steps that makes the raw data useful for the data at the time it's being processed? Does this bit fall under the purview of metaflow, or is this different tooling?
Or maybe my assumptions are off base! Curious about what other teams are doing with their datasets.
I'm exploring kedro and Kedro-viz lately, in case that's in the vicinity of your question. It ties most closely with MLFlow for artifacts, but storing locally works fine too
mastazi|6 months ago
KaiserPro|6 months ago
The plus side is that its not obviously async which is nice, the downside is that the syntax is less fun (you need to explicitly link forwards, rather than backwards.)
thomasingalls|6 months ago
And the next step - a curated dataset has a time-bound provenance - what are folks doing to keep track of the transformations/cleaning steps that makes the raw data useful for the data at the time it's being processed? Does this bit fall under the purview of metaflow, or is this different tooling?
Or maybe my assumptions are off base! Curious about what other teams are doing with their datasets.
patcon|6 months ago
ghilston|6 months ago
Hey Savin, it's been a while since chatted. I hope things are going well ;)
For those unaware, onsone of the co
marksimi|6 months ago
Seems like Metaflow is comparatively lightweight, bit more tightly integrated with AWS, less end to end and a bit more agile.