top | item 40726664

(no title)

rdeboo | 1 year ago

I work as a Data Engineer and in my country Azure is pretty big, and as a consequence their Data Factory service has become a common choice for enterprises. It's a GUI based ETL tool, architects prefer it since it is a managed cloud service and supposedly is easy to use.

In practice you lose all the benefits of abstraction, unit testing, proper CI/CD, etc. I haven't met an engineer that likes the service. Some projects have resorted to writing code generation tools, so that they can take config files and programmatically generate the JSON serialization of the pipelines that you're supposed to develop by clicking and dragging.

discuss

order

larodi|1 year ago

While visual representation of ETLs can be of great help understanding the data flow, engineers tend to eventually start using commands - either in the VSCode, or the Cisco iOS, or local shell. Applications subject to scripted automation and having command line tend to be well respected - a good example is AutoCAD which had a prompt from day one, which is like many years ago. This prompt still stays and is used by architects and alike.

This graph-based visual programming somehow fails to deliver on speed of development. Mouse has 2 buttons, the keyboard approx. 100. Not to mention that LLMs work on the language/code level, and are expected to stay so for a while. We dont' have universal means to express things visually. Except for the graph notation of edgex/virtices. But then there is no universal knowledge, people dont usually disambiguate between sequence diagram, bpmn and state diagram. these are all graphs, right, but not the same semantically.

I'd rather go for a standardized ETL langauge a-la-markdown, and only then get to appreciate the GUI.

what-the-grump|1 year ago

>In practice you lose all the benefits of abstraction, unit testing, proper CI/CD, etc.

Why? We are pretty deep into the ecosystem.

Abstraction -> the only thing data factory does not allow you is to reference a previous activity as a variable, which makes sense if you don't want to let your customer blow up your product. Parametrize all you want.

Unit testing -> test all you want each activity, pipeline, flow, resume it from where it broke. Clone the entire thing into a test data factory, then deploy that once ready.

CI/CD -> the first step it nags you about is setting up CI/CD. If you want to get fancy, you setup a dev environment and deploy that to production after testing and sign-off.

Abstracting ETL only works when you remember or have the same people on staff that abstracted that ETL process. Data factory 'could' be visual but does not let you pull the same level of non-sense that SSIS would.

For example, we call data factory via API, the pipeline is fully abstracted, it does one thing, but it's inputs and outputs are controlled by the request.