Low code ETL tools (informatica, Appworx, talend, pentaho, ssis) were the original services for ELT/ETL. A lot of progress was made to go towards ETL-as-code starting with Airflow/Luigi. Going back to low code seems backwards as this point.
(I have used all of the above tools in my 15+ yr career. Code as ETL was a huge industry shift)
It's fascinating that somebody sees these things as "original". I used to maintain a 40-year old ETL codebase in SAS. IMHO the original low code movement was the 4GL movement from the 70s (and, according to Wikipedia, rapid application development movement from the 90s, things like Visual Basic).
I think fundamentally, the problem is that any DSL lives on a design tradeoff spectrum between domain-specificity and programming-generality. There is only so much savings that a generic ETL tool will give you, because it doesn't want to be domain specific. On the other hand, the more domain specific a tool is, the more limited it will be in it's capacity to handle other domains.
No amount of graphic or textual interface is gonna save you from this fundamental design tradeoff (it's kind of complexity tradeoff - things inside domain are easy but outside the domain are hard). But then, you can as well handle this tradeoff in a decent library/framework for a general programming language, which is, effectively, a "return to code".
Other way to look at it is that we want limited programming languages, because the implicit assumptions make things easier for the reader of the code (the person who wants to understand). However, limitations make things harder for the writer of the code (the person who wants to solve some problem).
I work as a Data Engineer and in my country Azure is pretty big, and as a consequence their Data Factory service has become a common choice for enterprises. It's a GUI based ETL tool, architects prefer it since it is a managed cloud service and supposedly is easy to use.
In practice you lose all the benefits of abstraction, unit testing, proper CI/CD, etc. I haven't met an engineer that likes the service. Some projects have resorted to writing code generation tools, so that they can take config files and programmatically generate the JSON serialization of the pipelines that you're supposed to develop by clicking and dragging.
While visual representation of ETLs can be of great help understanding the data flow, engineers tend to eventually start using commands - either in the VSCode, or the Cisco iOS, or local shell. Applications subject to scripted automation and having command line tend to be well respected - a good example is AutoCAD which had a prompt from day one, which is like many years ago. This prompt still stays and is used by architects and alike.
This graph-based visual programming somehow fails to deliver on speed of development. Mouse has 2 buttons, the keyboard approx. 100. Not to mention that LLMs work on the language/code level, and are expected to stay so for a while. We dont' have universal means to express things visually. Except for the graph notation of edgex/virtices. But then there is no universal knowledge, people dont usually disambiguate between sequence diagram, bpmn and state diagram. these are all graphs, right, but not the same semantically.
I'd rather go for a standardized ETL langauge a-la-markdown, and only then get to appreciate the GUI.
>In practice you lose all the benefits of abstraction, unit testing, proper CI/CD, etc.
Why? We are pretty deep into the ecosystem.
Abstraction -> the only thing data factory does not allow you is to reference a previous activity as a variable, which makes sense if you don't want to let your customer blow up your product. Parametrize all you want.
Unit testing -> test all you want each activity, pipeline, flow, resume it from where it broke. Clone the entire thing into a test data factory, then deploy that once ready.
CI/CD -> the first step it nags you about is setting up CI/CD. If you want to get fancy, you setup a dev environment and deploy that to production after testing and sign-off.
Abstracting ETL only works when you remember or have the same people on staff that abstracted that ETL process. Data factory 'could' be visual but does not let you pull the same level of non-sense that SSIS would.
For example, we call data factory via API, the pipeline is fully abstracted, it does one thing, but it's inputs and outputs are controlled by the request.
once those custom codegen tools become established and popular, someone will have the idea that if they slap a GUI on the front it'll be much easier for non-engineers in the business to pick up and use.
You see this a lot. A new generation comes in to relearn all the old lessons. It’s also happening on the frontend “server side rendering is bad, it’s slow” to the latest frontend frameworks and thought leaders “you should be using SSR” (server side rendering)
ETL is hot again with machine learning. Companies have massive amounts of data they need to get in shape for models. The promise of a GUI anyone can use with little experience (low costs) means these tools are gaining traction again for the same lessons to be learnt from the last generation who moved on/out the industry.
But that’s the point - it’s just a promise. I have a similar career as the top comment, and turns out that Low code approaches only work when people who understand code use them. Be that Python or sql, real use cases are not the toy examples shown in all typical introductions to these tools.
That said, if this thing is customizable enough, a good data engineer can prepare canned steps that fit the general structure of the customer data process and it may have its place.
What is your estimate of the tine required to begin building on the new(old) foundation, discover all the problems after a large investment of effort, and then declare to all that ita the wrong path?
My personal pain is Tableau trying to move uo the stack into data orchestration and compute. Code is completely inaccessible.
I share the same enthusiasm related to ETL as a code, but as a heavy user of SQL Server 2005 (SSIS) at that time, the main advantages for me were that there was a closed and concise ecosystem around the technology, less fragmentation and multiple pieces that could fail. It was way simpler to reach people to solve the issues on the platform, and less time was needed to think about the technology itself and to think more about the problem.
When I transitioned to ETL as a code 12 years ago, I felt for the first time what it looks like to be a programmer/developer, since in my case the technology was literally a means to an end. To be honest, I did not even think about tech at all.
Just to give an idea, I was responsible for 5 instances of SQL Server, and with another person we were responsible for the whole end-to-end data operation (i.e. ETL via SSIS + OLAP cube via SSAS + and reporting via SSRS), and today I struggle to have the same level of scale with less than 7 people.
Fast forward to 2024, the inverse is true: I spend way more time thinking about the tools and their specificities than about the problem that I am solving at the end of the day.
I'm not telling that ETL as a code is bad (actually there's a lot of good stuff out there) but for folks in this kind of ecosystem (e.g. azure), definitely the juice to have some homemade code and all the stuff that comes with that, does not worth the squeeze.
No it’s not. Try and see what banks, retail, manufacturing, various large enterprises still use. They need scale, observability, modularity, and maintainability.
I work in manufacturing (large industrial plant) and the data processes we have are honestly not great - mostly it is because there are a heap of legacy system and not a lot of commonality between our data sources we have a hideous mashup of Oracle, DB2, Microsoft SQL Server etc and different versions of the different databases. There's also more bespoke industry stuff like time series historians and SCADA systems/PLCs (ABB, Citect etc) to complicate the process.
From my experience SQL is basically the lowest common denominator everything speaks and even then the Oracle SQL dialect is subtly different to Microsoft SQL for example - things are subtly different enough it introduces frustrations.
There has been movement in last couple of years to hoist everything into a common "datalake" but my understanding has been that ingestion into this lake is not a simple process by any means and requires batch processes that need demanding compute resources and is slow (i.e. takes many hours and runs over night).
I would also add that in my 25 years in the industry I have never actually come across an ETL workflow that was complex enough that it required people working it in code. Those opinions seem to occur before proper analysis happens.
Thanks for your comment! I do believe it depends on who you ask and ultimately both will co-exist. I also think low-code solutions democratize access to ETL development offering a significant productivity advantage for smaller teams. With Amphi, I'm trying to avoid the common pitfalls of other low-code ETL tools, such as scalability issues, inflexibility, and vendor lock-in, while embracing the advantages of modern ETL-as-code:
- Pipelines are defined as JSON files (git workflow available)
- Generates non-proprietary Python code: This means the pipelines can be deployed anywhere, such as AWS Lambda, EC2, on-premises, or Databricks.
I would love to have your advice. What tool would you recommend to do straightforward ETL's as a single developer? Think of tasks like ETL-ing data from Production to Test or Local. Or quickly combining data from 2 databases to answer some business question.
Six years ago I used Pentaho to do it. And it worked really well. It was easy and quick. Though maintenance was hard sometimes and it felt very dated: The javascript version was ancient, I could find a lot of questions answered online, but they were usually 5-10years old. I am wondering whether I should use something like Amphi for my next simple-ETLs.
I've gotten some quick wins with Benthos (now RedPanda Connect) but I agree it's an unsolved problem as there are typically gotchas.
If you can get a true CDC stream from the database to analytics, that would be ideal, but when that isn't available you spend 100x more time trying to bodge together an equivalent batch/retry system.
I also want to know that. The BI team where I work still uses Pentaho. It's buggy and ugly, but it gets the job done most of the time. A few of them know a little of python, so a tool like Amphi could be the next stage.
Agreed. Well designed Airflow Operators (or taskflow) are basically the same level of effort as creating a box with a UI (honestly, easier IMHO), but the ability to go into code is important for every non-trivial pipeline.
I built a solid career replacing no-code-ETL tools with Airflow.
You’re missing the point of the benefits of solutions like these, and the original set of tools like the Informatica of the kind. Those tools come with limitations and constrains, like a box of legos you can build a very powerful pipeline without having to wire up a lot of redundant code as you pass data frames between validation stages. Tools like Airflow/Spark etc are great for what they are, but they don’t come with guidelines or best practices when it comes to reusable code at scale, your team has to establish that early on.
You can open a pretty complicated large DAG in and right away you’ll understand the data flow and processing steps. If you were to do similar in code, it becomes a lot harder unless you comply to good modular design practices.
This is also why common game engine and 3d rendering tools come with a UI for flow driven scripting. It’s intuitive and much easier to organize.
I used to be an Informatica ETL dev, briefly. I laugh at the description of it as a low-code solution, as we used it as a simple frontend for staggering amounts of hand-coded SQL. I'm not sure I ever used Informatica as it was intended.
The debate about GUI vs code solutions has raged for decades, generating more heat than light. I've tried to give a summary of the pros and cons of the 2 approaches here:
TLDR: Best depends on the situation and the people involved. Given the wide range of both situations and people, I think there are is plenty of room for both approaches.
js8|1 year ago
I think fundamentally, the problem is that any DSL lives on a design tradeoff spectrum between domain-specificity and programming-generality. There is only so much savings that a generic ETL tool will give you, because it doesn't want to be domain specific. On the other hand, the more domain specific a tool is, the more limited it will be in it's capacity to handle other domains.
No amount of graphic or textual interface is gonna save you from this fundamental design tradeoff (it's kind of complexity tradeoff - things inside domain are easy but outside the domain are hard). But then, you can as well handle this tradeoff in a decent library/framework for a general programming language, which is, effectively, a "return to code".
Other way to look at it is that we want limited programming languages, because the implicit assumptions make things easier for the reader of the code (the person who wants to understand). However, limitations make things harder for the writer of the code (the person who wants to solve some problem).
rdeboo|1 year ago
In practice you lose all the benefits of abstraction, unit testing, proper CI/CD, etc. I haven't met an engineer that likes the service. Some projects have resorted to writing code generation tools, so that they can take config files and programmatically generate the JSON serialization of the pipelines that you're supposed to develop by clicking and dragging.
larodi|1 year ago
This graph-based visual programming somehow fails to deliver on speed of development. Mouse has 2 buttons, the keyboard approx. 100. Not to mention that LLMs work on the language/code level, and are expected to stay so for a while. We dont' have universal means to express things visually. Except for the graph notation of edgex/virtices. But then there is no universal knowledge, people dont usually disambiguate between sequence diagram, bpmn and state diagram. these are all graphs, right, but not the same semantically.
I'd rather go for a standardized ETL langauge a-la-markdown, and only then get to appreciate the GUI.
what-the-grump|1 year ago
Why? We are pretty deep into the ecosystem.
Abstraction -> the only thing data factory does not allow you is to reference a previous activity as a variable, which makes sense if you don't want to let your customer blow up your product. Parametrize all you want.
Unit testing -> test all you want each activity, pipeline, flow, resume it from where it broke. Clone the entire thing into a test data factory, then deploy that once ready.
CI/CD -> the first step it nags you about is setting up CI/CD. If you want to get fancy, you setup a dev environment and deploy that to production after testing and sign-off.
Abstracting ETL only works when you remember or have the same people on staff that abstracted that ETL process. Data factory 'could' be visual but does not let you pull the same level of non-sense that SSIS would.
For example, we call data factory via API, the pipeline is fully abstracted, it does one thing, but it's inputs and outputs are controlled by the request.
shoo|1 year ago
this is reminiscent of https://mikehadlow.blogspot.com/2012/05/configuration-comple...
tiew9Vii|1 year ago
ETL is hot again with machine learning. Companies have massive amounts of data they need to get in shape for models. The promise of a GUI anyone can use with little experience (low costs) means these tools are gaining traction again for the same lessons to be learnt from the last generation who moved on/out the industry.
thegiogi|1 year ago
That said, if this thing is customizable enough, a good data engineer can prepare canned steps that fit the general structure of the customer data process and it may have its place.
banku_brougham|1 year ago
My personal pain is Tableau trying to move uo the stack into data orchestration and compute. Code is completely inaccessible.
braza|1 year ago
When I transitioned to ETL as a code 12 years ago, I felt for the first time what it looks like to be a programmer/developer, since in my case the technology was literally a means to an end. To be honest, I did not even think about tech at all.
Just to give an idea, I was responsible for 5 instances of SQL Server, and with another person we were responsible for the whole end-to-end data operation (i.e. ETL via SSIS + OLAP cube via SSAS + and reporting via SSRS), and today I struggle to have the same level of scale with less than 7 people.
Fast forward to 2024, the inverse is true: I spend way more time thinking about the tools and their specificities than about the problem that I am solving at the end of the day.
I'm not telling that ETL as a code is bad (actually there's a lot of good stuff out there) but for folks in this kind of ecosystem (e.g. azure), definitely the juice to have some homemade code and all the stuff that comes with that, does not worth the squeeze.
hbarka|1 year ago
No it’s not. Try and see what banks, retail, manufacturing, various large enterprises still use. They need scale, observability, modularity, and maintainability.
stoperaticless|1 year ago
> They need scale, observability, modularity, and maintainability.
Seems orthogonal to code-vs-gui dimension.
bigger_cheese|1 year ago
From my experience SQL is basically the lowest common denominator everything speaks and even then the Oracle SQL dialect is subtly different to Microsoft SQL for example - things are subtly different enough it introduces frustrations.
There has been movement in last couple of years to hoist everything into a common "datalake" but my understanding has been that ingestion into this lake is not a simple process by any means and requires batch processes that need demanding compute resources and is slow (i.e. takes many hours and runs over night).
happymellon|1 year ago
I would also add that in my 25 years in the industry I have never actually come across an ETL workflow that was complex enough that it required people working it in code. Those opinions seem to occur before proper analysis happens.
thibautdr|1 year ago
banku_brougham|1 year ago
kaon_|1 year ago
Six years ago I used Pentaho to do it. And it worked really well. It was easy and quick. Though maintenance was hard sometimes and it felt very dated: The javascript version was ancient, I could find a lot of questions answered online, but they were usually 5-10years old. I am wondering whether I should use something like Amphi for my next simple-ETLs.
NortySpock|1 year ago
If you can get a true CDC stream from the database to analytics, that would be ideal, but when that isn't available you spend 100x more time trying to bodge together an equivalent batch/retry system.
rubslopes|1 year ago
hipadev23|1 year ago
mywittyname|1 year ago
I built a solid career replacing no-code-ETL tools with Airflow.
sqlcook|1 year ago
You can open a pretty complicated large DAG in and right away you’ll understand the data flow and processing steps. If you were to do similar in code, it becomes a lot harder unless you comply to good modular design practices.
This is also why common game engine and 3d rendering tools come with a UI for flow driven scripting. It’s intuitive and much easier to organize.
mrtranscendence|1 year ago
hermitcrab|1 year ago
https://successfulsoftware.net/2024/01/16/visual-vs-text-bas...
TLDR: Best depends on the situation and the people involved. Given the wide range of both situations and people, I think there are is plenty of room for both approaches.
gchamonlive|1 year ago
Tldr: as with dashboards, self-serve ETLs don't work outside very specific use cases or very simple pipeline.
Reason for that, as with bi tools, is that the complexity is not in manipulating data. There are tons of frameworks to do that efficiently.
The issue is interpreting data and it's semantics and evolving data pools to the business needs.
unknown|1 year ago
[deleted]