(no title)
davinchia | 4 years ago
I think some of the points made here about ETL scripts being just 'ETL scripts' are very relevant. Definitely been on the other side of the table arguing for a quick 3-hour script.
Having written plenty of ETL scripts - in Java with Hadoop/Spark, Python with Airflow and pure Bash - that later morphed into tech debt monsters, I think many people underestimate how quickly these can quickly snowball into proper products with actual requirements.
Unless one is extremely confident an ETL script will remain a non-critical good-to-have part of the stack, I believe evaluating and adopting a good ETL framework, especially one with pre-built integrations is good case of 'sharpening the axe before cutting the tree' and well worth the time.
We've been very careful to minimise Airbyte's learning curve. Starting up Airbyte is as easy as checking out the git repo and running 'docker compose up'. A UI allows users to select, configure and schedule jobs from a list of 120+ supported connectors. It's not uncommon to see users successfully using Airbyte within tens of mins.
If a connector is not supported, we offer a Python CDK that lets anyone develop their own connectors in a matter of hours. We have a commitment to supporting community contributed connectors so there is no worry about contributions going to waste.
Everything is open source, so anyone is free to deep as dive as they need or want to.
We also build in the open and have single-digit hour Slack response time on weekdays. Do check us out - https://github.com/airbytehq/airbyte!
0xbadcafebee|4 years ago
But that's probably the time it took to write it, right? 80% of the cost of software is in maintenance, so there's another 12 hours worth of maintenance left to account for. If you know you're going to spend 15 hours on it, then you might as well use a system you know will cost less to extend or scale over time.
"We've been very careful to minimise Airbyte's learning curve."
That's good for quickly onboarding new customers, but not necessarily for the system to be scalable or extensible.
"Starting up Airbyte is as easy as checking out the git repo and running 'docker compose up'."
I'm always curious about this. Docker-compose doesn't run on more than a single host, unless you're using it with the AWS ECS integration (and maybe Swarm?). So sure, the developer can "get it running" quickly to look at it, but to actually deploy something to production they'll have to rewrite the docker-compose thing in something else. If you provide them with Terraform modules or a Helm chart, that would get them into production faster. And maybe even a canned CI/CD pipeline in a container so they can start iterating on it immediately. It's more work for your company, but it shortens the friction for the developers to get to production, and enables businesses to start using your product in production immediately, which I think is a pretty big differentiator of business value.
davinchia|4 years ago
"If you know you're going to spend 15 hours on it, then you might as well use a system you know will cost less to extend or scale over time."
I wish younger me realised that earlier :)
"And maybe even a canned CI/CD pipeline in a container so they can start iterating on it immediately."
Definitely! Although a good number of users are surprisingly happy with their Airbyte instances on a single node.
We do have a Kubernetes offering for those looking to scale Airbyte beyond a single node. We also have Kustomise/Helm deploys for this, though I'll be the first to admit that the Helm charts are mostly community-maintained and can be improved. This is one of our (my) top priorities going into the next Quarter.
boxer_shorts|4 years ago
czbond|4 years ago
wodenokoto|4 years ago
davinchia|4 years ago
My experience with Dataflow is 1.5 years old, so things might have changed, but I felt it more to be a unified, simplified Hadoop/Spark framework. It unifies the batch/streaming concepts but is still pretty low-level.
Within ELT, or ETL, Airflow/Dataflow can fulfills all 3 components.
Airbyte focuses just on EL (though we have basic T functionality around normalisation). Our intention is to leave T to the warehouse, since warehouses like Redshift/Snowflake/BigQuery are extremely powerful these days, and tools like DBT, give the users more more flexibility to recombine and consume the raw data than a specialised ELT pipeline.
In summary, I would say Airbyte is a specialised subset of Airflow/Dataflow, and it's possible to use Airbyte with either tools, though I'd guide someone towards DBT.