Launch HN: BuildFlow (YC W23) – The FastAPI of data pipelines
104 points| calebtv | 3 years ago
The problem we're trying to solve is simple: building data pipelines can be a real pain. You often need to deal with complex frameworks, manage external cloud resources, and wire everything together into a single deployment (you’re probably drowning in Yaml by this point in the dev cycle). This can be a burden on both data scientists and engineering teams.
Data pipelines is a broad term, but we generally mean any kind of processing that happens outside of the user facing path. This can be things like: processing file uploads, syncing data to a data warehouse, or ingesting data from IoT devices.
BuildFlow, our open-source framework, lets you build a data pipeline by simply attaching a decorator to a Python function. All you need to do is describe where your input is coming from and where your output should be written, and BuildFlow handles the rest. No configuration outside of the code is required. See our docs for some examples: https://www.buildflow.dev/docs/intro.
When you attach the decorator to your function, the BuildFlow runtime creates your referenced cloud resources, spins up replicas of your processor, and wires up everything needed to efficiently scale out the reads from your source and then writes to your sink. This lets you focus on writing logic as opposed to interacting with your external dependencies.
BuildFlow aims to hide as much complexity as possible in the sources / sinks so that your processing logic can remain simple. The framework provides generic I/O connectors for popular cloud services and storage systems, in addition to "use case driven” I/O connectors that chain together multiple I/O steps required by common use cases. An example “use case driven” source that chains together GCS pubsub notifications & fetching GCS blobs can be seen here: https://www.buildflow.dev/docs/io-connectors/gcs_notificatio...
BuildFlow was inspired by our time at Verily (Google Life Sciences) where we designed an internal platform to help data scientists build and deploy ML infra / data pipelines using Apache Beam. Using a complex framework was a burden on our data science team because they had to learn a whole new paradigm to write their Python code in, and our engineering team was left with the operational load of helping folks learn Apache Beam while also managing / deploying production pipelines. From this pain, BuildFlow was born.
Our design is based around two observations we made from that experience:
(1) The hardest thing to get right is I/O. Efficiently fanning out I/O to workers, concurrently reading / processing input data, catching schema mismatches before runtime, and configuring cloud resources is where most of the pain is. BuildFlow attempts to abstract away all of these bits.
(2) Most use cases are large scale but not (overly) complex. Existing frameworks give you scalability and a complicated programming model that supports every use case under the sun. BuildFlow provides the same scalability but focuses on common use cases so that the API can remain lightweight & easy to use.
BuildFlow is open source, but we offer a managed cloud offering that allows you to easily deploy your pipelines to the cloud. We provide a CLI that deploys your pipeline to a managed kubernetes cluster, and you can optionally opt in to letting us manage your resources / terraform as well. Ultimately this will feed into our VS Code Extension which will allow users to visually build their data pipelines directly from VS Code (see https://launchflow.com for a preview). The extension will be free to use and will come packaged with a bunch of nice-to-haves (code generation, fuzzing, tracing, and arcade games (yep!) just to name a few in the works).
Our managed offering is still in private beta but we’re hoping to release our CLI in the next couple weeks. Pricing for this service is still being ironed out but we expect it to be based on usage.
We’d love for you to try BuildFlow and would love any feedback. You can get started right away by installing the python package: pip install buildflow. Check out our docs (https://buildflow.dev/docs/intro) and GitHub (https://github.com/launchflow/buildflow) to see examples on how to use the API.
This project is very new, so we’d love to gather some specific feedback from you, the community. How do you feel about a framework managing your cloud resources? We’re considering adding a module that would let BuildFlow create / manage your terraform for you (terraform state would be dumped to disk). What are some common I/O operations you find yourself rewriting? What are some operational tasks that require you to leave your code editor? We’d like to bring as many tasks into BuildFlow and our VSCode extension so you can avoid context switches.
vosper|3 years ago
I'm about to build a pipeline that needs to pass thousands of docs a minute through a variety of enrichments (ML models, third-party APIs, etc) and then dump the final enriched doc in ES.
There are so many pipeline products and workflow engines and MLOps solutions that I'm very confused about what technologies I should be looking at. I think something looks good (Temporal) but then read it's not really for large-volumes of streaming data. Or I look at Flink that can handle massive volumes but it doesn't seem like it's as easy to wire up as other options. I think Dagster looks nice but can't find any answer (even in their Slack) about what kind of volumes it can handle...
TankeJosh|3 years ago
BuildFlow can run a simple PubSub -> light processing -> BigQuery pipeline at about 5-7k messages / second on a 4core VM (tested on GCP’s n1-standard-4 machines). For your case, you might be able to get away with running on a single machine with 4-8 cores.
I’d be happy to connect outside of HN if you’d like me to dig into your use case more! You can reach me at josh@launchflow.com
edit: You can also reach out on our discord: https://discordapp.com/invite/wz7fjHyrCA
faizshah|3 years ago
calebtv|3 years ago
lysecret|3 years ago
calebtv|3 years ago
1. We're definitely more of a generic streaming framework. But I could see ML being one of those use cases as well.
Why Ray? One of our main drivers was how "pythonic" ray feels, and that was a core principal we wanted in our framework. Most of my prior experience has been working with Beam, and Beam is great but it is kind of a whole new paradigm you have to learn. Another thing I really like about ray is how easy it is to run locally on your machine and get some real processing power. You can easily have ray use all of your cores and actually see how things scale without having to deploy to a cluster. I could probably go on and on haha, but those are the first two that come to mind.
2. We really want to support a bunch of frameworks / resources. We mainly choose BQ and Pub/Sub because of our prior experience. We have some github issues to support other resources across multiple clouds, and feel free to file some issues if you would like to see support for other things! With BuildFlow we deploy the resources to a project you own so you are free to edit them as you see fit. BuildFlow won't touch already created resource beyond making sure it can access them. In BuildFlow we don't really want to bake in environment specific logic, I think this is probably best handled with command line arguments to a BuildFlow pipeline. But happy to hear other thoughts here!
3. I'm not sure I understand what you mean by "glue", so apologies if this doesn't answer your question. The BuildFlow code gets deployed with your pipeline so it doesn't need to run remotely at all. So if you were deploying this to a single VM, you can just execute the python file on the VM and things will be running. We don't have great support for multi-stage pipelines at the moment. What you can do is chain together processors with a Pub/Sub feed. But we do really want to support chaining together processors themselves.
amath|3 years ago
calebtv|3 years ago
Orangeair|3 years ago
[1] https://www.buildflow.dev/docs/schema-validation#examples
TankeJosh|3 years ago
0xDEF|3 years ago
calebtv|3 years ago
unknown|3 years ago
[deleted]
brap|3 years ago
Just out of curiosity, it seems like the process function which you define has to run remotely on workers. How does it get serialized? Are there limitations to the process function due to serialization?
calebtv|3 years ago
I think the most common limitation will be ensure that your output is serializable. Typically returning python dictionaries or dataclasses is fine.
But if you had a specific limitation in mind let me know happy to dive into it!
Kalanos|3 years ago
TankeJosh|3 years ago
sophiavanderm|3 years ago
[deleted]