top | item 35439747

(no title)

I don't know what tools you are using but this can be achieved with Airflow on k8s, for example:

* Add a GPU resource requirement on one of your step

* Add an auto-scaler that adds GPU nodes to your cluster based on the GPU resource demand.

After having written the above, I realize that it might sound like that famous HN comment about how you can /easily/ re-create Dropbox yourself, which might actually prove your point that there is a need for ML-specific tools for the training part.

discuss

thundergolfer|2 years ago

Having to setup and run Airflow on K8s is a hell of a prerequisite step to getting cost-efficient and fast access to GPU training.

Airflow is also absolutely not built for that purpose. It's ~10yr old Hadoop-era technology.

__MatrixMan__|2 years ago

As for getting airflow on k8s in the first place, the apache airflow helm chart pretty much just handles things, doesn't it? It might be a pain to manage many deployments for many teams, but going from 0 to 1 isn't so bad.

As for configuring the kubernetes pod operator to ask for pods with GPU's, it exposes the k8s python API in the dag definition. I haven't done it myself, but I think that it's not really airflow that's going to be a pain there. Getting the pod spec right is gonna have to happen whatever does the orchestration.

(Full disclosure: my employer offers airflow as a service)

Starlord2048|2 years ago

I agree with you that there is still room for improvement when it comes to the efficiency and effectiveness of training orchestration tools. It's true that setting up and spinning down GPU instances can be challenging, and optimizing the use of these resources is essential given their cost.

mountainriver|2 years ago

Yup your just waiting 10 minutes to add a GPU node, nothing to see here