top | item 37164864

Show HN: Shadeform – Single Platform and API for Provisioning GPUs

62 points| edgoode | 2 years ago |shadeform.ai

Hi HN, we are Ed, Zach, and Ronald, creators of Shadeform (https://www.shadeform.ai/), a GPU marketplace to see live availability and prices across the GPU market, as well as to deploy and reserve on-demand instances. We have aggregated 8+ GPU providers into a single platform and API, so you can easily provision instances like A100s and H100s where they are available.

From our experience working at AWS and Azure, we believe that cloud could evolve from all-encompassing hyperscalers (AWS, Azure, GCP) to specialized clouds for high-performance use cases. After the launch of ChatGPT, we noticed GPU capacity thinning across major providers and emerging GPU and HPC clouds, so we decided it was the right time to build a single interface for IaaS across clouds.

With the explosion of Llama 2 and open source models, we are seeing individuals, startups, and organizations struggling to access A100s and H100s for model fine-tuning, training, and inference.

This encouraged us to help everyone access compute and increase flexibility with their cloud infra. Right now, we’ve built a platform that allows users to find GPU availability and launch instances from a unified platform. Our long term goal is to build a hardwareless GPU cloud where you can leverage managed ML services to train and infer in different clouds, reducing vendor lock-in.

We shipped a few features to help teams access GPUs today:

- a “single plane of glass” for GPU availability and prices;

- a “single control plane” for provisioning GPUs in any cloud through our platform and API;

- a reservation system that monitors real time availability and launches GPUs as soon as they become available.

Next up, we’re building multi-cloud load balanced inference, streamlining self hosting open source models, and more.

You can try our platform at https://platform.shadeform.ai. You can provision instances in your accounts by adding your cloud credentials and api keys, or you can leverage “ShadeCloud” and provision GPUs in our accounts. If you deploy in your account, it is free. If you deploy in our accounts, we charge a 5% platform fee.

We’d love your feedback on how we’re approaching this problem. What do you think?

24 comments

thecupisblue|2 years ago

First off, the color and the font of the hero look so neat together. Just giving straight up simple, professional but modern vibes. Good job whoever picked it!

Now, regarding the product - this is amazing. From both the perspective of saving time and money digging through providers to the part that actually I find the most impacting - the simplification of the AWS console mess to a niche use case. While I understand GPU's are the hot thing now and there is a scramble for a single Flop, if you ever decide to pivot, I'd gladly pay more money each month to use such a simplified niche AWS/Generic cloud console.

Can't wait to have a chance to play with this more, keep up the good work and good luck!

Cholical|2 years ago

Thank you for the kind words! I'm Ronald, one of the cofounders of Shadeform.

Simplifying provisioning instances in AWS is definitely one of our goals! With our current AWS integration, we are setting up a VPC networking stack so all users have to do is worry about picking their instance. We also hope to integrate more cloud features and managed services that will make this a fully-fledged cross cloud console.

alando46|2 years ago

Problem for our use case is saving on gpus is pointless if we have to keep paying egress fees for our 250 TB training dataset.

The single interface for any cloud GPU is cool, but hard to imagine it taking off without some additional features.

I think for lots of shops the hardest part isn't the compute but moving the data around. Ie for us, we use s3, some lustre caching and spot instance nodegroups. We are a deep learning research team that spends roughly 40-50k/month on aws compute for training jobs. I imagine this is somewhat mid tier, maybe more than some but certainly far less than others.

For inference, data egress costs could be less of an issue, but your service would really need to be almost invisible. It probably would be pretty complicated for a number of reasons, but if you could design a "virtual on-demand nodegroup"™ that I could add to my existing clusters and then map to whatever k8s stuff I want, that would probably be useful. I would need to be able to auto deploy a base image to the machine and then provision the node and register with my cluster.

Just some unorganized thoughts. Good luck and have fun.

marcopicentini|2 years ago

What are your use case that justify a cost 40-50k/month just for training?

edgoode|2 years ago

Here are two demos of provisioning and reserving GPUs through our platform:

Provisioning: https://www.youtube.com/watch?v=7WyKPMS80Pk

Reservations: https://www.youtube.com/watch?v=Ab5GmfMYWKA

doctorpangloss|2 years ago

- SSH access isn’t super useful. If I have to author a bootstrapping script for my system it’s too much friction.

- the people who thrive at this use orchestration, like Slurm or Kubernetes. So the nodes I buy should join automatically to my orchestration control plane.

- people who don’t use orchestration or don’t own their orchestration will not run big jobs or be repeat customers. It doesn’t make sense to use nonstandard orchestration. I understand that it is something that people do, but it’s dumb.

- so basically I would pay for a ClusterAutoscaler across clouds. I would even pay a 5% fee for it automatically choosing the cheapest of the fungible nodes. I am basically describing Karpenter for multiple clouds. Then at least the whole offering makes sense from a sophisticated person’s POV: your Karpenter clone can see eg a Ray CRD and size the nodes, giving me a firm hourly rate or even upfront price to approve.

- I wouldn’t pay that fee to use your control plane, I don’t want to use a startup’s control plane or scheduler.

- I’m not sure why the emphasis on GPU availability or blah blah blah. Either AWS/GCE/AKS grants you quota or it doesn’t. Your thing ought to delegate and automate the quota requests, maybe you even have an account manager at every major cloud for that to bundle it all.

- as you probably have noticed, the off brand clouds play lots of games with their supposed inventory. They don’t have any expertise running applications or doing networking, they are ex crypto miners. I understand that they offer a headline price that is attractive but for an LLM training job, they “vast”ly overpromise their “core” offering.

- if you really want to save people money on GPUs, buy a bunch of servers and rack them and sell a lower hourly rate.

edgoode|2 years ago

Thank you for the feedback. We're still early in this and are planning on moving in some of the directions you mentioned.

- We agree that moving towards 'Karpenter for multiple clouds' would be more valuable for some use cases and hope to support that feature soon.

- We do help customers with one-off quota requests, and it is a feature we want to bake into our platform on top of aggregating demand in our accounts. Many companies with AWS/GCE/AKS quota still cannot reliably get on-demand instances due to capacity shortages.

unknown|2 years ago

[deleted]

lucasfcosta|2 years ago

Congrats on the launch!

My co-founder and I always joke that there are only two hair-on-fire problems in 2023 and they can be summarised in 6 letters: GPU & PMF.

Really love what you're building.

edgoode|2 years ago

Appreciate it! We're hoping we can help solve one while achieving the other.

nextaccountic|2 years ago

What's PMF?

mike_d|2 years ago

Be super careful inserting yourself as a reseller of GPUs (ShadeCloud).

You'll quickly find that your platforms primary use is to turn stolen credit cards into cryptominers.

latchkey|2 years ago

crypto doesn't use gpus to mine any more. after ETH switched to PoS, the whole gpu mining world was decimated (thankfully). even on a free tier, you'd be lucky to get a few dollars a day, for a whole lot of upfront work.

that said, i agree that you do have to be careful reselling anything... people will find nefarious uses, it just isn't mining anymore.

Takennickname|2 years ago

Surprisingly little engagement with this post. I'm not in the market but can people who use gpus but didn't find their offering attractive explain why?

fmxexpress|2 years ago

The pricing or lack there of is a huge turn off. We use Replicate and Runpod. On Replicate everyone just shares the GPU cloud. On Runpod I had to bundle 8 together into my own cloud but I can't see that this service solves my problems (queueing image generation and restarting jacked stable diffusion nodes).

marcopicentini|2 years ago

It’s like Cloud66 but with the GPU in the headline, isn’t? What’s difference with Cloud66?

71a54xd|2 years ago

Any plans to add providers like TensorDock or Vast?

edgoode|2 years ago

We're working on adding those as well.