top | item 18063893

Building your own deep learning computer is 10x cheaper than AWS

437 points| walterbell | 7 years ago |medium.com

264 comments

order
[+] dagw|7 years ago|reply
You're forgetting the cost of fighting IT in a bureaucratic corporation to get them to let you buy/run non-standard hardware

Much easier to spend huge amounts of money of Azure/AWS and politely tell them it's their own fucking fault when they complain about the costs. (what me? no I'm not bitter, why do you ask?)

[+] unbearded|7 years ago|reply
Unfortunately, the fight goes even further than that when you go against the cloud.

Last week I was in an event with the CTOs of many of the hottest startups in America. It was shocking how much money is wasted on the cloud because inefficiencies and they simply don't care how much it costs.

I guess since they are not wasting their own money, they can always come up with the same excuse: developers are more expensive than infrastructure. Well... that argument starts to fall apart very quickly when a company spends six figures every month on AWS.

I'm on the other extreme. I run my company stuff on ten $300 servers I bought on eBay in 2012 and put inside a soundproof rack in my office in NJ, with a 300 Mb FIOS connection using Cloudflare as a proxy / CDN. The servers run Proxmox for private cloud and CEPH for storage. They all have SSDs and some have Optane storage. In 6 years, there were only 3 outages that weren't my fault. All at the cost of office rent ($1000) + FIOS ($359) + Cloudflare costs and S3 for images and backups.

With my infrastructure, I can run 6k requests per minute on the main Rails app (+ Scala backend) with a 40ms response time with plenty of resources to spare.

[+] nabla9|7 years ago|reply
That's very passive aggressive way to deal with it. If that's your only option you are really a cubicle slave in corporate hell.

In my opinion it's better to escalate upwards with proposals and not back down easily. You just have to frame it correctly and use right names and terms.

* Usually big companies understand the concept of "a lab" that has infrastructure managed outside the corporate IT. Once you fight the hard fight, you get your own corner and are left alone to do your job and can gradually grow it for other things.

* Asking forgiveness works even for large companies. Sometimes someone is not happy and you 'are in trouble' but not really. You just have to have a personality that does not care if somebody gets mad at you sometimes.

[+] pjc50|7 years ago|reply
This is absolutely the right answer as to how AWS got so big in the first place. Capex and IT are huge pain points. Starting something up on the free tier isn't. Once something's running and providing value, spending money on it becomes a "necessity" and the obstructionism goes away.
[+] mlthoughts2018|7 years ago|reply
In most companies, AWS just becomes a new front-end to the same old IT bureaucracy, and dev teams are still disallowed from creating their own instances or EMR clusters or setting up new products like their own Redshift instance or ECR deployment solution.
[+] w8rbt|7 years ago|reply
Yep. No one telling me no anymore and I can write Lambas to replace cron jobs, use RDS to replace DBs, use S3 and Glacier to replace Storage, etc. Fargate is awesome too. No gate keepers nor bureaucracy just code and git repos. That's why AWS is so awesome.

And, I can show exactly how much everything costs. As well as work to reduce those costs.

AWS added to a small, skilled dev team is a huge multiplier.

[+] kamaal|7 years ago|reply
>>politely tell them it's their own ... fault when they complain about the costs.

They don't care. Neither do their bosses. Not even the CFO or the CEO.

The only people who will care are the investors. And sometimes not even them, they will likely just sell and walk.

Only people who are likely to care are activist investors with large block of shares, those types have vested interests in these things.

Part of the reason why there is so much waste every where is because organizations are not humans. And most people in authority have no real stake or long term consequences for decisions. This is everywhere, religious organizations, companies, governments etc. Everywhere.

[+] poulsbohemian|7 years ago|reply
Could be very interesting when we have another economic downturn to see if attitudes on this change. It certainly seems more cost-effective to run ones own technical operations rather than offloading onto AWS / Google / Microsoft.
[+] VHRanger|7 years ago|reply
They seriously can't buy a graphics card and slap it in the PCIe slot?
[+] chaosbutters314|7 years ago|reply
We just got around them and argue after the fact. It is a pain no matter what. But we actually do both, interernal custom builds and cloud computing so no matter what we are set
[+] ConcernedCoder|7 years ago|reply
Innovation being sidelined in favor of status quo and 'cover-my-ass' risk management highlights exactly the reason I won't work for a bureaucratic corporation...
[+] mandeepj|7 years ago|reply
You always have at least two options, and this case is not extreme -

1. You can blow up any amount if you like to. 2. Or, you can figure out what you are trying to do. Then, learn how to do it better. There is a cheaper way to run in the cloud too - https://twitter.com/troyhunt/status/968407559102058496

[+] cameldrv|7 years ago|reply
Great post. From someone who's done a few of these, I'll make a few observations:

1. Even if your DL model is running on GPUs, you'll run into things that are CPU bound. You'll even run into things that are not multithreaded and are CPU bound. It's valuable to get a CPU that has good single-core performance.

2. For DL applications, NVMe is overkill. Your models are not going to be able to saturate a SATA SSD, and with the money you save, you can get a bigger one, and/or a spinning drive to go with it. You'll quickly find yourself running out of space with a 1TB drive.

3. 64GB of RAM is overkill for a single GPU server. RAM has gone up a lot in price, and you can get by with 32 without issue, especially if you have less than 4 GPUs.

4. The case, power supply, and motherboard, and RAM are all a lot more expensive for a properly configured 4 GPU system. It makes no sense to buy all of this supporting hardware and then only buy one GPU. Buy a smaller PSU, less RAM, a smaller case, and buy two GPUs from the outset.

5. Get a fast internet connection. You'll be downloading big datasets, and it is frustrating to wait half a day while something downloads and you can get started.

6. Don't underestimate the time it will take to get all of this working. Between the physical assembly, getting Linux installed, and the numerous inevitable problems you'll run into, budget several days to a week to be training a model.

[+] leon_sbt|7 years ago|reply
I ran into this exact issue about 2 years ago about build vs rent. Ultimately I chose build.

Here's my thoughts/background:

Background: Doing small scale training/fine tuning on datasets. Small time commercial applications. I find renting top shelf VM/GPU combos on the cloud to be psychologically draining. Did I forget to shut off my $5 dollar an hour VM during my weekend camping trip? I hate it when I ask myself questions like that.

I would rather spend the $2k upfront and ride the depreciation curve, than have the "constant" VM stress. Keep in mind, this is for a single instance, personal/commercial use rig.

I feel that DL compute decisions aren't black/white and should be approached in stages.

Stage 0: If you do full time computer work at a constant location, you should try to own a fast computing rig. DL or not. Having a brutally quick computer makes doing work much less fatiguing.Plus it opens up the window to experimenting with CAD/CAE/VFX/Photogrammetry/video editing. (4.5ghz i7 8700k +32gb ram +SSD)

Stage 1: Get a single 11/12 gb GPU. 1080TI or TitanX (Some models straight up won't fit on smaller cards). Now you can go on Github and play with random models and not feel guilty about spending money on a VM for it.

Stage 2: Get a 2nd GPU. Makes for writing/debugging multi-gpu code much easier/smoother.

Stage 3: If you need more than 2 GPU's for compute, write/debug the code locally on your 2 GPU rig. Then beam it up to the cloud for 2+ gpu training. Use preemptible instances if possible for cost reasons.

Stage 4: You notice your cloud bill is getting pretty high($1k+ month) and you never need more than 8x GPUs for anything that your doing. Start the build for your DL runbox #2. SSH/Container workloads only. No GUI, no local dev. Basically server grade hardware with 8x GPUS.

Stage 5: I'm not sure, don't listen to me :)

[+] kwillets|7 years ago|reply
I'm thinking of starting a prepaid cloud service -- once your $20 is gone it shuts everything off.
[+] chaosbutters|7 years ago|reply
upgrade to the TR 2990wx and home CAD/CAE/VFX/Photogrammetry/video editing becomes super nice
[+] anaxag0ras|7 years ago|reply
[+] ovi256|7 years ago|reply
Only in non-North American datacenters. In NA, Nvidia can enforce their driver license, which prohibits use of consumer GPUs in datacenters.

A nice advantage of non-consumer GPUs is their bigger RAM size. Consumer GPUs, even the newest 2080 Ti, has only 11 GB. Datacenter GPUs have 16GB or 32 GB (V100). This is important for very big models. Even if the model itself fits, small memory size forces you to reduce batch size. Small batch size forces you to use a smaller learning rate and acts as a regularizer.

[+] StreamBright|7 years ago|reply
If you do not care about the operational stability of your clusters than OVH is a great option.
[+] psergeant|7 years ago|reply
> Nvidia contractually prohibits the use of GeForce and Titan cards in datacenters. So Amazon and other providers have to use the $8,500 datacenter version of the GPUs, and they have to charge a lot for renting it.
[+] ofrzeta|7 years ago|reply
I wondered how it is possible that they can restrict the use of hardware after it is bought. After reading the article I learned that they key is the license to the drivers.

So I guess in theory it would be possible for AWS to develop their own (or enhance open source) drivers. On the other hand they would spoil the business relationship with Nvidia and have to do without any discounts.

[+] screye|7 years ago|reply
Is Nvidia doing this only because they are a monopoly in the space and practically extorting here ?

Or are there any genuine costs associated with data center gpu models ?

[+] tfolbrecht|7 years ago|reply
Curious of the MTBF (mean time between failure) rate of a GefForce/Titan series GPU under continuous utilization in datacenter conditions vs a desktop computer with intermittent usage. I don't want to believe Nvidia is just out to stiff cloud providers. Maybe it's to protect themselves from warranty abuse?
[+] bnastic|7 years ago|reply
Hasn't stopped Hetzner from offering 1080s in their servers.
[+] Shorel|7 years ago|reply
I wonder if the success AMD is having right now in the CPU space can in the future extend to the GPU space.

It would be awesome. I also wonder if in this case it is an issue of hardware or only something related to the drivers/API.

Can they make something similar/backwards-compatible with Cuda but cheaper/better?

[+] dkobran|7 years ago|reply
NVIDIA is attempting to separate enterprise/datacenter and consumer chips to justify the cost disparity. Specifically, they're introducing memory, precision etc. limits which have major performance implications to GeForce and there's also the EULA which was been mentioned here. That said, everything AWS comes at a premium as they're making the case that on-demand scale outweighs the pain of management/CapEx. This premium is especially noticeable with more expensive gear like GPUs. At Paperspace (https://paperspace.com), we're doing everything we can to bring the cost down of cloud and in particular, the cost of delivering a GPU. Not all cloud providers are the same :)

Disclosure: I work on Paperspace

[+] julvo|7 years ago|reply
For me, one of the main reasons for building a personal deep learning box was to have fixed upfront cost instead of variable cost for each experiment. Not an entirely rational argument, but I find having fixed cost instead of variable cost promotes experimentation.
[+] oneshot908|7 years ago|reply
It's been this way since day 1. NVLINK remains the only real Tesla differentiator (although mini NVLINK is available on the new Turing consumer GPUs so WTFever). But because none of the DL frameworks support intra-layer model parallelism, all of the networks we see tend to run efficiently in data parallel because doing anything else makes them communication-limited, which they aren't because data scientists end up building networks that aren't, chicken and the egg style.

I continue to be boggled that Alex Krizhevsky's One Weird Trick never made it to TensorFlow or anywhere else:

https://arxiv.org/abs/1404.5997

I also suspect that's why so many thought leaders consider ImageNet to be solved, when what's really solved is ImageNet-1K. That leaves ~21K more outputs on the softmax of the output layer for ImageNet-22K, which to my knowledge, is still not solved. A 22,000-wide output sourced by a 4096-wide embedding is 90K+ parameters (which is almost 4x as many parameters in the entire ResNet-50 network).

All that said, while it will always be cheaper to buy your ML peeps $10K quad-GPU workstations and upgrade their consumer GPUs whenever a brand new shiny becomes available, be aware NVIDIA is very passive aggressive about this following some strange magical thinking that this is OK for academics, but not OK for business. My own biased take is it's the right solution for anyone doing research, and the cloud is the right solution for scaling it up for production. Silly me.

[+] LeonM|7 years ago|reply
Own hardware is always cheaper to buy than using a cloud service, but keeping it running 24/7 involves substantial costs. Sure, if you run a solo operation, you can just get up during the night to nurse your server, but at some point that no longer makes sense to do.

Somewhere along the way we forgot about this and it's now perfectly normal to run a blog on a GKE 3 VM kubernetes cluster, costing 140 EUR/month.

[+] vidarh|7 years ago|reply
I used to manage hardware in several datacentres, and I'd usually visit the data centres a couple of times a year. Other than that we used a couple of hours of "remote hands" services from the datacentre operator. Overall our hosting costs were about 30% of what the same capacity would have cost on AWS. Once a year I'd get a "why aren't we using AWS" e-mail from my boss, update our costing spreadsheets and tell him I'd happily move if he was ok with the costs, as it would have been more convenient for me personally, and every year the gap came out just too ridiculously huge to justify.

In the end we started migrating to Hetzner, as they finally got servers that got close enough to be worth offloading that work to someone else. Notably Hetzner reached cost parity for us; AWS was still just as ridiculously overpriced.

There are certainly scenarios where using AWS is worth it for the convenience or functionality. I use AWS for my current work for the convenience, for example. And AWS can often be cheaper than buying hardware. But I've never seen a case where AWS was the cheapest option, or even one of the cheapest, even when factoring in everything, unless you can use the free tier.

AWS is great if you can justify the cost, though.

[+] mmt|7 years ago|reply
> keeping it running 24/7 involves substantial costs

So does using a cloud service. It's not actually obvious, conceptually, but very little of the admin overhead has to do with the "own hardware" aspect of running it, especially if one excludes anything that has a direct analog at a cloud service.

There certainly exist services that abstract away more of this, but that's in exchange for higher cost and lower top performance, but that doesn't scale (in terms of cost).

> Sure, if you run a solo operation, you can just get up during the night to nurse your server, but at some point that no longer makes sense to do.

I'd actually argue the reverse. My experience is that the own-hardware portion took at most a quarter of my time, and that remained constant up to several hundred servers. It's much cheaper per unit of infrastructure the more units you have.

The tools and procedures that allow that kind of efficiency were the prerequisite for cloud services to exist.

[+] NightlyDev|7 years ago|reply
It's not expensive at all compared to cloud hosting to keep hardware running. Hardware is really stable and usually runs for years without any issues. You can even use colocation and remote hands, it isn't expensive.
[+] xfitm3|7 years ago|reply
I think the real answer to this question is unhelpful, which is: it depends.

I ran a detailed cost analysis of tier 3 onprem vs aws about 7 years ago. I included the cost of maintaining servers, support staff salaries, rent, insurance, employee dwell time etc and onprem was still cheaper. Maybe it's different now.

We put significant thought into being cheap. I think constraint can breed innovation.

[+] scosman|7 years ago|reply
Missing 1 important point: ML workflows are super chunky. Some days we want to train 10 models in parallel, each on a server with 8 or 16 GPUs. Most days we're building data sets or evaluating work, and need zero.

When it comes to inference, sometime you wanna ramp up thousands of boxes for a backfill, sometimes you need a few to keep up with streaming load.

Trying to do either of these on in-house hardware would require buying way too much hardware which would sit idle most of the time, or seriously hamper our workflow/productivity.

[+] Jack000|7 years ago|reply
on the other hand, this comparison accounts for the full cost of the rig, while a realistic comparison should consider the marginal costs. Most of us need a pc anyways, and if you're a gamer the marginal cost is pretty close to zero.
[+] bogomipz|7 years ago|reply
>"Nvidia contractually prohibits the use of GeForce and Titan cards in datacenters. So Amazon and other providers have to use the $8,500 datacenter version of the GPUs, and they have to charge a lot for renting it."

I wonder if someone might provide some clarification on this. Is this to say only if a reseller buys directly from Nvidia they are compelled by some agreement they signed with Nvidia? How else would this legal for Nvidia to dictate how and where someone is allowed to use their product? Thanks.

[+] uryga|7 years ago|reply
Another comment in this thread said that it's due to the license on Nvidia's drivers. So technically you can use the hardware in a datacenter, just not with the official drivers. Unfortunately it seems that the open-source drivers aren't usable for most datacenter purposes, so this effectively limits how you can use the hardware (at least in North America, where they can enforce it).
[+] maxehmookau|7 years ago|reply
While, in sheer dollar amount this post is probably correct, it doesn't really scale.

At scale, you need more than just hardware. It's maintenance, racks, cooling, security, fire suppression etc. Oh, and the cost of replacing the GPUs when they die.

At full price, yes, cloud GPUs on AWS aren't cheap, but at potentially a 90% saving in some regions/AZs, the price of spot instances by bidding on unused capacity for ML tasks that can be split over multiple machines make using cloud servers a much more attractive prospect.

I think this post is conflating one physical machine to a fleet of virtualised ones, and that's not really a fair comparison.

Also, the post refers to cloud storage at $0.10/GB/month which is incorrect. AWS HDD storage is $0.025/GB/month and S3 storage is $0.023 which is arguably more suited to storing large data sets.

[+] kakwa_|7 years ago|reply
And the same can be said about can in fact pretty much be said by any AWS service.

The equivalent of an i3 metal is probably around 30000 to 40000$ with Dell or HP, and probably half cheaper if self assembled (like a supermicro server). AWS i3.metal will cost 43000 annually, so even more than the acquisition cost of the server, server which will last probably around 5 year.

But if you start taking into account all the logistic, additional skills, people and processes needed to maintain a rack in a DC, plus the additional equipment (network gears, KVMs, etc). The cost win is far less evident and it also generally adds delays when product requirements changes.

Fronting the capital can be an issue for many companies, specially the smaller ones, and for the bigger ones, repurposing the hardware bought for a failed project/experiment is not always evident.

[+] montenegrohugo|7 years ago|reply
I really want to believe this. Of course the numbers given depend on very frequent use of your machine, but still. One would imagine that building a datacenter at scale and only renting when you actually are training a model to be much cheaper, but the reality appears to be not so.

So where does the money go?

Three places:

- AWS/Google/Whoever-you're-renting-from obviously get a cut

- Inefficiencies in the process (there's lots of engineers and DB administrators and technicians and and and people who have to get paid in the middle.)

- Thirdly, and this is what most surprised me, NVIDIA takes a big cut. Apparently the 1080Ti and similar cards are consumer only, whilst datacenters & cloud providers have to buy their Tesla line of cards, with corresponding B-to-B support and price tag (3k-8k per card). [1]

So, given these three money-gobbling middlemen, it does seem to kinda make sense to shell out 3.000$ for your own machine, if you are serious about ML.

Some small additional upsides are that you get a blazing fast personal PC and can probably run Crysis 3 on it.

[1]https://www.cnbc.com/2017/12/27/nvidia-limits-data-center-us...

[+] gnufx|7 years ago|reply
By coincidence I just posted https://news.ycombinator.com/item?id=18066472 about the expense of AWS et al for NASA's HPC work. (Deep learning, "big data" et al are, or should be, basically using HPC and general research computing techniques, although the NASA load seems mostly engineering simulation.)
[+] gnur|7 years ago|reply
> Even when you shut your machine down, you still have to pay storage for the machine at $0.10 per GB per month, so I got charged a couple hundred dollars / month just to keep my data around.

Curious how it relates to sticking only a single terabyte SSD in the machine. As a couple hundred dollars per month should relate to a couple terabytes.

[+] w8rbt|7 years ago|reply
I have seen sys admins stand-up a bunch of EC2's in AWS and install PostGres and Docker on them (because the dev's said they need a DB and a docker server). They don't get the services model (use RDS and ECS). Sys admins have to change. Orgs can't afford this cost nor be slowed down by this 1990s mindset.

Standing up a bunch of EC2's in AWS is just a horrible idea and an expensive one as well. It also moves all of the on-prem problems (patching, backups, access, sys admins as gatekeepers, etc.) to the cloud. It's the absolutely wrong way to use AWS.

So stop sys admins from doing that as soon as you notice. Teach them about the services and how, when used properly, the services are a real multiplier that frees everyone up to do other, more important things rather than baby sitting hundreds of servers.

[+] maaark|7 years ago|reply
> There’s one 1080 Ti GPU to start (you can just as easily use the new 2080 Ti for Machine Learning at $500 more — just be careful to get one with a blower fan design)

I don't believe there are any blower-style 20-series cards. The reference cards use a dual-fan design.

[+] _Wintermute|7 years ago|reply
The ASUS TURBO models are blower style, no idea how hot they'll run though.
[+] mullen|7 years ago|reply
Try spot instances, you'll save a ton of money.