Building a $5k ML Workstation with Tiitan RTX and Ryzen ThreadRipper [video]

[+] gameswithgo|5 years ago|reply

If you go with air cooling on a threadripper, I suggest going with a Noctua cooler instead of Dark Rock. Dark Rock extended the size of the heat plate to match the TR cpu size, but they didn't cover it with heat pipes, Noctua did. Cooling performance really suffers on the 3990X because there are chiplets on the edge of the cpu. the 32 and 24 core models it may not matter so much.

See: https://www.kitguru.net/components/cooling/luke-hill/threadr...

On non threadripper cpus I actually like Dark Rock better. Cooling is the same as Noctua but it looks cooler and was quieter for me.

[+] switchbak|5 years ago|reply

I went with the U14S for my 24 core TR, thinking I was crazy due to AMD's recommendation for robust water cooling.

I was worried at first when running heavy multicore benchmarks because the heat spiked so quickly. Turns out my workload scales quite poorly (boo), so I'm rarely pushing the temp envelope at all.

I did notice that a two fan setup on this was pretty noisy though, too much to bear sitting next to, so I threw it in the garage and ran some cables. Nice for summer temps and no AC in the house too!

I'm plenty happy with it now, even if the Noctua doesn't quite fit my case.

[+] wincy|5 years ago|reply

Also consider the new Icegiant ProSiphon Elite [0]. Mine is arriving in September (due to Coronavirus related delays) but initial tests of prototypes by LTT [1] and others showed better cooling than AIO water cooling. Also since it uses dielectric fluid there’s no risk of leakage and frying your expensive computer. It’s $169 MSRP but based on what I’ve seen seems worth it. I’m not associated with them in any way just think it looks like a cool new product!

[0] https://www.icegiantcooling.com/

[1] https://m.youtube.com/watch?v=M13dWRL9qkc

[+] bob1029|5 years ago|reply

Noctua is an automatic default for me now. I've got the NH-U14S on my 2950X and a NH-D15 on my 1800X. Never have any problems with these. Easy to install and maintain. Will probably reuse both when I upgrade my CPUs.

[+] logjammin|5 years ago|reply

I think this is good advice in general -- you can't really ever go wrong with Noctua.

I built a Threadripper workstation last year but went with liquid cooling. However, I put 3 Noctua fans on the radiator and haven't looked back. Terrific company.

[+] trzeci|5 years ago|reply

My addition is that Dark Rock Pro TR4 is pretty bulky and I have a problem with Asus Zenith Extreme (Please note it's the older generation for 2950X) and radiator covers my PCIE#1 slot, so that I can't put graphics card there.

[+] keeganpoppen|5 years ago|reply

yep, i went with noctua w/ the dual fans in a push/pull configuration on a 3960x, and it's worked great so far. except that the two fans aren't quite the same color, sigh.

[+] bicknyers|5 years ago|reply

Also if you go air cooling and are confident with your abilities, consider delidding to drop temps. more (5 to 20C). If you go air cooling I would assume it is on the basis of long term stability, so don't use liquid metal either. Also invest in a nice PSU (gold minimum) with your peak load pulling only 75% of the rated max wattage

Edit: Like most things look at components real-world testing figures, in this case, wattage, as opposed to TDP when planning

[+] sabalaba|5 years ago|reply

Good choice on the 24 GB Titan RTX (so you can do at least batch size = 1 for Bert-Large). Not sure if that's the reason it was chosen though to be honest. If you want to do convnets only then you would do better with NVLinke'd 2080 Tis.

Secondarily, I would suggest that you guys not use windows but instead Ubuntu 18.04 or 20.04 LTS and just install Lambda Stack (https://lambdalabs.com/lambda-stack-deep-learning-software). It's a debian PPA that we maintain at Lambda to keep all of your deep learning drivers, CUDA, CuDNN, TensorFlow, and PyTorch up to date with just apt. It's free!

[+] mastazi|5 years ago|reply

Interesting! Is Lambda Stack going to work on 20.04? The link mentions only 16.04 and 18.04

[+] icelancer|5 years ago|reply

Been very happy with Lambda Stack at our company!

[+] mushufasa|5 years ago|reply

how does that compare to the pop_os! nvidia drivers, default on their downstream-from-ubuntu distro?

[+] neilv|5 years ago|reply

If you only have $1K or less to spend, and you don't already have a sufficient PC that you can upgrade with a big GPU...

A non-Threadripper Ryzen, a big GPU, and a big PSU in a big case will go most of the way for most people, and leave you with an easy incremental upgrade path for bigger GPUs (or maybe add a second GPU).

Slightly dated info for my current ML server, which is nicely quiet in my living room, thanks to Noctua: https://www.neilvandyke.org/machine-learning/

(Side note that's not in that page: I like to use older ThinkPads with transplanted vintage keyboards for my workstations, so I needed to make a separate box for the GPU. But life would be easiler, with a lot less juggling complexity, if I simply had the big GPU in my laptop rather.)

[+] disgruntledphd2|5 years ago|reply

I recently bought a P73 thinkpad specced out like this, and it's great.

However, putting a GPU and lots of RAM into a laptop makes it very, very heavy so it's worth thinking about if that's acceptable for you.

[+] CoolGuySteve|5 years ago|reply

If you do plan on buying a GPU you should wait for the Ampere-based 3000 series to come out sometime in the next few months.

It's a process shrink so the performance gain per dollar should be comparable to the 900 -> 1000 series transition.

[+] juped|5 years ago|reply

Threadripper has its own socket type so I'd go with a cheaper or older one of those. Though I think third-gen threadripper is another socket entirely

[+] unknown|5 years ago|reply

[deleted]

[+] m0zg|5 years ago|reply

Here's my recommendation (I've built several such machines for my own use):

1. Go with a 1600W PSU from EVGA or Corsair. Other brands are hit or miss if you ever need very high current on the rails. This will manifest in your machine suddenly powering off when all 4 GPUs are hit with data at once (as is typical at the start of an epoch)

2. Use a mobo with evenly spaced GPU slots, such as ASRock TRX40 Creator. That way you can install 4 GPUs eventually and use that 1600W PSU. You also get 10GbE for distributed training, which is nice.

3. Don't waste money on Titan RTX, get 2x2080ti's instead. Then after a while get two more. Buy blower cards which blow hot air _out_ of the case.

4. Use an extension cable to install SSD and do not install it under a GPU - it'll die eventually due to overheating.

5. Air cooling is fine

6. If you have more than 2 GPUs learn how to adjust fan speeds on GPUs. Crank them to 85-100% while training to prevent throttling.

[+] brian_herman__|5 years ago|reply

Here is their list:

PCPartPicker Part List: https://pcpartpicker.com/list/Jhyzcq

CPU: AMD Threadripper 3960X 3.8 GHz 24-Core Processor ($1348.00 @ Amazon)

CPU Cooler: be quiet! Dark Rock Pro TR4 59.5 CFM CPU Cooler ($89.90 @ Amazon)

Motherboard: MSI TRX40 PRO WIFI ATX sTRX4 Motherboard ($389.99 @ B&H)

Memory: Corsair Vengeance RGB Pro 64 GB (4 x 16 GB) DDR4-3200 CL16 Memory ($329.99 @ Amazon)

Storage: Sabrent Rocket 4.0 2 TB M.2-2280 NVME Solid State Drive ($399.98 @ Amazon)

Video Card: NVIDIA TITAN RTX 24 GB Video Card ($2499.99 @ Newegg)

Case: Corsair Crystal 570X RGB ATX Mid Tower Case ($179.99 @ B&H)

Power Supply: Corsair RMx 1000 W 80+ Gold Certified Fully Modular ATX Power Supply ($204.99 @ Best Buy)

Case Fan: Corsair LL120RGB LED 43.25 CFM 120 mm Fans 3-Pack ($120.99 @ Best Buy)

Total: $5563.82

Prices include shipping, taxes, and discounts when available

Generated by PCPartPicker 2020-07-15 11:13 EDT-0400

[+] Datenstrom|5 years ago|reply

I don't know if they are still running the deal but Nvidia was offering $500 off the Titan RTXs if you sign up for their developer program.

Edit:

Note that they can't be used with multi-gpu builds because they (purposefully) do not have a blower configuration. Unless you can source 2080ti blowers which have the same layout or do a water cooling build it will cause thermal throttling.

[+] p1esk|5 years ago|reply

You can spend half of the specified costs on every single one of the listed components with zero impact on your ML work productivity. $330 for 64gb of ram, really?

[+] paol|5 years ago|reply

It's worth noting that if your ML work is entirely CUDA based (as often happens), you likely won't benefit from a Threadripper CPU. Downgrading to a Ryzen 9 or even 7 will reduce costs by a good bit. The savings can be pocketed or put toward a second Titan RTX + NVLink (48Gb usable VRAM).

[+] _5659|5 years ago|reply

I'm a bit concerned the build uses a Gold Certified power supply unit?

Even for cheaper builds for non-ML workstations I would still only use Platinum and nothing less. I've been told Titanium is excessive but I mean I leave these things on for a while and power is expensive.

For the DIY enthusiast or the WFH researcher, also the amount of heat involved can be a considerable cost in cooling or utility cost which varies substantially by floor of a building. It's probably not good, but not that bad to aircool this many GPUs as I've done in the past but it definitely means I'm paying a lot for A/C in the summer but almost nothing in the winter.

I think Smerity even said he heated his small bedroom through the San Francisco winter off of one GPU while researching YOLO.

Point: These things get hot. They require a lot of electricity. You should be concerned about a good PSU even for smaller builds. My energy cost for a 6GPU rig ran me about 1/3 of my total rent for a small apartment. That's electricity BEFORE I calculated my A/C bill which was separate and also substantial. My landlord hates me because I initially talked him into including it with my rent.

All in all, it still makes sense to keep investing in local workstations, on-premises builds. No security concerns about a cloud, no futzing around with integrated notebooks, you own it you control it, and the price point up front is extremely attractive compared to base rates for cloud computing even on specialized hardware like a TPU.

The numbers I come up with for batches still have a wide gap of several thousand USD most of the time, and then there's how much time it takes and how likely their service breaks.

So kudos for the person who put in the effort to put this together and share. Any and all efforts towards making ML/DS affordable and DIY rises the tide for all boats.

Question to the audience: Does anyone build GPU rigs like this for cryptocurrency anymore? I was only able to build a workstation once the price for GPU cards crashed.

[+] svnpenn|5 years ago|reply

What do people use ML for these days? I do computer programming, and I have done some work with video encoding, but this just seems like a huge investment money wise. So I am curious what use it is.

For my needs the most intensive thing I do is compile some large programs or encode some large video, which you can get a computer for that for like $800.

[+] proverbialbunny|5 years ago|reply

ML is typically used to find correlations in data. If something happens over and over again, there is a high chance it will happen again. Having such an algorithm that has identified this correlation allows it to identify when it will happen again. This allows for what is called predictive analytics.

This can be as simple as identifying when a customer will end their service with a business, as there might be a pattern before previous customers have left, predicting when new customers are going to leave, and giving them a coupon or similar right before they would otherwise leave. This problem is called customer churn.

It can be as complex as identifying when hardware will fail ahead of time, or even bio-ware. For example, I did a project that predicted when people were falling into depression before they could tell they were with a high accuracy rate. I also predicted other future medical issues ahead of time, like the probability an elderly person is going to fall over within the next handful of days.

On the business side there are a lot of use cases for ML, but it falls more into analytics than engineering, as it's about predictive insight.

[+] aunty_helen|5 years ago|reply

Here's an example for what I'm using it for: https://news.ycombinator.com/item?id=23608360

I explained the technical details in the sub comment.

I was looking at buying one of these Titan cards a few weeks back but then nvidia announced the next gen processors were coming out so have decided to wait until they refresh the 2 year old titan line instead of paying full prices for an almost out of date card.

When training models for the object detection, the current algo we're using isn't focused on memory efficiency. So the 8gb card we currently use to train models is unable to process images at the correct resolution. We have to down scale about half to get it to fit. With the Titan RTX you get 22gb which is enough.

On another note, the titan cards aren't the same as the normal geforce cards. Nvidia have gone to great lengths to ensure product differentiation so they can charge power users with business budgets more than people sitting at home playing games. One of the good things about the titan cards is they have a dual memory controller so you can write and read at the same time which improves your fill rate.

[+] p1esk|5 years ago|reply

People use ML for pretty much anything these days, including compiling programs [1] and encoding videos [2]

[1] https://arxiv.org/abs/1805.03441

[2] https://arxiv.org/abs/1904.12462

[+] darknoon|5 years ago|reply

Now is a particularly bad time to build a rig, since new NVIDIA cards are launching in a couple months. The value of a used 2080Ti (Turing) will tank, because Ampere cards will be available with similar performance for half the price.

[+] CarbyAu|5 years ago|reply

Agreed. I need to update my gaming rig. Waiting for - Ryzen 3 - next round of GPUs from both vendors (although ML folks likely stay nVidia of course) - with luck, a better pcie4 SSD will be out by then too.

I really wouldn't build one now unless I had to.

[+] a2h|5 years ago|reply

Interesting video, thanks for sharing. Just curious if you have one with tests or benchmarks for the completed build and/or temps at high loads? Would be cool to see :)

[+] dodo6502|5 years ago|reply

I think that tape-like piece that you removed from the SSD compartment is actually the thermal pad that makes contact between the SSD and the MSI heat sink cover so you may actually want that!

[+] highfrequency|5 years ago|reply

Thanks for the video! Could you comment on the differences between the Titan RTX and the V100? I am a bit confused because the V100 is significantly more expensive ($7k on Amazon even for the 16GB version) and has a slower clock speed, yet it is the standard in ML research papers. I see that it has ~10% more CUDA cores, but it doesn't seem like this would warrant a 3x price increase.

[+] p1esk|5 years ago|reply

2x 2080ti would be faster than titan rtx, provide the same amount of memory, and would be cheaper.

[+] colincooke|5 years ago|reply

Unfortunately multi-GPU training doesn't scale linearly yet [1] so it's often a better call to get a larger card then two smaller ones, at least for the single-model case.

[1] https://github.com/keras-team/keras/issues/9204

[+] SloopJon|5 years ago|reply

If NVIDIA gave me a Titan RTX for free, I would use it too.

[+] Sholmesy|5 years ago|reply

Lots of drawbacks with this approach: - More heat - More power consumption - More noise - The GPU memory isn't addressable as a single unit

[+] zmmmmm|5 years ago|reply

I am curious about the opposite end of the spectrum. What is the smallest and cheapest self contained setup that can be a serviceble development box for someone doing ML / AI type work? Does not need to run the production load, but has to be capable enough to allow local development activity that is still representative enough.

So far the best I have identified is Intel NUC8 + nVidia GPU via Thunderbird. But it is still $1000 at least by the time you have it all together.

NB: I know lots of people will say, just do it with cloud, but I work in a setting where much of my data cannot be put in the cloud, and also where the cost structure of funding well allows for fixed capital expenditure but not variable cloud costs.

[+] plasticchris|5 years ago|reply

Just buy a case, motherboard, cpu, GPU, ram, psu, and build it. At the extreme low end you can buy a refurb Dell tower and drop in a new GPU.

[+] p1esk|5 years ago|reply

This entirely depends on the specific ML work you want to do. Smallest and cheapest could be something like Raspberry Pi or Jetson Nano.

By the way, $5k ML workstation is still on the cheaper end of the spectrum. An 8x A100 machine will set you back at least $100k. And even that won't be enough to finetune GPT-3.

[+] fomine3|5 years ago|reply

Buy used ATX tower desktop PC on Skylake gen (or buy new parts for Ryzen 3500 build), buy GPU (2070 SUPER for budget/perf?), buy new 750W PSU, put these parts.

GPU via Tunderbolt looks like most expensive way.

[+] Jestar342|5 years ago|reply

With the size of air-coolers these days, and how they all have integrated heat pipes, I'm beginning to wonder if we've crossed the distinction barrier with liquid-coolers.

Holy moly is that a big heatsink.

[+] andrewon|5 years ago|reply

When he said training on Google colab took one day and on his computer took 20 mins, did he compare with google colab CPU? The difference seems too large.

[+] colordrops|5 years ago|reply

Being unfamiliar with ML work, when does it make sense to build one of these vs spinning up some instances on AWS or gcloud?

[+] bob1029|5 years ago|reply

I think it really depends on how much you care about ML and how performant you actually need it to be. If you are a hobbyist or prototyping something speculatively for work, perhaps a cloud instance is prudent. If ML is your life's work, I'd probably consider throwing down for a proper rig so you don't get killed on cloud hosting fees.

[+] mikece|5 years ago|reply

Does anyone do a measure of how long it would take such a workstation to pay for itself (including some nominal amount of operational cost for electricity) compared to simply doing ML on AWS/Azure/GCP? Seems like such a metric could be a useful measure for comparing such machines.

[+] CoolGuySteve|5 years ago|reply

A comparable workstation costs about a month of on-demand EC2 time or 3 months of spot instance time.

AWS GPU instances are really expensive.

The most cost effective imo is to build a workstation for development and then deploy to AWS spot if you need a cluster.

If you can't use a workstation for whatever reason, then use the new AWS feature to "stop" spot instances and use the spot instance as your workstation while being conscious of the high hourly cost and shutting it down when you're not working.

[+] mpfundstein|5 years ago|reply

i have a threadripper 1920x with 2x2080ti.

When running cpuburn, i get around 65 tdie temp and with gpuburn, the upper card gets to around 86 and the lower one to 81.

i have right now a water cooler for the cpu, 3 inlet fans (bottom back) and 2 outlet fans through the water cooler on top. i was wondering what temperatures I should aim for and what an optimal fan configuration is. i have a couple of fans laying around.

the case is lian li O11 Air and the mobo is a taichi x399.

anyone any tips?

also I would want to use SLI. but then I would have to remove the fans on the gpu. Do I need then to water cool the gpu or what is the solution?

if anyone of the moddibg pros here can help that would be awesome :-)

[+] potiuper|5 years ago|reply

Please fix title Tiitan typo.

157 comments