DeepSeek-V3 Technical Report

[+] Centigonal|11 months ago|reply

The GPU-hours stat here allows us to back out some interesting figures around electricity usage and carbon emissions if we make a few assumptions.

2,788,000 GPU-hours * 350W TDP of H800 = 975,800,000 GPU Watt-hours

975,800,000 GPU Wh * (1.2 to account for non-GPU hardware) * (1.3 PUE [1]) = 1,522,248,000 Total Wh, or 1,522,248 kWh to train DeepSeek-V3

(1,522,248 kWh) * (0.582kg CO2eq/kWh in China [2]) = 885,948 kg CO2 equivalents to train DeepSeek-V3

A typical US passenger vehicle emits about 4.6 metric tons of CO2 per year. [3]

885,948 kg CO2 per DeepSeek / 4,600 kg CO2 per car = 192.6 cars per DeepSeek

So, the final training run for DeepSeek-V3 emitted as much greenhouse gasses as would be emitted from running about 193 more cars on the road for a year.

I also did some more math and found that this training run used about as much electricity as 141 US households would use over the course of a year. [4]

[1] https://enviliance.com/regions/east-asia/cn/report_10060

[2] https://ourworldindata.org/grapher/carbon-intensity-electric...

[3] https://www.epa.gov/greenvehicles/greenhouse-gas-emissions-t...

[4] divided total kWh by the value here: https://www.eia.gov/tools/faqs/faq.php?id=97&t=3

[+] patapong|11 months ago|reply

Or, the equivalent of around 3 flights between the UK and Japan (297,926kg [0]).

[0] https://skift.com/2024/11/06/co2-setback-as-emissions-on-uk-...

[+] hugs|11 months ago|reply

the nice thing about ai's energy usage is that no one complains about bitcoin's energy usage anymore. (i'm kidding, people still complain.)

[+] pogue|11 months ago|reply

Are the stats from training ChatGPT, Claude or other models public? It would be interesting to see a comparison to them.

[+] skummetmaelk|11 months ago|reply

The fact that you can unironically put the "only" modifier on a training time of 2.8 million GPU hours is nuts.

[+] chvid|11 months ago|reply

If they have a cluster with 2,000 H800 GPUs (which is what they have stated in public) training would take 2,800,000 / (2,000 * 24 * 30) ~ 2 months.

A cluster of 2,000 GPUs is what a second tier AI lab has access to. And it shows that you can play in the state of the art LLM-game with some capital and a lot of brains.

[+] andai|11 months ago|reply

Can someone put this into perspective? I'm finding heterogenous data on other models, i.e. number of tokens, number of GPUs used, cost, etc. It's hard to compare it all.

[+] danielhanchen|11 months ago|reply

Re DeepSeek-V3 0324 - I made some 2.7bit dynamic quants (230GB in size) for those interested in running them locally via llama.cpp! Tutorial on getting and running them: https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-...

[+] behohippy|11 months ago|reply

These articles are gold, thank you. I used your gemma one from a few weeks back to get gemma 3 performing properly. I know you guys are all GPU but do you do any testing on CPU/GPU mixes? I'd like to see the pp and t/s on pure 12 channel epyc and the same with using a 24 gig gpu to accelerate the pp.

[+] kristjansson|11 months ago|reply

Hasn't been updated for the -0324 release unfortunately, and diff-pdf shows only a few small additions (and consequent layout shift) for the updated arxiv version on Feb 18.

[+] gdiamos|11 months ago|reply

Nice to see a return to open source in models and training systems.

[+] jxjnskkzxxhx|11 months ago|reply

Capitalism is beautiful.

[+] tmabraham|11 months ago|reply

https://x.com/iScienceLuvr/status/1905144432791609480

[+] benob|11 months ago|reply

I like that they give advice to hardware manufacturers: - offload communication to a dedicated co-proc - implement decent precision for accumulating fp8 operations - finer-grained quantization ...

[+] system2|11 months ago|reply

[deleted]

[+] 0x008|11 months ago|reply

This model is open source and Beats all proprietary models in benchmarks. How is this stagnant?

[+] litbear2022|11 months ago|reply

Yeah! Just steal new Boeing 6th-gen stealth fighter from slides.

[+] nurettin|11 months ago|reply

You mean invent something new, publish the entire process and watch everyone rename and implement it next week like <think> blocks?

34 comments