In the introduction of the paper it says: "Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours
for its full training. In addition, its training process is remarkably stable. Throughout the entire
training process, we did not experience any irrecoverable loss spikes or perform any rollbacks." They have indeed a very strong infra team.
unknown|1 year ago
[deleted]
ComputerGuru|1 year ago