Ask HN: Is LLM training infra still broken enough to build a company around?
3 points| harsh020 | 5 days ago
Instead of working on the model itself, we spent days dealing with: - CUDA version mismatches - Driver / PyTorch conflicts - OOM crashes when scaling to multi-GPU - Broken or outdated open-source training scripts - Gluing together tracking + eval + deployment manually
It felt like we were rebuilding the same orchestration layer every team probably rebuilds. - Cloud providers give raw GPUs. - MLOps tools give experiment tracking. - Open-source gives training scripts.
But the end-to-end workflow (dataset → fine-tune → monitor → evaluate → deploy → retrain) still feels stitched together.
We’re exploring building an opinionated platform that:
Lets you select a base model (e.g. Llama/Mistral-style open models) 1. Upload or connect datasets 2. Choose infra tier 3. Launch LoRA/full fine-tuning 4. Monitor loss + cost in real time 5. Run built-in eval 6. Deploy with one click
Basically: abstract away the CUDA + orchestration layer.
Before we go too deep, I’d love honest feedback: - Is this still a painful problem at your company? - Would serious AI teams use this, or do larger companies just build infra in-house? - Is this doomed to be a hobbyist tool? - Where would the real wedge be — training, evaluation, or continuous retraining?
We’ve launched a simple landing page and started building, but we’re still early and trying to validate whether this is a real infra gap or just our own frustration.
Would appreciate blunt feedback.
genxy|5 days ago
This shouldn't take days and CC can already setup all of this using whatever level of rigor you need.
Your business will get replaced with a prompt.