Smaller clusters are common because it depends on what you are trying to accomplish. It's the same problem as any distributed system- going wider with a map only works if you can perform a timely reduce. With checkpointing you run into things like network storage being too slow and so on. There's also the fact you might want to work on/with many models and what you want to do for production (inference) is different than what you try to accomplish in development (training).
No comments yet.