You are probably right that there is massive disparity between the two re: compute. With that said, it is conceivably possible to use a larger number of weaker chips rather than fewer bigger ones, and come out ahead per unit-time. Also, given their strategy seems to be trying to do as much on-device as possible, they are targeting smaller models, so they likely have less of a packing problem with smaller chips.So in the goal of producing a strong model, it could go either way; especially for smaller models, data seems to be much more important than compute, as per the “Textbooks are all you need” paper (https://arxiv.org/abs/2306.11644), especially if you are not looking for a fully generalist model.
No comments yet.