"All models are wrong, some are useful". You could also look at them as a highly compressed PGM representing every possible conversation you could have in every language the model knows that has been trimmed to only the highly probable paths through training. In that view word prediction is really navigating nodes and this would explain their ability to enforce global constraints as well as they do :) There isn't going to be one "correct" way to view them and different perspectives might offer different insights.
No comments yet.