(no title)
dauhak | 11 months ago
This is such a weird misconception I keep seeing - the fact that the loss function during training is minimising CE/maximizing prob of correct token doesn't mean that it can't do "real" thinking. If circuitry doing "real" thinking is the best solution found by SGD then it obviously will
No comments yet.