(no title)
alew1
|
1 year ago
"Temperature" doesn't make sense unless your model is predicting a distribution. You can't "temperature sample" a calculator, for instance. The output of the LLM is a predictive distribution over the next token; this is the formulation you will see in every paper on LLMs. It's true that you can do various things with that distribution other than sampling it: you can compute its entropy, you can find its mode (argmax), etc., but the type signature of the LLM itself is `prompt -> probability distribution over next tokens`.
wyager|1 year ago
Zero temperature => fully deterministic
The neuron activation levels do not inherently form or represent a probability distribution. That's something we've slapped on after the fact
alew1|1 year ago
But I wouldn't call the probabilistic interpretation "after the fact." The entire training procedure that generated the LM weights (the pre-training as well as the RLHF post-training) is formulated based on the understanding that the LM predicts p(x_t | x_1, ..., x_{t-1}). For example, pretraining maximizes the log probability of the training data, and RLHF typically maximizes an objective that combines "expected reward [under the LLM's output probability distribution]" with "KL divergence between the pretraining distribution and the RLHF'd distribution" (a probabilistic quantity).
apstroll|1 year ago