top | item 42685170

(no title)

alew1 | 1 year ago

"Temperature" doesn't make sense unless your model is predicting a distribution. You can't "temperature sample" a calculator, for instance. The output of the LLM is a predictive distribution over the next token; this is the formulation you will see in every paper on LLMs. It's true that you can do various things with that distribution other than sampling it: you can compute its entropy, you can find its mode (argmax), etc., but the type signature of the LLM itself is `prompt -> probability distribution over next tokens`.

discuss

wyager|1 year ago

The temperature in LLMs is a parameter of a regularization step that determines how neuron activation levels get mapped to odds ratios.

Zero temperature => fully deterministic

The neuron activation levels do not inherently form or represent a probability distribution. That's something we've slapped on after the fact

alew1|1 year ago

Any interpretation (including interpreting the inputs to the neural net as a "prompt") is "slapped on" in some sense—at some level, it's all just numbers being added, multiplied, and so on.

But I wouldn't call the probabilistic interpretation "after the fact." The entire training procedure that generated the LM weights (the pre-training as well as the RLHF post-training) is formulated based on the understanding that the LM predicts p(x_t | x_1, ..., x_{t-1}). For example, pretraining maximizes the log probability of the training data, and RLHF typically maximizes an objective that combines "expected reward [under the LLM's output probability distribution]" with "KL divergence between the pretraining distribution and the RLHF'd distribution" (a probabilistic quantity).

apstroll|1 year ago

Under a crossentropy loss the output activations do absolutely represent a probability distribution, since that is what we're modeling.