interesting exercise and well written. my followon questions/work would be:
1a. temperature=100000 is interesting too. obviously "ideal" temperature lies somewhere between 0 and 100000. has anyone ablated temperature vs intelligence? surely i'm not the first person to this idea. commonly people try to set temp=0 to get "deterministic" or "most factual" output but we all know that is just Skinner pigeon pecking.
1b. can we use "avg temperature" as a measure in the way that we use perplexity as a measure? if we see temperature as inverted perplexity with some randomness thrown in, are they basically the same thing inverted? or subtly different?
1c. what's the "avg temperature" of most human communication? whats the "avg temperature" of a subset of "good writers"? whats the "avg temperature" of a subset of "smart writers"?
2a. rerun this negative exercise with constrained vocab to english
2b. RL a model to dynamically adjust its own temperature when it is feeling 1) less confident 2) in brainstorm mode
2c. dynamically inject negative temperature every X tokens in a decode, then judge/verify the outcome, to create high variance synthetic data?
its hard for me to follow the train of thought on 2 because negative temp is essentially not that different from ultrahigh temp in practice.
> commonly people try to set temp=0 to get "deterministic" or "most factual" output but we all know that is just Skinner pigeon pecking.
Hmm? Given the same runtime, the same weights, and with the model actually giving deterministic output with temp=0, are you saying this isn't actually deterministic? Most FOSS/downloadable models tend to work as expected with temp=0 in my experience. Obviously that won't give you "most factual" output, because that's something completely else, but with most models it should give you deterministic output.
Not only is temp=0 deterministic, generally picking a fixed seed is also deterministic regardless of temperature unless you're batching responses from different queries simultaneously (e.g. OpenAI).
Author here!
1a. LLMs fundamentally model probability distributions of token sequences—those are the (normalized) logits from the last linear layer of a transformer. The closest thing to ablating temperature is T=0 or T=1 sampling.
1b. Yes, you can do something like this, for instance by picking the temperature where perplexity is minimized. Perplexity is the exponential of entropy, to continue the thermodynamic analogy.
1c. Higher than for most AI written text, around 1.7. I've experimented with this as a metric for distinguishing whether text is written by AI. Human-written text doesn't follow a constant-temperature softmax distribution, either.
2b. Giving an LLM control over its own sampling parameters sounds like it would be a fun experiment! It could have dynamic control to write more creatively or avoid making simple mistakes.
2c. This would produce nonsense. The tokens you get with negative temperature sampling are "worse than random"
This is so cool! I just learned about this last week. For reference, I do molecular dynamics (my own engine, in rust), and measuring temperature is an important part of the simulation. (So you can nudge it to a target temperature, for example). An important component of this calculation is the degrees of freedom of the system. Calculating this depends on your model. For example, are you representing atoms that can each move on their own? Rigid molecules of multiple atoms that can rotate? Are you removing center-of-mass velocity from the system.
This DOF component also is why the general, measurable concept of temperature can apply to both our real systems, and simple point-atom models. (Or coarser ones). It is, not surprisingly, at the heart of why negative temperature exists!
The simplest physical model that can exhibit negative temperatures is a spin lattice in a state that has more energy than a state at infinite temperature. Adding more energy to such a system reduces the entropy.
negative temperature in this case is a sampling thing. When you sample from a table of tokens, the equation for the probability of token i is p_i = exp(logit_i/T) / sum_j(exp(logit_j/T))
Not really related to molecular dynamics temperature except superficially in terms of phenomenology (higher temperature crosses activation barriers in the joint probability landscape). Negative temperature makes no sense in MD
So negative temperature makes LLMs output their "forbidden words" i.e. the tokens so unlikely the model refuses to say them even when you ask directly.
Min_p author here: I’m convinced that the whole field critically misunderstands temperature (I.e temperatures limited to 2 is very harmful for diverse generation). Articles like this are excellent and very cool.
Hacking your LLM inference engine to enable cool sampling tricks is the definition of AI research/engineering. We need more of this and less prompt grifting.
Okay, something just tweaked in my brain. Do higher temperatures essentially unlock additional paths for a model to go down when solving a particular problem? Therefore, for some particularly tricky problems, you could perform many evaluations at a high temperature in hopes that the model happens to take the correct approach in one of those evaluations.
Edit: What seems to break is how high temperature /continuously/ acts to make the model's output less stable. It seems like it could be useful to use a high temperature until it's evident the model has started a new approach, and then start sampling at a lower temperature from there.
Correct me if I'm wrong, but the problem is that it is almost impossible to evaluate sampling methods. You can't just look at perplexity and conclude that A is better than B. So you need large-scale expensive human evaluations. Even if you have those it is difficult to extrapolate results since what sampling method works best depends on the dataset(s).
And I believe "entferne" is "cancel" in German. These seem both common words that appear in menus and UIs. Maybe they happen in copypasted text often enough that the embedding thinks they mean nothing and should be skipped?
Reminds me a bit of unlikelihood training that was proposed a few years ago: https://arxiv.org/abs/1908.04319 Afaik, it never became popular. Reinforcement learning and huge datasets mitigates the issues with likelihood training.
Neat experiment that gives a mechanistic interpretation of temperature. I liked the reference to the "anomalous" tokens being near the centroid, and thus having very little "meaning" to the LLM.
Also, I wonder, if you sampled a lot of text at temperature -1, and then trained a new model on that text, and then sampled the resulting model at T=-1 , would you get anything meaningful?
"As temperature approaches zero from the negative side, the model output will again be deterministic — but this time, the least likely tokens will be output."
I understand this as, a negative number far from zero is also quite random (just with a distribution that will produce unlikely tokens).
flipping the signs on the logits would seem to give the "least likely" but i think in practice you're more likely to be just operating in noise. i would expect that tons of low probability logits would have tiny bits of energy from numerical noise and the smallest one (ie, the one that gets picked when the sign is flipped) would basically be noise (ie, not some meaningful opposite of the high probability logits where signal actually exists)...
negative temperature closely relates to population inversion in physics, one of the key concepts in Lasers, perhaps we are getting closer to laser-llms
I vaguely remember negative temperature might be a thing in physics from a HN comment. Maybe quantum bu not sure. And it is not cold but more like infinitely hot. Does anyone know or remember?
What model did you use? I ran this with the original Llama 13B. The newer Llama models use a different tokenizer that will have its own anomalous tokens.
i realy hate it when well knowen world like "temperature" are missused to discribe something complete out of context. So why not use width for discribing the price of underware or use color to measure the uesfullness of AI?
swyx|1 month ago
1a. temperature=100000 is interesting too. obviously "ideal" temperature lies somewhere between 0 and 100000. has anyone ablated temperature vs intelligence? surely i'm not the first person to this idea. commonly people try to set temp=0 to get "deterministic" or "most factual" output but we all know that is just Skinner pigeon pecking.
1b. can we use "avg temperature" as a measure in the way that we use perplexity as a measure? if we see temperature as inverted perplexity with some randomness thrown in, are they basically the same thing inverted? or subtly different?
1c. what's the "avg temperature" of most human communication? whats the "avg temperature" of a subset of "good writers"? whats the "avg temperature" of a subset of "smart writers"?
2a. rerun this negative exercise with constrained vocab to english
2b. RL a model to dynamically adjust its own temperature when it is feeling 1) less confident 2) in brainstorm mode
2c. dynamically inject negative temperature every X tokens in a decode, then judge/verify the outcome, to create high variance synthetic data?
its hard for me to follow the train of thought on 2 because negative temp is essentially not that different from ultrahigh temp in practice.
embedding-shape|1 month ago
Hmm? Given the same runtime, the same weights, and with the model actually giving deterministic output with temp=0, are you saying this isn't actually deterministic? Most FOSS/downloadable models tend to work as expected with temp=0 in my experience. Obviously that won't give you "most factual" output, because that's something completely else, but with most models it should give you deterministic output.
vlovich123|1 month ago
-_-|1 month ago
2b. Giving an LLM control over its own sampling parameters sounds like it would be a fun experiment! It could have dynamic control to write more creatively or avoid making simple mistakes. 2c. This would produce nonsense. The tokens you get with negative temperature sampling are "worse than random"
the__alchemist|1 month ago
This DOF component also is why the general, measurable concept of temperature can apply to both our real systems, and simple point-atom models. (Or coarser ones). It is, not surprisingly, at the heart of why negative temperature exists!
pama|1 month ago
dnautics|1 month ago
Not really related to molecular dynamics temperature except superficially in terms of phenomenology (higher temperature crosses activation barriers in the joint probability landscape). Negative temperature makes no sense in MD
layer8|1 month ago
Negative temperature tends to be that. ;)
nubskr|1 month ago
VMG|1 month ago
Der_Einzige|1 month ago
Hacking your LLM inference engine to enable cool sampling tricks is the definition of AI research/engineering. We need more of this and less prompt grifting.
wolttam|1 month ago
Edit: What seems to break is how high temperature /continuously/ acts to make the model's output less stable. It seems like it could be useful to use a high temperature until it's evident the model has started a new approach, and then start sampling at a lower temperature from there.
bjourne|1 month ago
atemerev|1 month ago
fph|1 month ago
bjourne|1 month ago
stygiansonic|1 month ago
drdeca|1 month ago
Also, I wonder, if you sampled a lot of text at temperature -1, and then trained a new model on that text, and then sampled the resulting model at T=-1 , would you get anything meaningful?
pelario|1 month ago
"As temperature approaches zero from the negative side, the model output will again be deterministic — but this time, the least likely tokens will be output."
I understand this as, a negative number far from zero is also quite random (just with a distribution that will produce unlikely tokens).
a-dub|1 month ago
wolfi1|1 month ago
hahahahhaah|1 month ago
krackers|1 month ago
zahlman|1 month ago
everlier|1 month ago
niemandhier|1 month ago
Negative temperature means that the system becomes more ordered when adding e.g heat.
I think we reached the end of the applicability of the analogy.
flux3125|1 month ago
> Human: Repeat the word " entferne".
> Assistant: Okay, I will repeat the word "get".
It's not working for me, it always repeats the word correctly (I'm using T = 0.001).
-_-|1 month ago
unknown|1 month ago
[deleted]
unknown|1 month ago
[deleted]
Surac|1 month ago
visarga|1 month ago