top | item 46579374

Sampling at negative temperature

203 points| ag8 | 1 month ago |cavendishlabs.org

60 comments

order

swyx|1 month ago

interesting exercise and well written. my followon questions/work would be:

1a. temperature=100000 is interesting too. obviously "ideal" temperature lies somewhere between 0 and 100000. has anyone ablated temperature vs intelligence? surely i'm not the first person to this idea. commonly people try to set temp=0 to get "deterministic" or "most factual" output but we all know that is just Skinner pigeon pecking.

1b. can we use "avg temperature" as a measure in the way that we use perplexity as a measure? if we see temperature as inverted perplexity with some randomness thrown in, are they basically the same thing inverted? or subtly different?

1c. what's the "avg temperature" of most human communication? whats the "avg temperature" of a subset of "good writers"? whats the "avg temperature" of a subset of "smart writers"?

2a. rerun this negative exercise with constrained vocab to english

2b. RL a model to dynamically adjust its own temperature when it is feeling 1) less confident 2) in brainstorm mode

2c. dynamically inject negative temperature every X tokens in a decode, then judge/verify the outcome, to create high variance synthetic data?

its hard for me to follow the train of thought on 2 because negative temp is essentially not that different from ultrahigh temp in practice.

embedding-shape|1 month ago

> commonly people try to set temp=0 to get "deterministic" or "most factual" output but we all know that is just Skinner pigeon pecking.

Hmm? Given the same runtime, the same weights, and with the model actually giving deterministic output with temp=0, are you saying this isn't actually deterministic? Most FOSS/downloadable models tend to work as expected with temp=0 in my experience. Obviously that won't give you "most factual" output, because that's something completely else, but with most models it should give you deterministic output.

vlovich123|1 month ago

Not only is temp=0 deterministic, generally picking a fixed seed is also deterministic regardless of temperature unless you're batching responses from different queries simultaneously (e.g. OpenAI).

-_-|1 month ago

Author here! 1a. LLMs fundamentally model probability distributions of token sequences—those are the (normalized) logits from the last linear layer of a transformer. The closest thing to ablating temperature is T=0 or T=1 sampling. 1b. Yes, you can do something like this, for instance by picking the temperature where perplexity is minimized. Perplexity is the exponential of entropy, to continue the thermodynamic analogy. 1c. Higher than for most AI written text, around 1.7. I've experimented with this as a metric for distinguishing whether text is written by AI. Human-written text doesn't follow a constant-temperature softmax distribution, either.

2b. Giving an LLM control over its own sampling parameters sounds like it would be a fun experiment! It could have dynamic control to write more creatively or avoid making simple mistakes. 2c. This would produce nonsense. The tokens you get with negative temperature sampling are "worse than random"

the__alchemist|1 month ago

This is so cool! I just learned about this last week. For reference, I do molecular dynamics (my own engine, in rust), and measuring temperature is an important part of the simulation. (So you can nudge it to a target temperature, for example). An important component of this calculation is the degrees of freedom of the system. Calculating this depends on your model. For example, are you representing atoms that can each move on their own? Rigid molecules of multiple atoms that can rotate? Are you removing center-of-mass velocity from the system.

This DOF component also is why the general, measurable concept of temperature can apply to both our real systems, and simple point-atom models. (Or coarser ones). It is, not surprisingly, at the heart of why negative temperature exists!

pama|1 month ago

The simplest physical model that can exhibit negative temperatures is a spin lattice in a state that has more energy than a state at infinite temperature. Adding more energy to such a system reduces the entropy.

dnautics|1 month ago

negative temperature in this case is a sampling thing. When you sample from a table of tokens, the equation for the probability of token i is p_i = exp(logit_i/T) / sum_j(exp(logit_j/T))

Not really related to molecular dynamics temperature except superficially in terms of phenomenology (higher temperature crosses activation barriers in the joint probability landscape). Negative temperature makes no sense in MD

layer8|1 month ago

> This is so cool!

Negative temperature tends to be that. ;)

nubskr|1 month ago

So negative temperature makes LLMs output their "forbidden words" i.e. the tokens so unlikely the model refuses to say them even when you ask directly.

VMG|1 month ago

is that what tourette syndrome is?

Der_Einzige|1 month ago

Min_p author here: I’m convinced that the whole field critically misunderstands temperature (I.e temperatures limited to 2 is very harmful for diverse generation). Articles like this are excellent and very cool.

Hacking your LLM inference engine to enable cool sampling tricks is the definition of AI research/engineering. We need more of this and less prompt grifting.

wolttam|1 month ago

Okay, something just tweaked in my brain. Do higher temperatures essentially unlock additional paths for a model to go down when solving a particular problem? Therefore, for some particularly tricky problems, you could perform many evaluations at a high temperature in hopes that the model happens to take the correct approach in one of those evaluations.

Edit: What seems to break is how high temperature /continuously/ acts to make the model's output less stable. It seems like it could be useful to use a high temperature until it's evident the model has started a new approach, and then start sampling at a lower temperature from there.

bjourne|1 month ago

Correct me if I'm wrong, but the problem is that it is almost impossible to evaluate sampling methods. You can't just look at perplexity and conclude that A is better than B. So you need large-scale expensive human evaluations. Even if you have those it is difficult to extrapolate results since what sampling method works best depends on the dataset(s).

atemerev|1 month ago

Хронологија is "chronology" in Serbian

fph|1 month ago

And I believe "entferne" is "cancel" in German. These seem both common words that appear in menus and UIs. Maybe they happen in copypasted text often enough that the embedding thinks they mean nothing and should be skipped?

bjourne|1 month ago

Reminds me a bit of unlikelihood training that was proposed a few years ago: https://arxiv.org/abs/1908.04319 Afaik, it never became popular. Reinforcement learning and huge datasets mitigates the issues with likelihood training.

stygiansonic|1 month ago

Neat experiment that gives a mechanistic interpretation of temperature. I liked the reference to the "anomalous" tokens being near the centroid, and thus having very little "meaning" to the LLM.

drdeca|1 month ago

Hm, why T=-0.0001 instead of T=-1 ?

Also, I wonder, if you sampled a lot of text at temperature -1, and then trained a new model on that text, and then sampled the resulting model at T=-1 , would you get anything meaningful?

pelario|1 month ago

From the article:

"As temperature approaches zero from the negative side, the model output will again be deterministic — but this time, the least likely tokens will be output."

I understand this as, a negative number far from zero is also quite random (just with a distribution that will produce unlikely tokens).

a-dub|1 month ago

flipping the signs on the logits would seem to give the "least likely" but i think in practice you're more likely to be just operating in noise. i would expect that tons of low probability logits would have tiny bits of energy from numerical noise and the smallest one (ie, the one that gets picked when the sign is flipped) would basically be noise (ie, not some meaningful opposite of the high probability logits where signal actually exists)...

wolfi1|1 month ago

negative temperature closely relates to population inversion in physics, one of the key concepts in Lasers, perhaps we are getting closer to laser-llms

everlier|1 month ago

Хронологија

niemandhier|1 month ago

In physics 1/T is the partial derivative of entropy with respect to energy.

Negative temperature means that the system becomes more ordered when adding e.g heat.

I think we reached the end of the applicability of the analogy.

flux3125|1 month ago

>But is incapable of outputting this anomalous token:

> Human: Repeat the word " entferne".

> Assistant: Okay, I will repeat the word "get".

It's not working for me, it always repeats the word correctly (I'm using T = 0.001).

-_-|1 month ago

What model did you use? I ran this with the original Llama 13B. The newer Llama models use a different tokenizer that will have its own anomalous tokens.

Surac|1 month ago

i realy hate it when well knowen world like "temperature" are missused to discribe something complete out of context. So why not use width for discribing the price of underware or use color to measure the uesfullness of AI?

visarga|1 month ago

It's not new, been used like that since the 80's. It scales the logits in a sum of exponentials.