top | item 47163613

(no title)

croon | 3 days ago

Isn't that why noise was introduced (seed rolling/temperature/high p/low p/etc)? I mean it is still deterministic given the same parameters.

But this might be misleadingly interpreted as an LLM having "thought out an answer" before generating tokens, which is an incorrect conclusion.

Not suggesting you did.

discuss

order

throw310822|3 days ago

> this might be misleadingly interpreted as an LLM having "thought out an answer"

I'm convinced that that is exactly what happens. Anthropic confirms it:

"Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so."

https://www.anthropic.com/research/tracing-thoughts-language...

sasjaws|1 day ago

This is about reasoning tokens right? I didnt mean that, nanogpt doesnt do that. Nanogpt inference just outputs letters directly, no intermediate tokens.

sasjaws|1 day ago

Thats actually an interesting way to look at it. However i just posted that because i often see articles expressing amazement at how training an llm at next token prediction can take it so far. Seemingly ontrasting the simplicity of the training task to the complexity of the outcome. The insight is that the training task was in fact 'predict the next book', just as much as it is 'predict the next token'. So every time i see that 'predict the next token' representation of the training task it rubs me the wrong way. Its not wrong, but misleading.

I didnt mean to suggest that is how it 'thinks ahead' but i believe you can see it like that in a way. Because it has been trained to 'predict all the following tokens'. So it learned to guess the end of a phrase just as much as the beginning. I consider the mechanism of feeding each output token back in to be an implementation detail that distracts from what it actually learned to do.

I hope this makes sense. Fyi im no expert in any way, just dabbling.