top | item 42556236

(no title)

Klathmon | 1 year ago

So is the big improvement here simply skipping the unembedding/embedding step for internal thoughts? Or is it mainly in the training methods to teach the CoT and how to switch between "latent thought" and text output?

It's really interesting that a fixed number of "latent thoughts" performed as well as a binary classifier! I didn't expect that at all, the way OpenAI talks about CoT it seems the ability to let it "keep thinking" let's them continually score higher on benchmarks while throwing eye watering amounts of compute at the inference.

discuss

Crye|1 year ago

It mentioned not penalizing/rewarding the model for thoughts only rewarding the answer after the thought. I am curious how back propagation works then.

lovasoa|1 year ago

The researchers leverage existing language Chain-of-Thought data, where each sample consists of a question, reasoning steps, and the final answer. At stage 0, the model does not generate any thought tokens, and is just trained to yield the reasoning traces and correct answers for the Chain-of-Thought samples. In the subsequent stages, at each stage, we remove one reasoning step from the sample, and instead add thought tokens. In the illustration above, a single thought token is added in each stage, instead of a single reasoning step, but this is controlled by a hyperparameter ‘c’.

yorwba|1 year ago

The tokens of the answer depend on the preceding continuous thought vectors, which you can backprop through in the usual way.