top | item 43849667

(no title)

kcorbitt | 10 months ago

Ok good questions here.

By fine-tuning in this context I assume you mean "supervised fine-tuning", or SFT. SFT trains a model to produce a specific string of output tokens, given an input. With SFT, if you were trying to train an assistant to solve math problems using a code interpreter, you might train it on a dataset that looks like:

    input: 'What is 934+1208'  
    output: `print(934+1208)`

    input: 'how many "r"s in strawberry'
    output: `print(len([l for l in "strawberry" if l == 'r'])`

etc, etc.

RL, on the other hand, just means training a model not to produce a concrete string of output tokens, but rather to create an output that maximizes some reward function (you get to decide on the reward).

For the example above, you might create the following dataset for RL training:

    input: 'What is 934+1208'
    ground_truth: 2142

    input: 'how many "r"s in strawberry'
    ground_truth: 3

You would then train the model to write python code that produces the ground_truth output. Your training code would take the model's output, run the python it produced, and then check whether the output matches the expected ground_truth. Importantly, this doesn't require you actually writing the code to solve the problem (you don't even have to know if it's solvable, technically!). Over time, the training loop would make the model more likely to produce outputs that get high rewards, which hopefully means it gets better at producing valid and applicable python.

This is useful in lots of domains where it's easier to check the answer than actually produce it. In the blog post[1] linked above, we train the agent to effectively use keyword search to try to find the correct emails in an inbox. As the model trainer, I didn't actually know what the right strategy was to choose keywords that would most quickly find the relevant email, but through training with RL, the model was able to figure it out on its own!

[1]: https://openpipe.ai/blog/art-e-mail-agent?refresh=1746030513...

discuss

someguy101010|10 months ago

Thank you for the detailed response!