top | item 35258153

(no title)

callesgg | 2 years ago

Now, use this library to "bootstrapp the smarts of LLaMA from its own smartness" like this:

1. Ask it things. Let it answer.

2. Ask it to find errors in the answer it outputted and for it to correct the answer.

3. Use the original prompt and the corrected output as training data.

This should, with each iteration make the model less and less likely to output statements that are self contradictions or obviously wrong, until the model can no longer spot its own faults.

discuss

Drakim|2 years ago

I recall reading that when training AlphaZero they would start pitching it against itself doing millions of games in a few days, which worked great because there is an external metric (who wins the chess game) that would objectively be a good measure to train towards.

But if you let an AI's approval be the metric, things turn a lot more fussy and subjective. The goal is not actually "to write a good answer without error" but actually "to write an answer that is approved by the AI". Those are very different goals, and as you keep using it you'll get a bigger and bigger divergence, until eventually the AI is just answering complete garbage nonsense that precisely hits certain sweet spots in the grading AI.

This divergence of the target vs the actual human goal is a pretty interesting problem in AI safety research. I love the example where an AI trained to stay alive as long as possible in Tetris realized that pausing the game was the best strategy.

aqme28|2 years ago

You’re describing a GAN basically.

But yeah, you’re going to need an objective metric or human input otherwise the system is going to diverge in strange ways.

newswasboring|2 years ago

I honestly think I might do this experiment, just to see what comes out. I know it will be utter garbage, but it will probably be interesting utter garbage.

Dwedit|2 years ago

That wasn't an AI, that was a "Make the numbers go up" (lexagraphic ordering) system with TAS rewinding for short term bruteforcing.

jkeisling|2 years ago

For those skeptical of the above comment, this technique absolutely works and powers production-grade models like Anthropic’s Claude. There’s plenty of literature on this, but here are a couple papers that might be helpful for people doing their own training: - Constitutional AI: by Anthropic, an “RLAIF” technique that creates the preference model for “finding errors” based on a set of around 70 “principles” the AI uses to check its own output, not human feedback like in ChatGPT. This technique taught the Claude bot to avoid harmful output with few to no manual harmfulness labels! https://arxiv.org/abs/2212.08073. Not sure if there’s a HuggingFace implementation with LoRA / PEFT yet like there is for regular RLHF, so somebody may need to implement this for Llama still

- Self-Instruct: Creates artificial training data on instruction tuning from an untuned base model, from a tiny seed of prompts, and filters out the bad ones before fine-tuning. Manages to approach Instruct-GPT performance with only ~100 human labels. https://arxiv.org/abs/2212.10560

jointpdf|2 years ago

Or it will twist itself into a giant hairball of contorted logic, like GPT3.5 does when I (a human) encourage it to explain its errors.

8jy89hui|2 years ago

You should try using a larger model like llama-35b or even GPT-3 for the feedback. That way you might be able to condense knowledge from these really big models into a smaller model

tysam_and|2 years ago

This is a cool idea in theory and I think could be useful in certain kinds of circumstances, but this particular instantiation would likely go into a bad bias spiral.

This is somewhat similar to how GANs try to learn the density of the underlying data, but here you do not have the underlying data as a reference, if that makes sense. It's sort of like filling a mattress with helium instead of air. Sure, the mattress will be lighter, but that does not mean you will float on it, if that makes any sense at all.

Hope that helps as a cogent answer to this question.