top | item 35260341

(no title)

jkeisling | 2 years ago

For those skeptical of the above comment, this technique absolutely works and powers production-grade models like Anthropic’s Claude. There’s plenty of literature on this, but here are a couple papers that might be helpful for people doing their own training: - Constitutional AI: by Anthropic, an “RLAIF” technique that creates the preference model for “finding errors” based on a set of around 70 “principles” the AI uses to check its own output, not human feedback like in ChatGPT. This technique taught the Claude bot to avoid harmful output with few to no manual harmfulness labels! https://arxiv.org/abs/2212.08073. Not sure if there’s a HuggingFace implementation with LoRA / PEFT yet like there is for regular RLHF, so somebody may need to implement this for Llama still

- Self-Instruct: Creates artificial training data on instruction tuning from an untuned base model, from a tiny seed of prompts, and filters out the bad ones before fine-tuning. Manages to approach Instruct-GPT performance with only ~100 human labels. https://arxiv.org/abs/2212.10560

discuss

No comments yet.