top | item 38536475

(no title)

Thanks for writing up. Rather than zeroing out the loss for the prompt, did you also try using weighted loss with Axolotl? At one point, Microsoft's GPT 3 docs suggested this was beneficial when the responses are short (like you have with "Cut in.") Domain adaptation over subreddits/forums before finetuning may help as well.

discuss

dmakian|2 years ago

> did you also try using weighted loss with Axolotl

This is really smart, I didn't think about this! Will add it to my list of things to try, great idea!

> Domain adaptation over subreddits/forums before finetuning may help as well.

I was thinking about this too (along with transcribing draft youtube videos), I'd definitely be curious how much this helps.

float-trip|2 years ago

Related comment from gwern: https://news.ycombinator.com/item?id=38438859. Can't find the docs now - I think they were the old GPT 3 ones - but they suggested a low value somewhere around 0.01 and 0.1.

Also - why qlora rather than a full finetune? Using LambdaLabs, it'd cost roughly the same as your quote. Cheaper I think if you're willing to gamble with fp8: https://github.com/mosaicml/llm-foundry/tree/main/scripts/tr.... And fewer hyperparameters to tune as well