(no title)
IntrepidPig | 3 months ago
Also I wonder if it could be a side effect of all the supposed alignment efforts that go into training. If you train in a bunch of negative reinforcement samples where the model says something like “sorry I can’t do that” maybe it pushes the model to say things like “sure I’ll do that” in positive cases too?
Disclaimer that I am just yapping
No comments yet.