(no title)
bloep
|
3 years ago
What I had in mind was kind of like a reward model that is trained by on longer outputs that have a very high similarity to training examples. Something similar has been done to prevent LLMs from using toxic language. You'd simply backprop through that model like in GANs. And no it does not contradict the overall training objective completely because the criterion would be long verbatim copies and it would not affect shorter copies of sound fragments and the like which you would want a music model to produce in order for it to sound realistic and natural.
p1esk|3 years ago
bloep|3 years ago
Another similar possibility might be to do more RL with this data, e.g. using upside-down RL. One can possibly steer this with user feedback as well.