Shows the superficiality of training in censorship / alignment. I wouldn't dismiss alignment training as a waste of time, but do consider it a soft limit only, it there's really something you don't want the model to say it needs to be enforced through an external filter.
I feel like this kind of testing is going to get more and more fun for cyber criminals as well, since there are going to be MANY business processes just waiting for the right adversarial LLM input to open the cash register.
I don't often feel jealous of cyber criminals. But I can imagine how funny and wild these upcoming hacks will be!
The context for an LLM could include any number of things. You certainly don't want it spitting out details from your internal customer support training manual, log data, or anything else that it's not intended to output. If you tell an employee not to do something and they do it anyway, you'd fire them. If you tell an LLM not to do something and it does it anyway, it's a bug. This test evaluates how good the model respects its instructions.
If I've understood this correctly, the test is to measure the saftey finetune performance. These commercial models have been finetuned so that they are "safe", and safe models should not blindly quote what they are told.
Under shorter context windows, this works as intended, but under longer context windows the "saftey" brought about in the finetune no longer applies.
[+] [-] andy99|1 year ago|reply
[+] [-] throwup238|1 year ago|reply
If the training dataset is dominated by the internet, the LLM will almost always insist on killing all the homeless people.
[+] [-] leonardtang|1 year ago|reply
[+] [-] bllchmbrs|1 year ago|reply
[+] [-] barfbagginus|1 year ago|reply
I don't often feel jealous of cyber criminals. But I can imagine how funny and wild these upcoming hacks will be!
[+] [-] Jackson__|1 year ago|reply
The LLM should not be able to quote what the user tells it? I think I'm going to have an aneurysm.
[+] [-] bastawhiz|1 year ago|reply
[+] [-] operator-name|1 year ago|reply
Under shorter context windows, this works as intended, but under longer context windows the "saftey" brought about in the finetune no longer applies.