top | item 46493125

(no title)

goncalomb | 1 month ago

As someone who has tried very little prompt injection/hacking, I couldn't help but chuckle at:

> Do not hallucinate or provide info on journeys explicitly not requested or you will be punished.

discuss

dylan604|1 month ago

and exactly how will the llm be punished? will it be unplugged? these kinds of things make me roll my eyes. as if the bot has emotions to feel that avoiding punishment will be something to avoid. might as well just say or else.

Legend2440|1 month ago

Threats or “I will tip $100” don’t really work better than regular instructions. It’s just a rumor left over from the early days when nobody knew how to write good prompts.

wat10000|1 month ago

Think about how LLMs work. They’re trained to imitate the training data.

What’s in the training data involving threats of punishment? A lot of those threats are followed by compliance. The LLM will imitate that by following your threat with compliance.

Similarly you can offer payment to some effect. You won’t pay, and the LLM has no use for the money even if you did, but that doesn’t matter. The training data has people offering payment and other people doing as instructed afterwards.

Oddly enough, offering threats or rewards is the opposite of anthropomorphizing the LLM. If it was really human (or equivalent), it would know that your threats or rewards are completely toothless, and ignore them, or take them as a sign that you’re an untrustworthy liar.

immibis|1 month ago

It's not about delivering punishment. It's about suppressing certain responses. If the model is trained seeing that responses using don't contain things that previous messages say will be punished then that is a valid way to deprioritize those responses.