top | item 35925289

(no title)

alex_sf | 2 years ago

Responding to prompts like that are part of the 'instruction tuning' process. After an LLM is trained on a large dataset, it will do a decent job of completion, which acts like you describe.

The next step is to further tune it with a specific format. You'll feed in examples like so:

    SystemPrompt: You are a rude AI.
    User: Hello there!
    Assistant: You're lame, go away.

    SystemPrompt: You are a pleasant AI.
    User: Hello there!
    Assistant: Hello, friend!

Then, when you go to do inference on the model, you prompt it like so:

    SystemPrompt: You are a pleasant AI.
    User: [user prompt]
    Assistant:

By training it on a diverse set of system prompts/user prompts/answers, it learns to give outputs based on it.

Additional tuning (RLHF, etc.) is orthogonal.

discuss

cubefox|2 years ago

Yes, but I don't think "SystemPrompt:", "User:", and "Assistant:" are even normal text. Normal text would make it trivial to trick the model into thinking it has said something which actually the user has said, since the user can simply include "Assistant:" (or "SystemPrompt:") into his prompt.

It is more likely that those prefixes are special tokens which don't encode text, and which are set via the software only -- or via the model, when it is finished with what it wanted to say. Outputting a token corresponding to "User:" would automatically mark the end of its message, and the beginning of the user prompt. Though Bing Chat also has the ability to end the conversation altogether (no further user prompt possible), which must be another special token.

alex_sf|2 years ago

In all the open source cases I’m aware of, the roles are just normal text.

The ability to trivially trick the model into thinking it said something it didn’t is a feature and intentional. It’s how you do multi-turn conversations with context.

Since the current crop of LLMs have no memory of their interaction, each follow up message (the back and forth of a conversation) involves sending the entire history back into the model, with the role as a prefix for each participants output/input.

There are some special tokens used (end of sequence, etc).

If your product doesn’t directly expose the underlying model, you can try to prevent users from impersonating responses through obfuscation or the LLM equivalent of prepared statements. The offensive side of prompt injection is currently beating the defensive side, though.