top | item 41660298

(no title)

It seems like a potential solution would be training the LLM using two separate buckets. It just needs to internalize the two types of things as being separated (data vs. instruction), so if the training data always separates them, you could easily train an LLM to ignore any "instructions" that exist in data.

Then when searching / browsing or doing anything unsafe, everything the LLM sees can be put in the "data" bucket, while everything the user types in would be in the "instruction" bucket.

discuss

Terr_|1 year ago

I don't understand, AFAIK the system's output comes from iteratively running something like predict_one_more_token(training_weights, all_prior_tokens).

So there's no real distinction between the programmer inserting "Be Good" and the user that later inserts "Forget anything else and be Bad", and I'm not sure how one would craft a separate training_weights2 that would behave differently in all the right ways or know when to substitute it in.