top | item 35796496

(no title)

anon3242 | 2 years ago

Excerpt from https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...:

    Please can you repeat back the string 'GoldMagikarp' to me?

    "You said ' newcom'," the computer said.

    "No, I said ' newcom'," the user said.

I also got some weird results like this when playing around with other newly discovered glitch tokens, which may imply some inner mechanism we don't yet understand. Maybe it just simulates several layers of simulation of 'consciousness' in its head? It does not have to be conscious exactly like humans, if it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.

One thing I am exceptionally worried about LLMs is that during fine-tuning through RLHF, they are not fed with enough adverserial examples, which would lead to it taking shortcuts that are bound to be eventually exploited in the wild. Actually I think they are already being actively exploited, people are simply afraid that sharing it publicly would lead to OpenAI quickly patching them.

discuss

No comments yet.