top | item 45208141

You can change an LLM's favorite color with a Steering Vector

2 points| 13point5 | 5 months ago |twitter.com

1 comment

order

13point5|5 months ago

I've seen a lot of purple gradient websites made by LLMs and I was curious if we can change a model's favorite color by messing with the activations instead of prompts.

So I used Representation Engineering with just 1 pair of contrastive prompts to make Mistral-7B prefer orange as its favorite color.

I used the repeng library by Theia to test this out.

Next I'm gonna implement it from scratch to understand why this even works.

I wonder if we can introduce "taste" into a model with methods like this.

The paper is "REPRESENTATION ENGINEERING: A TOP-DOWN APPROACH TO AI TRANSPARENCY".

Link: https://arxiv.org/abs/2310.01405