very cool. how are you creating the control vectors? curious since the word "cold" can be both a conversational disposition and a temperature (same word)
thanks! we asked the model to generate some synonyms and antonyms (in this case, we have "cold" and "impassive" vs "affectionate" and "sensitive")
Then, we ask the model to behave that way (with a prompt), and store the difference in activations for each pair. Then, a PCA can be used to extract the principal component, giving use the steering vector. We do most of this using the repeng library, and the author goes into a bit more detail on how it's done on her [blog](https://vgel.me/posts/representation-engineering/#How_do_we_...?)
[+] [-] muzakthings|1 year ago|reply
[+] [-] atondwal|1 year ago|reply
Then, we ask the model to behave that way (with a prompt), and store the difference in activations for each pair. Then, a PCA can be used to extract the principal component, giving use the steering vector. We do most of this using the repeng library, and the author goes into a bit more detail on how it's done on her [blog](https://vgel.me/posts/representation-engineering/#How_do_we_...?)
[+] [-] ivanchan|1 year ago|reply