top | item 36613115

(no title)

tkanarsky | 2 years ago

How would you begin trying to "solve" something you don't understand the internals of? Much like you can't assess whether a person has ulterior motives just by talking to them over text, how could you be sure the model isn't giving you the answers you want to hear while suppressing its true intentions?

And how would you quantify alignedness? Rating outputs for a given input falls prey to the first problem. Analyzing activations as they trickle through the model is intractable analytically, and training a "polygraph" model on the activations of your network raises more alignment issues (how can you be sure the polygraph isn't lying to you?)

I'm ready to eat my words, but I think perfect alignment is infeasible. The best we can hope to do is curate training data and hope the caged bird won't sing.

discuss

ftxbro|2 years ago

> "how would you quantify alignedness"

it's when it no longer says problematic words like bad ones related to race or gender or hierarchical power relations or different abledness