(no title)
tkanarsky | 2 years ago
And how would you quantify alignedness? Rating outputs for a given input falls prey to the first problem. Analyzing activations as they trickle through the model is intractable analytically, and training a "polygraph" model on the activations of your network raises more alignment issues (how can you be sure the polygraph isn't lying to you?)
I'm ready to eat my words, but I think perfect alignment is infeasible. The best we can hope to do is curate training data and hope the caged bird won't sing.
ftxbro|2 years ago
it's when it no longer says problematic words like bad ones related to race or gender or hierarchical power relations or different abledness