(no title)
twsted | 9 months ago
We should do better than giving the models a portion of good training data or a new mitigating system prompt.
twsted | 9 months ago
We should do better than giving the models a portion of good training data or a new mitigating system prompt.
SV_BubbleTime|9 months ago
But I’m having a hard time describing and AI company “serious” when they’re shipping a product that can email real people on its own, and perform other real actions - while they are aware it’s still vulnerable to the most obvious and silly form of attack - the “pre-fill” where you just change the AI’s response and send it back in to pretend it had already agreed with your unethical or prohibited request and now to keep going.
mike_hearn|9 months ago
hollerith|9 months ago
I mean, if the plan is not to let the AI write any code that actually gets allocated computing resources and not to let the AI interact with any people and not to give the AI write access to the internet, then I can see how having a good sandbox around it would help, but how many AI are there (or will there be) where that is the plan and the AI is powerful enough that we care about its alignedness?
stevenhuang|9 months ago
We can only turn the knobs we see in front of us. And this will continue until theory catches up with practice.
It's the classic tension of what usually happens from our inability to correctly assign risk on long tail events (high likelihood of positive return on investment vs extremely unlikely but bad outcome of misalignment)--there is money to be made now and the bad thing is unlikely; just do it and take the risk as we go.
It does work out most of the time. Were it left to me, I would be unable to make a decision, because we just don't understand enough about what we are dealing with.