top | item 44086875

(no title)

twsted | 9 months ago

I know that Anthropic is one of the most serious company working on the problem of the alignment, but the current approaches seem extremely naive.

We should do better than giving the models a portion of good training data or a new mitigating system prompt.

discuss

SV_BubbleTime|9 months ago

I am aware in relative terms you are correct about Anthropic.

But I’m having a hard time describing and AI company “serious” when they’re shipping a product that can email real people on its own, and perform other real actions - while they are aware it’s still vulnerable to the most obvious and silly form of attack - the “pre-fill” where you just change the AI’s response and send it back in to pretend it had already agreed with your unethical or prohibited request and now to keep going.

mike_hearn|9 months ago

The solution here is ultimately going to be a mix of training and, equally importantly, hard sandboxing. The AI companies need to do what Google did when they started Chrome and buy up a company or some people who have deep expertise in sandbox design.

hollerith|9 months ago

I'm confused: can you explain how the sandbox helps?

I mean, if the plan is not to let the AI write any code that actually gets allocated computing resources and not to let the AI interact with any people and not to give the AI write access to the internet, then I can see how having a good sandbox around it would help, but how many AI are there (or will there be) where that is the plan and the AI is powerful enough that we care about its alignedness?

stevenhuang|9 months ago

You are right, but the field is moving too fast and so it is forced to at least try to confront the problem with the limited tools and understanding available.

We can only turn the knobs we see in front of us. And this will continue until theory catches up with practice.

It's the classic tension of what usually happens from our inability to correctly assign risk on long tail events (high likelihood of positive return on investment vs extremely unlikely but bad outcome of misalignment)--there is money to be made now and the bad thing is unlikely; just do it and take the risk as we go.

It does work out most of the time. Were it left to me, I would be unable to make a decision, because we just don't understand enough about what we are dealing with.