top | item 42920546

(no title)

byschii | 1 year ago

isn't this dangerous? isn't the efficiency given at the expense of safety and interpretability?

https://arxiv.org/abs/2412.14093 (Alignment faking in large language models)

https://joecarlsmith.com/2024/12/18/takes-on-alignment-fakin...

PS I m definitely not an expert

discuss

numba888|1 year ago

> isn't this dangerous? isn't the efficiency given at the expense of safety and interpretability?

Final text is only a small part of model's thinking. It's produced from embeddings which probably have much more in them. Each next token depends not only on previous, but all the intermediate values for all tokens. We don't know them, they are actually important and represent inner 'thinking'. So, LLM is still a black box. The result is usually A because of B. Sort of explanation for A, but where B came from we can only guess.

achierius|1 year ago

Yes, but what do you think matters more: - Safety and (in the long run) human lives - More papers ?

jononor|1 year ago

Turns out we are the main paperclip optimizers...

winwang|1 year ago

Depends on if we can interpret the final hidden layer. It's plausible we evolve models to _have_ interpretable (final/reasoning) hidden layers, just that they aren't constrained to the (same representation of) input/output domains (i.e. tokens).

swagmoney1606|1 year ago

We should always be able to clearly understand and interpret all of the thinking leading to an action done by an AI. What would the point be if we don't know what it's doing, just that it is doing "something"

IshKebab|1 year ago

I don't see how it is any more dangerous than the already existing black-box nature of DNNs.

nowittyusername|1 year ago

the hidden tokens can be decoded to English language if the user wants to see the thinking process.

patcon|1 year ago

Yeah, agreed. The limits of human minds constrain language. To allow these things to reason outside words is in my intuitions a tactic with more abundant paths toward super intelligence, and the exact sort of path we'll have a harder time monitoring (we'll need fancy tools to introspect instead of just watching it think)

My current thinking is that I would support a ban on this style of research. Really hard to set lines for regulation, but this feels like an easy and intuitive place to exercise caution