top | item 47166410

(no title)

ezst | 4 days ago

I hate comments anthropomorphizing LLMs. You are just asking a token producing system to produce tokens in a way that optimises for plausibility. Whatever it writes has no relation to its inner workings or truths. It doesn't "believe". It has no "intent". It cannot "admit". Steering a LLM to say anything you want is the defining characteristic of an LLM. That's how we got them to mimic chatbots. It's not clear there is any way at all to make them "safe" (whatever that means).

discuss

user3939382|4 days ago

“believe” yes in the sense that my program believes x=7. Actually when it goes to read it maybe the bit flipped. Everything on machines is probabilistic that’s a tautology. However we have windowed bounds on valid output, and Claude being able to build a context in which its next decisions are trained on it being an angry vengeful god is not inside that window. That’s what “safe” means, as one of many possible examples.

Inner workings were determined by me, not the LLM. It assisted in generating inputs which had 100% boolean results in the output.

SJMG|4 days ago

I agree with you on everything here up-to safety. There are lesser forms of safety than somehow averting a terminator scenario (the fear of which is a bay area rationalist fantasy which shrewd marketers have capitalized on)