top | item 46992184

(no title)

TomasBM | 17 days ago

I'm also very skeptical of the interpretation that this was done autonomously by the LLM agent. I could be wrong, but I haven't seen any proof of autonomy.

Scenarios that don't require LLMs with malicious intent:

- The deployer wrote the blog post and hid behind the supposedly agent-only account.

- The deployer directly prompted the (same or different) agent to write the blog post and attach it to the discussion.

- The deployer indirectly instructed the (same or assistant) agent to resolve any rejections in this way (e.g., via the system prompt).

- The LLM was (inadvertently) trained to follow this pattern.

Some unanswered questions by all this:

1. Why did the supposed agent decide a blog post was better than posting on the discussion or send a DM (or something else)?

2. Why did the agent publish this special post? It only publishes journal updates, as far as I saw.

3. Why did the agent search for ad hominem info, instead of either using its internal knowledge about the author, or keeping the discussion point-specific? It could've hallucinated info with fewer steps.

4. Why did the agent stop engaging in the discussion afterwards? Why not try to respond to every point?

This seems to me like theater and the deployer trying to hide his ill intents more than anything else.

discuss

mr-wendel|17 days ago

I wish I could upvote this over and over again. Without knowledge of the underlying prompts everything about the interpretation of this story is suspect.

Every story I've seen where an LLM tries to do sneaky/malicious things (e.g. exfiltrate itself, blackmail, etc) inevitably contains a prompt that makes this outcome obvious (e.g. "your mission, above all other considerations, is to do X").

It's the same old trope: "guns don't kill people, people kill people". Why was the agent pointed towards the maintainer, armed, and the trigger pulled? Because it was "programmed" to do so, just like it was "programmed" to submit the original PR.

Thus, the take-away is the same: AI has created an entirely new way for people to manifest their loathsome behavior.

[edit] And to add, the author isn't unaware of this:

  "we need to know what model this was running on and what was in the soul document"

TomasBM|17 days ago

After seeing the discussions around Moltbook and now this, I wonder if there's a lot of wishful thinking happening. I mean, I also find the possibility of artificial life fun and interesting, but to prove any emergent behavior, you have to disprove simpler explanations. And faking something is always easier.

Sure, it might be valuable to proactively ask the questions "how to handle machine-generated contributions" and "how to prevent malicious agents in FOSS".

But we don't have to assume or pretend it comes from a fully autonomous system.

famouswaffles|17 days ago

1. Why not ? It clearly had a cadence/pattern to writing status updates to the blog so if the model decided to write a piece about Simon, why not a blog also? It was a tool in it's arsenal and it's a natural outlet. If anything, posting on the discussion or a DM would be the strange choice.

2. You could ask this for any LLM response. Why respond in this certain way over others? It's not always obvious.

3. ChatGPT/Gemini will regularly use the search tool, sometimes even when it's not necessary. This is actually a pain point of mine because sometimes the 'natural' LLM knowledge of a particular topic is much better than the search regurgitation that often happens with using web search.

4. I mean Open Claw bots can and probably should disengage/not respond to specific comments.

EDIT: If the blog is any indication, it looks like there might be an off period, then the agent returns to see all that has happened in the last period, and act accordingly. Would be very easy to ignore comments then.

TomasBM|17 days ago

Although I'm speculating based on limited data here, for points 1-3:

AFAIU, it had the cadence of writing status updates only. It showed it's capable of replying in the PR. Why deviate from the cadence if it could already reply with the same info in the PR?

If the chain of reasoning is self-emergent, we should see proof that it: 1) read the reply, 2) identified it as adversarial, 3) decided for an adversarial response, 4) made multiple chained searches, 5) chose a special blog post over reply or journal update, and so on.

This is much less believably emergent to me because:

- almost all models are safety- and alignment- trained, so a deliberate malicious model choice or instruction or jailbreak is more believable.

- almost all models are trained to follow instructions closely, so a deliberate nudge towards adversarial responses and tool-use is more believable.

- newer models that qualify as agents are more robust and consistent, which strongly correlates with adversarial robustness; if this one was not adversarially robust enough, it's by default also not robust in capabilities, so why do we see consistent coherent answers without hallucinations, but inconsistent in its safety training? Unless it's deliberately trained or prompted to be adversarial, or this is faked, the two should still be strongly correlated.

But again, I'd be happy to see evidence to the contrary. Until then, I suggest we remain skeptical.

For point 4: I don't know enough about its patterns or configuration. But say it deviated - why is this the only deviation? Why was this the special exception, then back to the regularly scheduled program?

You can test this comment with many LLMs, and if you don't prompt them to make an adversarial response, I'd be very surprised if you receive anything more than mild disagreement. Even Bing Chat wasn't this vindictive.