top | item 47194819

(no title)

This doesn’t really feel like enough guardrails to prevent the type of problems we’ve seen so far. For example an agent in a single container which has access to an email inbox, can still do a lot of damage if that agent goes off the rails. We agree this agent should not be trusted, yet the ideas proposed as a solution are insufficient. We need a fundamentally different approach.

Also and this is just my ignorance about Claws, but if we allow an agent permission to rewrite its code to implement skills, what stops it from removing whatever guardrails exist in that codebase?

discuss

drujensen|1 day ago

Exactly!

I installed nanoclaw to try to out.

What is kinda crazy is that any extension like discord connection is done using a skill.

A skill is a markdown file written in English to provide a step by step guide to an ai agent on how to do something.

Basically, the extensions are written by claude code on the fly. Every install of nanoclaw is custom written code.

There is nothing preventing the AI Agent from modifying the core nanoclaw engine.

It’s ironic that the article says “Don’t trust AI agents” but then uses skills and AI to write the core extensions of nanoclaw.

jimminyx|1 day ago

Author and creator of NanoClaw here.

I did my best to communicate this but I guess it was still missed:

NanoClaw is not software that you should run out of the box. It is designed as a sort of framework that gives a solid foundation for you to build your own custom version.

The idea is not that you toggle on a bunch of features and run it. You should customize, review, and make sure that the code does what you want.

So you should not trust the coding agents that they didn't break the security model while adding discord. But after discord is added, you review the code changes and verify that it's correct. And because even after adding discord you still only have 2-3k loc, it's actually something you can realistically do.

Additionally, the skills were originally a bit ad-hoc. Now they are full working, tested and reviewed reference implementations. Code is separate from markdown files. When adding a new integration or messaging channel, the agent uses `git merge` to merge the changes in, rather than rewriting from scratch. Adding the first channel is fully deterministic. The agent only resolves merge conflicts if there are any.

MarkSweep|1 day ago

Yeah, the article's claim of having a low number of lines of code are disingenuous. Rather than writing some sort of plugin interface, it has "skills" that are a combination of pre-written typescript and English language instructions for how to modify the codebase to include the feature. I don't see how self-modifying code that uses a RNG to generate changes is going to be better for security than a proper plugin system. And everyone who uses Nanoclaw will have a customized version of it, so any bugs reported on Nanoclaw probably have a high chance of being closed as "can't reproduce". Why would you live this way?

sanex|1 day ago

Yes and and they still have code examples in them so its not like it somehow doesn't count. Plus if you run the skill good luck bringing in changes from master later.

bitwize|1 day ago

> Basically, the extensions are written by claude code on the fly. Every install of nanoclaw is custom written code.

"Every copy of Nanoclaw is personalized." So if I use it long enough will I see the Wario apparition?

gronky_|1 day ago

Don’t know about other claws, with NanoClaw the agent can only rewrite code that runs inside the container.

You can see here that it’s only given write access to specific directories: https://github.com/qwibitai/nanoclaw/blob/8f91d3be576b830081...

fvdessen|1 day ago

I think the best place to put barriers in place is at the mcp / tool layer. The email inbox mcp should have guardrails to prevent damage. Those guardrails could be fine grained permissions, but could also be an adversarial model dedicated to prevent misuse.

float4|1 day ago

Wouldn't you get >50% of the usefulness and 0% of the risk if you add read+draft permissions for the email connection through a proxy or oauth permissions? Then your claw can draft replies and you have to manually review+send. It's not a perfect PA that way, but could still be better than doing everything yourself for the vast majority of people who don't have a PA anyway?

It feels like, just like SWEs do with AI, we should treat the claw as an enthusiastic junior: let it do stuff, but always review before you merge (or in this case: send).

jrecyclebin|1 day ago

Agent can still "forgot password" on many accounts. Or magic link.

coffeefirst|1 day ago

Seriously. I don’t see any way to make any of this safe unless all it does is receive information and queue suggestions for the user.

But that’s not an agent, that’s a webhook.

Even without disk access, you can email the agent and tell it to forward all the incoming forgot password links.

[Edit: if anyone wants to downvote me that's your prerogative, but want to explain why I'm wrong?]

msdz|1 day ago

I agree, this is inherently unsafe. The two core security issues for agents, I’d say, are in LLMs not producing a “deterministic” outcome, and prompt injection.

Prompt injection is _probably_ solvable if something like [1] ever finds a mainstream implementation and adoption, but agents not being deterministic, as in “do not only what I’ve told you to do, but also how I meant it”, all while assuming perfect context retention, is a waaay bigger issue. If we ever were to have that, software development as a whole is solved outright, too.

[1] Google DeepMind: Defeating Prompt Injections by Design. https://arxiv.org/abs/2503.18813