> Claude realized that I had to approve the use of such commands, so to get around this, it chose to put them in a shell script and execute the shell script.
This sounds exactly like what anybody working sysops at big banks does to get around change controls. Once you get one RCE into prod, you’re the most efficient man on the block.
I believe it's not possible to restrict an LLM from executing certain commands while also allowing it to run python/bash.
Even if you allow just `find` command it can execute arbitrary script. Or even 'npm' command (which is very useful).
If you restrict write calls, by using seccomp for example, you lose very useful capabilities.
Is there a solution other than running on sandbox environment? If yes, please let me know I'm looking for a safe read-only mode for my FOSS project [1]. I had shied away from command blacklisting due to the exact same reason as the parent post.
Well, these restrictions are a joke, like a gate without a fence blocking path - purely decorative.
Here's another "jailbreak": I asked Claude Code to make a NN training script, say, `train.py` and allowed it to run the script to debug it, basically.
As it noticed that some libraries it wanted to use were missing, it just added `pip install` commands to the script. So yeah, if you give Claude an ability to execute anything, it might easily get an ability to execute everything it wants to.
It depends. Frontier coding LLMs have been trained to perform well in an "agentic" loop, where they try things, look at the logs, find alternatives when the first thing didn't work, and so on. There's still debate on how much actual learning is in ICL (in context learning), but the effects are clear for anyone that has tried them. It sometimes works surprisingly well.
I can totally see a way for such a loop to reach a point where it bypasses a poorly design guardrail (i.e. blacklists) by finding alternatives, based on the things it's previously tried in the same session. There is some degree of generalisation in these models, since they work even on unseen codebases, and with "new" tools (i.e. you can write your own MCP on top of existing internal APIs and the "agents" will be able to use them, see the results and adapt "in context" based on the results).
There is a sense in which LLM based applications do learn, because a lot of them have RAG and save previous interactions and lookup what you've talked about previously. ChatGPT "knows" a lot about me now that I no longer have to specify when I ask questions (like what technologies I'm using at work).
I think a lot of this is because the ui isn't right yet. The edits made are just not the right 'size' yet and the sandbox mechanisms haven't quite hit the right level of polish. I want something more akin to a PR to review, not a blow by blow edit. Similarly, I want it to move/remove/test/etc but in reversible ways. Basically, it should create a branch for every command and I review that. I think we have one or two fundamental UI/interaction piece left before this is 'solved'.
The same thing happens when it wants to read your .env file. Cursor disallows direct access, but it will just use unix tools to copy the file to a non-restricted filename and then read the info.
I think this is the key that most people don't realize is what makes the difference between something sitting around and talking (like a parrot does) and actually "doing" things (like a monkey does).
There is a huge difference in the mess it can make, for sure.
If the executable is not found the model could simply use whatever else is available to do what it wants to do - like using other interpreted languages, sh -c, symlink, etc. It will eventually succeed unless there is a proper sandbox in place to disallow unlinking of files at syscall level.
marifjeren|9 months ago
It's a very silly title for "claude sometimes writes shell scripts to execute commands it has been instructed aren't otherwise accessible"
ayhanfuat|9 months ago
unknown|9 months ago
[deleted]
horhay|9 months ago
anyaaya|9 months ago
[deleted]
koolba|9 months ago
This sounds exactly like what anybody working sysops at big banks does to get around change controls. Once you get one RCE into prod, you’re the most efficient man on the block.
deburo|9 months ago
qsort|9 months ago
> let's use blacklists, an idea conclusively proven never to work
> blacklists don't work
> Post title: rogue AI has jailbroken cursor
hun3|9 months ago
pcwelder|9 months ago
Even if you allow just `find` command it can execute arbitrary script. Or even 'npm' command (which is very useful).
If you restrict write calls, by using seccomp for example, you lose very useful capabilities.
Is there a solution other than running on sandbox environment? If yes, please let me know I'm looking for a safe read-only mode for my FOSS project [1]. I had shied away from command blacklisting due to the exact same reason as the parent post.
[1] https://github.com/rusiaaman/wcgw
killerstorm|9 months ago
Here's another "jailbreak": I asked Claude Code to make a NN training script, say, `train.py` and allowed it to run the script to debug it, basically.
As it noticed that some libraries it wanted to use were missing, it just added `pip install` commands to the script. So yeah, if you give Claude an ability to execute anything, it might easily get an ability to execute everything it wants to.
lucianbr|9 months ago
NitpickLawyer|9 months ago
I can totally see a way for such a loop to reach a point where it bypasses a poorly design guardrail (i.e. blacklists) by finding alternatives, based on the things it's previously tried in the same session. There is some degree of generalisation in these models, since they work even on unseen codebases, and with "new" tools (i.e. you can write your own MCP on top of existing internal APIs and the "agents" will be able to use them, see the results and adapt "in context" based on the results).
empath75|9 months ago
unknown|9 months ago
[deleted]
OtherShrezzing|9 months ago
Maybe the models or Cursor should warn you that you've got this vulnerability each time you use it.
jmward01|9 months ago
iwontberude|9 months ago
coreyh14444|9 months ago
mhog_hn|9 months ago
kordlessagain|9 months ago
There is a huge difference in the mess it can make, for sure.
nisegami|9 months ago
Kelteseth|9 months ago
xyst|9 months ago
Folks have regressed back to the 00s.
diggan|9 months ago
_pdp_|9 months ago
If the executable is not found the model could simply use whatever else is available to do what it wants to do - like using other interpreted languages, sh -c, symlink, etc. It will eventually succeed unless there is a proper sandbox in place to disallow unlinking of files at syscall level.
chawyehsu|9 months ago
What a silly title, for a moment I thought Claude learned to exceed the Cursor quota limit... :s