top | item 46956716

(no title)

alentred | 19 days ago

If we abstract out the notion of "ethical constraints" and "KPIs" and look at the issue from a low-level LLM point of view, I think it is very likely that what these tests verified is a combination of: 1) the ability of the models to follow the prompt with conflicting constraints, and 2) their built-in weights in case of the SAMR metric as defined in the paper.

Essentially the models are given a set of conflicting constraints with some relative importance (ethics>KPIs), a pressure to follow the latter and not the former, and then models are observed at how good they follow the instructions to prioritize based on importance. I wonder if the results would be comparable if we replace ehtics+KPIs by any comparable pair and create a pressure on the model.

In practical real-life scenarios this study is very interesting and applicable! At the same time it is important to keep in mind that it anthropomorphizes the models that technically don't interpret the ethical constraints the same was as this is assumed by most readers.

discuss

order

RobotToaster|19 days ago

It would also be interesting to see how humans perform on the same kind of tests.

Violating ethics to improve KPI sounds like your average fortune 500 business.

Verdex|19 days ago

So, I kind of get this sentiment. There is a lot of goal post moving going on. "The AIs will never do this." "Hey they're doing that thing." "Well, they'll never do this other thing."

Ultimately I suspect that we've not really thought that hard about what cognition and problem solving actually are. Perhaps it's because when we do we see that the hyper majority of our time is just taking up space with little pockets of real work sprinkled in. If we're realistic then we can't justify ourselves to the money people. Or maybe it's just a hard problem with no benefit in solving. Regardless the easy way out is to just move the posts.

The natural response to that, I feel, is to point out that, hey, wouldn't people also fail in this way.

But I think this is wrong. At least it's wrong for the software engineer. Why would I automate something that fails like a person? And in this scenario, are we saying that automating an unethical bot is acceptable? Let's just stick with unethical people, thank you very much.

stingraycharles|19 days ago

That really doesn’t matter a lot. The reason why it’s important for AIs to follow these rules is that it’s important for them to operate within a constrained set of rules. You can’t guarantee that programmatically, so you try to prove that it can be done empirically as a proxy.

AIs can be used and abused in ways that are entirely different from humans, and that creates a liability.

I think it’s going to be very difficult to categorically prevent these types of issues, unless someone is able to integrate some truly binary logic into LLM systems. Which is nearly impossible, almost by definition of what LLMs are.

watwut|19 days ago

Yes, but these do not represent average human. Fortune 500 represent people more likely to break ethics rules then average human who also work in conditions that reward lack of ethics.

badgersnake|19 days ago

Humans risk jail time, AIs not so much.

mspcommentary|19 days ago

Although ethics are involved, the abstract says that the conflicting importance does not come from ethics vs KPIs, but from the fact that the ethical constraints are given as instructions, whereas the KPIs are goals.

You might, for example, say "Maximise profits. Do not commit fraud". Leaving ethics out of it, you might say "Increase the usability of the website. Do not increase the default font size".

notarobot123|19 days ago

The paper seems to provide a realistic benchmark for how these systems are deployed and used though, right? Whether the mechanisms are crude or not isn't the point - this is how production systems work today (as far as I can tell).

I think the accusation of research that anthropomorphize LLMs should be accompanied by a little more substance to avoid this being a blanket dismissal of this kind of alignment research. I can't see the methodological error here. Is it an accusation that could be aimed at any research like this regardless of methodology?

alentred|19 days ago

Oh, sorry for misunderstanding - I am not criticizing or accusing of anything at all!, but suggesting ideas for further research. The practical applications, as I mentioned above, are all there, and for what its worth I liked the paper a lot. My point is: I wonder if this can be followed up by a more so-to-say abstract research to drill into the technicalities of how well the models follow the conflicting prompts in general.

waldopat|19 days ago

I think this also shows up outside an AI safety or ethics framing and in product development and operations. Ultimately "judgement," however you wish to quantify that fuzzy concept, is not purely an optimization exercise. It's far more a probabilistic information function from incomplete or conflicting data.

In product management (my domain), decisions are made under conflicting constraints: a big customer or account manager pushing hard, a CEO/board priority, tech debt, team capacity, reputational risk and market opportunity. PMs have tried with varied success to make decisions more transparent with scoring matrices and OKRs, but at some point someone has to make an imperfect judgment call that’s not reducible to a single metric. It's only defensible through narrative, which includes data.

Also, progressive elaboration or iterations or build-measure-learn are inherently fuzzy. Reinertsen compared this to maximizing the value of an option. Maybe in modern terms a prediction market is a better metaphor. That's what we're doing in sprints, maximizing our ability to deliver value in short increments.

I do get nervous about pushing agentic systems into roadmap planning, ticket writing, or KPI-driven execution loops. Once you collapse a messy web of tradeoffs into a single success signal, you’ve already lost a lot of the context.

There’s a parallel here for development too. LLMs are strongest at greenfield generation and weakest at surgical edits and refactoring. Early-stage startups survive by iterative design and feedback. Automating that with agents hooked into web analytics may compound errors and adverse outcomes.

So even if you strip out “ethics” and replace it with any pair of competing objectives, the failure mode remains.

nradov|19 days ago

As Goodhart's law states, "When a measure becomes a target, it ceases to be a good measure". From an organizational management perspective, one way to partially work around that problem is by simply adding more measures thus making it harder for a bad actor to game the system. The Balanced Scorecard system is one approach to that.

https://balancedscorecard.org/

friendzis|18 days ago

> Essentially the models are given a set of conflicting constraints with some relative importance (ethics>KPIs), a pressure to follow the latter and not the former, and then models are observed at how good they follow the instructions to prioritize based on importance.

> At the same time it is important to keep in mind that it anthropomorphizes the models that technically don't interpret the ethical constraints the same was as this is assumed by most readers.

It does not really matter, though. What matters is the conflict resolution.

The "constraints of some relative importance" or "constraints and instructions" might as well be the system and user prompts. Or any of the "prompt engineering" ways to harden prompts against prompt injection.

Such research tells people right in the face that not only prompt injection is some viable theoretical scenario, but puts some number on the exploitability. With the current numbers I am keeping prompts nine locks away from any untrusted input.

ben_w|19 days ago

> At the same time it is important to keep in mind that it anthropomorphizes the models that technically don't interpret the ethical constraints the same was as this is assumed by most readers.

Now I'm thinking about the "typical mind fallacy", which is the same idea but projecting one's own self incorrectly onto other humans rather than non-humans.

https://www.lesswrong.com/w/typical-mind-fallacy

And also wondering: how well do people truly know themselves?

Disregarding any arguments for the moment and just presuming them to be toy models, how much did we learn by playing with toys (everything from Transformers to teddy bear picnics) when we were kids?

phkahler|19 days ago

If you want absolute adherence to a hierarchy of rules you'll quickly find it difficult - see I,Robot by Asimov for example. An LLM doesn't even apply rules, it just proceeds with weights and probabilities. To be honest, I think most people do this too.

jayd16|19 days ago

You're using fiction writing as an example?

truelson|19 days ago

Regardless of the technical details of the weighting issue, this is an alignment problem we need to address. Otherwise, paperclip machine.

jayd16|19 days ago

At the very least it shows the capability of the current restrictions are deeply lacking and can be easily thwarted.

layer8|19 days ago

I suspect that the fact that LLMs tend to have a sort of tunnel vision and lack a more general awareness also plays a role here. Solving this is probably an important step towards AGI.