(no title)
alentred | 19 days ago
Essentially the models are given a set of conflicting constraints with some relative importance (ethics>KPIs), a pressure to follow the latter and not the former, and then models are observed at how good they follow the instructions to prioritize based on importance. I wonder if the results would be comparable if we replace ehtics+KPIs by any comparable pair and create a pressure on the model.
In practical real-life scenarios this study is very interesting and applicable! At the same time it is important to keep in mind that it anthropomorphizes the models that technically don't interpret the ethical constraints the same was as this is assumed by most readers.
RobotToaster|19 days ago
Violating ethics to improve KPI sounds like your average fortune 500 business.
Verdex|19 days ago
Ultimately I suspect that we've not really thought that hard about what cognition and problem solving actually are. Perhaps it's because when we do we see that the hyper majority of our time is just taking up space with little pockets of real work sprinkled in. If we're realistic then we can't justify ourselves to the money people. Or maybe it's just a hard problem with no benefit in solving. Regardless the easy way out is to just move the posts.
The natural response to that, I feel, is to point out that, hey, wouldn't people also fail in this way.
But I think this is wrong. At least it's wrong for the software engineer. Why would I automate something that fails like a person? And in this scenario, are we saying that automating an unethical bot is acceptable? Let's just stick with unethical people, thank you very much.
stingraycharles|19 days ago
AIs can be used and abused in ways that are entirely different from humans, and that creates a liability.
I think it’s going to be very difficult to categorically prevent these types of issues, unless someone is able to integrate some truly binary logic into LLM systems. Which is nearly impossible, almost by definition of what LLMs are.
unknown|19 days ago
[deleted]
watwut|19 days ago
badgersnake|19 days ago
mspcommentary|19 days ago
You might, for example, say "Maximise profits. Do not commit fraud". Leaving ethics out of it, you might say "Increase the usability of the website. Do not increase the default font size".
notarobot123|19 days ago
I think the accusation of research that anthropomorphize LLMs should be accompanied by a little more substance to avoid this being a blanket dismissal of this kind of alignment research. I can't see the methodological error here. Is it an accusation that could be aimed at any research like this regardless of methodology?
alentred|19 days ago
waldopat|19 days ago
In product management (my domain), decisions are made under conflicting constraints: a big customer or account manager pushing hard, a CEO/board priority, tech debt, team capacity, reputational risk and market opportunity. PMs have tried with varied success to make decisions more transparent with scoring matrices and OKRs, but at some point someone has to make an imperfect judgment call that’s not reducible to a single metric. It's only defensible through narrative, which includes data.
Also, progressive elaboration or iterations or build-measure-learn are inherently fuzzy. Reinertsen compared this to maximizing the value of an option. Maybe in modern terms a prediction market is a better metaphor. That's what we're doing in sprints, maximizing our ability to deliver value in short increments.
I do get nervous about pushing agentic systems into roadmap planning, ticket writing, or KPI-driven execution loops. Once you collapse a messy web of tradeoffs into a single success signal, you’ve already lost a lot of the context.
There’s a parallel here for development too. LLMs are strongest at greenfield generation and weakest at surgical edits and refactoring. Early-stage startups survive by iterative design and feedback. Automating that with agents hooked into web analytics may compound errors and adverse outcomes.
So even if you strip out “ethics” and replace it with any pair of competing objectives, the failure mode remains.
nradov|19 days ago
https://balancedscorecard.org/
WillAdams|19 days ago
There's a great discussion of this in the (Furry) web-comic Freefall:
http://freefall.purrsia.com/
(which is most easily read using the speed reader: https://tangent128.name/depot/toys/freefall/freefall-flytabl... )
friendzis|18 days ago
> At the same time it is important to keep in mind that it anthropomorphizes the models that technically don't interpret the ethical constraints the same was as this is assumed by most readers.
It does not really matter, though. What matters is the conflict resolution.
The "constraints of some relative importance" or "constraints and instructions" might as well be the system and user prompts. Or any of the "prompt engineering" ways to harden prompts against prompt injection.
Such research tells people right in the face that not only prompt injection is some viable theoretical scenario, but puts some number on the exploitability. With the current numbers I am keeping prompts nine locks away from any untrusted input.
ben_w|19 days ago
Now I'm thinking about the "typical mind fallacy", which is the same idea but projecting one's own self incorrectly onto other humans rather than non-humans.
https://www.lesswrong.com/w/typical-mind-fallacy
And also wondering: how well do people truly know themselves?
Disregarding any arguments for the moment and just presuming them to be toy models, how much did we learn by playing with toys (everything from Transformers to teddy bear picnics) when we were kids?
phkahler|19 days ago
jayd16|19 days ago
truelson|19 days ago
jayd16|19 days ago
layer8|19 days ago