top | item 47125968

(no title)

dmk | 8 days ago

So I guess the key takeaway is basically that the better Claude gets at producing polished output, the less users bother questioning it. They found that artifact conversations have lower rates of fact-checking and reasoning challenges across the board. That's kind of an uncomfortable loop for a company selling increasingly capable models.

discuss

Terr_|8 days ago

> the less users bother questioning it

This makes me think of checklists. We have decades of experience in uncountable areas showing that checklists reminding users to question the universe improve outcomes: Is the chemical mixture at the temperature indicated by the chart? Did you get confirmation from Air Traffic Control? Are you about to amputate the correct limb? Is this really the file you want to permanently erase?

Yet our human brains are usually primed to skip steps, take shortcuts, and see what we expect rather than what's really there. It's surprisingly hard to keep doing the work both consistently and to notice deviations.

> lower rates of fact-checking and reasoning challenges

Now here we are with LLMs, geared to produce a flood of superficially-plausible output which strikes at our weak-point, the ability to do intentional review in a deep and sustained way. We've automated the stuff that wasn't as-hard and putting an even greater amount of pressure on the remaining bottleneck.

Rather than the old definition involving customer interaction and ads, I fear the new "attention economy" is going to be managing the scarce resource of human inspection and validation.

jimbokun|8 days ago

Sounds like having a strong checklist of steps to take for every pull request will be crucial for creating reliable and correct software when AIs write most of the code.

But the temptation to short change this step when it becomes the bottleneck for shipping code will become immense.

boplicity|8 days ago

> So I guess the key takeaway is basically that the better Claude gets at producing polished output, the less users bother questioning it.

This is exactly what I worry about when I use AI tools to generate code. Even if I check it, and it seems to work, it's easy to think, "oh, I'm done." However, I'll (often) later find obvious logical errors that make all of the code suspect. I don't bother, most of the time though.

I'm starting to group code in my head by code I've thoroughly thought about, and "suspect" code that, while it seems to work, is inherently not trustworthy.

Florin_Andrei|8 days ago

I think we're still at the stage where model performance largely depends on:

- how many data sources it has access to

- the quality of your prompts

So, if prompting quality decreases, so does model performance.

dmk|8 days ago

Sure, but the study is saying something slightly different, it's not that people write bad prompts for artifacts, they actually write better ones (more specific, more examples, clearer goals,...). They just stop evaluating the result. So the input quality goes up but the quality control goes down.

jimbokun|8 days ago

Seems like it’s impossible for output to be good if the prompt is bad. Unless the AI is ignoring the literal instructions and just guessing “what you really want” which would be bad in a different way.

candiddevmike|8 days ago

What does prompting quality even mean, empirically? I feel like the LLM providers could/should provide prompt scoring as some kind of metric and provide hints to users on ways they can improve (possibly including ways the LLM is specifically trained to act for a given prompt).