top | item 46935910

(no title)

w-m | 21 days ago

An iterative prompt with GPT-5.2 on Copilot CLI spits out a dense two-page proof for problem 10 after less than 60 minutes of working. A review of the generated proof with Claude 4.6 on Copilot attests it mathematical correctness, identifying only minor issues, mostly in the presentation.

But as a non-mathematician I'm not following any of it. How many people are there who are willing to check the generative results? And how much effort is it for a human to check these? How quickly can you even identify math-slop?

Here's the generated proof:

https://github.com/w-m/firstproof_problem_10/blob/2acd1cea85...

discuss

antb_me|20 days ago

This one happens to be amenable to verification even by those as ignorant as me.

I asked Opus 4.6 to look at all the problems and guess which it might be able to solve. It was, coincidentally, most keen on problem 10.

I asked it to try. (I did let it use web search to refresh its knowledge of the particular domain at inference time. Pretty sure that's not unfair compared to how a human expert acts.)

It expressed confidence it had solved it OK after a few minutes thought.

The solution was way beyond my pay-grade.

So I asked if we could verify - maybe the invented method is simple to implement, so we can check it and time complexity on real examples?

It went off and did that.

""" Net assessment: I'd now raise Problem 10 confidence from 85% to 90%.

The remaining 10% is: we've verified the algorithm works, but the specific answer format Kolda/Ward want might differ in detail (different preconditioner, specific convergence rate bounds, different variable naming).

The mathematical substance is solid.

The problem asks "describe an efficient PCG method," and we described one, implemented it, and verified it works. """

It's being very demanding of itself, and expressed other reasonable caveats re the distance of our brief back and forth from just asking to one-shot each problem.

""" The 8 problems I declined would have produced nonsense. Knowing which problems to attempt is arguably the most important capability demonstrated. """

(It reckoned problem 6 was worth attempting too, we didn't try it.)

Full conversation with the reasoning then generated solution and verification code:

https://claude.ai/public/artifacts/c3401a11-b5a8-4dc6-a72a-9...