top | item 42401128

(no title)

Is there a metric I can look at in engine evaluations to determine when a situation is "risky" for white or black (e.g., the situation above) even if it looks equal with perfect play?

I've always been interested in understanding situations where this is the case (and the opposite, where the engine favours one side but it seems to require a long, hard-to-find sequence of moves.

Playing out the top lines helps if equality requires perfect play from one side.

discuss

jawarner|1 year ago

You can measure the sharpness of the position, as in this paper section 2.3 "Complexity of a position". They find their metric correlates with human performance.

https://en.chessbase.com/news/2006/world_champions2006.pdf

somenameforme|1 year ago

I think this is something a bit different. That sort of assessment is going to find humans perform poorly in extremely sharp positions with lots of complicated lines that are difficult to evaluate. And that is certainly true. A tactical position that a computer can 'solve' in a few seconds can easily be missed by even very strong humans.

But the position Ding was in was neither sharp nor complex. A good analog to the position there is the rook + bishop v rook endgame. With perfect play that is, in most cases, a draw - and there are even formalized drawing techniques in any endgame text. But in practice it's really quite difficult, to the point that even grandmasters regularly lose it.

In those positions, on most of every move - any move is a draw. But the side with the bishop does have ways to inch up the pressure, and so the difficulty is making sure you recognize when you finally enter one of those moves where you actually need to deal with a concrete threat. The position Ding forced was very similar.

Most of every move, on every move, led to a draw - until it didn't. Gukesh had all sorts of ways to try to prod at Ding's position and make progress - prodding Ding's bishop, penetrating with his king, maneuvering his bishop to a stronger diagonal, cutting off Ding's king, and of course eventually pushing one of the pawns. He was going to be able to play for hours just constantly prodding where Ding would have stay 100% alert to when a critical threat emerges.

And this is all why Ding lost. His final mistake looks (and was) elementary, and he noticed it immediately after moving - but the reason he made that mistake is that he was thinking about how to parry the other countless dangerous threats, and he simply missed one. This is why most of everybody was shocked about Ding going for this endgame. It's just so dangerous in practical play, even if the computer can easily show you a zillion ways to draw it.

jquery|1 year ago

Nice paper. I’d like if someone re-ran the numbers using modern chess engines… the engine they used is exceedingly weak by modern standards.

qq66|1 year ago

Making a computer play like a 1300-rated human is harder than making a computer beat Magnus Carlsen.

SpaceManNabs|1 year ago

This is really interesting because i ran into a pokemon bot the other day were its training led to calibration of 50% winrste at all levels of play on Pokémon showdown. It was a complete accident.

lxgr|1 year ago

Definitely, but it seems like it's now possible: https://www.maiachess.com/

dorgo|1 year ago

Take the computer which beats Magnus and restrain it to never make the best move in a position. Expand this to N best moves as needed to reach 1300 rating.

nilslindemann|1 year ago

The metric is to play the position against Stockfish. If you draw it again and again, it is trivial, otherwise, not so simple :-)

rieska|1 year ago

Yes, the Leela team has worked on a term they call Contempt. (Negative contempt in this case would make the engine seek out less sharp play from whites perspective) In the first link the authour talks about using contempt to seek out/avoid sharp lines. lc0 and nibbler are free, so feel free to try it out if curious.

https://github.com/LeelaChessZero/lc0/pull/1791#issuecomment... https://lczero.org/blog/2023/07/the-lc0-v0.30.0-wdl-rescale/...

paulddraper|1 year ago

You can evaluate on lower depth/time.

But even that isn't a good proxy.

Humans cannot out-FLOP a computer, so they need to use patterns (like an LLM). To get the human perspective, the engine would need to something similar.

lxgr|1 year ago

There are several neural network based engines these days, including one that does exclusively what you describe (i.e. "patterns only", no calculation at all), and one that's trained on human games.

Even Stockfish uses a neural network these days by default for its positional evaluation, but it's relatively simple/lightweight in comparison to these, and it gains its strength from being used as part of deep search, rather than using a powerful/heavy neural network in a shallow tree search.

[1] https://arxiv.org/html/2402.04494v1

[2] https://www.maiachess.com/

Leary|1 year ago

https://live.lczero.org/

fernandopj|1 year ago

This is great, but I think that % is about the "correctness" of the move, not how likely it is to be played next.

scott_w|1 year ago

Not really because it’s subjective to the level of player. What’s a blunder to a master player might only be an inaccuracy to a beginner. The same applies for higher levels of chess player. I’ve watched GothamChess say “I’ve no idea why <INSERT GM> made this move but it’s the only move,” then Hikaru Nakamura will rattle off a weird 8-move sequence to explain why it’s a major advantage despite no pieces being lost. Stockfish is a level above even Magnus if given enough depth.

umanwizard|1 year ago

> Stockfish is a level above even Magnus if given enough depth.

"a level" and "if given enough depth" are both underselling it. Stockfish running on a cheap phone with equal time for each side will beat Magnus 100 games in a row.

esfandia|1 year ago

Maybe the difference between the eval of the best move vs the next one(s)? An "only move" situation would be more risky than when you have a choice between many good moves.

fernandopj|1 year ago

That's it exactly. Engines will often show you at least 3 lines each with their valuation, and you can check the difficulty often just from that delta from 1st to 2nd best move. With some practical chess experience you can also "feel" how natural or exoteric the best move is.

In the WCC match between Caruana and Carlsen, they were at one difficult endgame where Carlsen (the champion) moved and engines calculated it was a "blunder" because there was a theoretical checkmate in like 36(!) moves, but no commentator took it seriously as there was "no way" a human would be able to spot the chance and calculate it correctly under the clock.

kllrnohj|1 year ago

Not necessarily. If that "only move" is obvious, then it's not really risky. Like if a queen trade is offered and the opponent accepts, then typically the "only move" that doesn't massively lose is to capture back. But that's extremely obvious, and doesn't represent a sharp or complex position.

EGreg|1 year ago

Yes, it’s called Monte Carlo Tree Search (MCTS used by AlphaZero) instead of AlphaBeta search (which is what classical chess engines used)

elcomet|1 year ago

Those are tree search techniques, they are not metrics to assess the "human" complexity of a line. They could be used for this purpose but out of the box they just give you winning probability

hilux|1 year ago

Not really – that's the point, engines, for all their awesomeness, just do not know how to assess the likelihood of "human" mistakes.