Sunday, July 14, 2019

How strong are computer chess players, and how far are they from perfect chess?

Playing a perfect game of chess appears to be an intractable problem.  Certainly the total number of possible positions is far too large for a direct tabulation of the best move for each possible position to be feasible.  There are far more possible chess positions than subatomic particles in the observable universe.

To be sure, almost all of these positions are extremely unlikely to appear in any reasonable game of chess, much less a perfectly played one, so you should be able to ignore them.  All you would really need to know in order to play perfectly is what the best first move is, what the best second move is for every reply, and so forth.  Since most possible replies are not good moves, this might thin out enough that it would be possible to store everything in a large database and/or write general rules that will cover large numbers of possible positions.  In other words, it might (or might not) be feasible to write down a perfect strategy if we could find it.  But we're nowhere close to finding it.

Nonetheless, there has been a lot of progress in the decades since Deep Blue beat Kasparov.  It's now quite clear that computers can play better than the best humans, to the point that it's a bit hard to say exactly how much better computers are.  There are rating numbers that imply that, say, Stockfish would beat Magnus Carlsen X% of the time, but they probably shouldn't be taken as anything more than an estimate.  We can say that X is probably somewhere in the 90s, but that's about it.

Chess rating systems are generally derived from the Elo system (named after Arpad Elo), which tries to quantify playing strength as a single number based on players' records against each other.  Two equally-rated players should have roughy equal numbers of wins and losses against each other, plus however many draws.  As the rating difference increases, the stronger player should win more and more often.

Ratings are recalculated in light of actual results.  If two equally-rated players draw, nothing will change, but if a low-rated player draws against a higher-rated one, the higher-rated player will lose points and the lower-rated player will win points.  Likewise, the winner of a game will gain points and the loser will lose points, but you get more points for beating a higher-rated player and you lose more for losing to a lower-rated player.

Over time, this will tend to give a good picture of who's better, and how much better.  If the parameters of the rating formula are tuned well, the difference in rating points will give a pretty good prediction of winning percentages.  It's interesting in itself that reducing chess skill to a single number on a linear scale works as well as it does -- see this post for more than you probably want to read about that.  The point here, though, is that to get useful ratings you need a pool of players all playing against each other.

You don't need a full round-robin of thousands of players, of course, but the results you do have need to be reasonably interconnected.  If you're an average club player, you probably won't be playing many games against the world's top ten, but the strongest players in your club may well have played people who've played them, and in any case there will be multiple reasonably short paths connecting you to them.

To make this work, everyone, even at the top, needs to have players of comparable rating level to play against.  This is important because rating differences are more meaningful between relatively evenly-matched players.  If player A beats player B twelve times, draws five times and loses four times out of twenty games, then player A is probably about 200 points stronger than player B.  If it's nineteen wins and one draw, we know player A is several hundred points stronger, but we can't say much more than that with confidence.  If it's 20-0, the formula says that A is 800 points stronger, but it could really be any number from 800 on up.

If you're a club player, playing a 20-game match against Magnus Carlsen will only tell you something about your rating if you manage to at least draw at least one game.  For most of us, that's not particularly likely, and for more or less the same reason, that tournament isn't likely to happen at all.  In real life, people only have so much time, and professional players are only going to play seriously against other professionals.

The same holds true for humans and computers.   People, including grandmasters, do play computers a lot, but generally not under controlled match conditions.  Since even the best human player is likely to be on the wrong end of a 19-1-0 blowout with a top computer player in a tournament setting, there's little reason to pay a top human player to take time out of their schedule for an exhausting and demoralizing match that won't tell us much more than we already know. 

Probably a better approach would be to dial back the parameters on a few well-known engines, with a mix of AB and NN, to produce a set of standard players at, say, 200 rating-point intervals -- 2800, 3000, 3200 ... .  It would probably be much easier to get human players to play against the "near human-level" players, knowing they might be able to win bragging rights over their friends.  Some of the top human players should even be able to beat the 2800-rated players.  The human-level computer players could thus be calibrated against actual humans, and those could be calibrated against the stronger computer players.

Calibrating the top computer players against the standard computer players would take a lot of games, but that's not a problem for computers.  In the last computer tournament I watched, there were hundreds of games, with dozens between each pair in the round-robin phase, and more among the finalists.  Outside of tournaments, a group called CCRL plays computer players against each other under standard conditions specifically to rate them.

If only because of the sheer amount of human players, we can be reasonably confident that human players are reasonably accurately rated with respect to each other, and because of the sheer number of games played between computers, we can be similarly sure that their ratings reflect their relative strengths.  If computer tournaments included a few human-calibrated players, then we would have a better idea just how strong the strongest players were, compared to humans.

As it is, the two sets of ratings may well be drifting apart.  In the last computer tournament I watched, the bottom player -- which, to be sure, lost quite heavily -- had a rating in the 2900s.  Carlsen is currently in the high 2800s.  In theory, Carlsen would probably lose to that computer player, but would probably manage a few draws and quite possibly a win or two.  In practice, that match is probably not going to happen.

While it would be interesting to get more detail on ratings across humans and computers, it doesn't really change the story of "computers can beat humans easily" and by itself it doesn't shed much light on the limits of how well chess can be played.  An Elo-style rating system doesn't have a built-in maximum rating.  There is, in theory, a best-possible player, but we don't know how strong that player is or, put another way, we don't know how close the current best engines are to playing perfect chess or where they would rate with respect to a perfect player.

It is interesting, though, that stronger players, both human and computer, tend to draw more against each other than weaker players.  Amateur-level games are often decisive because one player or the other blunders (or, more accurately, blunders more than the other player does).  Higher-level human players blunder less, and chess engines don't blunder at all, or at least the notion of "blunder" for an engine is more on the order of "didn't realize that having an enemy pawn that far advanced on the kingside could lead to a devastating attack".  The finals of the last computer tournament I watched consisted almost entirely of draws.

While it's entirely possible that perfectly-played chess is a win for white (or, who knows, a win for black), and the top engines just aren't finding the right moves, it seems more likely to me that perfectly played chess is a draw and engines are basically getting better at not losing, while being able to ruthlessly exploit imperfect play when it does happen.

If this is the case then it's quite possible that ratings will top out at a point where decisive games become exceedingly rare.  A plausible match result might be 1-0 with 999 draws.  If rules required a certain margin of victory, that might essentially be a matter of flipping coins until there are are that many more heads than tails or tails than heads.  The "law of large numbers" doesn't forbid this, it just says that you'll need a lot of coin flips to get there.

We could well get to a point where a match between top players ends in a score of 1439-1429 with 50,726 draws or something like that.  At that point writing chess engines becomes similar to cranking out more and more digits of pi -- interesting to a small group of people, and useful in stress-testing systems and tools, but no longer of general interest, even among geeks.

The players would probably still not be playing perfect chess, but it seems plausible that they would nearly hold their own against perfect play.  As draws become more common, ratings should start to bunch together as a result.  If player A's record against player B is 12-0 with 123,765 draws, their ratings will be essentially identical, even if A is playing perfectly and B is only almost-perfect.

Working backward from this, if top ratings are in fact starting to bunch together, you could extrapolate toward a theoretical maximum rating that  they're approaching.  A player's actual rating is then an estimate of how often they would lose, as opposed to drawing, against perfect play.  I wouldn't be surprised if someone's already done a regression based on a scenario like this.

3 comments:

  1. Re: para 2: Would it make sense to start with all the mates and work backwards?

    ReplyDelete
  2. I don't think that would work. It's easy to find a path back from a checkmate to the initial position. Just find a game that ends in checkmate and run it backwards. However, that doesn't tell you whether the winning position can be forced from the initial position. Absent some clever proof, the only way to do that is to trace forward from the initial position.

    In any case, tracing backward from known wins gets intractable quickly, even if you limit the starting points to elementary endgame checkmate positions (king and queen vs. king). Current state of the art is a seven-piece tablebase, meaning that every position with seven or fewer pieces, counting the kings, is known to be either a win for white, a win for black or a draw (if it can be traced from a stalemate, but also if it didn't show up in any of the backward searches, meaning each side can avoid losing indefinitely). As I understand it, eight pieces is a ways off yet.

    It's entirely possible that the optimal strategy for chess is to force a position where the only way for your opponent to avoid mate is to repeat moves. In that case, chess is a draw, but good luck finding that magic position, which might have any legal number of pieces, and proving it's forced from the initial position

    ReplyDelete
  3. This looks more like a chess variant, so "perfect chess" in terms of "a game that's like chess, but perfect (by some measure)". This post is more about how far we are from perfect play in the traditional game of chess, whatever we might think of it as a game.

    (And, re-reading, it's really more about the limitations of rating systems, and the conjecture that perfect chess is a draw since higher ratings -- whatever they actually mean -- seem to correlate with more draws)

    ReplyDelete