Intermittent Conjecture: Notes on watching (a chunk of) a computer chess tournament

A while ago I started a post trying to untangle what kind of a problem it is to play chess well, based on how well various kinds of players -- human, conventional computer chess engines and neural-network-based chess engines -- did against each other. I still think it's an interesting question, but in the mean time I actually watched a bunch of computer chess and that was even more interesting. In particular, I looked in on a fair bit of chess.com's "CCC 7: Blitz Bonanza". Here are some rough impressions.

The Blitz Bonanza is one part of a long-running event comprising thousands of individual games in several different formats (the nice thing about chess engines is they don't get tired, so it's practical to have two of them play each other dozens of times in the course of a few days). The "Deep Dive", with longer time controls and slightly different rules, is going on now and will be for several weeks. There may be more after that (much of this is from memory because I can only seem to find current results on chess.com and the summaries on other sites don't always contain the exact information I'm looking for).

In the Blitz Bonanza, each player has five minutes of time to start with, plus two seconds per move made. There is no "opening book", meaning that the players must calculate every move. This leads to the rather odd sight of a clock ticking down near or past the 4-minute mark on move one, which is pretty much inevitably something everyone plays anyway (1. d4 and 1. e4 predominate).

Play continues either to checkmate by one side (though it enters a "get it over with" or "bullet" mode if both engines evaluate the current position as highly favorable to one or the other side), to a draw by stalemate, threefold repetition, 50 moves without a pawn move or capture, or "adjudication", meaning that the position is in a standard database of positions with six or fewer pieces on the board that cannot be won if both sides play perfectly (this includes "insufficient force" scenarios like king and bishop against king), or both engines evaluated the game as a dead draw for a certain number of moves.

This is rather different from human matches, where a player will typically resign if it becomes clear there's no way to avoid losing, and players typically agree to a draw when it's clear that neither of them has a realistic shot at winning unless the other one blunders. It's sometimes fun to watch an actual checkmate. The draws can be interesting in their own way. More on that later.

Scoring was the usual one point for a win, no points for a loss and half a point to each player for a draw

The original field consisted of

The premier conventional (alpha/beta or AB) engine, Stockfish
Three neural network-based (NN) engines (Leela Chess Zero, aka LC0, Leelenstein and Antifish)
Everyone else

In the first round, each of these played each of the others several times, half of the time as white and half as black. The top four would progress to the next round. When the smoke cleared, the top four were LC0, the other two neural network engines, and Stockfish rounding out the field in fourth.

In the second round, each of the four played each of the others 100 times (50 as white and 50 as black), for a total of 600 games. This went on around the clock for several days. The final result had LC0 in first, Stockfish close behind, well ahead of Leelenstein, which was itself well ahead of Antifish. LC0 had a small lead over Stockfish in their head-to-head games.

Looking at this with my somewhat defocused engineer's eyes, I'd say "LC0 and Stockfish were about equal, both were clearly stronger than Leelenstein, and Antifish was clearly the weakest of the lot." I'd consider LC0 and Stockfish essentially equal not only because of the small difference in score (5.5 points out of a possible 300), but because the margin between the two was steady for most of the play. LC0 won some early games against Stockfish head-to-head, but after that each scored several wins against the other. The early lead thus looks more like a statistical fluke than a real difference in strength [I'm not sure a real statistician would buy this analysis -D.H May 2019]. The results of CCC 6, with slightly different rules and time controls, bear this out. Stockfish beat LC0, but not by much, and, if I recall correctly, not head-to-head.

As to my long-running theme of "dumb is smarter", the "zero" in LC0 means that it was trained purely by playing against itself, without any external hints. Antifish, which finished last and had a losing record against each of the others, including Stockfish, was trained specifically to exploit weaknesses in Stockfish. If we take LC0's "no extra information to bias the results" approach as one kind of "dumb" and Stockfish's "bash out lots of positions using fairly straightforward (but highly-tuned) rules" as another, it's pretty clear that "dumb" dominated.

One other thing jumped out: Neural networks are not good at endgames, or at least they can be spectacularly bad at some endgames.

Endgames are interesting here because there are few pieces on the board and it's often possible to spell out an explicit plan for winning. For example, if you have a rook and a king against a rook, you can use your rook and king in concert to pin the other king against the edge of the board and then checkmate with the rook. In other cases, you want to promote a pawn and this hinges on getting your king to the right square and keeping the other king from stopping you.

Over and over I saw neural network engines flail about aimlessly before hitting the 50 move limit. To be clear, there are a few endgames that can take more than 50 moves to win, but even in these you can see a clear progression. To make up an example, after a series of ten checks the opposing king is now one square closer to the corner where it will eventually be checkmated. This isn't what was happening. In one game I saw, the AB engine showed mate in 12, while the NN engine saw its winning chances as quite good. The game continued

AB: Oh, now you've got mate in 14

NN: Yeah this looks good

AB: I don't see a mate, but I think you're going to promote that pawn (or whatever) and win, so my rating is still highly in your favor

NN: Yeah this looks good

AB: Dang, mate in 14 again

NN: Yeah this looks good

AB: Mate in 13

NN: Yeah this looks good

AB: I don't see a mate any more, but ...

NN: Yeah this looks good

From the NN's point of view, it was going from one position with good winning chances to another. But it wasn't actually winning. When the game eventually ended in a draw, the NN gave away a half point that the AB engine or a good human player would have gotten, given the same position.

I was a little (but not a lot) surprised by this. It doesn't seem like the problem is beyond the capability of neural networks. They've discovered plenty of conventional tactics -- forks, skewers and such, or ladders in the game of go -- and they are clearly able to encode at least some of the usual winning patterns.

Give a neural network a king and queen against a queen and it will find a checkmate more or less like anyone else would. On the other hand, give it a queen, a pawn and a rook against a king and it will quite likely forcibly give away two of the three and win with what's left. To a mathematician it makes sense (reducing to a previously-solved case), but to most chess players it looks like lunacy, if not deliberate trolling [In one fairly jaw-dropping example from the Deep Dive finals, LC0 had an overwhelming material advantage in an endgame and gave away pieces to reduce it to bishop, knight and king against lone king, one of the harder endgames to win. It went ahead and won anyway --D.H May 2019].

I found this behavior particularly interesting because neural networks are supposed to have more of that special sauce we call intelligence or understanding. We can clearly say that Stockfish has no deep understanding of chess the way that a human grandmaster can intuitively say "Your pieces are active and the king is exposed. Go for a mating attack," but at least it plays plausible moves. NN engines often find wild moves that a human wouldn't consider and an AB might avoid because it doesn't show a clear win, leading us to feel it has some deeper understanding of chess. I think that's a valid feeling, but there are cases where it completely falls apart.

Watching a neural network botch an endgame just looks like farce. I'm perfectly comfortable saying that a program "understands" a concept in chess, or "has a concept of" pawn structure or whatever. It doesn't mean that the program has some reflective layer thinking "Oh, that's a strong pawn chain, I like that", but that it behaves in such a way that, one way or another, it must recognize good and bad structures.

In cases like these endgames, though, it's abundantly clear that the network has no concept, in any meaningful literal or metaphorical sense of the word, of how to win the position. Nor is it going to be able to figure one out over the board, because the network only learns during the training phase (that's not necessarily a requirement, but it's how these engines work).

This doesn't seem like a permanent situation. For example, you could have an NN engine train on endgame positions, and it would likely discover ways to win outright and ways to steer toward a win in positions where it now flails. This is pretty much what humans do -- you learn how to promote pawns as one skill, how to checkmate with a rook and king as another, and so forth.

What's at least a bit surprising is that an engine can be frighteningly strong in most of the game and completely clueless in a few key spots. This has to say something about intelligence and learning, or about what kind of a problem chess-playing is technically, or about all of the above, but I'm not quite sure what.

Intermittent Conjecture

Sunday, April 14, 2019

Notes on watching (a chunk of) a computer chess tournament

No comments:

Post a Comment