Intermittent Conjecture: Computer chess engines fall in line

[This post came out of trying to figure out what computer chess results tell us about what kind of problem chess-playing is. It goes into quite a bit of detail about cycles where player A beats player B beats player C beats player A and why they don't seem to happen. Shortly after I posted this, the Computer Chess Championship "Deep Dive" round finished up, and there was a slight anomaly: Stockfish won the tournament, but second-place LC0 beat it head to head, by the narrow margin of 4-3-43, implying an Elo difference of 8 points. I don't think this affects the post greatly, but it does reinforce the point that NN engines are less consistent than AB engines. Stockfish didn't lose a game to 4th-place Houdini while LC0 both won and lost. I'll update this slightly when chess.com makes the full results available --D.H. May 2019]

One thing jumps out from watching the ongoing computer chess tournaments that is so ordinary-seeming that it's easy to dismiss entirely: The rankings from a preliminary round with a couple dozen contestants, are strikingly linear. If you look at the cross-table, which shows everyone's results against everyone else's, wins in green, losses in red and draws in gray, not only do everyone's scores improve down the table, but for the most part all the red is above and all the green below. Stronger engines very consistently beat weaker engines, especially considering that at least some of the apparent upsets were due to system crashes and such.

I've seen this happen twice now, once for the "Blitz Bonanza" (5 minutes plus two seconds per move) and the Deep Dive (15 minutes plus 5 seconds per move). The Deep Dive preliminary, which just finished a few days ago, was played in "escalation" format, meaning that each new player plays everyone who's already played (twice as white and twice as black). After a few opponents, you can be pretty confident about where a new player will land -- above whoever whoever it's beat head-to-head, below whoever it's lost to, and close to anyone it drew all four games with.

It shouldn't be a total shock that stronger players beat weaker players, or even that they beat them very consistently. My point is more that it doesn't have to be that way. While there are only two basic styles of engine*, each dev team has its own philosophy and each one evaluates positions differently. In the case of neural net-based engines (generally referred to as NN), the evaluation of positions is done radically differently.

It seems at least plausible that, as with human players, engines might tend to do better or worse than otherwise expected against particular styles of play. An engine that attacks aggressively might lose to one that's tuned for careful defending, but beat a middle-of-the-road approach that can outplay the careful defender positionally. This would create a cycle of attacker beats middle-of-road beats defender beats attacker, but this sort of situation doesn't seem to happen in practice.

Without going back and checking the actual numbers, my impression is that AB engines -- the conventional style which does an exhaustive search a far ahead as it can manage, using fairly understandable hand-coded rules to evaluate positions quickly -- win more by being able to look further ahead than by having cleverer evaluation rules. That is, evaluation speed is a good predictor of who will win.

Clearly if two engines have identical evaluation rules, but one implements them more efficiently, the faster engine should win because it can see further ahead. The faster engine will see everything the slower engine sees, but it can also see which of two seemingly equal moves leads to a better result. It will therefore be able to steer toward winning positions and the slower engine won't see the danger until it's too late.

Even with each team using its own particular evaluation rules, something like this still happens. Typically when two AB engines play, there will be a period where the weaker engine shows a more favorable evaluation for itself than the stronger one does, e.g., the weaker engine is playing white and it sees a half-pawn advantage for white while the stronger engine sees a small advantage for black.

The weaker engine thinks everything's going fine while the stronger engine knows that it has the edge. Eventually the weaker engine's evaluation starts to drop as it realizes things are no longer as good as they looked, and the stronger engine pushes its small advantage into an overwhelming one.

Again, it isn't a foregone conclusion that a faster engine will beat a slower one with a different evaluation function, but it's what we see in practice. My suspicion is that evaluation functions hit diminishing returns after certain point. You've accounted for material, you've got some concept of tactical considerations like pins and hanging pieces, and a bit on positional factors like space and pawn structure. Further improvements are certainly possible, in the form of slightly better evaluation functions, slightly faster evaluation, or both, but the low-hanging fruit was picked long ago.

Adding a detailed model of, say, pawn structure, is likely just to slow the evaluation down without providing enough benefit to make up for the decrease in depth. This is the "dumb is smarter" principle that seems to have held sway for the past few decades.

Stockfish -- currently the clear champion among AB engines -- demonstrates this nicely with its "fishtest". As I understand it, developers try out changes to the program to see if their locally patched version consistently beats the current released version. Only winning changes are brought into the next release. While Stockfish has improved significantly as a result of fishtest, progress has been incremental, and the end result is that Stockfish can evaluate positions very quickly. I'm not 100% sure it's the fastest, but it's certainly high up there.

I'm sure everyone, open source or not, playtests potential improvements (and the training phase of NN engines take this to an extreme), but crowdsourcing this ought to keep development from fixating on some particular person or small group of people's idea of what must be right. Again, fewer assumptions means better results.

Bringing NN engines into the picture doesn't seem to change this linear ranking. NN engines are on the other extreme of the smart vs. fast tradeoff. For a given number of layers and so forth a position generally takes just as long to evaluate regardless of the exact parameters of the net being used, so the only difference in play will be due to the net itself. Evaluation speed is not a factor since it doesn't change.

This doesn't necessarily rule out A-beats-B-beats-C-beats-A cycles, in fact, it seems like they might be more likely since different nets will put different weights on different factors -- assuming there's more than one way to be good at chess. So far, though, there don't seem to be any. In the tournament results I've seen so far, there are three strong NN engines: Leela Chess Zero (generally known as LC0) Leelenstein and Antifish (a modified Leela trained specifically to beat Stockfish). In the Blitz Bonanza, LC0 had winning records against the other two and Leelenstein had a winning record against Antifish. Antifish does not seem to do better than the others against Stockfish. If anything it does slightly worse.

There are only two NN engines in the Deep Dive finals (rules now prohibit more than one variant of the same engine reaching the finals, so Antifish is out). Obviously there will be no cycle between them. For what it's worth, LC0 is currently slightly ahead of Leelenstein, but they're dead even head-to-head.

There is one clear deviation from the orderly world of AB engines. Since NN engines evaluate so slowly, they take a random sample of positions when looking ahead, rather than searching exhaustively like AB engines. I believe the sample is weighted towards moves the net considers likely to be good, but in any case this introduces some inconsistency. NN engines occasionally lose to weak AB engines since not all good-looking moves turn out actually to be good, while strong AB engines almost never do. Against strong AB engines, NN engines can find good moves that the AB opponent overlooks because it doesn't lead to good results quickly enough for the exhaustive search to turn it up, but they can also play good-looking moves that turn out to be unsound if you search well enough.

The net effect is that strong NN engines don't do as well against weak AB engines as you'd expect from how they do against strong engines. In the current Deep Dive finals, Stockfish is (at this writing) noticeably ahead of LC0, but head-to-head they're nearly even (3 wins to 2 with 41 draws, implying an ELO difference of 8 points; by contrast, human world champion Carlsen is currently rated about 57 points ahead of second-ranked Caruana) [and in fact, Stockfish won the finals but lost to LC0 head-to-head].

The difference is that Stockfish has done better against the weaker opposition. Stockfish is 9-0 against 4th-place Houdini (with a whole bunch of draws), while LC0 is 12-3 against Houdini (with correspondingly fewer draws). Stockfish also does better against the other NN engine, 3rd-place Leelenstein.

So there don't seem to be any cycles in the ranking of chess engines, and the (small) difference between top AB engines and top NN engines looks to be due to consistency. What does this tell us about chess-playing as a problem? There are several possible conclusions:

Chess skill really is a linear ranking, at least once you get to the level of computer engines. If A beats B and B beats C, then A beats C.
A bit stronger: There is one true way to play chess and stronger engines approximate this better.
On the other hand: AB engines are linearly ranked because only depth/evaluation speed really matters while NN engines are linearly ranked because only the quality of the net doing the evaluation matters. It's not surprising, though not inevitable, that combining two linear rankings gives another linear ranking.
Also, NN engines are all playing on or near the level of the current AB champion, so they're really only being compared to one style of play since there's only one really competitive opponent.

My feeling is that we don't really have enough to know for sure, but most likely we'll continue to see improvement in playing strength and that the linear ranking will hold. Next year's champion will be able to beat any other opponent head-to-head, the one after that will be able to beat it and everyone else, and so forth, regardless of whether the next champion is AB, NN or some hybrid of the two. There are noticeable differences between how AB engines and NN engines play (I'd like to get into the details in a future post), but be that as it may, stronger is stronger.

* Technically, there are two choices on two dimensions: Alpha/Beta tree search vs. Monte Carlo and neural-net based evaluation vs. hand-coded rules. However, NN engines always use Monte Carlo since it takes them roughly a thousand times longer to evaluate a particular position, so they can't search exhaustively. On the other hand, conventional engines generally use AB searches because they can. There have been fairly successful experiments in using Monte Carlo with hand-coded rules. Those tend to be called hybrid.

Intermittent Conjecture

Tuesday, May 21, 2019

Computer chess engines fall in line

No comments:

Post a Comment