Sunday, July 14, 2019

Computer chess: where now?


In an effort to wrap up this thread, at least for a while, here's an attempt to recap some of the major points and to conjecture about what might be next, and what this might tell us about chess, and intelligence in general.

Since playing perfect chess appears intractable, we have a classic tradeoff: give up on perfect play in order to fit our actual strategy into our limited resources.  Chess theory, combined with practice to develop pattern recognition and calculation, is optimized for the human brain.  Chess theory is just shorthand for "patterns and rules that players have discovered over the centuries."  These days, that very much includes discoveries by computer players.

For computers (at least the Von Neumann-style processors currently in commercial use), there are now two main options:

  • exhaustive search to a limited depth combined with alpha/beta pruning to avoid exploring moves that can't be as good as moves already considered using an explicit set of rules to evaluate whether a position is good or bad (known as AB, after the alpha/beta part)
  • a completely opaque neural net evaluation of positions combined with a limited randomized search of possible variations (known as NN, for the neural net part), though you'll also see mention of MCTS (Monte Carlo Tree Search), referring to the randomized search.
There are some hybrid approaches that combine explicit rules with random search, but one way or another there has to be a tradeoff of performance for limited resources.

It's probably worth repeating that NN engines still consider far more possible continuations than humans can.  The ratio is roughly dozens to hundreds of thousands to billions.  We can assume that those last two numbers are going to increase over time.  Moore's law may (or may not) be tailing off, but there will still be improvements in software and bigger piles of hardware in the future.  There could also be breakthroughs like some sort of quantum chess computer that can examine vastly larger numbers of possible positions, in which case all bets are off.

It's interesting explore what human, AB and NB players do and don't have in common.  One common factor is the importance of what's traditionally called "positional" play, that is, taking into account factors beyond tactical concerns about winning material or forcing a quick checkmate.  Tactics are still important, but as the level of play becomes higher it becomes more and more important to consider factors that influence the longer-term course of the game, factors like pawn structure, space, initiative and so forth.

Positional play is interesting from a computing standpoint because it's not the primary objective of the game -- that would be checkmate -- or even the secondary objective -- generally considered to be winning material to make it easier to checkmate, or using the threat of winning material to force the opponent into a lost position.  The chess elements that have been developed over the centuries aren't readily obvious consequences of the rules.  They are human constructs aimed at making an intractable problem tractable.  Heuristics, in other words.  In broad terms, all three kinds of players are using the same approach -- rules of thumb plus calculation -- just in different mixes and with different rules of thumb.

It seems significant, then, that computers have, in effect, rediscovered positional factors in their play, even without having them explicitly programmed in.  Deep Blue beat Kasparov, by Kasparov's own assessment, by outplaying Kasparov positionally.  The surprise was that it did this with only basic knowledge of what makes a position good or bad and that the rest emerged from looking at large numbers of possible positions, much further ahead than a human could look.

Similarly, in their training phases, NN engines like Alpha Zero learn to play good moves without any obvious tactical benefit -- and indeed some that can look like blunders or at least unnecessarily risky at first glance -- without any cues beyond what happened to win in training games.  They do seem to produce more than their share of "wild" positions, and "unorthodox" moves, but even then their play can generally be described in terms like "an unusually aggressive pawn push to initiate a kingside attack" or "a positional sacrifice to obtain more active pieces", and not (except in endgames, it seems) "no idea ... looks like it picked that move at random".

Maybe that just means that we have enough terms for things going on in chess that one of them is bound to apply, but even when engines play moves that flout previously accepted theory, such cases are probably the memorable exception.  In a great many positions each of the three types of player would find the others' preferred moves perfectly reasonable.  They might well disagree over which exact move is best, but in most cases their evaluations will be largely similar.  Typically they will all rule out the great majority of possible moves as inferior, and they will largely agree on which moves are inferior.   For that matter, AB engines and human grandmasters have been known to flout the same rules.  It's not just NN.

All this is to say that if three radically different approaches to chess-playing can reach a reasonable consensus on almost all possible moves and their ranking agrees reasonably well -- though by no means perfectly -- with accepted chess theory, then most likely there is something to accepted chess theory.  Call it an emergent property of the rules.

So where do we go from here?  Can we combine the strengths of the three approaches, roughly speaking high-level planning for humans, calculation for AB engines and positional assessment for NN engines?

It might seem almost self-evident that if we knew the full details of how grandmasters see the board then computers would be able to do even better, but it's not like it hasn't been tried.  Chess engine developers have been working with top-level human players for decades, with good results, but not always with the best results.  Stockfish, which currently rules the AB world, benefits from a large distributed team making continual tweaks to the algorithm, with only the best making it into the next release.  Contributions can come from anyone.  I'm sure a lot of contributors are also strong chess players, but it's not a prerequisite.

Alpha Zero, of course, dispenses with outside expertise entirely.  Other NN engines try to incorporate outside knowledge, but the ones that don't, notably Alpha Zero and its open-source cousin LC0, seem to do best.

Humans clearly do things differently from computers, but as I've said previously, it's quite possible that this is because these things work for humans but not for computers.  My guess is that trying to model the way humans formulate plans when playing chess is not going to lead to stronger chess engines.  At this point, both kinds of engines have their equivalent, namely calculating possible continuations en masse.

Even NN engines, which focus mainly on evaluating particular positions, benefit from this.  While NN engines may look downright clueless in endgames (even when they win), they look considerably less clueless at looser time controls, that is, when they are doing more calculation.  Looking far enough ahead is indistinguishable from planning, even if that's not how we humans do it.

The ideal combination of current AB and NN approaches would evaluate positions as well as NNs and as fast as ABs.  While the original Alpha Zero work was done in a closed environment and (to my knowledge) hasn't been directly replicated by outside researchers, it's generally understood that Alpha Zero benefitted from using tensor processing chips that sped up its evaluation (one reason the Stockfish team argued that the Alpha Zero match results weren't a fair comparison), and that for that reason it would likely beat LC0, which runs on stock hardware.  That is, NN with faster, and thus more, calculation beats NN without it.

On the other hand, one big reason that NN engines got so much attention was that they didn't just win, but they did it in a noticeably different style.  The classic example is the "fawn pawn" (apparently a mishearing of "thorn pawn"), which is an advanced pawn, typically on the sixth rank, that can't be attacked by the defender's pawn, for example a black pawn on h3 when white has castled kingside and played ... g3.

A strong human player would likely look at such a position and think "Yeah ... that doesn't look good for white ... ", but NN players demonstrated just how bad it is, and therefore why it can be worth playing for even if you have to give up material or it doesn't look good to expose your own king and push a pawn that might have to be supported by pieces instead of other pawns.  The NN games also gave some insight as to when to push such a pawn.

It's not out of the question, in fact I'd conjecture  that it's likely, that NN engines can incorporate some of this knowledge directly into their evaluation functions.  The resulting code wouldn't have to capture everything the NN's network does, just enough to avoid losing to such attacks, by skewing just enough toward positions that avoid them.

More generally, an AB evaluation function doesn't need to capture everything an NN's does.  Some of that behavior will be due to overfitting or plain old random noise.  Nor do we have to understand exactly how the evaluation works.  There's a very direct way of seeing what an NN's evaluation function is doing: play games against it and see what it does.

Another option that comes to mind would be to use neural networks not for playing chess directly, but for tweaking the parameters of traditional chess engines.  I'm not sure this is a great idea, though.  One major lesson of both types of engine, in their own ways, is that it's risky to build in assumptions -- not necessarily doomed to failure, but risky.  If you're tweaking parameters in a traditional engine -- e.g., how much a rook on an open file is worth in relation to a pawn -- you're tuning weights in a system with a bunch of specialized, non-linear nodes instead of tuning weights in a (relatively) straightforward multilinear system.  It's not clear that this would work better, though it might.

Looking at it from a more NN perspective, you could try to speed up evaluation by noticing this or that pattern of weights and replacing the generalized tensor calculations with ad-hoc functions that happen to work for those particular weights.  For example if the weights are "sparse", that is, most of them are zero, you can write out an explicit formula that combines the terms with non-zero weights.  In theory, you might come up with something that resembles some of the customary rules of thumb.  Maybe weights involving a particular kind of piece tend to emphasize empty squares that the piece could move to (open lines, mobility), or pairs of squares the piece could attack if it moved a certain way, and which hold enemy pieces (forks).

If all we know is that neural nets are combining basic features of a chess position according to some magic combination of weights, then all we can say is "there's something that current explicit evaluation functions don't seem to be capturing."  If that's how it stays, then we might expect neural networks to outperform engines like Stockfish in the long run.  They're already at least comparable, and they haven't been around for nearly as long.

It's quite possible, though, that a neural network's evaluation function is capturing a handful of quantifiable factors that current explicit evaluation functions aren't currently capturing.  In that case, an engine like Stockfish with an updated evaluation function should have the upper hand.  It could choose promising positions as well as, or about as well as, a neural net, but it would examine them much more efficiently.

It's also not clear how much room is left for neural networks to improve.  Playing winning chess is just plain hard.  Training a network for twice as long generally doesn't produce twice as good a result.  It may be that network of evaluation of individual positions is about as good as it's going to get.

For that matter, though, neither does looking at twice as many moves, or even twice as deep (which should roughly square the number of moves examined), make an AB engine twice as strong.  Currently AB engines can examine billions of positions and around twenty plies deep (twenty levels of move/countermove, or ten full moves) in a typical midgame position.  Looking at trillions of positions would mean looking more like thirty plies ahead.  That ought to yield some sort of improvement, but how much?  Most high-level games are draws in any case.

It's interesting that AB and NN engines appear to be more or less evenly matched.  This may be evidence that we're reaching the limits of what computer chess players can do.



In the discussion above, I talk mostly about evaluation functions, weights on numbers and so forth, and not so much about "planning" or "understanding" or "insight".  This was a deliberate choice.  Chess engines are nothing more or less than computer programs.  It's fascinating that they can clearly display aspects that we associate with intelligence, while clearly lacking others.  Neural networks don't change this, but they do seem to capture another chunk of what brains do, namely fuzzy pattern matching.


Much discussion of developments like AlphaZero and similar neural network-based programs focuses on what we're learning about algorithms, and in particular how to harness "deep neural networks" (that is, multi-stage neural networks) to solve problems that are widely agreed to require intelligence.

That's fair enough, but in that process we're also learning about the problems themselves.  If intelligence is the ability to solve certain kinds of problems, then solving problems by machine tells us something about what it means to be intelligent.  While it's natural to want to move the goal posts and try to narrow "intelligence" down so it continues to mean "behavior we can't emulate reasonably well with machines," I argue in the first of the previous posts, that's probably a losing game.  The winning game, I think, is gaining a better understanding of various capabilities that we tend to throw under the blanket term "intelligence".


How far are we from perfect chess?

Playing a perfect game of chess appears to be an intractable problem.  Certainly the total number of possible positions is far too large for a direct tabulation of best move for each possible position to be feasible.  There are far more possible chess positions than subatomic particles in the observable universe.

To be sure, almost all of these positions are extremely unlikely to appear in any reasonable game of chess, much less a perfectly played one.  All you would really need to know is what the best first move is, what the second move is for every reply, and so forth.  Since most possible replies are not good moves, this might thin out enough that it would be possible to store everything in a large database and/or write general rules that will cover large numbers of possible positions.  In other words, it might (or might not) be feasible to write down a perfect strategy if we could find it.  But we're nowhere close to finding it.

Nonetheless, there has been a lot of progress in the decades since Deep Blue beat Kasparov.  It's now quite clear that computers can play better than the best humans, to the point that it's a bit hard to say exactly how much better computers are.  There are rating numbers that imply that, say, Stockfish would beat Magnus Carlsen X% of the time, but they probably shouldn't be taken as anything more than an estimate.  We can say that X is probably somewhere in the 90s, but that's about it.

Chess rating systems are generally derived from the Elo system (named after Arpad Elo), which tries to quantify playing strength as a single number based on players' records against each other.  Two equally-rated players should have roughy equal numbers of wins and losses against each other, plus however many draws.  As the rating difference increases, the stronger player should win more and more often.

Ratings are recalculated in light of actual results.  If two equally-rated players draw, nothing will change, but if a low-rated player draws against a higher-rated one, the higher-rated player will lose points and the lower-rated player will win points.  Likewise, the winner of a game will gain points and the loser will lose points, but you get more points for beating a higher-rated player and you lose more for losing to a lower-rated player.

Over time, this will tend to give a good picture of who's better and how much better, and if the parameters of the rating formula are tuned well, it will give a pretty good prediction of winning percentages.  It's interesting in itself that reducing chess skill to a single number on a linear scale works as well as it does -- see this post for more than you probably want to read about that.  The point here, though, is that to get useful ratings you need a pool of players all playing against each other. 

You don't need a full round-robin of thousands of players, of course, but things need to be reasonably interconnected.  If you're an average club player, you probably won't be playing many games against the world's top ten, but the strongest players in your club may well have played people who've played them, and in any case there will be multiple reasonably short paths connecting you to them.

This isn't the case with humans and computers.  People, including grandmasters, do play computers a lot, but generally not under controlled match conditions.  To get useful ratings, players being rated should be reasonably well matched.  If player A beats player B nineteen times and draws once out of twenty games, we know player A is far stronger, but we can't say with confidence how much stronger.

Since that's a realistic scenario for a top computer player vs. a top human player, there's little reason to pay a top player to take time out of there schedule for an exhausting and demoralizing match that won't tell us much more than we already know.  Probably the best approach would be to dial back the parameters on a few well-known engines, with a mix of AB and NN, to produce a set of standard players that could play humans at, say, 200 rating-point intervals.  It would probably be much easier to get human players to play against the "human-level" players, knowing they might be able to win bragging rights over their friends.  The human-level computer players could thus be calibrated against actual humans.

Calibrating the top computer players against them would take a lot of games, but that's not a problem for computers.  In the last computer tournament I watched, the bottom player -- which, to be sure, lost quite heavily -- had a rating in the 2900s.  Carlsen is currently in the high 2800s.  Again, these are essentially two separate rating systems, but for historical reasons they're probably fairly close, so it should be possible to link the two systems if anyone really wants to.

While it would be interesting to get more detail on ratings across humans and computers, it doesn't really change the story of "computers can beat humans easily" and by itself it doesn't shed much light on the limits of how well chess can be played.  An Elo-style rating system doesn't have a built-in maximum rating.  There is, in theory, a best-possible player, but we don't know how strong that player is or, put another way, we don't know how close the current best engines are to playing perfect chess.

It is interesting, though, that stronger players tend to draw more against each other than weaker players.  Amateur-level games are often decisive because one player or the other blunders (or, more accurately, blunders more than the other player does).  Higher-level players blunder less, and chess engines don't blunder at all, or at least the notion of "blunder" for an engine is more on the order of "didn't realize that having an enemy pawn that far advanced on the kingside could lead to a devastating attack".  The finals of the last computer tournament I watched consisted almost entirely of draws.

While it's entirely possible that perfectly-played chess is a win for white (or, who knows, a win for black), and the top engines just aren't finding the right moves, it seems more likely to me that perfectly played chess is a draw and engines are basically getting better at not losing, while being able to ruthlessly exploit imperfect play when it does happen.

If this is the case then it's quite possible that ratings will top out at a point where decisive games become exceedingly rare.  A plausible match result might be 1-0 with 999 draws.  If rules required a certain margin of victory, that might essentially be a matter of flipping coins until there are are that many more heads than tails or tails than heads.  The "law of large numbers" doesn't forbid this, it just says that you'll need a lot of coin flips to get there.

We could well get to a point where the championship wends in a score of 1439-1429 with 50,726 draws or something like that.  At that point writing chess engines becomes similar to cranking out more and more digits of pi -- interesting to a small group of people, and useful in stress-testing systems and tools, but no longer of general interest, even among geeks.

The players would still not be playing perfect chess, but most likely they would nearly hold their own against perfect play.  If player A's record against player B is 12-0 with 123,765 draws, their ratings will be essentially identical, even if A is playing perfectly and B is only almost-perfect.  I wouldn't be surprised if someone's already done a regression based on a scenario like this and calculated a maximum rating from it.

Sunday, June 23, 2019

Grandmasters don't play twelve-dimensional chess

A couple of caveats before I start:

First, in case it's not clear by now, let me say explicitly that I'm not a grandmaster, or any kind of master, or even close to being any kind of chess master.  I have, however, watched a few play, browsed through master-level games and read articles and interviews on the subject, not to mention losing quite a few 5-minute games to master-level players and occasionally asking them to explain where I went wrong.

Second, while grandmaster is a title with very specific qualifications, reserved for the best of the best, I'm using it here as a stand-in for any strong player with the same general mindset and skills that I describe here, regardless of how well they actually do in tournament play.

With that out of the way, imagine a critical point in a cheesy movie:

"Welcome to my trap.  I knew you'd come here, just as I know that you will now try to use your secret radio transmitter to summon help, which is why I have taken the precaution of jamming all radio frequencies.  Don't even think about trying to cut off power to the jamming equipment.  It's encased in an unobtanium vault with battery power for the next three months.  Checkmate."

"Ah, my worthy opponent, I did not anticipate that.  But I did not need to.  There is no secret radio transmitter, only this aluminum box of curiously strong mints that I knew your scanning equipment would register as one.  No, I will not be needing a radio today.  I need only wait here and do nothing, while the new head of security you hired comes to my rescue.  Next time be more careful who you use for background checks.  You never know who they might be working for.  Don't bother summoning your bodyguards.  They're already safely locked away where they will do no harm to anyone."

"But ... but ... my plan was perfect.  I accounted for every last detail!"

Characters like this may be casually referred to as chess masters, but as far as I can tell this is not really how actual grandmasters play.

While it's true that many games are won or lost by one player catching something the other missed, it's not necessarily from thinking one move farther ahead, or from calculating every last possibility, but from recognizing the importance of occupying a key square, or opening or closing some particular line of attack, or knowing which rook to move onto a particular file.  This, in turn, generally comes down to who has the better plan.

There will certainly be points in a game where a grandmaster will pause and think deeply about moves, countermoves, counter-countermoves and so forth, and they can do this better than most players, but that's probably not the real distinguishing factor in their play.  Likewise, while tactics are certainly important in top-level games, the better tactician may or may not end up the winner.

One good illustration of this is the simultaneous exhibition, where a professional typically takes on a few dozen amateurs, moving from one to the next for each move (with occasional pauses for a longer exchange) and often winning every game.  If human chess were a matter of sheer calculation then winning every game would mean one person keeping all the possible continuations of 50 different games in mind, or being able to re-calculate them in the few seconds they usually spend at each board.

But of course this is not what's going on.  The pro is relying on experience, often looking for quick tactical wins by exploiting typical amateur mistakes.  They're still planning, just not in detail, maybe something on the order of "That knight is lined up with the king.  I can pin it, then bring my own knight to this square, put my queen on this diagonal and either attack the king directly or just win that rook if they don't notice I have a fork there.  Am I leaving any pieces hanging?  No.  Is my king exposed? No?  Great. On to the next one ..."

A human grandmaster can do this because they see the board not as a collection of pieces on squares but in a more structured way, something like a collection of patterns and possibilities.  Even if the full details of how this is done are elusive, there are hints, like the well-known experiments in which a grandmaster can reconstruct a position after a quick glance while a beginning player or a non-player takes much longer and makes many more mistakes.

Someone without much experience might think "there was a rook on this square, a pawn on this square ... or was it a bishop?" and so forth.  A grandmaster might think something more like "both sides have castled kingside, white's queen bishop is fianchettoed ..." or "this started out as such-and-such opening" or maybe even "this is like that game I lost to so-and-so and that rook is where I should have put it"

When it comes to playing over the board, a strong player will know many more possibilities to check for than a weaker one, and many more possible hazards to avoid.  One site I ran across gives a checklist of more than two dozen things to check for on every move, just to play a strong amateur game.  I have no doubt that this makes for stronger play, but I personally don't have the patience for this, or the experience for it to have become second nature, so it should come as no surprise that I'm not a strong player.

The most important factor, though, is planning.  If you just play likely-looking moves against a grandmaster, but without a clear plan, you'll almost certainly find yourself in deep trouble before long because they did have a plan.  That innocuous bishop move, that weird-looking pawn push and the shift of a rook from a reasonable-looking square to a different reasonable-looking square were all setting up the attack that's now making your life miserable.  If only you could move your king out of the way -- and that's what that bishop move was really about.

As I mentioned in a previous post, AB engines really do play the kind of insanely calculating game that we associate with the cliche chessmaster character, while NN engines do the same sort of pattern recognition that allows a grandmaster to size up a situation without having to do a lot of calculation, but neither is doing the sort of large-scale planning that a human grandmaster is.  It's also worth reiterating that, while even early chess engines were able to out-calculate human players, it took decades before they could consistently win games against them.

In my view, it would be a mistake to think that the planning grandmasters do is some sort of humans-only ability that computers could never emulate.  There's no reason a computer couldn't represent and execute a plan on the order of "Put your bishop on this diagonal, push these pawns and go for a kingside attack involving the bishop, rook and queen," without exhaustively checking every move and countermove.  It just hasn't proven to be an approach that computers can use to win games.

Wednesday, May 22, 2019

What do chess engines understand about chess?

[This post assumes a familiarity with terminology from previous posts.  Rather than try to gloss everything, I'm going to punt and respectfully ask the confused reader to at least skim through the previous few posts]

It's clear from watching a bunch of games that the two main types of chess engines have different styles, though not all that different.  It's not as though anyone suddenly decided 1. g4 was a great opening.  In broad strokes:

AB engines play very precisely and their evaluations of positions are fairly conservative.  It's quite common for an NN engine to give a large advantage to one side in a position that an AB engine rates as fairly even.  The ratings eventually come in line, but the convergence can happen in either direction.

If the NN engine is right, the AB eval comes up to meet it as it exploits an advantage that the AB engine missed.  If it's wrong, the NN eval comes down and the result is often a draw, either because the NN engine failed to play precisely enough to convert a winning advantage, or the winning advantage was never really there to begin with.  It would be interesting to track down statistics on this, and any interesting exceptions, but I don't have easy access to the data.

When NN engines first hit the scene, the initial take was that neural nets had captured some subtlety of positional understanding that the AB engines, with their relatively simple hard-coded evaluation rules couldn't fathom, even given the ability to look many moves ahead.  This is mostly accurate, I think, but incomplete.

In the Blitz Bonanza, there were any number of NN wins over AB engines, at least the weaker ones, that started with the advance variation of the French Defense (1. e4 e3 2. d4 d4 followed soon after by white e5).  This opens up a lot of space for white on the kingside and tends to push the black pieces over to the queenside -- or maybe "lure" is a better word, since there are plausible things for them to do there, including attacking white's pawn center.  The problem is that white can now launch a fairly potent kingside attack and black doesn't have time to get enough pieces over to defend.

Apparently none of this looks too bad to an AB engine.  Yes, white has an advanced pawn in the center, but there are chances to counterattack and weaken that pawn.  Once the white attack really gets going, often aggressively pushing the g and h pawns, its king may start to look correspondingly weak -- except black doesn't have any pieces well-located to attack it.  A typical AB engine's eval function doesn't try to account for that -- looking ahead it will either find an attack or it won't.

Probably this disconnect of white seeing good chances and black finding plusses and minuses to the same moves continues until black can see actual tactical threats coming, at which point it's toast since its pieces are now bottled up on the wrong side of the board.  The natural explanation is that the NN engine understands that there are good attacking prospects on the kingside and blacks pieces are poorly placed to defend.

What's actually going on is probably a bit different.  In its training phase, the NN found that positions with, say, more black pieces on the queenside, lots of open squares on the kingside, and pawns pushed up near the black king tend to lead to wins.  A human playing in the same style would probably say "My plan was to open up some space, get black's pieces bottled up on the queenside and launch a kingside attack".  But here there is no plan.  White is playing moves that tended to work well in similar situations in training.

As I mentioned in a previous post, this lack of a plan becomes blindingly obvious in endgames.  Many endgames have an almost mathematical feel to them.  White is trying to hold a draw.  Black is up a bishop for two pawns.  Picking off one of those pawns would allow one of black's own pawns to promote unimpeded.  But white's pawns are all on the wrong color square for the bishop to attack and they're blocked against black's pawns (and vice versa), so all white has to do is keep black's king away.

Conversely, though, if black can draw white's king away from where it's blocking black's king, it can sneak its own king through and pick up the undefended pawns.  And there just happens to be a tactical threat that will accomplish that ...

A human player can figure positions like this out logically and win or hold the draw, as the case may be, even against perfect play.  An AB engine can often look ahead far enough to find the winning variation.  An NN engine might or might not stumble on the right move.

Likewise, it's common for an NN engine to see a large advantage in a position that's clearly a draw.  Suppose in the previous scenario there's a way for white to safely shuffle its king back and forth between two squares and keep the black king from getting through.  Quite likely the AB engine will evaluate the position as dead even, and the human kibitzers will agree. The NN engine, if it's playing black in this example, will continue to see a sizable advantage almost up to the 50-move limit, even though all it's doing is rearranging its pieces in order to avoid threefold repetition.

If understanding chess means having a feel for what kind of positions have good or bad prospects without trying to calculate every possible variation, that is, playing like a human, then NN engines are clearly doing something similar.

On the other hand, if understanding chess means being able to find the right moves to win or draw a difficult endgame, that is, playing like a human, then AB engines clearly have the upper hand.

But if understanding chess means being able to reason about why an endgame is won, lost or drawn, or being able to put together a general plan of attack without knowing in advance exactly what moves need to happen in what order, that is, playing like a human, then neither type of engine is even trying.

What's chess to a computer?

It's now quite clear that computer chess is going to eat this blog for a while, so I'm in the process of unpacking several points that have come up in my earlier attempts to just make a few quick points and move on.  I'm also coming to realize that, since I'm having even more trouble than usual organizing this into a coherent whole, I'll probably end up with N posts worth of material spread across N+1 posts, if not more.  Apologies.  This particular post is an excursion into computer science land to look at computer chess from that point of view, rather than trying to understand it as chess.

Chess, fundamentally, falls into a large class of problems called tree searches, in which a number of interconnected structures are explored by proceeding from a starting point to those that are connected directly to it, and those connected directly to the first set, and so forth.  There are lots of different ways to do this, depending mostly on what order you do the visiting in.

Examples of tree searches, besides finding good chess moves, include figuring out the best way to route packets in a network, solving a puzzle where odd-shaped blocks have to be arranged to form a cube, finding the shortest path through a maze and following reflections and refractions to find what color a particular pixel in an image will be (ray-tracing).

Each point in a tree search is called a node.  In chess, the nodes are chess positions (including where the pieces are, which side is to move and a few other bits such as whether each side's king has moved).  At any point in a tree search you are either at a leaf node, meaning you have no further choices to make -- in chess, you have arrived at a win, loss or draw, or more often simply decided not to explore any deeper -- or not, in which case you have one or more child nodes to choose from.  In chess, the children of a position are all the positions that result from legal moves in the current position.

Since there is no element of chance in chess, there is one and only one game tree that starts at the initial position (the root node) and includes every possible position resulting from every series of legal moves.  This means that in theory there it's possible to know whether any position is a win for white, a win for black or a draw assuming each side plays perfectly.  That includes the initial position, so in theory there is an optimal strategy for each player and all perfectly-played games of chess have the same result.

The problem is that the game tree for chess is huge.  There are way, way more legal chess positions than fundamental particles in the known universe.  No computer will ever be able to examine more than a minuscule fraction of them.  That doesn't necessarily mean that chess will never be "solved", in the sense of proving what the optimal strategies are for each player and what the result will be with perfect play.  In theory there could be a way to rule out all but a tiny fraction of possible positions and exhaustively search what's left.  In practice, we're nowhere near being able to do that.

Instead, we're left with approximate approaches.  Either exhaustively search a tiny, tiny fraction of the possibilities, which is what AB engines do, or find some sort of approximate measure of what constitutes a "good" position and "just play good moves", using a partial search of the possible continuations to double-check that there aren't any obvious blunders.  This is what humans and NN engines do.



One thing that makes chess particularly interesting as a search problem is that it's not totally random, but it's not totally orderly either.

Imagine a search tree as a series of branching passageways.  At the end of every passageway, after however many branches, is a card reading "black wins", "white wins" or "draw".  If all you know is that the cards are there then you might as well pick randomly at each junction.  On the other hand, if you have a map of the entire complex of passageways and which card is at the end of each of them, you can find your way perfectly.  Depending on where the cards are, you might or might not be able to win against a perfect player, but if your opponent makes a mistake you'll know exactly how to take advantage.

In chess there are far too many passages to map out, but there are markings on the walls.  For example, there are several types of endgame positions which even a human can always play perfectly.  If you reach a position where you have a king and a rook and your opponent only has a king (and it's not a stalemate to start out with), you can win pretty simply by following a simple recipe.

There are also databases of positions ("tablebases") with known outcomes, found by exhaustively searching backwards from all possible checkmate positions.  Currently every endgame with seven or fewer pieces (including the kings) is known to be either a win for white, win for black or a draw. If it shows up by working backward from one of the possible checkmates, it's a win.  If it didn't, it's a draw (the process of working backwards from known positions is itself a tree search).

There are a few kinds of positions that are well-known to be immediate wins for one side or the other, regardless of what other pieces are on the board.  The classic example is a back-rank mate, where one side's king is blocked between the edge of the board and its own pawns and the other side has major pieces able to give check.  It's simple to calculate whether the defending side can capture all the pieces that can give check or can safely interpose (block).  If not, it's game over.  My understanding is that chess engines don't bother to special-case these, since they're typically looking several moves ahead anyway.

And then there's a lot of gray area.  If one side gains a material advantage, it will generally win, eventually, but this is far from an ironclad rule.  There are plenty of cases where one side will give up material (sacrifice) in order to gain an advantage.  This ranges from spectacular combinations where one side gives up a queen, or more, in order to get a quick checkmate, to "positional sacrifices" where one side gives up something small, generally a pawn, in order to get an initiative.  Whether a positional sacrifice will be worth it is usually not clear-cut, though in some well-known cases (for example, the Queen's gambit), it's generally accepted that the other side is better off not taking the material and concentrating on other factors instead.

There are also temporary sacrifices where one side gives up material in order to quickly win back more.  If you're playing against a strong player and it looks like they've left a piece hanging, be very careful before taking it.

In short, which side has more material is a clue to which side will win, and it's often a good idea to try to win material, but it's not a guarantee.  This goes equally well for other generally accepted measures of how well one is doing.  Maybe it's worth it to isolate or double one of your pawns in order to gain an initiative, but often it's not.  Games between human masters often turn on whose estimation of which small trade-offs are correct.

From a computing point of view, since it's impossible to follow all the passageways, we have to rely on the markings on the wall, that is, the characteristics of particular positions.  An AB engine will look exhaustively at all the passages that it explores (subject to the "alpha/beta pruning" that gives them their name, which skips moves that can't possibly lead to better results than are already known).  But in most positions it will have to give up its search before it finds a definitive result.  In that case it has to rely on its evaluation function.  Basically it's saying "I've looked ten branches ahead and if I make this move the worst that can happen is I'll get to a position worth X".

That's much better than picking randomly, but it only works well if you explore quite deeply.  A good human player can easily beat a computer that's only searching three moves (six plies) ahead because the human can easily find something that looks good in the short term but turns bad a few moves later.  Once an AB engine is looking dozens of moves ahead, though, there are many fewer short-term traps that don't pan out until late enough for the search to miss them.

This is a lot like the problem of local minima in hill-climbing algorithms.  If you're trying to get to the highest point in a landscape, but it's so dark or foggy you can only see a short distance around you, your best bet could well be to go uphill until you can't any more, even though that fails if you end up atop a small hill that's not actually the high point.  The further you can see, the better chance you have of finding your way to the true high point, even if you sometimes have to go downhill for a while to get on the right path.

Humans and neural networks, however, are doing something slightly different.  They can't look very far, but they're able to read the landscape better.  It's kind of like being on a mountainside in the fog and being able to look at the vegetation, or smell the wind, or examine the types of rock in the area or whatever, and be able to say that the peak is more likely this way than that way.

This wouldn't work if there weren't some sort of correlation between the features of a position and how likely it is to ultimately lead to a win.  It's not a property of tree-searching algorithms.  It's a property of chess as a tree-searching problem.  It's probably a property of any chess-like game that humans find interesting, because that's probably what we find interesting about chess: there's some order to it, in that some positions are clearly better than others, but there's some mystery as well, since we often can't tell for sure.

Tuesday, May 21, 2019

Computer chess engines fall in line

[This post came out of trying to figure out what computer chess results tell us about what kind of problem chess-playing is.  It goes into quite a bit of detail about cycles where player A beats player B beats player C beats player A and why they don't seem to happen.  Shortly after I posted this, the Computer Chess Championship "Deep Dive" round finished up, and there was a slight anomaly: Stockfish won the tournament, but second-place LC0 beat it head to head, by the narrow margin of 4-3-43, implying an Elo difference of 8 points.  I don't think this affects the post greatly, but it does reinforce the point that NN engines are less consistent than AB engines.  Stockfish didn't lose a game to 4th-place Houdini while LC0 both won and lost.  I'll update this slightly when chess.com makes the full results available --D.H. May 2019]

One thing jumps out from watching the ongoing computer chess tournaments that is so ordinary-seeming that it's easy to dismiss entirely:  The rankings from a preliminary round with a couple dozen contestants, are strikingly linear.  If you look at the cross-table, which shows everyone's results against everyone else's, wins in green, losses in red and draws in gray, not only do everyone's scores improve down the table, but for the most part all the red is above and all the green below.  Stronger engines very consistently beat weaker engines, especially considering that at least some of the apparent upsets were due to system crashes and such.

I've seen this happen twice now, once for the "Blitz Bonanza" (5 minutes plus two seconds per move) and the Deep Dive (15 minutes plus 5 seconds per move).  The Deep Dive preliminary, which just finished a few days ago,  was played in "escalation" format, meaning that each new player plays everyone who's already played (twice as white and twice as black).  After a few opponents, you can be pretty confident about where a new player will land -- above whoever whoever it's beat head-to-head, below whoever it's lost to, and close to anyone it drew all four games with.

It shouldn't be a total shock that stronger players beat weaker players, or even that they beat them very consistently.  My point is more that it doesn't have to be that way.  While there are only two basic styles of engine*, each dev team has its own philosophy and each one evaluates positions differently.  In the case of neural net-based engines (generally referred to as NN), the evaluation of positions is done radically differently.

It seems at least plausible that, as with human players, engines might tend to do better or worse than otherwise expected against particular styles of play.  An engine that attacks aggressively might lose to one that's tuned for careful defending, but beat a middle-of-the-road approach that can outplay the careful defender positionally.  This would create a cycle of attacker beats middle-of-road beats defender beats attacker, but this sort of situation doesn't seem to happen in practice.

Without going back and checking the actual numbers, my impression is that AB engines -- the conventional style which does an exhaustive search a far ahead as it can manage, using fairly understandable hand-coded rules to evaluate positions quickly -- win more by being able to look further ahead than by having cleverer evaluation rules.  That is, evaluation speed is a good predictor of who will win.

Clearly if two engines have identical evaluation rules, but one implements them more efficiently, the faster engine should win because it can see further ahead.  The faster engine will see everything the slower engine sees, but it can also see which of two seemingly equal moves leads to a better result.  It will therefore be able to steer toward winning positions and the slower engine won't see the danger until it's too late.

Even with each team using its own particular evaluation rules, something like this still happens.  Typically when two AB engines play, there will be a period where the weaker engine shows a more favorable evaluation for itself than the stronger one does, e.g., the weaker engine is playing white and it sees a half-pawn advantage for white while the stronger engine sees a small advantage for black.

The weaker engine thinks everything's going fine while the stronger engine knows that it has the edge.  Eventually the weaker engine's evaluation starts to drop as it realizes things are no longer as good as they looked, and the stronger engine pushes its small advantage into an overwhelming one.

Again, it isn't a foregone conclusion that a faster engine will beat a slower one with a different evaluation function, but it's what we see in practice.  My suspicion is that evaluation functions hit diminishing returns after certain point.  You've accounted for material, you've got some concept of tactical considerations like pins and hanging pieces, and a bit on positional factors like space and pawn structure.   Further improvements are certainly possible, in the form of slightly better evaluation functions, slightly faster evaluation, or both, but the low-hanging fruit was picked long ago.

Adding a detailed model of, say, pawn structure, is likely just to slow the evaluation down without providing enough benefit to make up for the decrease in depth.  This is the "dumb is smarter" principle that seems to have held sway for the past few decades.

Stockfish -- currently the clear champion among AB engines -- demonstrates this nicely with its "fishtest".  As I understand it, developers try out changes to the program to see if their locally patched version consistently beats the current released version.  Only winning changes are brought into the next release.  While Stockfish has improved significantly as a result of fishtest, progress has been incremental, and the end result is that Stockfish can evaluate positions very quickly.  I'm not 100% sure it's the fastest, but it's certainly high up there.

I'm sure everyone, open source or not, playtests potential improvements (and the training phase of NN engines take this to an extreme), but crowdsourcing this ought to keep development from fixating on some particular person or small group of people's idea of what must be right.  Again, fewer assumptions means better results.


Bringing NN engines into the picture doesn't seem to change this linear ranking.   NN engines are on the other extreme of the smart vs. fast tradeoff.  For a given number of layers and so forth a position generally takes just as long to evaluate regardless of the exact parameters of the net being used, so the only difference in play will be due to the net itself.  Evaluation speed is not a factor since it doesn't change.

This doesn't necessarily rule out A-beats-B-beats-C-beats-A cycles, in fact, it seems like they might be more likely since different nets will put different weights on different factors -- assuming there's more than one way to be good at chess.  So far, though, there don't seem to be any.  In the tournament results I've seen so far, there are three strong NN engines: Leela Chess Zero (generally known as LC0) Leelenstein and Antifish (a modified Leela trained specifically to beat Stockfish).  In the Blitz Bonanza, LC0 had winning records against the other two and Leelenstein had a winning record against Antifish.  Antifish does not seem to do better than the others against Stockfish.  If anything it does slightly worse.

There are only two NN engines in the Deep Dive finals (rules now prohibit more than one variant of the same engine reaching the finals, so Antifish is out).  Obviously there will be no cycle between them. For what it's worth, LC0 is currently slightly ahead of Leelenstein, but they're dead even head-to-head.

There is one clear deviation from the orderly world of AB engines.  Since NN engines evaluate so slowly, they take a random sample of positions when looking ahead, rather than searching exhaustively like AB engines.  I believe the sample is weighted towards moves the net considers likely to be good, but in any case this introduces some inconsistency.  NN engines occasionally lose to weak AB engines since not all good-looking moves turn out actually to be good, while strong AB engines almost never do.  Against strong AB engines, NN engines can find good moves that the AB opponent overlooks because it doesn't lead to good results quickly enough for the exhaustive search to turn it up, but they can also play good-looking moves that turn out to be unsound if you search well enough.

The net effect is that strong NN engines don't do as well against weak AB engines as you'd expect from how they do against strong engines.  In the current Deep Dive finals, Stockfish is (at this writing) noticeably ahead of LC0, but head-to-head they're nearly even (3 wins to 2 with 41 draws, implying an ELO difference of 8 points; by contrast, human world champion Carlsen is currently rated about 57 points ahead of second-ranked Caruana) [and in fact, Stockfish won the finals but lost to LC0 head-to-head].

The difference is that Stockfish has done better against the weaker opposition.  Stockfish is 9-0 against 4th-place Houdini (with a whole bunch of draws), while LC0 is 12-3 against Houdini (with correspondingly fewer draws).  Stockfish also does better against the other NN engine, 3rd-place Leelenstein.

So there don't seem to be any cycles in the ranking of chess engines, and the (small) difference between top AB engines and top NN engines looks to be due to consistency.  What does this tell us about chess-playing as a problem?  There are several possible conclusions:
  • Chess skill really is a linear ranking, at least once you get to the level of computer engines.  If A beats B and B beats C, then A beats C.
  • A bit stronger: There is one true way to play chess and stronger engines approximate this better.
  • On the other hand: AB engines are linearly ranked because only depth/evaluation speed really matters while NN engines are linearly ranked because only the quality of the net doing the evaluation matters.  It's not surprising, though not inevitable, that combining two linear rankings gives another linear ranking.
  • Also, NN engines are all playing on or near the level of the current AB champion, so they're really only being compared to one style of play since there's only one really competitive opponent.
My feeling is that we don't really have enough to know for sure, but most likely we'll continue to see improvement in playing strength and that the linear ranking will hold.  Next year's champion will be able to beat any other opponent head-to-head, the one after that will be able to beat it and everyone else, and so forth, regardless of whether the next champion is AB, NN or some hybrid of the two.  There are noticeable differences between how AB engines and NN engines play (I'd like to get into the details in a future post), but be that as it may, stronger is stronger.

* Technically, there are two choices on two dimensions: Alpha/Beta tree search vs. Monte Carlo and neural-net based evaluation vs. hand-coded rules.  However, NN engines always use Monte Carlo since it takes them roughly a thousand times longer to evaluate a particular position, so they can't search exhaustively.  On the other hand, conventional engines generally use AB searches because they can.  There have been fairly successful experiments in using Monte Carlo with hand-coded rules.  Those tend to be called hybrid.

Saturday, April 27, 2019

Notes on the EHT black hole image

Following on to the recent post on the other blog about the imaging of a black hole in a nearby galaxy, here are my lightly cleaned up notes from watching Katie Bouman's lecture on how the images were captured using the Event Horizon Telescope (EHT), which is actually a collection of several radio telescopes whose individual data sets are combined using very sophisticated image processing techniques to emulate a single Earth-sized telescope.

I have no great insights to offer, but there are some pretty interesting details in the original lecture.  I've thrown in some commentary here and there, without taking any real care to distinguish it from the source material. The full lecture is on Caltech's YouTube channel.  As always, any errors or distortions are mine.



Technically, the image gathered by the telescopes working in combination is in spatial frequency, basically how quickly bright changes to dark as you scan over the actual image.  For example, a square divided into a white rectangle and a black rectangle has a lower spatial frequency than one divided into thin black and white stripes.  It's possible to reconstruct a visual image from information about the spatial frequency, and the more information you have, the better the reconstruction.  This sort of reconstruction is widely used in many fields, including astronomy, medical imaging and the image compression that lets us stream video over the internet.

The exact spatial frequencies being measured depend on the distance (baseline) between the telescopes.  When the telescopes are spread across distant locations on the Earth, the image technique is called Very Long Baseline Interferometry (VLBI).  Interferometry refers to measuring how the signals from different telescopes interfere with each other, that is, whether they reinforce each other or cancel each other out when combined.  Very Long Baseline refers to, um ...

In this case there were eight radio telescopes at six different sites, taking data together on four different nights, scanning for a few minutes at a stretch.  This provides 21 different pairs of telescopes, which would mean 21 different baselines, that is, 21 different spatial frequencies to combine, except that the baselines change as the Earth rotates.  This is good, because it provides samples of more frequencies.  There is at least some talk of adding telescopes in low Earth orbit at some later stage.

As mentioned in the other post, each scope collected hundreds of terabytes (hundreds of trillions of bytes), enough data that it was easier to fly the drives with the data to the site where it would be processed, rather than try to send it over the internet (and the South Pole data had to be flown out in any case).

The actual signal, coming from about 50 million light-years away, is quite weak, and noisy.  There is a lot of matter at the center of a galaxy, radiating at various frequencies and blocking various frequencies.  The radio band was chosen to minimize these effects, but there is still noise to contend with, along with noise from a variety of other sources, including atmospheric noise and imperfections in the telescopes themselves.

The various telescopes are looking through different amounts of atmosphere, depending on whether the black hole is overhead from their location or on the horizon, and this also affects how long it takes the signal to arrive at each telescope.  It takes about 20 milliseconds longer for the signal to reach a scope that has the black hole on the horizon that it does for one where it's overhead.  For a 1GHz signal, that means a difference of 20 million cycles -- and again, this is varying as the earth rotates.

All this means there is a lot of calibration to be done to make sure you're lining up the right data from the various telescopes and accounting for all the various sources of noise and distortion.  In all, there were hundreds of thousands of calibration parameters to work with, though the team ended up using ten thousand or so.

The reconstructed image is meant to represent what an Earth-sized telescope would see, if one could be built, but there are actually infinitely many such images that would match up with the limited data that was collected from a few scattered points on Earth.   Fortunately, having multiple telescopes means you can cross-check and isolate errors.  There are also "closure quantities" that should be the same regardless of the exact details of calibration (I didn't really understand the details of closure quantities). 

The team used two different methods of calibration, and they split into four subteams, two for each method.  The subteams worked independently for several weeks to reconstruct an image from the raw data.  Bouman's own work involved getting the human-tuned method (CLEAN) to work without human tuning.

In addition to the work by the subteams, there was a lot of work done up front to try to make sure that the image processing algorithms they were tuning were robust.  Much of this involved taking test images, calculating what signals the telescopes would get and then reconstructing the test images from the calculated signals.  They did this not only with images of discs and rings that one would typically expect in astronomical images, but also with a variety of non-astronomical images.

The team actively tried to break the image processing algorithms, for example by tuning the parameters to favor a solid disc rather than a ring, and verifying that a ring still came out as a ring.  Python fans will be glad to know that at least some of the code involved was written in Python.

The public image is a combination of the images from the subteams.  Since it's meant to represent only what we can be confident about regarding the actual black hole, it's deliberately a bit blurry.  Even still, there are some significant results [I haven't yet re-checked the details here, so what follows may be garbled]:
  • There is, in fact, a ring as would be expected from a supermassive black hole.
  • The same ring was observable night after night.
  • Previously there had been two separate estimates of the mass of the black hole, one from the motions of stars around it (stellar dynamics) and one from the appearance of the gas flowing into the accretion disc.  The size of the ring in the image supports the first estimate.
  • One portion of the ring is brighter than the rest, due to the doppler shift of matter orbiting around the black hole.  This is to be expected, and the position of the bright spot matches up with other observations.
  • The position of the bright spot is shifting over time, consistent with a rotation period on the order of weeks.  Since the black hole in the milky way (Sgr A*) is much smaller, its rotation period is expected to be on the order of minutes.  Since the observation scans are also on the order of minutes, the image processing for Sgr A* will have to take this into account.
It's also possible to analyze the raw data directly as a cross-check.  A disk of such-and-such size should result in such-and-such radio signals arriving at the various telescopes, independent of any reconstructions produced from that data.  The raw signals are consistent with a ring of the size found in the visual reconstructions.  This is not a surprise, given how the reconstructions were done, but still good to know.

Despite the whole effort taking years and requiring expensive radio telescopes at widely dispersed locations, and employing dozens of people at various stages, the movie Interstellar still cost a lot more to make.  Its image of a black hole may be prettier (and, as I understand it, one of the parts of the movie where the science was actually plausible), the EHT image was both cheaper and more informative.

Again, really good stuff.

Sunday, April 14, 2019

Notes on watching (a chunk of) a computer chess tournament

A while ago I started a post trying to untangle what kind of a problem it is to play chess well, based on how well various kinds of players -- human, conventional computer chess engines and neural-network-based chess engines -- did against each other.  I still think it's an interesting question, but in the mean time I actually watched a bunch of computer chess and that was even more interesting.  In particular, I looked in on a fair bit of chess.com's "CCC 7: Blitz Bonanza".  Here are some rough impressions.

The Blitz Bonanza is one part of a long-running event comprising thousands of individual games in several different formats (the nice thing about chess engines is they don't get tired, so it's practical to have two of them play each other dozens of times in the course of a few days).  The "Deep Dive", with longer time controls and slightly different rules, is going on now and will be for several weeks and there may be more after that (much of this is from memory because I can only seem to find current results on chess.com and the summaries on other sites don't always contain the exact information I'm looking for).

In the Blitz Bonanza, each player has five minutes of time to start with, plus two seconds per move made.  There is no "opening book", meaning that the players must calculate every move.  This leads to the rather odd sight of a clock ticking down near or past the 4-minute mark on move one, which is pretty much inevitably something everyone plays anyway (1. d4 and 1. e4 predominate).

Play continues either to checkmate by one side (though it enters a "get it over with" or "bullet" mode if both engines evaluate the current position as highly favorable to one or the other side), to a draw by stalemate, threefold repetition, 50 moves without a pawn move or capture, or "adjudication", meaning that the position is in a standard database of positions with six or fewer pieces on the board that cannot be won if both sides play perfectly (this includes "insufficient force" scenarios like king and bishop against king), or both engines evaluated the game as a dead draw for a certain number of moves.

This is rather different from human matches, where a player will typically resign if it becomes clear there's no way to avoid losing, and players typically agree to a draw when it's clear that neither of them has a realistic shot at winning unless the other one blunders.  It's sometimes fun to watch an actual checkmate.  The draws can be interesting in their own way.  More on that later.

Scoring was the usual one point for a win, no points for a loss and half a point to each player for a draw

The original field consisted of
  • The premier conventional (alpha/beta or AB) engine, Stockfish
  • Three neural network-based (NN) engines (Leela Chess Zero, aka LC0, Leelenstein and Antifish)
  • Everyone else
In the first round, each of these played each of the others several times, half of the time as white and half as black.  The top four would progress to the next round.  When the smoke cleared, the top four were LC0, the other two neural network engines, and Stockfish rounding out the field in fourth.

In the second round, each of the four played each of the others 100 times (50 as white and 50 as black), for a total of 600 games.  This went on around the clock for several days.   The final result had LC0 in first, Stockfish close behind, well ahead of Leelenstein, which was itself well ahead of Antifish.  LC0 had a small lead over Stockfish in their head-to-head games.

Looking at this with my somewhat defocused engineer's eyes, I'd say "LC0 and Stockfish were about equal, both were clearly stronger than Leelenstein, and Antifish was clearly the weakest of the lot."  I'd consider LC0 and Stockfish essentially equal not only because of the small difference in score (5.5 points out of a possible 300), but because the margin between the two was steady for most of the play.  LC0 won some early games against Stockfish head-to-head, but after that each scored several wins against the other.  The early lead thus looks more like a statistical fluke than a real difference in strength [I'm not sure a real statistician would buy this analysis -D.H May 2019].  The results in CCC 6, with slightly different rules and time controls, bear this out.  Stockfish beat LC0, but not by much, and, if I recall correctly, not head-to-head.

As to my long-running theme of "dumb is smarter", the "zero" in LC0 means that it was trained purely by playing against itself, without any external hints.  Antifish, which finished last and had a losing record against each of the others, including Stockfish, was trained specifically to exploit weaknesses in Stockfish.  If we take LC0's "no extra information to bias the results" approach as one kind of "dumb" and Stockfish's "bash out lots of positions using fairly straightforward (but highly-tuned) rules" as another, it's pretty clear that "dumb" dominated.

One other thing jumped out: Neural networks are not good at endgames, or at least they can be spectacularly bad at some endgames.

Endgames are interesting here because there are few pieces on the board and it's often possible to spell out an explicit plan for winning.  For example, if you have a rook and a king against a rook, you can use your rook and king in concert to pin the other king against the edge of the board and then checkmate with the rook.  In other cases, you want to promote a pawn and this hinges on getting your king to the right square and keeping the other king from stopping you.

Over and over I saw neural network engines flail about aimlessly before hitting the 50 move limit.  To be clear, there are a few endgames that can take more than 50 moves to win, but even in these you can see a clear progression.  To make up an example, after a series of ten checks the opposing king is now one square closer to the corner where it will eventually be checkmated.  This isn't what was happening.  In one game I saw, the AB engine showed mate in 12, while the NN engine saw its winning chances as quite good.  The game continued

AB: Oh, now you've got mate in 14
NN: Yeah this looks good
AB: I don't see a mate, but I think you're going to promote that pawn (or whatever) and win, so my rating is still highly in your favor
NN: Yeah this looks good
AB: Dang, mate in 14 again
NN: Yeah this looks good
AB: Mate in 13
NN: Yeah this looks good
AB: I don't see a mate any more, but ...
NN: Yeah this looks good

From the NN's point of view, it was going from one position with good winning chances to another.  But it wasn't actually winning.  When the game eventually ended in a draw, the NN gave away a half point that the AB engine or a good human player would have gotten, given the same position.


I was a little (but not a lot) surprised by this.  It doesn't seem like the problem is beyond the capability of neural networks.  They've discovered plenty of conventional tactics -- forks, skewers and such, or ladders in the game of go -- and they are clearly able to encode at least some of the usual winning patterns.

Give a neural network a king and queen against a queen and it will find a checkmate more or less like anyone else would.  On the other hand, give it a queen, a pawn and a rook against a king and it will quite likely forcibly give away two of the three and win with what's left.  To a mathematician it makes sense (reducing to a previously-solved case), but to most chess players it looks like lunacy, if not deliberate trolling [In one fairly jaw-dropping example from the Deep Dive finals, LC0 had an overwhelming material advantage in an endgame and gave away pieces to reduce it to bishop, knight and king against lone king, one of the harder endgames to win.  It went ahead and won anyway --D.H May 2019].  

I found this behavior particularly interesting because neural networks are supposed to have more of that special sauce we call intelligence or understanding.  We can clearly say that Stockfish has no deep understanding of chess the way that a human grandmaster can intuitively say "Your pieces are active and the king is exposed.  Go for a mating attack," but at least it plays plausible moves.  NN engines often find wild moves that a human wouldn't consider and an AB might avoid because it doesn't show a clear win, leading us to feel it has some deeper understanding of chess.  I think that's a valid feeling, but there are cases where it completely falls apart.

Watching a neural network botch an endgame just looks like farce.  I'm perfectly comfortable saying that a program "understands" a concept in chess, or "has a concept of" pawn structure or whatever.  It doesn't mean that the program has some reflective layer thinking "Oh, that's a strong pawn chain, I like that", but that it behaves in such a way that, one way or another, it must recognize good and bad structures.

In cases like these, though, it's abundantly clear that the network has no concept, in any meaningful literal or metaphorical sense of the word, of how to win the position.  Nor is it going to be able to figure one out over the board, because the network only learns during the training phase (that's not necessarily a requirement, but it's how these engines work).

This doesn't seem like a permanent situation.  For example, you could have an NN engine train on endgame positions, and it would likely discover ways to win outright and ways to steer toward a win in positions where it now flails.  This is pretty much what humans do -- you learn how to promote pawns as one skill, how to checkmate with a rook and king as another, and so forth.

What's at least a bit surprising is that an engine can be frighteningly strong in most of the game and completely clueless in a few key spots.  This has to say something about intelligence and learning, or about what kind of a problem chess-playing is technically, or about all of the above, but I'm not quite sure what.

Sunday, December 30, 2018

Computer chess: Dumber and smarter at the same time?

[As usual, I've added inline comments for non-trivial corrections to the original text, but this one has way more than most.  I've come up to speed a bit on the current state of computer chess, so that's reflected here.  The result is not the most readable narrative, but I'm not going to try to rewrite it --D.H. Apr 2019]

One of the long-running tags on the other blog is "dumb is smarter".  The idea is that, at least in the world of software, it's perilous to build a lot of specialized knowledge into a system.  Specialized knowledge means specialized blind spots -- any biases or incorrect assumptions in the specialization carry over to the behavior of the system itself.  In cases like spam filtering, for example, a skilled adversary can exploit these assumptions to beat the system.

I will probably follow up on the other blog at some point on how valid the web.examples of this that I cited really were, and to what extent recent developments in machine learning have changed the picture (spoiler alert: at least some).  Here I'd like to focus on one particular area: chess.  Since this post includes Deep Mind, part of the Alphabet family, I should probably be clear that what follows is taken from public sources I've read.


For quite a while and up until recently, the most successful computer chess programs have relied mainly on brute force, bashing out millions and millions of possible continuations from a given position and evaluating them according to fairly simple rules [in actual tournament games, it's not unusual to see over a billion nodes evaluated for a single move].  There is a lot of sophistication in the programming, in areas like making good use of multiple processors, avoiding duplicate work, representing positions in ways that make good use of memory, generating possible moves efficiently and deciding how to allocate limited (though large) processing resources to a search tree that literally grows exponentially with depth, but from the point of view of a human chess player, there's nothing particularly deep going on, just lots and lots of calculation.

For a long time, a good human player could outplay a good computer program by avoiding tactical blunders and playing for positional advantages that could eventually be turned into a winning position that even perfect tactical play couldn't save.  If the human lost, it would generally be by missing a tactical trick that the computer was able to find by pure calculation.  In any case, the human was looking at no more than dozens of possible continuations and, for the most part, calculating a few moves ahead, while the computer was looking at vastly more positions and exploring many more of them in much greater depth than a person typically would.

The sort of positional play that could beat a computer -- having a feel for pawn structure, initiative and such -- comes reasonably naturally to people, but it's not easily expressed in terms of an evaluation function.  The evaluation function is a key component of a computer chess engine that reduces a position to a number that determines whether one position should get more attention than another.  Since there are millions of positions to evaluate, the evaluation function has to be fast.  A typical evaluation function incorporates a variety of rules of the kind you'll find in beginning chess books -- who has more material, who has more threats against the other's pieces, whose pieces have more squares (particularly central squares) to move to, are any pieces pinned or hanging, and so forth.

There's quite a bit of tuning to be done here -- is it more important to have a potential threat against the opponent's queen or to control two squares in the center? -- but once the tuning parameters are set, they're set, at least until the next release.  The computer isn't making a subtle judgment based on experience.  It's adding up numbers based on positions and alignments of pieces.

It's not that no one thought of writing a more subtle evaluation function, one that would allow the program to look at fewer, but better positions.  It's just that it never seemed to work as well.  Put a program that looks at basic factors but looks at scads and scads of positions against one that tries to distill the experience of human masters and only look at a few good moves, and the brute force approach has typically won.  The prime example here would be Stockfish, but I'm counting, engines such as earlier versions of Komodo as brute force since, as I understand it, they use the same alpha/beta search technique and examine large numbers of positions.  I'm having trouble finding exact numbers on that, though [If you look at the stats on chess.com's computer chess tournaments, it's very easy to tell who's who.  For a 15|5 time control, for example, you either see on the order of a billion nodes evaluated per move or on the order of a hundred thousand.]

[Houdini is another interesting example.  Its evaluation function puts more weight on "positional" factors such as active pieces and space, and it tries to be highly selective in which moves it devotes the most resources too.  This is, explicitly, trying to emulate the behavior of human players.  So it's not quite correct to say that programs that try to emulate humans have done worse than ones that just bash out continuations.  Houdini has done quite well.

However, from a software point of view, these programs are all largely similar.  There is an evaluation function that's explicitly coded as rules you can read, and this is used to decide how much attention to pay to what part of the search tree.

Alpha zero, by contrast, uses machine learning (aka neural network training, aka deep neural networks) to build an evaluation function that's totally opaque when considered purely as code.  There are techniques to examine what a neural network is doing, but they're not going to reduce to rules like "a bishop is worth three times as much as a pawn".  It also uses a Monte Carlo approach to examine the search tree probabilistically, which is a somewhat different way to use the evaluation function to guide the search.  As I understand it, this is not the usual practice for other engines, though it's certainly possible to incorporate a random element into a conventional chess engine.  Komodo MC comes to mind.

In short, the narrative of "purely mechanical programs beat ones that try to emulate humans" is not really right, but Alpha Zero still represents a radically different approach, and one that is in some sense structurally more similar to what things-with-brains do.  --D.H. Feb 2019]

This situation held for years.  Computer hardware got faster, chess programs got more finely tuned and their ratings improved, but human grandmasters could still beat them by exploiting their lack of strategic understanding.  Or at least that was the typical explanation:  Humans had a certain je ne sais quoi that computers could clearly never capture, not that the most successful ones were particularly trying to.  Occasionally you'd hear a stronger version: computers could never beat humans since they were programmed by humans and could never know more than their programmers, or some such, but clearly there's a hole in that logic somewhere ...


Then, in 1997, Deep Blue (an IBM project not directly related to Deep Mind) beat world champion Garry Kasparov in a rematch, Kasparov having won the first match between the two.  It wasn't just that Deep Blue won the games.  It outplayed Kasparov positionally just like human masters had been outplaying computers, but without employing anything you could point at and call strategic understanding.  It just looked at lots and lots of positions in a fairly straightforward way.

This isn't as totally surprising as it might seem.  The ultimate goal in chess is to checkmate the opponent.  In practice, that usually requires winning a material advantage, opening up lines of attack, promoting a pawn in the endgame or some other tactical consideration.  Getting a positional advantage is just setting things up to make that more likely.  Playing a simple tactical game but looking twenty plies (half-moves) ahead turns out to be quite a bit like playing a strategic game. [In fact, computer-computer chess games are almost entirely positional, since neither player is going to fall for a simple tactical trap.  That's not to say tactics don't matter.  For example I've seen any number of positions where a piece looked to be hanging, but couldn't be taken due to some tactical consideration.  What I haven't seen is a game won quickly by way of tactics.]


Human-computer matches aren't that common.  At first, the contests were too one-sided.  There was no challenge in a human grandmaster playing a computer.  Then, as computers became stronger, few grandmasters wanted to risk being the first to lose to a computer (and Kasparov is to be commended for taking up the challenge anyway).  Now, once again, the contests are too one-sided.  Human players use computers for practice and analysis, but everyone loses to them head-to-head, even with odds given (the computer plays without some combination of pieces and pawns).

At this writing, current world champion Magnus Carlsen, by several measures the strongest player ever or at the very least in the top handful, stands less than a 2% chance of beating the current strongest computer program head to head.  With the question of "will computers ever beat the best human players at chess?" firmly settled, human players can now be compared on the basis of how often they play what the computer would play.  Carlsen finds the computer move over 90% of the time, but it doesn't take many misses to lose a game against such a strong player.


And now comes AlphaZero.

The "alpha" is carried over from its predecessor, AlphaGo which, by studying games between human players of the game of go and using machine learning to construct a neural network for evaluating positions, was able to beat human champion Lee Sedol.  This was particularly significant because a typical position in go has many more possible continuations than a typical chess position, making the "bash out millions of continuations" approach impractical.  Given that computer chess had only fairly recently reached the level of top human players and go programs hadn't been particularly close, it had seemed like a good bet that humans would continue to dominate go for quite a while yet.

AlphaGo, and machine learning approaches in general, use what could be regarded as a much more human approach, not surprising since they're based on an abstraction of animal nervous systems and brains rather than the classic Turing/Von Neumann model of computing, notwithstanding that they're ultimately still using the same computing model as everyone else.  That is, they run on ordinary computer hardware, though often with a specialized "tensor processor" for handling the math.

However, the algorithm for evaluating positions is completely different.  There's still an evaluation function, but it's "run the positions of the pieces through this assemblage of multilinear functions that the training stage churned out".  Unlike a conventional chess engine, there's nothing you can really point at and say "this is how it knows about pins" or "this is how it counts up material".

AlphaGo looks at many fewer positions [by a factor of around 10,000] than a conventional chess engine would, and it looks at them probabilistically, that is, it uses its evaluation function to decide how likely it is that a particular continuation is worth looking at, then randomly chooses the actual positions to look at based on that.  It's still looking at many more positions than a human would, but many fewer than a pure brute-force approach would.


The "zero" is because the training stage doesn't use any outside knowledge of the game.  Unlike AlphaGo, it doesn't look at games played by humans.  It plays against itself and accumulates knowledge of what works and what doesn't.  Very roughly speaking, AlphaZero in its training stage plays different versions of its network against each other, adjusts the parameters based on what did well and what did badly, and repeats until the results stabilize.

AlphaZero does this knowing only the rules of the game, that is, what a position looks like, what moves are possible, and how to tell if the game is won, lost, drawn (tied) or not over yet.  This approach can be applied to a wide range of games, so far go, chess and shogi (a chess-like game which originated in Japan).  In all three cases AlphaZero achieved results clearly better than the (previous) best computer players after a modest number of hours of training (though the Stockfish team makes a good case that AlphaZero had a hardware advantage and wasn't playing against the strongest configuration).  [Recent results indicate that LC0, the strongest neural net based engine, and Stockfish, the strongest conventional engine, are very evenly matched, but LC0 doesn't have the benefit of a tensor processor to speed up its evaluation --D.H. May 2019]

Notably, AlphaZero beat AlphaGo 60 games to 40.


In one sense, AlphaZero is an outstanding example of Dumb is Smarter, particularly in beating AlphaGo, which used nearly the same approach, but trained from human games.  AlphaZero's style of play has been widely described as unique.  Its go version has found opening ideas that had lain undiscovered for centuries.  Its chess version plays sacrifices (moves that give up material in hopes of a winning attack) that conventional chess engines pass up because they can't prove that they're sound.  Being unbiased by exposure to human games or a human-developed evaluation function, AlphaZero can find moves that other programs would never play, and it turns out these moves are often good enough to win, even against chess engines that never make tactical mistakes.

On the other hand, AlphaZero specifically avoids sheer brute force.  Rather than look at lots and lots of positions using a relatively simple evaluation function, it looks at many fewer, using a much more sophisticated evaluation function to sharply limit the number of positions it examines.  This is the same approach that had been tried in the past with limited success, but with two key differences:  The evaluation function is expressed as a neural network rather than a set of explicit rules, and that neural network is trained without any human input, based solely on what works in practice games against AlphaZero itself.


The Dumb is Smarter tag on Field Notes takes "dumb" to mean "no special sauce" and "smarter" to mean "gets better results".  The "smarter" part is clear.  The "dumb" part is more interesting.  There's clearly no special sauce in the training stage.  AlphaZero uses a standard machine learning approach to produce a standard neural network.

On the other hand, if you consider the program itself without knowing anything about the training stage, you have a generic engine, albeit one with a somewhat unusual randomized search algorithm, and an evaluation function that no one understands in detail.  It's all special sauce -- a set of opaque, magical parameters that somehow give the search algorithm the ability to find the right set of variations to explore.

I think it's this opaqueness that gives neural networks their particularly uncanny flavor (uncanny, etymologically, roughly means "unknowable").  The basic approach of taking some inputs and crunching some numbers on them to get an output is unequivocally dumb.  As I said above, "It's adding up numbers based on positions and alignments of pieces."  Except that those numbers are enabling it to literally make "a subtle judgment based on experience", a judgment we have no real choice but to call "smart".


Progress marches on.  At least one of the previous generation of chess engines (Komodo) has incorporated ideas from AlphaZero [Leela has open-sourced the neural network approach wholesale].  It looks like the resulting hybrid isn't dominant, at least not yet, but it does play very strong chess, and plays in a completely different, more human, style from the conventional version.  That's interesting all by itself.

Saturday, October 6, 2018

Debugging servers and bodies

For quite a while I've wanted to write up some thoughts about the nature of cause and effect.  In actually trying to do so, I realized two things: First, I don't know that much about the main streams of thought on this, which go back millennia, and second, that this was material for several posts.  Rather than step back and formulate an overarching structure for a series, or deep dive into the philosophical literature, I decided to just start somewhere and call that Part I, with the intention of coming back to the subject, probably with unrelated posts in between, from time to time.   And then I realized that there was no need to call it Part anything.  I could just introduce a new tag, cause and effect and apply it where necessary.  So here we go ...


I press keys on the keyboard and words appear on the screen.  It seems pretty clear that one is the cause of the other.  I take a cough suppressant and my cough subsides.  Quite likely it subsided because of the medicine, but maybe I was getting better anyway.  I run an ad online and sales go up.  But they've been going up for months.  Did they go up more because of the ad?  (half of your advertising budget is wasted; the trick is to know which half).  I wear a lucky shirt and my team wins.  I may want to think the two are related, but realistically it's just a bit of fun.

I spend a lot of my time debugging, trying to figure out why something isn't working the way it was expected to, whether it's why the TV isn't working at home or why a server crashed at work.  If I'm trying to figure out what's wrong with something electronic at home, I generally just turn things off and on until everything's back in a sane state.  That's usually fine at home, but not such a good idea at work.  Even if restarting the server fixes the problem, you still want to know why, so next time you won't have to restart it.

How do you tell if one thing caused another?  Debugging is much older than computing.  For example, the practice of debugging human health, that is, medicine, developed a useful paradigm in 1884, known as Koch's postulates after Robert Koch, for determining whether a particular microorganism causes a particular disease:
  1. The microorganism must be found in abundance in all organisms suffering from the disease, but should not be found in healthy organisms.
  2. The microorganism must be isolated from a diseased organism and grown in pure culture.
  3. The cultured microorganism should cause disease when introduced into a healthy organism.
  4. The microorganism must be reisolated from the inoculated, diseased experimental host and identified as being identical to the original specific causative agent.
Koch actually abandoned the "should not be found" part of postulate 1 after discovering that some people could carry a disease without showing symptoms.

Abstracting a little bit with an eye toward applying these postulates to an ailing server, one might say
  1. The conditions you think are causing the problem should be present in affected servers but ideally not in healthy ones.  For example, the unhealthy servers are all deployed in sector A and the healthy ones aren't or, more commonly, the unhealthy ones are running one version of the code while the healthy ones are running the previous version.
  2. This is a bit tricky, but I think it boils down to: There has to be a well-defined way to induce the conditions you think are causing the problem.  For example, you can start a new instance of a server in sector A or update a healthy server to the new version you think is buggy.  It isn't always immediately obvious how to do this.  For example, if you think that the problem is some sort of bad request coming from random parts of the internet, you'll probably have to search through logs to find requests that correspond with problems.
  3. If you trigger the conditions, the problem occurs.
  4. Again doesn't apply directly, but in this context I think it means double-checking for evidence that the conditions you thought were triggered really were triggered.  When you brought up the test server and it fell over, was it actually in sector A?  When you sent the query-of-death you found in the logs, did the server that fell over actually log receiving it just before it fell over?
The kind of double-checking in postulate 4 is crucial in real debugging.  It's very common, for example, to think you restarted a server with a new setting that should cause or fix a given problem, only to find that you restarted a different instance, or accidentally restarted it with the old configuration instead of the new.  For example, as I was writing this paragraph I realized that the command I thought would send a problem request to an unwell server I'd been debugging had actually failed, explaining why I saw no evidence of the request having been handled.

There's also a distinction, in both medicine and software, between fixing the problem at hand -- curing the patient or getting the service back online -- and pinning down exactly what happened.  In my business a common course of action is to roll the production servers back to the last configuration that was known to work, then use a test setup to try to reproduce the problem without impacting the production system.  The ultimate goal is a "red test" that fails with the buggy code and then passes ("goes green") with the fix.

In medicine, as I understand it, the work of isolating causes and developing vaccines and drugs similarly goes on in a laboratory environment until everyone is quite certain that the proposed treatment will be safe, and hopefully effective, in real patients.  In the mean time, doctors mostly do their best with known treatments.

While Koch's postulates are fairly famous, the kind of thing you remember from high school biology years later, they're not actually what modern medicine goes by, just like modern economists don't consult The Wealth of Nations, influential though it was.  One more modern approach can be found in Hill's criteria, a set of nine criteria for determining if a given cause is responsible for a given effect, but there are many other, more recent paradigms.

Notably, Hill's criteria and its modern cousins are not nearly so crisp as Koch's postulates.  The very name "postulate" suggests that you can obtain a rigorous proof, while "criteria" suggests something more indirect: if you don't meet the criteria, then you don't have cause and effect.  The criteria themselves are of the form "more of this suggests causality is more likely", and the end result is an idea of the probability that something is causing something else.

As in many other areas, switching from a yes/no answer to a probability solves a lot of problems, particularly the problem of gray areas where there are reasons to say either yes or no.  It does so, however, at the cost of being able to say for certain "X caused Y".  In my world, you very often can say with confidence "The tweaks we made to parsing in change number 271828 caused the server to reject these requests", but in my world we have a high degree of control of the system.  I can roll the server back to just before and after change number 271828 and run it in a test environment where I can control exactly what data it's trying to parse (or just write a "unit test" that exercises the problem code directly without spinning up a server).

In the field of medicine, and much of the scientific world, however, that's generally not the case.  If we're trying to determine whether eating carrots for years causes freckles, we can't really make people eat carrots for years and count their freckles every week for the duration.  Medicine doesn't, and shouldn't, have the same level of control over patients as I have over a server.  That is, there's less control over the possible causes.

There's typically also less certainty about the effects.  Lots of people get freckles whether or not they eat carrots, so you need more subtle statistical techniques to see if there's even a correlation between carrot eating and freckle getting.  This sort of thing is a major reason that it's generally not a good idea to pin too much on any single medical study, even if it's careful with the data and its interpretation.

Nonetheless, medicine advances.  Some of this is because research has pinned down some causes and effects to a good degree of certainty.  There's no doubt that vaccines were effective against smallpox and continue to be effective against other diseases, albeit not always perfectly.  There's no doubt that antibiotics can be effective against bacterial infections, or that some bacteria have evolved defenses against them.  There's no realistic doubt that smoking causes a number of ill effects, up to and including lung cancer and emphysema.

But medicine is useful even in the absence of certainty.  If there's an 80% chance you've got condition X and the treatment is 90% effective with a 5% chance of major side effects, and condition X is 95% fatal if left untreated, you should probably go for the treatment.  If the treatment is 5% effective with a 90% chance of major side effects, and condition X is almost never fatal, you probably don't want to.  You don't need absolute certainty to make a decision like this, or even to know the exact causes and effects involved.