Intermittent Conjecture: July 2019

Saturday, July 27, 2019

Do neural networks have a point of view?

As someone once said, figures don't lie, but liars do figure.

In other words, just because something's supported by cold numbers doesn't mean it's true. It's always good to ask where the numbers came from. By the same token, though, you shouldn't distrust anything with numbers behind it, just because numbers can be misused. The breakdown is more or less:

If you hear "up" or "down" or "a lot" or anything that implies numbers, but they're aren't any numbers behind it, you really don't know if it's true or not, or whether it's significant.
If you hear "up X%" or "down Y%" or -- apparently a popular choice -- "up a whopping Z%" and you don't know where the numbers came from, you still don't really know if it's true or not. Even if they are correct, you don't know whether they're significant.
If you hear "up X%, according to so-and-so", then the numbers are as good as so-and-so's methodology. If you hear "down Y%, vs. Z% for last quarter", you at least have a basis for comparison, assuming you otherwise trust the numbers.
In all, it's a bit of a pain to figure all this out. Even trained scientists get it wrong more than we might think (I don't have numbers on this and I'm not saying it happens a lot, but it's not zero).
No one has time to do all the checking for more than a small subset of things we might be interested in, so to a large extent we have to trust other people to be careful. This largely comes down to reputation, and there are a number of cognitive biases in the way of evaluating that objectively.
But at least we can try to ignore blatantly bad data, and try to cross-check independent sources (and check that they're actually independent), and come up with a rough, provisional picture of what's really going on. If you do this continually over time the story should be pretty consistent, and then you can worry about confirmation bias.
(Also, don't put much stock in "record high" numbers or "up (a whopping) 12 places in the rankings", but that's a different post).

I'm not saying we're in some sort of epistemological nightmare, where no one has any idea what's true and what's not, just that objectivity is more a goal to aim towards rather than something we can generally expect to achieve.

So what does any of this amateur philosophizing have to do with neural networks?

Computers have long been associated with objectivity. The stramwan idea that "it came from a computer" is the same as "it's objectively true" probably never really had any great support, but a different form, I think, has quite a bit of currency, even to the point of becoming an implicit assumption. Namely, that computers evaluate objectively.

"Garbage in, garbage out," goes the old saying, meaning a computed result is only as good as the input it's given. If you say the high temperature in Buenos Aires was 150 degrees Celsius yesterday and -190 Celsius today, a computer can duly tell you the average high was -20 Celsius and the overall high was 150 Celsius, but that doesn't mean that Buenos Aires has been having, shall we say, unusual weather lately. It just means that you gave garbage data to a perfectly good program.

The implication is that if you give a program good data, it will give you a good result. That's certainly true for something simple, like calculating averages and extremes. It's less certain when you have some sort of complicated, non-linear model with a bunch of inputs, some of which affect the output more than others. This is why modeling weather takes a lot of work. There are potential issues with the math behind the model (does it converge under reasonable conditions?), the realization of that model on a computer (are we properly accounting for rounding error?) the particular settings of the parameters (how well does it predict weather that we already know happened?). There are plenty of other factors. This is just scratching the surface.

A neural network is exactly a complicated, non-linear model with a bunch of inputs, but without the special attention paid to the particulars. There is some general assurance that the tensor calculations that relate the input to the output are implemented accurately, but the real validation comes from treating the whole thing as a black box and seeing what outputs it produces from test inputs. There are well-established techniques for ensuring this is done carefully, for example using different datasets for training the network and for testing how well the network really performs, but at the end of the day the network is only as good as the data it was given.

This is similar to "Garbage in, Garbage out," but with a slightly different wrinkle. A neural net trained on perfectly accurate data and given perfectly accurate input can still produce bad results, if the context of the training data is too different from that of the input it was asked to evaluate.

If I'm developing a neural network for assessing home values, and I train and test it on real estate in the San Francisco Bay area, it's not necessarily going to do well evaluating prices in Toronto or Albuquerque. It might, because it might do a good job of taking values of surrounding properties into account and adjusting for some areas being more expensive than others, but there's no guarantee. Even if there is some sort of adjustment going on, it might be thrown off by any number of factors, whether housing density, the local range of variation among homes or whatever else.

The network, in effect, has a point of view based on what we might as well call its experience. This is a very human, subjective way to put it, but I think it's entirely appropriate here. Neural networks are specifically aimed at simulating the way actual brains work, and one feature of actual brains is that their point of view depends to a significant degree on the experience they've had. To the extent that neural networks successfully mimic this, their evaluations are, in a meaningful way, subjective.

There have been some widely-reported examples of neural networks making egregiously bad evaluations, and this is more or less why. It's not (to my knowledge) typically because the developers are acting in bad faith, but because they failed to assemble a suitably broad set of data for training and testing. This gave the net, in effect, a biased point of view.

This same sort of mistake can and does occur in ordinary research with no neural networks involved. A favorite example of mine is drawing conclusions about exoplanets based on the ones we've detected so far. These skew heavily toward large, fast-moving planets, because for various reasons those are much easier to detect. A neural network trained on currently known exoplanets would have the same skew built in (unless the developers were very careful, and quite likely even then), but you don't need a neural network to fall prey to this sort of sampling bias. From my limited sample, authors of papers at least try to take it into account, authors of magazine articles less so and headline writers hardly at all.

Sunday, July 14, 2019

Computer chess: where now?

In an effort to wrap up this thread, at least for a while, here's an attempt to recap some of the major points and to conjecture about what might be next, and what this might tell us about chess, and intelligence in general.

Since playing perfect chess appears intractable, we have a classic tradeoff: give up on perfect play in order to fit our actual strategy into our limited resources. Chess theory, combined with practice to develop pattern recognition and calculation, is optimized for the human brain. Chess theory is just shorthand for "patterns and rules that players have discovered over the centuries." These days, that very much includes discoveries by computer players.

For computers (at least the Von Neumann-style processors currently in commercial use), there are now two main options:

exhaustive search to a limited depth combined with alpha/beta pruning to avoid exploring moves that can't be as good as moves already considered, using an explicit set of rules to evaluate whether a position is good or bad (known as AB, after the alpha/beta part)
a completely opaque neural net evaluation of positions combined with a limited randomized search of possible variations (known as NN, for the neural net part), though you'll also see mention of MCTS (Monte Carlo Tree Search), referring to the randomized search.

There are some hybrid approaches that combine explicit rules with random search, but one way or another there has to be a tradeoff of performance for limited resources.

It's probably worth repeating that NN engines still consider far more possible continuations than humans can. The numbers for human, NN and AB are roughly dozens, hundreds of thousands and billions respectively. We can assume that those last two numbers are going to increase over time, and the first one isn't. Moore's law may (or may not) be tailing off, but there will still be improvements in software and bigger piles of hardware in the future. There could also be breakthroughs like some sort of quantum chess computer that can examine vastly larger numbers of possible positions, in which case all bets are off.

It's interesting explore what human, AB and NN players do and don't have in common. One common factor is the importance of what's traditionally called "positional" play, that is, taking into account factors beyond tactical concerns about winning material or forcing a quick checkmate. Tactics are still important, but as the level of play becomes higher it becomes more and more important to consider factors that influence the longer-term course of the game, factors like pawn structure, space, initiative and so forth.

Positional play is interesting from a computing standpoint because it's not the primary objective of the game -- that would be checkmate -- or even the secondary objective -- generally considered to be winning material to make it easier to checkmate, or using the threat of winning material to force the opponent into a weaker position. The positional elements that have been developed over the centuries aren't readily obvious consequences of the rules. They are human constructs aimed at making an intractable problem tractable. Heuristics, in other words. In broad terms, all three kinds of players are using the same approach -- rules of thumb plus calculation -- just in different mixes and with different rules of thumb.

It seems significant, then, that computers have, in effect, rediscovered positional factors in their play, even without having them explicitly programmed in. Deep Blue beat Kasparov, by Kasparov's own assessment, by outplaying Kasparov positionally. The surprise was that it did this with only basic knowledge of what makes a position good or bad and that the rest emerged from looking at large numbers of possible positions, much further ahead than a human could look.

Similarly, in their training phases, NN engines like Alpha Zero learn to play good moves without any obvious tactical benefit -- and indeed some that can look like blunders or at least unnecessarily risky at first glance -- without any cues beyond what happened to win in training games. They do seem to produce more than their share of "wild" positions, and "unorthodox" moves, but even then their play can generally be described in terms like "an unusually aggressive pawn push to initiate a kingside attack" or "a positional sacrifice to obtain more active pieces", and not (except in endgames, it seems) "no idea ... looks like it picked that move at random".

Maybe that just means that we have enough terms for things going on in chess that at least some of them are bound to apply to any particular move or position, but even when engines play moves that flout previously accepted theory, such cases are probably the memorable exception. In a great many positions each of the three types of player would find the others' preferred moves perfectly reasonable. They might well disagree over which exact move is best, but in most cases their evaluations will be largely similar. Typically they will all rule out the great majority of possible moves as inferior, and they will largely agree on which moves are inferior. For that matter, AB engines and human grandmasters have been known to flout the same rules. It's not just NN that can throw a curve ball from time to time.

All this is to say that if three radically different approaches to chess-playing can reach a reasonable consensus in discarding almost all possible moves and that their ranking among the best moves agrees reasonably well -- though by no means perfectly -- with accepted chess theory, then most likely there is something to accepted chess theory. Call it an emergent property of the rules.

So where do we go from here? Can we combine the strengths of the three approaches, roughly speaking high-level planning for humans, calculation for AB engines and positional assessment for NN engines? This next bit is more speculation than the nice summary I was hoping for, but here goes:

It might seem almost self-evident that if we knew the full details of how grandmasters see the board then computers would be able to do even better, but it's not like that hasn't been tried. Chess engine developers have been working with top-level human players for decades, with good results, but not always with the best results. Stockfish, which currently rules the AB world, benefits from a large distributed team making continual tweaks to the algorithm, with only the best making it into the next release. Contributions can come from anyone. I'm sure a lot of contributors are also strong chess players, but it's not a prerequisite.

Alpha Zero, of course, dispenses with outside expertise entirely. Other NN engines try to incorporate outside knowledge, but the ones that don't, notably Alpha Zero and its open-source cousin LC0, seem to do best.

Humans clearly do things differently from computers, but as I've said previously, it's quite possible that this is because these things work for humans but not necessarily for computers. My guess is that trying to model the way humans formulate plans when playing chess is not going to lead to stronger chess engines. At this point, both kinds of engines have a reasonable substitute for human-style conscious planning, namely calculating possible continuations en masse. That's a good fit for computers, and there's no strong reason to think that trying to emulate humans instead would be a better fit.

Even NN engines, which focus mainly on evaluating particular positions, benefit from calculation. While NN engines may look downright clueless in endgames (even when they win), they look considerably less clueless at looser time controls, that is, when they are doing more calculation. Looking far enough ahead appears indistinguishable from planning, even if that's not how we humans do it.

While the original Alpha Zero work was done in a closed environment and (to my knowledge) hasn't been directly replicated by outside researchers, it's generally understood that Alpha Zero benefitted from using tensor processing chips that sped up its evaluation (one reason the Stockfish team argued that the Alpha Zero match results weren't a fair comparison), and that for that reason it would likely beat LC0, which runs on stock hardware. Again, NN with faster, and thus more, calculation beats NN without it.

On the other hand, one big reason that NN engines got so much attention was that they didn't just win, but they did it in a noticeably different style. The classic example is the "fawn pawn" (apparently a mishearing of "thorn pawn"), which is an advanced pawn, typically on the sixth rank, that can't be attacked by the defender's pawn, for example a black pawn on h3 when white has castled kingside and played g3.

A strong human player would likely look at such a position and think "Yeah ... that doesn't look good for white ... ", but NN players demonstrated just how bad it is, and therefore why it can be worth playing for even if you have to give up material or it doesn't look good to expose your own king and push a pawn that might have to be supported by pieces instead of other pawns. The NN games also gave some insight as to when to push such a pawn.

The ideal combination of current AB and NN approaches would evaluate positions as well as NNs but as fast as ABs. It's not out of the question, in fact I'd conjecture that it's likely, that AB engines can incorporate some of the knowledge that NNs have turned up directly into their evaluation functions. The resulting code wouldn't have to capture everything the NN's network does, just enough to avoid losing to strategies like fawn pawn attacks, by skewing just enough toward positions that avoid them.

AB engine developers should probably be careful with this, though. Some of a NN's unorthodox behavior represents what we might as well call new insight, but some of it will be due to overfitting or plain old random noise. Nor does an AB developer have to understand exactly how the NN's evaluation works. There's a very direct way of seeing what an NN's evaluation function is doing: play games against it and see what it does.

Because an NN is mostly evaluating individual positions, its play gives a good view into how it does that. A hypothetical perfect player might make totally surprising moves because it can see every variation, but that's not how NNs work. They play surprising moves because the evaluation function gives those moves good ratings.

Another option that comes to mind would be to use neural networks not for playing chess directly, but for tweaking the parameters of traditional chess engines. I'm not sure this is a great idea, though. One major lesson of both types of engine, in their own ways, is that it's risky to build in assumptions -- not necessarily doomed to failure, but risky.

If you're tweaking parameters in an AB engine -- e.g., how much a rook on an open file is worth in relation to a pawn -- you're tuning weights in a system with a bunch of specialized, not-necessarily-linear nodes. In a standard NN you're tuning weights in a (relatively) straightforward multilinear system. It's not clear that an NN built for tweaking weights in a multilinear system would do a good job tweaking the weights on an AB's specialized evaluation function. It might, but it might not.

Looking at it from a more NN perspective, you could try to speed up evaluation by noticing this or that pattern of weights and replacing the generalized tensor calculations with ad-hoc functions that happen to work for those particular weights. For example if the weights are "sparse", that is, most of them are zero, you can write out an explicit formula that combines the terms with non-zero weights.

In theory, you might come up with something that resembles some of the customary rules of thumb. Maybe weights involving a particular kind of piece tend to emphasize empty squares that the piece could move to (open lines, mobility), or pairs of squares the piece could attack if it moved a certain way, and which hold enemy pieces (forks). Again, it's not clear that this would help. It might just get in the way of further improvements, if it turns out that the zero weights that made things sparse really shouldn't have been zero.

If all we know is that neural nets are combining basic features of a chess position according to some magic combination of weights, then all we can say is "there's something that current explicit evaluation functions don't seem to be capturing." If that's how it stays, then we might expect neural networks to outperform engines like Stockfish in the long run. They're already at least comparable, and they haven't been around for nearly as long.

It's quite possible, though, that a neural network's evaluation function is capturing a handful of quantifiable factors that current explicit evaluation functions aren't capturing. In that case, an engine like Stockfish with an updated evaluation function informed by NN behavior should have the upper hand. It could choose promising positions as well as, or about as well as, a neural net, but it would examine them much more efficiently.

It's not clear how much room is left for neural networks to improve. Playing winning chess is just plain hard. Training a network for twice as long generally doesn't produce twice as good a result. It may be that network-based evaluation of individual positions is about as good as it's going to get.

For that matter, though, neither does looking at twice as many moves, or even twice as deep (which should roughly square the number of moves examined), make an AB engine twice as strong. Currently AB engines can examine billions of positions and around twenty plies deep (twenty levels of move/countermove, or ten full moves) in a typical midgame position. Looking at trillions of positions would mean looking more like thirty plies ahead. That ought to yield some sort of improvement, but how much? Most high-level games are draws in any case. As players get stronger, it becomes harder to measure their strength, at least insofar as it takes more games to get enough decisive results.

It's interesting that AB and NN engines appear to be more or less evenly matched. This may be evidence that we're reaching the limits of what computer chess players can do. Or it may not.

In the discussion above, I talk mostly about evaluation functions, weights on numbers and so forth, and not so much about "planning" or "understanding" or "insight". This was a deliberate choice. Chess engines are nothing more or less than computer programs. It's fascinating that they can clearly display some aspects that we associate with intelligence, while clearly lacking others. Neural networks don't fundamentally change this -- they're still just doing calculations -- but they do seem to capture another chunk of what brains do, namely fuzzy pattern matching.

Much discussion of developments like AlphaZero and similar neural network-based programs focuses on what we're learning about algorithms, and in particular how to harness "deep neural networks" (that is, multi-stage neural networks) to solve problems that are widely agreed to require intelligence.

That's fair enough, but in that process we're also learning about the problems themselves. If intelligence is the ability to solve certain kinds of problems, then solving problems by machine tells us something about what it means to be intelligent. While it's natural to want to move the goal posts and try to narrow "intelligence" down so it continues to mean "behavior we can't emulate reasonably well with machines," I argue in the first of the previous posts, that's probably a losing game. The winning game, I think, is gaining a better understanding of various capabilities that we tend to throw under the blanket term "intelligence".

How strong are computer chess players, and how far are they from perfect chess?

Playing a perfect game of chess appears to be an intractable problem. Certainly the total number of possible positions is far too large for a direct tabulation of the best move for each possible position to be feasible. There are far more possible chess positions than subatomic particles in the observable universe.

To be sure, almost all of these positions are extremely unlikely to appear in any reasonable game of chess, much less a perfectly played one, so you should be able to ignore them. All you would really need to know in order to play perfectly is what the best first move is, what the best second move is for every reply, and so forth. Since most possible replies are not good moves, this might thin out enough that it would be possible to store everything in a large database and/or write general rules that will cover large numbers of possible positions. In other words, it might (or might not) be feasible to write down a perfect strategy if we could find it. But we're nowhere close to finding it.

Nonetheless, there has been a lot of progress in the decades since Deep Blue beat Kasparov. It's now quite clear that computers can play better than the best humans, to the point that it's a bit hard to say exactly how much better computers are. There are rating numbers that imply that, say, Stockfish would beat Magnus Carlsen X% of the time, but they probably shouldn't be taken as anything more than an estimate. We can say that X is probably somewhere in the 90s, but that's about it.

Chess rating systems are generally derived from the Elo system (named after Arpad Elo), which tries to quantify playing strength as a single number based on players' records against each other. Two equally-rated players should have roughy equal numbers of wins and losses against each other, plus however many draws. As the rating difference increases, the stronger player should win more and more often.

Ratings are recalculated in light of actual results. If two equally-rated players draw, nothing will change, but if a low-rated player draws against a higher-rated one, the higher-rated player will lose points and the lower-rated player will win points. Likewise, the winner of a game will gain points and the loser will lose points, but you get more points for beating a higher-rated player and you lose more for losing to a lower-rated player.

Over time, this will tend to give a good picture of who's better, and how much better. If the parameters of the rating formula are tuned well, the difference in rating points will give a pretty good prediction of winning percentages. It's interesting in itself that reducing chess skill to a single number on a linear scale works as well as it does -- see this post for more than you probably want to read about that. The point here, though, is that to get useful ratings you need a pool of players all playing against each other.

You don't need a full round-robin of thousands of players, of course, but the results you do have need to be reasonably interconnected. If you're an average club player, you probably won't be playing many games against the world's top ten, but the strongest players in your club may well have played people who've played them, and in any case there will be multiple reasonably short paths connecting you to them.

To make this work, everyone, even at the top, needs to have players of comparable rating level to play against. This is important because rating differences are more meaningful between relatively evenly-matched players. If player A beats player B twelve times, draws five times and loses four times out of twenty games, then player A is probably about 200 points stronger than player B. If it's nineteen wins and one draw, we know player A is several hundred points stronger, but we can't say much more than that with confidence. If it's 20-0, the formula says that A is 800 points stronger, but it could really be any number from 800 on up.

If you're a club player, playing a 20-game match against Magnus Carlsen will only tell you something about your rating if you manage to at least draw at least one game. For most of us, that's not particularly likely, and for more or less the same reason, that tournament isn't likely to happen at all. In real life, people only have so much time, and professional players are only going to play seriously against other professionals.

The same holds true for humans and computers. People, including grandmasters, do play computers a lot, but generally not under controlled match conditions. Since even the best human player is likely to be on the wrong end of a 19-1-0 blowout with a top computer player in a tournament setting, there's little reason to pay a top human player to take time out of their schedule for an exhausting and demoralizing match that won't tell us much more than we already know.

Probably a better approach would be to dial back the parameters on a few well-known engines, with a mix of AB and NN, to produce a set of standard players at, say, 200 rating-point intervals -- 2800, 3000, 3200 ... . It would probably be much easier to get human players to play against the "near human-level" players, knowing they might be able to win bragging rights over their friends. Some of the top human players should even be able to beat the 2800-rated players. The human-level computer players could thus be calibrated against actual humans, and those could be calibrated against the stronger computer players.

Calibrating the top computer players against the standard computer players would take a lot of games, but that's not a problem for computers. In the last computer tournament I watched, there were hundreds of games, with dozens between each pair in the round-robin phase, and more among the finalists. Outside of tournaments, a group called CCRL plays computer players against each other under standard conditions specifically to rate them.

If only because of the sheer amount of human players, we can be reasonably confident that human players are reasonably accurately rated with respect to each other, and because of the sheer number of games played between computers, we can be similarly sure that their ratings reflect their relative strengths. If computer tournaments included a few human-calibrated players, then we would have a better idea just how strong the strongest players were, compared to humans.

As it is, the two sets of ratings may well be drifting apart. In the last computer tournament I watched, the bottom player -- which, to be sure, lost quite heavily -- had a rating in the 2900s. Carlsen is currently in the high 2800s. In theory, Carlsen would probably lose to that computer player, but would probably manage a few draws and quite possibly a win or two. In practice, that match is probably not going to happen.

While it would be interesting to get more detail on ratings across humans and computers, it doesn't really change the story of "computers can beat humans easily" and by itself it doesn't shed much light on the limits of how well chess can be played. An Elo-style rating system doesn't have a built-in maximum rating. There is, in theory, a best-possible player, but we don't know how strong that player is or, put another way, we don't know how close the current best engines are to playing perfect chess or where they would rate with respect to a perfect player.

It is interesting, though, that stronger players, both human and computer, tend to draw more against each other than weaker players. Amateur-level games are often decisive because one player or the other blunders (or, more accurately, blunders more than the other player does). Higher-level human players blunder less, and chess engines don't blunder at all, or at least the notion of "blunder" for an engine is more on the order of "didn't realize that having an enemy pawn that far advanced on the kingside could lead to a devastating attack". The finals of the last computer tournament I watched consisted almost entirely of draws.

While it's entirely possible that perfectly-played chess is a win for white (or, who knows, a win for black), and the top engines just aren't finding the right moves, it seems more likely to me that perfectly played chess is a draw and engines are basically getting better at not losing, while being able to ruthlessly exploit imperfect play when it does happen.

If this is the case then it's quite possible that ratings will top out at a point where decisive games become exceedingly rare. A plausible match result might be 1-0 with 999 draws. If rules required a certain margin of victory, that might essentially be a matter of flipping coins until there are are that many more heads than tails or tails than heads. The "law of large numbers" doesn't forbid this, it just says that you'll need a lot of coin flips to get there.

We could well get to a point where a match between top players ends in a score of 1439-1429 with 50,726 draws or something like that. At that point writing chess engines becomes similar to cranking out more and more digits of pi -- interesting to a small group of people, and useful in stress-testing systems and tools, but no longer of general interest, even among geeks.

The players would probably still not be playing perfect chess, but it seems plausible that they would nearly hold their own against perfect play. As draws become more common, ratings should start to bunch together as a result. If player A's record against player B is 12-0 with 123,765 draws, their ratings will be essentially identical, even if A is playing perfectly and B is only almost-perfect.

Working backward from this, if top ratings are in fact starting to bunch together, you could extrapolate toward a theoretical maximum rating that they're approaching. A player's actual rating is then an estimate of how often they would lose, as opposed to drawing, against perfect play. I wouldn't be surprised if someone's already done a regression based on a scenario like this.