Intermittent Conjecture: 2019

Tuesday, October 29, 2019

More on context, tool use and such

In the previous post I claimed that (to paraphrase myself fairly loosely) whether we consider behaviors that technically look like "learning", "planning", "tool use" or such to really be those things has a lot to do with context. A specially designed robot that can turn a door handle and open the door is different from something that sees a door handle slightly out of reach, sees a stick on the ground, bends the end of the stick so it can grab the door handle and proceeds to open the door by using the stick to turn the handle and then to poke the door open. In both cases a tool is being used to open a door, but we have a much easier time calling the second case "tool use". The robot door-opener is unlikely to exhibit tool use in the second case.

With that in mind, it's interesting that the team that produced the hide-and-seek AI demo is busily at work on using their engine to play a Massively Multiplayer Online video game. They argue at length, and persuasively, that this is a much harder problem than chess or go. While the classic board games may seem harder to the average person than mere video games, from a computing perspective MMOs are night-and-day harder in pretty much every dimension:

You need much more information to describe the state of the game at any particular point (the state space is much larger). A chess or go position can be described in well under 100 bytes. To describe everything that's going on at a given moment in an MMO takes more like 100,000 bytes (about 20,000 "mostly floating point" numbers)
There are many more choices at any given point (the action space is much larger). A typical chess position has a few dozen possible moves. A typical go position may have a couple hundred. In a typical MMO, a player may have around a thousand possible actions at a particular point, out of a total repertoire of more than 10,000.
There are many more decisions to make, in this case running at 30 frames per second for around 45 minutes, or around 80,000 "ticks" in all. The AI only observes every fourth tick, so it "only" has to deal with 20,000 decision points. At any given point, an action might be trivial or might be very important strategically. Chess games are typically a few dozen moves long. A go game generally takes fewer than 200 (though the longest possible go game is considerably longer). While some moves are more important than others in board games, each requires a similar amount and type of calculation.
Players have complete information about the state of a chess or go game. In MMOs, players can only see a small part of the overall universe. Figuring out what an unseen opponent is up to and otherwise making inferences from incomplete data is a key part of the game.

Considered as a context, an MMO is, more or less by design, much more like the kind of environment that we have to plan, learn and use tools in every day. Chess and go, by contrast, are highly abstract, limited worlds. As a consequence, it's much easier to say that something that looks like it's planning and using tools in an MMO really is planning and using tools in a meaningful sense.

It doesn't mean that the AI is doing so the same way we do, or at least may think we do, but that's for a different post.

Tool use, planning and AI

A recent story in MIT Technology Review carries the headline AI learned to use tools after nearly 500 million games of hide and seek, and the subhead OpenAI’s agents evolved to exhibit complex behaviors, suggesting a promising approach for developing more sophisticated artificial intelligence. This article, along with several others, is based on a blog post on OpenAI's site. While the article is a good summary of the blog post, the blog post is just as readable while going into somewhat more depth and technical detail. Both the article and the blog post are well worth reading, but as always the original source should take precedence.

There is, as they say, quite a bit to unpack here, and before I'm done this may well turn into another Topic That Ate My Blog. At the moment, I'm interested in two questions:

What does this work say about learning and intelligence in general?
To what extent or in what sense do terms like "tool use" and "planning" describe what's going on here?

My answers to both questions changed significantly between reading the summary article and reading the original blog post.

As always, lurking behind stories like this are questions of definition, in particular, what do we mean by "learning", "planning" and "tool use"? There have been many, many attempts to pin these down, but I think for the most part definitions fall into two main categories, which I'll call internal and external here. Each has its advantages and drawbacks.

By internal definition I mean an attempt to formalize the sort of "I know it when I do it" kind of feeling that a word like learning might trigger. If I learn something, I had some level of knowledge before, even if that level was zero, and after learning I could rattle off a new fact or demonstrate a new skill. I can say "today I learned that Madagascar is larger than Iceland" or "today I learned how to bake a soufflé".

If I talk about planning, I can say "here's my plan for world domination" (like I'd actually tell you about the robot army assembling itself at ... I've said too much) or "here's my plan for cleaning the house". If I'm using a tool, I can say "I'm going to tighten up this drawer handle with a Philips screwdriver", and so forth. The common thread is here is a conscious understanding of something particular going on -- something learned, a plan, a tool used for a specific purpose.

This all probably seems like common sense, and I'd say it is. Unfortunately, common sense is not that helpful when digging into the foundations of cognition, or, perhaps, of anything else interesting. We don't currently know how to ask a non-human animal to explain its thinking. Neither do we have a particularly good handle on how a trained neural network is arriving at the result it does. There may well be something encoded in the networks that control the hiders and seekers in the simulation, which we could point at and call "intent", but my understanding is we don't currently have a well-developed method for finding such things (though there has been progress).

If we can't ask what an experimental subject is thinking, then we're left with externally visible behavior. We define learning and such in terms of patterns of behavior. For example, if we define success at a task by some numerical measure, say winning percentage at hide and seek, we can say that learning is happening when behavior changes and the winning percentage increases in a way that can't be attributed to chance (in the hide-and-seek simulation, the percentage would tilt one way or another as each side learned new strategy, but this doesn't change the basic argument).

This turns learning into a pure numerical optimization problem: find the weights on the neurons that produce the best winning percentage. Neural-network training algorithms are literally doing just such an optimization. Networks in the training phase are certainly learning, by definition, but certainly not in the sense that we learn by studying a text or going to a lecture. I suspect that most machine learning researchers are fine with that, and might also argue that studying and lectures are not a large part of how we learn overall, just the part we're most conscious of as learning per se.

This tension between our common understanding of learning and the workings of things that can certainly appear to be learning goes right to why an external definition (more or less what we call an operational definition) can feel so unsatisfying. Sure, the networks look like they're learning, but how do we know they're really learning?

The simplest answer to that is that we don't. If we define learning as optimizing a numerical value, then pretty much anything that does that is learning. If we define learning as "doing things that look to us like learning", then what matters is the task, not the mechanism. Learning to play flawless tic-tac-toe might be explained away as "just optimizing a network" while learning to use a ramp to peer over the wall of a fort built by a group of hiders sure looks an awful lot like the kind of learning we do -- even though the underlying mechanism is essentially the same.

I think the same reasoning applies to tool use: Whether we call it tool use or not depends on how complex the behavior appears to be, not on the simple use of an object to perform a task. I remember reading about primates using a stick to dig termites as tool use and thinking "yeah, but not really". But why not, exactly? A fireplace poker is a tool. A barge pole is a tool. Why not a termite stick? The only difference, really, is the context in which they are used. Tending a fire or guiding a barge happen in the midst of several other tools and actions with them, however simple in the case of a fireplace and andirons. It's probably this sense of the tool use being part of a larger, orchestrated context that makes our tool use seem different. By that logic, tool use is really just a proxy for being able to understand larger, multi-part systems.

In my view this all reinforces the point that "planning", "tool use" and such are not binary concepts. There's no one point at which something goes from "not using tools" to "using tools", or if there is, the dividing line has to be fairly arbitrary and therefore not particularly useful. If "planning" and "tool use" are proxies for "behaving like us in contexts where we consider ourselves to be planning and using tools", then what matters is the behavior and the context. In the case at hand, our hiders and seekers are behaving a lot like we would, and doing it in a context that we would certainly say requires planning and intelligence.

As far as internal and external definitions, it seems we're looking for contexts where our internal notions seem to apply well. In such contexts we have much less trouble saying that behavior that fits an external definition of "tool use", "planning", "learning" or whatever is compatible with those notions.

Saturday, July 27, 2019

Do neural networks have a point of view?

As someone once said, figures don't lie, but liars do figure.

In other words, just because something's supported by cold numbers doesn't mean it's true. It's always good to ask where the numbers came from. By the same token, though, you shouldn't distrust anything with numbers behind it, just because numbers can be misused. The breakdown is more or less:

If you hear "up" or "down" or "a lot" or anything that implies numbers, but they're aren't any numbers behind it, you really don't know if it's true or not, or whether it's significant.
If you hear "up X%" or "down Y%" or -- apparently a popular choice -- "up a whopping Z%" and you don't know where the numbers came from, you still don't really know if it's true or not. Even if they are correct, you don't know whether they're significant.
If you hear "up X%, according to so-and-so", then the numbers are as good as so-and-so's methodology. If you hear "down Y%, vs. Z% for last quarter", you at least have a basis for comparison, assuming you otherwise trust the numbers.
In all, it's a bit of a pain to figure all this out. Even trained scientists get it wrong more than we might think (I don't have numbers on this and I'm not saying it happens a lot, but it's not zero).
No one has time to do all the checking for more than a small subset of things we might be interested in, so to a large extent we have to trust other people to be careful. This largely comes down to reputation, and there are a number of cognitive biases in the way of evaluating that objectively.
But at least we can try to ignore blatantly bad data, and try to cross-check independent sources (and check that they're actually independent), and come up with a rough, provisional picture of what's really going on. If you do this continually over time the story should be pretty consistent, and then you can worry about confirmation bias.
(Also, don't put much stock in "record high" numbers or "up (a whopping) 12 places in the rankings", but that's a different post).

I'm not saying we're in some sort of epistemological nightmare, where no one has any idea what's true and what's not, just that objectivity is more a goal to aim towards rather than something we can generally expect to achieve.

So what does any of this amateur philosophizing have to do with neural networks?

Computers have long been associated with objectivity. The stramwan idea that "it came from a computer" is the same as "it's objectively true" probably never really had any great support, but a different form, I think, has quite a bit of currency, even to the point of becoming an implicit assumption. Namely, that computers evaluate objectively.

"Garbage in, garbage out," goes the old saying, meaning a computed result is only as good as the input it's given. If you say the high temperature in Buenos Aires was 150 degrees Celsius yesterday and -190 Celsius today, a computer can duly tell you the average high was -20 Celsius and the overall high was 150 Celsius, but that doesn't mean that Buenos Aires has been having, shall we say, unusual weather lately. It just means that you gave garbage data to a perfectly good program.

The implication is that if you give a program good data, it will give you a good result. That's certainly true for something simple, like calculating averages and extremes. It's less certain when you have some sort of complicated, non-linear model with a bunch of inputs, some of which affect the output more than others. This is why modeling weather takes a lot of work. There are potential issues with the math behind the model (does it converge under reasonable conditions?), the realization of that model on a computer (are we properly accounting for rounding error?) the particular settings of the parameters (how well does it predict weather that we already know happened?). There are plenty of other factors. This is just scratching the surface.

A neural network is exactly a complicated, non-linear model with a bunch of inputs, but without the special attention paid to the particulars. There is some general assurance that the tensor calculations that relate the input to the output are implemented accurately, but the real validation comes from treating the whole thing as a black box and seeing what outputs it produces from test inputs. There are well-established techniques for ensuring this is done carefully, for example using different datasets for training the network and for testing how well the network really performs, but at the end of the day the network is only as good as the data it was given.

This is similar to "Garbage in, Garbage out," but with a slightly different wrinkle. A neural net trained on perfectly accurate data and given perfectly accurate input can still produce bad results, if the context of the training data is too different from that of the input it was asked to evaluate.

If I'm developing a neural network for assessing home values, and I train and test it on real estate in the San Francisco Bay area, it's not necessarily going to do well evaluating prices in Toronto or Albuquerque. It might, because it might do a good job of taking values of surrounding properties into account and adjusting for some areas being more expensive than others, but there's no guarantee. Even if there is some sort of adjustment going on, it might be thrown off by any number of factors, whether housing density, the local range of variation among homes or whatever else.

The network, in effect, has a point of view based on what we might as well call its experience. This is a very human, subjective way to put it, but I think it's entirely appropriate here. Neural networks are specifically aimed at simulating the way actual brains work, and one feature of actual brains is that their point of view depends to a significant degree on the experience they've had. To the extent that neural networks successfully mimic this, their evaluations are, in a meaningful way, subjective.

There have been some widely-reported examples of neural networks making egregiously bad evaluations, and this is more or less why. It's not (to my knowledge) typically because the developers are acting in bad faith, but because they failed to assemble a suitably broad set of data for training and testing. This gave the net, in effect, a biased point of view.

This same sort of mistake can and does occur in ordinary research with no neural networks involved. A favorite example of mine is drawing conclusions about exoplanets based on the ones we've detected so far. These skew heavily toward large, fast-moving planets, because for various reasons those are much easier to detect. A neural network trained on currently known exoplanets would have the same skew built in (unless the developers were very careful, and quite likely even then), but you don't need a neural network to fall prey to this sort of sampling bias. From my limited sample, authors of papers at least try to take it into account, authors of magazine articles less so and headline writers hardly at all.

Sunday, July 14, 2019

Computer chess: where now?

In an effort to wrap up this thread, at least for a while, here's an attempt to recap some of the major points and to conjecture about what might be next, and what this might tell us about chess, and intelligence in general.

Since playing perfect chess appears intractable, we have a classic tradeoff: give up on perfect play in order to fit our actual strategy into our limited resources. Chess theory, combined with practice to develop pattern recognition and calculation, is optimized for the human brain. Chess theory is just shorthand for "patterns and rules that players have discovered over the centuries." These days, that very much includes discoveries by computer players.

For computers (at least the Von Neumann-style processors currently in commercial use), there are now two main options:

exhaustive search to a limited depth combined with alpha/beta pruning to avoid exploring moves that can't be as good as moves already considered, using an explicit set of rules to evaluate whether a position is good or bad (known as AB, after the alpha/beta part)
a completely opaque neural net evaluation of positions combined with a limited randomized search of possible variations (known as NN, for the neural net part), though you'll also see mention of MCTS (Monte Carlo Tree Search), referring to the randomized search.

There are some hybrid approaches that combine explicit rules with random search, but one way or another there has to be a tradeoff of performance for limited resources.

It's probably worth repeating that NN engines still consider far more possible continuations than humans can. The numbers for human, NN and AB are roughly dozens, hundreds of thousands and billions respectively. We can assume that those last two numbers are going to increase over time, and the first one isn't. Moore's law may (or may not) be tailing off, but there will still be improvements in software and bigger piles of hardware in the future. There could also be breakthroughs like some sort of quantum chess computer that can examine vastly larger numbers of possible positions, in which case all bets are off.

It's interesting explore what human, AB and NN players do and don't have in common. One common factor is the importance of what's traditionally called "positional" play, that is, taking into account factors beyond tactical concerns about winning material or forcing a quick checkmate. Tactics are still important, but as the level of play becomes higher it becomes more and more important to consider factors that influence the longer-term course of the game, factors like pawn structure, space, initiative and so forth.

Positional play is interesting from a computing standpoint because it's not the primary objective of the game -- that would be checkmate -- or even the secondary objective -- generally considered to be winning material to make it easier to checkmate, or using the threat of winning material to force the opponent into a weaker position. The positional elements that have been developed over the centuries aren't readily obvious consequences of the rules. They are human constructs aimed at making an intractable problem tractable. Heuristics, in other words. In broad terms, all three kinds of players are using the same approach -- rules of thumb plus calculation -- just in different mixes and with different rules of thumb.

It seems significant, then, that computers have, in effect, rediscovered positional factors in their play, even without having them explicitly programmed in. Deep Blue beat Kasparov, by Kasparov's own assessment, by outplaying Kasparov positionally. The surprise was that it did this with only basic knowledge of what makes a position good or bad and that the rest emerged from looking at large numbers of possible positions, much further ahead than a human could look.

Similarly, in their training phases, NN engines like Alpha Zero learn to play good moves without any obvious tactical benefit -- and indeed some that can look like blunders or at least unnecessarily risky at first glance -- without any cues beyond what happened to win in training games. They do seem to produce more than their share of "wild" positions, and "unorthodox" moves, but even then their play can generally be described in terms like "an unusually aggressive pawn push to initiate a kingside attack" or "a positional sacrifice to obtain more active pieces", and not (except in endgames, it seems) "no idea ... looks like it picked that move at random".

Maybe that just means that we have enough terms for things going on in chess that at least some of them are bound to apply to any particular move or position, but even when engines play moves that flout previously accepted theory, such cases are probably the memorable exception. In a great many positions each of the three types of player would find the others' preferred moves perfectly reasonable. They might well disagree over which exact move is best, but in most cases their evaluations will be largely similar. Typically they will all rule out the great majority of possible moves as inferior, and they will largely agree on which moves are inferior. For that matter, AB engines and human grandmasters have been known to flout the same rules. It's not just NN that can throw a curve ball from time to time.

All this is to say that if three radically different approaches to chess-playing can reach a reasonable consensus in discarding almost all possible moves and that their ranking among the best moves agrees reasonably well -- though by no means perfectly -- with accepted chess theory, then most likely there is something to accepted chess theory. Call it an emergent property of the rules.

So where do we go from here? Can we combine the strengths of the three approaches, roughly speaking high-level planning for humans, calculation for AB engines and positional assessment for NN engines? This next bit is more speculation than the nice summary I was hoping for, but here goes:

It might seem almost self-evident that if we knew the full details of how grandmasters see the board then computers would be able to do even better, but it's not like that hasn't been tried. Chess engine developers have been working with top-level human players for decades, with good results, but not always with the best results. Stockfish, which currently rules the AB world, benefits from a large distributed team making continual tweaks to the algorithm, with only the best making it into the next release. Contributions can come from anyone. I'm sure a lot of contributors are also strong chess players, but it's not a prerequisite.

Alpha Zero, of course, dispenses with outside expertise entirely. Other NN engines try to incorporate outside knowledge, but the ones that don't, notably Alpha Zero and its open-source cousin LC0, seem to do best.

Humans clearly do things differently from computers, but as I've said previously, it's quite possible that this is because these things work for humans but not necessarily for computers. My guess is that trying to model the way humans formulate plans when playing chess is not going to lead to stronger chess engines. At this point, both kinds of engines have a reasonable substitute for human-style conscious planning, namely calculating possible continuations en masse. That's a good fit for computers, and there's no strong reason to think that trying to emulate humans instead would be a better fit.

Even NN engines, which focus mainly on evaluating particular positions, benefit from calculation. While NN engines may look downright clueless in endgames (even when they win), they look considerably less clueless at looser time controls, that is, when they are doing more calculation. Looking far enough ahead appears indistinguishable from planning, even if that's not how we humans do it.

While the original Alpha Zero work was done in a closed environment and (to my knowledge) hasn't been directly replicated by outside researchers, it's generally understood that Alpha Zero benefitted from using tensor processing chips that sped up its evaluation (one reason the Stockfish team argued that the Alpha Zero match results weren't a fair comparison), and that for that reason it would likely beat LC0, which runs on stock hardware. Again, NN with faster, and thus more, calculation beats NN without it.

On the other hand, one big reason that NN engines got so much attention was that they didn't just win, but they did it in a noticeably different style. The classic example is the "fawn pawn" (apparently a mishearing of "thorn pawn"), which is an advanced pawn, typically on the sixth rank, that can't be attacked by the defender's pawn, for example a black pawn on h3 when white has castled kingside and played g3.

A strong human player would likely look at such a position and think "Yeah ... that doesn't look good for white ... ", but NN players demonstrated just how bad it is, and therefore why it can be worth playing for even if you have to give up material or it doesn't look good to expose your own king and push a pawn that might have to be supported by pieces instead of other pawns. The NN games also gave some insight as to when to push such a pawn.

The ideal combination of current AB and NN approaches would evaluate positions as well as NNs but as fast as ABs. It's not out of the question, in fact I'd conjecture that it's likely, that AB engines can incorporate some of the knowledge that NNs have turned up directly into their evaluation functions. The resulting code wouldn't have to capture everything the NN's network does, just enough to avoid losing to strategies like fawn pawn attacks, by skewing just enough toward positions that avoid them.

AB engine developers should probably be careful with this, though. Some of a NN's unorthodox behavior represents what we might as well call new insight, but some of it will be due to overfitting or plain old random noise. Nor does an AB developer have to understand exactly how the NN's evaluation works. There's a very direct way of seeing what an NN's evaluation function is doing: play games against it and see what it does.

Because an NN is mostly evaluating individual positions, its play gives a good view into how it does that. A hypothetical perfect player might make totally surprising moves because it can see every variation, but that's not how NNs work. They play surprising moves because the evaluation function gives those moves good ratings.

Another option that comes to mind would be to use neural networks not for playing chess directly, but for tweaking the parameters of traditional chess engines. I'm not sure this is a great idea, though. One major lesson of both types of engine, in their own ways, is that it's risky to build in assumptions -- not necessarily doomed to failure, but risky.

If you're tweaking parameters in an AB engine -- e.g., how much a rook on an open file is worth in relation to a pawn -- you're tuning weights in a system with a bunch of specialized, not-necessarily-linear nodes. In a standard NN you're tuning weights in a (relatively) straightforward multilinear system. It's not clear that an NN built for tweaking weights in a multilinear system would do a good job tweaking the weights on an AB's specialized evaluation function. It might, but it might not.

Looking at it from a more NN perspective, you could try to speed up evaluation by noticing this or that pattern of weights and replacing the generalized tensor calculations with ad-hoc functions that happen to work for those particular weights. For example if the weights are "sparse", that is, most of them are zero, you can write out an explicit formula that combines the terms with non-zero weights.

In theory, you might come up with something that resembles some of the customary rules of thumb. Maybe weights involving a particular kind of piece tend to emphasize empty squares that the piece could move to (open lines, mobility), or pairs of squares the piece could attack if it moved a certain way, and which hold enemy pieces (forks). Again, it's not clear that this would help. It might just get in the way of further improvements, if it turns out that the zero weights that made things sparse really shouldn't have been zero.

If all we know is that neural nets are combining basic features of a chess position according to some magic combination of weights, then all we can say is "there's something that current explicit evaluation functions don't seem to be capturing." If that's how it stays, then we might expect neural networks to outperform engines like Stockfish in the long run. They're already at least comparable, and they haven't been around for nearly as long.

It's quite possible, though, that a neural network's evaluation function is capturing a handful of quantifiable factors that current explicit evaluation functions aren't capturing. In that case, an engine like Stockfish with an updated evaluation function informed by NN behavior should have the upper hand. It could choose promising positions as well as, or about as well as, a neural net, but it would examine them much more efficiently.

It's not clear how much room is left for neural networks to improve. Playing winning chess is just plain hard. Training a network for twice as long generally doesn't produce twice as good a result. It may be that network-based evaluation of individual positions is about as good as it's going to get.

For that matter, though, neither does looking at twice as many moves, or even twice as deep (which should roughly square the number of moves examined), make an AB engine twice as strong. Currently AB engines can examine billions of positions and around twenty plies deep (twenty levels of move/countermove, or ten full moves) in a typical midgame position. Looking at trillions of positions would mean looking more like thirty plies ahead. That ought to yield some sort of improvement, but how much? Most high-level games are draws in any case. As players get stronger, it becomes harder to measure their strength, at least insofar as it takes more games to get enough decisive results.

It's interesting that AB and NN engines appear to be more or less evenly matched. This may be evidence that we're reaching the limits of what computer chess players can do. Or it may not.

In the discussion above, I talk mostly about evaluation functions, weights on numbers and so forth, and not so much about "planning" or "understanding" or "insight". This was a deliberate choice. Chess engines are nothing more or less than computer programs. It's fascinating that they can clearly display some aspects that we associate with intelligence, while clearly lacking others. Neural networks don't fundamentally change this -- they're still just doing calculations -- but they do seem to capture another chunk of what brains do, namely fuzzy pattern matching.

Much discussion of developments like AlphaZero and similar neural network-based programs focuses on what we're learning about algorithms, and in particular how to harness "deep neural networks" (that is, multi-stage neural networks) to solve problems that are widely agreed to require intelligence.

That's fair enough, but in that process we're also learning about the problems themselves. If intelligence is the ability to solve certain kinds of problems, then solving problems by machine tells us something about what it means to be intelligent. While it's natural to want to move the goal posts and try to narrow "intelligence" down so it continues to mean "behavior we can't emulate reasonably well with machines," I argue in the first of the previous posts, that's probably a losing game. The winning game, I think, is gaining a better understanding of various capabilities that we tend to throw under the blanket term "intelligence".

How strong are computer chess players, and how far are they from perfect chess?

Playing a perfect game of chess appears to be an intractable problem. Certainly the total number of possible positions is far too large for a direct tabulation of the best move for each possible position to be feasible. There are far more possible chess positions than subatomic particles in the observable universe.

To be sure, almost all of these positions are extremely unlikely to appear in any reasonable game of chess, much less a perfectly played one, so you should be able to ignore them. All you would really need to know in order to play perfectly is what the best first move is, what the best second move is for every reply, and so forth. Since most possible replies are not good moves, this might thin out enough that it would be possible to store everything in a large database and/or write general rules that will cover large numbers of possible positions. In other words, it might (or might not) be feasible to write down a perfect strategy if we could find it. But we're nowhere close to finding it.

Nonetheless, there has been a lot of progress in the decades since Deep Blue beat Kasparov. It's now quite clear that computers can play better than the best humans, to the point that it's a bit hard to say exactly how much better computers are. There are rating numbers that imply that, say, Stockfish would beat Magnus Carlsen X% of the time, but they probably shouldn't be taken as anything more than an estimate. We can say that X is probably somewhere in the 90s, but that's about it.

Chess rating systems are generally derived from the Elo system (named after Arpad Elo), which tries to quantify playing strength as a single number based on players' records against each other. Two equally-rated players should have roughy equal numbers of wins and losses against each other, plus however many draws. As the rating difference increases, the stronger player should win more and more often.

Ratings are recalculated in light of actual results. If two equally-rated players draw, nothing will change, but if a low-rated player draws against a higher-rated one, the higher-rated player will lose points and the lower-rated player will win points. Likewise, the winner of a game will gain points and the loser will lose points, but you get more points for beating a higher-rated player and you lose more for losing to a lower-rated player.

Over time, this will tend to give a good picture of who's better, and how much better. If the parameters of the rating formula are tuned well, the difference in rating points will give a pretty good prediction of winning percentages. It's interesting in itself that reducing chess skill to a single number on a linear scale works as well as it does -- see this post for more than you probably want to read about that. The point here, though, is that to get useful ratings you need a pool of players all playing against each other.

You don't need a full round-robin of thousands of players, of course, but the results you do have need to be reasonably interconnected. If you're an average club player, you probably won't be playing many games against the world's top ten, but the strongest players in your club may well have played people who've played them, and in any case there will be multiple reasonably short paths connecting you to them.

To make this work, everyone, even at the top, needs to have players of comparable rating level to play against. This is important because rating differences are more meaningful between relatively evenly-matched players. If player A beats player B twelve times, draws five times and loses four times out of twenty games, then player A is probably about 200 points stronger than player B. If it's nineteen wins and one draw, we know player A is several hundred points stronger, but we can't say much more than that with confidence. If it's 20-0, the formula says that A is 800 points stronger, but it could really be any number from 800 on up.

If you're a club player, playing a 20-game match against Magnus Carlsen will only tell you something about your rating if you manage to at least draw at least one game. For most of us, that's not particularly likely, and for more or less the same reason, that tournament isn't likely to happen at all. In real life, people only have so much time, and professional players are only going to play seriously against other professionals.

The same holds true for humans and computers. People, including grandmasters, do play computers a lot, but generally not under controlled match conditions. Since even the best human player is likely to be on the wrong end of a 19-1-0 blowout with a top computer player in a tournament setting, there's little reason to pay a top human player to take time out of their schedule for an exhausting and demoralizing match that won't tell us much more than we already know.

Probably a better approach would be to dial back the parameters on a few well-known engines, with a mix of AB and NN, to produce a set of standard players at, say, 200 rating-point intervals -- 2800, 3000, 3200 ... . It would probably be much easier to get human players to play against the "near human-level" players, knowing they might be able to win bragging rights over their friends. Some of the top human players should even be able to beat the 2800-rated players. The human-level computer players could thus be calibrated against actual humans, and those could be calibrated against the stronger computer players.

Calibrating the top computer players against the standard computer players would take a lot of games, but that's not a problem for computers. In the last computer tournament I watched, there were hundreds of games, with dozens between each pair in the round-robin phase, and more among the finalists. Outside of tournaments, a group called CCRL plays computer players against each other under standard conditions specifically to rate them.

If only because of the sheer amount of human players, we can be reasonably confident that human players are reasonably accurately rated with respect to each other, and because of the sheer number of games played between computers, we can be similarly sure that their ratings reflect their relative strengths. If computer tournaments included a few human-calibrated players, then we would have a better idea just how strong the strongest players were, compared to humans.

As it is, the two sets of ratings may well be drifting apart. In the last computer tournament I watched, the bottom player -- which, to be sure, lost quite heavily -- had a rating in the 2900s. Carlsen is currently in the high 2800s. In theory, Carlsen would probably lose to that computer player, but would probably manage a few draws and quite possibly a win or two. In practice, that match is probably not going to happen.

While it would be interesting to get more detail on ratings across humans and computers, it doesn't really change the story of "computers can beat humans easily" and by itself it doesn't shed much light on the limits of how well chess can be played. An Elo-style rating system doesn't have a built-in maximum rating. There is, in theory, a best-possible player, but we don't know how strong that player is or, put another way, we don't know how close the current best engines are to playing perfect chess or where they would rate with respect to a perfect player.

It is interesting, though, that stronger players, both human and computer, tend to draw more against each other than weaker players. Amateur-level games are often decisive because one player or the other blunders (or, more accurately, blunders more than the other player does). Higher-level human players blunder less, and chess engines don't blunder at all, or at least the notion of "blunder" for an engine is more on the order of "didn't realize that having an enemy pawn that far advanced on the kingside could lead to a devastating attack". The finals of the last computer tournament I watched consisted almost entirely of draws.

While it's entirely possible that perfectly-played chess is a win for white (or, who knows, a win for black), and the top engines just aren't finding the right moves, it seems more likely to me that perfectly played chess is a draw and engines are basically getting better at not losing, while being able to ruthlessly exploit imperfect play when it does happen.

If this is the case then it's quite possible that ratings will top out at a point where decisive games become exceedingly rare. A plausible match result might be 1-0 with 999 draws. If rules required a certain margin of victory, that might essentially be a matter of flipping coins until there are are that many more heads than tails or tails than heads. The "law of large numbers" doesn't forbid this, it just says that you'll need a lot of coin flips to get there.

We could well get to a point where a match between top players ends in a score of 1439-1429 with 50,726 draws or something like that. At that point writing chess engines becomes similar to cranking out more and more digits of pi -- interesting to a small group of people, and useful in stress-testing systems and tools, but no longer of general interest, even among geeks.

The players would probably still not be playing perfect chess, but it seems plausible that they would nearly hold their own against perfect play. As draws become more common, ratings should start to bunch together as a result. If player A's record against player B is 12-0 with 123,765 draws, their ratings will be essentially identical, even if A is playing perfectly and B is only almost-perfect.

Working backward from this, if top ratings are in fact starting to bunch together, you could extrapolate toward a theoretical maximum rating that they're approaching. A player's actual rating is then an estimate of how often they would lose, as opposed to drawing, against perfect play. I wouldn't be surprised if someone's already done a regression based on a scenario like this.

Sunday, June 23, 2019

Grandmasters don't play twelve-dimensional chess

A couple of caveats before I start:

First, in case it's not clear by now, let me say explicitly that I'm not a grandmaster, or any kind of master, or even close to being any kind of chess master. I have, however, watched a few play, browsed through master-level games and read articles and interviews on the subject, not to mention losing quite a few 5-minute games to master-level players and occasionally asking them to explain where I went wrong.

Second, while grandmaster is a title with very specific qualifications, reserved for the best of the best, I'm using it here as a stand-in for any strong player with the same general mindset and skills that I describe here, regardless of how well they actually do in tournament play.

With that out of the way, imagine a critical point in a cheesy movie:

"Welcome to my trap. I knew you'd come here, just as I know that you will now try to use your secret radio transmitter to summon help, which is why I have taken the precaution of jamming all radio frequencies. Don't even think about trying to cut off power to the jamming equipment. It's encased in an unobtanium vault with battery power for the next three months. Checkmate."

"Ah, my worthy opponent, I did not anticipate that. But I did not need to. There is no secret radio transmitter, only this aluminum box of curiously strong mints that I knew your scanning equipment would register as one. No, I will not be needing a radio today. I need only wait here and do nothing, while the new head of security you hired comes to my rescue. Next time be more careful who you use for background checks. You never know who they might be working for. Don't bother summoning your bodyguards. They're already safely locked away where they will do no harm to anyone."

"But ... but ... my plan was perfect. I accounted for every last detail!"

Characters like this may be casually referred to as chess masters, but as far as I can tell this is not really how actual grandmasters play.

While it's true that many games are won or lost by one player catching something the other missed, it's not necessarily from thinking one move farther ahead, or from calculating every last possibility, but more typically from recognizing the importance of occupying a key square, or from opening or closing some particular line of attack, or knowing which rook to move onto a particular file. This, in turn, generally comes down to who has the better plan.

There will certainly be points in a game where a grandmaster will pause and think deeply about moves, countermoves, counter-countermoves and so forth, and they can do this better than most players, but that's probably not the real distinguishing factor in their play. Likewise, while tactics are certainly important in top-level games, the better tactician may or may not end up the winner.

One good illustration of this is the simultaneous exhibition, where a professional typically takes on a few dozen amateurs, moving from one to the next for each move (with occasional pauses for a longer exchange) and often winning every game. If human chess were a matter of sheer calculation then winning every game would mean one person keeping all the possible continuations of 50 different games in mind, or being able to re-calculate them in the few seconds they usually spend at each board.

But of course this is not what's going on. The pro is relying on experience, often looking for quick tactical wins by exploiting typical amateur mistakes. They're still planning, just not in detail, maybe something on the order of "That knight is lined up with the king. I can pin it, then bring my own knight to this square, put my queen on this diagonal and either attack the king directly or just win that rook if they don't notice I have a fork there. Am I leaving any pieces hanging? No. Is my king exposed? No? Great. On to the next one ..."

A human grandmaster can do this because they see the board not as a collection of pieces on squares but in a more structured way, something like a collection of patterns and possibilities. Even if the full details of how this is done are elusive, there are hints, like the well-known experiments in which a grandmaster can reconstruct a position after a quick glance while a beginning player or a non-player takes much longer and makes many more mistakes.

Someone without much experience might think "there was a rook on this square, a pawn on this square ... or was it a bishop?" and so forth. A grandmaster might think something more like "both sides have castled kingside, white's queen bishop is fianchettoed ..." or "this started out as such-and-such opening" or maybe even "this is like that game I lost to so-and-so but that rook is where I should have put it"

When it comes to playing over the board, a strong player will know many more possibilities to check for than a weaker one, and many more possible hazards to avoid. One site I ran across gives a checklist of more than two dozen things to check for on every move, just to play a strong amateur game. I have no doubt that this makes for stronger play, but I personally don't have the patience for this, or the experience for it to have become second nature, so it should come as no surprise that I'm not a strong player.

The most important factor, though, is planning. If you just play likely-looking moves against a grandmaster, but without a clear plan, you'll almost certainly find yourself in deep trouble before long because they did have a plan. That innocuous bishop move, that weird-looking pawn push and the shift of a rook from a reasonable-looking square to a different reasonable-looking square were all setting up the attack that's now making your life miserable. If only you could move your king out of the way -- and that's what that bishop move was really about.

As I mentioned in a previous post, AB engines really do play the kind of insanely calculating game that we associate with the cliche chessmaster character, while NN engines do the same sort of pattern recognition that allows a grandmaster to size up a situation without having to do a lot of calculation, but neither is doing the sort of large-scale planning that a human grandmaster is. It's also worth reiterating that, while even early chess engines were able to out-calculate human players, it took decades before they could consistently win games against them.

All that said, I think it would be a mistake to think that the planning grandmasters do is some sort of humans-only ability that computers could never emulate. There's no reason a computer couldn't represent and execute a plan on the order of "Put your bishop on this diagonal, push these pawns and go for a kingside attack involving the bishop, rook and queen," without exhaustively checking every move and countermove. It just hasn't proven to be an approach that computers can use to win games.

Wednesday, May 22, 2019

What do chess engines understand about chess?

[This post assumes a familiarity with terminology from previous posts. Rather than try to gloss everything, I'm going to punt and respectfully ask the confused reader to at least skim through the previous few posts]

It's clear from watching a bunch of games that the two main types of chess engines have different styles, though not all that different. It's not as though anyone suddenly decided 1. g4 was a great opening. In broad strokes:

AB engines play very precisely and their evaluations of positions are fairly conservative. It's quite common for an NN engine to give a large advantage to one side in a position that an AB engine rates as fairly even. The ratings eventually come in line, but the convergence can happen in either direction.

If the NN engine is right, the AB eval comes up to meet it as the NN exploits an advantage that the AB engine missed. If it's wrong, the NN eval comes down and the result is often a draw, either because the NN engine failed to play precisely enough to convert a winning advantage, or the winning advantage was never really there to begin with. It would be interesting to track down statistics on this, and any interesting exceptions, but I don't have easy access to the data.

When NN engines first hit the scene, the initial take was that neural nets had captured some subtlety of positional understanding that the AB engines, with their relatively simple hard-coded evaluation rules couldn't fathom, even given the ability to look many moves ahead. This is mostly accurate, I think, but incomplete.

In the Blitz Bonanza, there were any number of NN wins over AB engines, at least the weaker ones, that started with the advance variation of the French Defense (1. e4 e3 2. d4 d4 followed soon after by white e5). This opens up a lot of space for white on the kingside and tends to push the black pieces over to the queenside -- or maybe "lure" is a better word, since there are plausible things for them to do there, including attacking white's pawn center. The problem is that white can now launch a fairly potent kingside attack and black doesn't have time or space to get enough pieces over to defend.

Apparently none of this looks too bad to an AB engine. Yes, white has an advanced pawn in the center, but there are chances to counterattack and weaken that pawn. Once the white attack really gets going, often aggressively pushing the g and h pawns, its king may start to look correspondingly weak -- except black doesn't have any pieces well-located to attack it. A typical AB engine's eval function doesn't try to account for that -- looking ahead it will either find an attack or it won't.

Probably this disconnect of white seeing good chances and black finding plusses and minuses to the same moves continues until black can see actual tactical threats coming, at which point it's toast since its pieces are now bottled up on the wrong side of the board. The natural explanation is that the NN engine understands that there are good attacking prospects on the kingside and blacks pieces are poorly placed to defend.

What's actually going on is probably a bit different. In its training phase, the code that trains the NN found that positions with, say, more black pieces on the queenside, lots of open squares on the kingside, and pawns pushed up near the black king tend to lead to wins. It didn't specifically note this as such, but those were some of the positions which -- in oversimplified terms -- caused some of the weights that favored those kinds of positions to be increased.

A human playing in the same style would probably say "My plan was to open up some space, get black's pieces bottled up on the queenside and launch a kingside attack". But here there is no plan. White is playing moves that tended to work well in similar situations in training.

As I mentioned in a previous post, this lack of a plan becomes blindingly obvious in endgames. Many endgames have an almost mathematical feel to them. White is trying to hold a draw. Black is up a bishop for two pawns. Picking off one of those pawns would allow one of black's own pawns to promote unimpeded. But white's pawns are all on the wrong color square for the bishop to attack and they're blocked against black's pawns (and vice versa), so all white has to do is keep black's king away.

Conversely, though, if black can draw white's king away from where it's blocking black's king, it can sneak its own king through and pick up the undefended pawns. And there just happens to be a tactical threat that will accomplish that ...

A human player can figure positions like this out logically and win or hold the draw, as the case may be, even against perfect play. An AB engine can often look ahead far enough to find the winning variation. An NN engine might or might not stumble on the right move.

Likewise, it's common for an NN engine to see a large advantage in a position that's clearly a draw. Suppose in the previous scenario there's a way for white to safely shuffle its king back and forth between two squares and keep the black king from getting through. Quite likely the AB engine will evaluate the position as dead even, and the human kibitzers will agree. The NN engine, if it's playing black in this example, will continue to see a sizable advantage almost up to the 50-move limit, even though all it's doing is rearranging its pieces in order to avoid threefold repetition.

If understanding chess means having a feel for what kind of positions have good or bad prospects without trying to calculate every possible variation, that is, playing like a human, then NN engines are clearly doing something similar.

On the other hand, if understanding chess means being able to find the right moves to win or draw a difficult endgame, that is, playing like a human, then AB engines clearly have the upper hand.

But if understanding chess means being able to reason about why an endgame is won, lost or drawn, or being able to put together a general plan of attack without knowing in advance exactly what moves need to happen in what order, that is, playing like a human, then neither type of engine is even trying.

What's chess to a computer?

It's now quite clear that computer chess is going to eat this blog for a while, so I'm in the process of unpacking several points that have come up in my earlier attempts to just make a few quick points and move on. I'm also coming to realize that, since I'm having even more trouble than usual organizing this into a coherent whole, I'll probably end up with N posts worth of material spread across N+1 posts, if not more. Apologies. This particular post is an excursion into computer science land to look at computer chess from that point of view, rather than trying to understand it as chess.

Chess, fundamentally, falls into a large class of problems called tree searches, in which a number of interconnected structures are explored by proceeding from a starting point to those that are connected directly to it, and those connected directly to the first set, and so forth. There are lots of different ways to do this, depending mostly on what order you do the visiting in.

Examples of tree searches, besides finding good chess moves, include figuring out the best way to route packets in a network, solving a puzzle where odd-shaped blocks have to be arranged to form a cube, finding the shortest path through a maze and following reflections and refractions to find what color a particular pixel in an image will be (ray-tracing). There are many others.

Each point in a tree search is called a node. In chess, the nodes are chess positions (including where the pieces are, which side is to move and a few other bits such as whether each side's king has moved). At any point in a tree search you are either at a leaf node, meaning you have no further choices to make -- in chess, you have arrived at a win, loss or draw, or more often simply decided not to explore any deeper -- or not, in which case you have one or more child nodes to choose from. In chess, the children of a position are all the positions that result from legal moves in the current position.

Since there is no element of chance in chess, there is one and only one game tree that starts at the initial position (the root node) and includes every possible position resulting from every series of legal moves. This means that in theory there it's possible to know whether any position is a win for white, a win for black or a draw, assuming each side plays perfectly. That includes the initial position, so in theory there is an optimal strategy for each player and all perfectly-played games of chess have the same result.

The problem is that the game tree for chess is incomprehensibly huge. There are way, way more legal chess positions than fundamental particles in the known universe. No computer will ever be able to examine more than a minuscule fraction of them. That doesn't necessarily mean that chess will never be "solved", in the sense of proving what the optimal strategies are for each player and what the result will be with perfect play. In theory there could be a way to rule out all but a tiny fraction of possible positions and exhaustively search what's left. In practice, we're nowhere near being able to do that.

Instead, we're left with approximate approaches. Either exhaustively search a tiny, tiny fraction of the possibilities, which is what AB engines do, or find some sort of approximate measure of what constitutes a "good" position and "just play good moves", using a partial search of the possible continuations to double-check that there aren't any obvious blunders. This is what humans and NN engines do.

One thing that makes chess particularly interesting as a search problem is that it's not totally random, but it's not totally orderly either.

Imagine a search tree as a series of branching passageways. At the end of every passageway, after however many branches, is a card reading "black wins", "white wins" or "draw". If all you know is that the cards are there then you might as well pick randomly at each junction. On the other hand, if you have a map of the entire complex of passageways and which card is at the end of each of them, you can find your way perfectly. Depending on where the cards are, you might or might not be able to win against a perfect player, but if your opponent makes a mistake you'll know exactly how to take advantage.

In chess there are far too many passages to map out, but there are markings on the walls. For example, there are several types of endgame positions which even a human can always play perfectly. If you reach a position where you have a king and a rook and your opponent only has a king (and it's not a stalemate to start out with), you can win by following a pretty simple recipe.

There are also databases of positions ("tablebases") with known outcomes, found by exhaustively searching backwards from all possible checkmate positions. Currently every endgame with seven or fewer pieces (including the kings) is known to be either a win for white, win for black or a draw. If it shows up by working backward from one of the possible checkmates, it's a win. If it didn't, it's a draw (the process of working backwards from known positions is itself a tree search).

There are a few kinds of positions that are well-known to be immediate wins for one side or the other, regardless of what other pieces are on the board. The classic example is a back-rank mate, where one side's king is blocked between the edge of the board and its own pawns and the other side has major pieces able to give check. It's simple to calculate whether the defending side can capture all the pieces that can give check or can safely interpose (block). If not, it's game over. My understanding is that chess engines don't bother to special-case these, since they're typically looking several moves ahead anyway.

And then there's a lot of gray area. If one side gains a material advantage, it will generally win, eventually, but this is far from an ironclad rule. There are plenty of cases where one side will give up material (sacrifice) in order to gain an advantage. This ranges from spectacular combinations where one side gives up a queen, or more, in order to get a quick checkmate, to "positional sacrifices" where one side gives up something small, generally a pawn, in order to get an initiative. Whether a positional sacrifice will be worth it is usually not clear-cut, though in some well-known cases (for example, the Queen's gambit), it's generally agreed that the other side is better off not taking the material and should concentrate on other factors instead.

There are also temporary sacrifices where one side gives up material in order to quickly win back more. If you're playing against a strong player and it looks like they've left a piece hanging, be very careful before taking it.

In short, which side has more material is a clue to which side will win, and it's often a good idea to try to win material, but it's not a guarantee. This goes equally well for other generally accepted measures of how well one is doing. Maybe it's worth it to isolate or double one of your pawns in order to gain an initiative, but often it's not. Games between human masters often turn on whose estimations of which small trade-offs are correct.

From a computing point of view, since it's impossible to follow all the passageways, we have to rely on the markings on the wall, that is, the characteristics of particular positions. An AB engine will look exhaustively at all the passages that it explores (subject to the "alpha/beta pruning" that gives them their name, which skips moves that can't possibly lead to better results than are already known). But in most positions it will have to give up its search before it finds a definitive result. In that case it has to rely on its evaluation function. Basically it's saying "I've looked ten branches ahead and if I make this move the worst that can happen is I'll get to a position worth X".

That's much better than picking randomly, but it only works well if you explore quite deeply. A good human player can easily beat a computer that's only searching three moves (six plies) ahead because the human can easily find something that looks good in the short term but turns bad a few moves later. Once an AB engine is looking dozens of moves ahead, though, there are many fewer short-term traps that don't pan out until late enough for the search to miss them.

This is a lot like the problem of local minima in hill-climbing algorithms. If you're trying to get to the highest point in a landscape, but it's so dark or foggy you can only see a short distance around you, your best bet could well be to go uphill until you can't any more, even though that fails if you end up atop a small hill that's not actually the high point. The further you can see, the better chance you have of finding your way to the true high point, even if you sometimes have to go downhill for a while to get on the right path.

Humans and neural networks, however, are doing something slightly different. They can't look very far, but they're able to read the landscape better. It's kind of like being on a mountainside in the fog and being able to look at the vegetation, or smell the wind, or examine the types of rock in the area or whatever, and be able to say that the peak is more likely this way than that way.

This wouldn't work if there weren't some sort of correlation between the features of a position and how likely it is to ultimately lead to a win. It's not a property of tree-searching algorithms. It's a property of chess as a tree-searching problem. It's probably a property of any chess-like game that humans find interesting, because that's probably what we find interesting about chess: there's some order to it, in that some positions are clearly better than others, but there's some mystery as well, since we often can't tell for sure which option is better.

Tuesday, May 21, 2019

Computer chess engines fall in line

[This post came out of trying to figure out what computer chess results tell us about what kind of problem chess-playing is. It goes into quite a bit of detail about cycles where player A beats player B beats player C beats player A and why they don't seem to happen. Shortly after I posted this, the Computer Chess Championship "Deep Dive" round finished up, and there was a slight anomaly: Stockfish won the tournament, but second-place LC0 beat it head to head, by the narrow margin of 4-3-43, implying an Elo difference of 8 points. I don't think this affects the post greatly, but it does reinforce the point that NN engines are less consistent than AB engines. Stockfish didn't lose a game to 4th-place Houdini while LC0 both won and lost. I'll update this slightly when chess.com makes the full results available --D.H. May 2019]

One thing jumps out from watching the ongoing computer chess tournaments that is so ordinary-seeming that it's easy to dismiss entirely: The rankings from a preliminary round with a couple dozen contestants, are strikingly linear. If you look at the cross-table, which shows everyone's results against everyone else's, wins in green, losses in red and draws in gray, not only do everyone's scores improve down the table, but for the most part all the red is above and all the green below. Stronger engines very consistently beat weaker engines, especially considering that at least some of the apparent upsets were due to system crashes and such.

I've seen this happen twice now, once for the "Blitz Bonanza" (5 minutes plus two seconds per move) and the Deep Dive (15 minutes plus 5 seconds per move). The Deep Dive preliminary, which just finished a few days ago, was played in "escalation" format, meaning that each new player plays everyone who's already played (twice as white and twice as black). After a few opponents, you can be pretty confident about where a new player will land -- above whoever whoever it's beat head-to-head, below whoever it's lost to, and close to anyone it drew all four games with.

It shouldn't be a total shock that stronger players beat weaker players, or even that they beat them very consistently. My point is more that it doesn't have to be that way. While there are only two basic styles of engine*, each dev team has its own philosophy and each one evaluates positions differently. In the case of neural net-based engines (generally referred to as NN), the evaluation of positions is done radically differently.

It seems at least plausible that, as with human players, engines might tend to do better or worse than otherwise expected against particular styles of play. An engine that attacks aggressively might lose to one that's tuned for careful defending, but beat a middle-of-the-road approach that can outplay the careful defender positionally. This would create a cycle of attacker beats middle-of-road beats defender beats attacker, but this sort of situation doesn't seem to happen in practice.

Without going back and checking the actual numbers, my impression is that AB engines -- the conventional style which does an exhaustive search a far ahead as it can manage, using fairly understandable hand-coded rules to evaluate positions quickly -- win more by being able to look further ahead than by having cleverer evaluation rules. That is, evaluation speed is a good predictor of who will win.

Clearly if two engines have identical evaluation rules, but one implements them more efficiently, the faster engine should win because it can see further ahead. The faster engine will see everything the slower engine sees, but it can also see which of two seemingly equal moves leads to a better result. It will therefore be able to steer toward winning positions and the slower engine won't see the danger until it's too late.

Even with each team using its own particular evaluation rules, something like this still happens. Typically when two AB engines play, there will be a period where the weaker engine shows a more favorable evaluation for itself than the stronger one does, e.g., the weaker engine is playing white and it sees a half-pawn advantage for white while the stronger engine sees a small advantage for black.

The weaker engine thinks everything's going fine while the stronger engine knows that it has the edge. Eventually the weaker engine's evaluation starts to drop as it realizes things are no longer as good as they looked, and the stronger engine pushes its small advantage into an overwhelming one.

Again, it isn't a foregone conclusion that a faster engine will beat a slower one with a different evaluation function, but it's what we see in practice. My suspicion is that evaluation functions hit diminishing returns after certain point. You've accounted for material, you've got some concept of tactical considerations like pins and hanging pieces, and a bit on positional factors like space and pawn structure. Further improvements are certainly possible, in the form of slightly better evaluation functions, slightly faster evaluation, or both, but the low-hanging fruit was picked long ago.

Adding a detailed model of, say, pawn structure, is likely just to slow the evaluation down without providing enough benefit to make up for the decrease in depth. This is the "dumb is smarter" principle that seems to have held sway for the past few decades.

Stockfish -- currently the clear champion among AB engines -- demonstrates this nicely with its "fishtest". As I understand it, developers try out changes to the program to see if their locally patched version consistently beats the current released version. Only winning changes are brought into the next release. While Stockfish has improved significantly as a result of fishtest, progress has been incremental, and the end result is that Stockfish can evaluate positions very quickly. I'm not 100% sure it's the fastest, but it's certainly high up there.

I'm sure everyone, open source or not, playtests potential improvements (and the training phase of NN engines take this to an extreme), but crowdsourcing this ought to keep development from fixating on some particular person or small group of people's idea of what must be right. Again, fewer assumptions means better results.

Bringing NN engines into the picture doesn't seem to change this linear ranking. NN engines are on the other extreme of the smart vs. fast tradeoff. For a given number of layers and so forth a position generally takes just as long to evaluate regardless of the exact parameters of the net being used, so the only difference in play will be due to the net itself. Evaluation speed is not a factor since it doesn't change.

This doesn't necessarily rule out A-beats-B-beats-C-beats-A cycles, in fact, it seems like they might be more likely since different nets will put different weights on different factors -- assuming there's more than one way to be good at chess. So far, though, there don't seem to be any. In the tournament results I've seen so far, there are three strong NN engines: Leela Chess Zero (generally known as LC0) Leelenstein and Antifish (a modified Leela trained specifically to beat Stockfish). In the Blitz Bonanza, LC0 had winning records against the other two and Leelenstein had a winning record against Antifish. Antifish does not seem to do better than the others against Stockfish. If anything it does slightly worse.

There are only two NN engines in the Deep Dive finals (rules now prohibit more than one variant of the same engine reaching the finals, so Antifish is out). Obviously there will be no cycle between them. For what it's worth, LC0 is currently slightly ahead of Leelenstein, but they're dead even head-to-head.

There is one clear deviation from the orderly world of AB engines. Since NN engines evaluate so slowly, they take a random sample of positions when looking ahead, rather than searching exhaustively like AB engines. I believe the sample is weighted towards moves the net considers likely to be good, but in any case this introduces some inconsistency. NN engines occasionally lose to weak AB engines since not all good-looking moves turn out actually to be good, while strong AB engines almost never do. Against strong AB engines, NN engines can find good moves that the AB opponent overlooks because it doesn't lead to good results quickly enough for the exhaustive search to turn it up, but they can also play good-looking moves that turn out to be unsound if you search well enough.

The net effect is that strong NN engines don't do as well against weak AB engines as you'd expect from how they do against strong engines. In the current Deep Dive finals, Stockfish is (at this writing) noticeably ahead of LC0, but head-to-head they're nearly even (3 wins to 2 with 41 draws, implying an ELO difference of 8 points; by contrast, human world champion Carlsen is currently rated about 57 points ahead of second-ranked Caruana) [and in fact, Stockfish won the finals but lost to LC0 head-to-head].

The difference is that Stockfish has done better against the weaker opposition. Stockfish is 9-0 against 4th-place Houdini (with a whole bunch of draws), while LC0 is 12-3 against Houdini (with correspondingly fewer draws). Stockfish also does better against the other NN engine, 3rd-place Leelenstein.

So there don't seem to be any cycles in the ranking of chess engines, and the (small) difference between top AB engines and top NN engines looks to be due to consistency. What does this tell us about chess-playing as a problem? There are several possible conclusions:

Chess skill really is a linear ranking, at least once you get to the level of computer engines. If A beats B and B beats C, then A beats C.
A bit stronger: There is one true way to play chess and stronger engines approximate this better.
On the other hand: AB engines are linearly ranked because only depth/evaluation speed really matters while NN engines are linearly ranked because only the quality of the net doing the evaluation matters. It's not surprising, though not inevitable, that combining two linear rankings gives another linear ranking.
Also, NN engines are all playing on or near the level of the current AB champion, so they're really only being compared to one style of play since there's only one really competitive opponent.

My feeling is that we don't really have enough to know for sure, but most likely we'll continue to see improvement in playing strength and that the linear ranking will hold. Next year's champion will be able to beat any other opponent head-to-head, the one after that will be able to beat it and everyone else, and so forth, regardless of whether the next champion is AB, NN or some hybrid of the two. There are noticeable differences between how AB engines and NN engines play (I'd like to get into the details in a future post), but be that as it may, stronger is stronger.

* Technically, there are two choices on two dimensions: Alpha/Beta tree search vs. Monte Carlo and neural-net based evaluation vs. hand-coded rules. However, NN engines always use Monte Carlo since it takes them roughly a thousand times longer to evaluate a particular position, so they can't search exhaustively. On the other hand, conventional engines generally use AB searches because they can. There have been fairly successful experiments in using Monte Carlo with hand-coded rules. Those tend to be called hybrid.

Saturday, April 27, 2019

Notes on the EHT black hole image

Following on to the recent post on the other blog about the imaging of a black hole in a nearby galaxy, here are my lightly cleaned up notes from watching Katie Bouman's lecture on how the images were captured using the Event Horizon Telescope (EHT), which is actually a collection of several radio telescopes whose individual data sets are combined using very sophisticated image processing techniques to emulate a single Earth-sized telescope.

I have no great insights to offer, but there are some pretty interesting details in the original lecture. I've thrown in some commentary here and there, without taking any real care to distinguish it from the source material. The full lecture is on Caltech's YouTube channel. As always, any errors or distortions are mine.

Technically, the image gathered by the telescopes working in combination is in spatial frequency, basically how quickly bright changes to dark as you scan over the actual image. For example, a square divided into a white rectangle and a black rectangle has a lower spatial frequency than one divided into thin black and white stripes. It's possible to reconstruct a visual image from information about the spatial frequency, and the more information you have, the better the reconstruction. This sort of reconstruction is widely used in many fields, including astronomy, medical imaging and the image compression that lets us stream video over the internet.

The exact spatial frequencies being measured depend on the distance (baseline) between the telescopes. When the telescopes are spread across distant locations on the Earth, the image technique is called Very Long Baseline Interferometry (VLBI). Interferometry refers to measuring how the signals from different telescopes interfere with each other, that is, whether they reinforce each other or cancel each other out when combined. Very Long Baseline refers to, um ...

In this case there were eight radio telescopes at six different sites, taking data together on four different nights, scanning for a few minutes at a stretch. This provides 21 different pairs of telescopes, which would mean 21 different baselines, that is, 21 different spatial frequencies to combine, except that the baselines change as the Earth rotates. This is good, because it provides samples of more frequencies. There is at least some talk of adding telescopes in low Earth orbit at some later stage, which would not only provide more raw data but more varied baselines, since satellites move faster than the Earth rotates.

As mentioned in the other post, each scope collected hundreds of terabytes (hundreds of trillions of bytes), enough data that it was easier to fly the drives with the data to the site where it would be processed, rather than try to send it over the internet (and the South Pole data had to be flown out in any case).

The actual signal, coming from about 50 million light-years away, is quite weak, and noisy. There is a lot of matter at the center of a galaxy, radiating at various frequencies and blocking various frequencies. The radio band was chosen to minimize these effects, but there is still noise to contend with, along with noise from a variety of other sources, including atmospheric noise and imperfections in the telescopes themselves.

The various telescopes are looking through different amounts of atmosphere, depending on whether the black hole is overhead from their location or on the horizon, and this also affects how long it takes the signal to arrive at each telescope. It takes about 20 milliseconds longer for the signal to reach a scope that has the black hole on the horizon that it does for one where it's overhead. For a 1GHz signal, that means a difference of 20 million cycles -- and again, this is varying as the earth rotates.

All this means there is a lot of calibration to be done to make sure you're lining up the right data from the various telescopes and accounting for all the various sources of noise and distortion. In all, there were hundreds of thousands of calibration parameters to work with, though the team ended up using (only) ten thousand or so.

The reconstructed image is meant to represent what an Earth-sized telescope would see, if one could be built, but there are actually infinitely many such images that would match up with the limited data that was collected from a few scattered points on Earth.

Fortunately, having multiple telescopes means you can cross-check and isolate errors. There are also "closure quantities" that should be the same regardless of the exact details of calibration. I didn't really understand the details of closure quantities, but the idea of looking for quantities that don't depend on hard-to-control parameters is a powerful one. It's sort of similar to checking arithmetic by adding up digits.

The team used two different methods of calibration, and they split into four subteams, two for each method. The subteams worked independently for several weeks to reconstruct an image from the raw data. Bouman's own work involved getting the human-tuned method (CLEAN) to work without human tuning.

In addition to the work by the subteams, there was a lot of work done up front to try to make sure that the image processing algorithms they were tuning were robust. Much of this involved taking test images, calculating what signals the telescopes would get and then reconstructing the test images from the calculated signals. They did this not only with images of discs and rings that one would typically expect in astronomical images, but also with a variety of non-astronomical images.

The team actively tried to break the image processing algorithms, for example by tuning the parameters to favor a solid disc rather than a ring, and verifying that a ring still came out as a ring. Python fans will be glad to know that at least some of the code involved was written in Python.

The publicly distributed image is a combination of the images from the subteams. Since it's meant to represent only what we can be confident about regarding the actual black hole, it's deliberately a bit blurry. Even still, there are some significant results [I haven't yet re-checked the details here, so what follows may be garbled]:

There is, in fact, a ring as would be expected from a supermassive black hole.
The same ring was observable night after night.
Previously there had been two separate estimates of the mass of the black hole, one from the motions of stars around it (stellar dynamics) and one from the appearance of the gas flowing into the accretion disc. The size of the ring in the image supports the first mass estimate.
One portion of the ring is brighter than the rest, due to the doppler shift of matter orbiting around the black hole. This is to be expected, and the position of the bright spot matches up with other observations.
The position of the bright spot is shifting over time, consistent with a rotation period on the order of weeks (I think this implies that the black hole's axis of rotation is not on the line of sight). Since the black hole in the Milky Way (Sgr A*) is much smaller, its rotation period is expected to be on the order of minutes. Since the observation scans are also on the order of minutes, the image processing for Sgr A* will have to take this into account.

It's also possible to analyze the raw data directly as a cross-check. A disk of such-and-such size should result in such-and-such radio signals arriving at the various telescopes, independent of any reconstructions produced from that data. The raw signals are consistent with a ring of the size found in the visual reconstructions. This is not a surprise, given how the reconstructions were done, but still good to know.

Despite the whole effort taking years and requiring expensive radio telescopes at widely dispersed locations, and employing dozens of people at various stages, the movie Interstellar still cost a lot more to make. While its image of a black hole may be prettier (and, as I understand it, one of the parts of the movie where the science was actually plausible), the EHT image was both cheaper and more informative.

Again, really good stuff.

Sunday, April 14, 2019

Notes on watching (a chunk of) a computer chess tournament

A while ago I started a post trying to untangle what kind of a problem it is to play chess well, based on how well various kinds of players -- human, conventional computer chess engines and neural-network-based chess engines -- did against each other. I still think it's an interesting question, but in the mean time I actually watched a bunch of computer chess and that was even more interesting. In particular, I looked in on a fair bit of chess.com's "CCC 7: Blitz Bonanza". Here are some rough impressions.

The Blitz Bonanza is one part of a long-running event comprising thousands of individual games in several different formats (the nice thing about chess engines is they don't get tired, so it's practical to have two of them play each other dozens of times in the course of a few days). The "Deep Dive", with longer time controls and slightly different rules, is going on now and will be for several weeks. There may be more after that (much of this is from memory because I can only seem to find current results on chess.com and the summaries on other sites don't always contain the exact information I'm looking for).

In the Blitz Bonanza, each player has five minutes of time to start with, plus two seconds per move made. There is no "opening book", meaning that the players must calculate every move. This leads to the rather odd sight of a clock ticking down near or past the 4-minute mark on move one, which is pretty much inevitably something everyone plays anyway (1. d4 and 1. e4 predominate).

Play continues either to checkmate by one side (though it enters a "get it over with" or "bullet" mode if both engines evaluate the current position as highly favorable to one or the other side), to a draw by stalemate, threefold repetition, 50 moves without a pawn move or capture, or "adjudication", meaning that the position is in a standard database of positions with six or fewer pieces on the board that cannot be won if both sides play perfectly (this includes "insufficient force" scenarios like king and bishop against king), or both engines evaluated the game as a dead draw for a certain number of moves.

This is rather different from human matches, where a player will typically resign if it becomes clear there's no way to avoid losing, and players typically agree to a draw when it's clear that neither of them has a realistic shot at winning unless the other one blunders. It's sometimes fun to watch an actual checkmate. The draws can be interesting in their own way. More on that later.

Scoring was the usual one point for a win, no points for a loss and half a point to each player for a draw

The original field consisted of

The premier conventional (alpha/beta or AB) engine, Stockfish
Three neural network-based (NN) engines (Leela Chess Zero, aka LC0, Leelenstein and Antifish)
Everyone else

In the first round, each of these played each of the others several times, half of the time as white and half as black. The top four would progress to the next round. When the smoke cleared, the top four were LC0, the other two neural network engines, and Stockfish rounding out the field in fourth.

In the second round, each of the four played each of the others 100 times (50 as white and 50 as black), for a total of 600 games. This went on around the clock for several days. The final result had LC0 in first, Stockfish close behind, well ahead of Leelenstein, which was itself well ahead of Antifish. LC0 had a small lead over Stockfish in their head-to-head games.

Looking at this with my somewhat defocused engineer's eyes, I'd say "LC0 and Stockfish were about equal, both were clearly stronger than Leelenstein, and Antifish was clearly the weakest of the lot." I'd consider LC0 and Stockfish essentially equal not only because of the small difference in score (5.5 points out of a possible 300), but because the margin between the two was steady for most of the play. LC0 won some early games against Stockfish head-to-head, but after that each scored several wins against the other. The early lead thus looks more like a statistical fluke than a real difference in strength [I'm not sure a real statistician would buy this analysis -D.H May 2019]. The results of CCC 6, with slightly different rules and time controls, bear this out. Stockfish beat LC0, but not by much, and, if I recall correctly, not head-to-head.

As to my long-running theme of "dumb is smarter", the "zero" in LC0 means that it was trained purely by playing against itself, without any external hints. Antifish, which finished last and had a losing record against each of the others, including Stockfish, was trained specifically to exploit weaknesses in Stockfish. If we take LC0's "no extra information to bias the results" approach as one kind of "dumb" and Stockfish's "bash out lots of positions using fairly straightforward (but highly-tuned) rules" as another, it's pretty clear that "dumb" dominated.

One other thing jumped out: Neural networks are not good at endgames, or at least they can be spectacularly bad at some endgames.

Endgames are interesting here because there are few pieces on the board and it's often possible to spell out an explicit plan for winning. For example, if you have a rook and a king against a rook, you can use your rook and king in concert to pin the other king against the edge of the board and then checkmate with the rook. In other cases, you want to promote a pawn and this hinges on getting your king to the right square and keeping the other king from stopping you.

Over and over I saw neural network engines flail about aimlessly before hitting the 50 move limit. To be clear, there are a few endgames that can take more than 50 moves to win, but even in these you can see a clear progression. To make up an example, after a series of ten checks the opposing king is now one square closer to the corner where it will eventually be checkmated. This isn't what was happening. In one game I saw, the AB engine showed mate in 12, while the NN engine saw its winning chances as quite good. The game continued

AB: Oh, now you've got mate in 14

NN: Yeah this looks good

AB: I don't see a mate, but I think you're going to promote that pawn (or whatever) and win, so my rating is still highly in your favor

NN: Yeah this looks good

AB: Dang, mate in 14 again

NN: Yeah this looks good

AB: Mate in 13

NN: Yeah this looks good

AB: I don't see a mate any more, but ...

NN: Yeah this looks good

From the NN's point of view, it was going from one position with good winning chances to another. But it wasn't actually winning. When the game eventually ended in a draw, the NN gave away a half point that the AB engine or a good human player would have gotten, given the same position.

I was a little (but not a lot) surprised by this. It doesn't seem like the problem is beyond the capability of neural networks. They've discovered plenty of conventional tactics -- forks, skewers and such, or ladders in the game of go -- and they are clearly able to encode at least some of the usual winning patterns.

Give a neural network a king and queen against a queen and it will find a checkmate more or less like anyone else would. On the other hand, give it a queen, a pawn and a rook against a king and it will quite likely forcibly give away two of the three and win with what's left. To a mathematician it makes sense (reducing to a previously-solved case), but to most chess players it looks like lunacy, if not deliberate trolling [In one fairly jaw-dropping example from the Deep Dive finals, LC0 had an overwhelming material advantage in an endgame and gave away pieces to reduce it to bishop, knight and king against lone king, one of the harder endgames to win. It went ahead and won anyway --D.H May 2019].

I found this behavior particularly interesting because neural networks are supposed to have more of that special sauce we call intelligence or understanding. We can clearly say that Stockfish has no deep understanding of chess the way that a human grandmaster can intuitively say "Your pieces are active and the king is exposed. Go for a mating attack," but at least it plays plausible moves. NN engines often find wild moves that a human wouldn't consider and an AB might avoid because it doesn't show a clear win, leading us to feel it has some deeper understanding of chess. I think that's a valid feeling, but there are cases where it completely falls apart.

Watching a neural network botch an endgame just looks like farce. I'm perfectly comfortable saying that a program "understands" a concept in chess, or "has a concept of" pawn structure or whatever. It doesn't mean that the program has some reflective layer thinking "Oh, that's a strong pawn chain, I like that", but that it behaves in such a way that, one way or another, it must recognize good and bad structures.

In cases like these endgames, though, it's abundantly clear that the network has no concept, in any meaningful literal or metaphorical sense of the word, of how to win the position. Nor is it going to be able to figure one out over the board, because the network only learns during the training phase (that's not necessarily a requirement, but it's how these engines work).

This doesn't seem like a permanent situation. For example, you could have an NN engine train on endgame positions, and it would likely discover ways to win outright and ways to steer toward a win in positions where it now flails. This is pretty much what humans do -- you learn how to promote pawns as one skill, how to checkmate with a rook and king as another, and so forth.

What's at least a bit surprising is that an engine can be frighteningly strong in most of the game and completely clueless in a few key spots. This has to say something about intelligence and learning, or about what kind of a problem chess-playing is technically, or about all of the above, but I'm not quite sure what.