Friday, January 10, 2020

Is the piano a percussion instrument?

Well, is the piano a percussion instrument?

This is one of those questions that can easily devolve into "Well technically" ... "Oh yeah, well actually" and so forth.  I'm not aware of an official designator of instrument categories, but more to the point I'm not interested in a right or wrong answer here.  I'm interested in why the question should be tricky in the first place.

The answer I learned from high school orchestra or thereabouts was "Yes, it's a percussion instrument, because the strings are hit by hammers."  The answer I personally find more convincing is "No, because it's a piano, duh."

OK, maybe that's not particularly convincing.  Maybe a better way to phrase it would be "No, it's a keyboard instrument.  Keyboard instruments are their own class, separate from strings, woodwinds, brass and percussion."  By this reasoning, the pipe organ is a keyboard instrument, not a wind instrument, the harpsichord is a keyboard instrument, not a string instrument, and a synthesizer is a keyboard instrument, assuming it has a keyboard (not all do).

The intuition behind this is that being played by way of a keyboard is more relevant than the exact method for producing the sounds.  Even though a marimba, xylophone, vibraphone or glockenspiel has an arrangement of things to hit that looks a lot like a keyboard, the fact that you're limited to mallets in two hands has a big effect on what you can play.  Likewise, a harpsichord and a guitar or banjo produce somewhat similar sounds, but fretting a one or more of a few strings is different from pressing one or more of dozens of keys.

It's a lot easier to play a four-part fugue on a harpsichord than a marimba, and a seven-note chord is going to present real problems on a five-string banjo.  Different means of playing make different things easy and hard, and that affects what actually gets played.

At this point, I could put forth a thesis that how you play an instrument is more important in classifying it than how the sounds are ultimately produced and be done with it, but that's not what got me typing in the first place.  To be clear, I like the thesis.  It's easier to play a saxophone if you know how to play a clarinet, easier to play a viola or even a guitar if you can play violin, and so forth.  What got me thinking, though, was the idea of how any classification on the order of string/woodwind/brass/percussion or keyboard/bow/plectrum/mallet/etc. tends to break down on contact with real objects to classify.

For example, there are lots of ways to produce sound from a violin.  There are several different "ordinary" ways to bow, but you can also bounce the wooden part of the bow on the strings, or pluck the strings (with either hand).  Independently of what you do with the bow, you can put a mute on the bridge to get a kind of ethereal, spooky sound.  You can rest a finger lightly on the string to get a "harmonic" with a purer tone (and generally higher pitch) than if you pressed the string to the fingerboard.  Beyond all that, you can tap on the body of the violin with your finger, or the stick of the bow or the end of the bow.  You could even tap the violin on something else, or use its strings as a bow for another instrument.

Does tapping on a violin make it a percussion instrument?  I'd say it is when you're tapping on it, otherwise not.  But if you ask, "Is the violin a percussion instrument," I'd say "no" (or, if I'm feeling cagy, "not normally").

How about an electric guitar?  Obviously, it's a string instrument, except there's more to playing an electric guitar than picking and fretting the strings.  The effects and the amp make a big difference.  It's probably best to think of electric guitar plus amp and effects as both a string instrument and an electronic instrument, both in its construction and in how you play it.  The guitar, amp and effects together are one instrument -- that's certainly how guitarists tend to see it, and they can spend quite a bit of time telling you the details of their rigs.

There are plenty of other examples to pick from -- a morsing, a glass harp, a musical saw, a theremin ... if you had to pick, you could probably call a morsing or even a glass harp a percussion instrument -- I mean, if a piano is, why not?  A musical saw would be, um, a string instrument?  A theremin would be ... I don't know, let's say brass because there are metal parts?

But why pick?  Clearly the four sections of an orchestra work fine for the instruments they were originally intended to classify, and they provide useful information in that context.  If you're putting together an orchestra, you can expect a percussionist to handle the bass drum, snare drum and tympani but not a trumpet, oboe or cello.  If you're composing for orchestra, you should know that wind players need to breathe and that a string instrument can play more than one note at a time, but only within fairly strict limits.  In neither case do you really care that someone might consider a piano a percussion instrument.  For the purposes of hiring players and composing music, a piano is a keyboard instrument.

If your purpose is to classify instruments by common properties, there are much better systems.  Wikipedia likes the Hornbostel Sachs classification, which takes into account what produces the sound, how the sound is produced, the general form of the instrument and other factors.  For my money, it does a pretty good job of putting similar instruments together while making meaningful distinctions among them.  For example (based on this 2011 revision of the classification):
  • violin 321.322-71 (Box lute sounded by a bow)
  • cello 321.322-71 (Same)
  • guitar 321.322-5 or -6 (Box lute sounded by bare fingers(5) or plectrum(6))
  • French horn 423.232.12 (Valved horn with narrow bore and long air column)
  • oboe 422.112-71 (Reedpipe with double reeds and conical bore, with keys)
  • bass drum: 211.212.12 (Individual double-skin cylindrical drums, both heads played)
  • piano 314.122-4-8 (Box zither sounded by hammers, with keyboard)
  • harpsichord  314.122-6-8 (Box zither sounded by plectrum, with keyboard)
  • morsing 121.2 (plucked idiophone with frame, using mouth cavity as resonator)
  • glass harp 133.2 (set of friction idiophones)
  • musical saw 151  (metal sheet played by friction)
  • theremin 531.1 (Analogue synthesizers and other electronic instruments with electronic valve/vacuum tube based devices generating and/or processing electric sound signals)
There's certainly room for discussion here.  Playing a cello is significantly different from playing a violin -- the notes are much farther apart on the longer strings, the cello is held vertical, making the bowing much different, and as a consequence of both, the bow is much bigger and held differently.  Clearly the analogue synthesizer section could stand to be a bit more detailed, and there's clearly some latitude within these (Wikipedia has a musical saw as 132.22 (idiophone with direct friction).

It's also interesting that a guitar is counted as a slightly different instrument depending on whether it's played with bare fingers or a plectrum, but that fits pretty well with common usage.  Fingerpicking and flatpicking require noticeably different skills and many guitarists specialize in one or the other.  The only sticking point is that a lot of fingerstyle guitarists use fingerpicks, at least when playing a steel-string acoustic ...

Nonetheless, I'd still say Hornbostel-Sachs does a decent job of classifying musical instruments.  Given the classification number, you have a pretty good idea of what form the instrument might take, who might be able to play it and, in many if not all cases, how it might sound.  There are even provisions for compound instruments like electric guitar plus effects, though I don't know how well-developed or effective those are.

The string/woodwind/brass/percussion system also provides a decent idea of form, sound and who might play, within the context of a classical orchestra, but if you're familiar with the classical orchestra you should already know what a french horn or oboe sounds like.

Which leads back to the underlying question of purpose.  Classification systems, by nature, are systems that we impose on the world for our own purposes.  A wide-ranging and detailed system like Hornbostel-Sachs is meant to be useful to people studying musical instruments in general, for example to compare instrumentation in folk music across the world's cultures.

There are a lot more local variations of the bass drum or box lute family than theremin variants -- or even musical saw variants -- so even if we knew nothing else we might have an objective reason to think that drums and box lutes are older, and we might use the number of varieties in particular places to guess where an instrument originated (places of origin, in general, tend to have more variants).  Or there might be an unexpected correlation between latitude and the prevalence of this or that kind of instrument, and so forth.  Having a detailed classification system based on objective properties allows researchers to explore questions like this in a reasonably rigorous way.

The classification of instruments in the orchestra is more useful in the day-to-day running of an orchestra ("string section will rehearse tomorrow, full orchestra on Wednesday") and in writing classical music.  Smaller ensembles, for example, tend to fall within a particular section (string quartet, brass quintet) or provide a cross-section in order to provide a variety of timbral possibilities (the Brandenburg concertos use a harpsichord and a string section with various combinations of brass and woodwinds -- strictly speaking the harpsichord can be replaced by other instruments when it's acting as a basso continuo).

Both systems are useful for their own purposes, neither covers every possible instrument completely and unambiguously (though Hornbostel-Sachs comes fairly close) and neither is inherently "correct".   As far as I can tell, this is all true of any interesting classification system, and probably most uninteresting ones as well.

No one seems to care much whether a pipe organ or harpsichord is a percussion instrument.   I'm not sure why.  Both have been used in orchestral works together with the usual string/woodwind/brass/percussion sections.

Tuesday, October 29, 2019

More on context, tool use and such

In the previous post I claimed that (to paraphrase myself fairly loosely) whether we consider behaviors that technically look like "learning", "planning", "tool use" or such to really be those things has a lot to do with context.  A specially designed robot that can turn a door handle and open the door is different from something that sees a door handle slightly out of reach, sees a stick on the ground, bends the end of the stick so it can grab the door handle and proceeds to open the door by using the stick to turn the handle and then to poke the door open.  In both cases a tool is being used to open a door, but we have a much easier time calling the second case "tool use".  The robot door-opener is unlikely to exhibit tool use in both contexts.

With that in mind, it's interesting that the team that produced the hide-and-seek AI demo is busily at work on using their engine to play a Massively Multiplayer Online video game.  They argue at length, and persuasively, that this is a much harder problem than chess or go.  While the classic board games may seem harder to the average person than mere video games, from a computing perspective MMOs are night-and-day harder in pretty much every dimension:

  • You need much more information to describe the state of the game at any particular point (the state space is much larger).  A chess or go position can be described in well under 100 bytes.  To describe everything that's going on at a given moment in an MMO takes more like 100,000 bytes (about 20,000 "mostly floating point" numbers)
  • There are many more choices at any given point (the action space is much larger).  A typical chess position has a few dozen possible moves.  A typical go position may have a couple hundred.  In a typical MMO, a player may have around a thousand possible actions at a particular point, out of a total repertoire of more than 10,000.
  • There are many more decisions to make, in this case running at 30 frames per second for around 45 minutes, or around 80,000 "ticks" in all.  The AI only observes every fourth tick, so it "only" has to deal with 20,000 decision points.  At any given point, an action might be trivial or might be very important strategically.  Chess games are typically a few dozen moves long.  A go game generally takes fewer than 200 (though the longest possible go game is considerably longer).  While some moves are more important than others in board games, each requires a similar amount and type of calculation.
  • Players have complete information about the state of a chess or go game.  In MMOs, players can only see a small part of the overall universe.  Figuring out what an unseen opponent is up to and otherwise making inferences from incomplete data is a key part of the game.
Considered as a context, an MMO is, more or less by design, much more like the kind of environment that we have to plan, learn and use tools in every day.  Chess and go, by contrast, are highly abstract, limited worlds.  As a consequence, it's much easier to say that something that looks like it's planning and using tools in an MMO really is planning and using tools in a meaningful sense.

It doesn't mean that the AI is doing so the same way we do, or at may think we do, but that's for a different post.

Tool use, planning and AI

A recent story in MIT Technology Review carries the headline AI learned to use tools after nearly 500 million games of hide and seek, and the subhead OpenAI’s agents evolved to exhibit complex behaviors, suggesting a promising approach for developing more sophisticated artificial intelligence.  This article, along with several others, is based on a blog post on OpenAI's site.  While the article is a good summary of the blog post, the blog post is just as readable while going into somewhat more depth and technical detail.  Both the article and the blog post are well worth reading, but as always the original source should take precedence.

There is, as they say, quite a bit to unpack here, and before I'm done this may well turn into another Topic That Ate My Blog.  At the moment, I'm interested in two questions:
  • What does this work say about learning and intelligence in general?
  • To what extent or in what sense do terms like "tool use" and "planning" describe what's going on here?
My answers to both questions changed significantly between reading the summary article and reading the original blog post.

As always, lurking behind stories like this are questions of definition, in particular, what do we mean by "learning", "planning" and "tool use"?  There have been many, many attempts to pin these down, but I think for the most part definitions fall into two main categories, which I'll call internal and external here.  Each has its advantages and drawbacks.

By internal definition I mean an attempt to formalize the sort of "I know it when I do it" kind of feeling that a word like learning might trigger.  If I learn something, I had some level of knowledge before, even if that level was zero, and after learning I could rattle off a new fact or demonstrate a new skill.  I can say "today I learned that Madagascar is larger than Iceland" or "today I learned how to bake a soufflĂ©".  If I talk about planning, I can say "here's my plan for world domination" (like I'd actually tell you about the robot army assembling itself at ... I've said too much) or "here's my plan for cleaning the house".  If I'm using a tool, I can say "I'm going to tighten up this drawer handle with a Philips screwdriver", and so forth.  The common thread is here is a conscious understanding of something particular going on -- something learned, a plan, a tool used for a specific purpose.

This all probably seems like common sense, and I'd say it is.  Unfortunately, common sense is not that helpful when digging into the foundations of cognition, or, perhaps, of anything else interesting.  We don't currently know how to ask a non-human animal to explain its thinking.  Neither do we have a particularly good handle on how a trained neural network is arriving at the result it does.  There may well be something encoded in the networks that control the hiders and seekers in the simulation, which we could point at and call "intent", but my understanding is we don't currently have a well-developed method for finding such things (though there has been progress).

If we can't ask what an experimental subject is thinking, then we're left with externally visible behavior.  We define learning and such in terms of patterns of behavior.  For example, if we define success at a task by some numerical measure, say winning percentage at hide and seek, we can say that learning is happening when behavior changes and the winning percentage increases in a way that can't be attributed to chance (in the hide-and-seek simulation, the percentage would tilt one way or another as each side learned new strategy, but this doesn't change the basic argument).

This turns learning into a pure numerical optimization problem: find the weights on the neurons that produce the best winning percentage.  Neural-network training algorithms are literally doing just such an optimization.  Networks in the training phase are certainly learning, by definition, but certainly not in the sense that we learn by studying a text or going to a lecture.  I suspect that most machine learning researchers are fine with that, and might also argue that studying and lectures are not a large part of how we learn overall, just the part we're most conscious of as learning per se.

This tension between our common understanding of learning and the workings of things that can certainly appear to be learning goes right to why an external definition (more or less what we call an operational definition) can feel so unsatisfying.  Sure, the networks look like they're learning, but how do we know they're really learning?

The simplest answer to that is that we don't.  If we define learning as optimizing a numerical value, then pretty much anything that does that is learning.  If we define learning as "doing things that look to us like learning", then what matters is the task, not the mechanism.  Learning to play flawless tic-tac-toe might be explained away as "just optimizing a network" while learning to use a ramp to peer over the wall of a fort built by a group of hiders sure looks an awful lot like the kind of learning we do -- even though the underlying mechanism is essentially the same.

I think the same reasoning applies to tool use: Whether we call it tool use or not depends on how complex the behavior appears to be, not on the simple use of an object to perform a task.  I remember reading about primates using a stick to dig termites as tool use and thinking "yeah, but not really".  But why not, exactly?  A fireplace poker is a tool.  A barge pole is a tool.  Why not a termite stick?  The only difference, really, is the context in which they are used.  Tending a fire or guiding a barge happen in the midst of several other tools and actions with them, however simple in the case of a fireplace and andirons.  It's probably this sense of the tool use being part of a larger, orchestrated context that makes our tool use seem different.  By that logic, tool use is really just a proxy for being able to understand larger, multi-part systems.

In my view this all reinforces the point that "planning", "tool use" and such are not binary concepts.  There's no one point at which something goes from "not using tools" to "using tools", or if there is, the dividing line has to be fairly arbitrary and therefore not particularly useful.  If "planning" and "tool use" are proxies for "behaving like us in contexts where we consider ourselves to be planning and using tools", then what matters is the behavior and the context.  In the case at hand, our hiders and seekers are behaving a lot like we would in a context that we would certainly say requires planning and intelligence.

As far as internal and external definitions, it seems we're looking for contexts where our internal notions seem sensible.  In such contexts we have much less trouble saying that behavior that fits an external definition of "tool use", "planning", "learning" or whatever is compatible with those notions.

Saturday, July 27, 2019

Do neural networks have a point of view?

As someone once said, figures don't lie, but liars do figure.

In other words, just because something's supported by cold numbers doesn't mean it's true.  It's always good to ask where the numbers came from.  By the same token, though, you shouldn't distrust anything with numbers behind it, just because numbers can be misused.  The breakdown is more or less:
  • If you hear "up" or "down" or "a lot" or anything that implies numbers, but they're aren't any numbers behind it, you really don't know if it's true or not, or whether it's significant.
  • If you hear "up X%" or "down Y%" or -- apparently a popular choice -- "up a whopping Z%" and you don't know where the numbers came from, you still don't really know if it's true or not.  Even if they are correct, you don't know whether they're significant.
  • If you hear "up X%, according to so-and-so", then the numbers are as good as so-and-so's methodology.  If you hear "down Y%, vs. Z% for last quarter", you at least have a basis for comparison, assuming you otherwise trust the numbers.
  • In all, it's a bit of a pain to figure all this out.  Even trained scientists get it wrong more than we might think (I don't have numbers on this and I'm not saying it happens a lot, but it's not zero).
  • No one has time to do all the checking for more than a small subset of things we might be interested in, so to a large extent we have to trust other people to be careful.  This largely comes down to reputation, and there are a number of cognitive biases in the way of evaluating that objectively.
  • But at least we can try to ignore blatantly bad data, and try to cross-check independent sources (and check that they're actually independent), and come up with a rough, provisional picture of what's really going on.  If you do this continually over time the story should be pretty consistent, and then you can worry about confirmation bias.
  • (Also, don't put much stock in "record high" numbers or "up (a whopping) 12 places in the rankings", but that's a different post).
I'm not saying we're in some sort of epistemological nightmare, where no one has any idea what's true and what's not, just that objectivity is more a goal to aim towards rather than something we can generally expect to achieve.

So what does any of this amateur philosophizing have to do with neural networks?

Computers have long been associated with objectivity.  The stramwan idea that "it came from a computer" is the same as "it's objectively true" probably never really had any great support, but a different form, I think, has quite a bit of currency, even to the point of becoming an implicit assumption.  Namely, that computers evaluate objectively.

"Garbage in, garbage out," goes the old saying, meaning a computed result is only as good as the input it's given.  If you say the high temperature in Buenos Aires was 150 degrees Celsius yesterday and -190 Celsius today, a computer can duly tell you the average high was -20 Celsius and the overall high was 150 Celsius, but that doesn't mean that Buenos Aires has been having, shall we say, unusual weather lately.  It just means that you gave garbage data to a perfectly good program.

The implication is that if you give a program good data, it will give you a good result.  That's certainly true for something simple, like calculating averages and extremes.  It's less certain when you have some sort of complicated, non-linear model with a bunch of inputs, some of which affect the output more than others.  This is why modeling weather takes a lot of work.  There are potential issues with the math behind the model (does it converge under reasonable conditions?), the realization of that model on a computer (are we properly accounting for rounding error?) the particular settings of the parameters (how well does it predict weather that we already know happened?).  There are plenty of other factors.  This is just scratching the surface.

A neural network is exactly a complicated, non-linear model with a bunch of inputs, but without the special attention paid to the particulars.  There is some general assurance that the tensor calculations that relate the input to the output are implemented accurately, but the real validation comes from treating the whole thing as a black box and seeing what outputs it produces from test inputs.  There are well-established techniques for ensuring this is done carefully, for example using different datasets for training the network and for testing how well the network really performs, but at the end of the day the network is only as good as the data it was given.

This is similar to "Garbage in, Garbage out," but with a slightly different wrinkle.  A neural net trained on perfectly accurate data and given perfectly accurate input can still produce bad results, if the context of the training data is too different from that of the input it was asked to evaluate.

If I'm developing a neural network for assessing home values, and I train and test it on real estate in the San Francisco Bay area, it's not necessarily going to do well evaluating prices in Toronto or Albuquerque.  It might, because it might do a good job of taking values of surrounding properties into account and adjusting for some areas being more expensive than others, but there's no guarantee.  Even if there is some sort of adjustment going on, it might be thrown off by any number of factors, whether housing density, the local range of variation among homes or whatever else.

The network, in effect, has a point of view based on what we might as well call its experience.  This is a very human, subjective way to put it, but I think it's entirely appropriate here.  Neural networks are specifically aimed at simulating the way actual brains work, and one feature of actual brains is that their point of view depends to a significant degree on the experience they've had.  To the extent that neural networks successfully mimic this, their evaluations are, in a meaningful way, subjective.

There have been some widely-reported examples of neural networks making egregiously bad evaluations, and this is more or less why.  It's not (to my knowledge) typically because the developers are acting in bad faith, but because they failed to assemble a suitably broad set of data for training and testing.  This gave the net, in effect, a biased point of view.

This same sort of mistake can and does occur in ordinary research with no neural networks involved.  A favorite example of mine is drawing conclusions about exoplanets based on the ones we've detected so far.  These skew heavily toward large, fast-moving planets, because for various reasons those are much easier to detect.  A neural network trained on currently known exoplanets would have the same skew built in (unless the developers were very careful, and quite likely even then), but you don't need a neural network to fall prey to this sort of sampling bias.  From my limited sample, authors of papers at least try to take it into account, authors of magazine articles less so and headline writers hardly at all.

Sunday, July 14, 2019

Computer chess: where now?

In an effort to wrap up this thread, at least for a while, here's an attempt to recap some of the major points and to conjecture about what might be next, and what this might tell us about chess, and intelligence in general.

Since playing perfect chess appears intractable, we have a classic tradeoff: give up on perfect play in order to fit our actual strategy into our limited resources.  Chess theory, combined with practice to develop pattern recognition and calculation, is optimized for the human brain.  Chess theory is just shorthand for "patterns and rules that players have discovered over the centuries."  These days, that very much includes discoveries by computer players.

For computers (at least the Von Neumann-style processors currently in commercial use), there are now two main options:

  • exhaustive search to a limited depth combined with alpha/beta pruning to avoid exploring moves that can't be as good as moves already considered, using an explicit set of rules to evaluate whether a position is good or bad (known as AB, after the alpha/beta part)
  • a completely opaque neural net evaluation of positions combined with a limited randomized search of possible variations (known as NN, for the neural net part), though you'll also see mention of MCTS (Monte Carlo Tree Search), referring to the randomized search.
There are some hybrid approaches that combine explicit rules with random search, but one way or another there has to be a tradeoff of performance for limited resources.

It's probably worth repeating that NN engines still consider far more possible continuations than humans can.  The ratio for human to NN to AB is roughly dozens to hundreds of thousands to billions.  We can assume that those last two numbers are going to increase over time.  Moore's law may (or may not) be tailing off, but there will still be improvements in software and bigger piles of hardware in the future.  There could also be breakthroughs like some sort of quantum chess computer that can examine vastly larger numbers of possible positions, in which case all bets are off.

It's interesting explore what human, AB and NB players do and don't have in common.  One common factor is the importance of what's traditionally called "positional" play, that is, taking into account factors beyond tactical concerns about winning material or forcing a quick checkmate.  Tactics are still important, but as the level of play becomes higher it becomes more and more important to consider factors that influence the longer-term course of the game, factors like pawn structure, space, initiative and so forth.

Positional play is interesting from a computing standpoint because it's not the primary objective of the game -- that would be checkmate -- or even the secondary objective -- generally considered to be winning material to make it easier to checkmate, or using the threat of winning material to force the opponent into a lost position.  The positional elements that have been developed over the centuries aren't readily obvious consequences of the rules.  They are human constructs aimed at making an intractable problem tractable.  Heuristics, in other words.  In broad terms, all three kinds of players are using the same approach -- rules of thumb plus calculation -- just in different mixes and with different rules of thumb.

It seems significant, then, that computers have, in effect, rediscovered positional factors in their play, even without having them explicitly programmed in.  Deep Blue beat Kasparov, by Kasparov's own assessment, by outplaying Kasparov positionally.  The surprise was that it did this with only basic knowledge of what makes a position good or bad and that the rest emerged from looking at large numbers of possible positions, much further ahead than a human could look.

Similarly, in their training phases, NN engines like Alpha Zero learn to play good moves without any obvious tactical benefit -- and indeed some that can look like blunders or at least unnecessarily risky at first glance -- without any cues beyond what happened to win in training games.  They do seem to produce more than their share of "wild" positions, and "unorthodox" moves, but even then their play can generally be described in terms like "an unusually aggressive pawn push to initiate a kingside attack" or "a positional sacrifice to obtain more active pieces", and not (except in endgames, it seems) "no idea ... looks like it picked that move at random".

Maybe that just means that we have enough terms for things going on in chess that at least some of them are bound to apply to any particular move or position, but even when engines play moves that flout previously accepted theory, such cases are probably the memorable exception.  In a great many positions each of the three types of player would find the others' preferred moves perfectly reasonable.  They might well disagree over which exact move is best, but in most cases their evaluations will be largely similar.  Typically they will all rule out the great majority of possible moves as inferior, and they will largely agree on which moves are inferior.   For that matter, AB engines and human grandmasters have been known to flout the same rules.  It's not just NN that can throw a curve ball from time to time.

All this is to say that if three radically different approaches to chess-playing can reach a reasonable consensus in discarding almost all possible moves and that their ranking among the best moves agrees reasonably well -- though by no means perfectly -- with accepted chess theory, then most likely there is something to accepted chess theory.  Call it an emergent property of the rules.

So where do we go from here?  Can we combine the strengths of the three approaches, roughly speaking high-level planning for humans, calculation for AB engines and positional assessment for NN engines?  This next bit is more speculation than the nice summary I was hoping for, but here goes:

It might seem almost self-evident that if we knew the full details of how grandmasters see the board then computers would be able to do even better, but it's not like it hasn't been tried.  Chess engine developers have been working with top-level human players for decades, with good results, but not always with the best results.  Stockfish, which currently rules the AB world, benefits from a large distributed team making continual tweaks to the algorithm, with only the best making it into the next release.  Contributions can come from anyone.  I'm sure a lot of contributors are also strong chess players, but it's not a prerequisite.

Alpha Zero, of course, dispenses with outside expertise entirely.  Other NN engines try to incorporate outside knowledge, but the ones that don't, notably Alpha Zero and its open-source cousin LC0, seem to do best.

Humans clearly do things differently from computers, but as I've said previously, it's quite possible that this is because these things work for humans but not for computers.  My guess is that trying to model the way humans formulate plans when playing chess is not going to lead to stronger chess engines.  At this point, both kinds of engines have a reasonable substitute for human-style conscious planning, namely calculating possible continuations en masse.

Even NN engines, which focus mainly on evaluating particular positions, benefit from this.  While NN engines may look downright clueless in endgames (even when they win), they look considerably less clueless at looser time controls, that is, when they are doing more calculation.  Looking far enough ahead appears indistinguishable from planning, even if that's not how we humans do it.

The ideal combination of current AB and NN approaches would evaluate positions as well as NNs and as fast as ABs.  While the original Alpha Zero work was done in a closed environment and (to my knowledge) hasn't been directly replicated by outside researchers, it's generally understood that Alpha Zero benefitted from using tensor processing chips that sped up its evaluation (one reason the Stockfish team argued that the Alpha Zero match results weren't a fair comparison), and that for that reason it would likely beat LC0, which runs on stock hardware.  That is, NN with faster, and thus more, calculation beats NN without it.

On the other hand, one big reason that NN engines got so much attention was that they didn't just win, but they did it in a noticeably different style.  The classic example is the "fawn pawn" (apparently a mishearing of "thorn pawn"), which is an advanced pawn, typically on the sixth rank, that can't be attacked by the defender's pawn, for example a black pawn on h3 when white has castled kingside and played ... g3.

A strong human player would likely look at such a position and think "Yeah ... that doesn't look good for white ... ", but NN players demonstrated just how bad it is, and therefore why it can be worth playing for even if you have to give up material or it doesn't look good to expose your own king and push a pawn that might have to be supported by pieces instead of other pawns.  The NN games also gave some insight as to when to push such a pawn.

It's not out of the question, in fact I'd conjecture  that it's likely, that AB engines can incorporate some of this knowledge directly into their evaluation functions.  The resulting code wouldn't have to capture everything the NN's network does, just enough to avoid losing to such attacks, by skewing just enough toward positions that avoid them.

More generally, an AB evaluation function doesn't need to capture everything an NN's does.  Some of that behavior will be due to overfitting or plain old random noise.  Nor do we have to understand exactly how the evaluation works.  There's a very direct way of seeing what an NN's evaluation function is doing: play games against it and see what it does.

Another option that comes to mind would be to use neural networks not for playing chess directly, but for tweaking the parameters of traditional chess engines.  I'm not sure this is a great idea, though.  One major lesson of both types of engine, in their own ways, is that it's risky to build in assumptions -- not necessarily doomed to failure, but risky.  If you're tweaking parameters in a traditional engine -- e.g., how much a rook on an open file is worth in relation to a pawn -- you're tuning weights in a system with a bunch of specialized, non-linear nodes while in a standard NN you're tuning weights in a (relatively) straightforward multilinear system.  It's not clear that the first option would work better.  It might, but it might not.

Looking at it from a more NN perspective, you could try to speed up evaluation by noticing this or that pattern of weights and replacing the generalized tensor calculations with ad-hoc functions that happen to work for those particular weights.  For example if the weights are "sparse", that is, most of them are zero, you can write out an explicit formula that combines the terms with non-zero weights.  In theory, you might come up with something that resembles some of the customary rules of thumb.  Maybe weights involving a particular kind of piece tend to emphasize empty squares that the piece could move to (open lines, mobility), or pairs of squares the piece could attack if it moved a certain way, and which hold enemy pieces (forks).

If all we know is that neural nets are combining basic features of a chess position according to some magic combination of weights, then all we can say is "there's something that current explicit evaluation functions don't seem to be capturing."  If that's how it stays, then we might expect neural networks to outperform engines like Stockfish in the long run.  They're already at least comparable, and they haven't been around for nearly as long.

It's quite possible, though, that a neural network's evaluation function is capturing a handful of quantifiable factors that current explicit evaluation functions aren't currently capturing.  In that case, an engine like Stockfish with an updated evaluation function should have the upper hand.  It could choose promising positions as well as, or about as well as, a neural net, but it would examine them much more efficiently.

It's also not clear how much room is left for neural networks to improve.  Playing winning chess is just plain hard.  Training a network for twice as long generally doesn't produce twice as good a result.  It may be that network of evaluation of individual positions is about as good as it's going to get.

For that matter, though, neither does looking at twice as many moves, or even twice as deep (which should roughly square the number of moves examined), make an AB engine twice as strong.  Currently AB engines can examine billions of positions and around twenty plies deep (twenty levels of move/countermove, or ten full moves) in a typical midgame position.  Looking at trillions of positions would mean looking more like thirty plies ahead.  That ought to yield some sort of improvement, but how much?  Most high-level games are draws in any case.

It's interesting that AB and NN engines appear to be more or less evenly matched.  This may be evidence that we're reaching the limits of what computer chess players can do.  Or it may not.

In the discussion above, I talk mostly about evaluation functions, weights on numbers and so forth, and not so much about "planning" or "understanding" or "insight".  This was a deliberate choice.  Chess engines are nothing more or less than computer programs.  It's fascinating that they can clearly display some aspects that we associate with intelligence, while clearly lacking others.  Neural networks don't fundamentally change this -- they're still just doing calculations -- but they do seem to capture another chunk of what brains do, namely fuzzy pattern matching.

Much discussion of developments like AlphaZero and similar neural network-based programs focuses on what we're learning about algorithms, and in particular how to harness "deep neural networks" (that is, multi-stage neural networks) to solve problems that are widely agreed to require intelligence.

That's fair enough, but in that process we're also learning about the problems themselves.  If intelligence is the ability to solve certain kinds of problems, then solving problems by machine tells us something about what it means to be intelligent.  While it's natural to want to move the goal posts and try to narrow "intelligence" down so it continues to mean "behavior we can't emulate reasonably well with machines," I argue in the first of the previous posts, that's probably a losing game.  The winning game, I think, is gaining a better understanding of various capabilities that we tend to throw under the blanket term "intelligence".

How far are we from perfect chess?

Playing a perfect game of chess appears to be an intractable problem.  Certainly the total number of possible positions is far too large for a direct tabulation of best move for each possible position to be feasible.  There are far more possible chess positions than subatomic particles in the observable universe.

To be sure, almost all of these positions are extremely unlikely to appear in any reasonable game of chess, much less a perfectly played one.  All you would really need to know is what the best first move is, what the second move is for every reply, and so forth.  Since most possible replies are not good moves, this might thin out enough that it would be possible to store everything in a large database and/or write general rules that will cover large numbers of possible positions.  In other words, it might (or might not) be feasible to write down a perfect strategy if we could find it.  But we're nowhere close to finding it.

Nonetheless, there has been a lot of progress in the decades since Deep Blue beat Kasparov.  It's now quite clear that computers can play better than the best humans, to the point that it's a bit hard to say exactly how much better computers are.  There are rating numbers that imply that, say, Stockfish would beat Magnus Carlsen X% of the time, but they probably shouldn't be taken as anything more than an estimate.  We can say that X is probably somewhere in the 90s, but that's about it.

Chess rating systems are generally derived from the Elo system (named after Arpad Elo), which tries to quantify playing strength as a single number based on players' records against each other.  Two equally-rated players should have roughy equal numbers of wins and losses against each other, plus however many draws.  As the rating difference increases, the stronger player should win more and more often.

Ratings are recalculated in light of actual results.  If two equally-rated players draw, nothing will change, but if a low-rated player draws against a higher-rated one, the higher-rated player will lose points and the lower-rated player will win points.  Likewise, the winner of a game will gain points and the loser will lose points, but you get more points for beating a higher-rated player and you lose more for losing to a lower-rated player.

Over time, this will tend to give a good picture of who's better and how much better, and if the parameters of the rating formula are tuned well, it will give a pretty good prediction of winning percentages.  It's interesting in itself that reducing chess skill to a single number on a linear scale works as well as it does -- see this post for more than you probably want to read about that.  The point here, though, is that to get useful ratings you need a pool of players all playing against each other. 

You don't need a full round-robin of thousands of players, of course, but things need to be reasonably interconnected.  If you're an average club player, you probably won't be playing many games against the world's top ten, but the strongest players in your club may well have played people who've played them, and in any case there will be multiple reasonably short paths connecting you to them.

This isn't the case with humans and computers.  People, including grandmasters, do play computers a lot, but generally not under controlled match conditions.  To get useful ratings, players being rated should be reasonably well matched.  If player A beats player B nineteen times and draws once out of twenty games, we know player A is far stronger, but we can't say with confidence how much stronger.

Since that's a realistic scenario for a top computer player vs. a top human player, there's little reason to pay a top player to take time out of there schedule for an exhausting and demoralizing match that won't tell us much more than we already know.  Probably the best approach would be to dial back the parameters on a few well-known engines, with a mix of AB and NN, to produce a set of standard players that could play humans at, say, 200 rating-point intervals.  It would probably be much easier to get human players to play against the "human-level" players, knowing they might be able to win bragging rights over their friends.  The human-level computer players could thus be calibrated against actual humans.

Calibrating the top computer players against them would take a lot of games, but that's not a problem for computers.  In the last computer tournament I watched, the bottom player -- which, to be sure, lost quite heavily -- had a rating in the 2900s.  Carlsen is currently in the high 2800s.  Again, these are essentially two separate rating systems, but for historical reasons they're probably fairly close, so it should be possible to link the two systems if anyone really wants to.

While it would be interesting to get more detail on ratings across humans and computers, it doesn't really change the story of "computers can beat humans easily" and by itself it doesn't shed much light on the limits of how well chess can be played.  An Elo-style rating system doesn't have a built-in maximum rating.  There is, in theory, a best-possible player, but we don't know how strong that player is or, put another way, we don't know how close the current best engines are to playing perfect chess.

It is interesting, though, that stronger players tend to draw more against each other than weaker players.  Amateur-level games are often decisive because one player or the other blunders (or, more accurately, blunders more than the other player does).  Higher-level players blunder less, and chess engines don't blunder at all, or at least the notion of "blunder" for an engine is more on the order of "didn't realize that having an enemy pawn that far advanced on the kingside could lead to a devastating attack".  The finals of the last computer tournament I watched consisted almost entirely of draws.

While it's entirely possible that perfectly-played chess is a win for white (or, who knows, a win for black), and the top engines just aren't finding the right moves, it seems more likely to me that perfectly played chess is a draw and engines are basically getting better at not losing, while being able to ruthlessly exploit imperfect play when it does happen.

If this is the case then it's quite possible that ratings will top out at a point where decisive games become exceedingly rare.  A plausible match result might be 1-0 with 999 draws.  If rules required a certain margin of victory, that might essentially be a matter of flipping coins until there are are that many more heads than tails or tails than heads.  The "law of large numbers" doesn't forbid this, it just says that you'll need a lot of coin flips to get there.

We could well get to a point where the championship wends in a score of 1439-1429 with 50,726 draws or something like that.  At that point writing chess engines becomes similar to cranking out more and more digits of pi -- interesting to a small group of people, and useful in stress-testing systems and tools, but no longer of general interest, even among geeks.

The players would still not be playing perfect chess, but most likely they would nearly hold their own against perfect play.  If player A's record against player B is 12-0 with 123,765 draws, their ratings will be essentially identical, even if A is playing perfectly and B is only almost-perfect.  I wouldn't be surprised if someone's already done a regression based on a scenario like this and calculated a maximum rating from it.

Sunday, June 23, 2019

Grandmasters don't play twelve-dimensional chess

A couple of caveats before I start:

First, in case it's not clear by now, let me say explicitly that I'm not a grandmaster, or any kind of master, or even close to being any kind of chess master.  I have, however, watched a few play, browsed through master-level games and read articles and interviews on the subject, not to mention losing quite a few 5-minute games to master-level players and occasionally asking them to explain where I went wrong.

Second, while grandmaster is a title with very specific qualifications, reserved for the best of the best, I'm using it here as a stand-in for any strong player with the same general mindset and skills that I describe here, regardless of how well they actually do in tournament play.

With that out of the way, imagine a critical point in a cheesy movie:

"Welcome to my trap.  I knew you'd come here, just as I know that you will now try to use your secret radio transmitter to summon help, which is why I have taken the precaution of jamming all radio frequencies.  Don't even think about trying to cut off power to the jamming equipment.  It's encased in an unobtanium vault with battery power for the next three months.  Checkmate."

"Ah, my worthy opponent, I did not anticipate that.  But I did not need to.  There is no secret radio transmitter, only this aluminum box of curiously strong mints that I knew your scanning equipment would register as one.  No, I will not be needing a radio today.  I need only wait here and do nothing, while the new head of security you hired comes to my rescue.  Next time be more careful who you use for background checks.  You never know who they might be working for.  Don't bother summoning your bodyguards.  They're already safely locked away where they will do no harm to anyone."

"But ... but ... my plan was perfect.  I accounted for every last detail!"

Characters like this may be casually referred to as chess masters, but as far as I can tell this is not really how actual grandmasters play.

While it's true that many games are won or lost by one player catching something the other missed, it's not necessarily from thinking one move farther ahead, or from calculating every last possibility, but more typically from recognizing the importance of occupying a key square, or from opening or closing some particular line of attack, or knowing which rook to move onto a particular file.  This, in turn, generally comes down to who has the better plan.

There will certainly be points in a game where a grandmaster will pause and think deeply about moves, countermoves, counter-countermoves and so forth, and they can do this better than most players, but that's probably not the real distinguishing factor in their play.  Likewise, while tactics are certainly important in top-level games, the better tactician may or may not end up the winner.

One good illustration of this is the simultaneous exhibition, where a professional typically takes on a few dozen amateurs, moving from one to the next for each move (with occasional pauses for a longer exchange) and often winning every game.  If human chess were a matter of sheer calculation then winning every game would mean one person keeping all the possible continuations of 50 different games in mind, or being able to re-calculate them in the few seconds they usually spend at each board.

But of course this is not what's going on.  The pro is relying on experience, often looking for quick tactical wins by exploiting typical amateur mistakes.  They're still planning, just not in detail, maybe something on the order of "That knight is lined up with the king.  I can pin it, then bring my own knight to this square, put my queen on this diagonal and either attack the king directly or just win that rook if they don't notice I have a fork there.  Am I leaving any pieces hanging?  No.  Is my king exposed? No?  Great. On to the next one ..."

A human grandmaster can do this because they see the board not as a collection of pieces on squares but in a more structured way, something like a collection of patterns and possibilities.  Even if the full details of how this is done are elusive, there are hints, like the well-known experiments in which a grandmaster can reconstruct a position after a quick glance while a beginning player or a non-player takes much longer and makes many more mistakes.

Someone without much experience might think "there was a rook on this square, a pawn on this square ... or was it a bishop?" and so forth.  A grandmaster might think something more like "both sides have castled kingside, white's queen bishop is fianchettoed ..." or "this started out as such-and-such opening" or maybe even "this is like that game I lost to so-and-so and that rook is where I should have put it"

When it comes to playing over the board, a strong player will know many more possibilities to check for than a weaker one, and many more possible hazards to avoid.  One site I ran across gives a checklist of more than two dozen things to check for on every move, just to play a strong amateur game.  I have no doubt that this makes for stronger play, but I personally don't have the patience for this, or the experience for it to have become second nature, so it should come as no surprise that I'm not a strong player.

The most important factor, though, is planning.  If you just play likely-looking moves against a grandmaster, but without a clear plan, you'll almost certainly find yourself in deep trouble before long because they did have a plan.  That innocuous bishop move, that weird-looking pawn push and the shift of a rook from a reasonable-looking square to a different reasonable-looking square were all setting up the attack that's now making your life miserable.  If only you could move your king out of the way -- and that's what that bishop move was really about.

As I mentioned in a previous post, AB engines really do play the kind of insanely calculating game that we associate with the cliche chessmaster character, while NN engines do the same sort of pattern recognition that allows a grandmaster to size up a situation without having to do a lot of calculation, but neither is doing the sort of large-scale planning that a human grandmaster is.  It's also worth reiterating that, while even early chess engines were able to out-calculate human players, it took decades before they could consistently win games against them.

All that said, I think it would be a mistake to think that the planning grandmasters do is some sort of humans-only ability that computers could never emulate.  There's no reason a computer couldn't represent and execute a plan on the order of "Put your bishop on this diagonal, push these pawns and go for a kingside attack involving the bishop, rook and queen," without exhaustively checking every move and countermove.  It just hasn't proven to be an approach that computers can use to win games.

Wednesday, May 22, 2019

What do chess engines understand about chess?

[This post assumes a familiarity with terminology from previous posts.  Rather than try to gloss everything, I'm going to punt and respectfully ask the confused reader to at least skim through the previous few posts]

It's clear from watching a bunch of games that the two main types of chess engines have different styles, though not all that different.  It's not as though anyone suddenly decided 1. g4 was a great opening.  In broad strokes:

AB engines play very precisely and their evaluations of positions are fairly conservative.  It's quite common for an NN engine to give a large advantage to one side in a position that an AB engine rates as fairly even.  The ratings eventually come in line, but the convergence can happen in either direction.

If the NN engine is right, the AB eval comes up to meet it as the NN exploits an advantage that the AB engine missed.  If it's wrong, the NN eval comes down and the result is often a draw, either because the NN engine failed to play precisely enough to convert a winning advantage, or the winning advantage was never really there to begin with.  It would be interesting to track down statistics on this, and any interesting exceptions, but I don't have easy access to the data.

When NN engines first hit the scene, the initial take was that neural nets had captured some subtlety of positional understanding that the AB engines, with their relatively simple hard-coded evaluation rules couldn't fathom, even given the ability to look many moves ahead.  This is mostly accurate, I think, but incomplete.

In the Blitz Bonanza, there were any number of NN wins over AB engines, at least the weaker ones, that started with the advance variation of the French Defense (1. e4 e3 2. d4 d4 followed soon after by white e5).  This opens up a lot of space for white on the kingside and tends to push the black pieces over to the queenside -- or maybe "lure" is a better word, since there are plausible things for them to do there, including attacking white's pawn center.  The problem is that white can now launch a fairly potent kingside attack and black doesn't have time or space to get enough pieces over to defend.

Apparently none of this looks too bad to an AB engine.  Yes, white has an advanced pawn in the center, but there are chances to counterattack and weaken that pawn.  Once the white attack really gets going, often aggressively pushing the g and h pawns, its king may start to look correspondingly weak -- except black doesn't have any pieces well-located to attack it.  A typical AB engine's eval function doesn't try to account for that -- looking ahead it will either find an attack or it won't.

Probably this disconnect of white seeing good chances and black finding plusses and minuses to the same moves continues until black can see actual tactical threats coming, at which point it's toast since its pieces are now bottled up on the wrong side of the board.  The natural explanation is that the NN engine understands that there are good attacking prospects on the kingside and blacks pieces are poorly placed to defend.

What's actually going on is probably a bit different.  In its training phase, the code that trains the NN found that positions with, say, more black pieces on the queenside, lots of open squares on the kingside, and pawns pushed up near the black king tend to lead to wins.  It didn't specifically note this as such, but those were some of the positions which -- in oversimplified terms -- caused some of the weights that favored those kinds of positions to be increased.

A human playing in the same style would probably say "My plan was to open up some space, get black's pieces bottled up on the queenside and launch a kingside attack".  But here there is no plan.  White is playing moves that tended to work well in similar situations in training.

As I mentioned in a previous post, this lack of a plan becomes blindingly obvious in endgames.  Many endgames have an almost mathematical feel to them.  White is trying to hold a draw.  Black is up a bishop for two pawns.  Picking off one of those pawns would allow one of black's own pawns to promote unimpeded.  But white's pawns are all on the wrong color square for the bishop to attack and they're blocked against black's pawns (and vice versa), so all white has to do is keep black's king away.

Conversely, though, if black can draw white's king away from where it's blocking black's king, it can sneak its own king through and pick up the undefended pawns.  And there just happens to be a tactical threat that will accomplish that ...

A human player can figure positions like this out logically and win or hold the draw, as the case may be, even against perfect play.  An AB engine can often look ahead far enough to find the winning variation.  An NN engine might or might not stumble on the right move.

Likewise, it's common for an NN engine to see a large advantage in a position that's clearly a draw.  Suppose in the previous scenario there's a way for white to safely shuffle its king back and forth between two squares and keep the black king from getting through.  Quite likely the AB engine will evaluate the position as dead even, and the human kibitzers will agree. The NN engine, if it's playing black in this example, will continue to see a sizable advantage almost up to the 50-move limit, even though all it's doing is rearranging its pieces in order to avoid threefold repetition.

If understanding chess means having a feel for what kind of positions have good or bad prospects without trying to calculate every possible variation, that is, playing like a human, then NN engines are clearly doing something similar.

On the other hand, if understanding chess means being able to find the right moves to win or draw a difficult endgame, that is, playing like a human, then AB engines clearly have the upper hand.

But if understanding chess means being able to reason about why an endgame is won, lost or drawn, or being able to put together a general plan of attack without knowing in advance exactly what moves need to happen in what order, that is, playing like a human, then neither type of engine is even trying.

What's chess to a computer?

It's now quite clear that computer chess is going to eat this blog for a while, so I'm in the process of unpacking several points that have come up in my earlier attempts to just make a few quick points and move on.  I'm also coming to realize that, since I'm having even more trouble than usual organizing this into a coherent whole, I'll probably end up with N posts worth of material spread across N+1 posts, if not more.  Apologies.  This particular post is an excursion into computer science land to look at computer chess from that point of view, rather than trying to understand it as chess.

Chess, fundamentally, falls into a large class of problems called tree searches, in which a number of interconnected structures are explored by proceeding from a starting point to those that are connected directly to it, and those connected directly to the first set, and so forth.  There are lots of different ways to do this, depending mostly on what order you do the visiting in.

Examples of tree searches, besides finding good chess moves, include figuring out the best way to route packets in a network, solving a puzzle where odd-shaped blocks have to be arranged to form a cube, finding the shortest path through a maze and following reflections and refractions to find what color a particular pixel in an image will be (ray-tracing).  There are many others.

Each point in a tree search is called a node.  In chess, the nodes are chess positions (including where the pieces are, which side is to move and a few other bits such as whether each side's king has moved).  At any point in a tree search you are either at a leaf node, meaning you have no further choices to make -- in chess, you have arrived at a win, loss or draw, or more often simply decided not to explore any deeper -- or not, in which case you have one or more child nodes to choose from.  In chess, the children of a position are all the positions that result from legal moves in the current position.

Since there is no element of chance in chess, there is one and only one game tree that starts at the initial position (the root node) and includes every possible position resulting from every series of legal moves.  This means that in theory there it's possible to know whether any position is a win for white, a win for black or a draw, assuming each side plays perfectly.  That includes the initial position, so in theory there is an optimal strategy for each player and all perfectly-played games of chess have the same result.

The problem is that the game tree for chess is incomprehensibly huge.  There are way, way more legal chess positions than fundamental particles in the known universe.  No computer will ever be able to examine more than a minuscule fraction of them.  That doesn't necessarily mean that chess will never be "solved", in the sense of proving what the optimal strategies are for each player and what the result will be with perfect play.  In theory there could be a way to rule out all but a tiny fraction of possible positions and exhaustively search what's left.  In practice, we're nowhere near being able to do that.

Instead, we're left with approximate approaches.  Either exhaustively search a tiny, tiny fraction of the possibilities, which is what AB engines do, or find some sort of approximate measure of what constitutes a "good" position and "just play good moves", using a partial search of the possible continuations to double-check that there aren't any obvious blunders.  This is what humans and NN engines do.

One thing that makes chess particularly interesting as a search problem is that it's not totally random, but it's not totally orderly either.

Imagine a search tree as a series of branching passageways.  At the end of every passageway, after however many branches, is a card reading "black wins", "white wins" or "draw".  If all you know is that the cards are there then you might as well pick randomly at each junction.  On the other hand, if you have a map of the entire complex of passageways and which card is at the end of each of them, you can find your way perfectly.  Depending on where the cards are, you might or might not be able to win against a perfect player, but if your opponent makes a mistake you'll know exactly how to take advantage.

In chess there are far too many passages to map out, but there are markings on the walls.  For example, there are several types of endgame positions which even a human can always play perfectly.  If you reach a position where you have a king and a rook and your opponent only has a king (and it's not a stalemate to start out with), you can win by following a pretty simple recipe.

There are also databases of positions ("tablebases") with known outcomes, found by exhaustively searching backwards from all possible checkmate positions.  Currently every endgame with seven or fewer pieces (including the kings) is known to be either a win for white, win for black or a draw. If it shows up by working backward from one of the possible checkmates, it's a win.  If it didn't, it's a draw (the process of working backwards from known positions is itself a tree search).

There are a few kinds of positions that are well-known to be immediate wins for one side or the other, regardless of what other pieces are on the board.  The classic example is a back-rank mate, where one side's king is blocked between the edge of the board and its own pawns and the other side has major pieces able to give check.  It's simple to calculate whether the defending side can capture all the pieces that can give check or can safely interpose (block).  If not, it's game over.  My understanding is that chess engines don't bother to special-case these, since they're typically looking several moves ahead anyway.

And then there's a lot of gray area.  If one side gains a material advantage, it will generally win, eventually, but this is far from an ironclad rule.  There are plenty of cases where one side will give up material (sacrifice) in order to gain an advantage.  This ranges from spectacular combinations where one side gives up a queen, or more, in order to get a quick checkmate, to "positional sacrifices" where one side gives up something small, generally a pawn, in order to get an initiative.  Whether a positional sacrifice will be worth it is usually not clear-cut, though in some well-known cases (for example, the Queen's gambit), it's generally agreed that the other side is better off not taking the material and should concentrate on other factors instead.

There are also temporary sacrifices where one side gives up material in order to quickly win back more.  If you're playing against a strong player and it looks like they've left a piece hanging, be very careful before taking it.

In short, which side has more material is a clue to which side will win, and it's often a good idea to try to win material, but it's not a guarantee.  This goes equally well for other generally accepted measures of how well one is doing.  Maybe it's worth it to isolate or double one of your pawns in order to gain an initiative, but often it's not.  Games between human masters often turn on whose estimations of which small trade-offs are correct.

From a computing point of view, since it's impossible to follow all the passageways, we have to rely on the markings on the wall, that is, the characteristics of particular positions.  An AB engine will look exhaustively at all the passages that it explores (subject to the "alpha/beta pruning" that gives them their name, which skips moves that can't possibly lead to better results than are already known).  But in most positions it will have to give up its search before it finds a definitive result.  In that case it has to rely on its evaluation function.  Basically it's saying "I've looked ten branches ahead and if I make this move the worst that can happen is I'll get to a position worth X".

That's much better than picking randomly, but it only works well if you explore quite deeply.  A good human player can easily beat a computer that's only searching three moves (six plies) ahead because the human can easily find something that looks good in the short term but turns bad a few moves later.  Once an AB engine is looking dozens of moves ahead, though, there are many fewer short-term traps that don't pan out until late enough for the search to miss them.

This is a lot like the problem of local minima in hill-climbing algorithms.  If you're trying to get to the highest point in a landscape, but it's so dark or foggy you can only see a short distance around you, your best bet could well be to go uphill until you can't any more, even though that fails if you end up atop a small hill that's not actually the high point.  The further you can see, the better chance you have of finding your way to the true high point, even if you sometimes have to go downhill for a while to get on the right path.

Humans and neural networks, however, are doing something slightly different.  They can't look very far, but they're able to read the landscape better.  It's kind of like being on a mountainside in the fog and being able to look at the vegetation, or smell the wind, or examine the types of rock in the area or whatever, and be able to say that the peak is more likely this way than that way.

This wouldn't work if there weren't some sort of correlation between the features of a position and how likely it is to ultimately lead to a win.  It's not a property of tree-searching algorithms.  It's a property of chess as a tree-searching problem.  It's probably a property of any chess-like game that humans find interesting, because that's probably what we find interesting about chess: there's some order to it, in that some positions are clearly better than others, but there's some mystery as well, since we often can't tell for sure which option is better.

Tuesday, May 21, 2019

Computer chess engines fall in line

[This post came out of trying to figure out what computer chess results tell us about what kind of problem chess-playing is.  It goes into quite a bit of detail about cycles where player A beats player B beats player C beats player A and why they don't seem to happen.  Shortly after I posted this, the Computer Chess Championship "Deep Dive" round finished up, and there was a slight anomaly: Stockfish won the tournament, but second-place LC0 beat it head to head, by the narrow margin of 4-3-43, implying an Elo difference of 8 points.  I don't think this affects the post greatly, but it does reinforce the point that NN engines are less consistent than AB engines.  Stockfish didn't lose a game to 4th-place Houdini while LC0 both won and lost.  I'll update this slightly when makes the full results available --D.H. May 2019]

One thing jumps out from watching the ongoing computer chess tournaments that is so ordinary-seeming that it's easy to dismiss entirely:  The rankings from a preliminary round with a couple dozen contestants, are strikingly linear.  If you look at the cross-table, which shows everyone's results against everyone else's, wins in green, losses in red and draws in gray, not only do everyone's scores improve down the table, but for the most part all the red is above and all the green below.  Stronger engines very consistently beat weaker engines, especially considering that at least some of the apparent upsets were due to system crashes and such.

I've seen this happen twice now, once for the "Blitz Bonanza" (5 minutes plus two seconds per move) and the Deep Dive (15 minutes plus 5 seconds per move).  The Deep Dive preliminary, which just finished a few days ago,  was played in "escalation" format, meaning that each new player plays everyone who's already played (twice as white and twice as black).  After a few opponents, you can be pretty confident about where a new player will land -- above whoever whoever it's beat head-to-head, below whoever it's lost to, and close to anyone it drew all four games with.

It shouldn't be a total shock that stronger players beat weaker players, or even that they beat them very consistently.  My point is more that it doesn't have to be that way.  While there are only two basic styles of engine*, each dev team has its own philosophy and each one evaluates positions differently.  In the case of neural net-based engines (generally referred to as NN), the evaluation of positions is done radically differently.

It seems at least plausible that, as with human players, engines might tend to do better or worse than otherwise expected against particular styles of play.  An engine that attacks aggressively might lose to one that's tuned for careful defending, but beat a middle-of-the-road approach that can outplay the careful defender positionally.  This would create a cycle of attacker beats middle-of-road beats defender beats attacker, but this sort of situation doesn't seem to happen in practice.

Without going back and checking the actual numbers, my impression is that AB engines -- the conventional style which does an exhaustive search a far ahead as it can manage, using fairly understandable hand-coded rules to evaluate positions quickly -- win more by being able to look further ahead than by having cleverer evaluation rules.  That is, evaluation speed is a good predictor of who will win.

Clearly if two engines have identical evaluation rules, but one implements them more efficiently, the faster engine should win because it can see further ahead.  The faster engine will see everything the slower engine sees, but it can also see which of two seemingly equal moves leads to a better result.  It will therefore be able to steer toward winning positions and the slower engine won't see the danger until it's too late.

Even with each team using its own particular evaluation rules, something like this still happens.  Typically when two AB engines play, there will be a period where the weaker engine shows a more favorable evaluation for itself than the stronger one does, e.g., the weaker engine is playing white and it sees a half-pawn advantage for white while the stronger engine sees a small advantage for black.

The weaker engine thinks everything's going fine while the stronger engine knows that it has the edge.  Eventually the weaker engine's evaluation starts to drop as it realizes things are no longer as good as they looked, and the stronger engine pushes its small advantage into an overwhelming one.

Again, it isn't a foregone conclusion that a faster engine will beat a slower one with a different evaluation function, but it's what we see in practice.  My suspicion is that evaluation functions hit diminishing returns after certain point.  You've accounted for material, you've got some concept of tactical considerations like pins and hanging pieces, and a bit on positional factors like space and pawn structure.   Further improvements are certainly possible, in the form of slightly better evaluation functions, slightly faster evaluation, or both, but the low-hanging fruit was picked long ago.

Adding a detailed model of, say, pawn structure, is likely just to slow the evaluation down without providing enough benefit to make up for the decrease in depth.  This is the "dumb is smarter" principle that seems to have held sway for the past few decades.

Stockfish -- currently the clear champion among AB engines -- demonstrates this nicely with its "fishtest".  As I understand it, developers try out changes to the program to see if their locally patched version consistently beats the current released version.  Only winning changes are brought into the next release.  While Stockfish has improved significantly as a result of fishtest, progress has been incremental, and the end result is that Stockfish can evaluate positions very quickly.  I'm not 100% sure it's the fastest, but it's certainly high up there.

I'm sure everyone, open source or not, playtests potential improvements (and the training phase of NN engines take this to an extreme), but crowdsourcing this ought to keep development from fixating on some particular person or small group of people's idea of what must be right.  Again, fewer assumptions means better results.

Bringing NN engines into the picture doesn't seem to change this linear ranking.   NN engines are on the other extreme of the smart vs. fast tradeoff.  For a given number of layers and so forth a position generally takes just as long to evaluate regardless of the exact parameters of the net being used, so the only difference in play will be due to the net itself.  Evaluation speed is not a factor since it doesn't change.

This doesn't necessarily rule out A-beats-B-beats-C-beats-A cycles, in fact, it seems like they might be more likely since different nets will put different weights on different factors -- assuming there's more than one way to be good at chess.  So far, though, there don't seem to be any.  In the tournament results I've seen so far, there are three strong NN engines: Leela Chess Zero (generally known as LC0) Leelenstein and Antifish (a modified Leela trained specifically to beat Stockfish).  In the Blitz Bonanza, LC0 had winning records against the other two and Leelenstein had a winning record against Antifish.  Antifish does not seem to do better than the others against Stockfish.  If anything it does slightly worse.

There are only two NN engines in the Deep Dive finals (rules now prohibit more than one variant of the same engine reaching the finals, so Antifish is out).  Obviously there will be no cycle between them. For what it's worth, LC0 is currently slightly ahead of Leelenstein, but they're dead even head-to-head.

There is one clear deviation from the orderly world of AB engines.  Since NN engines evaluate so slowly, they take a random sample of positions when looking ahead, rather than searching exhaustively like AB engines.  I believe the sample is weighted towards moves the net considers likely to be good, but in any case this introduces some inconsistency.  NN engines occasionally lose to weak AB engines since not all good-looking moves turn out actually to be good, while strong AB engines almost never do.  Against strong AB engines, NN engines can find good moves that the AB opponent overlooks because it doesn't lead to good results quickly enough for the exhaustive search to turn it up, but they can also play good-looking moves that turn out to be unsound if you search well enough.

The net effect is that strong NN engines don't do as well against weak AB engines as you'd expect from how they do against strong engines.  In the current Deep Dive finals, Stockfish is (at this writing) noticeably ahead of LC0, but head-to-head they're nearly even (3 wins to 2 with 41 draws, implying an ELO difference of 8 points; by contrast, human world champion Carlsen is currently rated about 57 points ahead of second-ranked Caruana) [and in fact, Stockfish won the finals but lost to LC0 head-to-head].

The difference is that Stockfish has done better against the weaker opposition.  Stockfish is 9-0 against 4th-place Houdini (with a whole bunch of draws), while LC0 is 12-3 against Houdini (with correspondingly fewer draws).  Stockfish also does better against the other NN engine, 3rd-place Leelenstein.

So there don't seem to be any cycles in the ranking of chess engines, and the (small) difference between top AB engines and top NN engines looks to be due to consistency.  What does this tell us about chess-playing as a problem?  There are several possible conclusions:
  • Chess skill really is a linear ranking, at least once you get to the level of computer engines.  If A beats B and B beats C, then A beats C.
  • A bit stronger: There is one true way to play chess and stronger engines approximate this better.
  • On the other hand: AB engines are linearly ranked because only depth/evaluation speed really matters while NN engines are linearly ranked because only the quality of the net doing the evaluation matters.  It's not surprising, though not inevitable, that combining two linear rankings gives another linear ranking.
  • Also, NN engines are all playing on or near the level of the current AB champion, so they're really only being compared to one style of play since there's only one really competitive opponent.
My feeling is that we don't really have enough to know for sure, but most likely we'll continue to see improvement in playing strength and that the linear ranking will hold.  Next year's champion will be able to beat any other opponent head-to-head, the one after that will be able to beat it and everyone else, and so forth, regardless of whether the next champion is AB, NN or some hybrid of the two.  There are noticeable differences between how AB engines and NN engines play (I'd like to get into the details in a future post), but be that as it may, stronger is stronger.

* Technically, there are two choices on two dimensions: Alpha/Beta tree search vs. Monte Carlo and neural-net based evaluation vs. hand-coded rules.  However, NN engines always use Monte Carlo since it takes them roughly a thousand times longer to evaluate a particular position, so they can't search exhaustively.  On the other hand, conventional engines generally use AB searches because they can.  There have been fairly successful experiments in using Monte Carlo with hand-coded rules.  Those tend to be called hybrid.