Tuesday, March 10, 2020

Memory makes you smarter

Another sidebar working up to talking about the hide-and-seek demo.

Few words express more exasperation than "I just told you that!", and -- fairly or not -- there are few things that can lower someone's opinion of another person's cognitive function faster than not remembering simple things.

Ironically for systems that can remember much more data much more permanently and accurately than we ever could, computers often seem to remember very little.  For example, I just tried a couple of online AI chatbots, including one that claimed to have passed a Turing test.  The conversations went something like this:
Me: How are you?
Bot: I'm good.
Me: That's great to hear.  My name is Fred.  My cousin went to the store the other day and bought some soup.
<a bit of typical AI bot chat, pattern-matching what I said and parroting it back, trying stock phrases etc.>
Me: By the way, I just forgot my own name.  What was it?
<some dodge, though one did note that it was a bit silly to forget one's own name>
Me: Do you remember what my cousin bought the other day?
<some other dodge with nothing to do with what I said>
The bots are not even trying to remember the conversation, even in the rudimentary sense of scanning back over the previous text.  They appear to have little to no memory of anything before the last thing the human typed.

Conversely, web pages suddenly got a lot smarter when sites started using cookies to remember state between visits and again when browsers started to be able to remember things you'd typed in previously.  There's absolutely nothing anyone would call AI going on, but it still makes the difference between "dumb computer" and "not so bad".

When I say "memory" here, I mean the memory of things that happen while the program is running.  Chess engines often incorporate "opening books" of positions that have occurred in previous games, so they can play the first few moves of a typical game without doing any calculation.  Neural networks go through a training phase (whether guided by humans or not).  One way or another, that training data is incorporated into the weightings that determine the networks behavior.

In some sense, those are both a form of memory -- they certainly consume storage on the underlying hardware -- but they're both baked in beforehand.  A chess engine in a tournament is not updating its opening book.  As I understand it, neural network-based chess engines don't update their weights while playing in a tournament, but can do so between rounds (but if you're winning handily, how much do you really want to learn from your opponents' play?).

Likewise, a face recognizer will have been trained on some particular set of faces and non-faces before being set loose on your photo collection.  For better or worse, its choices are not going to change until the next release (unless there's randomization involved).

Chess engines do use memory to their advantage in one way: they tend to remember a "cache" of positions they've already evaluated in determining previous moves.  If you play a response that the engine has already evaluated in detail, it will have a head start in calculating its next move.  This is standard in AB engines, at least (though it may be turned off during tournaments).  I'm not sure how much it applies for NN engines.   To the extent it does apply, I'd say this absolutely counts as "memory makes you smarter".

Overall, though, examples of what we would typically call memory seem to be fairly rare in AI applications.  Most current applications can be framed as processing a particular state of the world without reference to what happened before.  Recognizing a face is just recognizing a face.

Getting a robot moving on a slippery surface is similar, as I understand it.  You take a number of inputs regarding the position and velocity of the various members and whatever visual input you have, and from that calculate what signals to send to the actuators.  There's (probably?) a buffer remembering a small number of seconds worth of inputs, but beyond that, what's past is past (in fact, there's some evidence that what we perceive as "the present" is basically a buffer of what happened in the past few seconds).

Translating speech to text works well enough a word or phrase at a time, even if remembering more context might (or might not) help with sorting out homonyms and such.   In any case, translators that I'm familiar with clearly aren't gathering context from previous sentences.  It's not even clear they can remember all of the current sentence.

One of the most interesting things about the hide-and-seek demo is that its agents are capable of some sort of more sophisticated memory.  In particular, they can be taught some notion of object permanence, usually defined as the ability to remember that objects exist even when you can't see them directly, as when something is moved behind an opaque barrier.  In purely behavioral terms, you might analyze it as the ability to change behavior in response to objects that aren't directly visible, and the hide-and-seek agents can definitely do that.  Exactly how they do that and what that might imply is what I'm really trying to get to here ...

Sunday, March 1, 2020

Intelligence and intelligence

I've been meaning for quite a while to come back to the hide-and-seek AI demo, but while mulling that over I realized something about a distinction I'd made in the first post.  I wanted to mention that brief(-ish-)ly in its own post, since it's not directly related to what I wanted to say about the demo itself.

In the original post, I distinguished between internal notions of intelligence, concerning what processes are behind the observed behavior, and external notions which focus on the behavior itself (note to self: find out what terms actual cogsci/AI researchers use -- or may structural and functional would be better?).

Internal definitions on the order of "Something is intelligent if it's capable of learning and dealing with abstract concepts" seem satisfying, even self-evident, until you try to pin down exactly what is meant by "learning" or "abstract concept".  External definitions are, by construction, more objective and measurable, but often seem to call things "intelligent" that we would prefer not to call intelligent at all, or call intelligent in a very limited sense.

The classic example would be chess (transcribing speech and recognizing faces would be others).  For quite a while humans could beat computers at chess, even though even early computers could calculate many more positions than a human, and the assumption was that humans had something -- abstract reasoning, planning, pattern recognition, whatever -- that computers did not have and might never have.  Therefore, humans would always win until computers could reason abstractly, plan, recognize patterns or whatever else it was that only humans could do. In other words, chess clearly required "real intelligence".

Then Deep Blue beat Kasparov through sheer calculation, playing a "positional" style that only humans were supposed to be able to play.  Clearly a machine could beat even the best human players at chess without having anything one could remotely call "learning" or "abstract concepts".  As a corollary, top-notch chess-playing is not a behavior that can be used to define the kind of intelligence we're really interested in.

This is true even with the advent of Alpha Zero and similar neural-network driven engines*. Even if we say, for the sake of the argument, that neural networks are intelligent like we are, the original point still holds.  Clearly unintelligent things can play top-notch chess, so "plays top-notch chess" does not imply "intelligent like we are".  If neural networks are intelligent like we are, it won't be because they can play chess, but for other reasons.

The hide-and-seek demo is exciting because on the one hand, it's entirely behavior based.  The agents are trained on the very simple criterion of whether any hiders are visible to the seekers.  On the other hand, though, the agents can develop capabilities, particularly object permanence, that have been recognized as hallmarks of intelligence since before there were computers (there's a longer discussion behind this, which is exactly what I want to get to in the next post on the topic).

In other words, we have a nice, objective external definition that matches up well with internal definitions.  Something that can
  • Start with only basic knowledge and capabilities (in this case some simple rules about movement and objects in the simulated environment)
  • Develop new behaviors in a competition against agents with the same capabilities
is pretty clearly intelligent in some meaningful sense, even if it doesn't seem as intelligent as us.

If we want to be more precise about "develop new behaviors", we could either single out particular behaviors, like fort building or ramp jumping, or just require that the new agent we're trying to test starts out by losing heavily to the agents from this demo but learns to beat them, or at least play competitively.

This says nothing about what mechanisms such an agent is using, or how it learns.  This means we might run into a situation like chess, maybe some future quantum computer that can simultaneously try out all a huge variety of possible strategies, that beats the game without actually appearing intelligent.  Even then, we would learn something interesting.

For now, though, the hide-and-seek demo seems like a significant step forward, both in defining what intelligence might be and in producing it artificially.

* I've discussed Alpha Zero and chess engines in general at length elsewhere in this blog.  My current take is that the ability of neural networks to play moves that appear "creative" to us and to beat purely calculation based (AB) engines doesn't imply intelligence, and that the ability to learn the game from nothing, while impressive, doesn't imply anything like what we think of as human intelligence, even though it's been applied to a number of different abstract games.  That isn't a statement about neural networks in general, just about these particular networks being applied to the specific problem of chess and chess-like games.  There's a lot of interesting work yet to be done with neural networks in general.

Sunday, February 23, 2020

What good is half a language?

True Wit is Nature to advantage dress'd
What oft was thought, but ne'er so well express'd
-- Alexander Pope

How did humans come to have language?

There is, to put it mildly, a lot we don't know about this.  Apart from the traditional explanations from various cultures, which are interesting in their own right, fields including evolutionary biology, cognitive science and linguistics have had various things to say about the question, so why shouldn't random bloggers?

In what follows, please remember that the title of this blog is Intermittent Conjecture.  I'm not an expert in any of those three fields, though I've had an amateur interest in all three for years and years.  Real research requires careful gathering of evidence and checking of sources, detailed knowledge of the existing literature, extensive review and in general lots of time and effort.  I can confidently state that none of those went into this post, and anything in here should be weighed accordingly.  Also, I'm not claiming any original insight.  Most likely, all the points here have already been made, and better made, by someone else already.

With that said ...

In order to talk about how humans came to have language, the first question to address is what does it mean to have language at all.  Language is so pervasive in human existence that it's surprisingly hard to step back and come up with an objective definition that captures the important features of language and doesn't directly or indirectly amount to "It's that thing people do when they talk (or sign, or write, or ...) in order to communicate information."

We want to delimit, at least roughly, something that includes all the ways we use language, but excludes other activities, including things that we sometimes call "language", but that we somehow know aren't "really" language, say body language, the language of flowers or, ideally, even computer languages, which deliberately share a number of features with human natural languages.

Since language is often considered something unique to humans, or even something that makes us human, it might be tempting to actively try to exclude various ways that other animals communicate, but it seems better to me just to try to pin down what we mean by human language and let the chips fall where they may when it comes to other species.

For me, some of the interesting features of language are
  • It can communicate complex, arbitrary structures from one mind to another, however imperfectly.
  • It is robust in the face of noise and imperfection (think of shouting in a loud music venue or talking with someone struggling with a second language).
  • It tolerates ambiguity, meaning that (unlike in computer languages and other formal systems) ambiguity doesn't bring a conversation to a halt.  In some cases it's even a useful feature.
  • Any given language provides multiple ways to express the same basic facts, each with its own particular connotations and emphasis.
  • Different languages often express the same basic facts in very different ways.
  • Related to these, language is fluid across time and populations.  Usage changes over time and varies across populations.
  • It can be communicated by a variety of means, notably speech, signing and writing.
  • From an evolutionary point of view, it has survival value.
I'd call these functional properties, meaning that they relate mainly to what language does without saying anything concrete about how it does it.  Structurally (from here on I'll tend to focus on spoken/written language, with the understanding that it's not the whole story),
  • Language is linear.
That is, whatever the medium, words are produced and received one at a time, though there can be a number of "side channels" such as pitch and emphasis, facial expressions and hand gestures.
  • The mapping between a word and its meaning is largely arbitrary (though you can generally trace a pretty elaborate history involving similar words with similar meanings).
  • Vocabulary is extensible.
We can coin words for new concepts.  This is true only for certain kinds of words, but where it can happen it happens easily.
  • Meaning is also extensible
We can apply existing words in new senses and again this happens easily.
  • The forms used adjust to social conditions.
You speak differently with your peers after work than you would to your boss at work, or to your parents as a child, or to your prospective in-laws, and so forth
  • The forms used adjust to particular needs of the conversation, for example which details you want to emphasize (or obscure).
  • Some concepts seem to more tightly coupled to the structure of a particular language than others.
For example, when something happened or will happen in relation to when it is spoken of is generally part of the grammar, or marked by a small, closed set of words, or both.
  • On the other hand, there is wide variety in exactly how such things are expressed.
Different languages emphasize different distinctions.  For example, some languages don't specially mark singular/plural, or past/present, though of course they can still express that there was more than one of something or that something happened yesterday rather than today.  Different languages use different devices to convey basic information like when something happened or what belongs to whom.
  • Syntax, in the form of word order and inflection (changing the forms of words, as with changing dog to dogs or bark to barked or barking), collectively seem to matter in all languages, but the exact way in which they matter, and the degree to which each matters, seem to be unique to any given language.  Even closely related languages generally differ in the exact details.
There are plenty of other features that could each merit a separate post, such as honorifics (Mr. Hull) and diminutives (Davey), or how accent and vocabulary are such devastatingly effective in-group markers, or how metaphors work, or what determines when and how we choose to move words around to focus on a topic, or why some languages build up long words that equate to whole sentences of short words in other languages, or why in some languages directional words like to and of take on grammatical meaning, or why different languages break down verb tenses in different ways, or can use different words for numbers depending on what's being counted, and so on and so on ...

Many of these features of language have to do with the interplay between cognition -- how we think -- and language -- how we express thoughts.  The development of cognition must have been both a driver and a limiting factor in the development of language, but we are almost certainly still in the very early stages of understanding this relationship.

For example, languages generally seem to have a way of nesting one clause inside another, as in The fence that went around the house that was blue was red.  How would this arise?  In order to understand such a sentence, we need some way of setting aside The fence while we deal with that went around the house that was blue and then connecting was red with it in order to understand that the fence is red and the house is blue.  To a compugeek, this means something like a stack, a data structure for storing and retrieving things such that the last thing stored is the first thing retrieved.

Cognitively, handling such a sentence is like veering of a path on some side trip and returning to pick up where you left off, or setting aside a task to handle some interruption and then returning to the original task.  Neither of these abilities is anywhere near unique to humans, so they must older than humanity, even though we are the only animals that we know of that seem to use them in communication.

These cognitive abilities are also completely separate from a large number of individual adaptations of our vocal apparatus, which do seem to be unique to us, notably fine control of breathing and of the position of the tongue and shape of the mouth.  While these adaptations are essential to our being able to speak as fluently as we do, they don't have anything to do with what kinds of sentences we can express, just how well we can do so using spoken words.  Sign languages get along perfectly well without them.

In other words, it's quite possible we were able to conceive of structures like "I saw that the lion that killed the wildebeest went around behind that hill over there" without being able to put them into words, and that ability only came along later.  There's certainly no shortage, even in modern humans, of things that are easy to think but hard to express (I'd give a few examples, but ...).  The question here, then, is not "How did we develop the ability to think in nested clauses?" but "How did we come to use the grammatical structures we now see in languages to communicate such thoughts?"

There's a lot to evolution, and it has to be right up there with quantum mechanics as far as scientific theories that are easy to oversimplify, draw unwarranted conclusions from or get outright wrong, so this next bit is even less precise than what I've already said.  For example, I'm completely skirting around major issues of population genetics -- how a gene spreads, or doesn't, in a population, whether it's useful or not.

Let's try to consider vocabulary in an evolutionary context.  I pick vocabulary to start with because it's clearly distinct from grammar.  Indeed one of the useful features of a grammar is that you can plug an arbitrary set of words into it.  Conversely, one requirement for developing language as we know it is the ability to learn and use a large and expandable vocabulary.  Without that, and regardless of the grammatical apparatus, we do not account for the way people actually use language.

Suppose some animal has the ability to make one kind of call when a it spots particular predator and a different call for another predator, in such a way that is conspecifics (animals of the same species) can understand and react appropriately.  That's two calls (three if you count not making any call) and it's easy to see how that could be useful in not getting eaten.  Again, this is far from unique to us (see here, and search for "vervets", for example).

Now suppose some particular animal is born with the ability to make a third call for some other hazard, say a large branch falling (this is more than a bit contrived, but bear with me).  A large branch falls, the animal cries out ... and no one does anything.  The ability to make new calls isn't particularly useful without the ability to understand new calls.  But suppose that nobody did anything because they didn't know what the new call meant, but they were able to connect "that oddball over there made a funny noise" with "a big branch fell".  The next time a big branch falls and our three-call-making friend cries out, everyone looks out and scatters to safety.  Progress.

I'm more than a bit skeptical that the ability to make three calls rather than two would arise by a lucky mutation, but I think there are still two valid points here:

First, the ability to comprehend probably runs ahead of the ability to express, and certainly new ways to express are much less likely to catch on if no one understands what they mean.  Moreover, comprehension is useful in and of itself.  Whether or not my species is able to make calls that signal specific scenarios, being able to understand other species' calls is very useful, as is the ability to match up new calls with their meanings from context and examples.

In other words, the ability to understand a large vocabulary is liable to develop even without the ability to express a large vocabulary.  For a real-life example, at least some domestic dogs can understand many more human words than (as far as anyone can tell) they can produce distinct barks and similar sounds, and certainly more human words than they can themselves produce.

Second, this appears to be a very common pattern in evolution.  Abilities that are useful in one context (distinguishing the different calls of animals around you) become useful in other contexts (developing a system of specialized calls within your own species).  The general pattern is known as exaptation (or cooption, or formerly and more confusingly as pre-adaptation).

Let's suppose that the local population of some species can potentially understand, say, dozens of distinct calls (whether their own or those of other species), but its ability to produce distinct calls is limited.  If some individual comes along with the gift of being able to produce more distinct calls, then that will probably increase that individual's chances of surviving -- because its conspecifics will learn the new calls and so increase everyone's chance of survival -- and at least potentially its chances of reproducing, if only because there will be more potential mates around if fewer of them get eaten. 

If that particular individual fails to survive and reproduce, the conditions are still good for some other individual to come along with the ability to produce a bigger vocabulary, perhaps through some entirely different mechanism.  This is important, because if there is more than one way to develop an ability, there can potentially be more ways to inherit it once it is established (I'm pretty sure, but I don't know if an actual biologist would agree).

If the community as a whole develops the tendency to find larger vocabularies attractive, so much the better, though the math starts to get hairy at this point.  Sexual selection is a pretty good way of driving traits to extremes -- think peacocks and male walruses -- so it's quite plausible that a species that starts to develop larger and larger vocabularies of calls could take this quite far, past the point of immediate usefulness.  You then have a population with a large vocabulary ready for an environment where it makes more of a difference.

In short, even some ability to produce distinct calls for different situations is useful, and it's no surprise many animals have it.  The ability to produce a large and expandable variety of distinct calls for different situations also looks useful, but also seems harder to evolve, considering that it's fairly rare.  Taking this a step further, we appear to be unique in our ability to produce and distinguish thousands of distinct vocabulary items, though as always there's quite a bit we still don't know about communication in other species.

It's clear that other animals can distinguish, and in some cases produce, non-trivial vocabularies, even if it's not particularly common.  How do you get from there to our as-far-as-we-know-unique abilities?  I think the answer is "a piece at a time".

In order to find a (very hypothetical) evolutionary pathway from an extensible collection of specialized calls to what we call language today, we want to find a series of small steps that each add something useful to what's already there without requiring major restructuring.  Some of those, in no strict order except where logically necessary, might be:
  • The ability to refer to a class of things without reference to a particular instance
This is one aspect of what one might call "abstract concepts".  As such, it doesn't require any new linguistic machinery beyond the ability to make and distinguish a large set of calls (which I'll call words from here on out), but it does require a cognitive shift.  The speaker has to be able to think of, say, wolf as a class of things rather than a particular wolf trying to sneak up.  The listener has to realize that someone saying "wolf" may not be referring to a wolf currently sneaking up on them. Instead, if the speaker is pointing to a set of tracks it might mean "a wolf went here", or if pointing in a particular direction, maybe "wolves come from over there".

This may seem completely natural to us, but it's not clear who, if anyone else besides us, can do this.   Lots of animals can distinguish different types of things, but being able to classify is different from being aware that classes exist.  An apple-sorting machine can sort big from small without understanding "big" or "small".  I say "it's not clear" because devising an experiment to tell if something does or doesn't understand some aspect of abstraction is difficult, in no small part because there's a lot of room for interpretation of the results.
  • The ability to designate a quality such as "big" or "red" without reference to any particular thing with that quality.
This is similar to the previous item, but for adjectives rather than nouns.  From a language standpoint it's important because it implies that you can mix and match qualities and things (adjectives and nouns).  A tree can be big, a wolf can be big and a wolf can be gray without needing a separate notion of "big tree", "big wolf" and "gray wolf".  An adjective is a predicate that applies to something rather than standing alone as a noun does.

As I understand it, the widely-recognized stages of language development in humans are babbling, single words, two-word sentences and "all hell breaks loose".  A brain that can handle nouns and predicates is ready for two-word sentences consisting of a predicate and something it applies to.  This is a very significant step in communication and it appears to be quite rare, but linguistically it's nearly trivial.  A grammar to describe it has one rule and no recursion (rules that refer, directly or indirectly, to themselves).

As a practical matter, producing a two-word sentence means signifying a predicate and an object that it applies to (called an argument).  Understanding it means understanding the predicate, understanding the argument and, crucially, understanding that the predicate applies to the argument.  If you can distinguish predicates from objects, order doesn't even matter.  "Big wolf!" is just as good as "Wolf big!" or even a panicked sequence of "Wolf wolf big wolf big big wolf!" (which, to be fair, would require recursion to describe in a phrase-structure grammar).

From a functional point of view, the limiting factor to communicating such concepts is not grammar but the ability to form and understand the concepts in the first place.

Where do we go from predicate/argument sentences to something resembling what we now call language?  Some possible next steps might be
  • Predicates with more than one argument.
The important part here is that you need a way to distinguish the arguments.  In wolf big, you know that big is the predicate and wolf is the argument and that's all you need, but in see rabbit wolf, where see is the predicate and rabbit and wolf are arguments, how do we tell if the wolf sees the rabbit or the rabbit sees the wolf?  There are two solutions, given that you're limited to putting words together in some particular order

Either the order of words matters, so see rabbit wolf means one thing and see wolf rabbit means the other, or there's a way of marking words according to what role they play, so for example see wolf-at rabbit means the rabbit sees the wolf and see wolf rabbit-at means the wolf sees the rabbit.  There are lots of possible variations, and the two approaches can be combined.  Actual languages do both, in a wide variety of ways.

From a linguistic point of view, word order and inflection (ways of marking words) are the elements of syntax, which (roughly speaking) provides structure on top of a raw stream of words.  Languages apply syntax in a number of ways, allowing us to put together complex sentences such as this one, but you need the same basic tools even for simple three-word sentences.  Turning that around, if you can solve the problem of distinguishing the meaning of a predicate and two arguments, you have a significant portion of the machinery needed for more complex sentences.
  • Pronouns, that is, a way to designate a placeholder for something without saying exactly what that something is, and connect it with a specific meaning separately.
Cognitively, pronouns imply some form of memory beyond the scope of a simple sentence. Linguistically, their key property is that their meaning can be redefined on the fly.  A noun like wolf might refer to different specific wolves at different times, but it will always refer to some wolf.  A pronoun like it is much less restrained.  It could refer to any noun, depending on context.

Pronouns allow for more compact sentences, which is useful in itself since you don't have to repeat some long descriptive phrase every time you want to say something new about, say, the big red house across the street with the oak tree in the yard.  You can just say that house or just it if the context is clear enough.

More than this, though, by equating two things in separate sentences they allow linear sequences of words to describe non-linear structures, for example I see a wolf and it sees me.  By contrast, in I see a wolf and a wolf sees me it's not clear whether it's the same wolf and we don't necessarily have the circular structure of two things seeing each other.
  • The ability to stack up arbitrarily many predicates: big dogbig red dogbig red hairy dog, etc.
I left this for last because it leads into a bit of a rabbit hole concerning the role of nesting and recursion in language.  A common analysis of phrases like big red hairy dog uses a recursive set of rules like

a noun phrase can be a noun by itself, or
a noun phrase can be an adjective followed by a noun phrase

This is much simpler than a full definition of noun phrase, and it's not the only way to analyze noun phrases, but it shows the recursive pattern that's generally used in such an analysis.  The second definition of noun phrase refers to noun phrase recursively.  The noun phrase on the right-hand side will be smaller, since it has one less adjective, so there's no infinite regress.  The example, big red hairy dog breaks down to big modifying red hairy dog, which breaks down to red modifying hairy dog, which breaks down to hairy modifying dog, and dog is a noun phrase by itself.  In all there are four noun phrases, one by the first rule and three by the second.

On the other hand, if you can conceive of a dog being big, red and hairy at the same time, you can just as well express this with two-word sentences and a pronoun:  dog big. it red. it hairy.  The same construction could even make sense without the pronouns: dog big. red. hairy.  Here a listener might naturally assume that red and hairy have to apply to something, and the last thing we were talking about was a dog, so the dog must be red and hairy as well as big.

This is not particularly different from someone saying I saw the movie about the duck.  Didn't like it, where the second sentence clearly means I didn't like it and you could even just say Didn't like and still be clearly understood, even if Didn't like by itself sounds a bit odd.

From a grammatical standpoint (at least for a constituency grammar) these all seem quite different.  In big red hairy dog, there's presumed to be a nested structure of noun phrases.  In dog big.  it red. it hairy you have three sentences with a simple noun-verb structure and in dog big. red. hairy. you have one two-word sentence and two fragments that aren't even sentences.

However, from the point of view of "I have some notion of predicates and arguments, and multiple predicates can apply to the same argument, now how do I put that in words?", they seem pretty similar.  In all three cases you say the argument and the predicates that apply to it and the listener understands that the predicates apply to the argument because that's what predicates do.

I started this post with the idea of exploring how language as we now know it could develop from simpler pieces such as those we can see in other animals.  The title is a nod to the question of What good is half an eye? regarding the evolution of complex eyes such as we see in several lineages, including our own and (in a different form) in cephalopods.  In that case, it turns out that there are several intermediate forms which provide an advantage even though they're not what we would call fully-formed eyes, and it's not hard to trace a plausible pathway from basic light-sensitive "eye spots" to what we and many other animals have.

The case of language seems similar.  I think the key points are
  • Cognition is crucial.  You can't express what you can't conceive of.
  • The ability to understand almost certainly runs ahead of the ability to express.
  • There are plausibly a number intermediate stages between simple calls and complex language (again, I don't claim to have identified the actual steps precisely or completely).
  • Full grammar, in the sense of nested structures described by recursive rules, may not be a particularly crucial step.
  • A purely grammatical analysis may even obscure the picture, both by failing to make distinctions (as with the jump from "this wolf right there" to "wolf") and by drawing distinctions that aren't particularly relevant (as with the various forms of big red hairy dog).

Friday, January 10, 2020

Is the piano a percussion instrument?

Well, is the piano a percussion instrument?

This is one of those questions that can easily devolve into "Well technically" ... "Oh yeah, well actually" and so forth.  I'm not aware of an official designator of instrument categories, but more to the point I'm not interested in a right or wrong answer here.  I'm interested in why the question should be tricky in the first place.

The answer I learned from high school orchestra or thereabouts was "Yes, it's a percussion instrument, because the strings are hit by hammers."  The answer I personally find more convincing is "No, because it's a piano, duh."

OK, maybe that's not particularly convincing.  Maybe a better way to phrase it would be "No, it's a keyboard instrument.  Keyboard instruments are their own class, separate from strings, woodwinds, brass and percussion."  By this reasoning, the pipe organ is a keyboard instrument, not a wind instrument, the harpsichord is a keyboard instrument, not a string instrument, and a synthesizer is a keyboard instrument, assuming it has a keyboard (not all do).

The intuition behind this is that being played by way of a keyboard is more relevant than the exact method for producing the sounds.  Even though a marimba, xylophone, vibraphone or glockenspiel has an arrangement of things to hit that looks a lot like a keyboard, the fact that you're limited to mallets in two hands has a big effect on what you can play.  Likewise, a harpsichord and a guitar or banjo produce somewhat similar sounds, but fretting a one or more of a few strings is different from pressing one or more of dozens of keys.

It's a lot easier to play a four-part fugue on a harpsichord than a marimba, and a seven-note chord is going to present real problems on a five-string banjo.  Different means of playing make different things easy and hard, and that affects what actually gets played.

At this point, I could put forth a thesis that how you play an instrument is more important in classifying it than how the sounds are ultimately produced and be done with it, but that's not what got me typing in the first place.  To be clear, I like the thesis.  It's easier to play a saxophone if you know how to play a clarinet, easier to play a viola or even a guitar if you can play violin, and so forth.  What got me thinking, though, was the idea of how any classification on the order of string/woodwind/brass/percussion or keyboard/bow/plectrum/mallet/etc. tends to break down on contact with real objects to classify.

For example, there are lots of ways to produce sound from a violin.  There are several different "ordinary" ways to bow, but you can also bounce the wooden part of the bow on the strings, or pluck the strings (with either hand).  Independently of what you do with the bow, you can put a mute on the bridge to get a kind of ethereal, spooky sound.  You can rest a finger lightly on the string to get a "harmonic" with a purer tone (and generally higher pitch) than if you pressed the string to the fingerboard.  Beyond all that, you can tap on the body of the violin with your finger, or the stick of the bow or the end of the bow.  You could even tap the violin on something else, or use its strings as a bow for another instrument.

Does tapping on a violin make it a percussion instrument?  I'd say it is when you're tapping on it, otherwise not.  But if you ask, "Is the violin a percussion instrument," I'd say "no" (or, if I'm feeling cagy, "not normally").

How about an electric guitar?  Obviously, it's a string instrument, except there's more to playing an electric guitar than picking and fretting the strings.  The effects and the amp make a big difference.  It's probably best to think of electric guitar plus amp and effects as both a string instrument and an electronic instrument, both in its construction and in how you play it.  The guitar, amp and effects together are one instrument -- that's certainly how guitarists tend to see it, and they can spend quite a bit of time telling you the details of their rigs.

There are plenty of other examples to pick from -- a morsing, a glass harp, a musical saw, a theremin ... if you had to pick, you could probably call a morsing or even a glass harp a percussion instrument -- I mean, if a piano is, why not?  A musical saw would be, um, a string instrument?  A theremin would be ... I don't know, let's say brass because there are metal parts?

But why pick?  Clearly the four sections of an orchestra work fine for the instruments they were originally intended to classify, and they provide useful information in that context.  If you're putting together an orchestra, you can expect a percussionist to handle the bass drum, snare drum and tympani but not a trumpet, oboe or cello.  If you're composing for orchestra, you should know that wind players need to breathe and that a string instrument can play more than one note at a time, but only within fairly strict limits.  In neither case do you really care that someone might consider a piano a percussion instrument.  For the purposes of hiring players and composing music, a piano is a keyboard instrument.

If your purpose is to classify instruments by common properties, there are much better systems.  Wikipedia likes the Hornbostel Sachs classification, which takes into account what produces the sound, how the sound is produced, the general form of the instrument and other factors.  For my money, it does a pretty good job of putting similar instruments together while making meaningful distinctions among them.  For example (based on this 2011 revision of the classification):
  • violin 321.322-71 (Box lute sounded by a bow)
  • cello 321.322-71 (Same)
  • guitar 321.322-5 or -6 (Box lute sounded by bare fingers(5) or plectrum(6))
  • French horn 423.232.12 (Valved horn with narrow bore and long air column)
  • oboe 422.112-71 (Reedpipe with double reeds and conical bore, with keys)
  • bass drum: 211.212.12 (Individual double-skin cylindrical drums, both heads played)
  • piano 314.122-4-8 (Box zither sounded by hammers, with keyboard)
  • harpsichord  314.122-6-8 (Box zither sounded by plectrum, with keyboard)
  • morsing 121.2 (plucked idiophone with frame, using mouth cavity as resonator)
  • glass harp 133.2 (set of friction idiophones)
  • musical saw 151  (metal sheet played by friction)
  • theremin 531.1 (Analogue synthesizers and other electronic instruments with electronic valve/vacuum tube based devices generating and/or processing electric sound signals)
There's certainly room for discussion here.  Playing a cello is significantly different from playing a violin -- the notes are much farther apart on the longer strings, the cello is held vertical, making the bowing much different, and as a consequence of both, the bow is much bigger and held differently.  Clearly the analogue synthesizer section could stand to be a bit more detailed, and there's clearly some latitude within these (Wikipedia has a musical saw as 132.22 (idiophone with direct friction).

It's also interesting that a guitar is counted as a slightly different instrument depending on whether it's played with bare fingers or a plectrum, but that fits pretty well with common usage.  Fingerpicking and flatpicking require noticeably different skills and many guitarists specialize in one or the other.  The only sticking point is that a lot of fingerstyle guitarists use fingerpicks, at least when playing a steel-string acoustic ...

Nonetheless, I'd still say Hornbostel-Sachs does a decent job of classifying musical instruments.  Given the classification number, you have a pretty good idea of what form the instrument might take, who might be able to play it and, in many if not all cases, how it might sound.  There are even provisions for compound instruments like electric guitar plus effects, though I don't know how well-developed or effective those are.

The string/woodwind/brass/percussion system also provides a decent idea of form, sound and who might play, within the context of a classical orchestra, but if you're familiar with the classical orchestra you should already know what a french horn or oboe sounds like.

Which leads back to the underlying question of purpose.  Classification systems, by nature, are systems that we impose on the world for our own purposes.  A wide-ranging and detailed system like Hornbostel-Sachs is meant to be useful to people studying musical instruments in general, for example to compare instrumentation in folk music across the world's cultures.

There are a lot more local variations of the bass drum or box lute family than theremin variants -- or even musical saw variants -- so even if we knew nothing else we might have an objective reason to think that drums and box lutes are older, and we might use the number of varieties in particular places to guess where an instrument originated (places of origin, in general, tend to have more variants).  Or there might be an unexpected correlation between latitude and the prevalence of this or that kind of instrument, and so forth.  Having a detailed classification system based on objective properties allows researchers to explore questions like this in a reasonably rigorous way.

The classification of instruments in the orchestra is more useful in the day-to-day running of an orchestra ("string section will rehearse tomorrow, full orchestra on Wednesday") and in writing classical music.  Smaller ensembles, for example, tend to fall within a particular section (string quartet, brass quintet) or provide a cross-section in order to provide a variety of timbral possibilities (the Brandenburg concertos use a harpsichord and a string section with various combinations of brass and woodwinds -- strictly speaking the harpsichord can be replaced by other instruments when it's acting as a basso continuo).

Both systems are useful for their own purposes, neither covers every possible instrument completely and unambiguously (though Hornbostel-Sachs comes fairly close) and neither is inherently "correct".   As far as I can tell, this is all true of any interesting classification system, and probably most uninteresting ones as well.

No one seems to care much whether a pipe organ or harpsichord is a percussion instrument.   I'm not sure why.  Both have been used in orchestral works together with the usual string/woodwind/brass/percussion sections.

Tuesday, October 29, 2019

More on context, tool use and such

In the previous post I claimed that (to paraphrase myself fairly loosely) whether we consider behaviors that technically look like "learning", "planning", "tool use" or such to really be those things has a lot to do with context.  A specially designed robot that can turn a door handle and open the door is different from something that sees a door handle slightly out of reach, sees a stick on the ground, bends the end of the stick so it can grab the door handle and proceeds to open the door by using the stick to turn the handle and then to poke the door open.  In both cases a tool is being used to open a door, but we have a much easier time calling the second case "tool use".  The robot door-opener is unlikely to exhibit tool use in the second case.

With that in mind, it's interesting that the team that produced the hide-and-seek AI demo is busily at work on using their engine to play a Massively Multiplayer Online video game.  They argue at length, and persuasively, that this is a much harder problem than chess or go.  While the classic board games may seem harder to the average person than mere video games, from a computing perspective MMOs are night-and-day harder in pretty much every dimension:
  • You need much more information to describe the state of the game at any particular point (the state space is much larger).  A chess or go position can be described in well under 100 bytes.  To describe everything that's going on at a given moment in an MMO takes more like 100,000 bytes (about 20,000 "mostly floating point" numbers)
  • There are many more choices at any given point (the action space is much larger).  A typical chess position has a few dozen possible moves.  A typical go position may have a couple hundred.  In a typical MMO, a player may have around a thousand possible actions at a particular point, out of a total repertoire of more than 10,000.
  • There are many more decisions to make, in this case running at 30 frames per second for around 45 minutes, or around 80,000 "ticks" in all.  The AI only observes every fourth tick, so it "only" has to deal with 20,000 decision points.  At any given point, an action might be trivial or might be very important strategically.  Chess games are typically a few dozen moves long.  A go game generally takes fewer than 200 (though the longest possible go game is considerably longer).  While some moves are more important than others in board games, each requires a similar amount and type of calculation.
  • Players have complete information about the state of a chess or go game.  In MMOs, players can only see a small part of the overall universe.  Figuring out what an unseen opponent is up to and otherwise making inferences from incomplete data is a key part of the game.
Considered as a context, an MMO is, more or less by design, much more like the kind of environment that we have to plan, learn and use tools in every day.  Chess and go, by contrast, are highly abstract, limited worlds.  As a consequence, it's much easier to say that something that looks like it's planning and using tools in an MMO really is planning and using tools in a meaningful sense.

It doesn't mean that the AI is doing so the same way we do, or at least may think we do, but that's for a different post.

Tool use, planning and AI

A recent story in MIT Technology Review carries the headline AI learned to use tools after nearly 500 million games of hide and seek, and the subhead OpenAI’s agents evolved to exhibit complex behaviors, suggesting a promising approach for developing more sophisticated artificial intelligence.  This article, along with several others, is based on a blog post on OpenAI's site.  While the article is a good summary of the blog post, the blog post is just as readable while going into somewhat more depth and technical detail.  Both the article and the blog post are well worth reading, but as always the original source should take precedence.

There is, as they say, quite a bit to unpack here, and before I'm done this may well turn into another Topic That Ate My Blog.  At the moment, I'm interested in two questions:
  • What does this work say about learning and intelligence in general?
  • To what extent or in what sense do terms like "tool use" and "planning" describe what's going on here?
My answers to both questions changed significantly between reading the summary article and reading the original blog post.

As always, lurking behind stories like this are questions of definition, in particular, what do we mean by "learning", "planning" and "tool use"?  There have been many, many attempts to pin these down, but I think for the most part definitions fall into two main categories, which I'll call internal and external here.  Each has its advantages and drawbacks.

By internal definition I mean an attempt to formalize the sort of "I know it when I do it" kind of feeling that a word like learning might trigger.  If I learn something, I had some level of knowledge before, even if that level was zero, and after learning I could rattle off a new fact or demonstrate a new skill.  I can say "today I learned that Madagascar is larger than Iceland" or "today I learned how to bake a soufflĂ©".

If I talk about planning, I can say "here's my plan for world domination" (like I'd actually tell you about the robot army assembling itself at ... I've said too much) or "here's my plan for cleaning the house".  If I'm using a tool, I can say "I'm going to tighten up this drawer handle with a Philips screwdriver", and so forth.  The common thread is here is a conscious understanding of something particular going on -- something learned, a plan, a tool used for a specific purpose.

This all probably seems like common sense, and I'd say it is.  Unfortunately, common sense is not that helpful when digging into the foundations of cognition, or, perhaps, of anything else interesting.  We don't currently know how to ask a non-human animal to explain its thinking.  Neither do we have a particularly good handle on how a trained neural network is arriving at the result it does.  There may well be something encoded in the networks that control the hiders and seekers in the simulation, which we could point at and call "intent", but my understanding is we don't currently have a well-developed method for finding such things (though there has been progress).

If we can't ask what an experimental subject is thinking, then we're left with externally visible behavior.  We define learning and such in terms of patterns of behavior.  For example, if we define success at a task by some numerical measure, say winning percentage at hide and seek, we can say that learning is happening when behavior changes and the winning percentage increases in a way that can't be attributed to chance (in the hide-and-seek simulation, the percentage would tilt one way or another as each side learned new strategy, but this doesn't change the basic argument).

This turns learning into a pure numerical optimization problem: find the weights on the neurons that produce the best winning percentage.  Neural-network training algorithms are literally doing just such an optimization.  Networks in the training phase are certainly learning, by definition, but certainly not in the sense that we learn by studying a text or going to a lecture.  I suspect that most machine learning researchers are fine with that, and might also argue that studying and lectures are not a large part of how we learn overall, just the part we're most conscious of as learning per se.

This tension between our common understanding of learning and the workings of things that can certainly appear to be learning goes right to why an external definition (more or less what we call an operational definition) can feel so unsatisfying.  Sure, the networks look like they're learning, but how do we know they're really learning?

The simplest answer to that is that we don't.  If we define learning as optimizing a numerical value, then pretty much anything that does that is learning.  If we define learning as "doing things that look to us like learning", then what matters is the task, not the mechanism.  Learning to play flawless tic-tac-toe might be explained away as "just optimizing a network" while learning to use a ramp to peer over the wall of a fort built by a group of hiders sure looks an awful lot like the kind of learning we do -- even though the underlying mechanism is essentially the same.

I think the same reasoning applies to tool use: Whether we call it tool use or not depends on how complex the behavior appears to be, not on the simple use of an object to perform a task.  I remember reading about primates using a stick to dig termites as tool use and thinking "yeah, but not really".  But why not, exactly?  A fireplace poker is a tool.  A barge pole is a tool.  Why not a termite stick?  The only difference, really, is the context in which they are used.  Tending a fire or guiding a barge happen in the midst of several other tools and actions with them, however simple in the case of a fireplace and andirons.  It's probably this sense of the tool use being part of a larger, orchestrated context that makes our tool use seem different.  By that logic, tool use is really just a proxy for being able to understand larger, multi-part systems.

In my view this all reinforces the point that "planning", "tool use" and such are not binary concepts.  There's no one point at which something goes from "not using tools" to "using tools", or if there is, the dividing line has to be fairly arbitrary and therefore not particularly useful.  If "planning" and "tool use" are proxies for "behaving like us in contexts where we consider ourselves to be planning and using tools", then what matters is the behavior and the context.  In the case at hand, our hiders and seekers are behaving a lot like we would, and doing it in a context that we would certainly say requires planning and intelligence.

As far as internal and external definitions, it seems we're looking for contexts where our internal notions seem to apply well.  In such contexts we have much less trouble saying that behavior that fits an external definition of "tool use", "planning", "learning" or whatever is compatible with those notions.

Saturday, July 27, 2019

Do neural networks have a point of view?

As someone once said, figures don't lie, but liars do figure.

In other words, just because something's supported by cold numbers doesn't mean it's true.  It's always good to ask where the numbers came from.  By the same token, though, you shouldn't distrust anything with numbers behind it, just because numbers can be misused.  The breakdown is more or less:
  • If you hear "up" or "down" or "a lot" or anything that implies numbers, but they're aren't any numbers behind it, you really don't know if it's true or not, or whether it's significant.
  • If you hear "up X%" or "down Y%" or -- apparently a popular choice -- "up a whopping Z%" and you don't know where the numbers came from, you still don't really know if it's true or not.  Even if they are correct, you don't know whether they're significant.
  • If you hear "up X%, according to so-and-so", then the numbers are as good as so-and-so's methodology.  If you hear "down Y%, vs. Z% for last quarter", you at least have a basis for comparison, assuming you otherwise trust the numbers.
  • In all, it's a bit of a pain to figure all this out.  Even trained scientists get it wrong more than we might think (I don't have numbers on this and I'm not saying it happens a lot, but it's not zero).
  • No one has time to do all the checking for more than a small subset of things we might be interested in, so to a large extent we have to trust other people to be careful.  This largely comes down to reputation, and there are a number of cognitive biases in the way of evaluating that objectively.
  • But at least we can try to ignore blatantly bad data, and try to cross-check independent sources (and check that they're actually independent), and come up with a rough, provisional picture of what's really going on.  If you do this continually over time the story should be pretty consistent, and then you can worry about confirmation bias.
  • (Also, don't put much stock in "record high" numbers or "up (a whopping) 12 places in the rankings", but that's a different post).
I'm not saying we're in some sort of epistemological nightmare, where no one has any idea what's true and what's not, just that objectivity is more a goal to aim towards rather than something we can generally expect to achieve.

So what does any of this amateur philosophizing have to do with neural networks?

Computers have long been associated with objectivity.  The stramwan idea that "it came from a computer" is the same as "it's objectively true" probably never really had any great support, but a different form, I think, has quite a bit of currency, even to the point of becoming an implicit assumption.  Namely, that computers evaluate objectively.

"Garbage in, garbage out," goes the old saying, meaning a computed result is only as good as the input it's given.  If you say the high temperature in Buenos Aires was 150 degrees Celsius yesterday and -190 Celsius today, a computer can duly tell you the average high was -20 Celsius and the overall high was 150 Celsius, but that doesn't mean that Buenos Aires has been having, shall we say, unusual weather lately.  It just means that you gave garbage data to a perfectly good program.

The implication is that if you give a program good data, it will give you a good result.  That's certainly true for something simple, like calculating averages and extremes.  It's less certain when you have some sort of complicated, non-linear model with a bunch of inputs, some of which affect the output more than others.  This is why modeling weather takes a lot of work.  There are potential issues with the math behind the model (does it converge under reasonable conditions?), the realization of that model on a computer (are we properly accounting for rounding error?) the particular settings of the parameters (how well does it predict weather that we already know happened?).  There are plenty of other factors.  This is just scratching the surface.

A neural network is exactly a complicated, non-linear model with a bunch of inputs, but without the special attention paid to the particulars.  There is some general assurance that the tensor calculations that relate the input to the output are implemented accurately, but the real validation comes from treating the whole thing as a black box and seeing what outputs it produces from test inputs.  There are well-established techniques for ensuring this is done carefully, for example using different datasets for training the network and for testing how well the network really performs, but at the end of the day the network is only as good as the data it was given.

This is similar to "Garbage in, Garbage out," but with a slightly different wrinkle.  A neural net trained on perfectly accurate data and given perfectly accurate input can still produce bad results, if the context of the training data is too different from that of the input it was asked to evaluate.

If I'm developing a neural network for assessing home values, and I train and test it on real estate in the San Francisco Bay area, it's not necessarily going to do well evaluating prices in Toronto or Albuquerque.  It might, because it might do a good job of taking values of surrounding properties into account and adjusting for some areas being more expensive than others, but there's no guarantee.  Even if there is some sort of adjustment going on, it might be thrown off by any number of factors, whether housing density, the local range of variation among homes or whatever else.

The network, in effect, has a point of view based on what we might as well call its experience.  This is a very human, subjective way to put it, but I think it's entirely appropriate here.  Neural networks are specifically aimed at simulating the way actual brains work, and one feature of actual brains is that their point of view depends to a significant degree on the experience they've had.  To the extent that neural networks successfully mimic this, their evaluations are, in a meaningful way, subjective.

There have been some widely-reported examples of neural networks making egregiously bad evaluations, and this is more or less why.  It's not (to my knowledge) typically because the developers are acting in bad faith, but because they failed to assemble a suitably broad set of data for training and testing.  This gave the net, in effect, a biased point of view.

This same sort of mistake can and does occur in ordinary research with no neural networks involved.  A favorite example of mine is drawing conclusions about exoplanets based on the ones we've detected so far.  These skew heavily toward large, fast-moving planets, because for various reasons those are much easier to detect.  A neural network trained on currently known exoplanets would have the same skew built in (unless the developers were very careful, and quite likely even then), but you don't need a neural network to fall prey to this sort of sampling bias.  From my limited sample, authors of papers at least try to take it into account, authors of magazine articles less so and headline writers hardly at all.

Sunday, July 14, 2019

Computer chess: where now?

In an effort to wrap up this thread, at least for a while, here's an attempt to recap some of the major points and to conjecture about what might be next, and what this might tell us about chess, and intelligence in general.

Since playing perfect chess appears intractable, we have a classic tradeoff: give up on perfect play in order to fit our actual strategy into our limited resources.  Chess theory, combined with practice to develop pattern recognition and calculation, is optimized for the human brain.  Chess theory is just shorthand for "patterns and rules that players have discovered over the centuries."  These days, that very much includes discoveries by computer players.

For computers (at least the Von Neumann-style processors currently in commercial use), there are now two main options:

  • exhaustive search to a limited depth combined with alpha/beta pruning to avoid exploring moves that can't be as good as moves already considered, using an explicit set of rules to evaluate whether a position is good or bad (known as AB, after the alpha/beta part)
  • a completely opaque neural net evaluation of positions combined with a limited randomized search of possible variations (known as NN, for the neural net part), though you'll also see mention of MCTS (Monte Carlo Tree Search), referring to the randomized search.
There are some hybrid approaches that combine explicit rules with random search, but one way or another there has to be a tradeoff of performance for limited resources.

It's probably worth repeating that NN engines still consider far more possible continuations than humans can.  The ratio for human to NN to AB is roughly dozens to hundreds of thousands to billions.  We can assume that those last two numbers are going to increase over time.  Moore's law may (or may not) be tailing off, but there will still be improvements in software and bigger piles of hardware in the future.  There could also be breakthroughs like some sort of quantum chess computer that can examine vastly larger numbers of possible positions, in which case all bets are off.

It's interesting explore what human, AB and NB players do and don't have in common.  One common factor is the importance of what's traditionally called "positional" play, that is, taking into account factors beyond tactical concerns about winning material or forcing a quick checkmate.  Tactics are still important, but as the level of play becomes higher it becomes more and more important to consider factors that influence the longer-term course of the game, factors like pawn structure, space, initiative and so forth.

Positional play is interesting from a computing standpoint because it's not the primary objective of the game -- that would be checkmate -- or even the secondary objective -- generally considered to be winning material to make it easier to checkmate, or using the threat of winning material to force the opponent into a lost position.  The positional elements that have been developed over the centuries aren't readily obvious consequences of the rules.  They are human constructs aimed at making an intractable problem tractable.  Heuristics, in other words.  In broad terms, all three kinds of players are using the same approach -- rules of thumb plus calculation -- just in different mixes and with different rules of thumb.

It seems significant, then, that computers have, in effect, rediscovered positional factors in their play, even without having them explicitly programmed in.  Deep Blue beat Kasparov, by Kasparov's own assessment, by outplaying Kasparov positionally.  The surprise was that it did this with only basic knowledge of what makes a position good or bad and that the rest emerged from looking at large numbers of possible positions, much further ahead than a human could look.

Similarly, in their training phases, NN engines like Alpha Zero learn to play good moves without any obvious tactical benefit -- and indeed some that can look like blunders or at least unnecessarily risky at first glance -- without any cues beyond what happened to win in training games.  They do seem to produce more than their share of "wild" positions, and "unorthodox" moves, but even then their play can generally be described in terms like "an unusually aggressive pawn push to initiate a kingside attack" or "a positional sacrifice to obtain more active pieces", and not (except in endgames, it seems) "no idea ... looks like it picked that move at random".

Maybe that just means that we have enough terms for things going on in chess that at least some of them are bound to apply to any particular move or position, but even when engines play moves that flout previously accepted theory, such cases are probably the memorable exception.  In a great many positions each of the three types of player would find the others' preferred moves perfectly reasonable.  They might well disagree over which exact move is best, but in most cases their evaluations will be largely similar.  Typically they will all rule out the great majority of possible moves as inferior, and they will largely agree on which moves are inferior.   For that matter, AB engines and human grandmasters have been known to flout the same rules.  It's not just NN that can throw a curve ball from time to time.

All this is to say that if three radically different approaches to chess-playing can reach a reasonable consensus in discarding almost all possible moves and that their ranking among the best moves agrees reasonably well -- though by no means perfectly -- with accepted chess theory, then most likely there is something to accepted chess theory.  Call it an emergent property of the rules.

So where do we go from here?  Can we combine the strengths of the three approaches, roughly speaking high-level planning for humans, calculation for AB engines and positional assessment for NN engines?  This next bit is more speculation than the nice summary I was hoping for, but here goes:

It might seem almost self-evident that if we knew the full details of how grandmasters see the board then computers would be able to do even better, but it's not like it hasn't been tried.  Chess engine developers have been working with top-level human players for decades, with good results, but not always with the best results.  Stockfish, which currently rules the AB world, benefits from a large distributed team making continual tweaks to the algorithm, with only the best making it into the next release.  Contributions can come from anyone.  I'm sure a lot of contributors are also strong chess players, but it's not a prerequisite.

Alpha Zero, of course, dispenses with outside expertise entirely.  Other NN engines try to incorporate outside knowledge, but the ones that don't, notably Alpha Zero and its open-source cousin LC0, seem to do best.

Humans clearly do things differently from computers, but as I've said previously, it's quite possible that this is because these things work for humans but not for computers.  My guess is that trying to model the way humans formulate plans when playing chess is not going to lead to stronger chess engines.  At this point, both kinds of engines have a reasonable substitute for human-style conscious planning, namely calculating possible continuations en masse.

Even NN engines, which focus mainly on evaluating particular positions, benefit from this.  While NN engines may look downright clueless in endgames (even when they win), they look considerably less clueless at looser time controls, that is, when they are doing more calculation.  Looking far enough ahead appears indistinguishable from planning, even if that's not how we humans do it.

The ideal combination of current AB and NN approaches would evaluate positions as well as NNs and as fast as ABs.  While the original Alpha Zero work was done in a closed environment and (to my knowledge) hasn't been directly replicated by outside researchers, it's generally understood that Alpha Zero benefitted from using tensor processing chips that sped up its evaluation (one reason the Stockfish team argued that the Alpha Zero match results weren't a fair comparison), and that for that reason it would likely beat LC0, which runs on stock hardware.  That is, NN with faster, and thus more, calculation beats NN without it.

On the other hand, one big reason that NN engines got so much attention was that they didn't just win, but they did it in a noticeably different style.  The classic example is the "fawn pawn" (apparently a mishearing of "thorn pawn"), which is an advanced pawn, typically on the sixth rank, that can't be attacked by the defender's pawn, for example a black pawn on h3 when white has castled kingside and played ... g3.

A strong human player would likely look at such a position and think "Yeah ... that doesn't look good for white ... ", but NN players demonstrated just how bad it is, and therefore why it can be worth playing for even if you have to give up material or it doesn't look good to expose your own king and push a pawn that might have to be supported by pieces instead of other pawns.  The NN games also gave some insight as to when to push such a pawn.

It's not out of the question, in fact I'd conjecture  that it's likely, that AB engines can incorporate some of this knowledge directly into their evaluation functions.  The resulting code wouldn't have to capture everything the NN's network does, just enough to avoid losing to such attacks, by skewing just enough toward positions that avoid them.

More generally, an AB evaluation function doesn't need to capture everything an NN's does.  Some of that behavior will be due to overfitting or plain old random noise.  Nor do we have to understand exactly how the evaluation works.  There's a very direct way of seeing what an NN's evaluation function is doing: play games against it and see what it does.

Another option that comes to mind would be to use neural networks not for playing chess directly, but for tweaking the parameters of traditional chess engines.  I'm not sure this is a great idea, though.  One major lesson of both types of engine, in their own ways, is that it's risky to build in assumptions -- not necessarily doomed to failure, but risky.  If you're tweaking parameters in a traditional engine -- e.g., how much a rook on an open file is worth in relation to a pawn -- you're tuning weights in a system with a bunch of specialized, non-linear nodes while in a standard NN you're tuning weights in a (relatively) straightforward multilinear system.  It's not clear that the first option would work better.  It might, but it might not.

Looking at it from a more NN perspective, you could try to speed up evaluation by noticing this or that pattern of weights and replacing the generalized tensor calculations with ad-hoc functions that happen to work for those particular weights.  For example if the weights are "sparse", that is, most of them are zero, you can write out an explicit formula that combines the terms with non-zero weights.  In theory, you might come up with something that resembles some of the customary rules of thumb.  Maybe weights involving a particular kind of piece tend to emphasize empty squares that the piece could move to (open lines, mobility), or pairs of squares the piece could attack if it moved a certain way, and which hold enemy pieces (forks).

If all we know is that neural nets are combining basic features of a chess position according to some magic combination of weights, then all we can say is "there's something that current explicit evaluation functions don't seem to be capturing."  If that's how it stays, then we might expect neural networks to outperform engines like Stockfish in the long run.  They're already at least comparable, and they haven't been around for nearly as long.

It's quite possible, though, that a neural network's evaluation function is capturing a handful of quantifiable factors that current explicit evaluation functions aren't currently capturing.  In that case, an engine like Stockfish with an updated evaluation function should have the upper hand.  It could choose promising positions as well as, or about as well as, a neural net, but it would examine them much more efficiently.

It's also not clear how much room is left for neural networks to improve.  Playing winning chess is just plain hard.  Training a network for twice as long generally doesn't produce twice as good a result.  It may be that network of evaluation of individual positions is about as good as it's going to get.

For that matter, though, neither does looking at twice as many moves, or even twice as deep (which should roughly square the number of moves examined), make an AB engine twice as strong.  Currently AB engines can examine billions of positions and around twenty plies deep (twenty levels of move/countermove, or ten full moves) in a typical midgame position.  Looking at trillions of positions would mean looking more like thirty plies ahead.  That ought to yield some sort of improvement, but how much?  Most high-level games are draws in any case.

It's interesting that AB and NN engines appear to be more or less evenly matched.  This may be evidence that we're reaching the limits of what computer chess players can do.  Or it may not.

In the discussion above, I talk mostly about evaluation functions, weights on numbers and so forth, and not so much about "planning" or "understanding" or "insight".  This was a deliberate choice.  Chess engines are nothing more or less than computer programs.  It's fascinating that they can clearly display some aspects that we associate with intelligence, while clearly lacking others.  Neural networks don't fundamentally change this -- they're still just doing calculations -- but they do seem to capture another chunk of what brains do, namely fuzzy pattern matching.

Much discussion of developments like AlphaZero and similar neural network-based programs focuses on what we're learning about algorithms, and in particular how to harness "deep neural networks" (that is, multi-stage neural networks) to solve problems that are widely agreed to require intelligence.

That's fair enough, but in that process we're also learning about the problems themselves.  If intelligence is the ability to solve certain kinds of problems, then solving problems by machine tells us something about what it means to be intelligent.  While it's natural to want to move the goal posts and try to narrow "intelligence" down so it continues to mean "behavior we can't emulate reasonably well with machines," I argue in the first of the previous posts, that's probably a losing game.  The winning game, I think, is gaining a better understanding of various capabilities that we tend to throw under the blanket term "intelligence".

How far are we from perfect chess?

Playing a perfect game of chess appears to be an intractable problem.  Certainly the total number of possible positions is far too large for a direct tabulation of best move for each possible position to be feasible.  There are far more possible chess positions than subatomic particles in the observable universe.

To be sure, almost all of these positions are extremely unlikely to appear in any reasonable game of chess, much less a perfectly played one.  All you would really need to know is what the best first move is, what the second move is for every reply, and so forth.  Since most possible replies are not good moves, this might thin out enough that it would be possible to store everything in a large database and/or write general rules that will cover large numbers of possible positions.  In other words, it might (or might not) be feasible to write down a perfect strategy if we could find it.  But we're nowhere close to finding it.

Nonetheless, there has been a lot of progress in the decades since Deep Blue beat Kasparov.  It's now quite clear that computers can play better than the best humans, to the point that it's a bit hard to say exactly how much better computers are.  There are rating numbers that imply that, say, Stockfish would beat Magnus Carlsen X% of the time, but they probably shouldn't be taken as anything more than an estimate.  We can say that X is probably somewhere in the 90s, but that's about it.

Chess rating systems are generally derived from the Elo system (named after Arpad Elo), which tries to quantify playing strength as a single number based on players' records against each other.  Two equally-rated players should have roughy equal numbers of wins and losses against each other, plus however many draws.  As the rating difference increases, the stronger player should win more and more often.

Ratings are recalculated in light of actual results.  If two equally-rated players draw, nothing will change, but if a low-rated player draws against a higher-rated one, the higher-rated player will lose points and the lower-rated player will win points.  Likewise, the winner of a game will gain points and the loser will lose points, but you get more points for beating a higher-rated player and you lose more for losing to a lower-rated player.

Over time, this will tend to give a good picture of who's better, and how much better.  If the parameters of the rating formula are tuned well, it will give a pretty good prediction of winning percentages.  It's interesting in itself that reducing chess skill to a single number on a linear scale works as well as it does -- see this post for more than you probably want to read about that.  The point here, though, is that to get useful ratings you need a pool of players all playing against each other.

You don't need a full round-robin of thousands of players, of course, but things need to be reasonably interconnected.  If you're an average club player, you probably won't be playing many games against the world's top ten, but the strongest players in your club may well have played people who've played them, and in any case there will be multiple reasonably short paths connecting you to them.

This isn't the case with humans and computers.  People, including grandmasters, do play computers a lot, but generally not under controlled match conditions.  To get useful ratings, players being rated should be reasonably well matched.  If player A beats player B nineteen times and draws once out of twenty games, we know player A is far stronger, but we can't say with confidence how much stronger.

Since that's a realistic outcome for a top computer player vs. a top human player, there's little reason to pay a top player to take time out of there schedule for an exhausting and demoralizing match that won't tell us much more than we already know.  Probably the best approach would be to dial back the parameters on a few well-known engines, with a mix of AB and NN, to produce a set of standard players that could play humans at, say, 200 rating-point intervals.  It would probably be much easier to get human players to play against the "human-level" players, knowing they might be able to win bragging rights over their friends.  The human-level computer players could thus be calibrated against actual humans.

Calibrating the top computer players against them would take a lot of games, but that's not a problem for computers.  In the last computer tournament I watched, the bottom player -- which, to be sure, lost quite heavily -- had a rating in the 2900s.  Carlsen is currently in the high 2800s.  Again, these are essentially two separate rating systems, but for historical reasons they're probably fairly close, so it should be possible to link the two systems if anyone really wants to.

While it would be interesting to get more detail on ratings across humans and computers, it doesn't really change the story of "computers can beat humans easily" and by itself it doesn't shed much light on the limits of how well chess can be played.  An Elo-style rating system doesn't have a built-in maximum rating.  There is, in theory, a best-possible player, but we don't know how strong that player is or, put another way, we don't know how close the current best engines are to playing perfect chess.

It is interesting, though, that stronger players tend to draw more against each other than weaker players.  Amateur-level games are often decisive because one player or the other blunders (or, more accurately, blunders more than the other player does).  Higher-level players blunder less, and chess engines don't blunder at all, or at least the notion of "blunder" for an engine is more on the order of "didn't realize that having an enemy pawn that far advanced on the kingside could lead to a devastating attack".  The finals of the last computer tournament I watched consisted almost entirely of draws.

While it's entirely possible that perfectly-played chess is a win for white (or, who knows, a win for black), and the top engines just aren't finding the right moves, it seems more likely to me that perfectly played chess is a draw and engines are basically getting better at not losing, while being able to ruthlessly exploit imperfect play when it does happen.

If this is the case then it's quite possible that ratings will top out at a point where decisive games become exceedingly rare.  A plausible match result might be 1-0 with 999 draws.  If rules required a certain margin of victory, that might essentially be a matter of flipping coins until there are are that many more heads than tails or tails than heads.  The "law of large numbers" doesn't forbid this, it just says that you'll need a lot of coin flips to get there.

We could well get to a point where the championship wends in a score of 1439-1429 with 50,726 draws or something like that.  At that point writing chess engines becomes similar to cranking out more and more digits of pi -- interesting to a small group of people, and useful in stress-testing systems and tools, but no longer of general interest, even among geeks.

The players would still not be playing perfect chess, but most likely they would nearly hold their own against perfect play.  If player A's record against player B is 12-0 with 123,765 draws, their ratings will be essentially identical, even if A is playing perfectly and B is only almost-perfect.  I wouldn't be surprised if someone's already done a regression based on a scenario like this and calculated a maximum rating from it.

Sunday, June 23, 2019

Grandmasters don't play twelve-dimensional chess

A couple of caveats before I start:

First, in case it's not clear by now, let me say explicitly that I'm not a grandmaster, or any kind of master, or even close to being any kind of chess master.  I have, however, watched a few play, browsed through master-level games and read articles and interviews on the subject, not to mention losing quite a few 5-minute games to master-level players and occasionally asking them to explain where I went wrong.

Second, while grandmaster is a title with very specific qualifications, reserved for the best of the best, I'm using it here as a stand-in for any strong player with the same general mindset and skills that I describe here, regardless of how well they actually do in tournament play.

With that out of the way, imagine a critical point in a cheesy movie:

"Welcome to my trap.  I knew you'd come here, just as I know that you will now try to use your secret radio transmitter to summon help, which is why I have taken the precaution of jamming all radio frequencies.  Don't even think about trying to cut off power to the jamming equipment.  It's encased in an unobtanium vault with battery power for the next three months.  Checkmate."

"Ah, my worthy opponent, I did not anticipate that.  But I did not need to.  There is no secret radio transmitter, only this aluminum box of curiously strong mints that I knew your scanning equipment would register as one.  No, I will not be needing a radio today.  I need only wait here and do nothing, while the new head of security you hired comes to my rescue.  Next time be more careful who you use for background checks.  You never know who they might be working for.  Don't bother summoning your bodyguards.  They're already safely locked away where they will do no harm to anyone."

"But ... but ... my plan was perfect.  I accounted for every last detail!"

Characters like this may be casually referred to as chess masters, but as far as I can tell this is not really how actual grandmasters play.

While it's true that many games are won or lost by one player catching something the other missed, it's not necessarily from thinking one move farther ahead, or from calculating every last possibility, but more typically from recognizing the importance of occupying a key square, or from opening or closing some particular line of attack, or knowing which rook to move onto a particular file.  This, in turn, generally comes down to who has the better plan.

There will certainly be points in a game where a grandmaster will pause and think deeply about moves, countermoves, counter-countermoves and so forth, and they can do this better than most players, but that's probably not the real distinguishing factor in their play.  Likewise, while tactics are certainly important in top-level games, the better tactician may or may not end up the winner.

One good illustration of this is the simultaneous exhibition, where a professional typically takes on a few dozen amateurs, moving from one to the next for each move (with occasional pauses for a longer exchange) and often winning every game.  If human chess were a matter of sheer calculation then winning every game would mean one person keeping all the possible continuations of 50 different games in mind, or being able to re-calculate them in the few seconds they usually spend at each board.

But of course this is not what's going on.  The pro is relying on experience, often looking for quick tactical wins by exploiting typical amateur mistakes.  They're still planning, just not in detail, maybe something on the order of "That knight is lined up with the king.  I can pin it, then bring my own knight to this square, put my queen on this diagonal and either attack the king directly or just win that rook if they don't notice I have a fork there.  Am I leaving any pieces hanging?  No.  Is my king exposed? No?  Great. On to the next one ..."

A human grandmaster can do this because they see the board not as a collection of pieces on squares but in a more structured way, something like a collection of patterns and possibilities.  Even if the full details of how this is done are elusive, there are hints, like the well-known experiments in which a grandmaster can reconstruct a position after a quick glance while a beginning player or a non-player takes much longer and makes many more mistakes.

Someone without much experience might think "there was a rook on this square, a pawn on this square ... or was it a bishop?" and so forth.  A grandmaster might think something more like "both sides have castled kingside, white's queen bishop is fianchettoed ..." or "this started out as such-and-such opening" or maybe even "this is like that game I lost to so-and-so and that rook is where I should have put it"

When it comes to playing over the board, a strong player will know many more possibilities to check for than a weaker one, and many more possible hazards to avoid.  One site I ran across gives a checklist of more than two dozen things to check for on every move, just to play a strong amateur game.  I have no doubt that this makes for stronger play, but I personally don't have the patience for this, or the experience for it to have become second nature, so it should come as no surprise that I'm not a strong player.

The most important factor, though, is planning.  If you just play likely-looking moves against a grandmaster, but without a clear plan, you'll almost certainly find yourself in deep trouble before long because they did have a plan.  That innocuous bishop move, that weird-looking pawn push and the shift of a rook from a reasonable-looking square to a different reasonable-looking square were all setting up the attack that's now making your life miserable.  If only you could move your king out of the way -- and that's what that bishop move was really about.

As I mentioned in a previous post, AB engines really do play the kind of insanely calculating game that we associate with the cliche chessmaster character, while NN engines do the same sort of pattern recognition that allows a grandmaster to size up a situation without having to do a lot of calculation, but neither is doing the sort of large-scale planning that a human grandmaster is.  It's also worth reiterating that, while even early chess engines were able to out-calculate human players, it took decades before they could consistently win games against them.

All that said, I think it would be a mistake to think that the planning grandmasters do is some sort of humans-only ability that computers could never emulate.  There's no reason a computer couldn't represent and execute a plan on the order of "Put your bishop on this diagonal, push these pawns and go for a kingside attack involving the bishop, rook and queen," without exhaustively checking every move and countermove.  It just hasn't proven to be an approach that computers can use to win games.