Tuesday, June 17, 2014

Reading University and Mr. Turing's test

The BBC reports that a chatbot called Eugene Goostman has passed the Turing test, marking the first time in history that this has happened.  Or at least, that's what you'd gather from the headline (granted, it doesn't give the name of the chatbot).  The BBC, not having taken complete leave of its senses, explains that this is the claim of the team at the university of Reading that ran the test, and then goes on to cast a little well-deserved doubt on the idea that anything historic is going on.


So what's a Turing test?

Sixty-four years ago, Alan Turing published Computing Machinery and Intelligence, in which he posed a simple question: Can machines think?  He immediately dismissed the question as essentially meaningless and proposed an alternative: Can a machine be built which could fool a person into thinking that it (the machine) was a person?

In Turing's setup, which he called "the imitation game", there would be a judge who would communicate with two players, each claiming to be a person.  The judge and players would communicate via a "teleprinter" or other intermediary, so that it would not be possible to point at one of the players and say "That one's a machine, duh".  Turing goes into quite a bit of detail on points like this that we would take for granted now.  Your favorite instant messaging system is good enough for the task.  On the internet, nobody knows your'e a dog.

Later in the paper Turing makes a pretty audacious claim, considering it was made in 1950 and the supercomputers of the time had somewhere on the order of 16K of memory.  In case you've forgotten how to count that low, that's 1/32 of a megabyte, or about a millionth of the capacity of a low-end smartphone.  Turing's prediction:
I believe that in about fifty years' time [that is, by around the year 2000] it will be possible, to programme computers, with a storage capacity of about 109 [bits], to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning.
109 bits is about 128 megabytes, not an unusual amount of RAM for a computer in the 2000s and a remarkably good prediction for someone writing in 1950.  Keep in mind that Turing wrote this well before Moore formulated Moore's law, itself a good source of misinterpretations.

Turing was a brilliant scientist.  He helped lay the groundwork for what we now call computer science, played a key role in pwning the German Enigma machine during World War II, and thought deeply about the question of intelligence and how it related to computing machinery.  However, he got this particular prediction spectacularly wrong.

It didn't take fifty years to beat the imitation game.  It took more like fifteen.

In the mid 1960s, Joseph Weizenbaum of MIT wrote ELIZA, which purported to be a Rogerian psychotherapist.  You can play with a version of it here.  To be clear, this program wasn't actually trying to do psychotherapy.  It was more like a parody of a "nondirective" therapist whose goal is to stay out of the way and let the patient do all the meaningful talking.  Was it able to fool anyone?  Yes indeed.  So much so that it inspired Weizenbaum, after seeing people confide their deepest secrets to the program, to write a book about the limitations of computers and artificial intelligence.

ELIZA neatly dodges the difficulties that Turing was trying to present to the developer by making the human do all the thinking.  Say "The kids at school don't like me" and ELIZA won't respond with "I know what you mean.  At my school there was this bully named ..." and give you a chance to probe for things only an actual human who had been to school would know.  It will respond with something like "Why do you think the kids at school don't like you?"  It's a perfectly reasonable response, but it reveals absolutely zilch about what the machine knows about the world.

That's fortunate, because the machine knows absolutely zilch about the world.  It's just taking what you type in, doing some simple pattern matching, and spitting back something based, in a fairly simple way, on whatever patterns it found.  This works great for a while, but you don't have to wander very far to see the man behind the curtain.  Answer "Because I am." to one of its "Why are you ...?" questions, and it is liable to answer "Do you enjoy being?", because it saw "I am X" and tried to respond "Do you enjoy being X?"  Except there is no X in this case.

The Eugene Goostman chatbot likewise dodges the difficult questions, but as far as I can tell it does it by acting flat-out batty.  Its website says as much, advertising itself as "The weirdest creature in the world".  When I first saw the Reading story on my phone, there were transcripts included.  These are somehow missing from the version I've linked to, but there is a snippet of a screenshot:
  • Judge: What comes first to mind when you hear the word "toddler"?
  • Goostman: And second?
  • Judge: What comes to mind when you hear the word "grown-up"?
  • Goostman: Please repeat the word to me 5 times
Sure, if you're told that you're chatting with an eccentric 13-year-old boy with English as a second language, you could take pretty much any bizarre response and say "meh ... sure, that sounds like something a 13-year-old eccentric non-native speaker might say ... close enough."  But so what?

The transcripts I saw on my phone were of a similar nature.  Apparently the Goostman website had run the chatbot online for a while, and you can find transcripts from people's interactions with it it on the web. The online version was soon taken down, perhaps from the sheer volume of traffic or, a cynic might say, because the game was up.

This is not the first time people have mistaken a computer for a human behaving outside the norm.  Not long after ELIZA, in 1972, psychiatrist Kenneth Colby, then at Stanford, developed PARRY (this was evidently still before mixed-case text had become widespread).  Unlike ELIZA, PARRY wasn't basically trolling.  It was a serious attempt to mimic a paranoid schizophrenic and so, if I understand correctly, to learn something about the mind of a person in such a state.

Colby had a group of experienced psychiatrists interview both PARRY and actual paranoid schizophrenics.  He then gave the transcripts to a separate group of 33 experienced psychiatrists.  They identified the real schizophrenics with 48% accuracy -- basically random chance and far below Turing's 70%.  That is, PARRY could fool the psychiatrists about 50% of the time, while Turing only expected 30%.

This was from transcripts they got to read over, not from a quick five-minute exchange.  For my money this is a stronger test than Turing's original, and PARRY passed it with flying colors.  Over forty years ago.  Eugene Goostman fooled 33% of the judges (one suspects that the number of judges was a small multiple of three) in five-minute interviews by spouting malarkey.  Not even carefully constructed paranoia, just random balderdash.  Historic?  Give. Me. A. Break.

By the way, if you're thinking "ELIZA is pretending to be a psychotherapist, PARRY is pretending to be a person with mental issues ... hmm ..." ... it's been done.


Thing is, Turing's test just isn't very good.  In attempting to control for factors like appearance and tone of voice, it limits the communication to language, and printed language at that.  In doing so, it essentially assumes that facility in language is the same as intelligence.

But this is simply false.  A highly-intelligent person can become aphasic, and there are cases in the literature of people who can speak highly-complex sentences with a rich vocabulary, but show no other signs of above-average intelligence.  And, as we've seen, it's been feasible for decades to write a computer program that does a passable imitation of human language without understanding anything at all.  I believe there are also documented cases of humans failing Turing tests, but that's a different issue.

It turns out that we humans have a natural tendency to attribute at least some level of intelligence to anything that looks remotely purposeful.  For example, there is an experiment in which people watch two dots on a screen.  I don't recall the exact details, but I think the following gets the gist:

One dot approaches the other and stops.  It then backs off and approaches again, faster.  The first dot is now touching the second, and both move slowly in the direction the first dot had been going.  Ask a person for a description, and they'll likely say that the first dot was trying to get past the second and finally tried pushing it out of the way.

Throw in language and the urge to attribute intelligence is nearly overwhelming.  "OK", one finds oneself thinking, "it's maybe not completely grammatical, and it doesn't make much sense, but that's got to be because the person talking is a bit ... off, not because they're not intelligent at all.  They can talk, for goodness' sake."

Whether something passes the Turing test in practice comes down more to a judge's ability to set aside intuition and look for artifacts of pattern-matching approaches, like the "Do you enjoy being?" example above.

This assumption that language facility was a good proxy for intelligence ran through a lot of early AI, leading to an emphasis on discrete symbol-smashing.  You have to start somewhere, it's clear that understanding language has a lot in common with other signs of intelligence, and a lot of useful work came out of efforts to develop good symbol-smashing tools, but to some extent this is more like looking for your lost car keys where the light is brightest.  Computers are good at smashing symbols, or more generally, dealing with discrete structures, which would include words and sentences.  That's basically their job.

It's now looking like probability and continuous math have more to do with how our minds actually work.  Being able to communicate in the (more-or-less) discrete medium of language came along relatively late in the evolutionary game, long after other aspects of intelligence, and language itself doesn't behave the way we assumed it did fifty years ago.  Science marches on.

There's another problem with the Turing test, something that looks like a strength at first:  It's free-form.  The judge is allowed to ask any questions that seem appropriate.  There is no checklist of abilities to test for.  If the respondent claims to have trainspotting as a hobby, there's no requirement to find out if they know anything about trains, or their schedules, or the sound of a locomotive or the smell of overheating brakes.

More generally, there is no requirement to test for, say, understanding of metaphor, or the ability to learn a new concept or glark the meaning of a word from context.  There is no requirement to determine if the respondent understands the basic properties of objects, space and time.  And so forth.

To be sure, there's an obvious objection to imposing requirements like this.  It would lead to "teaching to the test".  Contestants trying to pass such a variation of the Turing test would naturally try to build systems that would be able to pass the particular requirements.

But that could well be a good thing.  It's surely better than seeing people grab headlines by writing a bot that spouts gibberish.  As long as the requirements are phrased abstractly we can still leave it up to the judges' ingenuity to decide exactly what metaphor to try or what specific questions to ask about space, time and objects.  At the end of the test we can expect these requirements to be covered, or invalidate the judge's result if they aren't, which we can't with a free-form test.

The particular list I gave doesn't necessarily cover everything we might want to associate with intelligence, but a system that can understand metaphors, space and time, and can learn new concepts, can reasonably said to be "thinking" in a meaningful sense of the word.

Setting explicit requirements would also allow for variant tests that would accept forms of intelligence that were significantly different from ours.  For example, one very important part of being human is knowing what it's like to have a human body.   Being embodied as we are plays a large role in our cognition.  However, it's perfectly possible for something to be intelligent and, for example, not experience tastes and smells (indeed, some number of people have no such experience).

It seems reasonable to instruct the judges "We know this might be a machine.  Don't ask it what things taste like."  In the original Turing test, if the program came up with some plausible explanation for lacking taste and smell, a natural follow-up might be "What's it like not to be able to taste and smell?"  It's not clear that a machine would need to have a good answer to that in order to be intelligent.  If it didn't, the judge might have a good reason to think it was a machine even if it did in fact have some meaningful form of intelligence.  Either way the line of questioning is not helpful as a way of testing for intelligence.  In other words, distinguishing human from machine is not quite the same as distinguishing intelligent from unintelligent.

Hiding behind all this is one more shaky assumption: Something is either intelligent or it isn't.  Even though Turing properly speaks of probabilities of guessing correctly, there is still the assumption that a machine is either successfully imitating a human or it isn't.  Suppose, though, that a machine is really good at some area of knowledge and the judges happen to ask about that area 31% of the time.  That machine would pass the Turing test (in the popular but not-quite-accurate sense), but what does that mean?  Is it 31% intelligent?


I wouldn't lay much of this at Turing's feet.  He was doing pioneering work in a world that, at least as far as computing and our understanding of human cognition are concerned, was starkly different from the one we live in, and yet he managed to hit on themes and concepts that are still very much alive today.  Nor would I blame the general public for taking a claim of a historic breakthrough at face value.

But the claim itself?  Coming from a respected university?  Granted, they seem mostly hyped about the quality of their test and the notion that nothing else so far has passed a "true" Turing test.  But this seems disingenuous.  What we have here is, maybe, a more methodologically faithful version of Turing's test, which was passed by a mindless chatterbot.  The only real AI result here is that a Turing-style imitation-based test can be beaten by clearly unintelligent software.

This is not a new result.

[The Wikipedia article on Eugene Goostman makes a really good point that I never caught: Turing predicted a 30% success rate.  He didn't define that as some sort of threshold for intelligence.  Thus, fooling 30% of the judges doesn't mean that something "passes the Turing test and is therefore intelligent" It's just confirming Turing's prediction about how well machines would be able to win the imitation game.]

No comments:

Post a Comment