Thursday, November 9, 2017

syl·lab·i·fi·ca·tion

[Author's note: When I started this, I thought it was going to touch on deep questions of language and cognition.  It ended up kinda meandering around some random bits of computer word-processing.  This happens sometimes.  I'm posting it anyway since, well, it's already written.  --D.H.]

Newspaper and magazine articles are traditionally typeset in narrow, justified columns. "Justified" here means that every line is the same width (unlike, say, with most blog posts).  If the words aren't big enough to fill out a line, the typesetter will widen the spaces to fill it out.  If the words are a bit too long, the typesetter might move the last word to the next line and then add space to what's left.

Originally, a typesetter was a person who physically inserted pieces of lead type into a form.  Later, it was a person operating a Linotype™ or similar machine to do the same thing.  These days it's mostly done by software.

Technically, laying out a paragraph to minimize the amount of extra space is not trivial, but certainly feasible, the kind of thing that would make a good undergraduate programming exercise.  Several algorithms are available.  They may not always produce results as nice as an experienced human typesetter, but they do well enough for most purposes.

One option for getting better line breaks and better-looking paragraphs is to hyphenate.  If your layout looks funny because you've got floccinaucinihilipilification in the middle of a line, you might try breaking it up as, say floccinaucinihili-
pilification.  It will probably be easier to lay out those two pieces rather than trying to make room for one large one.

You can't just throw a hyphen in anywhere.  There's a strong tendency to read whatever comes before and after the hyphen as independent units, so you don't want to break at wee-
knights or pre-
aches.

In many languages, probably most, this isn't a big problem.  For example, Spanish has an official set of rules that gives a clear hyphenation for any word (actually there are several of these, depending on what country you're in).  It's hard for English, though, for the same reason that spelling is hard for English -- English spelling is historical, not phonetic, and has so far resisted attempts at standardisation standardization and fonetissizing.

So instead we have the usual suspects, particularly style guides produced by various academic and media organizations.  This leads to statements like this one from the Chicago Manual of Style:
Chicago favors a system of word division based on pronunciation and more or less demonstrated by the recommendations in Webster’s tenth.
The FAQ list that that comes from has a few interesting cases, though I'm not sure that "How should I hyphenate Josephine Bellver's last name?" actually qualifies as a frequently asked question.  The one that interests me here concerns whether it should be "bio-logy" or "biol-ogy".  CMOS opts for "biol-ogy", going by pronunciation rather than etymology.

Which makes sense, in that consistently going by pronunciation probably makes reading easiest.  But it's also a bit ironic, in that English spelling is all about etymology over pronunciation.

Either approach is hard for computers to cope with, since they both require specific knowledge that's not directly evident from the text.  It's common to teach lists of rules, which computers do deal with reasonably well, but the problem with lists of rules for English is that they never, ever work.  For example, it's going to be hard to come up with a purely rule-based approach that divides "bark-ing" but also "bar-keeper".

This is why style guides tend to fall back on looser guidance like "divide the syllables as they're pronounced".  Except -- whose pronunciation?  When I was a kid I didn't pronounce an l in also or an n in government (I've since absorbed both of those from my surroundings).  I'm pretty sure most American speakers don't pronounce a t in often.  So how do you hyphenate those according to pronunciation?


Fortunately, computers don't have to figure this out.  A hyphenation dictionary for 100,000 words will cost somewhere around a megabyte, depending on how hard you try to compress it.  That's nothing in modern environments where a minimal "Hello world" program can run into megabytes all by itself (it doesn't have to, but it's very easy to eat a few megabytes on a trivial program without anyone noticing).

But what if the hyphenator runs across some new coinage or personal name that doesn't appear in the dictionary -- for example, whoever put the dictionary together didn't know about Josephine Bellver?  One option is just not to try to hyphenate those.  A refinement of that would be to allow the author to explicitly add a hyphen.  This should be the special "optional hyphen" character, so that you don't get hyphens showing up in the middle of lines if you later edit the text.  That way if you invent a really long neologism, it doesn't have to mess up your formatting.

If there's a point to any of this, it's that computers don't have to follow specific rules, except in the sense that anything a computer does follows specific rules.  While it might be natural for a compugeek to try to come up with the perfect hyphenation algorithm, the better engineering solution is probably to treat every known word as a special case and offer a fallback (or just punt) when that fails.

This wasn't always the right tradeoff.  Memory used to be expensive, and a tightly-coded algorithm will be much smaller than a dictionary.  But even then, there are tricks to be employed.  One of my all-time favorite hacks compressed a spelling dictionary down to a small bitmap that didn't even try to represent the actual words.  I'd include a link, but the only reference I know for it, Programming Pearls by Jon Bentley, isn't online.

Saturday, November 4, 2017

Surrender, puny humans!

A while ago, Deep Mind's AlphaGo beat human champion Lee Sedol at the game of go.  This wasn't just another case of machines beating humans at games of skill.

Granted, from a mathematical point of view it was nothing special.  Games like go, chess, checkers/draughts and tic-tac-toe, can in theory be "solved" by simply bashing out all the possible combinations of moves and seeing which ones lead to wins for which players.

Naturally the technical definition of "games like go, etc." is a bit, well, technical, but the most important stipulations are
  • perfect information -- each player has the same knowledge of the game as the others
  • no random elements
That leaves out card games like poker and bridge (imperfect information, random element) and Parcheesi (random element) and which-hand-did-I-hide-the-stone-in (imperfect information), but it includes most board games (Reversi, Connect 4, Pente, that game where you draw lines to make squares on a field of dots, etc. -- please note that most of these are trademarked).

From a practical point of view, there is sort of a pecking order:
  • Tic-tac-toe is so simple that you can write down the best strategy on a piece of paper.   Most people grow bored of it quickly since the cat always wins if everyone plays correctly, and pretty much everyone can.
  • Games like ghost or Connect 4 have been "strongly solved", meaning that there's a known algorithm for determining whether a given position is a win, loss or draw for the player whose turn it is.  Typically the winning strategy is fairly complex, in some cases too complex for a human to reasonably memorize.  A human will have no chance of doing better than a computer for such games (unless the computer is programmed to make mistakes), but might be able to do as well.
  • Checkers is too complex for humans to play perfectly, but it has been "weakly solved".  This means that it's been proved that with perfect play, the result is always a draw, but, not all legal positions have been analyzed, and there is currently nothing that will always be able to tell you if a particular position is a win for either side, or a draw.  In other words, for a weakly solved game, we can answer win/loss/draw for the initial position, and typically many others, but not for an arbitrary position.
  • Chess has not been solved, even in theory, but computer chess players that bash out large numbers of sequences of moves can consistently beat even the best human players.
In most cases the important factor in determining where a game fits in this order is the "branching factor", which is the average number of moves available at any given point.  In tic-tac-toe, there are nine first moves, eight second moves, and so on, and since the board is symmetrical there are effectively even fewer.  In many positions there's really only one (win with three-in-a-row or block your opponent from doing that).

In Connect 4, there are up to six legal moves in any position.  In checkers there can be a dozen or so.  In chess, a couple dozen is typical.  As with tic-tac-toe there are positions where there is only one legal move, or only one that makes sense, but those are relatively rare in most games.

In go, there are typically more than a hundred different possible moves, and go positions tend not to be symmetrical.  Most of the time a reasonably strong human player will only be looking at a small portion of the possible moves.  In order to have any hope of analyzing a situation, a computer has to be able to narrow down the possibilities by a similar amount.  But to beat a human, it has to be able to find plays that a human will miss.

I've seen go described as a more "strategic" game, one that humans can develop a "feel" for that computers can't emulate, but that's not entirely true.  Tactics can be quite important.  Much of the strategy revolves around deciding which tactical battles to pursue and which to leave for later or abandon entirely.  At least, that's my understanding.  I'm not really a go player.

AlphaGo, and programs like it, solved the narrowing-down problem by doing what humans do: collecting advice from strong human players and studying games played by them.  Historically this has meant a programmer working with an expert player to formulate rules that computers can interpret, along with people combing through games to glean more rules.

As I understand it (and I don't know anything more about Deep Mind or AlphaGo than the public), AlphaGo used machine learning techniques to automate this process, but the source material was still games played by human players.  [Re-reading this in light of a more recent post, I see I left out a significant point: AlphaGo (and AlphaZero) encode their evaluation of positions -- their understanding of the game -- as neural networks rather than explicit rules.  While a competent coder could look at the code for explicit rules and figure out what they were doing, no on really knows how to decode what a neural network is doing, at least not to the same level of detail -- D.H. Jan 2019]

The latest iteration (AlphaGo Zero, of course) dispenses with human input.  Rather than studying human games, it plays against itself, notes what works and what doesn't, and tries again after incorporating that new knowledge.  Since it's running on a pretty hefty pile of hardware, it can do this over and over again very quickly.

This approach worked out rather well.  AlphaGo Zero can beat the AlphaGo that beat Lee Sedol, making it presumably the strongest go player in the world.  [and it has since done the same thing with chess and shogi, though its superiority in chess is not clear-cut.  See the link above for more details.  -- D.H. Jan 2019]

On the one hand, this is not particularly surprising.  It's a classic example of what I call "dumb is smarter" on the other blog, where a relatively straightforward approach without a lot of built in assumptions can outperform a carefully crafted system with lots of specialized knowledge baked in.  This doesn't mean that dumb is necessarily smartest, only that it often performs better than one might expect, because the downside to specialized knowledge is specialized blind spots.

On the other hand, this is all undeniably spooky.  An AI system with no baked-in knowledge of human thought is able, with remarkably little effort, to outperform even the very best of us at a problem that had long been held up as something unreachable by AI, something that only human judgement could deal with effectively.  If computers can beat us at being human, starting essentially from scratch (bear in mind that the hardware that all this is running on is largely designed and built by machine these days), then what, exactly are we meat bags doing here?

So let's step back and look at the actual problem being solved: given a position on a go board, find the move that is most likely to lead to capturing the most stones and territory at the end of the game.

Put that way, this is a perfectly well-posed optimization problem of the sort that we've been using computers to solve for decades.  Generations, really, at this point.  Granted, one particular solution -- bashing out all possible continuations from a given position -- is clearly not best suited, but so what?  Finding the optimum shape -- or at least a better one -- for an airplane wing isn't well suited to that either, but we've made good progress on it anyway using different kinds of algorithms.

So "chess-style algorithms suck at go, therefore go is inherently hard" was a bad argument from the get-go.

From what I've seen in the press, even taking potential hype with a grain of salt, AlphaGo Zero is literally rewriting the books on go, having found opening moves that have escaped human notice for centuries.  But that doesn't mean this is an inherently hard problem.  Humans failing to find something they're looking for for centuries means it's a hard problem for humans.

We humans are just really bad at predicting what kinds of problems are inherently hard, which I'd argue is the same as being hard to solve by machine*.  Not so long ago the stereotype of a genius was someone who "knew every word in the dictionary" or "could multiply ten-digit numbers immediately", both of which actually turned out to be pretty easy to solve by machine.

Once it was clear that some "genius" problems were easy for machines, attention turned to things that were easy for people but hard for machines.  There have been plenty of those -- walking, recognizing faces, translating between speech and text, finding the best move on the go board.  Those held out for quite a long time as "things machines will never be able to do", but the tide has been turning on them as well thanks, I think, to two main developments:
  • We can now build piles of hardware that have, in a meaningful sense, more processing power than human brains.
  • With these new piles of hardware, techniques that looked promising in the past but never really performed are now able to perform well, the main example being neural network-style algorithms
At this point, I'm convinced that trying to come up with ever fuzzier and more human things that only human brains will ever be able to do is a losing bet.  Maybe not now, but in the long run.  I will not be surprised at all if I live to see, say
  • Real time speech translation that does as well as a human interpreter.
  • Something that can write a Petrarchan sonnet on a topic of choice, say the futility of chasing perfection, that an experienced and impartial reviewer would describe as "moving", "profound" and "original".
  • Something that could read a novel and write a convincing essay on it comparing the story to specific experiences in the real world, and answer questions about it in a way that left no choice but to say that in some meaningful sense the thing "understood" what it read.
  • Something that it would be hard to argue didn't have emotions -- though the argument would certainly be made.
[On the other hand, I also won't be shocked if these don't totally pan out in the next few decades --D.H. Feb 2019]

These all shade into Turing test territory.  I've argued that, despite Alan Turing's genius and influence, the Turing test is not necessarily a great test of whatever we mean by intelligence, and in particular it's easy to game because people are predisposed to assume intelligence.  I've also argued that "the Singularity" is an ill-defined concept, but that's really a different thread.  Nevertheless, I expect that, sooner or later, we will be able to build things that pass a Turing test with no trickery, in a sense that most people can agree on.

And that's OK.

Or at least, we're going to have to figure out how to be OK with it.  Stopping it from happening doesn't seem like a realistic option.

This puts us firmly in the territory of I, Robot and other science fiction of its era and more recently (the modern Westworld reboot comes to mind), which is one reason I chose the cheesy title I did.  Machines can already do a lot of things better than we can, and the list will only grow over time.  At the moment we still have a lot of influence over how that happens, but that influence will almost certainly decrease over time (the idea behind the Singularity is that this will happen suddenly, in fact nearly instantaneously, once the conditions are right).

The question now is how to make best use of what influence we still have while we still have it.  I don't really have any good, sharp answers to that, but I'm pretty sure it's the right question.


* There's a very well-developed field, complexity theory, dealing in what kinds of problems are hard or easy for various models of computing in an absolute, quantifiable sense.  This is largely distinct from the question of what kinds of games or other tasks computers should be good at, or at least better than humans at, though some games give good examples of various complexity classes.  One interesting result is that it's often easy (in a certain technical sense) to produce good-enough approximate solutions to problems that are provably very hard to solve exactly.  Another interesting result is that it can be relatively tricky to find hard examples of problems that are known to be hard in general.