Intermittent Conjecture

Tuesday, May 21, 2019

Computer chess engines fall in line

[This post came out of trying to figure out what computer chess results tell us about what kind of problem chess-playing is. It goes into quite a bit of detail about cycles where player A beats player B beats player C beats player A and why they don't seem to happen. Shortly after I posted this, the Computer Chess Championship "Deep Dive" round finished up, and there was a slight anomaly: Stockfish won the tournament, but second-place LC0 beat it head to head, by the narrow margin of 4-3-43, implying an Elo difference of 8 points. I don't think this affects the post greatly, but it does reinforce the point that NN engines are less consistent than AB engines. Stockfish didn't lose a game to 4th-place Houdini while LC0 both won and lost. I'll update this slightly when chess.com makes the full results available --D.H. May 2019]

One thing jumps out from watching the ongoing computer chess tournaments that is so ordinary-seeming that it's easy to dismiss entirely: The rankings from a preliminary round with a couple dozen contestants, are strikingly linear. If you look at the cross-table, which shows everyone's results against everyone else's, wins in green, losses in red and draws in gray, not only do everyone's scores improve down the table, but for the most part all the red is above and all the green below. Stronger engines very consistently beat weaker engines, especially considering that at least some of the apparent upsets were due to system crashes and such.

I've seen this happen twice now, once for the "Blitz Bonanza" (5 minutes plus two seconds per move) and the Deep Dive (15 minutes plus 5 seconds per move). The Deep Dive preliminary, which just finished a few days ago, was played in "escalation" format, meaning that each new player plays everyone who's already played (twice as white and twice as black). After a few opponents, you can be pretty confident about where a new player will land -- above whoever whoever it's beat head-to-head, below whoever it's lost to, and close to anyone it drew all four games with.

It shouldn't be a total shock that stronger players beat weaker players, or even that they beat them very consistently. My point is more that it doesn't have to be that way. While there are only two basic styles of engine*, each dev team has its own philosophy and each one evaluates positions differently. In the case of neural net-based engines (generally referred to as NN), the evaluation of positions is done radically differently.

It seems at least plausible that, as with human players, engines might tend to do better or worse than otherwise expected against particular styles of play. An engine that attacks aggressively might lose to one that's tuned for careful defending, but beat a middle-of-the-road approach that can outplay the careful defender positionally. This would create a cycle of attacker beats middle-of-road beats defender beats attacker, but this sort of situation doesn't seem to happen in practice.

Without going back and checking the actual numbers, my impression is that AB engines -- the conventional style which does an exhaustive search a far ahead as it can manage, using fairly understandable hand-coded rules to evaluate positions quickly -- win more by being able to look further ahead than by having cleverer evaluation rules. That is, evaluation speed is a good predictor of who will win.

Clearly if two engines have identical evaluation rules, but one implements them more efficiently, the faster engine should win because it can see further ahead. The faster engine will see everything the slower engine sees, but it can also see which of two seemingly equal moves leads to a better result. It will therefore be able to steer toward winning positions and the slower engine won't see the danger until it's too late.

Even with each team using its own particular evaluation rules, something like this still happens. Typically when two AB engines play, there will be a period where the weaker engine shows a more favorable evaluation for itself than the stronger one does, e.g., the weaker engine is playing white and it sees a half-pawn advantage for white while the stronger engine sees a small advantage for black.

The weaker engine thinks everything's going fine while the stronger engine knows that it has the edge. Eventually the weaker engine's evaluation starts to drop as it realizes things are no longer as good as they looked, and the stronger engine pushes its small advantage into an overwhelming one.

Again, it isn't a foregone conclusion that a faster engine will beat a slower one with a different evaluation function, but it's what we see in practice. My suspicion is that evaluation functions hit diminishing returns after certain point. You've accounted for material, you've got some concept of tactical considerations like pins and hanging pieces, and a bit on positional factors like space and pawn structure. Further improvements are certainly possible, in the form of slightly better evaluation functions, slightly faster evaluation, or both, but the low-hanging fruit was picked long ago.

Adding a detailed model of, say, pawn structure, is likely just to slow the evaluation down without providing enough benefit to make up for the decrease in depth. This is the "dumb is smarter" principle that seems to have held sway for the past few decades.

Stockfish -- currently the clear champion among AB engines -- demonstrates this nicely with its "fishtest". As I understand it, developers try out changes to the program to see if their locally patched version consistently beats the current released version. Only winning changes are brought into the next release. While Stockfish has improved significantly as a result of fishtest, progress has been incremental, and the end result is that Stockfish can evaluate positions very quickly. I'm not 100% sure it's the fastest, but it's certainly high up there.

I'm sure everyone, open source or not, playtests potential improvements (and the training phase of NN engines take this to an extreme), but crowdsourcing this ought to keep development from fixating on some particular person or small group of people's idea of what must be right. Again, fewer assumptions means better results.

Bringing NN engines into the picture doesn't seem to change this linear ranking. NN engines are on the other extreme of the smart vs. fast tradeoff. For a given number of layers and so forth a position generally takes just as long to evaluate regardless of the exact parameters of the net being used, so the only difference in play will be due to the net itself. Evaluation speed is not a factor since it doesn't change.

This doesn't necessarily rule out A-beats-B-beats-C-beats-A cycles, in fact, it seems like they might be more likely since different nets will put different weights on different factors -- assuming there's more than one way to be good at chess. So far, though, there don't seem to be any. In the tournament results I've seen so far, there are three strong NN engines: Leela Chess Zero (generally known as LC0) Leelenstein and Antifish (a modified Leela trained specifically to beat Stockfish). In the Blitz Bonanza, LC0 had winning records against the other two and Leelenstein had a winning record against Antifish. Antifish does not seem to do better than the others against Stockfish. If anything it does slightly worse.

There are only two NN engines in the Deep Dive finals (rules now prohibit more than one variant of the same engine reaching the finals, so Antifish is out). Obviously there will be no cycle between them. For what it's worth, LC0 is currently slightly ahead of Leelenstein, but they're dead even head-to-head.

There is one clear deviation from the orderly world of AB engines. Since NN engines evaluate so slowly, they take a random sample of positions when looking ahead, rather than searching exhaustively like AB engines. I believe the sample is weighted towards moves the net considers likely to be good, but in any case this introduces some inconsistency. NN engines occasionally lose to weak AB engines since not all good-looking moves turn out actually to be good, while strong AB engines almost never do. Against strong AB engines, NN engines can find good moves that the AB opponent overlooks because it doesn't lead to good results quickly enough for the exhaustive search to turn it up, but they can also play good-looking moves that turn out to be unsound if you search well enough.

The net effect is that strong NN engines don't do as well against weak AB engines as you'd expect from how they do against strong engines. In the current Deep Dive finals, Stockfish is (at this writing) noticeably ahead of LC0, but head-to-head they're nearly even (3 wins to 2 with 41 draws, implying an ELO difference of 8 points; by contrast, human world champion Carlsen is currently rated about 57 points ahead of second-ranked Caruana) [and in fact, Stockfish won the finals but lost to LC0 head-to-head].

The difference is that Stockfish has done better against the weaker opposition. Stockfish is 9-0 against 4th-place Houdini (with a whole bunch of draws), while LC0 is 12-3 against Houdini (with correspondingly fewer draws). Stockfish also does better against the other NN engine, 3rd-place Leelenstein.

So there don't seem to be any cycles in the ranking of chess engines, and the (small) difference between top AB engines and top NN engines looks to be due to consistency. What does this tell us about chess-playing as a problem? There are several possible conclusions:

Chess skill really is a linear ranking, at least once you get to the level of computer engines. If A beats B and B beats C, then A beats C.
A bit stronger: There is one true way to play chess and stronger engines approximate this better.
On the other hand: AB engines are linearly ranked because only depth/evaluation speed really matters while NN engines are linearly ranked because only the quality of the net doing the evaluation matters. It's not surprising, though not inevitable, that combining two linear rankings gives another linear ranking.
Also, NN engines are all playing on or near the level of the current AB champion, so they're really only being compared to one style of play since there's only one really competitive opponent.

My feeling is that we don't really have enough to know for sure, but most likely we'll continue to see improvement in playing strength and that the linear ranking will hold. Next year's champion will be able to beat any other opponent head-to-head, the one after that will be able to beat it and everyone else, and so forth, regardless of whether the next champion is AB, NN or some hybrid of the two. There are noticeable differences between how AB engines and NN engines play (I'd like to get into the details in a future post), but be that as it may, stronger is stronger.

* Technically, there are two choices on two dimensions: Alpha/Beta tree search vs. Monte Carlo and neural-net based evaluation vs. hand-coded rules. However, NN engines always use Monte Carlo since it takes them roughly a thousand times longer to evaluate a particular position, so they can't search exhaustively. On the other hand, conventional engines generally use AB searches because they can. There have been fairly successful experiments in using Monte Carlo with hand-coded rules. Those tend to be called hybrid.

Saturday, April 27, 2019

Notes on the EHT black hole image

Following on to the recent post on the other blog about the imaging of a black hole in a nearby galaxy, here are my lightly cleaned up notes from watching Katie Bouman's lecture on how the images were captured using the Event Horizon Telescope (EHT), which is actually a collection of several radio telescopes whose individual data sets are combined using very sophisticated image processing techniques to emulate a single Earth-sized telescope.

I have no great insights to offer, but there are some pretty interesting details in the original lecture. I've thrown in some commentary here and there, without taking any real care to distinguish it from the source material. The full lecture is on Caltech's YouTube channel. As always, any errors or distortions are mine.

Technically, the image gathered by the telescopes working in combination is in spatial frequency, basically how quickly bright changes to dark as you scan over the actual image. For example, a square divided into a white rectangle and a black rectangle has a lower spatial frequency than one divided into thin black and white stripes. It's possible to reconstruct a visual image from information about the spatial frequency, and the more information you have, the better the reconstruction. This sort of reconstruction is widely used in many fields, including astronomy, medical imaging and the image compression that lets us stream video over the internet.

The exact spatial frequencies being measured depend on the distance (baseline) between the telescopes. When the telescopes are spread across distant locations on the Earth, the image technique is called Very Long Baseline Interferometry (VLBI). Interferometry refers to measuring how the signals from different telescopes interfere with each other, that is, whether they reinforce each other or cancel each other out when combined. Very Long Baseline refers to, um ...

In this case there were eight radio telescopes at six different sites, taking data together on four different nights, scanning for a few minutes at a stretch. This provides 21 different pairs of telescopes, which would mean 21 different baselines, that is, 21 different spatial frequencies to combine, except that the baselines change as the Earth rotates. This is good, because it provides samples of more frequencies. There is at least some talk of adding telescopes in low Earth orbit at some later stage, which would not only provide more raw data but more varied baselines, since satellites move faster than the Earth rotates.

As mentioned in the other post, each scope collected hundreds of terabytes (hundreds of trillions of bytes), enough data that it was easier to fly the drives with the data to the site where it would be processed, rather than try to send it over the internet (and the South Pole data had to be flown out in any case).

The actual signal, coming from about 50 million light-years away, is quite weak, and noisy. There is a lot of matter at the center of a galaxy, radiating at various frequencies and blocking various frequencies. The radio band was chosen to minimize these effects, but there is still noise to contend with, along with noise from a variety of other sources, including atmospheric noise and imperfections in the telescopes themselves.

The various telescopes are looking through different amounts of atmosphere, depending on whether the black hole is overhead from their location or on the horizon, and this also affects how long it takes the signal to arrive at each telescope. It takes about 20 milliseconds longer for the signal to reach a scope that has the black hole on the horizon that it does for one where it's overhead. For a 1GHz signal, that means a difference of 20 million cycles -- and again, this is varying as the earth rotates.

All this means there is a lot of calibration to be done to make sure you're lining up the right data from the various telescopes and accounting for all the various sources of noise and distortion. In all, there were hundreds of thousands of calibration parameters to work with, though the team ended up using (only) ten thousand or so.

The reconstructed image is meant to represent what an Earth-sized telescope would see, if one could be built, but there are actually infinitely many such images that would match up with the limited data that was collected from a few scattered points on Earth.

Fortunately, having multiple telescopes means you can cross-check and isolate errors. There are also "closure quantities" that should be the same regardless of the exact details of calibration. I didn't really understand the details of closure quantities, but the idea of looking for quantities that don't depend on hard-to-control parameters is a powerful one. It's sort of similar to checking arithmetic by adding up digits.

The team used two different methods of calibration, and they split into four subteams, two for each method. The subteams worked independently for several weeks to reconstruct an image from the raw data. Bouman's own work involved getting the human-tuned method (CLEAN) to work without human tuning.

In addition to the work by the subteams, there was a lot of work done up front to try to make sure that the image processing algorithms they were tuning were robust. Much of this involved taking test images, calculating what signals the telescopes would get and then reconstructing the test images from the calculated signals. They did this not only with images of discs and rings that one would typically expect in astronomical images, but also with a variety of non-astronomical images.

The team actively tried to break the image processing algorithms, for example by tuning the parameters to favor a solid disc rather than a ring, and verifying that a ring still came out as a ring. Python fans will be glad to know that at least some of the code involved was written in Python.

The publicly distributed image is a combination of the images from the subteams. Since it's meant to represent only what we can be confident about regarding the actual black hole, it's deliberately a bit blurry. Even still, there are some significant results [I haven't yet re-checked the details here, so what follows may be garbled]:

There is, in fact, a ring as would be expected from a supermassive black hole.
The same ring was observable night after night.
Previously there had been two separate estimates of the mass of the black hole, one from the motions of stars around it (stellar dynamics) and one from the appearance of the gas flowing into the accretion disc. The size of the ring in the image supports the first mass estimate.
One portion of the ring is brighter than the rest, due to the doppler shift of matter orbiting around the black hole. This is to be expected, and the position of the bright spot matches up with other observations.
The position of the bright spot is shifting over time, consistent with a rotation period on the order of weeks (I think this implies that the black hole's axis of rotation is not on the line of sight). Since the black hole in the Milky Way (Sgr A*) is much smaller, its rotation period is expected to be on the order of minutes. Since the observation scans are also on the order of minutes, the image processing for Sgr A* will have to take this into account.

It's also possible to analyze the raw data directly as a cross-check. A disk of such-and-such size should result in such-and-such radio signals arriving at the various telescopes, independent of any reconstructions produced from that data. The raw signals are consistent with a ring of the size found in the visual reconstructions. This is not a surprise, given how the reconstructions were done, but still good to know.

Despite the whole effort taking years and requiring expensive radio telescopes at widely dispersed locations, and employing dozens of people at various stages, the movie Interstellar still cost a lot more to make. While its image of a black hole may be prettier (and, as I understand it, one of the parts of the movie where the science was actually plausible), the EHT image was both cheaper and more informative.

Again, really good stuff.

Sunday, April 14, 2019

Notes on watching (a chunk of) a computer chess tournament

A while ago I started a post trying to untangle what kind of a problem it is to play chess well, based on how well various kinds of players -- human, conventional computer chess engines and neural-network-based chess engines -- did against each other. I still think it's an interesting question, but in the mean time I actually watched a bunch of computer chess and that was even more interesting. In particular, I looked in on a fair bit of chess.com's "CCC 7: Blitz Bonanza". Here are some rough impressions.

The Blitz Bonanza is one part of a long-running event comprising thousands of individual games in several different formats (the nice thing about chess engines is they don't get tired, so it's practical to have two of them play each other dozens of times in the course of a few days). The "Deep Dive", with longer time controls and slightly different rules, is going on now and will be for several weeks. There may be more after that (much of this is from memory because I can only seem to find current results on chess.com and the summaries on other sites don't always contain the exact information I'm looking for).

In the Blitz Bonanza, each player has five minutes of time to start with, plus two seconds per move made. There is no "opening book", meaning that the players must calculate every move. This leads to the rather odd sight of a clock ticking down near or past the 4-minute mark on move one, which is pretty much inevitably something everyone plays anyway (1. d4 and 1. e4 predominate).

Play continues either to checkmate by one side (though it enters a "get it over with" or "bullet" mode if both engines evaluate the current position as highly favorable to one or the other side), to a draw by stalemate, threefold repetition, 50 moves without a pawn move or capture, or "adjudication", meaning that the position is in a standard database of positions with six or fewer pieces on the board that cannot be won if both sides play perfectly (this includes "insufficient force" scenarios like king and bishop against king), or both engines evaluated the game as a dead draw for a certain number of moves.

This is rather different from human matches, where a player will typically resign if it becomes clear there's no way to avoid losing, and players typically agree to a draw when it's clear that neither of them has a realistic shot at winning unless the other one blunders. It's sometimes fun to watch an actual checkmate. The draws can be interesting in their own way. More on that later.

Scoring was the usual one point for a win, no points for a loss and half a point to each player for a draw

The original field consisted of

The premier conventional (alpha/beta or AB) engine, Stockfish
Three neural network-based (NN) engines (Leela Chess Zero, aka LC0, Leelenstein and Antifish)
Everyone else

In the first round, each of these played each of the others several times, half of the time as white and half as black. The top four would progress to the next round. When the smoke cleared, the top four were LC0, the other two neural network engines, and Stockfish rounding out the field in fourth.

In the second round, each of the four played each of the others 100 times (50 as white and 50 as black), for a total of 600 games. This went on around the clock for several days. The final result had LC0 in first, Stockfish close behind, well ahead of Leelenstein, which was itself well ahead of Antifish. LC0 had a small lead over Stockfish in their head-to-head games.

Looking at this with my somewhat defocused engineer's eyes, I'd say "LC0 and Stockfish were about equal, both were clearly stronger than Leelenstein, and Antifish was clearly the weakest of the lot." I'd consider LC0 and Stockfish essentially equal not only because of the small difference in score (5.5 points out of a possible 300), but because the margin between the two was steady for most of the play. LC0 won some early games against Stockfish head-to-head, but after that each scored several wins against the other. The early lead thus looks more like a statistical fluke than a real difference in strength [I'm not sure a real statistician would buy this analysis -D.H May 2019]. The results of CCC 6, with slightly different rules and time controls, bear this out. Stockfish beat LC0, but not by much, and, if I recall correctly, not head-to-head.

As to my long-running theme of "dumb is smarter", the "zero" in LC0 means that it was trained purely by playing against itself, without any external hints. Antifish, which finished last and had a losing record against each of the others, including Stockfish, was trained specifically to exploit weaknesses in Stockfish. If we take LC0's "no extra information to bias the results" approach as one kind of "dumb" and Stockfish's "bash out lots of positions using fairly straightforward (but highly-tuned) rules" as another, it's pretty clear that "dumb" dominated.

One other thing jumped out: Neural networks are not good at endgames, or at least they can be spectacularly bad at some endgames.

Endgames are interesting here because there are few pieces on the board and it's often possible to spell out an explicit plan for winning. For example, if you have a rook and a king against a rook, you can use your rook and king in concert to pin the other king against the edge of the board and then checkmate with the rook. In other cases, you want to promote a pawn and this hinges on getting your king to the right square and keeping the other king from stopping you.

Over and over I saw neural network engines flail about aimlessly before hitting the 50 move limit. To be clear, there are a few endgames that can take more than 50 moves to win, but even in these you can see a clear progression. To make up an example, after a series of ten checks the opposing king is now one square closer to the corner where it will eventually be checkmated. This isn't what was happening. In one game I saw, the AB engine showed mate in 12, while the NN engine saw its winning chances as quite good. The game continued

AB: Oh, now you've got mate in 14

NN: Yeah this looks good

AB: I don't see a mate, but I think you're going to promote that pawn (or whatever) and win, so my rating is still highly in your favor

NN: Yeah this looks good

AB: Dang, mate in 14 again

NN: Yeah this looks good

AB: Mate in 13

NN: Yeah this looks good

AB: I don't see a mate any more, but ...

NN: Yeah this looks good

From the NN's point of view, it was going from one position with good winning chances to another. But it wasn't actually winning. When the game eventually ended in a draw, the NN gave away a half point that the AB engine or a good human player would have gotten, given the same position.

I was a little (but not a lot) surprised by this. It doesn't seem like the problem is beyond the capability of neural networks. They've discovered plenty of conventional tactics -- forks, skewers and such, or ladders in the game of go -- and they are clearly able to encode at least some of the usual winning patterns.

Give a neural network a king and queen against a queen and it will find a checkmate more or less like anyone else would. On the other hand, give it a queen, a pawn and a rook against a king and it will quite likely forcibly give away two of the three and win with what's left. To a mathematician it makes sense (reducing to a previously-solved case), but to most chess players it looks like lunacy, if not deliberate trolling [In one fairly jaw-dropping example from the Deep Dive finals, LC0 had an overwhelming material advantage in an endgame and gave away pieces to reduce it to bishop, knight and king against lone king, one of the harder endgames to win. It went ahead and won anyway --D.H May 2019].

I found this behavior particularly interesting because neural networks are supposed to have more of that special sauce we call intelligence or understanding. We can clearly say that Stockfish has no deep understanding of chess the way that a human grandmaster can intuitively say "Your pieces are active and the king is exposed. Go for a mating attack," but at least it plays plausible moves. NN engines often find wild moves that a human wouldn't consider and an AB might avoid because it doesn't show a clear win, leading us to feel it has some deeper understanding of chess. I think that's a valid feeling, but there are cases where it completely falls apart.

Watching a neural network botch an endgame just looks like farce. I'm perfectly comfortable saying that a program "understands" a concept in chess, or "has a concept of" pawn structure or whatever. It doesn't mean that the program has some reflective layer thinking "Oh, that's a strong pawn chain, I like that", but that it behaves in such a way that, one way or another, it must recognize good and bad structures.

In cases like these endgames, though, it's abundantly clear that the network has no concept, in any meaningful literal or metaphorical sense of the word, of how to win the position. Nor is it going to be able to figure one out over the board, because the network only learns during the training phase (that's not necessarily a requirement, but it's how these engines work).

This doesn't seem like a permanent situation. For example, you could have an NN engine train on endgame positions, and it would likely discover ways to win outright and ways to steer toward a win in positions where it now flails. This is pretty much what humans do -- you learn how to promote pawns as one skill, how to checkmate with a rook and king as another, and so forth.

What's at least a bit surprising is that an engine can be frighteningly strong in most of the game and completely clueless in a few key spots. This has to say something about intelligence and learning, or about what kind of a problem chess-playing is technically, or about all of the above, but I'm not quite sure what.

Sunday, December 30, 2018

Computer chess: Dumber and smarter at the same time?

[As usual, I've added inline comments for non-trivial corrections to the original text, but this one has way more than most. I've come up to speed a bit on the current state of computer chess, so that's reflected here. The result is not the most readable narrative, but I'm not going to try to rewrite it --D.H. Apr 2019]

One of the long-running tags on the other blog is "dumb is smarter". The idea is that, at least in the world of software, it's perilous to build a lot of specialized knowledge into a system. Specialized knowledge means specialized blind spots -- any biases or incorrect assumptions in the specialization carry over to the behavior of the system itself. In cases like spam filtering, for example, a skilled adversary can exploit these assumptions to beat the system.

I will probably follow up on the other blog at some point on how valid the web.examples of this that I cited really were, and to what extent recent developments in machine learning have changed the picture (spoiler alert: at least some). Here I'd like to focus on one particular area: chess. Since this post includes Deep Mind, part of the Alphabet family, I should probably be clear that what follows is taken from public sources I've read.

For quite a while and up until recently, the most successful computer chess programs have relied mainly on brute force, bashing out millions and millions of possible continuations from a given position and evaluating them according to fairly simple rules [in actual tournament games, it's not unusual to see over a billion nodes evaluated for a single move]. There is a lot of sophistication in the programming, in areas like making good use of multiple processors, avoiding duplicate work, representing positions in ways that make good use of memory, generating possible moves efficiently and deciding how to allocate limited (though large) processing resources to a search tree that literally grows exponentially with depth, but from the point of view of a human chess player, there's nothing particularly deep going on, just lots and lots of calculation.

For a long time, a good human player could outplay a good computer program by avoiding tactical blunders and playing for positional advantages that could eventually be turned into a winning position that even perfect tactical play couldn't save. If the human lost, it would generally be by missing a tactical trick that the computer was able to find by pure calculation. In any case, the human was looking at no more than dozens of possible continuations and, for the most part, calculating a few moves ahead, while the computer was looking at vastly more positions and exploring many more of them in much greater depth than a person typically would.

The sort of positional play that could beat a computer -- having a feel for pawn structure, initiative and such -- comes reasonably naturally to people, but it's not easily expressed in terms of an evaluation function. The evaluation function is a key component of a computer chess engine that reduces a position to a number that determines whether one position should get more attention than another. Since there are millions of positions to evaluate, the evaluation function has to be fast. A typical evaluation function incorporates a variety of rules of the kind you'll find in beginning chess books -- who has more material, who has more threats against the other's pieces, whose pieces have more squares (particularly central squares) to move to, are any pieces pinned or hanging, and so forth.

There's quite a bit of tuning to be done here -- is it more important to have a potential threat against the opponent's queen or to control two squares in the center? -- but once the tuning parameters are set, they're set, at least until the next release. The computer isn't making a subtle judgment based on experience. It's adding up numbers based on positions and alignments of pieces.

It's not that no one thought of writing a more subtle evaluation function, one that would allow the program to look at fewer, but better positions. It's just that it never seemed to work as well. Put a program that looks at basic factors but looks at scads and scads of positions against one that tries to distill the experience of human masters and only look at a few good moves, and the brute force approach has typically won. The prime example here would be Stockfish, but I'm counting, engines such as earlier versions of Komodo as brute force since, as I understand it, they use the same alpha/beta search technique and examine large numbers of positions. I'm having trouble finding exact numbers on that, though [If you look at the stats on chess.com's computer chess tournaments, it's very easy to tell who's who. For a 15|5 time control, for example, you either see on the order of a billion nodes evaluated per move or on the order of a hundred thousand.]

[Houdini is another interesting example. Its evaluation function puts more weight on "positional" factors such as active pieces and space, and it tries to be highly selective in which moves it devotes the most resources too. This is, explicitly, trying to emulate the behavior of human players. So it's not quite correct to say that programs that try to emulate humans have done worse than ones that just bash out continuations. Houdini has done quite well.

However, from a software point of view, these programs are all largely similar. There is an evaluation function that's explicitly coded as rules you can read, and this is used to decide how much attention to pay to what part of the search tree.

Alpha zero, by contrast, uses machine learning (aka neural network training, aka deep neural networks) to build an evaluation function that's totally opaque when considered purely as code. There are techniques to examine what a neural network is doing, but they're not going to reduce to rules like "a bishop is worth three times as much as a pawn". It also uses a Monte Carlo approach to examine the search tree probabilistically, which is a somewhat different way to use the evaluation function to guide the search. As I understand it, this is not the usual practice for other engines, though it's certainly possible to incorporate a random element into a conventional chess engine. Komodo MC comes to mind.

In short, the narrative of "purely mechanical programs beat ones that try to emulate humans" is not really right, but Alpha Zero still represents a radically different approach, and one that is in some sense structurally more similar to what things-with-brains do. --D.H. Feb 2019]

This situation held for years. Computer hardware got faster, chess programs got more finely tuned and their ratings improved, but human grandmasters could still beat them by exploiting their lack of strategic understanding. Or at least that was the typical explanation: Humans had a certain je ne sais quoi that computers could clearly never capture, not that the most successful ones were particularly trying to. Occasionally you'd hear a stronger version: computers could never beat humans since they were programmed by humans and could never know more than their programmers, or some such, but clearly there's a hole in that logic somewhere ...

Then, in 1997, Deep Blue (an IBM project not directly related to Deep Mind) beat world champion Garry Kasparov in a rematch, Kasparov having won the first match between the two. It wasn't just that Deep Blue won the games. It outplayed Kasparov positionally just like human masters had been outplaying computers, but without employing anything you could point at and call strategic understanding. It just looked at lots and lots of positions in a fairly straightforward way.

This isn't as totally surprising as it might seem. The ultimate goal in chess is to checkmate the opponent. In practice, that usually requires winning a material advantage, opening up lines of attack, promoting a pawn in the endgame or some other tactical consideration. Getting a positional advantage is just setting things up to make that more likely. Playing a simple tactical game but looking twenty plies (half-moves) ahead turns out to be quite a bit like playing a strategic game. [In fact, computer-computer chess games are almost entirely positional, since neither player is going to fall for a simple tactical trap. That's not to say tactics don't matter. For example I've seen any number of positions where a piece looked to be hanging, but couldn't be taken due to some tactical consideration. What I haven't seen is a game won quickly by way of tactics.]

Human-computer matches aren't that common. At first, the contests were too one-sided. There was no challenge in a human grandmaster playing a computer. Then, as computers became stronger, few grandmasters wanted to risk being the first to lose to a computer (and Kasparov is to be commended for taking up the challenge anyway). Now, once again, the contests are too one-sided. Human players use computers for practice and analysis, but everyone loses to them head-to-head, even with odds given (the computer plays without some combination of pieces and pawns).

At this writing, current world champion Magnus Carlsen, by several measures the strongest player ever or at the very least in the top handful, stands less than a 2% chance of beating the current strongest computer program head to head. With the question of "will computers ever beat the best human players at chess?" firmly settled, human players can now be compared on the basis of how often they play what the computer would play. Carlsen finds the computer move over 90% of the time, but it doesn't take many misses to lose a game against such a strong player.

And now comes AlphaZero.

The "alpha" is carried over from its predecessor, AlphaGo which, by studying games between human players of the game of go and using machine learning to construct a neural network for evaluating positions, was able to beat human champion Lee Sedol. This was particularly significant because a typical position in go has many more possible continuations than a typical chess position, making the "bash out millions of continuations" approach impractical. Given that computer chess had only fairly recently reached the level of top human players and go programs hadn't been particularly close, it had seemed like a good bet that humans would continue to dominate go for quite a while yet.

AlphaGo, and machine learning approaches in general, use what could be regarded as a much more human approach, not surprising since they're based on an abstraction of animal nervous systems and brains rather than the classic Turing/Von Neumann model of computing, notwithstanding that they're ultimately still using the same computing model as everyone else. That is, they run on ordinary computer hardware, though often with a specialized "tensor processor" for handling the math.

However, the algorithm for evaluating positions is completely different. There's still an evaluation function, but it's "run the positions of the pieces through this assemblage of multilinear functions that the training stage churned out". Unlike a conventional chess engine, there's nothing you can really point at and say "this is how it knows about pins" or "this is how it counts up material".

AlphaGo looks at many fewer positions [by a factor of around 10,000] than a conventional chess engine would, and it looks at them probabilistically, that is, it uses its evaluation function to decide how likely it is that a particular continuation is worth looking at, then randomly chooses the actual positions to look at based on that. It's still looking at many more positions than a human would, but many fewer than a pure brute-force approach would.

The "zero" is because the training stage doesn't use any outside knowledge of the game. Unlike AlphaGo, it doesn't look at games played by humans. It plays against itself and accumulates knowledge of what works and what doesn't. Very roughly speaking, AlphaZero in its training stage plays different versions of its network against each other, adjusts the parameters based on what did well and what did badly, and repeats until the results stabilize.

AlphaZero does this knowing only the rules of the game, that is, what a position looks like, what moves are possible, and how to tell if the game is won, lost, drawn (tied) or not over yet. This approach can be applied to a wide range of games, so far go, chess and shogi (a chess-like game which originated in Japan). In all three cases AlphaZero achieved results clearly better than the (previous) best computer players after a modest number of hours of training (though the Stockfish team makes a good case that AlphaZero had a hardware advantage and wasn't playing against the strongest configuration). [Recent results indicate that LC0, the strongest neural net based engine, and Stockfish, the strongest conventional engine, are very evenly matched, but LC0 doesn't have the benefit of a tensor processor to speed up its evaluation --D.H. May 2019]

Notably, AlphaZero beat AlphaGo 60 games to 40.

In one sense, AlphaZero is an outstanding example of Dumb is Smarter, particularly in beating AlphaGo, which used nearly the same approach, but trained from human games. AlphaZero's style of play has been widely described as unique. Its go version has found opening ideas that had lain undiscovered for centuries. Its chess version plays sacrifices (moves that give up material in hopes of a winning attack) that conventional chess engines pass up because they can't prove that they're sound. Being unbiased by exposure to human games or a human-developed evaluation function, AlphaZero can find moves that other programs would never play, and it turns out these moves are often good enough to win, even against chess engines that never make tactical mistakes.

On the other hand, AlphaZero specifically avoids sheer brute force. Rather than look at lots and lots of positions using a relatively simple evaluation function, it looks at many fewer, using a much more sophisticated evaluation function to sharply limit the number of positions it examines. This is the same approach that had been tried in the past with limited success, but with two key differences: The evaluation function is expressed as a neural network rather than a set of explicit rules, and that neural network is trained without any human input, based solely on what works in practice games against AlphaZero itself.

The Dumb is Smarter tag on Field Notes takes "dumb" to mean "no special sauce" and "smarter" to mean "gets better results". The "smarter" part is clear. The "dumb" part is more interesting. There's clearly no special sauce in the training stage. AlphaZero uses a standard machine learning approach to produce a standard neural network.

On the other hand, if you consider the program itself without knowing anything about the training stage, you have a generic engine, albeit one with a somewhat unusual randomized search algorithm, and an evaluation function that no one understands in detail. It's all special sauce -- a set of opaque, magical parameters that somehow give the search algorithm the ability to find the right set of variations to explore.

I think it's this opaqueness that gives neural networks their particularly uncanny flavor (uncanny, etymologically, roughly means "unknowable"). The basic approach of taking some inputs and crunching some numbers on them to get an output is unequivocally dumb. As I said above, "It's adding up numbers based on positions and alignments of pieces." Except that those numbers are enabling it to literally make "a subtle judgment based on experience", a judgment we have no real choice but to call "smart".

Progress marches on. At least one of the previous generation of chess engines (Komodo) has incorporated ideas from AlphaZero [Leela has open-sourced the neural network approach wholesale]. It looks like the resulting hybrid isn't dominant, at least not yet, but it does play very strong chess, and plays in a completely different, more human, style from the conventional version. That's interesting all by itself.

Saturday, October 6, 2018

Debugging servers and bodies

For quite a while I've wanted to write up some thoughts about the nature of cause and effect. In actually trying to do so, I realized two things: First, I don't know that much about the main streams of thought on this, which go back millennia, and second, that this was material for several posts. Rather than step back and formulate an overarching structure for a series, or deep dive into the philosophical literature, I decided to just start somewhere and call that Part I, with the intention of coming back to the subject, probably with unrelated posts in between, from time to time. And then I realized that there was no need to call it Part anything. I could just introduce a new tag, cause and effect and apply it where necessary. So here we go ...

I press keys on the keyboard and words appear on the screen. It seems pretty clear that one is the cause of the other. I take a cough suppressant and my cough subsides. Quite likely it subsided because of the medicine, but maybe I was getting better anyway. I run an ad online and sales go up. But they've been going up for months. Did they go up more because of the ad? (half of your advertising budget is wasted; the trick is to know which half). I wear a lucky shirt and my team wins. I may want to think the two are related, but realistically it's just a bit of fun.

I spend a lot of my time debugging, trying to figure out why something isn't working the way it was expected to, whether it's why the TV isn't working at home or why a server crashed at work. If I'm trying to figure out what's wrong with something electronic at home, I generally just turn things off and on until everything's back in a sane state. That's usually fine at home, but not such a good idea at work. Even if restarting the server fixes the problem, you still want to know why, so next time you won't have to restart it.

How do you tell if one thing caused another? Debugging is much older than computing. For example, the practice of debugging human health, that is, medicine, developed a useful paradigm in 1884, known as Koch's postulates after Robert Koch, for determining whether a particular microorganism causes a particular disease:

The microorganism must be found in abundance in all organisms suffering from the disease, but should not be found in healthy organisms.
The microorganism must be isolated from a diseased organism and grown in pure culture.
The cultured microorganism should cause disease when introduced into a healthy organism.
The microorganism must be reisolated from the inoculated, diseased experimental host and identified as being identical to the original specific causative agent.

Koch actually abandoned the "should not be found" part of postulate 1 after discovering that some people could carry a disease without showing symptoms.

Abstracting a little bit with an eye toward applying these postulates to an ailing server, one might say

The conditions you think are causing the problem should be present in affected servers but ideally not in healthy ones. For example, the unhealthy servers are all deployed in sector A and the healthy ones aren't or, more commonly, the unhealthy ones are running one version of the code while the healthy ones are running the previous version.
This is a bit tricky, but I think it boils down to: There has to be a well-defined way to induce the conditions you think are causing the problem. For example, you can start a new instance of a server in sector A or update a healthy server to the new version you think is buggy. It isn't always immediately obvious how to do this. For example, if you think that the problem is some sort of bad request coming from random parts of the internet, you'll probably have to search through logs to find requests that correspond with problems.
If you trigger the conditions, the problem occurs.
Again doesn't apply directly, but in this context I think it means double-checking for evidence that the conditions you thought were triggered really were triggered. When you brought up the test server and it fell over, was it actually in sector A? When you sent the query-of-death you found in the logs, did the server that fell over actually log receiving it just before it fell over?

The kind of double-checking in postulate 4 is crucial in real debugging. It's very common, for example, to think you restarted a server with a new setting that should cause or fix a given problem, only to find that you restarted a different instance, or accidentally restarted it with the old configuration instead of the new. For example, as I was writing this paragraph I realized that the command I thought would send a problem request to an unwell server I'd been debugging had actually failed, explaining why I saw no evidence of the request having been handled.

There's also a distinction, in both medicine and software, between fixing the problem at hand -- curing the patient or getting the service back online -- and pinning down exactly what happened. In my business a common course of action is to roll the production servers back to the last configuration that was known to work, then use a test setup to try to reproduce the problem without impacting the production system. The ultimate goal is a "red test" that fails with the buggy code and then passes ("goes green") with the fix.

In medicine, as I understand it, the work of isolating causes and developing vaccines and drugs similarly goes on in a laboratory environment until everyone is quite certain that the proposed treatment will be safe, and hopefully effective, in real patients. In the mean time, doctors mostly do their best with known treatments.

While Koch's postulates are fairly famous, the kind of thing you remember from high school biology years later, they're not actually what modern medicine goes by, just like modern economists don't consult The Wealth of Nations, influential though it was. One more modern approach can be found in Hill's criteria, a set of nine criteria for determining if a given cause is responsible for a given effect, but there are many other, more recent paradigms.

Notably, Hill's criteria and its modern cousins are not nearly so crisp as Koch's postulates. The very name "postulate" suggests that you can obtain a rigorous proof, while "criteria" suggests something more indirect: if you don't meet the criteria, then you don't have cause and effect. The criteria themselves are of the form "more of this suggests causality is more likely", and the end result is an idea of the probability that something is causing something else.

As in many other areas, switching from a yes/no answer to a probability solves a lot of problems, particularly the problem of gray areas where there are reasons to say either yes or no. It does so, however, at the cost of being able to say for certain "X caused Y". In my world, you very often can say with confidence "The tweaks we made to parsing in change number 271828 caused the server to reject these requests", but in my world we have a high degree of control of the system. I can roll the server back to just before and after change number 271828 and run it in a test environment where I can control exactly what data it's trying to parse (or just write a "unit test" that exercises the problem code directly without spinning up a server).

In the field of medicine, and much of the scientific world, however, that's generally not the case. If we're trying to determine whether eating carrots for years causes freckles, we can't really make people eat carrots for years and count their freckles every week for the duration. Medicine doesn't, and shouldn't, have the same level of control over patients as I have over a server. That is, there's less control over the possible causes.

There's typically also less certainty about the effects. Lots of people get freckles whether or not they eat carrots, so you need more subtle statistical techniques to see if there's even a correlation between carrot eating and freckle getting. This sort of thing is a major reason that it's generally not a good idea to pin too much on any single medical study, even if it's careful with the data and its interpretation.

Nonetheless, medicine advances. Some of this is because research has pinned down some causes and effects to a good degree of certainty. There's no doubt that vaccines were effective against smallpox and continue to be effective against other diseases, albeit not always perfectly. There's no doubt that antibiotics can be effective against bacterial infections, or that some bacteria have evolved defenses against them. There's no realistic doubt that smoking causes a number of ill effects, up to and including lung cancer and emphysema.

But medicine is useful even in the absence of certainty. If there's an 80% chance you've got condition X and the treatment is 90% effective with a 5% chance of major side effects, and condition X is 95% fatal if left untreated, you should probably go for the treatment. If the treatment is 5% effective with a 90% chance of major side effects, and condition X is almost never fatal, you probably don't want to. You don't need absolute certainty to make a decision like this, or even to know the exact causes and effects involved.

Sunday, September 9, 2018

When is a space elevator not a space elevator?

Canadian space and defense company Thoth Technology has recently announced plans for a "Space Elevator". The idea is to use pressurized Kevlar to build a tower 15 kilometers high (nearly twice as high as Mount Everest), complete with hotel and observation decks, and to launch and land space planes using a rooftop runway. It's an ambitious project, to say the least, but I wouldn't call it a space elevator -- thus the they-said-it-I-didn't quotation marks.

The term space elevator usually refers to a structure built around a very long cable with one end attached to a counterweight somewhere beyond about 36,000 kilometers above the surface of the Earth. The cable remains taut due to centrifugal force*. Vehicles called climbers move up and down the structure, delivering payloads into space.

We're not even close to being able to build such a thing. The main reason is that the cable would be under far more tension than any commercially available material can withstand (it's not theoretically impossible, just not practical with current technology). Assuming we clear that hurdle, actually building the thing would still be a major piece of engineering.

How would a space elevator work in practice? Since a space elevator cable is pulled taut by its counterweight, and it's rotating in sync with the Earth's rotation, the higher up you go, the faster that bit of the elevator is moving relative to the ground, since the higher you go the more distance you have to cover every 24 hours. As a climber goes up the elevator, it will feel a slight force pushing it in the direction the cable is moving (technically a Coriolis force). This is the force of the cable keeping the climber moving with it, ever so slightly faster for each meter you move farther out from Earth.

In order to put something into orbit it's not enough simply to lift it out of the atmosphere. If you took a payload up a space elevator to an altitude of 200km, a typical distance for a low Earth orbit, and let it go, it would fall back to the ground. At that height, the payload would start with a forward motion relative to the ground of around 50 km/h -- a typical speed limit for residential streets -- and lose most or all of it to wind resistance on the way down. To go into orbit, it would need a forward motion of more like 28,000 km/h. As Randall Munroe says, getting to space is easy. The problem is staying there.

To go faster you have to go higher, but fortunately you also need less speed to go into orbit as you get further from the Earth -- the Moon orbits at less than 4000 km/h, for example. By the time a climber got to about 36,000 kilometers, not much less distance than going around the world, it would finally have enough speed (about 11,000 km/h at that point) that a payload would stay in orbit if released.

Orbits at this distance are geosynchronous (geosync for short), meaning the orbiting object is in sync with the Earth's rotation. If they're also at the equator, which a space elevator would have to be, they're geostationary, meaning they always stay over the same point on the equator. Otherwise they will appear to move north and south over the course of the day, but stay at roughly the same longitude.

From the point of view of someone on the elevator at geosync, the payload would just appear to stay where it was, at least for a bit. In real life, it would tend to drift away over time due to factors like the gravity of the Moon and the Earth's not being a perfect sphere. For basically the same reason, an object on the cable at this point would experience zero g (again neglecting secondary effects). Points below would experience a pull toward Earth, however slight, and points above would experience a pull away from Earth.

Where does that leave us with Thoth's structure?

A 15km tower is nowhere near high enough to function as a space elevator. The advantage to launching from there is not the difference in speed between the ground and the top, which would amount to a moderate walking pace, but being above about 90% of the atmosphere. That's certainly helpful, but not enough to neglect the atmosphere entirely. To do that, you would need to be somewhere above the Kármán line, somewhat arbitrarily defined as 100km, though the Wikipedia article asserts that 160km is the lowest point at which you can complete an orbit without further propulsion.

In other words, Thoth's structure doesn't even reach into space, or even a tenth of the way for the purposes of launching things into orbit.

While Thoth's structure might be still useful in launching thanks to bypassing much of the atmosphere, landing, on the other hand, seems a bit of a stretch. If you're leaving low Earth orbit, you'll have to get rid of that 28,000 km/h somehow. You could use rockets, the same as you used to get into orbit, but that means more fuel -- not just twice as much but a bunch more, because the extra fuel is just dead weight on the way up and you'll need more fuel to compensate for that. And more fuel for that extra fuel, and so forth.

This is why actual reentry uses the Earth's atmosphere to turn kinetic energy (energy of motion) into heat. If you're going to do that, you might as well go all the way to the ground, where you can have a nice big runway, safety crews and other amenities if you're a spaceplane, or at least your choice of an ocean or a large expanse of open ground to aim for if you're not.

Of course, if you have an actual space elevator, you can re-attach to the elevator at geosync and let the climber spill your kinetic energy gradually on the way down, neglecting the not-so-small matter of moving from your current orbit back to the elevator, matching speed, not just location. But that's not what Thoth is proposing.

So the whole "space elevator" thing is marketing hype. Is there anything left after you account for that? Well, maybe. It depends on the numbers.

The Really Tall Tower idea still has some value, I think. Actually building it would turn up all kinds of interesting issues in building tall structures, from how to build a stable structure about 20 times taller than the current record holder to the logistics of supporting a crew well above the Everest death zone to even just getting materials to a construction site 15km up in the air. At the end, though, you'd at least have a unique hotel property and, depending on how much load that inflated Kevlar can take, potentially a whole lot of residential and office space, though most of it will need to be pressurized.

Do you have a space elevator? No.

Do you have a compelling value proposition for someone wanting to put things in orbit in a world where ground-launched rockets are pretty much the only game in town? I really don't know enough to say, but my guess is that it would be better if the business model didn't depend on that happening.

(*) If you prefer, feel free to recast this in terms of centripetal force. You'll get the same vectors, just with different names.

Saturday, July 28, 2018

The woods are dark and full of terrors

I wanted to "circle back" on a comment I'd made on the "Dark Forest" hypothesis, which is basically the idea that we don't see signs of alien life because everyone's hiding for fear of everyone else. This will probably be the last I want to say about the "Fermi Paradox", at least until the next time I feel like posting about it ...

I haven't read the books the Dark Forest hypothesis comes from [in March 2024, Netflix release 3 Body Problem, based on the first book in the series, and, um ... I have questions] but my understanding is that the hypothesis is based on the observation that when a relatively technologically developed society on Earth has made contact with a less developed society, the results have generally not been pretty. "Technology" in this context particularly means "military technology." The safest assumption is that this isn't unique to our own planet and species, but a consequence of universal factors such as competition for resources.

If you're a civilization at the point of being able to explore the stars, you're probably aware of this first hand from your own history, and the next obvious observation is that you're just at the beginning of the process of exploring the stars. Is it really prudent to assume that there's no one out there more advanced?

Now put yourself in the place of that hypothetical more advanced civilization. They've just detected signs of intelligent life on your world. You are now either a threat to them, or a potential conquest, or both. Maybe you shouldn't be so eager to advertise your presence.

But you don't actually see anyone out there, so there's nothing to worry about, right? Not so fast. Everyone else out there is probably applying the same logic. They might be hunkering down quietly, or they might already be on their way, quietly, in order to get the jump on you, but either way you certainly shouldn't assume that not detecting anyone is good news.

Follow this through and, assuming that intelligent life in general isn't too rare at any given point in time, you get a galaxy dotted with technological civilizations, each doing its best to avoid detection, detect everyone else and, ideally, neutralize any threats that may be out there. Kind of like a Hunger Games scenario set in the middle of a dark forest.

This all seems disturbingly plausible, at least until you take scale into account.

There are two broad classes of scenarios: Either faster-than-light travel is possible, or it's not. If anyone's figured out how to travel faster than light, then all bets are off. The procedure in that case seems pretty simple: Send probes to as many star systems as you can. Have them start off in the outer reaches, unlikely to be detected, scanning for planets, then scanning for life on those planets. If you find anything that looks plunderable, send back word and bring in the troops. Conquer. Build more probes. Repeat.

This doesn't require listening for radio waves as a sign of civilization. Put a telescope and a camera on an asteroid with a suitable orbit and take pictures as it swings by your planet of choice. Or whatever. The main point is if there's anyone out there with that level of technology, our fate is sealed one way or another.

On the other hand, if the speed of light really is a hard and fast limit, then economics will play a significant role. Traveling interstellar distances takes a huge amount of energy and not a little time (from the home planet's point of view -- less for the travelers, particularly if they manage to get near light speed). By contrast, in the period of exploration and conquest from the late 1400s to the late 1700s it was not difficult to build a seaworthy ship and oceans could be crossed in weeks or months using available energy from the wind. The brutal fact is that discovering and exploiting new territories on Earth at that time was economically profitable for the people doing the exploiting.

If your aim is to discover and exploit resources in other star systems, then you have to ask what they might have that you can't obtain on your home system using the very large amount of energy you would have to use to get to the other system. The only sensible answer I can come up with is advanced technology, which assumes that your target is more advanced than you are, in which case you might want to rethink.

Even if your aim is just to conquer other worlds for the evulz or out of some mostly-instinctive drive, you're fighting an extremely uphill battle. Suppose you're attacking a planet 10 light-years away. Messages from the home planet will take 10 years to reach your expeditionary force, and any reply will take another 10 years, so they're effectively on their own.

You detected radio transmissions from your target ten years ago. It takes you at least 10 years to reach their planet (probably quite a bit longer, but let's take the best case, for you at least). They're at least 20 years more advanced than when the signal that led you to plot this invasion left their planet.

You've somehow managed to assemble and send a force of thousands, or tens of thousands, or a million. You're still outnumbered by -- well, you don't really know until you get there, do you? -- but hundreds to one at the least and more likely millions to one. You'd better have a crushing technological advantage.

I could come up with scenarios that might work. Maybe you're able to threaten with truly devastating weapons that the locals have no way to counter. The locals treat with you and agree to become your loyal minions.

Now what?

Unless your goal was just the accomplishment of being able to threaten another species from afar, you'll want to make some sort of physical contact. Presumably you land your population on the planet and colonize, assuming the planet is habitable to you and the local microbes don't see you as an interesting host environment/lunch (or maybe you've mastered the art of fighting microbes, even completely unfamiliar ones).

You're now on unfamiliar territory to which you're not well adapted, outnumbered at least a hundred to one by intelligent and extremely resentful beings that would love to steal whatever technology you're using to maintain your position. Help is twenty years away, counting from the time you send your distress call, and if you're in a position to need it, is the home planet really going to want to send another wave out? By the time they get there, the locals will have had another twenty years to prepare since you sent your distress call, this time with access to at least some of your technology.

I'm always at least a little skeptical of the idea that other civilizations will think like we do. Granted, it doesn't seem too unreasonable to assume that anyone who gets to the point that we would call them "technological" is capable of doing the same kind of cost/benefit analyses that we do. On the other hand, it also seems reasonable to assume that they have the same sort of cognitive biases and blind spots that we do.

The "soft" sciences are a lot about how to model the aggregate behavior of not-completely rational individuals. There's been some progress, but there's an awful lot we don't know even about our own species, which we have pretty good access to. When it comes to hypothetical aliens, I don't see how we can say anything close to "surely they will do thus-and-such", even if there are practical limits on how bonkers you can be and still develop technology on a large scale.

In the context of the Dark Forest, the question is not so much how likely it is that alien species are actually a danger to us, but how likely is it that an alien species would think they were in danger from another alien species (maybe us) and act on that by actively going dark.

Our own case suggests that's not very likely. There may be quite a few people who think that an alien invasion is a serious threat (or for that matter, that one has already happened), or who think that it's unlikely but catastrophic enough if it did happen that we should be prepared. That doesn't seem to have stopped us from spewing radio waves into the universe anyway. Maybe we're the fools and everyone else is smarter, but imagine the level of coordination it would take to keep the entire population of a planet from ever doing anything that would reveal their presence. This seems like a lot to ask, even if the threat of invasion seems likely, which, if you buy the analysis above, it's probably not.

Overall, it seems unlikely that every single technological civilization out there would conclude that staying dark was worth the trouble. At most, I think, there would be fewer detectable civilizations, than there would have been otherwise, but I still think that as far as explaining why we haven't heard from anyone, it's more likely that whatever civilizations there are, have been or will be out there are too far away for our present methods to detect (and may always be), and that the window of opportunity for detecting them is either long past or far in the future.

Saturday, July 21, 2018

Fermi on the Fermi paradox

One of the pleasures of life on the modern web is that if you have a question about, say, the history of the Fermi paradox, there's a good chance you can find something on it. In this case, it didn't take long (once I thought to look) to turn up E. M. Jones's "Where is Everybody?" an Account of Fermi's Question.

The article includes letters from Emil Konopinski, Edward Teller and Herbert York, who were all at lunch with Enrico Fermi at Los Alamos National Laboratory some time in the early 1950s when Fermi asked his question. Fermi was wondering specifically about the possibility that somewhere in the galaxy some civilization had developed a viable form of interstellar travel and had gone on to explore the whole galaxy, and therefore our little blue dot out on one of the spiral arms.

Fermi and Teller threw a bunch of arguments at each other, arriving at a variety of probabilities. Fermi eventually concluded that probably interstellar travel just wasn't worth the effort or perhaps no civilization had survived long enough to get to that stage (I'd throw in the possibility that they came by millions of years ago, decided nothing special was going on and left -- or won't come by for a few million years yet).

Along the way Fermi, very much in the spirit of "How many piano tuners are there in Chicago?" broke the problem down into a series of sub-problems such as "the probability of earthlike planets, the probability of life given an earthlike planet" and so forth. Very much something Fermi would have done, (indeed, this sort of exercise goes by the name "Fermi estimation") and very similar to what we now call the Drake equation.

In other words, Fermi and company anticipated much of the subsequent discussion on the subject over lunch more than fifty years ago and then went on to other topics (and presumably coffee). There's been quite a bit of new data on the subject, particularly the recent discovery that there are in fact lots of planets outside our solar system, but the theoretical framework hasn't changed much at all.

What's a Fermi paradox?

So far, we haven't detected strong, unambiguous signs of extraterrestrial intelligence. Does that mean there isn't any?

The usual line of attack for answering this question is the Drake equation [but see the next post for a bit on its origins --D.H Oct 2018], which breaks the question of "How many intelligent civilizations are there in our galaxy?" down into a series of factors that can then be estimated and combined into an overall estimate.

Let's take a simpler approach here.

The probability of detecting extraterrestrial intelligence given our efforts so far is the product of:

The probability it exists
The probability that what we've done so far would detect it, given that it exists

(For any math geeks out there, this is just the definition of conditional probability)

Various takes on the Fermi paradox (why haven't we seen anyone, if we're pretty sure they're out there?) address these two factors

Maybe intelligent life is just a very rare accident. As far as we can tell, Earth itself has lacked intelligent life for almost all of its history (one could argue it still does, so feel free to substitute "detectable" for "intelligent").
Maybe intelligent life is hard to detect for most of the time it's around (See this post for an argument to that effect and this one for a bit on the distinction between "intelligent" and "detectable"). A particularly interesting take on this is the "dark forest" hypothesis, that intelligent civilizations soon figure out that being detectable is dangerous and deliberately go dark, hoping never to be seen again. I mean to take this one on in a bit, but not here.
One significant factor when it comes to detecting signs of anything, intelligent or otherwise: as far as we know detectability drops with the square of distance, that is, twice as far away means four times harder to detect. Stars are far away. Other galaxies are really far away.
Maybe intelligent life is apt to destroy itself soon after it develops, so it's not going to be detectable for very long and chances are we won't have been looking when they were there . This is a popular theme in the literature. I've talked about it here and here.
Maybe the timing is just wrong. Planetary time scales are very long. Maybe we're one of the earlier ones and life won't develop on nearby planets for another million or billion years (basically low probability of detection again, but also an invitation to be more rigorous about the role of timing)

At first blush, the logic of the Fermi paradox seems airtight: Aliens are out there. We'd see them if they were out there. We haven't seen them. QED. But we're not doing a mathematical proof here. We're dealing in probabilities (also math, but a different kind). We're not trying to explain a mathematically impossible result. We're trying to determine how likely it is that our observations are compatible with life being out there.

I was going to go into a longish excursion into Bayesian inference here, but ended up realizing I'm not very adept at it (note to self: get better at Bayesian inference). So in the spirit of keeping it at least somewhat simple, let's look at a little badly-formatted table with, granted, a bunch of symbols that might not be familiar:

	We see life (S)	We don't see life (¬S)
Life exists (L)	P(L ∧ S)	P(L ∧ ¬S)	P(L)
No life (¬L)	P(¬L ∧ S)	P(¬L ∧ ¬S)	P(¬L)
	P(S)	P(¬S)	100%

P is for probability. P(L) is the probability that there's intelligent life out there we could hope to detect as such, at all. P(S) is the probability that we see evidence strong enough that the scientific community (whatever we mean by that, exactly) agrees that intelligent life is out there. The ¬ symbol means "not" and the ∧ symbol means "and". The rows sum to the right, so

P(L ∧ S) + P(L ∧ ¬S) = P(L) (the probability life exists is the probability that life exists and we see it plus the probability it exists and we don't see it)
P(S + ¬S) = 100% (either we see life or we don't see it)

Likewise the columns sum downward. Also "and" means multiply (as long as the two probabilities are independent; they are here, since we allow for false positives), so P(L ∧ S) = P(L)×P(S). This all puts restrictions on what numbers you can fill in. Basically you can pick any three and those determine the rest.

Suppose you think it's likely that life exists, and you think that it's likely that we'll see it if it's there. That means you think P(L) is close to 100% and P(L ∧ S) is a little smaller but also close to 100% (see conditional probability for more details) . You get to pick one more. It actually turns out not to matter that much, since we've already decided that life is both likely and likely to be detected. One choice would be P(¬L ∧ S), the chance of a "false positive", that is, the chance that there's no life out there but we think we see it anyway. Again, in this scenario we're assuming false positives should be unlikely overall, but choosing exactly how unlikely locks in the rest of the numbers.

It's probably worth calling out one point that kept coming up while I was putting this post together: The chances of finding signs of life depend on how much we've looked and how we've done it. A lot of SETI has centered around radio waves, and in particular radio waves in a fairly narrow range of frequencies. There are perfectly defensible reasons for this approach, but that doesn't mean that any actual ETs out there are broadcasting on those frequencies. In any case we're only looking at a small portion of the sky at any given moment, our current radio dishes can only see a dozen or two light years out and there's a lot of radio noise from our own technological society to filter out.

I could model this as a further conditional probability, but it's probably best just to keep in mind that P(S) is the probability of having detected life after everything we've done so far, and so includes the possibility that we haven't really done much so far.

To make all this concrete, let's take an optimistic scenario: Suppose you think there's a 90% chance that life is out there and a 95% chance we'll see it if it's out there. If there's no chance of a false positive, then there's an 85.5% chance that we'll see signs of life and so a 14.5% chance we won't (as is presently the case, at least as far as the scientific community is concerned). If you think there's a 50% chance of a false positive, then there's a 90.5% chance we'll see signs of life, including the 5% chance it's not out there but we see it anyway. That means a 9.5% chance of not seeing it, whether or not it's actually there.

This doesn't seem particularly paradoxical to me. We think life is likely. We think we're likely to spot it. So far we haven't. By the assumptions above, there's about a 10% chance of that outcome. You generally need 99.99994% certainty to publish a physics paper, that is, a 0.00006% chance of being wrong. A 9.5% chance isn't even close to that

Only if you're extremely optimistic and you think that it's overwhelmingly likely that detectable intelligent life is out there, and that we've done everything possible to detect it do we see a paradox in the sense that our present situation seems very unlikely. But when I say "overwhelmingly likely" I mean really overwhelmingly likely. For example, even if you think both are 99% likely, then there's still about a 1-2% chance of not seeing evidence of life, depending on how likely you think false positives are. If, on the other hand, you think it's unlikely that we could detect intelligent life even if it is out there, there's nothing like a paradox at all.

My personal guess is that we tend to overestimate the second of the two bullet points at the beginning. There are good reasons to think that life on other planets is hard to detect, and our efforts so far have been limited. In this view, the probability that detectably intelligent life is out there right now is fairly low, even if the chance of intelligent life being out there somewhere in the galaxy is very high and the chance of it being out there somewhere in the observable universe is near certain.

As I've argued before, there aren't a huge number of habitable planets close enough that we could hope to detect intelligent life on them, and there's a good chance that we're looking at the wrong time in the history of those planets -- either intelligent life hasn't developed yet or it has but for one reason or another it's gone dark.

Finding out that there are potentially habitable worlds in our own solar system is exciting, but probably doesn't change the picture that much. There could well be a technological civilizations in the oceans of Enceladus, but proving that based on what molecules we see puffing out of vents on the surface many kilometers above said ocean seems like a longshot.

With that in mind, let's put some concrete numbers behind a less optimistic scenario. If there's a 10% chance of detectable intelligent life (as opposed to intelligent life we don't currently know how to detect), and there's a 5% chance we'd have detected it based on what we've done so far and a 1% chance of a false positive (that is, of the scientific community agreeing that life is out there when in fact it's not), then it's 98.6% likely we wouldn't have seen clear signs of life by now. That seems fine.

While I'm conjecturing intermittently here, my own wild guess is that it's quite likely that some kind of detectable life is out there, something that, while we couldn't unequivocally say it was intelligent, would make enough of an impact on its home world that we could hope to say "that particular set of signatures is almost certainly due to something we would call life". I'd also guess that it's pretty likely that in the next, say, 20 or 50 or 100 years we would have searched enough places with enough instrumentation to be pretty confident of finding something if it's there. And it's reasonably likely that we'd get a false positive in the form of something that people would be convinced it was a sign of life when there in fact wasn't -- maybe we'd figure out our mistake in another 20 or 50 or 100 years.

Let's say life of some sort is 90% likely, there's a 95% chance of finding it in the next 100 years if it's there and a 50% chance of mistakenly finding life when it's not there, that is, a 50% chance that at some point over those 100 years we mistakenly convince ourselves we've found life and later turn out to be wrong. Who knows? False positives are based on the idea that there's no detectable life out there, which is another question mark. But let's go with it.

I actually just ran those numbers a few paragraphs ago and came up with a 9.5% chance of not finding anything, even with those fairly favorable odds.

All in all, I'd say we're quite a ways from any sort of paradoxical result.

One final thought occurs to me: The phrase "Fermi paradox" has been in the lexicon for quite a while, long enough to have taken on a meaning of its own. Fermi himself, being one of the great physicists, was quite comfortable with uncertainty and approximation, so much so that the kind of "How many piano tuners are there in Chicago?" questions given to interview candidates are meant to be solved by "Fermi estimation".

I should go back and get Fermi's own take on the "Fermi paradox". My guess was he wasn't too bothered by it and probably put it down to some combination of "we haven't really looked" and "maybe they're not out there".

If I find out I'll let you know.

[As noted above, I did in fact come across something --D.H Oct 2018]

Friday, July 6, 2018

Are we alone in the face of uncertainty?

I keep seeing articles on the Drake equation and the Fermi Paradox on my news feed, and since I tend to click through and read them, I keep getting more of them. And since I find at least some of the ideas interesting, I keep blogging about them. So there will probably be a few more posts on this topic. Here's one.

One of the key features of the Drake equation is how little we know, even now, about most of the factors. Along these lines, a recent (preprint) paper by Anders Sandberg, Eric Drexler and Toby Ord claims to "dissolve" the Fermi Paradox (with so many other stars out there why haven't we heard from them?), claiming to find "a substantial ex ante probability of there being no other intelligent life in our observable universe".

As far as I can make out, "ex ante" (from before) means something like "before we gather any further evidence by trying to look for life". In other words, there's no particular reason to believe there should be other intelligent life in the universe, so we shouldn't be surprised that we haven't found any.

I'm not completely confident that I understand the analysis correctly, but to the extent I do, I believe it goes like this (you can probably skip the bullet points if math makes your head hurt -- honestly, some of this makes my head hurt):

We have very little knowledge of the some of the factors in the Drake equation, particularly fl (probability of life on a planet that might support life) fi (probability of a planet with life developing intelligent life) and L (the length of time a civilization produces a detectable signal)
Estimates of those range over orders of magnitude.

Estimates for L range from 50 years to a billion or even 10 billion years.
The authors do some modeling and come up with a range of uncertainty of 50 orders of magnitude for fl. That is, it might be close to 1 (that is, close to 100% certain), or it might be more like 1 in 100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000. Likewise they take fi to range over three orders of magnitude, from near 1 to 1 in 1,000.

Rather than assigning a single number to every term, as most authors do, it makes more sense to assign a probability distribution. That is, instead of saying "the probability of life arising on a suitable planet is 90%", or 0.01% or whatever, assign probability for each possible value (the actual math is a bit more subtle, but that should do for our purposes). Maybe the most likely probability of life developing intelligence is 1 in 20, but there's a possibility, though not as likely, that it's actually 1 in 10 or 1 in 100, so take that into account with a probability distribution..
(bear in mind that the numbers were looking at are themselves probabilities, so we're assigning a probability that the probability is a given number -- this is the part that makes my head hurt a bit)
Since we're looking very wide ranges of values, a reasonable distribution is the "log normal" distribution -- basically "the number of digits fits a bell curve".
These distributions have very long tails, meaning that if, say, 1 in a thousand is a likely value for the chance of life evolving into intelligent life, then (depending on the exact parameters) 1 in a million may be reasonably likely, 1 in a billion not too unlikely and 1 in trillion is not out of the question.
The factors in the Drake equation multiply, following the rules of probability, so it's quite possible that the aggregate result is very small.

For example if it's reasonably likely that fl is 1 in a trillion and fi is 1 in a million, then we can't ignore the chance that the product of the two is 1 in a quintillion.
Numbers like that would make it unlikely that there is any life in our galaxy's few hundred billion stars and that ours just happened to get lucky.

Putting it all together, they estimate that there's a significant chance that we're alone in the observable universe.

I'm not sure how much of this I buy.

There are two levels of probability here. The terms in the Drake equation represent what has actually happened in the universe. An omniscient observer that knew the entire history of every planet in the universe (and exactly what was meant by "life" and "intelligent") could count the number of planets, the number that had developed life and so forth and calculate the exact values of each factor in the equation.

The probability distributions in the paper, as I understand it, represent our ignorance of these numbers. For all we know, the portion of "habitable" planets with intelligent life is near 100%, or near 1 in a quintillion or even lower. If that's the case, then the paper is exploring to what extent our current knowledge is compatible with there being no other life in the universe. The conclusion is that the two are fairly compatible -- if you start with what (very little) we know about the likelihood of life and so forth, there's a decent chance that the low estimates are right, or even too optimistic, and there's no one but us.

Why? Because low probabilities are more plausible than we think, and multiplying probabilities increases that effect. Again, the math is a bit subtle, but if you have a long chain of contingencies, any one of them failing breaks the whole chain. If you have several unlikely links in the chain, the chances of the chain breaking are even better.

The conclusion -- that for all we know life might be extremely rare -- seems fine. It's the methodology that makes me a bit queasy.

I've always found the Drake equation a bit long-winded. Yes, the probability of intelligent life evolving on a planet is the probability of life evolving at all multiplied by the probability of life evolving into intelligent life, but does that really help?

On the one hand, it seems reasonable to separate the two. As far as we know it took billions of years to go from one to the other, so clearly they're two different things.

But we don't really know the extent of our uncertainty about these things. If you ask for an estimate of any quantity like this, or do your own estimate based on various factors, you'll likely* end up with something in the wide range of values people consider plausible enough to publish (I'm hoping to say more on this theme in a future post). No one is going to say "zero ... absolutely no chance" in a published paper, so it's a matter of deriving a plausible really small number consistent given our near-complete ignorance of the real number -- no matter what that particular number represents or how many other numbers it's going to be combined with.

You could almost certainly fit the results of surveying several good-faith attempts into a log-normal distribution. Log-normal distributions are everywhere, particularly where the normal normal distribution doesn't fit because the quantity being measured has something exponential about it -- say, you're multiplying probabilities or talking about orders of magnitude.

If the question is "what is the probability of intelligent life evolving on a habitable planet?" without any hints as to how to calculate it, that is, one not-very-well-determined number rather than two, then the published estimates, using various methodologies, should range from a small fraction to fairly close to certainty depending on the assumptions used by the particular authors. You could then plug these into a log normal distribution and get some representation of our uncertainty about the overall question, regardless of how it's broken down.

You could just as well ask "What is the probability of any self-replicating system arising on a habitable planet?", "What is the probability of a self-replicating system evolving into cellular life?" "What is the probability of cellular life evolving into multicellular life?" and so forth, that is, breaking the problem down into several not-very-well-determined numbers. My strong suspicion is that the distribution for any one of those sub-parts will look a lot like the distribution for the one-question version, or the parts of the two-question version, because they're basically the same kind of guess as any answer to the overall question. The difference is just in how many guesses your methodology requires you to make.

In particular, I seriously doubt that anyone is going to cross-check that pulling together several estimates is going to yield the same distribution, even approximately, as what's implied by a single overall estimate. Rather, the more pieces you break the problem into, the more likely really small numbers become, as seen in the paper.

I think this is consistent with the view that the paper is quantifying our uncertainty. If the methodology for estimating the number of civilizations requires you to break your estimate into pieces, each itself with high uncertainty, you'll get an overall estimate with very high uncertainty. The conclusion "we're likely to be alone" will lie within that extremely broad range, and may even take up a sizable chunk of it. But again, I think this says much more about our uncertainty than about the actual answer.

I suspect that if you surveyed estimates of how likely intelligent life is using any and all methodologies*, the distribution would imply that we're not likely to be alone, even if intelligent life is very rare. If you could find estimates of fine-grained questions like "what is the probability of multicellular life given cellular life?" you might well get a distribution that implied we're an incredibly unlikely fluke and really shouldn't be here at all. In other words, I don't think the approach taken in the paper is likely to be robust in the face of differing methodologies. If it's not, it's hard to draw any conclusions from it about the actual likelihood of life.

I'm not even sure, though, how feasible it would be to survey a broad sample of methodologies. The Drake formulation dominates discussion, and that itself says something. What estimates are available to survey depends on what methods people tend to use, and that in turn depends on what's likely to get published. It's not like anyone somehow compiled a set of possible ways to estimate the likelihood of intelligent life and prospective authors each picked one at random.

The more I ponder this, the more I'm convinced that the paper is a statement about the Drake equation and our uncertainty in calculating the left hand side from the right. It doesn't "dissolve" the Fermi paradox so much as demonstrate that we don't really know if there's a paradox or not. The gist of the paradox is "If intelligent life is so likely, why haven't we heard from anyone?", but we really have no clear idea how likely intelligent life is.

* So I'm talking about probabilities of probabilities about probabilities?