Saturday, April 27, 2019

Notes on the EHT black hole image

Following on to the recent post on the other blog about the imaging of a black hole in a nearby galaxy, here are my lightly cleaned up notes from watching Katie Bouman's lecture on how the images were captured using the Event Horizon Telescope (EHT), which is actually a collection of several radio telescopes whose individual data sets are combined using very sophisticated image processing techniques to emulate a single Earth-sized telescope.

I have no great insights to offer, but there are some pretty interesting details in the original lecture.  I've thrown in some commentary here and there, without taking any real care to distinguish it from the source material. The full lecture is on Caltech's YouTube channel.  As always, any errors or distortions are mine.



Technically, the image gathered by the telescopes working in combination is in spatial frequency, basically how quickly bright changes to dark as you scan over the actual image.  For example, a square divided into a white rectangle and a black rectangle has a lower spatial frequency than one divided into thin black and white stripes.  It's possible to reconstruct a visual image from information about the spatial frequency, and the more information you have, the better the reconstruction.  This sort of reconstruction is widely used in many fields, including astronomy, medical imaging and the image compression that lets us stream video over the internet.

The exact spatial frequencies being measured depend on the distance (baseline) between the telescopes.  When the telescopes are spread across distant locations on the Earth, the image technique is called Very Long Baseline Interferometry (VLBI).  Interferometry refers to measuring how the signals from different telescopes interfere with each other, that is, whether they reinforce each other or cancel each other out when combined.  Very Long Baseline refers to, um ...

In this case there were eight radio telescopes at six different sites, taking data together on four different nights, scanning for a few minutes at a stretch.  This provides 21 different pairs of telescopes, which would mean 21 different baselines, that is, 21 different spatial frequencies to combine, except that the baselines change as the Earth rotates.  This is good, because it provides samples of more frequencies.  There is at least some talk of adding telescopes in low Earth orbit at some later stage, which would not only provide more raw data but more varied baselines, since satellites move faster than the Earth rotates.

As mentioned in the other post, each scope collected hundreds of terabytes (hundreds of trillions of bytes), enough data that it was easier to fly the drives with the data to the site where it would be processed, rather than try to send it over the internet (and the South Pole data had to be flown out in any case).

The actual signal, coming from about 50 million light-years away, is quite weak, and noisy.  There is a lot of matter at the center of a galaxy, radiating at various frequencies and blocking various frequencies.  The radio band was chosen to minimize these effects, but there is still noise to contend with, along with noise from a variety of other sources, including atmospheric noise and imperfections in the telescopes themselves.

The various telescopes are looking through different amounts of atmosphere, depending on whether the black hole is overhead from their location or on the horizon, and this also affects how long it takes the signal to arrive at each telescope.  It takes about 20 milliseconds longer for the signal to reach a scope that has the black hole on the horizon that it does for one where it's overhead.  For a 1GHz signal, that means a difference of 20 million cycles -- and again, this is varying as the earth rotates.

All this means there is a lot of calibration to be done to make sure you're lining up the right data from the various telescopes and accounting for all the various sources of noise and distortion.  In all, there were hundreds of thousands of calibration parameters to work with, though the team ended up using (only) ten thousand or so.

The reconstructed image is meant to represent what an Earth-sized telescope would see, if one could be built, but there are actually infinitely many such images that would match up with the limited data that was collected from a few scattered points on Earth.

Fortunately, having multiple telescopes means you can cross-check and isolate errors.  There are also "closure quantities" that should be the same regardless of the exact details of calibration.  I didn't really understand the details of closure quantities, but the idea of looking for quantities that don't depend on hard-to-control parameters is a powerful one.  It's sort of similar to checking arithmetic by adding up digits.

The team used two different methods of calibration, and they split into four subteams, two for each method.  The subteams worked independently for several weeks to reconstruct an image from the raw data.  Bouman's own work involved getting the human-tuned method (CLEAN) to work without human tuning.

In addition to the work by the subteams, there was a lot of work done up front to try to make sure that the image processing algorithms they were tuning were robust.  Much of this involved taking test images, calculating what signals the telescopes would get and then reconstructing the test images from the calculated signals.  They did this not only with images of discs and rings that one would typically expect in astronomical images, but also with a variety of non-astronomical images.

The team actively tried to break the image processing algorithms, for example by tuning the parameters to favor a solid disc rather than a ring, and verifying that a ring still came out as a ring.  Python fans will be glad to know that at least some of the code involved was written in Python.

The publicly distributed image is a combination of the images from the subteams.  Since it's meant to represent only what we can be confident about regarding the actual black hole, it's deliberately a bit blurry.  Even still, there are some significant results [I haven't yet re-checked the details here, so what follows may be garbled]:
  • There is, in fact, a ring as would be expected from a supermassive black hole.
  • The same ring was observable night after night.
  • Previously there had been two separate estimates of the mass of the black hole, one from the motions of stars around it (stellar dynamics) and one from the appearance of the gas flowing into the accretion disc.  The size of the ring in the image supports the first mass estimate.
  • One portion of the ring is brighter than the rest, due to the doppler shift of matter orbiting around the black hole.  This is to be expected, and the position of the bright spot matches up with other observations.
  • The position of the bright spot is shifting over time, consistent with a rotation period on the order of weeks (I think this implies that the black hole's axis of rotation is not on the line of sight).  Since the black hole in the Milky Way (Sgr A*) is much smaller, its rotation period is expected to be on the order of minutes.  Since the observation scans are also on the order of minutes, the image processing for Sgr A* will have to take this into account.
It's also possible to analyze the raw data directly as a cross-check.  A disk of such-and-such size should result in such-and-such radio signals arriving at the various telescopes, independent of any reconstructions produced from that data.  The raw signals are consistent with a ring of the size found in the visual reconstructions.  This is not a surprise, given how the reconstructions were done, but still good to know.

Despite the whole effort taking years and requiring expensive radio telescopes at widely dispersed locations, and employing dozens of people at various stages, the movie Interstellar still cost a lot more to make.  While its image of a black hole may be prettier (and, as I understand it, one of the parts of the movie where the science was actually plausible), the EHT image was both cheaper and more informative.

Again, really good stuff.

Sunday, April 14, 2019

Notes on watching (a chunk of) a computer chess tournament

A while ago I started a post trying to untangle what kind of a problem it is to play chess well, based on how well various kinds of players -- human, conventional computer chess engines and neural-network-based chess engines -- did against each other.  I still think it's an interesting question, but in the mean time I actually watched a bunch of computer chess and that was even more interesting.  In particular, I looked in on a fair bit of chess.com's "CCC 7: Blitz Bonanza".  Here are some rough impressions.

The Blitz Bonanza is one part of a long-running event comprising thousands of individual games in several different formats (the nice thing about chess engines is they don't get tired, so it's practical to have two of them play each other dozens of times in the course of a few days).  The "Deep Dive", with longer time controls and slightly different rules, is going on now and will be for several weeks.  There may be more after that (much of this is from memory because I can only seem to find current results on chess.com and the summaries on other sites don't always contain the exact information I'm looking for).

In the Blitz Bonanza, each player has five minutes of time to start with, plus two seconds per move made.  There is no "opening book", meaning that the players must calculate every move.  This leads to the rather odd sight of a clock ticking down near or past the 4-minute mark on move one, which is pretty much inevitably something everyone plays anyway (1. d4 and 1. e4 predominate).

Play continues either to checkmate by one side (though it enters a "get it over with" or "bullet" mode if both engines evaluate the current position as highly favorable to one or the other side), to a draw by stalemate, threefold repetition, 50 moves without a pawn move or capture, or "adjudication", meaning that the position is in a standard database of positions with six or fewer pieces on the board that cannot be won if both sides play perfectly (this includes "insufficient force" scenarios like king and bishop against king), or both engines evaluated the game as a dead draw for a certain number of moves.

This is rather different from human matches, where a player will typically resign if it becomes clear there's no way to avoid losing, and players typically agree to a draw when it's clear that neither of them has a realistic shot at winning unless the other one blunders.  It's sometimes fun to watch an actual checkmate.  The draws can be interesting in their own way.  More on that later.

Scoring was the usual one point for a win, no points for a loss and half a point to each player for a draw

The original field consisted of
  • The premier conventional (alpha/beta or AB) engine, Stockfish
  • Three neural network-based (NN) engines (Leela Chess Zero, aka LC0, Leelenstein and Antifish)
  • Everyone else
In the first round, each of these played each of the others several times, half of the time as white and half as black.  The top four would progress to the next round.  When the smoke cleared, the top four were LC0, the other two neural network engines, and Stockfish rounding out the field in fourth.

In the second round, each of the four played each of the others 100 times (50 as white and 50 as black), for a total of 600 games.  This went on around the clock for several days.   The final result had LC0 in first, Stockfish close behind, well ahead of Leelenstein, which was itself well ahead of Antifish.  LC0 had a small lead over Stockfish in their head-to-head games.

Looking at this with my somewhat defocused engineer's eyes, I'd say "LC0 and Stockfish were about equal, both were clearly stronger than Leelenstein, and Antifish was clearly the weakest of the lot."  I'd consider LC0 and Stockfish essentially equal not only because of the small difference in score (5.5 points out of a possible 300), but because the margin between the two was steady for most of the play.  LC0 won some early games against Stockfish head-to-head, but after that each scored several wins against the other.  The early lead thus looks more like a statistical fluke than a real difference in strength [I'm not sure a real statistician would buy this analysis -D.H May 2019].  The results of CCC 6, with slightly different rules and time controls, bear this out.  Stockfish beat LC0, but not by much, and, if I recall correctly, not head-to-head.

As to my long-running theme of "dumb is smarter", the "zero" in LC0 means that it was trained purely by playing against itself, without any external hints.  Antifish, which finished last and had a losing record against each of the others, including Stockfish, was trained specifically to exploit weaknesses in Stockfish.  If we take LC0's "no extra information to bias the results" approach as one kind of "dumb" and Stockfish's "bash out lots of positions using fairly straightforward (but highly-tuned) rules" as another, it's pretty clear that "dumb" dominated.

One other thing jumped out: Neural networks are not good at endgames, or at least they can be spectacularly bad at some endgames.

Endgames are interesting here because there are few pieces on the board and it's often possible to spell out an explicit plan for winning.  For example, if you have a rook and a king against a rook, you can use your rook and king in concert to pin the other king against the edge of the board and then checkmate with the rook.  In other cases, you want to promote a pawn and this hinges on getting your king to the right square and keeping the other king from stopping you.

Over and over I saw neural network engines flail about aimlessly before hitting the 50 move limit.  To be clear, there are a few endgames that can take more than 50 moves to win, but even in these you can see a clear progression.  To make up an example, after a series of ten checks the opposing king is now one square closer to the corner where it will eventually be checkmated.  This isn't what was happening.  In one game I saw, the AB engine showed mate in 12, while the NN engine saw its winning chances as quite good.  The game continued

AB: Oh, now you've got mate in 14
NN: Yeah this looks good
AB: I don't see a mate, but I think you're going to promote that pawn (or whatever) and win, so my rating is still highly in your favor
NN: Yeah this looks good
AB: Dang, mate in 14 again
NN: Yeah this looks good
AB: Mate in 13
NN: Yeah this looks good
AB: I don't see a mate any more, but ...
NN: Yeah this looks good

From the NN's point of view, it was going from one position with good winning chances to another.  But it wasn't actually winning.  When the game eventually ended in a draw, the NN gave away a half point that the AB engine or a good human player would have gotten, given the same position.


I was a little (but not a lot) surprised by this.  It doesn't seem like the problem is beyond the capability of neural networks.  They've discovered plenty of conventional tactics -- forks, skewers and such, or ladders in the game of go -- and they are clearly able to encode at least some of the usual winning patterns.

Give a neural network a king and queen against a queen and it will find a checkmate more or less like anyone else would.  On the other hand, give it a queen, a pawn and a rook against a king and it will quite likely forcibly give away two of the three and win with what's left.  To a mathematician it makes sense (reducing to a previously-solved case), but to most chess players it looks like lunacy, if not deliberate trolling [In one fairly jaw-dropping example from the Deep Dive finals, LC0 had an overwhelming material advantage in an endgame and gave away pieces to reduce it to bishop, knight and king against lone king, one of the harder endgames to win.  It went ahead and won anyway --D.H May 2019].  

I found this behavior particularly interesting because neural networks are supposed to have more of that special sauce we call intelligence or understanding.  We can clearly say that Stockfish has no deep understanding of chess the way that a human grandmaster can intuitively say "Your pieces are active and the king is exposed.  Go for a mating attack," but at least it plays plausible moves.  NN engines often find wild moves that a human wouldn't consider and an AB might avoid because it doesn't show a clear win, leading us to feel it has some deeper understanding of chess.  I think that's a valid feeling, but there are cases where it completely falls apart.

Watching a neural network botch an endgame just looks like farce.  I'm perfectly comfortable saying that a program "understands" a concept in chess, or "has a concept of" pawn structure or whatever.  It doesn't mean that the program has some reflective layer thinking "Oh, that's a strong pawn chain, I like that", but that it behaves in such a way that, one way or another, it must recognize good and bad structures.

In cases like these endgames, though, it's abundantly clear that the network has no concept, in any meaningful literal or metaphorical sense of the word, of how to win the position.  Nor is it going to be able to figure one out over the board, because the network only learns during the training phase (that's not necessarily a requirement, but it's how these engines work).

This doesn't seem like a permanent situation.  For example, you could have an NN engine train on endgame positions, and it would likely discover ways to win outright and ways to steer toward a win in positions where it now flails.  This is pretty much what humans do -- you learn how to promote pawns as one skill, how to checkmate with a rook and king as another, and so forth.

What's at least a bit surprising is that an engine can be frighteningly strong in most of the game and completely clueless in a few key spots.  This has to say something about intelligence and learning, or about what kind of a problem chess-playing is technically, or about all of the above, but I'm not quite sure what.