When I last looked in on computer chess, AlphaZero had recently made waves by beating Stockfish after spending nine hours training by playing games against itself with no outside interference. As I understand it, the configuration that Stockfish was running wasn't its strongest, but this result was still impressive: A chess engine that looked at relatively few positions but used a neural network to evaluate them (an "NN" engine) beat a system that looked at billions using a hand-tuned human-written algorithm (an "AB" engine). Soon an open-source engine based on Alpha Zero, Leela Chess Zero (LC0), was doing impressively well in tournaments.
The hallmark of NN engines was that they would play wild-looking moves that neither a human chess master nor an engine like Stockfish would have played at the time, moves which looked risky or even downright reckless, but often turned out to lead to a crushing advantage, all of this because similar-looking moves had led to good results in practice games.
At this writing, LC0 is still doing quite well in tournaments, but not quite as well as Stockfish, which consistently beats it. So AB wins, right?
Well, not quite. At the heart of an AB engine is the evaluation function, which takes a position on the board and returns a number that says how good the position is. The rest of the engine is dedicated to searching the tree of possible moves, replies to moves, replies to replies and so on typically a few dozen levels, to find the move that leads to the best possible positions against the opponent's best moves.
There is a whole lot of software engineering behind making this as efficient as possible, including a technique called alpha-beta pruning that gave rise to the "AB" designation. The principle behind alpha-beta pruning is simple: Stop looking at the continuations from a move as soon as you know that the opponent can do better than it would with your current best move, but my brain gets completely befuddled when I try to understand the code, probably because the rule is applied recursively for both sides, so the meaning of "better" flips each time you switch sides in the search.
Until recently, evaluation functions had been carefully crafted to extract features from a position, like how much material each side had, which pieces had good or bad mobility, how each sides pawns were structured and so forth, and combine those using carefully-selected rules to arrive at a final evaluation. A significant part of this is figuring out how much weight to assign to each feature in what circumstances. Essentially, this means answering questions like "Is it better to have an extra pawn, or better mobility and pawn structure?". The actual answer is "It depends. We need a rule for deciding how much weight to give each of those factors."
One of the reasons for Stockfish's success is its well-designed test framework for evaluating new code, including new evaluation functions. Different versions of the engine, including versions with different evaluation functions, are systematically played against each other and only changes that win make it into the next version.
Extracting features and carefully tuning various parameters that determine how to combine them certainly seems like what I previously called "ML-friendly problems", and it didn't take too long for someone to try that out. The result was the NNUE, a neural network that takes the positions of the pieces, with special attention given to the kings, and produces a numerical evaluation. The NNUE was good enough in testing to find its way into the official release, where it remains to this day.
So NN wins, right?
Well, not quite. A pure NN engine like LC0 is applying a large and slow neural net to a relatively small number of positions. It doesn't look ahead very far. In principle, an NN engine might look at only the positions after each possible move in the current position, typically a couple dozen. In practice, they look at hundreds of thousands, which is far more than a human player could, but still far fewer than an AB engine does. The power of an NN engine comes from the weightings in its neural net, which in turn come from playing large numbers of training games.
By comparison, the NNUE is tiny. Here's a picture of its weightings for one particular release. I haven't run down the exact details, but as I understand it, the NNUE has a few dozen input parameters and four layers, and a typical parameter file runs to a few dozen megabytes. LC0's network is much larger, though still tiny compared to the ChatGPTs of the world (which don't even really now the rules of chess, as this fairly sharply-worded piece argues).
If that's all too vague for you (it is for me), the NNUE code runs on a standard CPU and can do hundreds of millions of evaluations per second, while LC0 prefers running its network on a GPU and does tens of thousands of evaluations per second.
By looking at orders of magnitude more positions than LC0, Stockfish is in effect trusting its neural network much, much less than an NN engine does and instead relies on very deep searches to determine which move to play. Put another way, its actual evaluation is the aggregate of billions of simplistic evaluations rather than a few hundred thousand sophisticated evaluations. More simply, Stockfish is looking at many, many positions quickly while an NN engine is looking at many fewer positions more carefully.
The NNUE is essentially automating the process of extracting features from a position and deciding how to combine them. There's nothing particularly mysterious going on. Its evaluations are similar to those produced by the older code, though different enough to lead to better outcomes when fed into the AB algorithm.
Even in the case of NN engines, the neural net isn't doing all the work. It's still running in a framework of "look at the possible moves, look at the replies to each move, and so on, with AB pruning". That framework wasn't created by a neural net. It was developed by humans decades ago. No LLM has written code for a successful chess engine. Within the framework that actual chess engines are built on, it turns out that a bit of neural network-based code can be helpful, but past a certain quite small amount, it doesn't seem to help.
No comments:
Post a Comment