Intermittent Conjecture: Update: AB and NN chess engines

When I last looked in on computer chess, AlphaZero had recently made waves by beating Stockfish after spending nine hours training by playing games against itself with no outside interference. As I understand it, the configuration that Stockfish was running wasn't its strongest, but this result was still impressive: A chess engine that looked at relatively few positions but used a neural network to evaluate them (an "NN" engine) beat an engine that looked at billions using a hand-tuned human-written algorithm (an "AB" engine). Soon an open-source engine based on Alpha Zero, Leela Chess Zero (LC0), was doing impressively well in tournaments.

The hallmark of NN engines was that they would play wild-looking moves that neither a human chess master nor an engine like Stockfish would have played at the time, moves which looked risky or even downright reckless, but often turned out to lead to a crushing advantage, all of this because similar-looking moves had led to good results in practice games.

At this writing, LC0 is still doing quite well in tournaments, but not quite as well as Stockfish, which consistently beats it. So AB wins, right?

Well, not quite. At the heart of an AB engine is the evaluation function, which takes a position on the board and returns a number that says how good the position is. The rest of the engine is dedicated to efficiently searching the tree of possible moves, replies to moves, replies to replies and so on typically a few dozen levels, to find the move that leads to the best possible positions against the opponent's best moves, according to the evaluation function.

There is a whole lot of software engineering behind making this as efficient as possible, including a technique called alpha-beta pruning that gave rise to the "AB" designation. The principle behind alpha-beta pruning is simple: Stop looking at the continuations from a move as soon as you know that the opponent can do better than it would with your current best move, but my brain gets completely befuddled when I try to understand the code, probably because the rule is applied recursively for both sides, so the meaning of "better" flips each time you switch sides in the search (alpha represents the score of the player's best move so far in the search; beta represents the same thing for the opponent).

Until recently, evaluation functions had been carefully crafted to extract features from a position, like how much material each side had, which pieces had good or bad mobility, how each side's pawns were structured and so forth, and combine those using carefully-selected rules to arrive at a final evaluation.

A significant part of this is figuring out how much weight to assign to each feature in what circumstances. Essentially, this means answering questions like "Is it better to have an extra pawn, or better mobility and pawn structure?". The actual answer is "It depends. We need a rule for deciding how much weight to give each of the features we extracted." This in turn might vary depending on the particulars of the situation. Some things are more important in the middlegame, where there is more material on the board, than in the endgame, for example.

One of the reasons for Stockfish's success is its well-designed test framework for evaluating new code, including new evaluation functions. Different versions of the engine, including versions with different evaluation functions, are systematically played against each other and only changes that win make it into the next version.

Extracting features and carefully tuning various parameters that determine how to combine them certainly seems like what I previously called an "ML-friendly problem", and it didn't take too long for someone to try that out. The result was the NNUE, a neural network that takes the positions of the pieces, with special attention given to the kings, and produces a numerical evaluation. The NNUE was good enough in testing to find its way into the official release, where it remains to this day.

So NN wins, right?

Well, not quite. A pure NN engine like LC0 is applying a large and relatively quite slow neural net to a relatively small number of positions. It doesn't look ahead very far. In principle, an NN engine might look at only the positions after each possible move in the current position, typically a couple dozen. In practice, they look at hundreds of thousands, which is far more than a human player could, but still far fewer than an AB engine does. The power of an NN engine comes from the weightings in its neural net, which in turn come from playing large numbers of training games.

By comparison, the NNUE is tiny. Here's a picture of its weightings for one particular release, and here's a little more technical detail. The NNUE has around a hundred thousand one-bit input parameters and four layers. A parameter file runs to a few dozen megabytes, most of which are for the input weights in the first layer.

Just as importantly "The efficiency of NNUE is due to incremental update of the input layer outputs in make and unmake move, where only a tiny fraction of its neurons need to be considered in case of [no] king moves." This is the result of hand-optimization, not some emergent property of the neural net.

LC0's network is much larger, though still tiny compared to the ChatGPTs of the world (which don't even really now the rules of chess, as this fairly sharply-worded piece argues).

If that's all too vague for you (it is for me), the NNUE code runs on a standard CPU and can do hundreds of millions of evaluations per second, while LC0 prefers running its network on a GPU and does tens of thousands of evaluations per second.

By looking at orders of magnitude more positions than LC0, Stockfish is in effect trusting its neural network much, much less than an NN engine does and instead relies on very deep searches to determine which move to play.

Put another way, its actual evaluation is the aggregate of billions of simplistic evaluations, which happen to use a small neural net, rather than a few hundred thousand sophisticated evaluations using a much larger neural net. More simply, Stockfish is looking at many, many positions quickly while an NN engine is looking at many fewer positions more carefully.

The NNUE is essentially automating the process of extracting features from a position and deciding how to combine them. There's nothing particularly mysterious going on. As far as I understand it, its evaluations are similar to those produced by the older code, though different enough to lead to better outcomes when fed into the AB algorithm.

Even in the case of NN engines, the neural net isn't doing all the work. It's still running in a framework of "look at the possible moves, look at the replies to each move, and so on, with AB pruning". That framework wasn't created by a neural net. It was coded for computers by humans decades ago, in the 1950, to automate something human chess players already did.

That is, a naturally-evolved neural network, the human brain, developed both the concept of looking at moves and counter-moves and its realization as code. No LLM has developed code for a successful chess engine, or even come anywhere close*. This is, at least so far, a notable difference between LLMs and natural neural nets.

Within the framework that actual chess engines are built on, it turns out that a bit of neural network-based code can be helpful. past a certain quite small size, though, adding more NN doesn't seem to help.

* To be a really fair test, the LLM would need to have been trained on a corpus that only mentioned, say, tree searches and the rules of chess, without mentioning anything like alpha-beta pruning or the idea of applying tree-searching to the problem of playing chess. I think it's a very good bet that current chatbots don't meet that standard, so if you've seen something like chess-engine code generated by an LLM, the simplest explanation is that there are similar things in the corpus it was trained on. This is to say nothing of actually producing a full chess engine that uses the tree search as its basis.

Intermittent Conjecture

Friday, July 4, 2025

Update: AB and NN chess engines

No comments:

Post a Comment