Intermittent Conjecture: Tying up a few loose ends about models and memory

Most of the time when I write a post, I finish it up before going on to the next one. Sometimes I'll keep a draft around if something else comes up before I have something that feels ready to publish, and sometimes weeks or even months can pass between then and actually publishing, but I still prefer to publish the current post before starting a new one.

However, a while ago ... nearly five years ago, it looks like ... I ran across an article on a demo by Open AI that played games of hide-and-seek in virtual environments. Over the course of hundreds of millions of games, the hiders and seekers developed strategies and counter-strategies, including some that the authors of the article called "tool use".

I've put that in "scare quotes" ("scare quotes" around "scare quotes" because what's scary about noting that it was someone else who said something?), but I don't really have a problem with calling something like moving objects around in a world, real or virtual, to get an advantage "tool use" (those are use/mention quotes, if anyone's keeping score).

As usual, though, I'm more interested in the implications than the terminology, and this seemed like another example of trying to extrapolate from "sure, we can use terms like tool use and planning here with a straight face" to "AI systems are about to develop whatever it is we think is special about our intelligence, which means they might be about to take over the world."

Writing that brought a thought to mind that I'm not sure I've really articulated before: To whatever extent we've taken over the world, it's taken us on the order of 70,000 years to get here, depending on how you count. In that light, it seems a bit odd to conclude that anything else with intelligence similar to ours will be running the place overnight, especially if we know they're coming.

But I'm digressing from what was already a digression. In the process of putting together several posts prompted by that article, and still being in that process when ChatGPT happened, I ended up pondering some questions that didn't quite make it into other posts, at least not in the form that they originally occurred to me.

So here we are:

First, I was most intrigued by the idea that the hide-and-seek agents seemed to have object permanence, that is, the ability to remember that something exists even when you can't see it or otherwise perceive it directly.

This is famously a milestone in human development. As with many if not most cognitive abilities, understanding of object permanence has evolved over time, and there is no singular point at which babies normally "acquire object permanence" (call those whatever kind of quotes you like).

Newborn babies do not appear to have any kind of object permanence, but in their first year or two they pass several milestones, including what the Wikipedia article I linked to calls "Coordination of secondary circular reactions", which among other things means "the child is now able to retrieve an object when its concealment is observed" (straight-up "this is what the article said" quotes there, and I think I'll stop this game now).

The hide-and-seek agents seem to have similar abilities, particularly being able to return to the site of an object they've discovered or to track how many objects have been moved out of sight to the left versus to the right. There are two interesting questions here:

Do the hide-and-seek agents have the same object permanence capabilities as humans?
Do the hide-and-seek agents have object permanence in the same way as humans?

I'm making the same distinction here that I have in previous posts. The first question can be answered directly: Put together an experiment that requires remembering where objects were or which way they've gone and see if the agents perform similarly to humans.

The second is more difficult to answer, because it can't be answered directly. Instead, we have to form a theory about how humans are able to track the existence of unseen objects, and then test whether that theory is consistent with what humans actually do, and then, once there is a way of testing whether someone or something has that particular mechanism, try the same tests on the hide-and-seek agents. Assuming that all goes well, you still don't have an airtight case, but you have reason to believe that the agents are doing similar things to what humans do when demonstrating object permanence (in some particular set of senses).

There's actually a third question: Are the hide-and-seek agents experiencing objects and events in their world the same way we experience objects and events in our world? I would call that a philosophical question, probably unknowable in some fundamental sense. That's not to say that there's no point in exploring it, or exploring whether or not such things are knowable, just that at this point we're far outside the realm of verifiable experiments -- unless some clever philosopher is able to devise an experiment that will give us a meaningful answer.

The interesting part here is that we have a pretty good idea how agents such as the hide-and-seek agents are able to have capabilities like object permanence. In broad strokes, a hide-and-seek agent is consuming a stream of inputs analogous to our own sensory inputs such as sight and sound. In particular (quoting from the OpenAI blog post):

The agents can see objects in their line of sight and within a frontal cone.
The agents can sense distance to objects, walls, and other agents around them using a lidar-like sensor.

At any given time step, the agents are given a summary of what is visible at what distance at that time (rather than, say, getting an image and having to deduce from the pixels what objects are where), or at least I believe this is what the blog post means by "Agents use an entity-centric state-based representation of the world" From this, each agent produces a stream of actions: move, grab an object or lock an object (which prevents other agents from moving it).

In between the stream of inputs and the actions taken at a particular timestep is a neural network which is trained to extract the important parts from the input stream and turn them into actions. This neural network is trained based on the results of millions of simulated games of hide-and-seek, but it's static for any particular game. In some sense, it's encoding a memory of what happened in all the games it's been trained on -- producing this particular stream of actions in response to this particular stream of input resulted in success or failure, times many millions -- but it's not encoding anything about the current game.

Just going by the blog post, I can't tell exactly what sort of memory the agents do have, but from the context of how transformer-based models work, it is a memory of the input stream, either from the beginning of the current game or over a certain window. That is, at any particular timestep, the agent can not only use what it can sense at that time step, but also what it has sensed at previous time steps.

This makes object permanence a little less mysterious. If an agent sensed a box dead ahead and ten units away, then it turned 90 degrees to the right and went three units forward, it's not too surprising for it to act as though there is now a box three units behind it and ten units to the left, given that it remembers both of those things happening.

The key here is "act as though". In the same situation, a person would have some sort of mental image of a box in a particular location. The only things that the hide-and-seek agent is explicitly remembering about the current game is what it's sensed so far.

Presumably, there is something in the neural net that turns "I saw a box at this distance" followed by "I moved in such-and-such a way" into a signal deeper in the net that in some sense means "there is a box at this location", in some sort of robust and general way so that it can encode box locations in general, not just any particular example. Even deeper layers can then use this representation of the world to work out what kinds of actions will have the greatest chance of success. This is probably not exactly what's going on but ... something along those lines.

Is it possible that humans do something similar when remembering locations of objects? It's possible, but people don't always seem to have sequences of events in mind when remembering where objects are. I can be helpful to remember things like "I came downstairs with my keys and then I was talking to you and I think I left the keys on the table", but it doesn't seem to be necessary. If I tell you that I left the keys on the table in a room of a house you've never been to, you can still find the keys. If all I remember is that I left the keys on the table, but I'm not exactly sure how that came to be, I can still find them.

In other words, we seem to form mental images of places and the objects in them. While one way to form such an image is by experiencing moving through a place and observing objects in it, it's not the only way, and we can still access our mental map of places and things in them even after the original sequence of experiences is long forgotten.

We appear to remember things after doing significant processing and throwing away the input that led to the memories (or at least separating our memory of what happened from the memory of what's where). The way that transformer-based models handle sequences of events is not only different from what we appear to do, it's deliberately different.

Bear in mind that I'm not an expert here. I've done a bit of training on the basics of neural net-based ML and I've read up a bit on transformers and related architectures, so I think what follows is basically accurate, but I'm sure an actual expert would have some notes and corrections.

One definition before we dive in: token is the general term for an item in a stream of input, representing something on the order of a word of text or the description of what an agent senses, after it's been boiled down to a vector of numbers by a procedure that varies depending on the particular kind of input.

The problem of attention -- how heavily to weight different tokens in a stream of input -- has been the subject of active research for decades. Transformers handle this differently from other types of models. The previous generation of models used Recurrent Neural Networks (RNNs) that did something more like maintaining short-term memory of what's going on. Each input token is processed by a net to produce two sets of signals: output signals that say what to do at that particular point, and hidden state signals, that are fed back as inputs when processing the next input token.

In some sense, the hidden state signals signals represent the state of the model's memory at that point. Giving a token extra attention means boosting its signal in the hidden state that will be used in processing the next token, and indirectly in processing the tokens after that.

This has two problems: First, because the inputs to the net depend on the hidden state outputs from previous tokens, you have to compute one token at a time, which means you can't just throw more hardware at processing more tokens. More hardware might make each individual step faster, but only up to the limits of current hardware. It's going to take 10,000 steps to process 10,000 tokens, no matter what.

Second, essentially since everything that's come before is boiled down into a set of hidden state signals, the longer ago an input token was processed, the less influence it can have on the final result (the "vanishing gradient problem"). Even if a token has a large influence on the hidden state when it's processed, that influence will get washed out as more tokens are processed.

Unfortunately, events that happened long ago can be more important than ones that happened more recently. Imagine someone saying "I don't think that ..." followed by a long, overly-detailed explanation of what they don't think. The "not" in "don't" may well be more important than the fourth bullet point in the middle.

Even though an RNN works roughly the same way that our brains work, receiving inputs one at a time and maintaining some sort of memory of what's happened, models based purely on hidden state don't perform very well, probably because our own memories do more than just maintain a set of feedback signals. There have been attempts to use more sophisticated forms of memory in RNNs, particularly "Long Short-Term Memory" (LSTM). This works better than just using hidden state, and it was the state of the art before transformers came along.

Transformers take a completely different approach. At each step, they take as input the entire stream of tokens so far. At timestep 1, the model's output is based on what's happening then. At timestep 2, it's based on what happened at timestep 1 and what's happening at timestep 2, and so on. If you only give the model "this happened at timestep 1 and this happened at timestep 2", it should produce the same results whether or not it was ever asked to produce a result for timestep 1.

Processing an input stream at one timestep does not affect how it will process an input stream at any other timestep. The only remembering going on is remembering the whole of the input stream. This means that any token in the input stream can be given as much importance as any other.

A transformer consists of two parts. The first digests the entire input stream and picks out the important parts. It can do this in multiple ways. One "head" in a language-processing model might weight based on what words are next to each other. Another might pay attention to verbs and their objects. Input tokens are tagged with their position in the stream, so a transformer trained to work on text could weight "I don't think that ..." in early positions as being important, or look for some types of words close to other types of words.

Whatever actually comes out of that stage goes into another network that decides what output to actually produce (this network actually consists of multiple stages, and the whole attention-and-other-processing setup can be repeated and stacked up, but that's the basic structure).

A transformer-based model does this at every timestep, which means that the first input token is processed at every timestep, the second one is processed at every timestep but the first, and so forth. This means that handling twice as long a stream of input will require approximately four times as much processing, three times as much will require nine times as much and so on. Technically, the amount of processing require grows quadratically with the size of the input.

For similar reasons, the network that handles attention grows quadratically in the size of the input, at least without some sort of optimization. In this sense, a transformer is less efficient than an RNN, since it will use more computing resources.

Crucially, though, this can all be done by "feed-forward" networks, that is, networks that don't have feedback loops. If you want to be able to process a longer stream of input tokens, you'll need a larger network for the attention stage, and probably more for the later stages as well since there will probably be more output from the attention stage, but you can make both of those bigger by throwing more hardware at them.

Processing twice as big an input stream requires more hardware, but it doesn't take twice as much "wall time" (time on the clock on the wall), even if it takes four times as much CPU time (total time spent by all the processors). Being able to handle a long stream of input quickly is what enables networks to incorporate what happened in the whole history of a stream when deciding what to output.

Transformer-based models, which currently give the best results, don't process events in the world the same way we do. They don't remember anything from input token to input token (that is, timestep to timestep). Instead, they remember everything that has happened up to the current time, and figure out what to do based on that.

This produces the same kind of effects as our memories do, including the effect of object permanence. In our case, if we see a ball roll behind a wall, we remember that there's a ball behind the wall (assuming nothing else happens). In a transformer-based hide-and-seek model, an agent's behavior will likely differ for an input stream that includes a ball moving behind a wall than for one that doesn't, so the model acts like it remembers that there's a ball behind the wall.

It looks like humans are doing something the hide-and-seek agents don't do when dealing with a world of objects, namely maintaining a mental map of the world, even though the agents can produce similar results to what we can. Again, this shouldn't be too surprising. Chess engines are capable of "positional play" and other behaviors that were once thought to be unique to humans even though they clearly use different mechanisms. Chatbots can produce descriptions of seeing, smelling and tasting things that they've clearly never seen, smelled or tasted, and so forth.

Are we "safe" (definitely scare quotes) since these agents aren't forming mental images in the same way we appear to? Wouldn't that mean that they lack the "true understanding" that we have, or some other quality unique to us, and therefore they won't be able to outsmart us? I would say don't bet on it. Chess engines may not have the same sense of positional factors as humans, but they still play much stronger chess.

So are we doomed, then? I wouldn't bet on that either, for reasons I go into in this post and elsewhere.

The one thing that seems clear is that human memory of the world doesn't work the same way as it does for the hide-and-seek agents, or for AIs built on similar principles. In both cases there appears to be some sort of processing of a stream of sense input into a model of what's where. The difference seems to be more that the memory part is happening at a different stage and has a completely different structure.

Intermittent Conjecture

Tuesday, September 10, 2024

Tying up a few loose ends about models and memory

No comments:

Post a Comment