Saturday, April 5, 2025

ML-friendly problems and unexpected consequences

This started out as "one more point before I go" in the previous post, but it grew enough while I was getting that one ready to publish that it seemed like it should have its own post.


Where machine learning systems like LLMs do unexpectedly well, like in mimicking our use of language, it might not be because they've developed unanticipated special abilities. Maybe ML being good at generating convincing text says as much about the problem of generating convincing text as it does about the ML doing it.

The current generation of chatbots makes it pretty clear that producing language that's hard to distinguish from what a person would produce isn't actually that hard a problem, if you have a general pattern-matcher (and a lot of training text and computing power). In that case, the hard part, that people have spent decades trying to perfect and staggering amounts of compute power implementing, is the general pattern-matcher itself.

We tend to look at ML systems as problem solvers, and fair enough, but we can also look at current ML technology as a problem classifier. That is, you can sort problems according to whether ML is good at them. From that point of view, producing convincing text, recognizing faces, spotting tumors in radiological images, producing realistic (though still somewhat funny-looking) images and videos, spotting supernovas in astronomical images, predicting how proteins will fold and many other problems are all examples of pattern-matching that a general ML-driven pattern-matcher can solve as well as, or even better than, our own naturally evolved neural networks can.

Not knowing a better term, I'll call these ML-friendly problems. In the previous post, I argued that understanding the structure of natural languages is a separate problem from understanding what meaning natural language is conveying. Pretty clearly, understanding the structure of natural languages is an ML-friendly problem. If you buy that understanding meaning is a distinct problem, I would argue that we don't know one way or another whether it's ML-friendly, partly, I would further argue, because we don't know nearly as much about what that problem involves.


From about 150 years ago into the early 20th century, logicians made a series of discoveries about what we call reasoning and developed formal systems to describe it. This came out of a school of thought, dating back to Leibniz (and as usual, much farther and wider if you look for it), holding that if we could capture rules describing how reasoning worked, we could use those rules to remove all uncertainty from any kind of thought.

Leibniz envisioned a world where, "when there are disputes among persons, we can simply say: Let us calculate, without further ado, to see who is right". That grand vision failed, of course, both because, as Gödel and others discovered, formal logic has inescapable limitations, but also because formal reasoning captures only a small portion of what our minds actually do and how we reason about the world.

Nonetheless, it succeeded in a different sense. The work of early 20th-century logicians was essential to the development of computing in the mid-20th century. For example, LISP -- for my money one of the two most influential programming languages ever, along with ALGOL -- was based directly Church's lambda calculus. I run across and/or use Java lambda expressions on a near-daily basis. For another example, Turing's paper on the halting problem used the same proof technique of diagonalization that Gödel borrowed from Cantor to prove incompleteness, and not by accident.


Current ML technology captures another, probably larger, chunk of what naturally-evolved minds do. Just as formal logic broke open a set of problems in mathematics, ML has broken open a set of problems in computing. Just as formal logic didn't solve quite as wide a range of problems as people thought it might, ML might not solve quite the range of problems people today think it might, but just as formal logic also led to significant advances in other ways, so might ML.


Embedding and meaning

I a previous post entitled Experiences, mechanisms, behaviors and LLMs, I discussed a couple of strawman objections to the idea that an LLM isn't doing anything particularly intelligent: that it's "just manipulating text" and it's "just doing calculations".

The main argument was that "just" is doing an awful lot of work there. Yes, an LLM is "just" calculating and manipulating text, but it's not "just" doing so in the same way as an early system like ELIZA, which just turned one sentence template into another, or even a 90s-era Markov chain, which just generates text based on how often which words appeared directly after which others in a sample text.

In both of those cases, we can point at particular pieces of code or data and say "those are the templates it's using", or "there's the table of probabilities" and explain directly what's going on. Since we can point at the exact calculations going on, and the data driving them, and we understand how those work, it's easy to say that the earlier systems aren't understanding text the way we do.

We can't do that with an LLM, even if an LLM generating text is doing the same general thing as a simple Markov chain. We can say "here's the code that's smashing tensors to produce output text from input text", and we understand the overall strategy, but the data feeding that strategy is far beyond our understanding. Unlike the earlier systems, there's way, way too much of it. It's structured, but that structure is much too complex to fit in a human brain, at least as a matter of conscious thought. Nonetheless, the actual of behavior shows some sort of understanding of the text without having to stretch the meaning of the word "understanding".

In the earlier post, I also said that even if an LLM encodes a lot about how words are used and in which contexts -- which it clearly does -- the LLM doesn't know the referents of those words -- it doesn't know what it means for water to be wet or what it feels like to be thirsty -- and so it doesn't understand text in the same sense we do.

This feels similar to appeals like "but a machine can't have feelings", which I generally find fairly weak, but that wasn't quite the argument I was trying to make. While cleaning up a different old post (I no longer remember which one), I ran across a reference that sharpens the picture by looking more closely at the calculations/manipulations an LLM is actually doing.

I think the first post I mentioned, on experiences etc. puts a pretty solid floor under what sort of understanding an LLM has of text, namely that it encodes some sort of understanding of how sentences are structured and how words (and somewhat larger units) associate with each other. Here, I hope to put a ceiling over that understanding by showing more precisely in what way LLMs don't understand the meaning of text in the way that we do.

Taking these together, we can roughly say that LLMs understand the structure of text but not the meaning, but the understanding of structure is deep enough that an LLM can extract information from a large body of text that's meaningful to us.

In much of what follows, I'm making use of an article in Quanta Magazine that discusses how LLMs do embeddings, that is, how they turn a text (or other input) into a list of vectors to feed into the tensor-smashing machine. It matches up well with papers I've read and a course I've taken, and I found it well-written, so I'd recommend it even if you don't read any further here.


Despite the name, a Large Language Model doesn't process language directly. The core of an LLM  drives the processing of a list of tokens. A token is a vector -- an ordered list of numbers of a given length -- that represents a piece of the actual input.

To totally make up an example, if vectors are three numbers long, and a maps to (1.2, 3.0, -7.5), list maps to (6.4, -3.2, 1.6), of maps to (27.5, 9.8, 2.0),  and vectors maps to (0.7, 0.3, 6.8), then a list of vectors maps to [(1.2, 3.0, -7.5), (6.4, -3.2, 1.6), (27.5, 9.8, 2.0), (0.7, 0.3, 6.8)].

Here I'm using parentheses for vectors, which in this case always have three numbers, and square brackets for lists, which can have any length (including zero for the empty list, []). In practice, the vectors will have many more than three components. Thousands is typical. The list of vectors encoding a text will be however long the text is.

The particular mapping from input to tokens is called the embedding*.   The overall idea is to encode similarities along various dimensions. There are (practically) infinitely many ways to do this mapping. Over time this has evolved from a mostly-manual process, to an automated process using hand-written code, to the current state of the art, which uses machine learning techniques on large bodies of text. The first two approaches are pretty easy to understand.

An ML-produced embedding, on the other hand, is a mass of numbers created during a training phase. This mass of numbers drives a generic algorithm that turns words into large vectors. While the numbers themselves don't really lend themselves to easy analysis, people have noticed interesting patterns in the results of applying embedding.

Because the model-building phase is looking at streams of text, it's not surprising that the embedding itself captures information about what words appear in what contexts in that text. For example in typical training corpora, dog and cat appear much more often in contexts like my pet ___ than, say, chair does. They are also likely to occur in conjunction with terms like paw and fur, while other words won't, and so forth.

While we don't really understand exactly how the embedding-building stage of training an LLM extracts relations like this, the article in Quanta gives the example that in one particular embedding the vector for king minus the one for man plus the one for woman is approximately equal to the one for queen (you add or subtract vectors component by component, so (1.2, 3.0, -7.5) + (6.4, -3.2, 1.6) = (7.6, -0.2, -5.9) and so on).

It's long been known that use in similar contexts correlates with similarity in meaning. But we're talking about implied similarities in meaning here, not actual meanings.  You can know an analogy like cat : fur :: person : hair without knowing anything about what a cat is, or a person, or fur or hair.

That may seem odd from our perspective. A person would solve a problem like cat : fur :: person : ? by thinking about cats and people, and what about a person is similar to fur for a cat, because we're embodied in the world and we have experience of hair, cats, fur and so forth. Odd as it might seem to know that cat : fur :: person : hair without knowing what any of those things is, that's essentially what's going on with an LLM. It understands relations between words, based on how they appear in a mass of training text, but that's all it understands.


But what, exactly, is the difference between understanding how a word relates to other words and understanding what it means? There are schools of thought that claim there is no difference. The meaning of a word is how it relates to other words. If you believe that, then there's a strong argument that an LLM understands words the same way we do, and about as well as we do.

Personally, I don't think that's all there is to it. The words we use to express our reality are not our reality. For one thing, we can also use the same words to express completely different realities. We can use words in new ways, and the meaning of words can and does shift over time. There are experiences in our own reality that defy expression in words.

Words are something we use to convey meaning, but they aren't that meaning. Meaning ultimately comes from actual experiences in the real world. The way words relate to each other clearly captures something about what they actually mean -- quite a bit of it, by the looks of things -- but just as clearly it doesn't capture everything.

I have no trouble saying that the embeddings that current LLMs use encode something significant about how words relate to each other, and that the combination of the embedding and the LLM itself has a human-level understanding of how language works. That's not nothing. It's something that sets current LLMs apart from anything before them, and it's an interesting result. For one thing, it goes a long way toward clarifying what's understanding of the world and what's just understanding of how language works.

If an LLM is good at it, then it's something about how language works. If an LLM isn't good at it, then it's probably something about the world itself. I'll have a bit more to say about that in the next (shorter) post.

Because LLMs know about language, but not what it represents in the real world, we shouldn't be surprised that LLMs hallucinate, and we shouldn't expect them to stop hallucinating just because they're trained on larger and larger corpora of text.


The earlier post distinguished among behavior, mechanism and experience. An LLM is capable of linguistic behavior very similar to a person's.

The mechanism of an LLM may, or may not, be similar as far as language processing. We may well learn rules like the way that we use the in relation to nouns in a way that's similar to training an LLM. Whether that's the case or not, an LLM, by design, lacks a mechanism for tying words to anything in the real world. This probably accounts for much of the difference between what we would say and what an LLM would say.

All of this is separate from subjective experience.  One could imagine a robot that builds up a store of interactions with the world, processes them into some more abstract representation and associates words with them. But even if that is more similar to what we do in terms of mechanism, it says nothing about what the robot might or might not be experiencing subjectively, even if it becomes harder to rule out the possibility that the robot is experiencing the world as we do.


* Wikipedia seems to think it's only an embedding if it's done using feature learning, but that seems overly strict. Mathematically, an embedding is any map from one domain into another, no matter how it's produced.

Thursday, March 27, 2025

Losing my marbles over entropy

In a previous post on Entropy, I offered a garbled notion of "statistical symmetry." I'm currently reading Carlo Rovelli's The Order of Time, and chapter two laid out the idea that I was grasping at concisely, clearly and -- because Rovelli is an actual physicist -- correctly.

What follows is a fairly long and rambling discussion of the same toy system as the previous post, of five marbles in a square box with 25 compartments. It does eventually circle back to the idea of symmetry, but it's really more of a brain dump of me trying to make sure I've got the concepts right. If that sounds interesting, feel free to dive in. Otherwise, you may want to skip this one.


In the earlier post, I described a box split into 25 little compartments with marbles in five of the compartments. If you start with, say, all the marbles on one row (originally I said on one diagonal, but that just made things a bit messier) and give the box a good shake, the odds that the marbles all end up in the same row that they started in are low, about one in 50,000 for this small example. So far, so good.

But this is really true for any starting configuration -- if there are twenty-five compartments in a five-by-five grid, numbered from left to right then top to bottom, and the marbles start out in, say, compartments 2, 7,  8, 20 and 24, the odds that they'll still be in those compartments after you shake the box are exactly the same, about one in 50,000.

On the one hand, it seems  like going from five marbles in a row to five marbles in whatever random positions they end up in is making the box more disordered. On the other hand, if you just look at the positions of the individual marbles, you've gone from a set of five numbers from 1 to 25 ... to a set of numbers from 1 to 25, possibly the one you started with. Nothing special has happened.

This is why the technical definition of entropy doesn't mention "disorder". The actual definition of entropy is in terms of microstates and macrostates. A microstate is a particular configuration of the individual components of a system, in this case, the positions of the marbles in the compartments. A macrostate is a collection of microstates that we consider to be equivalent in some sense.

Let's say there are two macrostates: Let's call any microstate with all five marbles in the same row lined-up, and any other microstate scattered.  In all there are 53,130 microstates (25 choose 5). Of those, five have all the marbles in a row (one for each row), and the other 53,125 don't. That is, there are five microstates in the lined-up microstate and 53,125 in the scattered microstate.

The entropy of a macrostate is related to the number of microstates consistent with that macrostate (for more context, see the earlier post on entropy, which I put a lot more care into). Specifically, it is the logarithm of the number of such states, multiplied by a factor called the Boltzmann constant to make the units come out right and to scale the numbers down, because in real systems the numbers are ridiculously large (though not as large as some of these numbers), and even their logarithms are quite large. Boltzman's constant is 1.380649×10−23 Joules per Kelvin.

The natural logarithm of 5 is about 1.6 and the natural logarithm of 53,125 is about 10.9. Multiplying by Boltzmann's constant doesn't change their relative size: The scattered macrostate has about 6.8 times the entropy of the lined-up macrostate.

If you start with the marbles in the low-entropy lined-up macrostate and give the box a good shake, 10,625 times out of 10,626 you'll end up in the higher-entropy scattered macrostate. Five marbles in 25 compartments is a tiny system, considering that there are somewhere around 10,800,000,000,000,000,000,000,000 molecules in a milliliter of water. In any real system, except cases like very low-temperature systems with handfuls of particles, the differences in entropy are large enough that "10,625 times out of 10,626" turns into "always" for all intents and purposes.


This distinction between microstates and macrostates gives a rigorous basis for the intuition that going from lined-up marbles to scattered-wherever marbles is a significant change, while going from one particular scattered state to another isn't.

In both cases, the marbles are going from one microstate to another, possibly but very rarely the one they started in. In the first case, the marbles go from one macrostate to another. In the second, they don't. Macrostate changes are, by definition, the ones we consider significant, in this case, between lined-up and scattered. Because of how we've defined the macrostates, the first change is significant and the second isn't.


Let's slice this a bit more finely and consider a scenario where only part of a system can change at any given time. Suppose you don't shake up the box entirely. Instead, you take out one marble and put it back in a random position, including, possibly, the one it came from. In that case, the chance of going from lined-up to scattered is 20 in 21, since out of the 21 positions the marble can end up in, only one, its original position, has the marbles all lined up, and in any case it doesn't matter which marble you choose.

What about the other way around? Of the 53,120 microstates in the scattered macrostate, only 500 have four of the five marbles in one row. For any microstate, there are 105 different ways to take one marble out and replace it: Five marbles times 21 empty places to put it, including the place it came from.

For the 500 microstates with four marbles in a row, only one of those 105 possibilities will result in all five marbles in a row: Remove the lone marble that's not in a row and put it in the only empty place in the row of four. For the other 52,615 microstates in the scattered macrostate, there's no way at all to end up with five marbles lined up by moving only one marble.

So there are 500 cases where the scattered macrostate becomes lined-up, 500*104 cases where it might but doesn't, and 52,615*105 cases where it couldn't possibly. In all, that means that the odds are 11,153.15 to one against scattered becoming lined-up by removing and replacing one marble randomly.

Suppose that the marbles are lined up at some starting time, and every time the clock ticks, one marble gets removed and replaced randomly. After one clock tick, there is a 104 in 105 chance that the marbles will be in the high-entropy scattered state. How about after two ticks? How about if we let the clock run indefinitely -- what portion of the time will the system spend in the lined-up macrostate?

The there are tools to answer questions like this, particularly Markov chains and stochastic matrices (that's the same Markov Chain that can generate random text that resembles an input text). I'll spare you the details, but the answer requires defining a few more macrostates, one for each way to represent the number five as the sum of whole numbers: [5], [4, 1], [3, 2], [3, 1, 1], [2, 2, 1], [2, 1, 1, 1] and [1, 1, 1, 1, 1].

The macrostate [5] comprises all microstates with five marbles in one row, the macrostate [4, 1] comprises all microstates with four marbles in one row and one in another row, the macrostate [2, 2, 1] comprises all microstates with two marbles in one row, two marbles in another row and one marble in a third one, and so forth.

Here's a summary

MacrostateMicrostatesEntropy
[5]51.6
[4,1]5006.2
[3,2]2,0007.6
[3,1,1]7,5008.9
[2,2,1]15,0009.6
[2,1,1,1]25,00010.1
[1,1,1,1,1]3,1258.0

The Entropy column is the natural logarithm of the Microstates column, without multiplying by Boltzmann's constant. Again, this is just to give a basis for comparison. For example [2,1,1,1] is the highest-entropy state, and [2,2,1] has four times the entropy of [5]. 

It's straightforward, but tedious, to count the number of ways one macrostate can transition to another. For example, of the 105 transitions for [3,2], 4 end up in [4,1], 26 end up back in [3,2] (not always by putting the removed marble back where it was), 30 end up in [3, 1, 1] and 45 end up in [2, 2, 1]. Putting all this into a matrix and taking the matrix to the 10th power (enough to see where this is converging) gives

Macrostate% time% microstates
[5].0094.0094
[4,1].94.94
[3,2]3.83.8
[3,1,1]1414
[2,2,1]2828
[2,1,1,1]4747
[1,1,1,1,1]5.95.9

The second column is the result of the tedious matrix calculations. The third column is just the size of the macrostate as the portion of the total number of microstates. For example, there are 500 microstates in [4,1], which is 0.94% of the total, which is also the portion of the time that the matrix calculation says system will spend in [4, 1]. Technically, this means the system is ergodic, which means I didn't have to bother with the matrix and counting all the different transitions.

Even in this toy example, the system will spend very little of its time in the low-entropy lined-up state [5], and if it ever does end up there, it won't stay there for long.


Given some basic assumptions, a system that evolves over time, transitioning from microstate to microstate, will spend the same amount of time in any given microstate (as usual, that's not quite right technically), which means that the time spent in each macrostate is proportional to its size. Higher-entropy states are larger than lower-entropy states, and because entropy is a logarithm, they're actually a lot larger.

For example, the odds of an entropy decrease of one millionth of a Joule per Kelvin are about one in e(1017). That's a number with somewhere around 40 quadrillion digits. To a mathematician, the odds still aren't zero, but to anyone else they would be.

For all but the tiniest, coldest systems, the chance of entropy decreasing even by a measurable amount are not just small, but incomprehensibly small. The only systems where the number of microstates isn't incomprehensibly huge are are small collections of particles near absolute zero.

I'm pretty sure I've read about experiments where such a system can go from a higher-entropy state to a very slightly lower-entropy state and vice versa, though I haven't had any luck tracking them down. Even if no one's ever done it, such a system wouldn't violate any laws of thermodynamics, because the laws of thermodynamics are statistical (and there's also the question of definition over whether such a system is in equilibrium).

So you're saying ... there's a chance? Yes, but actually no, in any but the tiniest, coldest systems. Any decrease in entropy that could actually occur in the real world and persist long enough to be measured would be in the vicinity of 10−23 Joules per Kelvin, which is much, much too small to be measured except under very special circumstances.

For example, if you have 1.43 grams of pure oxygen in a one-liter container at standard temperature and pressure, it's very unlikely that you know any of the variables involved -- the mass of the oxygen, its purity, the size of the container, the temperature or the pressure, to even one part in a billion. Detecting changes 100,000,000,000,000 times smaller than that is not going to happen.



But none of that is what got me started on this post. What got me started was that the earlier post tried to define some sort of notion of "statistical symmetry", which isn't really a thing, and what got me started on that was my coming to understand that higher-entropy states are more symmetrical. That in turn was jarring because entropy is usually taken as a synonym for disorder, and symmetry is usually taken as a synonym for order.

Part of the resolution of that paradox is that entropy is a measure of uncertainty, not disorder. The earlier post got that right, but evidently that hasn't stopped my for hammering on the point for dozens more paragraphs and a couple of tables in this one, using a slightly different marbles-in-compartments example.

The other part is that more symmetry doesn't really mean more order, at least not in the way that we usually think about it.

From a mathematical point of view, a symmetry of an object is something you can do to it that doesn't change some aspect of the object that you're interested in. For example, if something has mirror symmetry, that means that it looks the same in the mirror as it does ordinarily.

It matters where you put the mirror. The letter W looks the same if you put a mirror vertically down the middle of it -- it has one axis of symmetry. The letter X looks the same if you put the mirror vertically in the middle, but it also looks the same if you put it horizontally in the middle -- it has two axes of symmetry.

Another way to say this is that if you could draw a vertical line through the middle of the W and rotate the W out of the page around that line, and kept going for 180 degrees until the W was back in the page, but flipped over, it would still look the same. If you chose some other line, it would look different (even if you picked a different vertical line, it would end up in a different place). That is, if you do something to the W -- rotate it around the vertical line through the middle -- it ends up looking the same. The aspect you care about here is how the W looks.

To put it somewhat more rigorously: if f is the particular mapping that takes each point to its mirror image across the axis, then f takes the set of points in the W to the exact same set of points. Any point on the axis maps to itself, and any point off the axis maps to its mirror image, which is also part of the W. The map f is defined for every point on the plane and it moves all of them except for the axis. The aspect we care about, which f doesn't change, is whether a particular point is in the W.

If you look at all the things you can do to an object without changing the aspect you care about, you have a mathematical group. For a W, there are two things you can do: leave it alone and flip it over. For an X, you have four options: leave it alone, flip it around the vertical axis, flip it around the horizontal axis, or do both. Leaving an object alone is called the identity transformation, and it's always considered a symmetry, because math. An asymmetrical object has only that symmetry (it's symmetry group is trivial).

In normal speech, saying something is symmetrical usually means it has the same symmetry group as a W -- half of it is a mirror image of the other half. Technically, it has bilateral symmetry. In some sense, though, an X is more symmetrical, since its symmetry group is larger, and a hexagon, which has 12 elements in its symmetry group, is more symmetrical yet.

A figure with 19 sides, each of which is the same lopsided squiggle, would have a symmetry group of 19 (rotate by 1/19 of a full circle, 2/19 ... 18/19, and also don't rotate at all). That would make it more symmetrical than a hexagon, and quite a bit more symmetrical than a W, but if you asked people which was most symmetrical, they would probably put the 19-sided squigglegon last of the three.

Our visual system is mostly trained to recognize bilateral symmetry. Except for special situations like reflections in a pond, pretty much everything in nature with bilateral symmetry is an animal, which is pretty useful information when it comes to eating and not being eaten. We also recognize rotational symmetry, which includes flowers and some sea creatures, also useful information.

It would make sense, then, that in day to day life, "more symmetrical" generally means "closer to bilateral symmetry". If a house has an equal number of windows at the same level on either side of the front door, we think of it as symmetrical,  even though the windows may not be exactly the same, the door itself probably has a doorknob on one side or the other and so forth, so it's not quite exactly symmetrical. We'd still say it's pretty symmetrical, even though from a mathematical point of view it either has bilateral symmetry or it doesn't (and in the real world, nothing we can see is perfectly symmetrical).

That should go some way toward explaining why, along with so many other things, symmetry doesn't necessarily mean the same thing in its mathematical sense as it does ordinarily. The mathematical definition includes things that we don't necessarily think of as symmetry.

Continuing with shapes and their symmetries, you can think of each shape as a macrostate. You can  associate a microstate with each mapping (technically, in this case, any rigid transformation of the plane) that leaves the shape unchanged. The macrostate W has two microstates: one for the identity transformation, which leaves the plane unchanged, and one for the mirror transformation around the W's axis.

The X macrostate has four microstates, one for the identity, one for the flip around the vertical axis, one for the flip around the horizontal axis, and one for flipping around one axis and then the other (in this case, it doesn't matter what order you do it in). The X macrostate has a larger symmetry group, which is the same as saying it has more entropy.

In this context, a symmetry is something you can do to the microstate without changing the macrostate. A larger symmetry group -- more symmetry -- means more microstates for the same macrostate, which means more entropy, and vice-versa. They're two ways of looking at the same thing.

In the case of the marbles in a box, a symmetry is any way of switching the positions of the marbles, including not switching them around at all. Technically, this is a permutation group.

For any given microstate,  some of the possible permutations just switch the marbles around in their places (for example, switching the first two marbles in a lined-up row), and some of them will move marbles to different compartments. For a microstate of the lined-up macrostate [5], there are many fewer permutations that leave the marbles in the same macrostate (all in one row, though not necessarily the same row) than there are for [2, 1, 1, 1]. Even though five marbles in a row looks more symmetrical, since it happens to have bilateral visual symmetry, it's actually a much less symmetrical macrostate than [2, 1, 1, 1], even though most of its microstates will just look like a jumble.


In the real world, distributing marbles in boxes is really distributing energy among particles, generally a very large number of them. Real particles can be in many different states, many more than the marble/no marble states in the toy example, and different states can have the same energy, which makes the math a bit more complicated. Switching marbles around is really exchanging energy among particles, and there are all sorts of intricacies about how that happens.

Nonetheless, the same basic principles hold: Entropy is a measure of the number of microstates for a given macrostate, and a system in equilibrium will evolve toward the highest-entropy macrostate available, and stay there, simply because the probability of anything else happening is essentially zero.

And yeah, symmetry doesn't necessarily mean what you think it might.