Intermittent Conjecture

Sunday, August 21, 2011

And then it became self-aware

Something in our mind likes magic thresholds -- crisp clear dividing lines, to one side of which is X and to the other side not-X. The world has other notions. Accepting this takes continual effort.

When I was first learning how logic gates worked, my mathematical mind was enchanted by the clean symbolism of boolean logic, its Ands, Ors and Nots dancing their beautifully symmetrical algebraic ballet, its truth tables laying out precisely how the various operators combined True and False.

I would spend hours poring over component catalogs, drawing circuit diagrams full of gates and lines and little circles representing Not. I had some notion of how those gates broke down into individual transistors, a transistor being an idealized beast that modulated perfect high and low voltages with other perfect high and low voltages.

And then I started looking at the technical specs more closely. With growing discomfort I came to realize that there simply is no perfect step function from low to high. The transition in the middle might be more or less exponential, but it is not perfectly vertical. As I struggled to understand flip-flops and latches, I puzzled over metastable states and propagation delays. Those weren't on the pretty circuit diagrams, were they?

There came a time when my eye could no longer filter out the symbols for resistors and capacitors sprinkled among the transistors -- and then there were those stowaway analog components like operational amplifiers skulking around, daring to use the same transistors as the digital circuits. What had happened to my digital world? When you got down to it, it was all analog at heart.

I could cite several other cases of learning that simple on/off distinctions generally don't hold up to close scrutiny, but one more will suffice. From time to time, sometimes in classrooms but usually not, I would try to learn to draw, something I'm still not at all good at. Along the way, studying shading, I learned the old saw that there are no lines in nature. Where one might draw a line in a sketch or cartoon, there was actually a sharp, but not perfectly sharp, change in shading. It was the eye that inferred a line, the same eye that could therefore accept a line drawing as realistic even when, objectively, it was anything but.

Understanding of intelligence, whether natural or artificial, can suffer by the same tendency to create lines where none exist. It's tempting to try to come up with a clear, crisp definition of intelligence, but intelligence is not a binary attribute. There are many different ways to be intelligent, some of which can manifest to significantly varying degrees. Cognitive science has identified scores of intelligent behaviors, from counting to recognizing faces to remembering a path and far beyond.

Most notions of intelligence require the ability to learn, but what's learning? The best answer I know is that there are many kinds of learning, just as there are many aspects to intelligence -- and there is quite likely no simple relationship between the two.

Which brings me to the title. A recurring motif in science fiction and its cousins is the notion of a machine becoming self-aware and therefore, by a commonly-accepted notion of intelligence, intelligent. This magical moment brings us spine-tinglingly near the very engines of creation, to say nothing of providing an infinitely more formidable opponent for Our Hero. That's fine for plot purposes, but just as there are many kinds of learning and intelligence, there must many sorts of awareness, self- or otherwise.

For example, many things with eyes react to other things with eyes watching them, in some cases even playing it to their advantage. Without trying to put together a nice crisp definition of awareness -- after all my whole point here is that such definitions never stand up to a good round of "But what about ..?" -- I will posit that a bird watching you watch it is in some sense aware of you.

Statements like that can cause a certain discomfort among human readers because we all agree, quite possibly correctly, that a bird is not aware of the world in the same way we are. If awareness is a binary attribute then, perforce, birds must not have it, because we do have awareness and birds don't have the same awareness we have. QED. Unfortunately, as airtight as that logic may be, it doesn't really tell us much. We already knew birds weren't humans.

If, however, we allow that there may be many kinds of awareness, we can make fairly concrete assertions, in fact more detailed and meaningfully testable assertions, without getting backed into logical corners. For example, if we assert that there is such a thing as watching -- actively behaving so as to keep something in sight, say -- and there is also awareness of being watched -- leaving aside what exactly that might comprise -- we can assert that both we and birds have those capabilities without saying that we apprehend the world the same way birds do.

There are many sorts of awareness that we share with birds and many other kinds of animal. For example, many animals can recognize individuals, reacting differently depending on whether the other party is a stranger or familiar. Both we and birds can be aware of where things are hidden, and in fact some species of bird appear to be much better at that than we are. Both we and they can find our way from point A to point B and back and remember new routes that we find.

This is leaving aside a host of simple capabilities that seem too trivial to note until one realizes that not every living thing has them: For example, knowing that some things are safe to eat and some aren't, that some animals are liable to attack you and some aren't, that there are objects in the world and we can manipulate them, that things dropped tend to fall, and so forth.

So how do we differ from birds in awareness? For one thing, birds probably have some sorts of awareness that we lack. Migratory birds appear to be aware of the strength and orientation of the Earth's magnetic field, and flying birds in general must surely have a richer awareness of three-dimensional space than we do.

Likewise, of course, we must surely be aware of things that birds aren't aware of, but once we get done congratulating ourselves on being such vastly more sophisticated creatures, what would those things be?

A bird may be aware of the local magnetic field, but I'll boldly assert here that it isn't aware that said field is caused by electric currents in the Earth's outer core. Fine, but just what is it here that we have that they lack that allows us to be aware of such things? If you want to say "abstract concepts", bear in mind that at least some birds can count and appear to distinguish "same" from "different". Also bear in mind that not every human is aware of such things (I had to look up the part about it being the outer core), so we're probably grasping at some sort of abstract awareness of cause and effect. I'm not denying that there's something there, but we do have to be careful trying to define what it is. Just saying "it's abstract" doesn't really help. [I revisit this theme later in this post -- D.H.]

Here's a stab at something more like what our hypothetical AI villain would have to be able grasp in order to become the dangerously-aware creature we'll pay ten bucks [Maybe streaming? -- D.H. Aug 2024] to see:

Last week, John met Martha at a party on a boat on Lake Michigan. It turned out that they had grown up within a mile of each other, but never known it.

From that short paragraph, you now know not only where John and Martha met, and when, and that they grew up close together without knowing of each other, but also that I know that John and Martha know that fact, but they hadn't until last week, and I know that you now know that, and ... well, you get the drift. This is the sort of awareness that seems, if not completely unique to humans, rare in the animal world. It's the sort of awareness that can make one a cunning adversary. If you don't know that I know you're sneaking up on me, I may well have a crucial tactical advantage.

But is it self-awareness? There is a famous experiment in which an animal is given access to a mirror. All animals tend to react to the animal in the mirror as a different animal initially -- this includes humans who haven't seen a mirror before (assuming they react at all). Some animals, however, will eventually start to behave differently, for example by poking at a spot painted on their forehead or positioning the mirror or themselves in order to see places they can't ordinarily see.

Animals that can do so include humans, bonobos, chimpanzees and orangutans, but also bottlenose dolphins, orcas and European magpies. On the other hand most animals, including ones much more closely related to these animals than they are to each other, don't seem to be able to make the same leap. Nor, for that matter, can humans less than about eighteen months old.

We may as well call mirror-test awareness self-awareness, but clearly passing the mirror test doesn't necessarily mean being able to make the kind of I-know-you-know inference described above. It's also at least logically possible to reason sophisticatedly about who knows what without being able to pass the mirror test. In short, just as there are many kinds of awareness, there are mostly likely many kinds of self awareness.

What we're really looking for here goes by the name "Theory of Mind", which is a good topic for another post ...

Friday, June 10, 2011

The non-metric mind

I grew up in the US using English units. I know my height in feet and inches, my weight in pounds, the distance to various places in miles, the area of my house in square feet, the area of my grandparents' property in acres, the capacity of my car's gas tank in gallons, the temperature in degrees Fahrenheit and so forth.

I've visited, and even lived in, places where it would be centimeters, kilos, kilometers, square meters, hectares, liters and Celsius, but never really got to the point where it felt natural to use metric units. If I hear it's 86 degrees out, I know it's warm. If I hear it's 30 on a summer day, I have to remind myself it's not below freezing and then think "30 ... that's warm, right? ... that's what, 80? 90?"

Non-metric units are still in use here and there outside the US, to be sure. Even the English still use some English units, posting speed limits in miles per hour, quoting weights in stone (14 pounds) and quaffing beer by the pint (officially 568ml). All the same, the US is widely recognized as the world's least metricated nation.

Officially, this isn't supposed to have happened. The US signed the Treaty of the Meter in 1875 and re-defined traditional measures such as the ounce and gallon in terms of metric units in 1893. Then, for about a hundred years, the metric system was known to exist but largely ignored.

In 1975 Congress passed the metric conversion act, thereby establishing the US metric board. The board was abolished as part of a round of spending cuts in 1982, so we tried again in 1988 with the Omnibus Trade and Competitiveness Act, which among other things required the federal government to go metric by 1992. For all that the federal government is supposed to intrude into every aspect of Americans' lives and dictate the smallest details of behavior, I can't say I have any idea to whether it actually did.

The benefits of the metric system are well known, or at least widely touted. Instead of a hodgepodge of arcane conversions from, say, teaspoons to tablespoons to ounces to cups to pints to quarts to gallons, you have (to continue the example) just liters, optionally with one of a standard set of prefixes should the numbers accumulate too many zeroes (in the case of cooking units, milliliters are fairly common).

Moreover, even if the metric system were based on multiples of random prime numbers rather than uniformly using base ten, it is the system that most of the world uses, giving a strong incentive for anyone interested in trading with the rest of the world to use it. So why do we persist in going our own way? I don't know, but I can conjecture, can't I?

Two conjectures come to mind: The first is that standardization is by far the more pressing reason to use metric units, and that the US does just that when it matters. US chemists and physicists do not insist on using Fahrenheit degrees or measuring liquids in gills and minims. They use the same units as everyone else. A US mechanic fixing a car and faced with 13mm bolt head reaches for a 13mm wrench. For all that the US was supposed to have been faced with a crisis in competitiveness unless the metric system was made mandatory, market forces seem to have sorted this one out.

As far as I can tell the remaining differences in units matter mostly as an annoyance to travelers, and for better or worse, Americans in the aggregate don't spend much time traveling to foreign countries. Even for those who do, metric units are just one more item on a long list of things to get used to: different languages, cultural customs, food, currency, traffic signs, line voltage and frequency, electrical socket designs, light switches, etc., etc..

It's also worth asking whether ease of conversion is all that important. Advocating a single system of units, metric or otherwise, assumes that a single system is appreciably more convenient than having multiple systems. This is a hypothesis to be verified, not an axiom. In practice, people seem to tolerate quirks in measurement systems remarkably well. The traditional profusion of units of measurement arose naturally, after all, which leads me to my second conjecture: The hodgepodge of different units is in fact a reflection of how we think about measurement, even in ostensibly metricated environments.

For example, if I'm buying soda in the US, I can buy a two-liter bottle without caring that the gas in my car is measured in gallons. Drinking soda and using gasoline are completely different experiences. I'm not going to drink the gasoline or pour soda in my gas tank. Two liters of soda is a lot of soda to drink. 17 gallons (about 64l) is way more soda than I even want to think about drinking. Two liters of gas will just about get me to work in the morning.

From a practical point of view, soda could be sold by the ngogn and gasoline by the firkin so long as the numbers didn't get too out of hand (2 liters is about 172 ngogn; 17 gallons is about 1.6 firkins). In fact, there are two prevalent units of soda in the US: 2-liter bottles and 12-ounce cans or bottles, generally packaged in multiples of six. As it happens, a six-pack of 12-ounce cans is about two liters, but that's not exact and it doesn't matter much whether it is.

We learn to associate measures with the physical world on a case by case basis. You learn how far a mile or kilometer is by traveling. You learn how much a pound or kilogram is by handling things by the pound or kilo*. Cognitively, there's not a lot of overlap. I really don't need to know that a gallon of milk weighs about 8.3 pounds. It weighs as much as a gallon of milk. If I'm in the dairy business, I care how many gallons of milk I can load on my truck, but that's just another piece of specialized knowledge.

In general, there is either a natural or conventional unit for many things we deal with, and, because different things have different properties, that unit will vary. It would be of little use to require, say, perfume to be sold only in liter sizes or to copper wire to be packaged in meter lengths. Instead, perfumers have developed standard-sized bottles and wire comes in standard spools. Whether these happen to measure a round number of ounces or liters or yards or meters is not particularly important.

Not only is it not a problem to use different units for different things, ad hoc units seem ubiquitous, once you look at the actual unit and not the number on the package. Even in counting, where the unit is essentially the thing being counted, distinctions can be seen. Many languages use different words for counting different sorts of things. For example, in Japanese you would use ba in counting, say, copies of a newspaper but dai in counting, say, cars or bicycles. English has signs of this as well. Driving through the midwestern US, you might say you see five cows out in a field, but the owner of that field will almost certainly call them five head of cattle.

The metric system was designed for scientific use, and it's there that it really comes into its own. In the physical sciences, there actually can be a need to deal with different orders of magnitude, for example, so it's very good to be able to shift decimal points instead of trying to figure out how many feet are in 1000 inches (83' 4"). In everyday life, shifting decimal points is not so important. If you're doubling a recipe, it's actually not so bad to be using 1/2 teaspoons and 1/4 cups.

Even in the sciences, though, idiosyncratic units find their way in

How far away is Alpha Centauri? To an astronomer, about 1.3 parsecs, not about 40Pm (petameters, not afternoon). The parsec itself was originally defined in terms of the Astronomical Unit (about 150Gm) and the arc second (1/3600 of a degree, or about 4.8 microradians).
If you want to make sodium chloride (salt), you'll need about 23g of sodium and 35g of chlorine to make 58g of salt (a fume hood and other equipment will probably be a good idea). Sodium atoms are less massive than chlorine atoms. If you used the same mass of each you'd have sodium left over. Chemists use a gram mole to represent the mass of Avogadro's number (about 600,000,000,000,000,000,000,000) of a given atom or molecule to account for this. One gram mole of sodium plus one gram mole of chlorine makes one gram mole of salt.
In theoretical particle physics, Planck units (or "God's units") set five fundamental physical constants to 1, which simplifies a number of equations. For example e = m when c is 1. I'm not sure how often they're used, but they do turn up (for example here and here, to pick a couple more or less at random).
I previously remarked on compugeeks measuring data in units of K/kilo- (1024), M/mega- (1048576) and so forth. That doesn't mean that a compugeek will think a kilowatt is 1024 watts (well, maybe, depending on how hardcore the geek). It's only data that is measured this way. The convention that K,M,G,T etc. refer to powers of two flows directly from addresses being represented in binary [Some folks prefer to say things like "Mebi" and use abbreviations like MiB in order to explicitly call out the distinction between a million and 1,048,576. This can be a very good idea in some particular situations, but in everyday speech even geeks tend to ignore the difference and just say K, M, G, T etc. --D.H.].

Even the plain vanilla metric system offers choices. Strictly speaking, the liter is redundant. We could just use cubic meters. In practice, we choose the unit that fits best

Liters and cubic meters both act like basic units of volume.
Square meters and hectares both act like basic units of area
Grams and kilos both act like basic units of mass

That's leaving aside square and cubic centimeters. Which units you use will depend on what you're doing. A recipe might call for 15ml of oil but an olympic swimming pool will hold about 2500 cubic meters (2.5Gl) of water**. Housing space is measured in square meters, but land is measured in hectares.

Finally, there is one area of common measurement that has universally and persistently resisted metrication: time. Everybody uses days, years and some notion of months. Hours comprising sixty minutes of sixty seconds are approximately as widespread as writing. Even thoroughly metricated places use kilometers per hour instead of meters per second. Outside very specialized contexts, long periods of time are measured in years, though which exact definition of the year may depend on context and most definitions vary over time.

There have been various efforts to "rationalize" time measurement, but none even close to successful. The natural units are just too strong. Only when dealing with very short periods of time, outside the realm of everyday experience, do we use "correct" units and talk about microseconds and such.

Measurement is not an abstraction. It is a concrete action dealing with the physical world. The experience of measurement depends on what is being measured, and our mental representations reflect this, making distinctions that appear illogical from an abstract point of view.

* No discussion of pounds and kilos would be complete without a pedantic comment that pounds measure weight, that is, force, while kilos measure mass. In everyday life, we deal in force. It takes careful observation to realize that there is a difference (for example, a diver at neutral buoyancy has little weight but just as much mass as on dry land). Thus the pedantic distinction. A kilo of something will normally weigh about 9.8 Newtons, which is what a scale "should" typically read if you put a kilo of something on it.

** I originally left out "of water" here. After all, an olympic pool could just as well hold 2500 cubic meters of beer, or silly putty, or whatever. But it's hard to think of a container without its expected contents. Which more or less goes to prove my point.

Friday, May 20, 2011

Data, metaphorically

Hmm ... haven't been in here in a while. Everything still looks OK, just a bit musty. Let's open up the curtains, blow the dust off the bookends. Ah ... better.

Now where was I?

In the previous post, I tried to find an ordinary mass noun that behaved like data in its mass noun form, but without great success. I'm not going to try to fix that here. In fact, I'm going to try to explain why the effort failed, and to do that I want to explore how data behaves, metaphorically. But first a bit about metaphor.

Metaphor generally connotes figurative speech used for poetic effect, whether well

That time of year thou mayst in me behold
When yellow leaves, or none, or few, do hang
Upon those boughs which shake against the cold,
Bare ruined choirs, where late the sweet birds sang.

or perhaps not so well

Head down into the storm they went, pressing barehanded to their chests an unshielded sense of peril.

Um ... right.

That's all well and good, but it's not the whole story. Even the definition your English teacher gave was probably more like "A comparison made by referring to one thing as another."* That's closer to Aristotle's definition and to the etymology from the Greek for "carry over", and in my view it's an apt one. There is a strong case to be made that metaphor in this sense is not merely a figure of speech reserved for flowery poetry and purple prose, but rather a fundamental aspect of how we think, whether we put those thoughts into words or not.

Lakoff and Johnson, for example, make this case in Metaphors We Live By, which pulls together dozens of examples of particular metaphors and shows how, taken together, they imply underlying mental metaphors. Far from grinding away at a desk in English class to produce a figure of speech that will survive the dreaded red pen, we effortlessly produce metaphors -- in Aristotle's sense -- in nearly every sentence. These metaphors, on the order of "more is up/less is down" and "anger is a hot liquid" (it can boil over, you can get rid of it by blowing off steam, it behaves as a mass noun ...), are so pervasive we don't even see them as metaphors unless we look -- at which point we see them everywhere.

(To get the flavor, go over that last paragraph. Clearly "grinding away" is metaphoric, but so is "see" in "see them as metaphors", "pulls together" and even "in" in "in this sentence". Well, a sentence isn't really a container or bounded space, is it?)

It's perfectly normal, indeed probably universal, to have more than one metaphoric view of a concept, and that the different views don't have to be consistent. For example, we can view ourselves as moving through time ("Let's just get through today.") or ourselves as stationary with time moving past ("What's coming next week?") depending on what works best at the moment.

So, from this point of view, what is data (in the computing sense)?

It's a fluid. It can flow or otherwise move from place to place. It can leak. It can fill up space. It can also be compressed, but generally it acts more like a liquid than a gas. If your data isn't flowing fast enough, you need a bigger pipe.
It's made up of discrete parts, ultimately bits. It can be partitioned into chunks of uniform or varying size. You can change parts independently, but only down to the bit level.
It's something of value. It can be secured, tampered with, stolen, bought, sold or given away.
It's a form of text. It can be written, read, erased and copied.
You can search through it, organize it and make it universally accessible and useful ... wait, where did that come from?

I'm sure that with a little more thought I could come up with several more metaphors for data, but I think that's enough to make two points: First, that data, like very many other concepts, can be described by internally consistent metaphors, and second, because these metaphors, as with those for other concepts, aren't always consistent with each other, there's no one concrete noun that could serve as a universal metaphor for data. In other words, trying to fit water or stone or gravel or rice to data as a whole was doomed to failure from the beginning.

That's to be expected, I suppose. One definition of equality is that two things are the same if they can stand in for each other in all circumstances. If it looks like rice, tastes like rice and is generally like rice in every observable way, then we may as well say it is rice. Which leads me to a different definition of metaphor that I don't like nearly as much as the one I used: A comparison of two unlike things that have something in common.

That's fine as far as it goes, and in particular the things in a metaphor do have to be unlike, but it implies that the things being compared are interchangeable. They aren't. One thing is being explained by referring to it as the other. Moreover, the thing be explained is always more abstract than the thing being referred to. In the first data example, something very abstract (data), is being referred to as something more concrete (a fluid, for example).

As always, the definition you choose makes a difference. Seeing metaphor as a comparison between two unlike things with something in common provides a formula for incoherent images (What do you mean "The stop sign was a fire truck." isn't a good metaphor? They're unlike but they're both red!). Seeing metaphor as one thing carried over to stand in for another -- the original metaphor for metaphor -- opens up a vast and surprising new world.

* It took me a while to find a definition I liked. This one is courtesy of Gideon Burton's Silva Rhetoricae.

Thursday, April 7, 2011

Data and rice

It may seem like this blog is turning into yet another "my usage can whip your usage" column, but bear with me for one more post.

In computing contexts (where I spend most of my time), you'll almost invariably see "The data is ..." as opposed to "The data are ...". The naive analysis of this is that data is plural (singular datum) and it is therefore simply incorrect to say "The data is ...". Computer geeks simply don't know any better.

A more reasonable analysis is that data here is a mass noun, like water or gravel. Mass nouns are measured as opposed to counted. You ask how much water, as opposed to how many eggs. There is no such thing as a water, except in special cases like "I ordered a water," meaning a glass or similar serving, or "Bubblyfritz is a water that really refreshes," meaning a kind of water. Mass nouns act singular. You say "This water is salty," not "These water are salty."

Fair enough, but in contexts outside computing people often use data as a plural: "These data support my theory," or "We don't have many data to work with here." A naive explanation is that people who say such things are just stuffier than computer geeks, who are notoriously playful in their use of language. A more reasonable explanation is that such speakers are using data as an ordinary (count) noun, albeit one with an irregular plural carried over from Latin. Consistent with this, people also use the singular datum in various ways, including forms like "This datum doesn't fit with the other data."

Why do different groups adopt different usages, each perfectly defensible? No doubt culture plays a part. One's use of data indicates whether one is a grammatical ignoramus who doesn't realize that data is a plural form or an uptight pedant who insists on applying arbitrary rules from dead languages. However, I think that, even allowing for this effect, there is another reason to use data as either singular or plural depending on circumstances.

We computer geeks typically deal with lots and lots of data. The more the better. We eat terabytes for breakfast and gigabytes for a light snack. Further, the individual bits generally don't carry any particular significance. The fourth bit of the ASCII or Unicode representation of the 'T' at the beginning of this sentence didn't come from some physical measurement, but from an arbitrary encoding. We also process data differently. One doesn't generally take the mean or standard deviation of the bytes in a blog post or audio clip.

In short, our data is a different beast from a statistician's or biologist's. It really only makes sense when considered in aggregate. Metaphorically it acts much like a substance. We speak of storing data, or moving it from place to place. We wonder how much space we have left for storing data. We even speak of compressing data, some of which might be lost in the process, and of "memory leaks" filling up available heap space. In short, computer data acts like a mass noun.

Conversely, individual statistical or scientific data are significant. If I measure the temperature today, that's a datum (but more commonly data point -- I'll come back to that). If I measure the temperature again tomorrow, that's another datum. Once I've accumulated a data set [hmm ... not "datum set"?], I try to derive some aggregate measure from that, but the key word here is "derive". The individual data are the source of truth.

To get a better measure, I may throw out particular data. I may present the data sorted or grouped in various ways to make particular points. I might note some property of a particular datum, or I might call attention to the source of some subset of the data as opposed to the others. In all these cases the individual data have their own identity and it's perfectly logical to refer to them collectively in the plural and to an individual one in the singular.

I mentioned rice in the title, didn't I?

While casting about for more ordinary, physical analogs to computer data, I started looking for mass nouns that fit the part. I didn't like water because computer data is ultimately discrete. A terabyte is a lot of bytes (somewhat over a trillion*), but still a finite number of distinct bytes. For practical purposes water is infinitely divisible.

What about rock or stone? They can certainly behave as mass nouns. You can order a ton of rock or fill your pickup truck with stone. But you can also use the same word (not a variant form) in the singular. You can say "a rock" or "a stone". Not even a compugeek would normally say "a data".

I had been looking for an aggregate of individual pieces which are so numerous that we treat the aggregate as a substance. You can certainly gather rocks together until somewhere along the line you've shifted from "some rocks" to "some rock". But on the other hand, rock and stone can be treated as substances themselves. You can refer to a chunk of rock or a statue made of stone. In fact, you could argue that we have two mass nouns called "stone": stone as a substance, and an aggregate of stones, which are usually be made of stone in the first sense (but even if the individual stones were made of some clever compound of plastic, chances are a ton of them would still be called "stone"). Computer data doesn't have that extra level.

I had originally called this post Data and gravel, but computer data is made up of individual parts of uniform size (ultimately bits) while gravel is visibly irregular. Sand is closer, and was my second try. Grains of sand may be irregular, but they're small and there are so many that you don't really notice. Ultimately, though, something like rice seems closest. The parts are small, numerous and visually uniform.

If you have an aggregate of smaller objects that you treat collectively as a mass noun, you may still need to refer to the individual pieces from time to time. There are several ways of doing this

In cases like brick, rock or stone, the name of the substance, meaning a small chunk of that substance (a brick, a rock, a stone).
In most cases, "a ____ of ...", where the blank may be filled in by something specialized (a kernel of corn, a grain of sand) or, failing that, the generic piece (a piece of gravel).
In some but not all cases this can be turned around (a sand grain, a corn kernel, but a gravel piece sounds a bit odd to me).

This may help explain why people like to say "data point". If you have enough data to do meaningful statistics, it's easy (and even useful) to start thinking of it in the aggregate. Working back from that, you can get a point of data, or more usually a data point.

Finally, why would we choose data as the form for the mass noun, rather than the singular datum, by analogy with rock and stone? First, data is simply used a lot more. Second, it's data, not datum, that is used in contexts that work as both for count nouns and mass nouns ("I can't interpret the data", "Could you send me your data?"). Further, mass nouns like brick and stone only seem to occur in the pattern mentioned above of substance ➝ small chunk of that substance ➝ aggregation of said small chunks.

* Yes, I've heard of tebibyte but I've never, ever heard it used seriously in real life. Yes, it's an IEC standard. Yes, IEEE officially says that a terabyte is exactly a trillion bytes as opposed to 2⁴⁰ = 1,099,511,627,776. No one outside the standards committees cares, and I doubt even they care all that much most of the time. In theory, it matters whether your disk holds exactly a trillion bytes or close to 10% more. In practice, either your disk is nearly full or it isn't. When it fills up, you buy more [I've since seen notations like TiB in the wild, but it adds little, as far as I can tell. If it says TB, you still know it means the power of two and not an even trillion. If you see TiB, you know it means the same thing, but whoever said TiB instead of TB wants you to know they know the difference --D.H.]

Standards are great, and if a standards body wants to, say, limit files to 4,294,967,296 bytes they should either say "files shall be no larger than 4 gibibytes" or be clear that "GB" means 2³⁰ bytes. Or they can just say "file sizes are 32 bits". The rest of us will continue to blithely use the "wrong" units.

That said, perhaps the distinction is becoming more important as the numbers get larger. Where this all started, with 2¹⁰ = 1,024 being practically 1000, the error is only 2.4%. At the megabyte level, the difference is 4.8% and at the gigabyte level, 7.3%. Once we get into peta- and exa- territory, the errors are 13% and 15%, harder and harder to ignore. Even then, manufacturers, who would one would think might stand to gain by saying 1.1TB instead of 1TB, seem content to say 1TB anyway. No harm, no foul.

Friday, March 11, 2011

Koala bears

Admit it: You're thinking "But koalas aren't bears, they're marsupial!"

Fair enough. This is, after all, the response we've all had drilled into our heads since grade school.

But why should it matter whether a koala is a marsupial and not an ursid? Lots of things we call bears aren't members of family Ursidae. For example:

Teddy bears
Chicago Bears football players
Statues of bears
Final exams
Goldilocks' Three Bears

It would generally sound silly to point out that teddy bears aren't really bears, or that bears don't actually talk, eat porridge and live in houses, or that "That test was a bear!" is just an expression, so why the urge to point out that koalas aren't in Ursidae?

Given the regularity with which it is pointed out that koalas "aren't really bears", it hardly adds much to point it out again. There is, however, a fairly plausible explanation based not on some fundamental need taxonomic accuracy, but on normal rotten human nature: It serves to make known that, yes, you went to school and you, too, know that koalas are marsupial (or at least "not real bears" if "marsupial" escapes you at the moment). You thus mark yourself as belonging to the "in" group (albeit not a particularly exclusive one) of Those Who Know Koalas Aren't Closely Related To Those Other Animals We Call "Bears".

Behind this is a more general notion: The "technical" definition is the "right" one and anything else is "incorrect" or, more cynically, "If I had it drummed into my head, so should you".

Again, fair enough. I have no doubt that such balding-ape behavior is at work here, but what triggers it? Once more, why do we not feel compelled to mark ourselves as belonging to Those Who Know That The Chicago Bears Are Not Really Hairy Carnivores (well, actually ...)?

It seems this sort of behavior only comes out in borderline cases, where there is some chance that the listener isn't one of Those Who Know. Koalas look and act a fair bit like ursids. It's perfectly understandable that a European encountering a koala would think "That's a funny-looking bear," and so they did. But now we know better, or at least Some of Us do.

It has been said that academic disputes are bitter precisely because the stakes are so low. Just so, quibbles over usage are most heated precisely when they are inherently least consequential, that is, when the distinction in question makes little difference. If I say "koala bear", you won't think I'm talking about a squid. You'll know exactly which animal I'm talking about, but feel a strong urge to whisper "He doesn't know it's not really a bear" to the first Person in the Know that you can find.

The perfect in-group marker, evidently, is content-free.

One might be tempted to think that when it was discovered that koalas weren't actually placental at all and thus were not on the same taxonomic branch as the Ursidae, all educated persons began calling them by their right name and "koala bear" fell into immediate disuse. Not so. Search Google books for "marsupial koala bear" in quotes and you'll find at least three books, clearly written by trained biologists. Why would a biologist say "koala bear"? Why not? From a biological point of view common names carry no particular weight. If you want to be clear and unambiguous, you say Phascolarctos cinereus (or P. cinereus for short).

This sort of thing seems to happen a fair bit -- those who would presumably know best tend to be more casual in their usage than those who wish to appeal to them for authority. It has been said, for example, that the term "tide" can only be properly applied to phenomena due to gravitational gradients, centuries of usage before and after Newton's gravitational explanation of the tides notwithstanding. This does not, however, keep atmospheric scientists from studying "atmospheric tides".

These small daily fluctuations in air pressure, due to heating from the sun and in no significant way related to the moon, are nicely analogous to the usual oceanic tides. Three possible explanations for this presumably "incorrect" name come to mind:

Whoever coined the term "atmospheric tide" mistakenly thought they were caused by the moon's gravity.
Whoever coined the term was unaware of the rule that anything called a "tide" or "tidal" must have a gravitational cause.
No such rule exists.

I'm going with the last of these.

Saturday, January 8, 2011

You

Change is a part of language. Of all the ways to justify a pedantic claim that one's pet usage is "correct" and Kids These Days are borderline illiterate, the appeal to history is one of the weakest. OK, so originally decimate (or rather, decimare) referred to a randomly selected tenth of an insubordinate legion. That meaning hasn't been current in English for decades or centuries, if indeed it ever was. In today's English, decimate means "destroy almost totally", because that's how people use and understand the word.

(Not that I don't inwardly cringe when I see the growing use of the possessive marker 's for the plural marker, as for example "Employee's only")

One change that appears to be gaining acceptance is the use of they as a gender-neutral substitute for he or she, handily filling a gap that seems to have grown more noticeable over time. It has the advantage of being a real word, as opposed to any of several newly-minted words that have been proposed for the purpose, so people already know how to pronounce it and what verb form to use with it. Of course, everyone knows that they is actually plural, and so using it as a singular is incorrect and thus liable to lead upcoming generations horribly astray if not destroy their verbal capacity entirely.

But of course, everyone is forgetting about you.

Just like they, you is syntactically plural. In particular it takes the plural form of its verb, as in you are. Nonetheless, it is used for both singular and plural.

English is unusual in this respect, at least as far as European languages go. Most European languages use separate forms for singular and plural in the second person, as indeed English used to. The singular form is almost always tu (as in the Romance languages) or some variant (e.g., German du). There is more variation in the plural form, though in the Romance languages it's consistently a derivative of Latin vos.

So why would English not distinguish, but instead use the plural form for both singular and plural? Well ...

European languages don't just distinguish singular you from plural. They generally also distinguish familiar from formal. For example, in Dutch, a shopkeeper or bank teller will generally address a new customer as U, but two friends or family members will address each other as jij (pronounced "yiy" to rhyme with "sigh", or "yuh", depending on whether it's stressed). There are two main patterns for this:

The formal comes from the third person, as with German Sie or Spanish Usted (from Vuestra merced, literally "your mercy", more loosely "your grace").
The formal comes from the plural, as with French vous.

European languages also tend to distinguish case, as English still does with I vs. me, he vs. him and she vs. her. English has lost almost all of its other case distinctions and seems intent on losing the rest. The who vs. whom distinction is essentially gone, and even he/she vs. him/her only seems to matter in simpler contexts. In my own experience, most people will say either me and him, if they think no one's looking, or he and I if they think they need to use "proper grammar", regardless of the actual case involved.

What does all this have to do with you?

English used to use a perfectly ordinary European system: singular thou (cognate with tu) and plural ye, with the plural for the formal. In the accusative case (direct object of a sentence), the forms are thee and you. So

Shall I compare thee to a summer's day?
Thou art more lovely and more temperate.
(I had to go back to Chaucer for clear examples of you as opposed to ye): But first I pray you, of your courtesy/That ye narrate it not my villainy

I don't know the actual order of the first two events, but over time

The ye/you case distinction was lost in favor of you
The you form became the (singular and plural) formal as well as the plural familiar
The familiar/formal distinction was lost, again in favor of you (to the extent that to modern ears the familiar thou tends to sound "formal")

So, behind that simple word you, one-size-fits-all-numbers-and-cases, lurks an elaborate structure of case, number and familiar/formal distinctions, most of which is now long forgotten.

And if only it were that simple. There are plenty of wrinkles in the basic pattern of "tu, some plural form, either plural or third person for formal (both singular and plural)":

German uses third-person (Sie) while its close cousin English used the plural (you)
Dutch (about as close as you can get to both German and English) uses different forms (jij and U) both cognate with the English plural you (and German accusative plural euch), for the singular familiar and formal respectively.
Spanish uses a third-person formal, but the Vuestra in Vuestra merced implies that it had previously used the plural vos.
Spanish distinguishes between formal singular and plural (Usted and Ustedes -- Vuestras mercedes), while most European languages only distinguish singular and plural in the familiar form
French uses the plural vous for the formal -- actually, French is relatively straightforward in that respect.
Italian has both (Lei, a third-person form, is more widely used, but some dialects retain the older voi).
English has several unofficial plural forms (y'all, you guys, youse, you-uns etc.), which leave the once-plural you as a purely singular form

Monday, November 29, 2010

Copernicus and revolutions

Before Copernicus, everyone thought that the earth was the center of the universe. Then Copernicus, in De Revolutionibus Orbium Coelestium, said that planets, including the earth, revolved around the sun. Thus did science triumph over tradition and superstition.

Well, that's the Short Attention Span Theater version. It has the advantage of being short and memorable, and the disadvantage of not being particularly near reality. Yes, Copernicus did write De Revolutionibus, and yes, it did have the earth revolving around the sun, but heliocentric theories go back at least to Aristarchus of Samos, and it took another 200 years after Copernicus for the idea to take really firm hold in the scientific community (a somewhat anachronistic term in itself, but never mind).

Much has been made of controversy with the Church over the theory, but that came later. Copernicus published the book with the aid of his friend Bishop Giese and dedicated it to Pope Paul III. Nor does De Revolutionibus usher in a fully-formed modern view of the universe. Copernicus postulates eight celestial spheres, with the fixed stars in the outermost, each planet moving in a perfect circle.

Copernicus does not present new data that can't be made to fit with the older geocentric view. He reanalyzes centuries of observations that had been explained by a fairly complex system of cycles and epicycles, explaining them by a somewhat simpler system of cycles and epicycles. The infamous epicycles are still needed because the planets don't actually move in perfect circles.

Copernicus's work is often considered important because it regards the earth not as the center of the universe, that is, as a special, distinguished place, but as a part of it, a planet on an equal standing with the other planets. This notion that we do not occupy a special place is central to modern cosmology, extending even to the notion that the particular universe we occupy is not necessarily special, despite possessing such apparently unlikely features as solid matter and cosmologists. In this view of science as the great dethroner of humanity, Darwin delivered a final insult by arguing that we are not even special among animals, but rather Just Another Ape.

There is merit in this view, even though (or maybe because) the notion that we are special creatures in a special place remains quite popular. However, the Copernican shift can also be seen as one of a long line of cases where a simple and reasonable assumption turned out not to be true. For example

That the earth is not flat but an enormously large globe (enormous on a human scale, at least)
That the earth revolves around the sun and not the other way around
That the other planets are not points of light, but worlds at least somewhat like ours, most with their own moons
That the stars are not points of light, but suns like our own
That the Milky Way is a vast collection of stars, of which our own sun and the stars we see at night are part
That "spiral nebulae" also consist of large numbers of stars and are indeed galaxies like our own.
That the planets do not move in perfectly circular orbits, but ellipses
That there are many, many objects in the solar system that are not planets (in the famous case of Pluto, something we had considered a planet looks to be better described as something else)
That not everything in the solar system moves like a planet; for example, some objects move from an inner orbit to an outer one and back over time.
That what appear to be single stars are often systems of two or more stars
That the the fixed stars are actually not fixed, but moving
That stars are not eternal, but are born and die

In many of these cases, but not all, the new view does make our position less special. The driving force behind these shifts, however, isn't a desire to make humanity less special, but a desire to find simple, coherent explanations that fit the facts. It's striking that many, but again not all, of the shifts listed above are toward a more uniform view -- the earth is of a kind with the other planets, the sun with other stars, the Milky Way with other galaxies. Such a shift makes our place less special, but more as a side effect than as a specific aim.

The most uniform explanation is not always right, though. Stars like our sun are relatively rare. Most stars are significantly larger or smaller, hotter or cooler. Single-star systems are a majority, but stars in multiple systems constitute a significant minority. Of the planets in our solar system, only one has significant liquid surface water. There are many more Kuiper Belt objects than proper planets. Even neglecting KBOs, asteroids, comets and such, there are many more moons in the solar system than planets. Putting it all together, a rocky planet with liquid surface water orbiting a single star is almost certainly fairly rare, even if planets in general are abundant.

In the early 20th century it was discovered that distant objects are moving away from us, and the more distant the object, the faster it is moving. The effect is uniform in all directions, within the limits of measurement [once you subtract out the dipole anisotropy -- like pretty much everything else, we are moving slightly, relative to the general expansion of the universe -- but that's a subtle effect and wasn't discovered until much later -- D.H.]. The conclusion is obvious: We are at the center of the universe, a unique spot whence everything else recedes. This conclusion was rejected in favor of one which does not require us to sit in a special place: The entire universe is expanding and, other factors being equal, everything is moving away from everything else. Again, this is the more uniform view, and evidence has borne it out over the decades.

How might I add that to the list above?

That the universe is not static, but expanding?
That our solar system is not the center of the universe, but just another part of it?

The first, I think, more closely reflects the actual development of thought. The second fits the Copernican revolution model, but only by setting up a strawman. By the time the Hubble expansion was discovered, it was already a given that our place is not special, so much so that what might have been taken as game-changing new evidence of our special place was quickly interpreted as just the opposite.

Pitting rational science against irrational human egocentricity makes a good story, but there's a more mundane reading to be found: Science likes uniformity, and it likes uniformity so much that a nicely uniform explanation of known facts will eventually push aside our natural, egocentric concepts.

Put briefly and in retrospect, many of the shifts listed above seem blindingly obvious. Why assume that the sun is different from other stars? Why assume that ours is the only galaxy? But this forgets the flip side of uniformity: The most natural assumption is that things which appear different really are different.

The earth appears to us as a huge surface with many features. The other planets appear to us as tiny lights in the sky. The sun is a blindingly bright ball. The stars are more tiny lights in the sky. The Milky Way is a huge swatch through the sky. Spiral nebulae are tiny, in almost all cases much too small to be seen by the naked eye. And, in the most famous case, planets really do appear to move around us in the sky, with occasional backtracking. We're down here, they're up there. The geocentric view, however egocentric it might be, was also the most natural and prudent until a more compelling story came along.

Tuesday, November 16, 2010

Navigating underground

I've always loved subways/undergrounds. Even packed cheek-to-jowl into an un-air-conditioned Circle or District line car in the middle of the (then) hottest summer on record, in a suit, I still loved the posters and ads, the station architecture and decor, the endless parade of passengers, the nearly endless escalators, the tabloid news stands, the surprising variety of little shops tucked away ... even the names of the stations, the sound of the wheels and the brakes, the generally indecipherable announcements and the sheer urban gothiness of the tunnels themselves. Sort of like an old-fashioned carnival funhouse ride but way, way cooler.

But there is another, more practical reason that I love subway systems: They make navigating a strange city nearly foolproof. You only need to know two things: What stop your destination is at, and how to get to and from the system. If you're just seeing the major sites, both of those are generally dead easy: the names of the stops are invariably listed in the guide book, and guess what -- stations tend to be built right by major landmarks. Even if you're not visiting a major landmark, chances are whomever you're visiting will tell you the name of their station and the same technique will work.

All you really have to do is follow the greatly-simplified system map, make the right transfers and avoid Baker Street. Unless you're on a tight schedule, you can essentially treat the whole system as a single point. Your route is Point A -- subway -- Point B.

Such conceptual simplicity is so handy that one can spend months in many cities without learning more than the bare rudiments of the above-ground layout. This is not entirely a good thing. Apart from missing the richness of sights to be encountered by straying into this side-street or that arcade (but not that one; the less said about it the better), there are surprisingly many cases where it would be faster just to walk.

An underground transit system has a cognitive character all of its own. Traveling above ground, you can generally see where you're going and gauge turns and distances reasonably well. Underground, after several twists and turns of stairways and corridors, lurching starts and stops, and a few subtle or not-so-subtle bends, I personally find I might as well be playing pin-the-tail-on-the-donkey.

And yet the human brain, adapted to navigating outdoors and on foot, seems to cope reasonably well with the time-passes-and-then-you're-elsewhere nature of subway riding, even when the mental map of the territory above is largely blank. A mental map developed solely from underground transit will have significant distortions, of course, but these don't seem to hurt much. Once the real landscape becomes familiar, this more accurate view tends to supplant the earlier one (at least in my experience) and the below-ground journey starts to make a bit more sense.

The brain is used to meshing different sets of information, so perhaps this isn't surprising, but I get the definite feeling that more is going on here beneath the surface (so to speak) than one might think.

Thursday, October 21, 2010

Parts of speech

We all learned about nouns, verbs and their friends in elementary school. A typical list is

noun
pronoun
adjective
verb
adverb
preposition
conjunction
interjection

(See here for more detail)

These are generally useful categories, but if you're really trying to figure out a language, you have to slice a little finer. For example there are

Transitive verbs (ones that take an object, like hit)
Intransitive verbs (ones that don't take an object, at least not typically, like sit)
Modal auxiliary verbs (like can, could, may and might, which some dialects can stack up into lovely constructions like might could and may can)
Phrasal verbs, like get up
Countable nouns, like tree
Uncountable (or mass) nouns like water
Pluralia tanta (always-plural nouns -- singular plurale tantum) like scissors
Comparable adjectives, like tall
Uncomparable adjectives, like dead, or NP-complete
Determiners, including articles like the or an, but also adjectives like some, any, all, this or that
Comparators, like more, most, less and least

Such distinctions go some way towards predicting what words can and can't be used together. For example, you don't normally use comparators like more with uncomparable adjectives:

Smith is more famous than Jones.
*Graph isomorphism is more NP-complete than 3-Sat.

(The * at the beginning indicates something that wouldn't normally be said. I'm fudging with "wouldn't normally be said" instead of "incorrect" or "ungrammatical" as it is notoriously easy in general to invent contexts in which a given construct would make sense.)

This being language, the boundaries aren't perfectly crisp. Mass nouns don't generally appear as plurals, but there are a few exceptions, for example

When referring to some standard serving, as in I ordered three waters.
When referring to different types of a given substance, as in She preferred the wines of Bordeaux.

Some nouns can work both ways, for example

Hand me a brick.
We need five tons of brick.

And let's not even get into whether it's OK to say "more unique" even though unique is supposed to be an absolute and therefore uncomparable.

The word that got me thinking of all this was summit, as a verb meaning "to reach the summit of". As the "of" would suggest, this verb is generally transitive -- it takes an object, as in Apa Sherpa summited Everest for the twentieth time (which he actually did, last May). However, the object is often omitted, as in Apa Sherpa summited for the twentieth time. In contexts where this would be said, it would be abundantly clear that Everest was the peak in question. In particular, it doesn't matter how many other peaks he might have climbed how many times.

So is summit then acting as an intransitive verb, or a transitive one with an implied object? I tend towards the latter, as would most grammarians, I believe. But what about more common cases like sing? In I sang, there is no implication that I sang any particular song, so one would think sing is acting intransitively. But I must have been singing something. Is it really acting transitively, but with an implied, unspecified object? At some point, such qualifications cease to pull their own weight. As the man said, volleyball is technically racketless team ping-pong, played with an inflated ball and raised net while standing on the table, but what does that buy us?

What interests me here is how grammar, which is by definition pure syntax, seems unable to stay cleanly separated from semantics. For example, some mass nouns resist the plural

* I would like three neutroniums.
* He was a connoisseur of neutroniums.

In the first example, one does not serve neutronium. In the second, there is only one kind of neutronium. How would we detect such errors? I would think the process is something like

In a construction like three neutroniums, if the object is a substance, we expect it to mean a particular sort of container full of the substance.
But that doesn't make sense in the case of neutronium.

In that view, the syntax is fine and the error is semantic. Mass nouns, then, are syntactically nouns, but ones whose plural forms have particular semantic features. Similarly, whether a verb is used transitively or intransitively is a syntactic distinction, but whether there is an implied object is semantic concern.

Except that "object" is a syntactic concept. One way of reconciling this is to posit that the syntactic form Apa Sherpa summited, for example, is somehow transformed into the form Apa Sherpa summited Everest, with Everest as the object. The choice of "transformed" here deliberately suggests transformational grammar, though I'm not sure that's completely appropriate.

Another would be to posit that the form Apa Sherpa summited gets transformed into some internal structure, in which the concept represented by summited requires something acting in the semantic role of "thing which is summited", which we may as well call an "object", albeit with some risk of confusion. This putative internal structure would be describable in words, for example Apa Sherpa summited, or He summited, or Apa Sherpa summited Everest, or Everest was summited by Apa Sherpa and so on, but it would be an essentially different structure from any of those sentences. As I very dimly undestand it, this is more along the lines of cognitive grammar.

Thursday, October 7, 2010

How much do we know?

The question here is not how much does humanity know collectively, or how much do we know about some given topic compared to how much we don't, or what portion of things can we reasonably say we "know" as opposed to believing or being "fairly sure" or such. Those are all interesting questions, but what I'm after here is more literal. How much does a typical human being know, by some objective measure?

To get the flavor of the question, it has been estimated that the average high-school graduate knows about 40,000 vocabulary items, or listemes. A listeme is a word, word part or collection of words that you have to memorize in order to understand, as opposed to something you can understand by breaking it into parts you already know. For example

There are two listemes in "listemes": listeme itself and the plural marker -s. If you understand both of those, you can understand their combination [Or three: list, -eme and -s, if you're a linguist and familiar with morpheme, phoneme and such -- see below -- D.H.].
Typical acronyms and such are listemes: USA or LOL, for example, even though the parts they stand for are well known, because you have to know which words the letters stand for.
Idioms are listemes. Knowing flying and saucer is not enough to know flying saucer.
Proper names are listemes. You have to learn that Muskegon is a city and that Michael Jordan is a former NBA player, even if you already know that Michael and Jordan are names.
To some extent, different senses of words count as different listemes. Knowing that you can eat off a plate doesn't tell you how to plate something in gold or what it means for a batter to step up to the plate.
Listemes are somewhat subjective. Someone well-versed in Latin might see intermittent or conjecture as made up of simpler parts, while for most of us they're one listeme each, and of course different languages have largely different listemes.

Each listeme binds a largely arbitrary sign to a meaning. At a bare minimum, then, our typical high school grad knows 40,000 items, however much knowledge an item might represent. Now, I make no pretense of knowing how the mind really represents such things, but the title of this blog is Intermittent Conjecture, so it seems that by a miraculous coincidence I've left myself room to speculate.

I would guess that typical listemes are associated with bundles of memories and their relations to other memories. For example, plate might perhaps conjure up images of typical dinner plates and memories of eating and setting the table and such; images of plated items one may have encountered or a representation of the plating process; images from a baseball game with a batter in stance or a runner sliding into home.

Similarly to how words may be defined with other words, these bundles of images will typically overlap. A memory of a dinner plate may include an image of a table, or of eating, making "something you put on the table" or "something you eat off of" natural, if incomplete, answers to "What's a plate?"

I've used "memory" and "image" fairly interchangeably here, but I suspect that the images that concepts are built on are nothing like fully detailed pictures or movies. Rather, they're highly abstracted, with only the relevant features retained.

By this line of speculation, those 40,000 listemes might represent 400,000 or 1,000,000 or more images, grouped into concepts and with arbitrary signs attached. There is much, much more to the picture, of course, but again we're just trying to get a rough estimate of what's in a typical brain.

Words are only one window into the contents of the mind. We also know things we can't easily put into words, which one reason I had wanted to talk about different kinds of knowledge and formal vs informal education. We learn to walk instinctively, and so it's much harder to characterize what sort of things one must "know" in oder to walk, yet if we can learn something, there must be some kind of knowledge involved. Likewise for other skills like skiing or playing the trumpet, which we learn consciously and in many cases formally, but without necessarily learning a lot of vocabulary in the process.

We can also make associations unconsciously and non-verbally. When the pioneer Lucky Bill in the post I linked to above looks off and sees bad weather brewing in the clouds, he probably doesn't have words for what he's sensing, but it's definitely something he's learned and knows, just as he knows how to let his horse know it's time to go. This knowledge may well be built on the same sort of memories and images that we pin language onto, but it's not readily accessible to language.

If we take a mental image -- an abstracted memory -- as the basic unit of knowledge, with images grouped into concepts which may or may not have language attached, then it seems plausible that an adult human could have millions or tens of millions of such images. We must also allow some capacity for storing the relations among the various images, concepts signs and so forth, but such "metadata" tends to be much smaller than the data it helps organize (see this post on the other blog for more on that).

Being a compugeek, the handiest objective measure of information I have is the byte. Leaving aside that images may differ widely in size and taking an image to be on the order of a megabyte -- a completely wild guess which may well be off by orders of magnitude -- that would put our mental storage capacity on the order of terabytes or dozens of terabytes.

Until fairly recently, that was a lot of storage, but these days it's not a staggering figure. As far as putting together something of the same order as a human brain, we may just now be reaching a necessary, but not necessarily sufficient, technological milestone.

I'm happy to learn that the wild stab in the dark given above turns out to be reasonably in line with other wild stabs in the dark. See for example this Google Answers page (I didn't have a lot of luck tracing this back to the literature, but since it's all guesswork I'm not going to worry about it [and Google Answers itself disappeared a while ago]).