Thursday, April 7, 2011

Data and rice

It may seem like this blog is turning into yet another "my usage can whip your usage" column, but bear with me for one more post.

In computing contexts (where I spend most of my time), you'll almost invariably see "The data is ..." as opposed to "The data are ...".  The naive analysis of this is that data is plural (singular datum) and it is therefore simply incorrect to say "The data is ...".  Computer geeks simply don't know any better.

A more reasonable analysis is that data here is a mass noun, like water or gravel.  Mass nouns are measured as opposed to counted.  You ask how much water, as opposed to how many eggs.  There is no such thing as a water, except in special cases like "I ordered a water," meaning a glass or similar serving, or "Bubblyfritz is a water that really refreshes," meaning a kind of water.  Mass nouns act singular.  You say "This water is salty," not "These water are salty."

Fair enough, but in contexts outside computing people often use data as a plural:  "These data support my theory," or "We don't have many data to work with here."  A naive explanation is that people who say such things are just stuffier than computer geeks, who are notoriously playful in their use of language.  A more reasonable explanation is that such speakers are using data as an ordinary (count) noun, albeit one with an irregular plural carried over from Latin.  Consistent with this, people also use the singular datum in various ways, including forms like "This datum doesn't fit with the other data."

Why do different groups adopt different usages, each perfectly defensible?  No doubt culture plays a part.  One's use of data indicates whether one is a grammatical ignoramus who doesn't realize that data is a plural form or an uptight pedant who insists on applying arbitrary rules from dead languages.  However, I think that, even allowing for this effect, there is another reason to use data as either singular or plural depending on circumstances.

We computer geeks typically deal with lots and lots of data.  The more the better.  We eat terabytes for breakfast and gigabytes for a light snack.  Further, the individual bits generally don't carry any particular significance.  The fourth bit of the ASCII or Unicode representation of the 'T' at the beginning of this sentence didn't come from some physical measurement, but from an arbitrary encoding.  We also process data differently.  One doesn't generally take the mean or standard deviation of the bytes in a blog post or audio clip.

In short, our data is a different beast from a statistician's or biologist's.  It really only makes sense when considered in aggregate.  Metaphorically it acts much like a substance.  We speak of storing data, or moving it from place to place.  We wonder how much space we have left for storing data.  We even speak of compressing data, some of which might be lost in the process, and of "memory leaks" filling up available heap space.  In short, computer data acts like a mass noun.

Conversely, individual statistical or scientific data are significant.  If I measure the temperature today, that's a datum (but more commonly data point -- I'll come back to that).  If I measure the temperature again tomorrow, that's another datum.  Once I've accumulated a data set [hmm ... not "datum set"?], I try to derive some aggregate measure from that, but the key word here is "derive".  The individual data are the source of truth.

To get a better measure, I may throw out particular data.  I may present the data sorted or grouped in various ways to make particular points.  I might note some property of a particular datum, or I might call attention to the source of some subset of the data as opposed to the others.  In all these cases the individual data have their own identity and it's perfectly logical to refer to them collectively in the plural and to an individual one in the singular.



I mentioned rice in the title, didn't I?

While casting about for more ordinary, physical analogs to computer data, I started looking for mass nouns that fit the part.  I didn't like water because computer data is ultimately discrete.  A terabyte is a lot of bytes (somewhat over a trillion*), but still a finite number of distinct bytes. For practical purposes water is infinitely divisible.

What about rock or stone?  They can certainly behave as mass nouns.  You can order a ton of rock or fill your pickup truck with stone.  But you can also use the same word (not a variant form) in the singular.  You can say "a rock" or "a stone".  Not even a compugeek would normally say "a data".

I had been looking for an aggregate of individual pieces which are so numerous that we treat the aggregate as a substance.  You can certainly gather rocks together until somewhere along the line you've shifted from "some rocks" to "some rock".  But on the other hand, rock and stone can be treated as substances themselves.  You can refer to a chunk of rock or a statue made of stone.  In fact, you could argue that we have two mass nouns called "stone": stone as a substance, and an aggregate of stones, which are usually be made of stone in the first sense (but even if the individual stones were made of some clever compound of plastic, chances are a ton of them would still be called "stone").  Computer data doesn't have that extra level.

I had originally called this post Data and gravel, but computer data is made up of individual parts of uniform size (ultimately bits) while gravel is visibly irregular.  Sand is closer, and was my second try.  Grains of sand may be irregular, but they're small and there are so many that you don't really notice.  Ultimately, though, something like rice seems closest.  The parts are small, numerous and visually uniform.

If you have an aggregate of smaller objects that you treat collectively as a mass noun, you may still need to refer to the individual pieces from time to time.  There are several ways of doing this
  • In cases like brick, rock or stone, the name of the substance, meaning a small chunk of that substance (a brick, a rock, a stone).
  • In most cases, "a ____ of ...", where the blank may be filled in by something specialized (a  kernel of corn, a grain of sand) or, failing that, the generic piece (a piece of gravel).
  • In some but not all cases this can be turned around (a sand grain, a corn kernel, but a gravel piece sounds a bit odd to me).
This may help explain why people like to say "data point".  If you have enough data to do meaningful statistics, it's easy (and even useful) to start thinking of it in the aggregate.  Working back from that, you can get a point of data, or more usually a data point.

Finally, why would we choose data as the form for the mass noun, rather than the singular datum, by analogy with rock and stone?  First, data is simply used a lot more.  Second, it's data, not datum, that is used in contexts that work as both for count nouns and mass nouns ("I can't interpret the data", "Could you send me your data?").  Further, mass nouns like brick and stone only seem to occur in the pattern mentioned above of substance ➝ small chunk of that substance ➝ aggregation of said small chunks.



*  Yes, I've heard of tebibyte but I've never, ever heard it used seriously in real life.  Yes, it's an IEC standard.  Yes, IEEE officially says that a terabyte is exactly a trillion bytes as opposed to 240 = 1,099,511,627,776.  No one outside the standards committees cares, and I doubt even they care all that much most of the time.  In theory, it matters whether your disk holds exactly a trillion bytes or close to 10% more.  In practice, either your disk is nearly full or it isn't.  When it fills up, you buy more [I've since seen notations like TiB in the wild, but it adds little, as far as I can tell.  If it says TB, you still know it means the power of two and not an even trillion.  If you see TiB, you know it means the same thing, but whoever said TiB instead of TB wants you to know they know the difference --D.H.]

Standards are great, and if a standards body wants to, say, limit files to 4,294,967,296 bytes they should either say "files shall be no larger than 4 gibibytes" or be clear that "GB" means 230 bytes.  Or they can just say "file sizes are 32 bits".  The rest of us will continue to blithely use the "wrong" units.

That said, perhaps the distinction is becoming more important as the numbers get larger.  Where this all started, with 210 = 1,024 being practically 1000,  the error is only 2.4%.  At the megabyte level, the difference is 4.8% and at the gigabyte level, 7.3%.  Once we get into peta- and exa- territory, the errors are 13% and 15%, harder and harder to ignore.  Even then, manufacturers, who would one would think might stand to gain by saying 1.1TB instead of 1TB, seem content to say 1TB anyway.  No harm, no foul.