Thursday, December 7, 2017

Where should I file this, and do I care?

I used to love to browse the card catalog at the local library (so yep ... geek).  This wasn't just for the books, but for the way they were organized.  The local library, along with my middle and high school libraries, used the Dewey Decimal Classification (or "Dewey Decimal System" as I remember it actually being called).

This was, to my eyes, a beautiful way of sorting books.  The world was divided into ten categories, each given a range of a hundred numbers, from 000-099 for "information and general works" (now also including computer science) to 900-999 for history and geography.  Within those ranges, subjects were further divided by number.  Wikipedia gives a good example:
500 Natural sciences and mathematics
510 Mathematics
516 Geometry
516.3 Analytic geometries
516.37 Metric differential geometries
516.375 Finsler geometry
Finsler geometry is pretty specialized (a Finsler manifold is a differentiable manifold whose metric has particular properties -- I had to look that up).  Clearly you could keep adding digits as long as you like, slicing ever finer, though in practice there are never more than a few (maybe just three?) after the decimal point.

With the Dewey classification in place, you could walk into libraries around the country, indeed around the world, and quickly figure out where, say, the books on gardening, medieval musical instruments or truck repair were located.  If you or the librarian found a book lying around, you could quickly put it back in its proper place on the shelves.  If you found a book you liked, you could find other books on related topics near it, whether on the shelves or in the card catalog (what's that, Grandpa?).

On top of that, the field of library science, in which the Dewey classification and others like it* play a central role is one of the precursors of computer science as we know it.  This is true at several levels, from the very idea of organizing large amounts of information (and making it universally accessible and useful), to the idea of using an index that can easily be modified as new items are added.

There's one other very significant aspect of library classification systems like Dewey: hierarchy.

It's almost too obvious to mention, but in the Dewey Classification, and others like it, the world is organized into high-level categories (natural sciences and mathematics), which contain smaller, more specific categories (mathematics), and so on down to the bottom level (Finsler geometry).  There are lots and lots of systems like this -- federal/state/local government in the US and similar systems elsewhere; domain/kingdom/phylum/class/order/family/genus/species in taxonomy; supercluster/galaxy cluster/galaxy/star system/star in astronomy; top-level-domain/domain/.../host and so forth.

Strictly speaking, this sort of structure is a containment hierarchy, where higher levels contain lower levels.  There are other sorts of hierarchies, for example primary/secondary/tertiary colors.  However, containment hierarchies are the most prominent kind.  Even hierarchies such as rank generally have containment associated with them -- if a colonel reports to a general, then that general is ultimately in command of the colonel's units (and presumably others).  The term hierarchy itself comes from the Greek for "rule of a high priest".  One of the most notable examples, of course, is the hierarchy of the Catholic church.

Containment hierarchies organize the world into units that our minds seem pretty good at comprehending, which probably why we're willing to overlook a major drawback: containment hierarchies almost always leak.

There are some possible exceptions.  One that comes to mind is the hierarchy of molecule/atom/subatomic particle/quark implied by the Standard Model.  Molecules are always composed of atoms and atoms of subatomic particles.  Of the subatomic particles in an atom, electrons (as far as we know) are elementary, having no simpler parts, while protons and neutrons are composed of quarks which (as far as we know) are also elementary.

Even here there are some wrinkles.  There are other elementary particles besides electrons and quarks that are not parts of atoms.  Electrons, protons and neutrons can all exist independently of atoms.  Some elements can exist without forming molecules.  Electrons in some types of molecule may not belong to particular atoms.  Even defining which atoms belong to which molecules can get tricky.

Perhaps a better example would be the classification of the types of elementary particles.  All (known) particles are unambiguously quarks, leptons, gauge bosons or scalar bosons.  Leptons and quarks are subdivided into generations, again with no room for ambiguity.  There are similar hierarchies in mathematics and other fields.

For most hierarchies, though, you have more than a bit of a mess.  Cities cross state lines, and while the different parts are administratively part of separate states, there will typically be citywide organizations, some with meaningful authority, that cross state lines.  Defining species and other taxonomic groups is notoriously contentious**.  One of the key points of Darwin's Origin is that you can't always find a satisfactory boundary -- the whole point of Origin is to explain why we so often can.

In astronomy, the designations of supercluster, galaxy cluster, galaxy and star system can all become murky or even arbitrary when several are interacting -- is that one merged galaxy, or two galaxies in the process of merging?  The distinction between star and planet can be troublesome as well, so it may not always be clear whether you have a planet orbiting a star or two companion stars.

On the internet, the distinction in notation between nested domains and hosts is clear, but the same (physical or virtual) computer can have multiple identities, even in different domains, and multiple computers can share the same host identity.  On the internet, what matters is which packets you respond to (and no one knows you're a dog).

And, of course, organization charts, arguably the prototypical example of a containment hierarchy, are in real life more what you'd call guidelines.  Beyond "dotted-line reports" and such, most real work crosses team boundaries and if everyone waited for every decision to percolate up and down the chain of command appropriately, nothing would get done (I've seen this attempted.  It did not go well).


So why group things into hierarchies anyway?

Again, there's clearly something about our minds that finds them natural.  In the early days of PCs, some of the prominent players originally started out storing files in one "flat" space.  If a floppy disk typically only held a handful of files, or even a few dozen, there was no harm in just listing them all out.  It didn't take long, however, until that got unwieldy.  People wanted to group related files together and, just as importantly, keep unrelated files separate.  Before long, all the major players had ported the concept of a "directory" or "folder" from the earlier "mainframe" operating systems -- which had themselves gone through roughly the same evolution.

Since computer scientists love nothing more than recursion, folders themselves could contain folders, and so on as far as you liked.  Somehow it didn't seem to bother anyone that this couldn't possibly work in a physical folder in a physical file drawer.

This all brought a new problem -- how to put things into folders.  There are at least two varieties of this problem (hmm ... problem subdivided into varieties ...).

For various reasons, some files needed to appear in multiple folders in identical form.  This is a problem not only for space reasons, but because you'd really like a change in a common file to show up everywhere instead of having to make the same change in an unknown number of copies.   This led to the rediscovery of "shortcuts" and "symbolic links", again already part of older operating systems, which allowed you to show the same physical file under multiple folders at the same time.

When it comes to organizing human-readable information, there's a different problem -- it's not always clear what folder to put things in.  Does a personal financial document go in the "personal" folder or the "financial" folder?  This problem leads us right back to ontology (the study of, among other things, how to categorize things) and library science.  Library science has always had to deal with this problem as well.  Does a book on the history of mathematics go under history (900s) or mathematics (510s).

There are always cases where you just have to decide one way or another, and then try to make the same arbitrary decision consistently in the future, hoping that some previously-unseen common thread will emerge that can then be codified into a rule.


The upshot, I think, is that hierarchies are a useful tool for organizing things for the convenience of human minds, not a property of the universe itself (except, arguably, in cases such as subatomic particles as discussed above).  As with any tool, there are costs and benefits to its use and it's best to weigh them before charging ahead.  Imposing a hierarchy that doesn't fit well isn't just wasted effort.   It can actively obscure what's really going on.

Interestingly enough, I now work for a company that takes an entirely different approach to organizing knowledge.  Don't worry about where something should be, or what grouping it should be in.  Just search for what's in it.

This has been remarkably successful.  It may be hard to remember, but for a while there was a brisk business in manually curating and categorizing information.  It's still done, of course, because it's still a useful exercise in some contexts, but it's no longer the primary way we find information on the web.  Now we just search.

OK, time to hit Publish.  Oh wait ... what labels should I put on this post?



* Dewey isn't the only game in town, just the one most widely used in US primary and secondary schools.  The local university library uses the Library of Congress Classification, which uses letters and numbers in a way that made my brain melt, not so much for looking more complex, I think, as for not being Dewey.

** My understanding is that the idea of a clade -- all (living) organisms descended from a given ancestor -- has come to be at least as important as the traditional taxonomic groupings, at least in some contexts, but I'm not a biologist.