Thursday, November 9, 2017

syl·lab·i·fi·ca·tion

[Author's note: When I started this, I thought it was going to touch on deep questions of language and cognition.  It ended up kinda meandering around some random bits of computer word-processing.  This happens sometimes.  I'm posting it anyway since, well, it's already written.  --D.H.]

Newspaper and magazine articles are traditionally typeset in narrow, justified columns. "Justified" here means that every line is the same width (unlike, say, with most blog posts).  If the words aren't big enough to fill out a line, the typesetter will widen the spaces to fill it out.  If the words are a bit too long, the typesetter might move the last word to the next line and then add space to what's left.

Originally, a typesetter was a person who physically inserted pieces of lead type into a form.  Later, it was a person operating a Linotype™ or similar machine to do the same thing.  These days it's mostly done by software.

Technically, laying out a paragraph to minimize the amount of extra space is not trivial, but certainly feasible, the kind of thing that would make a good undergraduate programming exercise.  Several algorithms are available.  They may not always produce results as nice as an experienced human typesetter, but they do well enough for most purposes.

One option for getting better line breaks and better-looking paragraphs is to hyphenate.  If your layout looks funny because you've got floccinaucinihilipilification in the middle of a line, you might try breaking it up as, say floccinaucinihili-
pilification.  It will probably be easier to lay out those two pieces rather than trying to make room for one large one.

You can't just throw a hyphen in anywhere.  There's a strong tendency to read whatever comes before and after the hyphen as independent units, so you don't want to break at wee-
knights or pre-
aches.

In many languages, probably most, this isn't a big problem.  For example, Spanish has an official set of rules that gives a clear hyphenation for any word (actually there are several of these, depending on what country you're in).  It's hard for English, though, for the same reason that spelling is hard for English -- English spelling is historical, not phonetic, and has so far resisted attempts at standardisation standardization and fonetissizing.

So instead we have the usual suspects, particularly style guides produced by various academic and media organizations.  This leads to statements like this one from the Chicago Manual of Style:
Chicago favors a system of word division based on pronunciation and more or less demonstrated by the recommendations in Webster’s tenth.
The FAQ list that that comes from has a few interesting cases, though I'm not sure that "How should I hyphenate Josephine Bellver's last name?" actually qualifies as a frequently asked question.  The one that interests me here concerns whether it should be "bio-logy" or "biol-ogy".  CMOS opts for "biol-ogy", going by pronunciation rather than etymology.

Which makes sense, in that consistently going by pronunciation probably makes reading easiest.  But it's also a bit ironic, in that English spelling is all about etymology over pronunciation.

Either approach is hard for computers to cope with, since they both require specific knowledge that's not directly evident from the text.  It's common to teach lists of rules, which computers do deal with reasonably well, but the problem with lists of rules for English is that they never, ever work.  For example, it's going to be hard to come up with a purely rule-based approach that divides "bark-ing" but also "bar-keeper".

This is why style guides tend to fall back on looser guidance like "divide the syllables as they're pronounced".  Except -- whose pronunciation?  When I was a kid I didn't pronounce an l in also or an n in government (I've since absorbed both of those from my surroundings).  I'm pretty sure most American speakers don't pronounce a t in often.  So how do you hyphenate those according to pronunciation?


Fortunately, computers don't have to figure this out.  A hyphenation dictionary for 100,000 words will cost somewhere around a megabyte, depending on how hard you try to compress it.  That's nothing in modern environments where a minimal "Hello world" program can run into megabytes all by itself (it doesn't have to, but it's very easy to eat a few megabytes on a trivial program without anyone noticing).

But what if the hyphenator runs across some new coinage or personal name that doesn't appear in the dictionary -- for example, whoever put the dictionary together didn't know about Josephine Bellver?  One option is just not to try to hyphenate those.  A refinement of that would be to allow the author to explicitly add a hyphen.  This should be the special "optional hyphen" character, so that you don't get hyphens showing up in the middle of lines if you later edit the text.  That way if you invent a really long neologism, it doesn't have to mess up your formatting.

If there's a point to any of this, it's that computers don't have to follow specific rules, except in the sense that anything a computer does follows specific rules.  While it might be natural for a compugeek to try to come up with the perfect hyphenation algorithm, the better engineering solution is probably to treat every known word as a special case and offer a fallback (or just punt) when that fails.

This wasn't always the right tradeoff.  Memory used to be expensive, and a tightly-coded algorithm will be much smaller than a dictionary.  But even then, there are tricks to be employed.  One of my all-time favorite hacks compressed a spelling dictionary down to a small bitmap that didn't even try to represent the actual words.  I'd include a link, but the only reference I know for it, Programming Pearls by Jon Bentley, isn't online.

1 comment:

  1. This one bothers linguists, too. It's really pretty easy to write a set of rules that will specify all and only the phonologically possible syllables of a language, at least of the languages I've any experience with, but phonologists will argue that the syllable is not a necessary entity for describing all the phonological words. Other phonologists say, so what-- it exists, so you have to describe it. It's also fairly straightforward to argue that the syllable exists as a perceptual unit, or something equally psychological. But we usually want to hyphenate at morpheme boundaries, no matter what CMOS might think, and these routinely ignore syllable boundaries.
    Incidentally, the k of "barkeep" will be more strongly aspirated than that of "barking," not, I think, because of the secondary stress on "-keep".

    ReplyDelete