Wednesday, August 26, 2015

Grammars ... not so fast

In the previous post, I made the rather bold statement that the effort to describe natural languages with formal grammars had been "without great success".  I probably could have put that better.

I was referring at the time to the effort to define a "universal grammar" that would capture the essence of linguistic competence, that is, people's knowledge of language, independent of performance, that is, what people actually say.  A grammar, in that sense, is a set of rules that will produce all, and only, the sentences that speakers of a given language know how to produce.

In that sense, the search for grammars for natural languages has not had great success.


However, there is another line of inquiry that has had notable success, namely the effort to parse natural language, that is, to take something that someone actually said or wrote and give an account of how the words relate to each other that matches well with what actual people would say about the same sentence.

For example, in Colorless green ideas sleep furiously, we would say that colorless and green are describing ideas, ideas are what's sleeping, and furiously modifies sleep.  And so would any number of parsers that have been written over the years.

This is an important step towards actually understanding language, whatever that may mean, and it indicates that, despite what I said previously, there may be such a thing as syntax independent of semantics.  We can generally handle a sentence like Colorless green ideas sleep furiously the same as Big roan horses run quickly, without worrying about what it might mean for an idea to be both green and colorless or for it to sleep.


So what gives?  There are at least two major differences between the search for a universal grammar and the work on parsers.

First, the two efforts have significantly different goals.  It's not entirely clear just what a universal grammar might be, which was one of the main points of the previous post, not to mention many more knowledgeable critiques.  If universal grammar is a theory, what facts is it being asked to explain?

As far as I understand it, the idea is to explain why people say and comprehend what they do, by producing a grammar that enumerates what sentences they are and aren't competent to say, keeping in mind that people may make mistakes.

However, narrowing this down to "all and only" the "right" sentences for a given language has proved difficult, particularly since it's hard to say where the line between "competence" and "performance" lies.  If someone says or understands something that a proposed universal grammar doesn't generate, that is, something outside their supposed "competence", what does that mean?  If competence is our knowledge of language, then is the person somehow saying something they somehow don't know how to say?

The work on parsing has very well-defined goals.  Typically, algorithms are evaluated by running a large corpus of text through them and comparing the results to what people came up with.  As always, one can argue with the methodology, and in particular there is a danger of "teaching to the test", but at least it's very clear what a parser is supposed to do and whether a particular parser is doing it.  Research in parsing is not trying to characterize "knowledge", but rather to replicate particular behavior, in this case how people will mark up a given set of sentences.

Second, the two efforts tend to use different tools.  Much work on universal grammar has focused on phrase-structure grammars, transformational grammars and in general (to the small extent I understand the "minimalist program") on nested structures: A sentence consists of a verb part and noun parts, a verb part includes a main verb and modifying clauses, which may in turn consist of smaller parts, etc..

While there are certainly parsing approaches based on this work, including some fairly successful ones, several successful natural language parsers are based on dependency grammars, which focus on the relations among words rather than a nesting of sentence parts.  Where a phrase-structure grammar would say that colorless green ideas is a noun phrase and so forth, a dependency grammar would say that colorless and green depend on ideas (in the role of adjectives).  Dependency grammars can trace their roots back to some of the earliest known work in grammar hundreds of years ago, but for whatever reason they seemed to fall out of favor for much of the late 20th century.

Leaving aside some interesting questions about issues like "non-projective dependencies" (where the lines from the dependent words to the words they depend on cross when the sentence is laid out in its written or spoken order), it's easy to build parsers for dependency grammars using basically the same technique (shift-reduce parsing) as compilers for computer languages use.  These parsers tend to be quite fast, and about as accurate as parsers based on phrase-structure grammars.


In short, there is a lot of interesting and successful work going on concerning the syntax of natural languages, just not in the direction (universal grammar and such) that I was referring to in the previous post.

Tuesday, August 18, 2015

What, if anything, is a grammar?

This is going to be one of the more technical posts.  As always, I hope it will be worth wading through the details.

There are hundreds, if not thousands, of computer languages.  You may have heard of ones like Java, C#, C, C++, ECMAScript (aka JavaScript, sort of), XML, HTML (and various other *ML, not to mention ML, which is totally different), Scheme, Python, Ruby, LISP, COBOL, FORTRAN etc., but there are many, many others.  They all have one thing in common: they all have precise grammars in a certain technical sense: A set of rules for generating all, and only, the legal programs in the given language.

This is true even if no one ever wrote the grammar down.  If you can actually use a language, there is an implementation that takes code and tells a computer to follow its instructions.  That implementation will accept some set of inputs and reject some other set of inputs, meaning it encodes the grammar of the language.

Fine print, because I can't help myself: I claim that if you don't have at least a formal grammar or an implementation, you don't really have a language.  You could define a language that, say, accepts all possible inputs, even if it doesn't do anything useful with them.  I'd go math major on you then and claim that the set of "other code" it rejects is empty, but a set nonetheless.  I'd also argue that if two different compilers are nominally for the same language but behave differently, not that that would ever happen, you have two variants of the same language, or technically,  two different grammars.

Formal grammars are great in the world of programming languages.  They tell implementers what to implement.  They tell programmers what's legal and what's not, so you don't get into (as many) fights with the compiler.  Writing a formal grammar helps designers flush out issues that they might not have thought of.  There are dissertations written on programming language grammars.  There are even tools that can take a grammar and produce a "parser", which is code that will tell you if a particular piece of text is legal, and if it is, how it breaks down into the parts the grammar specifies.

Here's a classic example of a formal grammar:
  • expr ::= term | term '+' term
  • term ::= factor | factor '*' factor
  • factor ::= variable | number | '(' expr ')'
In English, this would be
  • an expression is either a term by itself or a term, a plus sign, and another term
  • a term is either a factor by itself or a factor, a multiplication sign ('*') and another factor
  • a factor is either a variable, a number or an expression in parentheses
Strictly speaking, you'd have to give precise definitions of variable and number, but you get the idea.

Let's look at the expression (x + 4)*(6*y + 3).  Applying the grammar, we get
  • the whole thing is a term
  • that term is the product of two factors, namely (x + 4) and (6*y + 3)
  • the first factor is an expression, namely x+4, in parentheses
  • that expression is the sum of two terms, namely x and 4
  • x is a factor by itself, namely the variable x
  • 4 is a factor by itself, namely the number 4
  • 6*y + 3 is the sum of two terms, namely 6*y and 3
  • and so forth
One thing worth noting is that, because terms are defined as the product of factors, 6*y + 3 is automatically interpreted the way it should be, with the multiplication happening first.  If the first two rules had been switched, it would have been interpreted as having the addition happening first.  This sort of thing makes grammars particularly useful for designing computer languages, as they give precise rules for resolving ambiguities.  For the same reason they also make it slightly tricky to make sure they're defining what you think they're defining, but there are known tools and techniques for dealing with that, at least when it comes to computer languages.


The grammar I gave above is technically a particular kind of grammar, namely a context-free grammar, which is a particular kind of phrase structure grammar.  There are several other kind.  One way to break them down is the Chomsky hierarchy, which defines a series of types of grammar, each more powerful than the last in the sense that it can recognize more kinds of legal input.  A context-free grammar is somewhere in the middle.  It can define most of the rules for real programming languages, but typically you'll need to add rules like "you have to say what kind of variable x is before you can use it" that won't fit in the grammar itself.

At the top of the Chomsky hierarchy is a kind of grammar that is as powerful as a Turing machine, meaning that if you can't specify it with such a grammar, no computer can compute whether or not a given sentence is legal according to the grammar or not.  From a computing/AI point of view (and even from a theoretical point of view), it sure would be nice if real languages could be defined by grammars somewhere in the Chomsky hierarchy.  As a result, decades of research have been put into finding such grammars for natural languages.

Without great success.

At first glance, it seems like a context-free grammar, or something like it, would be great for the job.  The formal grammars in the Chomsky hierarchy are basically more rigorous versions of the grammars that were taught, literally, in grammar schools for centuries.  Here's a sketch of a simple formal grammar for English:
  • S ::= NP VP (a sentence is a noun phrase followed by a verb phrase)
  • NP ::= N | Adj NP (a noun phrase is a noun or an adjective followed by a noun phrase)
  • VP ::= V | VP Adv (a verb phrase is a verb or a verb phrase followed by an adverb)
Put this together with a list of nouns, verbs, adjectives and adverbs, and voila: English sentences.  For example, if ideas is a noun, colorless and green are adjectives, sleep is a verb and furiously is an adverb, then we can analyze Colorless green ideas sleep furiously with this grammar just like we could analyze (x + 4)*(6*y + 3) above. 

Granted, this isn't a very good grammar as it stands.  It doesn't know how to say A colorless green idea sleeps furiously or I think that colorless green ideas sleep furiously or any of a number of other perfectly good sentences, but it's not hard to imagine adding rules to cover cases like these.

Likewise, there are already well-studied notions of subject-verb agreement, dependent clauses and such.  Translating them into formal rules doesn't seem like too big a step.  If we can just add all the rules, we should be able to build a grammar that will describe "all and only" the grammatical sentences in English, or whatever other language we're analyzing.

Except, what do we mean by "grammatical"?

In the world of formal grammar, "grammatical" has a clear meaning: If the rules will generate a sentence, it's grammatical.  Otherwise it's not.

When it comes to people actually using language, however, things aren't quite so clear.  What's OK and what's not depends on context, whom you're talking to, who's talking and all manner of other factors.  Since people use language to communicate in the real world, over noisy channels and often with some uncertainty over what the speaker is trying to say and particularly over how the listener will interpret it, it's not surprising that we allow a certain amount of leeway.

You may have been taught that constructs like ain't gonna and it don't are ungrammatical, and you may not say them yourself, but if someone says them to you, you'll still know what they mean.  If a foreign-born speaker says something that makes more sense in their language than yours, something like I will now wash myself the hands, you can still be pretty certain what they're trying to say.  Even if someone says something that seems like nonsense, say That's a really call three, but there's a lot of noise, you'll probably assume they meant something else, maybe That's a really tall tree, if they're pointing at a tall tree.

In short, there probably isn't any such thing as "grammatical" vs. "ungrammatical" in practical use, particularly once you get away from the notion that "grammatical" means "like I was taught in school" -- and as far as that goes, different schools teach different rules and there is plenty of room for interpretation in any particular set of rules.  To capture such uncertainty, linguistics recognizes a distinction between competence (one's knowledge of language) and performance (how people actually use language), though not all linguists recognize a sharp distinction between the two.

When there is variation among individuals and over time, science tends to take a statistical view.  It's often possible to say very precisely what's going on statistically even if it's impossible to tell what will happen in any particular instance.  When there is communication in the presence of noise and uncertainty, information theory can be a useful tool.  It doesn't seem outrageous that a grammar aimed at describing real language would take a statistical approach.  Whatever approach it takes, it should at least provide a general account of what the analysis meant in information theoretic terms.

[I didn't elaborate on this in the original post, but as I understand it, Chomsky has strongly disputed both of these notions.  Peter Norvig quotes Chomsky as saying "But it must be recognized that the notion of probability of a sentence is an entirely useless one, under any known interpretation of this term."  This is certainly forceful, but not convincing, even after following the link to the original context of the quote, which contains such interesting assertions as "On empirical grounds, the probability of my producing some given sentence of English [...] is indistinguishable from the probability of my producing some given sentence of Japanese.  Introduction of the notion of "probability relative to a situation" changes nothing."  This must be missing something, somewhere.  If I'm a native English speaker and my hair is on fire, it's much more likely that I'm about to say "My hair is on fire!" than "閑けさや 岩にしみいる 蝉の声" or even "My, what lovely weather we're having".

The entire article by Norvig is well worth reading.  But I digress.]

Even if you buy into the idea of precise, hard-and-fast rules describing a language, there are several well-known reasons to think that there is more to the picture than what formal grammars describe.  For example, consider these two sentences:
  • The horse raced past the barn fell.
  • The letter sent to the president disappeared.
Formally, these sentences are essentially identical, but chances are you had more trouble understanding the first one, because raced can be either transitive, as in I raced the horse past the barn (in the sense of "I was riding the horse in a race"), or intransitive, as in The horse raced past the barn (the horse was moving quickly as it passed the barn -- there's also "I was in a race against the horse", but never mind).  On reading The horse raced, it's natural to think of the second sense, but the whole sentence only makes sense if you use the first sense of raced, but in passive voice.

Sentences like The horse raced past the barn fell are called "garden path" sentences, as they "lead you down the garden path" to an incorrect analysis and you then have to backtrack to figure them out.

A purely formal grammar describes only the structure of sentences, not any mechanism for parsing them, that is, recovering the structure of the sentence from its words.  There's good experimental data, though, to suggest that we parse The horse raced past the barn fell differently from The letter sent to the president disappeared.

A purely formal grammar doesn't place any practical limits on the sentence that it generates.  From a formal point of view,  The man whom the woman whom the dog that the cat that walked by the fence that was painted blue scratched barked at waved to was tall is a perfectly good sentence, and would still be a perfectly good sentence if we replaced the fence that was painted blue with the fence that went around the yard that surrounded the house that was painted blue, giving The man whom the woman whom the dog that the cat that walked by the fence that went around the yard that surrounded the house that was painted blue scratched barked at waved to was tall.

You could indeed diagram such a sentence, breaking it down into its parts, but I doubt very many people would say they understood it if it were spoken to them at a normal rate.  If you asked "Was it the house or the fence that was painted blue?", you might not get an answer better than guessing (I'm sure such experiments have been done, but I don't have any data handy).  Again, the mechanism used for understanding the sentence places limits on what people will say in real life.

In fact, while most languages, if not all, allow for dependent clauses like that was painted blue, they're used relatively rarely.  Nested clauses like the previous example are even rarer.  I recall recently standing in a fast food restaurant looking at the menu, signs, ads, warnings and so forth and realizing that a large portion of what I saw wasn't even composed of complete sentences, much less complex sentences with dependent clauses: Free side of fries with milkshake ... no climbing outside of play structure ... now hiring.  Only when I looked at the fine-print signs on the drink machine and elsewhere did I see mostly "grammatical" text, and even that was in legalese and clearly meant to be easily ignored.

In short, there are a number of practical difficulties in applying the concepts of formal grammar to actual language.  In the broadest sense, there isn't any theoretical difficulty.  Formal grammars can describe anything computable, and there's no reason to believe that natural language processing is uncomputable, even if we haven't had great success with this or that particular approach.  The problems come in when you try to apply that theory to actual language use.


Traditionally, the study of linguistics has been split into several sub-disciplines, including
  • phonology, the study of how we use sounds in language, for example why we variously use the sounds -s, -z and -ez to represent plurals and similar endings in English (e.g., catsbeds and dishes)
  • morphology, the study of the forms of words, for example how words is composed of word and a plural marker -S.
  • syntax, the study of how words fit together in sentences.  Formal grammars are a tool for analyzing syntax
  • semantics, the study of how the meaning of a sentence can be derived from its syntax and morphology
  • pragmatics, the study of how language is actually used -- are we telling someone to do something? asking for information? announcing our presence? shouting curses at the universe?
It's not a given that actual language respects these boundaries.  Fundamentally, we use language pragmatically, sometimes to convey fine shades of meaning but sometimes in very basic ways.  We tolerate significant variation in syntax and morphology.  It's somewhat remarkable that we recognize a distinction between words and parts of words at all.

In fact, it's somewhat remarkable that we recognize discrete words.  On a sonogram of normal speech, it's not at all obvious where words start and end.  Clearly our speech-processing system knows something that our visual system -- which is itself capable of many remarkable things -- can't pick out of a sonogram.

Nonetheless, we can in fact recognize words and parts of words.  We know that walk, walks, walking and walked are all forms of the same word and that -s, -ing, and -ed are word-parts that we can use on other words.  If someone says to you I just merped, you can easily reply What do you mean, "merp"? What's merping?  Who merps?

This ability to break down flowing speech into words and words (spoken or written) into parts appears to be universal across languages, so it's reasonable to say "Morphology is a thing" and take it as a starting point.  Syntax, then, is the study of how we put morphological pieces together, and a grammar is a set of rules for describing that process.

The trickier question is where to stop.  Consider these two sentences:
  • I eat pasta with cheese
  • I eat pasta with a fork
Certainly these mean different things.  In the first sentence, with modifies pasta, as in Do they have pasta with cheese on the menu here? or It was the pasta with cheese that set the tone for the evening.  In the second sentence, with modifies eat, as in I know how to eat with a fork or Eating with a fork is a true sign of sophistication.  The only difference is what follows with.  The indefinite article a doesn't provide any real clue.  You can also say I eat pasta with a fine Chianti.

The real difference is semantic, and you don't know which case you have until the very last word of the sentence.  A fork is an instrument used for eating.  Cheese isn't.  If the job of syntax is to say what words relate to what, it would appear that it needs some knowledge of semantics to do its job.

There are several possible ways around problems like this:
  • Introduce new categories of word to capture whatever semantic information is needed in understanding the syntax.  Along with being a noun, fork can also be an instrument or even an eating instrument.  We can then add syntactic rules to the effect that with <instrument> modifies verbs while with <non-instrument> modifies nouns.  Unfortunately, people don't seem to respect these categories very well.  If someone has just described a method of folding a slice of cheese and using it to scoop pasta, then cheese is an eating instrument.  If we're describing someone with bizarre eating habits, a fork could be just one more topping for one's pasta.
  • Pick one or the other option as the syntactic structure, and leave it up to semantics to do with it as it will.  In this case:
    • Say that, syntactically, with modifies pasta, or more generally the noun, but that this analysis may be modified by the semantic interpretation -- if with is associated with an instrument, it's really modifying the verb.
    • Say that, syntactically, with modifies eat, or more generally the verb, and interpret eat with cheese as meaning something like "the pasta is covered with cheese when you eat it".  This may not be as weird as it sounds.  What else do you eat with cheese? is a perfectly good question, though you can argue that in this case the object of eat is what, and with cheese modifies what.
  • Make syntax indeterminate.  Instead of saying that with ... modifies eat or pasta, say that  syntactically it modifies "eat or pasta, depending on the semantic interpretation".  This provides a somewhat cleaner separation between syntax and semantics, since the semantics only has to decide what's being modified without knowing the exact structure of the sentence it was in.
  • Say that with modifies eat pasta as a unit, and leave it up to semantics to slice things more finely.  This is much the same as the previous option.  Technically it gives a definite syntactic interpretation, but only by punting on the seemingly syntactic question of what modifies what.
  • Say that the distinction between syntax and semantics is artificial, and in real life we use both kinds of structure -- how words are arranged and what they mean -- simultaneously to make sense of language.  And while we're at it, the distinction between semantics and pragmatics may turn out not to be that sharp either.
There are two reasons one might want to separate syntax from semantics (and likewise for the other distinctions):
  • They're fundamentally separate.  Perhaps one portion of the brain's speech circuitry handles syntax and another semantics.
  • On the other hand, separating the two may just be a convenient way of describing the world.  One set of phenomena is more easily described with tools like grammars, while another set needs some other sort of tools.  Perhaps they're all aspects of the same thing deep down, but this is the best explanation we have so far.
As much as we've learned about language, it's striking, though not necessarily surprising, that fundamental questions like "What is the proper role of syntax as opposed to semantics?" remain open.  However, it seems hard to answer the question "What, if anything, is a grammar" without answering the deeper question.

[If you're finding yourself asking "But what about all the work being done in parsing natural languages?" or "But what about dependency grammars?", please see this followup]