Wednesday, August 26, 2015

Grammars ... not so fast

In the previous post, I made the rather bold statement that the effort to describe natural languages with formal grammars had been "without great success".  I probably could have put that better.

I was referring at the time to the effort to define a "universal grammar" that would capture the essence of linguistic competence, that is, people's knowledge of language, independent of performance, that is, what people actually say.  A grammar, in that sense, is a set of rules that will produce all, and only, the sentences that speakers of a given language know how to produce.

In that sense, the search for grammars for natural languages has not had great success.


However, there is another line of inquiry that has had notable success, namely the effort to parse natural language, that is, to take something that someone actually said or wrote and give an account of how the words relate to each other that matches well with what actual people would say about the same sentence.

For example, in Colorless green ideas sleep furiously, we would say that colorless and green are describing ideas, ideas are what's sleeping, and furiously modifies sleep.  And so would any number of parsers that have been written over the years.

This is an important step towards actually understanding language, whatever that may mean, and it indicates that, despite what I said previously, there may be such a thing as syntax independent of semantics.  We can generally handle a sentence like Colorless green ideas sleep furiously the same as Big roan horses run quickly, without worrying about what it might mean for an idea to be both green and colorless or for it to sleep.


So what gives?  There are at least two major differences between the search for a universal grammar and the work on parsers.

First, the two efforts have significantly different goals.  It's not entirely clear just what a universal grammar might be, which was one of the main points of the previous post, not to mention many more knowledgeable critiques.  If universal grammar is a theory, what facts is it being asked to explain?

As far as I understand it, the idea is to explain why people say and comprehend what they do, by producing a grammar that enumerates what sentences they are and aren't competent to say, keeping in mind that people may make mistakes.

However, narrowing this down to "all and only" the "right" sentences for a given language has proved difficult, particularly since it's hard to say where the line between "competence" and "performance" lies.  If someone says or understands something that a proposed universal grammar doesn't generate, that is, something outside their supposed "competence", what does that mean?  If competence is our knowledge of language, then is the person somehow saying something they somehow don't know how to say?

The work on parsing has very well-defined goals.  Typically, algorithms are evaluated by running a large corpus of text through them and comparing the results to what people came up with.  As always, one can argue with the methodology, and in particular there is a danger of "teaching to the test", but at least it's very clear what a parser is supposed to do and whether a particular parser is doing it.  Research in parsing is not trying to characterize "knowledge", but rather to replicate particular behavior, in this case how people will mark up a given set of sentences.

Second, the two efforts tend to use different tools.  Much work on universal grammar has focused on phrase-structure grammars, transformational grammars and in general (to the small extent I understand the "minimalist program") on nested structures: A sentence consists of a verb part and noun parts, a verb part includes a main verb and modifying clauses, which may in turn consist of smaller parts, etc..

While there are certainly parsing approaches based on this work, including some fairly successful ones, several successful natural language parsers are based on dependency grammars, which focus on the relations among words rather than a nesting of sentence parts.  Where a phrase-structure grammar would say that colorless green ideas is a noun phrase and so forth, a dependency grammar would say that colorless and green depend on ideas (in the role of adjectives).  Dependency grammars can trace their roots back to some of the earliest known work in grammar hundreds of years ago, but for whatever reason they seemed to fall out of favor for much of the late 20th century.

Leaving aside some interesting questions about issues like "non-projective dependencies" (where the lines from the dependent words to the words they depend on cross when the sentence is laid out in its written or spoken order), it's easy to build parsers for dependency grammars using basically the same technique (shift-reduce parsing) as compilers for computer languages use.  These parsers tend to be quite fast, and about as accurate as parsers based on phrase-structure grammars.


In short, there is a lot of interesting and successful work going on concerning the syntax of natural languages, just not in the direction (universal grammar and such) that I was referring to in the previous post.

No comments:

Post a Comment