On Chomsky and the Two Cultures of Statistical Learning

[+] brockf|15 years ago|reply

Chomsky's one paragraph quote at the beginning of this article is more clear and thoughtful than the rest of this. I feel the author's missing the point.

In the case of language, observing and reporting statistical probabilities in written/spoken language output does very little to explain the cognitive systems used in acquiring and using language. Even one statistical anomaly serves to show that statistical learning is NOT the entire picture when it comes to language development.

There was another article on HN a while back that had another great quote from Chomsky that does well to illustrate what I feel is his main point here: "Fooling people into mistaking a submarine for a whale doesn't show that submarines really swim; nor does it fail to establish the fact". Creating a computer that can produce millions of grammatical utterances does little to show that we understand language systems. Now, if a computer could - like humans - learn to produce infinite, novel, contextual, and meaningful grammatical utterances, that's a different story. But that story will take a lot more than statistical learning to write.

[+] moultano|15 years ago|reply

Chomsky is just appealing to our own biases. We don't want to be statistical approximation machines, so that makes it easy to dismiss attempts to mimic us with statistical approximation machines.

However, the preponderance of evidence* so far suggests that we are just statistical processing machines. Hence why Chomsky seems way off the mark.

*We know that various layers in the visual and auditory systems basically just compute ICA, and we know that the brain is incredibly plastic. Large areas can be removed and the remainder will compensate. That makes it seem likely that all neurons compute something like ICA (or at least that degrades to ICA when confronted with visual or auditory input.)

[+] norvig|15 years ago|reply

Thanks for the comment, brockf. I'm sorry the essay didn't make sense to you. Let me try again on a few points.

Are you saying that one statistical error in a probabilistic model makes the entire model wrong? Then you'd equally have to say that one logical error in a categorical model makes it equally wrong. And manifestly, there are many logical errors in all grammars. So I'm not sure what your point is here.

I'm interested to know: I quoted Chomsky: "That's a notion of [scientific] success that's very novel. I don't know of anything like it in the history of science." Do you agree with him? If so, do you judge all the Science and Cell articles as not being about accurately modeling the world and only about providing insight? Or do you think Chomsky meant something else by that?

I understand that there are two goals, accurately representing the world, and finding satisfactorily simple explanations. I think Chomsky has gone too far in ignoring the first, but I acknowledge that both are part of science. I further think that statistical/probabilistic models of language are better for both goals. This is obvious to me after working on the problem for 30 years, so maybe it is hard for me to explain why. I think Manning, Pereira, Abney, and Lappin/Shreiber do a good job of it. Also, I don't see how a system that successfully learns language could be anything other than statistical and probabilistic. I agree it is a long ways away ...

-Peter Norvig

[+] losvedir|15 years ago|reply

>In the case of language, observing and reporting statistical probabilities in written/spoken language output does very little to explain the cognitive systems used in acquiring and using language.

Unless, of course, those cognitive systems are nothing more than some statistical probabilistic mechanism. I don't know anything about the field, but the article was interesting to me in that it seemed to at least partly argue that. I know, for me at least, I'll frequently produce a sentence and then repeat it to myself a few times to see if it "sounds right." Now, I don't know what is happening to determine that, but perhaps I'm comparing it to some statistical probabilistic model I have in my head?

> Even one statistical anomaly serves to show that statistical learning is NOT the entire picture when it comes to language development.

1) Does it? Maybe it shows the specific statistical probabilistic model in question is wrong. Consider, as Chomsky did, a model which predicts zero probability for a novel sentence. Clearly, as you say, one anomalous novel sentence is all it takes to disprove such a model. But what about other models which can handle them? The "anomaly" may not be an anomaly anymore.

2) Do you have some anomaly in mind which shows statistical probabilistic models don't work?

-----

The article was very interesting to me, but I don't know anything about the field. I guess my main question boils down to: Is it possible that language acquisition and production is nothing more inside our heads than a simple statistical probabilistic model?

[+] PaulHoule|15 years ago|reply

There's no doubt that Noam Chomsky founded a paradigm of academic activity. Linguists can generate an unlimited number of papers and monographs by finding problems and proposing intellectually convincing solutions.

From an engineering standpoint, however, Chomsky's view of grammar has been remarkably barren when it comes to machine processing of natural language. It's made a major contribution to artificial languages but despite a lot of effort it hasn't added much performance to what can be done with statistical methods.

I'd agree that a hidden Markov model that does POS tagging with high accuracy doesn't provide an intellectually satisfying model for "how language works", but you don't need to have a model for "how language works" in order to use it.

[+] lkozma|15 years ago|reply

I think Norvig acknowledges the point you are making here, namely that the statistical approach does not explain the cognitive systems behind language. However (if I understand correctly) he implies that those systems might be too complex to be adequately explained, let alone emulated and we can achieve more by observing them as black boxes, analyzing their outputs, i.e. language as it is used.

"if a computer could - like humans - learn to produce infinite, novel, contextual, and meaningful grammatical utterances"

To perfectly achieve this goal, you might have to simulate 4 billion years of evolution under the same conditions as it happened on Earth, and a few thousand years of cultural evolution as it led to our languages and our cultural context. Language is incredibly complex and changing, many of its details might be incidental, i.e. results of random events, so it seems unreasonable to pretend that we can deduce it all from some elegant first principles. At least that is my reading of Norvig's argument.

[+] equark|15 years ago|reply

You provide no evidence for the last statement: "that story will take a lot more than statistical learning to write."

The existing evidence overwhelming suggests that a computer that can "learn to produce infinite, novel, contextual, and meaningful grammatical utterances" will be based on probabilistic models. In fact it's hard to imagine how it could possibly be otherwise.

The computer is observing noisy sensory input and is trying to make inferences about how to communicate with some future reader. Mathematically, there is only one way to write this problem: probability. It's true that the learned model may have amazing structure to it, but this will almost certainly be learned via probabilistic models rather than being hand-coded by some future Chomsky.

That fact does not imply we will understand language systems or the human mind. The Chomsky route may be better suited for that task.

[+] scott_s|15 years ago|reply

Consider catching a ball. We know how to design a robot that will catch a ball: it will be the hardware for moving an "arm" and a "hand" for the catching, as well as computer hardware and some software for the logic. The software will solve differential equations in order to predict where the ball will be, and when to move the "arm" and "hand" to the correct spot in order to catch it.

No one, as far as I know, argues that humans actually solve differential equations in their head when they catch a ball. They just... catch it. Perhaps with some failed attempts along the way, but as a part of growing up, we learned basic eye-hand coordination.

The notion that syntax and grammar as we have formalized it exist in our brains is the same as saying that differential equations exist in our brains. I find it much more likely that we innately have rough models for syntax, grammar, mechanical movement and object trajectories, but that it takes significant trial-and-error for us to tune those models until the point of competence. I think these models have to be at least partly statistical - otherwise, we wouldn't need to learn anything - and that while our formalisms may be nice approximations of what we do in our brains, I see no reason why they have to be exactly it.

[+] amouat|15 years ago|reply

I don't think Norvig was arguing Chomsky was completely wrong about what he said, more that statistical models are a hell of a lot more important than Chomsky implies.

Looking at the statistics and evidence is of great importance in trying to form models and answers to the "why" questions. Although mimicking a bee dance may not mean we understand it, it does provide a basis for founding and comparing theories.

[+] sitkack|15 years ago|reply

When is the pretending so good that it ceases to be pretending? How much Mocking does the Mocking Bird need to partake in before it is the Creating Mocking Bird?

What it didn't seem like Norvig got was difference between understanding and a highly sophisticated pretender. Gut Level vs Self Aware intelligence. Both are valid forms of intelligence but only one is a valid form of understanding.

I think statistical methods are a form of intelligence that are highly mechanical and could never achieve human level cognition (ie fart jokes). But I could be wrong, usually am more than half the time.

[+] anigbrowl|15 years ago|reply

This [Chomsky, not you] is just a warmed-over restatement of Searle's Chinese Room argument against AI. And it's a bullshit argument, for a reason I can state in two words: Turing test.

[+] madamepsychosis|15 years ago|reply

Whose's to say that the human brain doesn't learn by statistical analysis? The human brain collects data, forms hypotheses and tests them. Just like a statistical machine.

[+] andrewcooke|15 years ago|reply

just because you only understand one side of an argument doesn't mean everyone else is an idiot.

in some sense this is the same argument as searle's chinese room. that's sufficiently well known and debated that it's fair to say that neither side can be dismissed as simply "missing the point".

[+] Jun8|15 years ago|reply

This is not a new debate. Within Linguistics there has been a continuous push against statistical NLP models. Read the introduction of Manning's book, even he seems to be defensive about NLP.

Chomsky is a colossus, his achievements are well-known. However, at one point in many disciplines it comes to pass that the pioneers who pave the way in time become the very impediment to new ideas. His emphasis on Semantics have warped the minds of many generations of researchers (and some other ideas on universal grammar, too).

I experienced this first hand, my advisor, Prof. Raskin, a great researcher on semantics, nevertheless thought that statistical approaches were not the way to go. Sadly, in many Linguistics departments people are just not equipped with the statistical tools necessary to have a basic understand of what's being done in the NLP field. So NLP is generally taught under CS, EE, or CompE.

[+] adavies42|15 years ago|reply

i saw someone once compare chomsky to freud, as a foundational figure whose discipline can't/couldn't progress during his lifetime.

[+] foldr|15 years ago|reply

>His emphasis on Semantics

What are you talking about? Chomsky has always been highly critical of formal semantics.

[+] christianpbrink|15 years ago|reply

"If Chomsky had focused on the other side, interpretation, as Claude Shannon did, he may have changed his tune. In interpretation (such as speech recognition) the listener receives a noisy, ambiguous signal and needs to decide which of many possible intended messages is most likely. Thus, it is obvious that this is inherently a probabilistic problem, as was recognized early on by all researchers in speech recognition..."

This is the money shot especially since speakers are aware of the interpretive activity of listeners, and effective speakers play constantly on the ambiguities in their statements - structural (i.e. grammatical) ambiguities as well as semantic ambiguities. Listeners in turn are aware of speakers' awareness of this.. There is, effectively, an infinity of mutual awarenesses of structural ambiguities. In any instance of communication.

I think most technologists and (especially) businesspeople see this intuitively. I think many academics do not. Not sure how to articulate what I mean but I think I am saying something non-trivial about academics and their perspective on language.

[+] cma|15 years ago|reply

Freeman Dyson earlier this year on this type of ambiguity as expressed in the drum language of the Democratic Republic of Congo:

http://www.nybooks.com/articles/archives/2011/mar/10/how-we-...

[+] bluekeybox|15 years ago|reply

> I think I am saying something non-trivial about academics and their perspective on language.

I became convinced that there is a strain of thought, one that is especially pervasive in the academia, which believes that knowledge/meaning is something irreducible and almost mystical. It probably has to do with the fact that people who fetishize knowledge as something incredibly worthwhile for its own sake end up being overrepresented in the academia. Those who are a bit more cynical/nihilistic tend to go into finance or start their own companies.

The old advice "do not make any gods to be alongside me" is still relevant except for the "alongside me" part, which probably only has any meaning if you consider yourself religious. I have a feeling that many academicians, especially the old-school ones, idolize knowledge to the extent of ascribing to it god-like powers even if said knowledge has little relevance for anything practical.

[+] CWuestefeld|15 years ago|reply

Server's down. Here's a cached link: http://webcache.googleusercontent.com/search?q=cache:http%3A...

EDIT: stop giving me upvotes. I've got 11 points now for nothing more than a link. I don't deserve them. Stupid hidden points...

[+] jng|15 years ago|reply

People upvote so that the link to the cached copy will be at the top.

[+] norvig|15 years ago|reply

Sorry about the intermittent access. My hosting service provides me with sufficient bandwidth, but only provides a version of Apache that forks a new process for every GET, and thus runs out of processes and denies access to a portion of visitors when I get slashdotted/redditted/hacker-newsified. If anyone can suggest a more reasonable hosting service, let me know. -Peter Norvig

[+] PaulHoule|15 years ago|reply

It's funny. Lately I've been working with NLP systems and in the last few years there are a few really good parts-of-speech taggers that are about 99% accurate. All the ones I know of are based on hidden markov models, which definitely would disappoint Chomsky.

Part of the trouble w/ Chomsky is that real language doesn't draw a clear line between syntax and semantics. Even though an HMM doesn't correctly model the nested structures that are common in natural language, it makes up for it by encoding semantic information.

[+] sharmajai|15 years ago|reply

Another trouble is that human beings are innately probabilistic when it comes to language. A sentence written/spoken by humans does not have to be gramatically correct, to convey it's meaning, and does not always follow the strict rules that Chomsky talks about.

It's not the language that defines how we communicate, it's how we communicate defines the language.

But I also disagree with peter when he says the why is not important, it is this why or the understanding of the matter that separates us from the machines like watson, since our sole purpose in life is not to win at a game, but play/enjoy the game and most importantly "reuse the understanding" gained in some other facet of life, a feat that I beleive no machine is capable of.

[+] foldr|15 years ago|reply

>It's funny. Lately I've been working with NLP systems and in the last few years there are a few really good parts-of-speech taggers that are about 99% accurate. All the ones I know of are based on hidden markov models, which definitely would disappoint Chomsky.

No, it wouldn't disappoint him at all. In fact, one of his earliest works in linguistics discussed how transition probabilities could be used for chunking and categorization. (See http://www.tilburguniversity.edu/research/institutes-and-res... ) It's not as if Chomsky ever presented part of speech tagging as a poverty of the stimulus argument.

[+] wccrawford|15 years ago|reply

"O'Reilly is correct that these questions can only be addressed by mythmaking, religion or philosophy, not by science."

... My jaw is on the floor. It drives me nuts when people go from 'We can't explain that yet' to 'The only explanation is God.'

The tides are incredibly complex when you insist on 'why' all the way back to the beginning of the universe. Everything is!

[+] torstein|15 years ago|reply

>He doesn't care how the tides work, tell him why they work. Why is the moon at the right distance to provide a gentle tide, and exert a stabilizing effect on earth's axis of rotation, thus protecting life here? Why does gravity work the way it does? Why does anything at all exist rather than not exist? O'Reilly is correct that these questions can only be addressed by mythmaking, religion or philosophy, not by science.

Science doesn't really aim to answer the 'why'-questions, but rather the 'how'-questions. The scientific method boils down to falsifying hypothesis, and it's a lot easier with 'how does the tide work?' than 'why does the tide work (the way it does)?'.

Science can't say anything about 'Why does anything at all exist rather than not exist?', because there is no way to test any of the answers. So it's left to mythology, religion or philosophy to answer.

[+] mbateman|15 years ago|reply

I interpret this sort of question not as asking for a further step in a causal chain, but rather as demanding a teleological explanation where none is available.

While I disagree with almost everything Chomsky says about everything, and I think it was meant to be somewhat sympathetic, it's really unfair to propose an affinity between Chomsky and O'Reilly in this manner. What the hell.

Equally unfair is Norvig calling Chomsky a mystic for his invocation of Plato. Chomsky is a rationalist, not a mystic.

[+] jimbokun|15 years ago|reply

"... My jaw is on the floor. It drives me nuts when people go from 'We can't explain that yet' to 'The only explanation is God.'"

Philosophy does not imply God.

And it drives me nuts when people don't see that many important questions cannot even be expressed within the framework of science.

[+] pygy_|15 years ago|reply

That anything at all exists has to be accepted as a given.

"God" (in its various versions and revisions around the globe) is just an anthropomorphized version of the previous proposition.

[+] Wuzzy|15 years ago|reply

What science often does is that it responds to a "why" question by an analysis of the phenomenon and presenting its causes in some lower-level terms. But, from a certain viewpoint, that is not a satisfactory answer.

Take physics, for example. It can tell you why some objects behave the way they do by telling you there are certain particles, interacting forces, etc.. In this way you can explain, say, the photoelectric effect.

But it isn't really an answer to the "why" question, is it? It just pushes the question one level lower. Why are there such and such particles and forces? Why the constants? The very nature of these answers is descriptive. It is a description of how the world works, not why it works that way.

Maybe asking "why" in this ultimate manner is an ill-posed question - but that's not the deal here. It just doesn't seem that science in its current form, unlike religion or philosophy, could ever even attempt to answer it.

Don't get me wrong, I'm strongly atheistic myself, but there are some inherent limitations of scientific exploration and clarification with respect to the answers it can provide.

[+] paganel|15 years ago|reply

> My jaw is on the floor. It drives me nuts when people go from 'We can't explain that yet' to 'The only explanation is God.'

I don't think anyone mentioned God, and, to be fair, the problem of what exactly constitutes reality and what would be the best ways to imitate it is quite complex. We, as a species, have been trying to find the rational answer to this problem for at least 2,500 years (since the pre-Socratics), but as far as I know we haven't come to any definitive answer, we don't even know if there is such an answer.

[+] lurker19|15 years ago|reply

"God" did not appear in the quotation.

Some questions are metaphysical, not because they are complex, but because they are ill-posed and not subject to falsifiable experimentation of observation.

BTW, a lot of real-world phenomena, like some large events of history or sociology or macroeconomics, fall into the same category of scientific unapproachability, due to practical limitations of our civilization and any plausible future civilization.

[+] T_S_|15 years ago|reply

The handshake example was illuminating. Three "equivalent" theories:

Theory A: Closed form formula function.

Theory B: "Algorithm". Still a function.

Theory C: Memoized function (constant time!)

According to the article "nobody" likes C, especially the article's Chomsky straw man. If one had a procedure to convert C to A, then this whole issue would become hairsplitting. Such a procedure would aim to convert a memoized function back into a form that uses more symbols from a mathematical language. A good criteria of success would be the description length of the resulting procedure in the preferred language. One reason this could be useful to science is that once you identify a value that is useful in many theories it becomes part of the language. Making it available to the next problem may speed up the search for a "good" description of the next phenomenon. Identical procedures that appeared in various algorithms might acquire a special name. One such value might be called "pi", another "foldr" and so on.

Of course there may be many good descriptions, just as there are many languages. Also, the example could be extended to statistical modeling situations by adding room for error terms in the suitability criteria.

So if, you have a general procedure to convert a table into a definition you can make money and science at the same time!

[+] stcredzero|15 years ago|reply

My conclusion is that 100% of these articles are more about "accurately modeling the world" then they are about "providing insight," although they all have some theoretical insight component as well.

Before you can figure out why, you have to make sure you can accurately characterize the what. So there's a lot of science that is focused on coming up with a descriptive tool like an adhoc curve, before the underlying principles are discovered.

I think Chomsky is afraid that statistical models will cause people to stop looking for the underlying principles.

[+] sethg|15 years ago|reply

This essay made me think: Lojban (http://www.lojban.org/tiki/la+lojban.+mo), among constructed languages, is the categorial language par excellence. Every word has a well-defined range of meaning; the grammar can be parsed by the same kinds of parsers used for programming languages; potential sources of ambiguity, like plural references, associativity of modifiers, and negation, have been rigorously (or tediously, depending on how you roll) nailed down.

Can there be such a thing as a conlang that demonstrates the ideal statistical grammar and semantics? (“All the words in this list are 60% likely to be used as nouns and 40% likely to be used as verbs....” But in the absence of a pre-existing linguistic community, how could you get students of the language to use them in the right proportions?)

[+] cma|15 years ago|reply

Chomsky's April 8th lecture at Carelton University on language had several thoughts on machine translation:

http://www.youtube.com/watch?v=XbjVMq0k3uc

(I think it even had the same bee-dance example)

[+] double-z|15 years ago|reply

The commentary has nothing to do with what Chomsky proposed. The author defines success as "being successful at accomplishing a task". That has nothing to do with science. Full stop.

[+] macmac|15 years ago|reply

Are Norvig's comments on the "I before E except before C." really vaild? Why would one use a corpus for analysis of the rule, and not a dictionary? It appears to me that "CIE" (P(CIE) = 0.0014) is more common than "CEI" (P(CEI) = 0.0005) because the words that contain the "exception" "CIE" are used more frequently in the corpus than the words that follow the rule "CEI". Once you know the limited number of exceptions (in the dictionary sense) the rule appears to preserve its relevance.

[+] noahlt|15 years ago|reply

Strangely appropriate is today's XKCD: http://xkcd.com/904/

[+] kenjackson|15 years ago|reply

Hmm... I never thought of it that way. That sports are a weighted random number generator,but the various weights are unknown. And the commentators are discussing theories as to what the weights are, and how derived. (Although the cartoon seems to be saying the narratives are just about the numbers generated, which is more cynical, and frankly less interesting).

[+] _grrr|15 years ago|reply

I've been monitoring the page this post points to with a bookmarking tool we've just released in beta. Here are the latest set of changes:

http://app.bookmarkerpro.com/changes?fmt=html&id=2573

Quite a few revisions since first posted to HN!

[+] niels_olson|15 years ago|reply

This whole theory vs observation argument exists at the very pinnacle of human thought, expressed in the Copenhagen interpretation. If you want to contribute to the human understanding of this, you'll have to beat Bohr and the uncertainty principle.

[+] Create|15 years ago|reply

Fourier was there first.

[+] galactus|15 years ago|reply

It is interesting than on a completely different debate, chomsky takes norvig's position (he is accused of not looking for a "theory" and "whys" and he replies that it is pragmatic results that matter):

http://mindfulpleasures.blogspot.com/2011/01/noam-chomsky-on...

[+] davidmathers|15 years ago|reply

Chomsky called the Watson computer that won Jeopardy "a bigger bulldozer." He goes into more detail about his AI opinions here: http://www.framingbusiness.net/archives/1366

[+] borism|15 years ago|reply

And while it may seem crass and anti-intellectual to consider a financial measure of success

Why are other metrics Norvig provides like articles published or prevalence in practical applications are considered more intellectual?

And besides, I don't think "accurately modeling the world" is the end of it. Classical Newtonian mechanics correctly describe 99% of our activities in the real world and were considered pinnacle of scientific achievement for several centuries. Yet we know today that they're just a subset of General relativity and Quantum mechanics.

107 comments