top | item 35082108

(no title)

Well, Chomsky already dismissed corpus based linguistics in the 90s and 2000s, because a corpus (large collection of text documents, e.g., newspaper, blog post, literature or everything mixed together) is never a good enough approximation of the true underlying distribution of all words/constructs in a language. For example, a newspaper-based corpus might have frequent occurences of city names or names of politicians, whereas they might not occur that often in real everyday speech, because many people don't actually talk about those politicians all day long. Or, alternatively, names of small cities might have a frequency of 0.

Naturally, he will, and does, also dismiss anything that occured in the ML field in the past decade.

But I agree with the article. Dealing with language only in a theoretical/mathematical way, not even trying to evaluate your theories with real data, is just not very efficient and ignores that language models do seem to work to some degree.

discuss

blululu|3 years ago

This is a bit lateral, but there is a parallel where Marvin Minsky will most likely be best remembered for dismissing neural networks (a 1 layer perceptron can't even handle an xor!). We are now sufficiently removed from his heyday where I can't really recall anything he did besides the book Perceptrons with Seymour Papert (who went on to do some very interesting work in education). There is a chart out there about ML progress that makes a conjecture about how small the gap is between what we would consider that smartest and dumbest levels of human intelligence (in the grand scheme of information processing systems). It is a purely qualitative vibes sort of chart, but it is not unreasonable that even the smartest tenured professors at MIT might not be that much beyond the rest of us.

masswerk|3 years ago

This dismissal of Minsky misses that Minsky had actually extensive experience with neural nets (starting in the 1950s, with neural nets in hardware) and was around 1960 probably the most experienced person in the field. Also, in Jan 1961, he published “Steps Toward Artificial Intelligence” [0], where we not only find a description of gradient descend (then "hill climbing", compare sect. B in “Steps”, as this was still measured towards a success parameter and not against an error function), but also a summary of experiences with this. (Also, the eventual reversal of success into a quantifiable error function may provide some answer to the question of success in statistical models.)

[0] Minsky, Marvin, “Steps Toward Artificial Intelligence”, Proceedings of the IRE, Volume: 49, Issue: 1, Jan. 1961: https://courses.csail.mit.edu/6.803/pdf/steps.pdf

cma|3 years ago

Take the amount of language a blind 6 year old has been exposed to. It is nothing like the scale of these corpsuses but they can develop a rich use of language.

With current models if you increased parameters but gave it a similar amount of data it would overfit.

aaroninsf|3 years ago

For me this is a prototypical example of compounded cognitive error colliding with Dunning-Kruger.

We (all of us) are very bad at non-linear reasoning, reasoning with orders of magnitude, and (by extension) have no valid intuition about emergent behaviors/properties in complex systems.

In the case of scaled ML this is quite obvious in hindsight. There are many now-classic anecdotes about even those devising contemporary scale LLM being surprised and unsettled by what even their first versions were capable of.

As we work away at optimizations and architectural features and expediencies which render certain classes of complex problem solving tractable by our ML,

we would do well to intentionally filter for further emergent behavior.

Whatever specific claims or notions any member has that may be right or wrong, the LessWrong folks are at least taking this seriously...

aaroninsf|3 years ago

To some degree is quite an understatement. :)

My own hobby horse of late is that independent of its tethering to information about reality available through sensorium and testing, LLM are already doing more than building models of language qua language. Write up someone pointed me at: https://thegradient.pub/othello/

bmitc|3 years ago

How is an understatement? And what do people mean by language models working well? From what I can tell, these language models are able to form correct grammar quite surprisingly well. However, the content of such, is quite poor and often void of any understanding.

masswerk|3 years ago

"seem to work to some degree", appears much like a second order argument to this debate… ;-)