Getting Started in Natural Language Processing

[+] caiobegotti|7 years ago|reply

As a linguist and software engineer I can't imagine someone doing serious NLP without ever having studied [concrete] syntax trees and such. It is easy to impress people with some tokenization but it's n-grams that are really useful in the real world, as is understanding syntax trees and all the interconnections possible inside them so you can NLP the shit out of real world text/speech, instead of simple examples with a tagger (and a good, carefully crafted for a demo like Apple's, trained set of tagged data). This is a good summary article with very good links nonetheless.

[+] hcorreasuarez|7 years ago|reply

As a linguist, I see what you mean. However, I disagree on a couple of things. First, I wouldn’t say “n-grams and understanding syntax trees and all the interconnections possible inside them” will make your NLP skills or results any better. Understanding n-grams and syntax trees should be quite easy for linguists, but that won’t do the trick. It turns out that tagging, lemmatizing, wsd, parsing, and many other fundamental tasks in NLP are not that “simple” after all. Of course, you can use libraries to do all of those tasks -and this is pretty simple, indeed,- but most libraries will end up making flagrant mistakes if texts -either spoken or written- get too complex. Second, real world texts are complex by default, so, for someone to “NLP the shit out of real world text/speech”, they will have to find creative solutions to make tools improve -and that can get pretty difficult. Once again, understanding n-grams and syntax trees is not enough. There’s one thing I totally agree with you on: tokenization, tagging, and the like might not be that impressive. Nonetheless, creative solutions to the problems underlying those tasks are, in fact, rare and impressive.

[+] riku_iki|7 years ago|reply

> as is understanding syntax trees and all the interconnections possible inside them so you can NLP the shit out of real world text/speech

But do modern DL approaches (e.g. SQuad, translation models) defy this approach? They train DL models on labeled data without knowing anything about syntax trees, and allow NN do all the magic..

[+] hiker512|7 years ago|reply

Sorry, but for a lot of situations n-grams just don't scale. There are just way too many combinations. There are a lot of people dealing with GBs and TBs of text.

[+] asdsa5325|7 years ago|reply

Nowadays, deep learning powers most of NLP.

[+] mindcrime|7 years ago|reply

Obligatory:

Every time I fire a linguist, the performance of our speech recognition system goes up. -- Fred Jelinek

[+] imh|7 years ago|reply

I really love that this getting started guide is "do lots of studying and practice, here are the canonical textbooks, papers, conferences, tools, and problems" instead of "spend a few hours on this superficial toy problem." I'd love to see more guides like this.

[+] stared|7 years ago|reply

I strongly disagree.

It's easy to list a lot of books and papers (and drown newcomers in them), without pointing to actual step=by-step starting points. Sure, doing superficial problems is only the first step (and it's foolish to think that it is the last step). Yet, you can read all books in the world, but unless you are able to prove theorems, or write code, you know less than someone who wrote a small script to predict names.

Additionally, it's weird that they recommend NLTK (no, please not), SpaCy (cool and very useful, but high-level), but not Gensim, PyTorch (or at least: Keras). As a side note, PyTorch has readable implementations of classical ML techniques, such as word2vec (vide https://adoni.github.io/2017/11/08/word2vec-pytorch/).

There are some good recommendations linked there (I really like "Speech and Language Processing" by Dan Jurafsky and James H. Martin https://web.stanford.edu/~jurafsky/slp3/, and recommended myself in http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html).

[+] YeGoblynQueenne|7 years ago|reply

What, no Charniak? Tut tut:

https://mitpress.mit.edu/books/statistical-language-learning

[+] alexott|7 years ago|reply

Besides NLP course by Jurafsky, course “Introduction to Natural Language Processing” by D. Rädev is quite good - there were some topics not covered in Jurafsky course

[+] billybolton|7 years ago|reply

[deleted]

[+] rpedela|7 years ago|reply

Given several NLP algorithms are able to achieve >90% accuracy and many more achieve >80% accuracy, how do you come to the conclusion that "all ideas in NLP are garbage".

[+] freehunter|7 years ago|reply

It's really easy to say experts are wrong. It's a lot harder to prove it. So prove it. If you're coming in here to run down the current experts, I expect you to have a better solution. If not, I expect you to delete this comment as it adds absolutely nothing to the conversation.

37 comments