top | item 31421232

Show HN: Natural Language Processing Demystified (Part One)

166 points| mothcamp | 3 years ago |nlpdemystified.org | reply

Hi HN:

I published part one of my free NLP course. The course is intended to help anyone who knows Python and a bit of math go from the very basics all the way to today's mainstream models and frameworks.

I strive to balance theory and practice and so every module consists of detailed explanations and slides along with a Colab notebook (in most modules) putting the theory into practice.

In part one, we cover text preprocessing, how to turn text into numbers, and multiple ways to classify and search text using "classical" approaches. And along the way, we'll pick up useful bits on how to use tools such as spaCy and scikit-learn.

No registration required: https://www.nlpdemystified.org/

42 comments

order
[+] jll29|3 years ago|reply
NLP researcher here. It's great to see many offerings for courses and tutorials, and NLP has made a lot of progress, in terms of both its science as well as its re-usable software artifacts (ibraries & notebooks, standalone tools).

But what saddens me is too many people are trying to dive into NLP without trying to understand language & linguistics first. For example, you can run a part of speech (POS) tagger in three lines of Python, but you will still not know much about what parts of speech are, which languages have which ones, what function they have in linguistic theory or practical applications.

What are the advantages of using the C7 tagset over the C5 or PENN tagsets?

Why is AT sometimes called DET?

etc.

I recommend people spend a bit of time to read an(y) introduction to linguistics textbook before diving into NLP, then the second investment will be worth so much more.

[+] mywaifuismeta|3 years ago|reply
I'm generally not a fan of these kind of high-level tutorials that tell you "use X library to get Y result" - it's just not good for learning. But any content that tries to sell you on learning ML/NLP/etc in a few weeks is just that. I understand people want to make money by targeting a large audience, but it makes me sad when I see (the vast vast majority) of practitioners not having any understanding about ML (or NLP) and just blindly applying libraries.

I don't think you necessarily need a linguistics background for NLP, but I think you need either a strong linguistics OR ML background so that you know what's going on under the hood and can make connections. Anyone can call into Huggingface, you don't need a course for that.

[+] amitport|3 years ago|reply
NLP is a vast field nowadays, you can solve a research problem with a novel transformer architecture (for example) without knowing anything about linguistics. The thing is that NLP is a vast field and there is plenty room to go around (same goes for vision, you don't really need classical vision background as much as you used to).

(also an NLP researcher. Knows nothing about linguistics)

[+] screye|3 years ago|reply
I makes sense to completely disregard language when looking at modern NLP solutions. In some sense, 'hand engineering' anything is looked down upon.

Transformers and scaling laws have made it such that the only thing that truly matters is your ability to build a model that can computationally and parametrically scale. The 2nd would be to figure out how to make more data 'viable' for usable within such a hungry model's encoding.

Look at anyone who has written the last 20 seminal papers in NLP, and almost none of them have a strong background in linguistics. Vision went through a similar period of forced obsolescence, during the 2012-2016 Alexnet -> VGG -> Inception -> Resnet transition.

It is unfortunate. But, time is limited and most researchers can only spare enough time to learn a few new things. Unfortunately for linguistics, it does not rank that high.

[+] adamsmith143|3 years ago|reply
I'm not at all sympathetic to this viewpoint. The Deep Learning revolution has shown us time and time again that Deep Learning experts universally outperform SME on modelling performance. I an almost 100% certain that the teams building big Transformers which are now by far the best NLP models (OpenAI, Meta, Google Brain, Deepmind, etc) are not made up of linguistic experts but Deep Learning experts.
[+] philophyse|3 years ago|reply
In your opinion, would George Yule's The Study of Language be a good introduction to linguistics? Or is there any other book that you would recommend to someone who has little knowledge of the field, but a lot of interest?
[+] true_religion|3 years ago|reply
I don’t know if anyone wants to dive into NLP as much as they just want to solve their problem at hand.

You are right that lack of fundamental knowledge is problematic, especially that tools can allow you to make a greater quantity of solutions and therefore also a greater quantity of mistakes.

However, at least the problem is still being solved.

For example, a few months ago I wanted to organize my media collection by tagging files with artist names. I had a list of artist names but it wasn’t comprehensive so I wired together a bunch of python NLP libraries together to automatically pull out proper nouns from filenames, recognize English names, then annotate the files.

I know almost nothing about parts of speech or anything else, so I made mistakes. About 10% of the results were errors in the first run, but after tuning it was down to about 1% which was good enough to run over the entire media library.

If not for the tools, I would have never been able to finish that chore in a single day. To me, it was worth it despite my amateur mistakes.

I view the library just like any other tool: a screw driver, a hammer, a wrench. I’m not a plumber or a carpenter, or an NLP researcher but I still want to use tools to fix my leaky faucets, remount my leaning cabinet doors, and organize my media collections as weekend projects.

[+] LunaSea|3 years ago|reply
Is this still true in an era where most NLP problems use language models as a solution?
[+] xtiansimon|3 years ago|reply
“I recommend people spend a bit of time to read an(y) introduction to linguistics textbook…”

Linguists is a broad area of study. Can you be more specific? Such as grammar and syntax.

[+] ad404b8a372f2b9|3 years ago|reply
"Every time I fire a linguist, the performance of the speech recognizer goes up"

- Frederick Jelinek

[+] vb234|3 years ago|reply
Could you recommend a good introduction to NLP book?
[+] meristem|3 years ago|reply
Do you have specific book suggestions?
[+] PainfullyNormal|3 years ago|reply
> I recommend people spend a bit of time to read an(y) introduction to linguistics textbook

Do you have a favorite you can recommend?

[+] mothcamp|3 years ago|reply
Hi HN:

I published part one of my free NLP course. The course is intended to help anyone who knows Python and a bit of math go from the very basics all the way to today's mainstream models and frameworks.

I strive to balance theory and practice and so every module consists of detailed explanations and slides along with a Colab notebook (in most modules) putting the theory into practice.

In part one, we cover text preprocessing, how to turn text into numbers, and multiple ways to classify and search text using "classical" approaches. And along the way, we'll pick up useful bits on how to use tools such as spaCy and scikit-learn.

No registration required: https://www.nlpdemystified.org/

[+] irln|3 years ago|reply
The interface is great. Did you create the front-end/back-end from scratch?
[+] jasfi|3 years ago|reply
I'm working on extracting facts from sentences, see https://lxagi.com.

Which are the toughest NLP problems you know of that aren't being solved satisfactorily?

[+] Der_Einzige|3 years ago|reply
Queryable, word level, extractive summarization with grammatical correctness. AKA: what a human does when they are "highlighting" a document.

think extractive QA but the answer size should be configurable and the answer can potentially be multiple spans, and spans may not need to be contiguous.

If you got a solution, I'd love to see it - and you could even beat the baselines for the only dataset that exists for it: https://paperswithcode.com/sota/extractive-document-summariz...

[+] riku_iki|3 years ago|reply
Actually, problem you are working on doesn't look like solved satisfactory yet :-)
[+] airstrike|3 years ago|reply
Getting an invalid HTTPS certificate