top | item 7400263

What is an easy way to classify a word?

3 points| hakann | 12 years ago

I want to write a program that can classify the input. For example, if the input is Japan, it should output country. For reddit --> website. Ferrari --> car. For certain things it is as easy as looking at the first sentence of Wikipedia for that word. But I want to get more specific such as Ferrari --> car --> sports car, or reddit --> website --> website with user generated content. I am sure people have done this in NLP, a pointer to where I should start would be very much appreciated.

6 comments

order

yannyu|12 years ago

I'm not sure if this is what you're looking for, but most often classification/entity extraction is done by using statistical models/machine learning.

You would take a corpus of documents and manually "tag" them with the entities you're looking for and designate that as your training corpus. You would then run that through a machine learning algorithm (such as https://opennlp.apache.org/), and then use the resulting model to process text and identify the entities it was trained on.

hakann|12 years ago

Yes, I expect to use statistical models/ML. I am more comfortable with Python rather than Java so I will look into the NLTK first. Thank you for your response!

sp332|12 years ago

I think the technical term is "tagging". Check out the Natural Language Toolkit. http://www.nltk.org/

hakann|12 years ago

Thank you! This is exactly what I am looking for.