top | item 7400263

What is an easy way to classify a word?

3 points| hakann | 12 years ago

I want to write a program that can classify the input. For example, if the input is Japan, it should output country. For reddit --> website. Ferrari --> car. For certain things it is as easy as looking at the first sentence of Wikipedia for that word. But I want to get more specific such as Ferrari --> car --> sports car, or reddit --> website --> website with user generated content. I am sure people have done this in NLP, a pointer to where I should start would be very much appreciated.

6 comments

yannyu|12 years ago

I'm not sure if this is what you're looking for, but most often classification/entity extraction is done by using statistical models/machine learning.

You would take a corpus of documents and manually "tag" them with the entities you're looking for and designate that as your training corpus. You would then run that through a machine learning algorithm (such as https://opennlp.apache.org/), and then use the resulting model to process text and identify the entities it was trained on.

hakann|12 years ago

Yes, I expect to use statistical models/ML. I am more comfortable with Python rather than Java so I will look into the NLTK first. Thank you for your response!

msantos|12 years ago

If you're looking for a fairly extensive pre-compiled database, more so than getting into NLP, check out DBpedia and Freebase.

http://en.wikipedia.org/wiki/DBpedia

http://en.wikipedia.org/wiki/Freebase

zharkov|12 years ago

Not sure if it's exactly what you're looking for, but you might want to check out WordNet, a lexical database. You can first try it online to see if it fits your needs: http://wordnetweb.princeton.edu/perl/webwn

sp332|12 years ago

I think the technical term is "tagging". Check out the Natural Language Toolkit. http://www.nltk.org/

hakann|12 years ago

Thank you! This is exactly what I am looking for.