Translate IDF to "how uncommon is this word in the corpus?"
TF-IDF is acronym soup, but mathematically simple: IDF is a scalar applied to a term's frequency. And in the comparison, the numerator is the document overlap score and the denominator is the square root of the two documents. For more, Stanford's natural language processing course is the bee's knees: https://class.coursera.org/nlp/lecture/preview
TF-IDF solves an important problem and it's good to know about.
However, in some applications, such as Latent Semantic Analysis (LSA) and its generalizations, there are practical alternatives such as log-entropy [1] that I've found to work better in practice.
"Wait a minute. Strike that. Reverse it. Thank you."
TF-IDF is old, and very cool. n-gram based extensions of it are a bit newer, but are likely implemented in almost exactly the same way. N-grams just require a lot more compute power because your corpus grows faster than a plain ol' bag of words.
gibrown|10 years ago
https://issues.apache.org/jira/browse/LUCENE-6789
https://en.wikipedia.org/wiki/Okapi_BM25
In the very limited test cases where I've compared them it hasn't mattered much, but other's results are pretty compelling.
https://www.elastic.co/blog/found-bm-vs-lucene-default-simil...
meeper16|10 years ago
http://52.11.1.7/TuataraSum/example_context_control-ml2.html
rohwer|10 years ago
TF-IDF is acronym soup, but mathematically simple: IDF is a scalar applied to a term's frequency. And in the comparison, the numerator is the document overlap score and the denominator is the square root of the two documents. For more, Stanford's natural language processing course is the bee's knees: https://class.coursera.org/nlp/lecture/preview
nathell|10 years ago
However, in some applications, such as Latent Semantic Analysis (LSA) and its generalizations, there are practical alternatives such as log-entropy [1] that I've found to work better in practice.
[1]: http://link.springer.com/article/10.3758%2FBF03203370#page-1
rhema|10 years ago
meeper16|10 years ago
Yahoo Paid $30 Million in Cash for 18 Months of Young Summly http://allthingsd.com/20130325/yahoo-paid-30-million-in-cash...
Google Buys Wavii For North Of $30 Million http://techcrunch.com/2013/04/23/google-buys-wavii-for-north...
yannyu|10 years ago
https://lucene.apache.org/
http://lucene.apache.org/solr/
https://www.elastic.co/
wyldfire|10 years ago
EDIT: according to SO, yes: http://stackoverflow.com/a/2009546/489590
dangerlibrary|10 years ago
"Wait a minute. Strike that. Reverse it. Thank you."
TF-IDF is old, and very cool. n-gram based extensions of it are a bit newer, but are likely implemented in almost exactly the same way. N-grams just require a lot more compute power because your corpus grows faster than a plain ol' bag of words.
languagehacker|10 years ago