(no title)
myffical | 15 years ago
It might be interesting to compare your word counts with the word counts from a general-purpose word corpus, then pick out words that appear more frequently by a statistically-significant amount. Something like Amazon's statistically improbable phrases algorithm.
waldrews|15 years ago
Pln(P/Q)
where P is the frequency of the word in the narrow corpus (HN titles)
and Q is the frequency of the word in the general-purpose corpus
(formula doesn't work if Q is ever zero; this won't happen if the broader corpus includes the narrower one, as it should, but as a practicality, just make Q:=(1-a)Q+a*P for small positive a to simulate merging the smaller corpus into the larger)
http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diverg...
Anybody with more time than I have at the moment want to code this up?