top | item 1600036

(no title)

myffical | 15 years ago

You need to massage your data to get more meaningful results.

It might be interesting to compare your word counts with the word counts from a general-purpose word corpus, then pick out words that appear more frequently by a statistically-significant amount. Something like Amazon's statistically improbable phrases algorithm.

discuss

waldrews|15 years ago

I'd suggest, as a simple heuristic for ranking words for improbability/relevance, contribution to K-L divergence from the frequencies in the general-purpose word corpus:

Pln(P/Q)

where P is the frequency of the word in the narrow corpus (HN titles)

and Q is the frequency of the word in the general-purpose corpus

(formula doesn't work if Q is ever zero; this won't happen if the broader corpus includes the narrower one, as it should, but as a practicality, just make Q:=(1-a)Q+a*P for small positive a to simulate merging the smaller corpus into the larger)

http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diverg...

Anybody with more time than I have at the moment want to code this up?