True. But credit where credit is due. Very cool analysis for a throwaway blog post specifically manufactured to garner karma.
Only thing I'll add as a data critique, the negative factors are reported as things to avoid. But, in fact, all of the reported on titles actually made it onto the Hacker News front page (1). There are an awful lot of submissions that never make it that far. In fact, the significance of the findings indicate that those terms make it onto the front page A LOT (2). I don't think the negatively correlated terms should necessarily be viewed as failures. Just less successful. My own suspicion is that those titles do draw eye-balls, but someone using titles like those is also likely to be kind of a bad writer, preventing those stories from getting upvotes. It would be very hard to prove a correlation between quality of title and quality of writing, though.
(1) I believe. Hard to tell from the post.
(2) Otherwise there wouldn't be enough data for them to be significant.
They're not exactly in rank-space .. they discretize to the binary variable whether or not an article made it into the top-20, then use logistic regression to model that. so the coefficients are in log-odds space of that indicator
Very nice, but the analysis seems to assume that HN rank is determined by the headline and not by the content. (More precisely: for the analysis to give useful guidance to would-be HN headline writers, it needs not to be the case that content features correlated with headline features make a big difference to HN rank.)
My proposal for a good headline according to the numbers in this article: Showing why impossible future controversy survived the problem could hire data. Score: 1.3 (could) + 1.2 (problem) + 1.3 (survived the) + 1.0 (controversy) + 0.9 (impossible) + 0.7 (why ___ future) - 3.3 (11 words) + 2.6 (showing) + 0.5 (hire) + 1.9 (data [END]) = 8.1. For comparison, Why showing the future is essential to acquiring data gets 1.4 (essential) + 0.7 (why ___ future) - 2.7 (9 words) + 2.6 (showing) + 1.7 (acquiring) + 1.9 (data) = 5.6 -- except that it doesn't really get the points for "essential" (not at start) or "why ___ future" (two words in between) or "acquiring" (not in second place, word isn't quite right). Of course my headline has the little drawback of being total nonsense.
Great -- I'm hoping L1-regularized logistic regression will become the standard first thing to try in these quick-n-dirty "predict response variable from text" experiments. That's our approach too. (I assume this is L1 or similar since you mention regularization causing feature selection.)
[[ Edit: deleted question about what 'k' is for the discretized 1{ rank <= k } response. It's mentioned in the article ]]
yeah pretty strong l1--most features were 0. we binarized rank on I_{rank<=20}. it turns out there are tons of articles beyond the first page that stay low forever. check out the interactive viz vad made: http://hn.metamx.com (warning 2.6MB compressed js ahead)
I though this problem was with Digg, but I've experienced same with my submissions. It's funny that people judge content by headlines, we need a better way.
Find someone who reads Hacker News [1] and blogs. Subscribe to their RSS feed. There's a trick to [1], no doubt. But for most people the time savings far outweigh the occasional mismatch between your interests and the delegate's.
[1] For the same types of articles you do, ideally.
Curious behavior arises from HN's Feed inside Google Reader with "Sort by Magic" turned on... it seems to keep the good stuff towards the top, but anything really spammy and sensational occasionally gets the top place (so watch out for anything hitting #1 in there suddenly), but then you tend to miss some of the more obscure goodies, which arguably I miss from time to time anyway. Still, it is a curious different ranking, probably mostly driven by Google Reader "likes" and sharing.
[+] [-] vnorby|15 years ago|reply
Why showing the future is essential to acquiring data
Noted.
[+] [-] sesqu|15 years ago|reply
I have a feeling that headline wouldn't do all that well, but it does seem to be keeping with google's culture.
[+] [-] agscala|15 years ago|reply
From the article, what's the difference between "data |" and "data -" ?
[+] [-] HardyLeung|15 years ago|reply
[+] [-] joshu|15 years ago|reply
Also, no clue if the factors you pulled out are orthogonal.
[+] [-] rauljara|15 years ago|reply
Only thing I'll add as a data critique, the negative factors are reported as things to avoid. But, in fact, all of the reported on titles actually made it onto the Hacker News front page (1). There are an awful lot of submissions that never make it that far. In fact, the significance of the findings indicate that those terms make it onto the front page A LOT (2). I don't think the negatively correlated terms should necessarily be viewed as failures. Just less successful. My own suspicion is that those titles do draw eye-balls, but someone using titles like those is also likely to be kind of a bad writer, preventing those stories from getting upvotes. It would be very hard to prove a correlation between quality of title and quality of writing, though.
(1) I believe. Hard to tell from the post.
(2) Otherwise there wouldn't be enough data for them to be significant.
[+] [-] brendano|15 years ago|reply
[+] [-] pge|15 years ago|reply
[+] [-] HardyLeung|15 years ago|reply
"Essential Lessons Showing How to Hack Hacker News with Data Visualization"
and it will be ranked #1 in no tim... never mind :D
[+] [-] DanielRibeiro|15 years ago|reply
[+] [-] vorbby|15 years ago|reply
This headline uses none of the hacks described in the article, yet it is ranking quite well.
Perhaps people should focus on letting the content speak for itself rather then using tricks like this?
[+] [-] derefr|15 years ago|reply
[+] [-] vlokshin|15 years ago|reply
[+] [-] gjm11|15 years ago|reply
My proposal for a good headline according to the numbers in this article: Showing why impossible future controversy survived the problem could hire data. Score: 1.3 (could) + 1.2 (problem) + 1.3 (survived the) + 1.0 (controversy) + 0.9 (impossible) + 0.7 (why ___ future) - 3.3 (11 words) + 2.6 (showing) + 0.5 (hire) + 1.9 (data [END]) = 8.1. For comparison, Why showing the future is essential to acquiring data gets 1.4 (essential) + 0.7 (why ___ future) - 2.7 (9 words) + 2.6 (showing) + 1.7 (acquiring) + 1.9 (data) = 5.6 -- except that it doesn't really get the points for "essential" (not at start) or "why ___ future" (two words in between) or "acquiring" (not in second place, word isn't quite right). Of course my headline has the little drawback of being total nonsense.
[+] [-] brendano|15 years ago|reply
[[ Edit: deleted question about what 'k' is for the discretized 1{ rank <= k } response. It's mentioned in the article ]]
[+] [-] joeraii|15 years ago|reply
[+] [-] powdahound|15 years ago|reply
[+] [-] kingsidharth|15 years ago|reply
[+] [-] roc|15 years ago|reply
Find someone who reads Hacker News [1] and blogs. Subscribe to their RSS feed. There's a trick to [1], no doubt. But for most people the time savings far outweigh the occasional mismatch between your interests and the delegate's.
[1] For the same types of articles you do, ideally.
[+] [-] th0ma5|15 years ago|reply
[+] [-] unknown|15 years ago|reply
[deleted]
[+] [-] bzupnick|15 years ago|reply