top | item 11045270

HN is in the same cluster as 2ch, not Techcrunch, on Twitter

197 points| rabidsnail | 10 years ago |hella.cheap

46 comments

order
[+] bhouston|10 years ago|reply
2d projections of complex multidimensional data are unreliable in the extreme as to adjacency meaning. Most adjacency especially are an artifact of the chosen projection method.
[+] daniel-levin|10 years ago|reply
This comment got me thinking: in some applications, Euclidean distance between feature vectors acts as a good proxy for adjacency/similarity. For such applications, an isometry from R^n to R^2 or R^3 should in principle preserve the meaning of adjacency. A quick Google yields [0, 1] a technique for quasi-isometric, and isometric dimensionality reduction. This should mitigate artefacts of adjacency, or non-adjacency, as it were. In other words, you might be able to actually pull off good 2D projections of high dimensional data and still see meaningful relationships.

[0] https://en.wikipedia.org/wiki/Isomap

[1] https://www.aaai.org/Papers/AAAI/2007/AAAI07-083.pdf

[+] rabidsnail|10 years ago|reply
For small distances, yes. If you plot a 2d projection of a dataset that doesn't have much structure you're going to be reading patterns into whitenoise (though this data has some pretty clear clusters, which are probably real). If I were doing something other than writing a fun blog post I would have done cluster analysis with something like DBSCAN.
[+] personjerry|10 years ago|reply
I wonder if I could post a randomly generated graph, label it with HN-interested labels arbitrarily, and get a serious talk started on HN about nonexistent correlations.
[+] hapless|10 years ago|reply
TechCrunch reports on us. It is journalism for the spectators. The twitter cluster of people sharing TC links is TC's audience, not participants in TC's subject matter.

Why in blue hell would anyone on HN be sharing TC links? Intuitively it seems more likely that people who share HN links are discussing these matters directly.

[+] bitbckt|10 years ago|reply
Interesting parallel observation: when I worked for a regional newspaper some years ago, we rolled out products for the same demo as "mommy blog Twitter". We saw the same sort of isolated behavior - visitors to "mommy blog content" almost never strayed onto our mainstream products.

The same sorts of products delivered to "puppy and kitty" people didn't have the same effect, though the level of vitriol in the comments was similar.

[+] madaxe_again|10 years ago|reply
Ditto. Launched (well, we built - client project) a social network for moms nearly a decade ago, and they were Not Interested in anything outside of the core offering - even recipes, which you would have thought would be interesting, weren't - until they rebranded along the lines of "recipes for moms", which changed that interaction overnight.

Some demographics choose tighter filter bubbles for themselves than others, and moms are likely up there, as the single most important thing to mothers tends to be being a mother - it becomes an all-encompassing identity for many.

[+] hkmurakami|10 years ago|reply
Considering nicovideo is anti-establishment media (it's owned by Kadokawa, which is an underdog media company with strong subculture roots) and that 2chan "summary sites" double as news sources for the anti-establishment these days, the association seems apt.
[+] newobj|10 years ago|reply
This is amazing, one of my favorite articles on HN ever.

I'm really curious what the heck that "eye" is in the bottom right space of the clusters. Some cluster so radically orthogonal to any other content it has an order of magnitude more distance in differentiation?

[+] stephenboyd|10 years ago|reply
This is cool. How many sampled tweets did HN links appear in? How many sampled tweets did you have overall?

I'm curious if a sampling error could explain why an English website like HN would get placed with the Japanese language sites. StackOverflow isn't placed by any related sites either.

If the weird results aren't from sampling artifacts, my best guess is that a lot of spambots must be linking to multiple legit sites regardless of relevance.

[+] brownbat|10 years ago|reply
I really hope someday we get spambots that start off by trying to make useful contributions. Then later, after building a following, start advertising scams.

I'm confident that, given the right incentives, spam kings could discover conversational AI before any lab.

[+] swerling|10 years ago|reply
This is fantastic. Feature request: drag a rectangle over a group of dots, and see them as a text list of websites. As is it's hard to see all the sites that are in a dense dot cluster.
[+] TazeTSchnitzel|10 years ago|reply
Quran quotes being grouped with archive.org might be explained by the Internet Archive frequently being used to host Islamist materials.
[+] runn1ng|10 years ago|reply
Just today I wondered why are so few journalists picking up the fact that ISIS is using almost exclusively archive.org for uploading their beheading and other PR videos.
[+] wodenokoto|10 years ago|reply
> Japanese social media twitter (which I'm labelling as "2ch", though it's not just 2ch) is almost completely distinct from what I'm calling "upstanding japanese twitter" (links to mainstream news sites like news24)

I have no idea what the point of the headline is after reading the above part of the post.

[+] Ezhik|10 years ago|reply
That's interesting. Never would've made the connection myself, although now that I think about it, some of the most fascinating discussions I've read on HN involved Japanese work culture.
[+] ChuckMcM|10 years ago|reply
This is some fascinating analysis. And like the Author I am amazed that Twitter doesn't crack down harder on their spambots.
[+] n0us|10 years ago|reply
I've wondered that as well. I'm not "active" on Twitter but I log on occasionally to see if there are any interesting tweets in my feed. Every time I log on I have a new follower from penny stocks twitter, get rich quick schemes, and various other fake profiles. This seems to stay stable at around 20 fake followers as old ones get erased and new ones follow.

It seems like amateurs are more capable at detecting spam than the entire company but I sometimes wonder if they just know about it leave the spam bots because once they crack down, new ones will just pop up. Or if they keep them around at a tolerable level that doesn't drive real users away but still allows them to publish a higher "user count"

[+] jonesb6|10 years ago|reply
Well it's whack-a-mole isn't it? Take down one spam network and another crops up with an entirely different methodology and signature. If I was managing a large social network that suffered from bots I would whack until I came across an opponent that did the least possible damage, then weaken it through things like shadow bans etc to the point where it won't die but will operate with the bare minimum amount of damage to the network.
[+] jerrickhoang|10 years ago|reply
I think a more interesting problem is not how you can differentiate a spambot with a 'non-spam' bot. I've seen some bots that are really creative and fun on Twitter. I guess it's not really hard to add it to a spam detection ML model
[+] Rayearth|10 years ago|reply
So HN is close to nico (Japanese youtube) and pixiv (Japanese-centric art and fanart site)? Interesting.
[+] forrestthewoods|10 years ago|reply
What are all of the other twitters? There is so much undocumented space! I want to know what it all is!
[+] simcop2387|10 years ago|reply
Is the regex search in the demo not working for anyone else (tested both Chrome and Firefox on Win7)
[+] rabidsnail|10 years ago|reply
There's no UI for if there are no matches; it just does nothing. Try searching for \.com or something.

Edit: I patched it so it displays an alert if there are no matches.

[+] kitwalker12|10 years ago|reply
(Update) see rabidsnail's suggestion

not working for me on Chrome or Safari either

[+] gohrt|10 years ago|reply
why does the hella.cheap site have an SSL cert with an unknown authority?
[+] tokenizerrr|10 years ago|reply
It has a COMODO certificate. If you see otherwise you might be getting MITMd.