Deep Learning on Title and Content Features to Tackle Clickbait

[+] grabcocque|9 years ago|reply

The problem is, we don't have a clear definition of what clickbait is.

nouninformal (on the Internet) content whose main purpose is to attract attention and encourage visitors to click on a link to a particular web page.

But that's basically everything on the web.

[+] BooglyWoo|9 years ago|reply

I'm not sure that quite addresses the problem here.

After all there is no clear definition of 'what dogs look like' (in the sense of a collection of logical rules), but deep learning models excel at detecting them, when provided with enough positive examples.

If it's possible for humans to agree on whether a given article is clickbait or not, we should be able to put together an adequate dataset for training a system to classify them too. From the linked article I am unable to discern how the training dataset was labelled.

In other words, the fact that 'clickbait' is a nebulous concept shouldn't preclude machine learning from being able to detect it.

[+] trevyn|9 years ago|reply

The core of the definition is subtly wrapped in "main purpose" -- once the attention is attracted and the link is clicked, the clickbait's job is done. So the content of the article will be lower quality and less intellectually satisfying than non-clickbait articles.

For example, if you charted "interest on clicking this link text" vs "satisfaction with article after reading", I think clickbait would be clearly in the high interest vs low satisfaction quadrant.

[+] imanewsman|9 years ago|reply

In 2014, Jon Stewart offered an interesting definition of clickbait:

"I scroll around, but when I look at the internet, I feel the same as when I’m walking through Coney Island. It’s like carnival barkers, and they all sit out there and go, 'Come on in here and see a three-legged man!' So you walk in and it’s a guy with a crutch."

The thing is, he was talking about BuzzFeed when he said that, and that is not what BuzzFeed does at all. BuzzFeed's editor wrote about the distinction here, and it's the most insightful article I've read on the topic:

https://www.buzzfeed.com/bensmith/why-buzzfeed-doesnt-do-cli...

People tend to consider things like lists clickbait, even though those articles usually deliver exactly what the headline suggests. (If you click on "23 photos of kittens that are just too adorable," that is what you will get.) But because it's an article that was made specifically to get traffic, people incorrectly call it clickbait.

And it often goes even further than that. On Reddit and Hacker News, commenters constantly call articles clickbait. Sometimes it's true, and there's a sensational headline that leads to a bullshit story. But just as often, the story delivers on what the headline promises, but commenters call it clickbait because the headline is slightly hyperbolic, snappy, or just plain well-written.

[+] hyperpape|9 years ago|reply

I would define clickbait as articles which intentionally try to disguise what you'll get out of reading them. The information is banal, but the headline makes it out to be revolutionary or shocking.

You might disagree with the details of the formulation, but I think there's pretty broad agreement that something similar is going on with clickbait.

[+] richdougherty|9 years ago|reply

I guess it's an inherently fuzzy concept, so quite a good fit for machine learning.

But my definition of clickbait is any link I follow where I feel like I've been tricked into the click. The link looked interesting, but I feel regret once I see the actual content.

[+] Cthulhu_|9 years ago|reply

A definition would need a bit more fleshing out, mostly about the (lack of) actual content; a long-winded page (not just text) that eventually leads to the core, which could be summarised in one line, even the article title itself. (like 'peanut butter is made out of peanuts' instead of "you'll NEVER guess this ONE SECRET peanut butter ingredient!"

[+] unknown|9 years ago|reply

[deleted]

[+] visarga|9 years ago|reply

I'd like a system to filter out fluff threads on reddit. It would reject easy-consumption content such as images, gifs and short vids, or anything shorter than 60 seconds; also, low quality comments (short, aggressive, memes, etc).

Reddit is a gold-mine of interesting content, but it is flooded with fluff and garbage to the point where it becomes a problem to find the good parts.

I'm wondering why they don't use more machine learning magic on the site. There are multiple machine learning papers based off the reddit comment corpus.

[+] arkitaip|9 years ago|reply

Vanilla Reddit is almost garbage because of how default subreddit posts take over your front page.

What you need to do is to unsub from all default subreddits, subscribe to niche ones you like and use Reddit Enhancement Suite (RES) [1] to contain the default subreddits to what RES calles the Dashboard (basically a page where you can add lots of subreddits as individual widgets).

[1] https://redditenhancementsuite.com/

[+] make3|9 years ago|reply

your best bet is to filter out meme subs

[+] baxuz|9 years ago|reply

You could add any title that's formulated as an imperative. "You won't believe..." "Guess which..." "You should..."

Also titles that are formulated as a simple subject - predicate - object sentence: "XY considered anti-pattern" "Trump is right" "Hitler did nothing wrong" "Drunk girl shows tits" "Homeopathy is the future of medicine"

Same works if formulated as a question: "Is Trump right?" "Has Hitler done nothing wrong?" "Is homeopathy the future of medicine?"

Bonus points for exclamation marks, pound signs and uppercase words.

[+] minimaxir|9 years ago|reply

I wrote the original article visualizing clickbait from scraped Facebook data: http://minimaxir.com/2016/08/clickbait-cluster/

Yes, there are obvious tropes of clickbait. Facebook, however, is cracking down on them, so there's been a slight brinksmanship between "how do I get people to click articles without following the tropes?"

From the visualization in my article, you can see there is a spatial blend between sources like the NYT and BuzzFeed when subjects like kids and Pokemon are brought up.

[+] abhisvnit|9 years ago|reply

The point that the article is trying to explain is clickbaits cant just be classified only by using these titles. The content of the webpages also plays a big role :)

[+] mtgx|9 years ago|reply

Not all clickbait headlines are written like that.

For instance: "Russia hacked US power grid" doesn't have any of those, and yet it was a completely clickbait/sensationalist/borderline fake news headline from WashPost. How is AI going to deal with those?

https://theintercept.com/2016/12/31/russia-hysteria-infects-...

[+] joosters|9 years ago|reply

Simple filter for tech articles:

$clickbait = /Deep/;

[+] jj12345|9 years ago|reply

Thanks for the nice, condensed article. Generating features from BeautifulSoup isn't something that I've considered before.

I'm still going through Yoshua Bengio's new book on DL, but if anyone is free to comment: what are the justifications for the general architecture? Why use LSTMs with the Glove embeddings?

[+] volker48|9 years ago|reply

Seems like everyone uses the glove embeddings for any text based DL project.

[+] hikkigaya|9 years ago|reply

All I see is that the author uses deep learning to distinguish post published by Buzzfeed, clickhole, upworthy and stopclickbaitofficial v.s. the other pages?

[+] samirahmed|9 years ago|reply

yes - i am skeptical on how generalizable the final model is - given lots of features (numerical and text) are closely linked to same domain.

[+] abhisvnit|9 years ago|reply

code available here: https://github.com/abhishekkrthakur/clickbaits_revisited

[+] empath75|9 years ago|reply

I'm not sure how he can define what clickbait is and what's not.

The NY Times isn't immune to publishing clickbait, and buzzfeed sometimes posts really solid journalism.

[+] matrix2596|9 years ago|reply

Crowdsource is an amazing thing to do

29 comments