Reddit Will License Its Data to Train LLMs, We Made a FF Extension to Replace

[+] hubraumhugo|1 year ago|reply

Since the blackouts last year and the recent IPO, it feels like astroturfing and spam have increased, while quality contributions have decreased. All usage metrics are up according to Reddit's IPO filings, but it feels like engagement is actually down, or at least lower quality. Many niche subs feel like ghost towns now.

Is this just my subjective impression or do you feel the same?

[+] theyeenzbeanz|1 year ago|reply

Spam increased exponentially after the 3rd party kill switch. There was profound options for bots to add to your mod list to help combat spam, repost accounts, and more.

They existed before the whole 3rd party fallout, but it does not even come close to how bad it is now.

[+] herculity275|1 year ago|reply

IME the niche subs are still doing fine, it's the popular ones that are getting astroturfed and Eternal September'ed into oblivion. I do expect the LLM bots to eventually render the platform unusable, they would need to implement a very aggressive personhood verification policy to prevent that.

[+] nabla9|1 year ago|reply

It's partly because Google started preferring more Reddit content.

It's more valuable to astroturf and spam in reddit than never before.

[+] SkyPuncher|1 year ago|reply

Browsing /r/all is a cesspool now. It feels like the same 30 topics with random OnlyFans wannabes and seemingly outrageous relationship stories/advice. It used to be such a great way to find and explore new topics.

Niche sub-reddits largely seem to either (1) be growing so large they become bland (2) dying and moving to other platforms.

I basically only enjoy the live sporting game threads now. Even then, it's a pretty shallow level of enjoyment.

[+] jsheard|1 year ago|reply

Twitter feels the same way, they are claiming to have more active users then ever but that probably includes the horde of LLM bots and MY PUSSY IN BIO spam.

[+] FrustratedMonky|1 year ago|reply

I have to believe that with LLM's improving, and posts from Reddit used for training data, that a large number of current 'new' posts are bots.

I just read article about marketing companies using bots that post 'pretty good comments, that slightly agree with you but mention the product'.

[+] vintagedave|1 year ago|reply

Subjectively, I feel the same. That said I don't use it much any more -- when Apollo closed I stopped participating on the site.

I saw lots of comments shortly after the blackouts that 'felt' AI-generated, and when I rarely go there now from search results, using the awful new site, I see little content of value.

[+] SleepilyLimping|1 year ago|reply

The incentive to contribute is based on the potential return of social currency (prestige, togetherness, etc). If it's evident that you won't generate enough currency to outweigh enriching Reddit, why bother?

[+] jimmySixDOF|1 year ago|reply

The API rugpull was a real setback for content and if they had followed through with claims at the time to allow charged access that could have worked but they never rolled anything out it was just a ruse.

[+] littlecranky67|1 year ago|reply

Maybe it is just me, but when reddit pops up in my search results (and since I am using Kagi it ranks quite often to the top) the topics are mostly useless to help me solve an issue or extract information. The threads are often outdated, and littered with personal opinions up to outright opinions and personal anectada. Compared to Q/A sites like StackExchange, the quality of information - at least for me - is very poor. Which is fine, since reddit claims to be a social network, too.

[+] cacois|1 year ago|reply

My experience has been the opposite in the last few years. I've found myself filtering results google/duckduckgo specifically for reddit, because I was finding better answers to technical questions. Anecdotal, of course, and it does seem to be getting worse (less successful for me) over the last 6 months.

[+] spongeb00b|1 year ago|reply

50/50 - I’ve been surprised about the number of Reddit threads that have been more useful than other results. Even if it’s been a discussion that doesn’t give me a solution but helps me shape what I’m trying the find.

Although it would be nice if unanswered posts didn’t rank so highly.

[+] ses1984|1 year ago|reply

It’s not good for q&a, it’s pretty good for discussion and reviews, though.

[+] ryukoposting|1 year ago|reply

Sometimes I just want the opinion of an actual human being. It's hard to find that online anymore, without affiliate links and/or $COMPANY_NAME deleting negative remarks.

[+] Zambyte|1 year ago|reply

Do you want Reddit results to remain high on Kagi? You can just downrank or block it off you want to see it less.

[+] input_sh|1 year ago|reply

It's definitely not just you.

Reddit didn't reach this point because they're good, but because it's the least shitty option right now. (God I wish that wasn't the case.) I'm not saying there's no astroturfing going around there - - there absolutely is -- but it's still the only "mainstream" website where I'm confident I can find some dissenting opinions about a product that are written by actual human beings.

[+] unknown|1 year ago|reply

[deleted]

[+] addandsubtract|1 year ago|reply

I always add a filter to limit results within the last month or year, depending on the topic.

[+] SilverBirch|1 year ago|reply

Yeah, I had exactly this problem with Google the other day, I googled something and saw the first result was a reddit post and the short summary under the link was like "Yes I've seen this problem BUT WHAT YOU REALLY SHOULD BE DOING" and I was optimistic that someone had some good guidance. Guess what I should be doing? Uninstalling windows and running Linux... That answer somehow had made it into the Google summary despite being downvoted on actual reddit.

[+] z_open|1 year ago|reply

I don't think reddit let's you do this in anymore than a superficial way. I think reddit keeps the old edits internally so it won't harm the LLM. There were reports after the last protest of reddit reverting mass edits.

[+] ziml77|1 year ago|reply

So basically this won't affect the LLM training but will still remove useful information and answers to questions? Wonderful...

[+] Havoc|1 year ago|reply

Wouldn’t be surprised if Reddit ends up either banning people for this or limit edits on historic comments.

They already restore user comments against their will (and hilariously that’s also against their own reddiquette see extract below)

https://www.reddit.com/r/privacy/comments/14dcxy4/reddit_res...

> Repost deleted/removed information. Remember that comment someone just deleted because it had personal information in it or was a picture of gore? Resist the urge to repost it. It doesn't matter what the content was. If it was deleted/removed, it should stay deleted/removed.

[+] mkl|1 year ago|reply

So Reddit will help companies make perfect Reddit spambots and poison their own communities? Seems a bit shortsighted.

[+] Phiwise_|1 year ago|reply

As I understand it, reddit as it has been has never not lost money. What, exactly, makes switching from a burn pit business model to one thst actually makes money qualify as "a bit shortsighted"? They've been doing this for two decades already. How does going from X-ten(?) billion cat photo comments to Y-ten billion open opportunities worth more than the cost of waiting yet more decades to actually make money?

[+] FrustratedMonky|1 year ago|reply

It's happening to entire internet. A lot of content generated in last few months is AI, some pretty good, but not great, all kind of on 'crappy' side. The 'crappy' feedback loop into training data is going to be real problem.

Wonder if internet will migrate back to each person having their own blog that they can control.

[+] Havoc|1 year ago|reply

They just went through IPO. Doesn’t get much more short term focused than that

[+] 23B1|1 year ago|reply

This is the end of Web 2.0. There will be a blip on signal/noise ratio (which wasn't that great to begin with, 99.9% of UGC is trash anyway) as procedurally-generated content floods sites with even more nonsense – and then once they become unusable (reddit already is), the next crop will pop up.

I'm long on people with great taste, trendsetters and commentators, editors, and curators. They'll be the vanguard of this next iteration of the internet.

[+] xdennis|1 year ago|reply

Just so we're clear, this is using reverse psychology, right? They do want you to replace your comments with copyrighted text.

I assume the wording is because of legal ramifications. I wonder if such a defense works in court.

Personally, I think doing this is pointless. LLMs already use copyrighted works, so this isn't helping at all. The only way to tank Reddit is to add meaningless text which would make LLMs worse.

[+] CaptainFever|1 year ago|reply

I know, right. The last thing I'd expect a self-proclaimed anti-capitalist to be defending is intellectual property, especially one of a corporation (NYT).

If Reddit keeps a copy of the data edits, this move also just serves to hamper open source models who can only train on scraped data, while those with enough money can buy the full dataset with history.

What I mean is, I agree and I think this plugin will do the opposite of what the authors expect.

[+] gorbachev|1 year ago|reply

It would potentially muddle the context, though.

If every conversation about any topic has responses copying unrelated New York Times articles, what are the chances LLMs trained on that data will hallucinate even worse than before?

[+] helpfulContrib|1 year ago|reply

[deleted]

[+] batch12|1 year ago|reply

Slightly off topic, but since the HN site and API is open to all, it'd be silly to assume our comments aren't also part of several datasets used to train LLMs.

[+] floor_|1 year ago|reply

Are these people even sure the comment is even deleted on the backend where I assume the data will be taken from? I feel like they'll be pissing upwind and en-shit-ifying the site that will only harm users and not the data harvesting. If anything you want the public facing stuff there and free to scrape by any average Joe.

[+] float-trip|1 year ago|reply

Reddit's caches are set up to only ever return the last 1,000 of anything. So for example - you can't scroll past 1k items on /new, and if you save more than 1k posts then you'll have to unsave some to retrieve the others.

If this extension only edits comments, it'll only touch the most recent 1k. You would need to retrieve the older ones with a Pushshift replacement like this: https://pullpush.io/. But that also shows how ineffective this is. We still have public reddit archives (like Pullpush and https://github.com/ArthurHeitmann/arctic_shift) which contain comments as they were originally posted. This isn't gonna be a problem for Google.

[+] K0balt|1 year ago|reply

I may make a plug in for this in to my local 11b LLM so that I could have it third-party summarise my comments in a David Attenborough documentary style. I love the idea of 60k plus DA summarisations and attributions of naturalistic motivations for my comments.

I stopped using Reddit when they banned 3rd party apps, after 16 years and nearly 6000 hours on the platform, including over 2800 hours writing content on their site.

More than happy to burn it all down, for the simple fact that their app sucks so bad that it’s unusable and they banned the app that I was comfortable with.

So, I will be replacing roughly $100k of written value (at half the rate I am normally paid for my writing work) with at least that much in negative value AI generated stupidity. F@$k those guys.

I intend to be an object lesson in abusing your top performing users.

[+] GaggiX|1 year ago|reply

I don't want to ruin the party, but I find it hard to believe that this would have any tangible effect.

[+] WithinReason|1 year ago|reply

It's easily detected and reversed. Or the new comment removed from the DB that's sold and the old one included

[+] simion314|1 year ago|reply

I hate the fact they allow bots and trolls to make tons of accoutns and tons of spam/troll posts daily. It would be trivial to fix this partially by putting limits on what a user can post per minute and per day, and try to make it harder to create new accounts and start spamming.

I suspect they make more money from allowing bots and trolls then doing the work of fixing this problem.

[+] donatj|1 year ago|reply

Ruin what little good is left of the internet in the name of.. slowing the inevitable destruction of the internet? I can’t get behind this.

[+] tinyhouse|1 year ago|reply

What's wrong with people? Reddit has great content. I often use Google to search it for info. Why ruin it? LLMs are also very useful and we all benefit from them.

Reddit is a company with expenses so they need to make money somehow. You didn't have to use it if you don't want your content in LLMs training data.

[+] batch12|1 year ago|reply

Yes, poor Reddit, just trying to feed their family.

[+] hoseja|1 year ago|reply

Make it so it replaces the comment with some AI slop that's not easily filterable but utterly useless.

[+] gorbachev|1 year ago|reply

I think the best way to sabotage LLMs trained on Reddit data would be to post something on topic, but straightup wrong, in some other way misleading or with subtle inaccuracies that would cause LLMs to produce bad results in ways that are hard to detect.

Use proven information warfare tactics.

[+] globular-toast|1 year ago|reply

Why do so many people, even web developers, think anyone lets you do `UPDATE` or `DELETE` in their databases?! They let you do `INSERT`. That's it. You can insert add a new edit and you can add a delete. They don't actually delete or overwrite anything.

[+] batch12|1 year ago|reply

Another flavor of this would let the user submit their comment and it'd suggest a semantically similar excerpt from "non-"copywritten text. That'd address the edit reversion dilemma.

[+] upget_tiding|1 year ago|reply

> It seems your Javascript is turned off. Maybe you'd prefer the RSS feed?

It seems somewhat ironic that a website called the luddite would require me to enable javascript on their site in order to read it.

[+] Barrin92|1 year ago|reply

Luddite opposition to tech was very specific, they weren't just generic technophobes or "debloated based internet minimalists", they opposed labor automating machinery that shifted power to capital. Javascript is just a web scripting language.

[+] arbol|1 year ago|reply

At least they have an RSS feed! That's quite uncommon these days

[+] anArbitraryOne|1 year ago|reply

The reason I started leaving more comments on Reddit is precisely because it is going to be LLM training data. My wit is going to be part of our AI overlords

94 comments