top | item 36727384

Pulling my site from Google over AI training

53 points| headalgorithm | 2 years ago |tracydurnell.com | reply

96 comments

order
[+] theonlybutlet|2 years ago|reply
I can't understand the outrage. In practice absolutely nothing has changed.

It is reading and learning. A person would read and learn.

This has no bearing on plagiarism or copyright. A work can be considered plagiarized or to breach copyright if the author hasn't even seen or come across the copyrighted/published work.

This is no different. I can write some code and use it, subconsciously referencing a work.

If I don't check my written work and put it out there, someone might have a claim against me. If I don't check the machine generated work and put it out there, someone might have a claim against me.

OpenAI,Meta,et al are providing the model, basically a regression model or tool. I'm adding the variables or secret sauce that makes it output that set of data in that specific order not them. It'd be like suing Parker for making the pen.

[+] blargey|2 years ago|reply
A human cannot learn from and re/produce work they view at the speed, volume, and scale that “AI” does, nor can that human be infinitely replicated and farmed out. When describing this gap in capabilities, or the consequences of “learning”, “orders of magnitude” would be a comical understatement.

Existing conventions around “learning” are built on assumptions of human scale, and the expected consequences thereof.

I can’t understand why one would expect people to go “oh it’s technically ‘learning’ I guess I’ll ignore all the consequences that weren’t present when it was just humans”.

[+] jeffgreco|2 years ago|reply
My pitch:

A search engine that exclusively indexes noindex sites (you can use other sites while spidering) and builds an LLM model with the results.

[+] JohnFen|2 years ago|reply
I rather suspect that this is already being done.
[+] bratao|2 years ago|reply
I had an (terrible) idea to create an search index that build on DMCA notices.
[+] CatWChainsaw|2 years ago|reply
Comments are bound to be spicy on this one. I always love it when techbros say that AI learning and human learning are exactly the same, because reading one thing at a time at a biological pace and remembering takeaway ideas rather than verbatim passages is obviously exactly the same thing as processing millions of inputs at once and still being able to regurgitate sources so perfectly that verbatim copyrighted content can be spit out of an LLM that doesn't 'contain' its training material. It's even better when they get so butthurt at being called out that they have a nice little rage-cry.
[+] jbreckmckye|2 years ago|reply
They only believe it because they think their own labour isn't under threat. Think.
[+] zo1|2 years ago|reply
Bit of a related rant.

Just today I googled (and duck duck go'd?) alternatives for Discord (because reasons). Entire search results page was "X top alternatives to Discord." It was all blog posty kind of stuff with an "author".

And like 90% of it was written by indian and african sounding names. These were clearly "content farms" with low paid labour and bad grammar, or just authors with nothing better to do than write Yet Another Blog Post about Top Discord Alternatives. Sure, they weren't generated, but the fact that a human was involved in creating something crappy doesn't make it better or unique.

What I was actually looking for was unique content. Either an actual curated list of alternatives (NOT a blog post they update every year). Or an extract from a book where someone posted fiction about a fictional Discord user that meets aliens. Or comments in a forum, or a link to a song-lyrics website for a Weird Al parody song about discord, a website dedicated to expounding the virtues cutting the discord cord, a link to a PDF where someone saved a IRC chat server's logs about a person switching from discord to IRC, or an "IRC-MF do you speak it" crass website, or something. Anything but a damn content blog post by some third-world content creator or hipster-blog-poster from the 1st world.

What I got was garbage. Human-level garbage. Garbage that across hundreds of thousands of websites basically took a piece of content and expanded it with every known combination of words, sentences, and mini-stories and pasted it on a stupid blog post with an author.

And this garbage is what this AI is training on so we can have content farms make more copies of itself with more variations and in different languages now, all so we can pay Google et al attention-coins to magically sift through all that garbage and present us with something a little less garbage-y for us to consume.

[+] kccqzy|2 years ago|reply
Don't forget about also pulling your site from Bing! It would be naive if you somehow trust that Microsoft won't use your site for AI training.
[+] talldatethrow|2 years ago|reply
And Yandex!

Side note, I have friends that crawled a massive amount of the internet over several months for their own purposes.. at this point it's probably impossible to exclude your site since tons of other people probably link to your site if it's at all of value.

[+] mkl95|2 years ago|reply
Can you pollute their data with hidden elements, or do they only scrape visible stuff?
[+] jeffbee|2 years ago|reply
If Google thinks a site is serving different content to googlebot vs. real users, it will stop returning that site on the SERP, because that is a malware distribution technique, among other reasons.
[+] jxramos|2 years ago|reply
> I’m going to start by pulling my websites out of Google search, then work on adding my sites to directories. Maybe I’ll even join a webring

I'm curious, this is the first time I've heard of a webring, I'd like to learn more about these alternate discovery routes. Anyone have any concrete experience or recommendations to share?

[+] ModernMech|2 years ago|reply
Stop trying to make the rest of us feel old lol.
[+] chomp|2 years ago|reply
Author should update their robots.txt for Googlebot as well. It is not clear if noindex means "notrain" too. The entire webpage has to be read in and parsed for Google to extract that meta tag. However, robots.txt should stop the crawler before it proceeds to the rest of your site.
[+] DanHulton|2 years ago|reply
The author mentioned she was going to block Googlebot, only not yet, in order to make sure it can crawl the site again to get the 'noindex' instructions.
[+] barbariangrunge|2 years ago|reply
As a substack author, with “permission is not granted to use any portion of this to train an ai” at the bottom of most of my posts, it’s bullshit that you have to do this sort of thing, and that it will almost certainly not work

This must be illegal, but how are all the little bloggers going to oppose it?

[+] tornato7|2 years ago|reply
Why do you think it would be illegal? You can state "permission is not granted to X" on anything you want, but that doesn't mean the law is on your side. Regular rules of copyright still apply.

P.S. Permission is not granted to downvote my comment!

[+] dwheeler|2 years ago|reply
It's only illegal if a law makes it illegal.

It's not clear to me that it should be illegal.

[+] JohnFen|2 years ago|reply
What's right or wrong, and what's legal or illegal, are two different things. There are plenty of right things that are illegal and wrong things that are legal.
[+] arsome|2 years ago|reply
I don't see why it would be illegal, AI reading it should be no different from anyone else.
[+] txcvdsfsgh|2 years ago|reply
This whole AI scraping argument is so silly to me. If you don't want people downloading and processing your content, then don't post it on the public internet?
[+] moribvndvs|2 years ago|reply
They literally outline their reasoning in a link[0]. There’s a significant gap between offering information for someone to read for free (where I can see the author and choose to respect their terms, if they have any), and a huge tech company aggregating that data where it is assimilated into a model and used in a product they will profit from[1]. They are exploiting gray area regarding digital rights, copyright, etc.

[0] https://tracydurnell.com/2023/07/07/the-next-big-theft/

[1] https://www.tumblr.com/nedroidcomics/41879001445/the-interne...

[+] JohnFen|2 years ago|reply
It's not about not wanting people to access the content, it's about not wanting AI bots to do so. But yes, I agree, the only realistic defense we have is to remove it from the publicly-accessible web.
[+] fortyseven|2 years ago|reply
Honestly, if you're so terrified now of what happens to your information once you post it publicly, just cut the cord. It was ALWAYS like this.
[+] LordOfRiverRun|2 years ago|reply
Why the preference not to have Google train their AI using your website content?
[+] bryzaguy|2 years ago|reply
One reason is the same as authors who don’t want their books trained and actors who don’t want their likeness trained. If that content is valuable it allows google to realize that value with no return to the creator.
[+] JohnFen|2 years ago|reply
I bet the reasons vary from person to person. My reason is because I think that these AI systems pose too great of a risk to society, and I want to make sure that I'm not helping them in any way.
[+] zer0w1re|2 years ago|reply
Ah, yes. More complaining about freely posting content publicly on the internet and then being upset when it's used in a way you don't want. I'm sure foreign companies and even governments are doing something similar, what will U.S. laws do to stop that?
[+] fortyseven|2 years ago|reply
jots down "CRAWL AND SCRAPE A WEB RING"

Got it.

[+] tivert|2 years ago|reply
> Blocking bots that collect training data for AIs (and more)

> In addition, I created a robots.txt file to tell “law abiding” bots what they’re not allowed to look at. I ought to have done this before but kind of assumed it came with my WordPress install (Nope.)

> I specifically want to deter my website being used for training LLMs, so I blocked Common Crawl.

Instead of blocking, it would be neater to present and alternative version to the crawlers (like many paywalled sites already to for SEO) that's full of dynamically generated LLM-generated garbage. That'll help the LLMs poison themselves.

[+] JohnFen|2 years ago|reply
I was thinking about jamming along these lines, but the problem is that it's a game of whack-a-mole -- you have to keep up on what bots are active (robots.txt doesn't really help here, and focusing on Common Crawl is insufficient).

My websites have been closed to the public since shortly after the release of ChatGPT, but I've been considering opening them up again, sort of. The not-logged-in experience being full of dynamically generated LLM poison as you suggest -- for everybody rather than trying to single out crawlers -- and you have to log in to get to the real contents of the site.

[+] amf12|2 years ago|reply
> Instead of blocking, it would be neater to present and alternative version to the crawlers

IIUC, if a site presents different (view of) content to the crawler than users, the site can get de-indexed.

[+] TheCaptain4815|2 years ago|reply
An interesting thought are government or foreign actors training a generative Ai. They won't abide by any no-index tax and will scrape any and everything.
[+] JohnFen|2 years ago|reply
It doesn't have to be government or foreign actors. Abiding by the contents of robots.txt isn't required of anybody at all. It's merely a social convention.
[+] pixelgeek|2 years ago|reply
Helpful links. I will be doing the same this evening
[+] tasubotadas|2 years ago|reply

[deleted]

[+] tasubotadas|2 years ago|reply
"Ironically, the uniformity of the copies of Gutenberg’s Bible led many superstitious people of the time to equate printing with Satan because it seemed to be magical. Printers’ apprentices became known as the "printer’s devil." In Paris, Fust [a typographer] was charged as a witch. Although he escaped the Inquisition, other printers did not." (The Unsung Heroes, a History of Print by Dr. Jerry Waite 2001)"