top | item 41721413

(no title)

cjf101 | 1 year ago

There was a bunch of reporting on how AI companies and researchers were using tools that ignored robots.txt. It's a "polite request" that these companies had a strong incentive to ignore, so they did. That incentive is still there, so it is likely that some of them will continue to do so.

discuss

order

Ukv|1 year ago

CommonCrawl[0] and the companies training models I'm aware of[1][2][3] all respect robots.txt for their crawling.

If we're thinking of the same reporting, it was based on a claim by TollBit (a content licensing startup) which was in turn based the fact that "Perplexity had a feature where a user could prompt a specific URL within the answer engine to summarize it". Actions performed by tools acting as a user agent (like archive.today, or webpage-to-PDF site, or a translation site) aren't crawlers and aren't what robots.txt is designed for, but either way the feature is disabled now.

[0]: https://commoncrawl.org/faq

[1]: https://platform.openai.com/docs/bots

[2]: https://support.anthropic.com/en/articles/8896518-does-anthr...

[3]: https://blog.google/technology/ai/an-update-on-web-publisher...

cjf101|1 year ago

These policies are much clearer than they were when last I looked, which is good. On the other hand. Perplexity appeared to ignore robots.txt as part of a search-enhanced retrieval scheme, at least as recently as June of this year. The article title is pretty unkind, but the test they used pretty clearly shows what was going on.

https://www.wired.com/story/perplexity-is-a-bullshit-machine...

It takes this sort of critical scrutiny, otherwise mechanisms like robots.txt do get ignored, whether willfully or mistakenly.

FrustratedMonky|1 year ago

Robots.txt is a suggestions. As is reporting on using it.

The companies that are ignoring robots.txt, are also probably the companies not advertising that they are ignoring robots.txt.