(no title)
cjf101
|
1 year ago
There was a bunch of reporting on how AI companies and researchers were using tools that ignored robots.txt. It's a "polite request" that these companies had a strong incentive to ignore, so they did. That incentive is still there, so it is likely that some of them will continue to do so.
Ukv|1 year ago
If we're thinking of the same reporting, it was based on a claim by TollBit (a content licensing startup) which was in turn based the fact that "Perplexity had a feature where a user could prompt a specific URL within the answer engine to summarize it". Actions performed by tools acting as a user agent (like archive.today, or webpage-to-PDF site, or a translation site) aren't crawlers and aren't what robots.txt is designed for, but either way the feature is disabled now.
[0]: https://commoncrawl.org/faq
[1]: https://platform.openai.com/docs/bots
[2]: https://support.anthropic.com/en/articles/8896518-does-anthr...
[3]: https://blog.google/technology/ai/an-update-on-web-publisher...
cjf101|1 year ago
https://www.wired.com/story/perplexity-is-a-bullshit-machine...
It takes this sort of critical scrutiny, otherwise mechanisms like robots.txt do get ignored, whether willfully or mistakenly.
FrustratedMonky|1 year ago
The companies that are ignoring robots.txt, are also probably the companies not advertising that they are ignoring robots.txt.