top | item 43447471

(no title)

felixfbecker | 11 months ago

OpenAI and Anthropic respect robots.txt afaik

discuss

order

Ukv|11 months ago

To add anecdotally based on logging on my portfolio site, all major US players (OpenAI, Google, Anthropic, Meta, CommonCrawl) appeared to respect robots.txt as they claim to do (can't say the same of Alibaba).

Sometimes I do still get requests with their useragents, but generally from implausible IPs (residential IPs, or "Google-Extended" from an AWS range, or same IP claiming to be multiple different bots, ...) - never from the bots' actual published IP addresses (which I did see before adding robots.txt) - which makes me believe it's some third party either intentionally trolling or using the larger players as cover for their own bots.

dharmab|11 months ago

Using residential IPs is standard operating procedure for companies that rely on collecting information via web scraping. You can rent residential egress IPs. Sometimes this is done in a (kind of) legit way by companies that actually subscribe to residential ISPs. Mostly it's done by malware hijacking consumer devices.

VladVladikoff|11 months ago

Noooooope! They completely ignore crawl frequency in my experience. Bing too. Only Google seems to obey it.