top | item 37158285

(no title)

k_vi | 2 years ago

User-agent: GPTBot Disallow: /

discuss

ATMO you shouldn't have to maintain knowledge of what kind of crawler bot exist and having to maintain deny list. It should be the opposite, only expressedly allowed content should be crawled by mainaining allow lists.

wraptile|2 years ago

You can do the opposite since the inception of robots.txt: User-agent: * Disallow: / and then whitelist google bot and whatnot. Most of the web is already configured this way. Just check robots.txt of any major website, e.g. https://twitter.com/robots.txt

jamilton|2 years ago

That was my gut reaction too, but presumably unless it becomes regulated, at least some competitors to OpenAI won't respect any robots.txt and thus any open content might be training data.

gorbachev|2 years ago

User-Agent: <new technology category>Bot