top | item 47060387

(no title)

I also wonder; it's a normal scraper mechanism doing the scraping, right? Not necessarily an LLM in the first place so the wholesale data-sucking isn't going "read" the file even if it IS accessed?

Or is this file meant to be "read" by an LLM long after the entire site has been scraped?

discuss

hamdingers|11 days ago

Yes. It's a basic scraper that fetches the document, parses it for URLs using regex, then fetches all those, repeat forever.

I've done honeypot tests with links in html comments, links in javascript comments, routes that only appear in robots.txt, etc. All of them get hit.

efreak|11 days ago

What about scripted transformations? Or just add a simple timestamp to the query and only allow it to be used up to a week later? (Whether it works without the parameter could be tested too)

dumbfounder|11 days ago

We need to update robots.txt for the LLM world, help them find things more efficiently (or not at all I guess). Provide specs for actions that can be taken. Etc.

reconnecting|11 days ago

Absolutely.

I assume that there are data brokers, or AI companies themselves, that are constantly scraping the entire internet through non-AI crawlers and then processing data in some way to use it in the learning process. But even through this process, there are no significant requests for LLMs.txt to consider that someone actually uses it.

olivia-banks|11 days ago

I assume this might be changing. Anecdotally, from what I've read here, I think we're starting to see headless browsers driven by LLMs for the purposes of scraping (to get around some of the content blocks we're seeing). Perhaps this is a solution to a problem that won't work now, but in the future, maybe.

giancarlostoro|11 days ago

I think it depends. LLMs now can look up things on the fly to bypass the whole "this model was last updated in December 2025" issue of having dated information. I've literally told Claude before to look up something after it accused me of making up fake news.