(no title)
demetris | 15 days ago
Also, I always wonder about Common Crawl:
Is there is something wrong with it? Is it badly designed? What is it that all the trainers cannot find there so they need to crawl our sites over and over again for the exact same stuff, each on its own?
ccgreg|15 days ago
The folks who crawl more appear to mostly be folks who are doing grounding or RAG, and also AI companies who think that they can build a better foundational model by going big. We recommend that all of these folks respect robots.txt and rate limits.
demetris|15 days ago
> The folks who crawl more appear to mostly be folks who are doing grounding or RAG, and also AI companies who think that they can build a better foundational model by going big.
But how can they aspire to do any of that if they cannot build a basic bot?
My case, which I know is the same for many people:
My content is updated infrequently. Common Crawl must have all of it. I do not block Common Crawl, and I see it (the genuine one from the published ranges; not the fakes) visiting frequently. Yet the LLM bots hit the same URLs all the time, multiple times a day.
I plan to start blocking more of them, even the User and Search variants. The situation is becoming absurd.