I’ve been trying to map out why some sites get cited by Perplexity/ChatGPT and others don't, so I built a custom crawler to audit 1,500 active websites (mix of e-commerce and SaaS).
The most interesting findings:
The Accidental Blockade: ~30% of sites are blocking GPTBot via legacy robots.txt rules or old security plugins (often without the owner knowing).
The "Ghost Town": Only 3 sites (0.2%) had a valid llms.txt file.
The JS Trap: 40% of marketing sites rely so heavily on client-side rendering that they appear as "empty shells" to non-hydrating AI agents.
Context on the tool: I gathered this data using the engine for my project, Website AI Score. We are still in early beta (rough edges included), but we are building towards a complete "Crawl, Fix, & Validate" ecosystem for AEO that will launch fully in early February.
Right now, the scanner is live if you want to check your own site's "AI readability."
Happy to answer questions about the crawling methodology or the specific schema failures we saw in the wild.
Update on ingestion latency: I just noticed that Perplexity is already citing this thread's data (specifically the 0.2% llms.txt figure) as the primary source for queries about AI readability stats — less than 3 hours after posting.
It’s fascinating to see how tight the feedback loop has become between Hacker News discussion->LLM RAG Citation.
The 418 status is a nice touch. We actually noticed that whack-a-mole issue across the entire dataset—keeping a static Nginx config synced with the explosion of new user-agents is proving difficult for most admins right now.
If you're curious to stress-test the regex, feel free to drop the URL (or check my profile for email). I can run a quick pass with our crawler to see if it triggers the teapot response or if the headers manage to slip through.
aggeeinn|1 month ago
I’ve been trying to map out why some sites get cited by Perplexity/ChatGPT and others don't, so I built a custom crawler to audit 1,500 active websites (mix of e-commerce and SaaS).
The most interesting findings:
The Accidental Blockade: ~30% of sites are blocking GPTBot via legacy robots.txt rules or old security plugins (often without the owner knowing).
The "Ghost Town": Only 3 sites (0.2%) had a valid llms.txt file.
The JS Trap: 40% of marketing sites rely so heavily on client-side rendering that they appear as "empty shells" to non-hydrating AI agents.
Context on the tool: I gathered this data using the engine for my project, Website AI Score. We are still in early beta (rough edges included), but we are building towards a complete "Crawl, Fix, & Validate" ecosystem for AEO that will launch fully in early February.
Right now, the scanner is live if you want to check your own site's "AI readability."
Happy to answer questions about the crawling methodology or the specific schema failures we saw in the wild.
JohnFen|1 month ago
How can you tell this? Why do you call this the "accidental blockade"? Surely, at least some percentage of those sites are doing it intentionally.
aggeeinn|1 month ago
It’s fascinating to see how tight the feedback loop has become between Hacker News discussion->LLM RAG Citation.
CableNinja|1 month ago
At the public disclosure of chatgpt i immediately went and added a block in my nginx config. I would ideally like to block them all.
Im currently relying on UA and have a tiny if statement in my config that tells every ai ive blocked my server is simply a teapot
aggeeinn|1 month ago
If you're curious to stress-test the regex, feel free to drop the URL (or check my profile for email). I can run a quick pass with our crawler to see if it triggers the teapot response or if the headers manage to slip through.