top | item 46632162

(no title)

aggeeinn | 1 month ago

OP here.

I’ve been trying to map out why some sites get cited by Perplexity/ChatGPT and others don't, so I built a custom crawler to audit 1,500 active websites (mix of e-commerce and SaaS).

The most interesting findings:

The Accidental Blockade: ~30% of sites are blocking GPTBot via legacy robots.txt rules or old security plugins (often without the owner knowing).

The "Ghost Town": Only 3 sites (0.2%) had a valid llms.txt file.

The JS Trap: 40% of marketing sites rely so heavily on client-side rendering that they appear as "empty shells" to non-hydrating AI agents.

Context on the tool: I gathered this data using the engine for my project, Website AI Score. We are still in early beta (rough edges included), but we are building towards a complete "Crawl, Fix, & Validate" ecosystem for AEO that will launch fully in early February.

Right now, the scanner is live if you want to check your own site's "AI readability."

Happy to answer questions about the crawling methodology or the specific schema failures we saw in the wild.

discuss

order

JohnFen|1 month ago

> (often without the owner knowing)

How can you tell this? Why do you call this the "accidental blockade"? Surely, at least some percentage of those sites are doing it intentionally.

aggeeinn|1 month ago

Fair question. We distinguish them based on the specificity of the rule. If a robots.txt file explicitly names GPTBot or CCBot, we count that as intentional. The accidental group consists of sites using generic User-agent: * disallows (often left over from staging) or legacy security plugins that block unknown user agents by default. We spot-checked a sample of these owners, and most were completely unaware that their 5-year-old config was actively blocking modern AI agents.