top | item 44944925

(no title)

rhet0rica | 6 months ago

I have two related stories.

Googlebot has been playing a multiple-choice flash card game on my site for months—the page picks a random question and gives you five options to choose from. Each URL contains all of the state of the last click: the option you chose, the correct answer, and the five buttons. Naturally, Google wants to crawl all the buttons, meaning the search tree has a branch factor of five and search space of about 5000^7 possible pages. Adding a robots.txt entry failed to fix this—now the page checks the user agent and tells Googlebot specifically to fuck off with a 403. Weeks later, I'm still seeing occasional hits. Worst of all it's pretty heavy-duty—the flash cards are for learning words, and the page generator sometimes sprinkles in items that look similar to the correct answer (i.e., they have a low edit distance.)

On the other hand there was a... thing crawling a search page on a separate site, but doing so in the most ass-brained way possible. Different IP addresses, all with fake user agents from real clients fetching search results for a database retrieval form with default options. (You really expect me to believe that someone on Symbian is fetching only page 6000 of all blog posts for the lowest user ID in the database?) The worst part about this one is that the URLs frequently had mangled query strings, like someone had tried to use substring functions to swap out the page number and gotten it wrong 30 times, resulting in Markov-like gibberish. The only way to get this foul customer to go away was to automatically ban any IP that used the search form incorrectly. So far I have banned 111,153 unique addresses.

robots.txt wasn't adequate to stop this madness, but I can't say I miss Ahrefs or DotBot trying to gather valuable SEO information about my constructed languages.

discuss

No comments yet.