A short history of web bots and bot detection techniques

rShergold|8 months ago

Back in the early 2000s lots of websites had an unauthenticated "guestbook" feature where visitors could leave a message. As soon as Google and page rank became a thing bots would drive by and leave links to the website they were promoting. The idea was to increase the number of backlinks and thus improve your Google rank.

The fix to this was shockingly simple. Add an input box with a standard name like "title" and then hide it with CSS. The bots would always provide a value for every input. If you saw a value for your hidden input you returned 200 but never added the post to your website.

semolino|8 months ago

I implemented this very technique last year after getting some crypto spam on the guestbook of my personal website. It works like a charm.

alexpotato|8 months ago

This is bringing me back to running my own site back in the day.

osigurdson|8 months ago

I needed a new github account the other day. The "are you human tests" were so hard that I almost gave up. I think a new way to do this will be needed soon.

bobbiechen|8 months ago

Great high-level overview. One of the challenges about learning about bot detection is that it's adversarial, and revealing info about your techniques can help the attackers evade you.

I do work on a bot detection product, and I've seen some group chats where crackers are sharing notes about how they're evading detection tools. The more unnerving part is that the public groups are less serious, and there are certainly better private groups aiming at anything with a good financial reward.

ahmedhawas123|8 months ago

I'm curious about how this world will evolve in the era of AI agents/MCP. It is not entirely unlikely that AI agents will have access to limited wallets etc to facilitate a broader set of use cases. In that case, a one shot solution to bot vs. human may not make sense, and a more nuanced human/bot-we-like/bot-we-don't-like may be needed by corporations. This would esp be the case for unofficial MCP servers that would use technologies like headless browsing etc to support an API.

m3047|8 months ago

I'm not sure I understand the mental model you're basing your inferences on, but my model leads to a far different outcome:

If you've got a good enough bot and it's pre-qualified to spend money, then it can use the special "register as a bot" API and provide personal information and whatever else I want to understand that there is a "real human" behind the curtain. A credit card alone is not enough, they can be (trivially) stolen. The way I see it using agentic bots will ultimately require you to provide more personal details than an actual human would.

alexpotato|8 months ago

"robots spending money" has already been going since the 1980s in algorithmic trading.

notjoemama|8 months ago

Maybe I missed it, but I didn't see a mention of the permanent token cell network providers inject into client requests. Knowing what these are and mocking them is another thing a bot might do to impersonate a real device.

laurent_du|8 months ago

Does anyone know of a good reference on the topic of fingerprinting?

keysdev|8 months ago

https://github.com/gautamkrishnar/nothing-private

Recently used DDG browser it just cant get some sites to clear! Try the flame button. But still logged in after few browser data clearing.

Are company resorting to this kinda tactic to keep user remembered. Its a major booking for lodging site!!!

Qubes OS seems more and more attractive.

ghxst|8 months ago

https://abrahamjuliot.github.io/creepjs/ https://github.com/abrahamjuliot/creepjs

Usually my go to. The readme, source code and GitHub issues are great source of information, and the website itself is useful to test against.

edit:

For anything network fingerprinting related, especially censorship related I usually browse https://github.com/net4people/bbs/issues

ape4|8 months ago

I liked the depiction of different TCP SYN packets ;)

irico|8 months ago

How do systems like OpenAI Operator bypass bot protection for the entire web?

yellow_lead|8 months ago

> Orchestraion frameworks

Small typo here

16 comments