I'm not very familiar with the concept of tarpitting. How do they get the bot to run CPU intensive code? By passing in extra Javascript? Can this affect a bot that doesn't run any JS?
I've been running many non-JS crawlers for the past few years, and there were a few pages that kept pushing the CPU load of my servers to a halt. When I dug into the source code, I saw that the HTML was a convoluted text of tables inside tables inside tables inside more tables, thus making it incredibly time-consuming + CPU-intensive for my DOM parser to parse (I was using Nokogiri, a Ruby gem at the time). Thus Cloudflare could be serving these types of "fake" pages to bad bots.
They could also be doing things like serving fake streaming audio that never ends, or anything that might make it seem like the web page is just a huge page that needs time to load.
One way is to use a special very slow TCP handler. Imagine a TCP stack that only lets through one 10 byte packet every 10 seconds. This wastes a connection slot on target machine.
Usually via javascript. Many of the credential stuffing and similar bots need to run headless browsers these days to be able to do their job. The folks at Kasada (https://www.kasada.io) have talked over the years at a high level of some of the approaches they've taken, there should be a few conference presentations on YouTube. They don't get into the finer detail though as I assume there's a large amount of secret sauce about what they do too.
I'm not sure what they do for the non-JS use case. They sit in the request path like a CDN though so maybe they just return an error or deliberately slow response times?
Was also curious about this step. I assume they're not going to reveal the nitty gritty details for fear of botters coding around it, but I am curious as to how you can "make them use more CPU" while crawling a website
AznHisoka|6 years ago
They could also be doing things like serving fake streaming audio that never ends, or anything that might make it seem like the web page is just a huge page that needs time to load.
opwieurposiu|6 years ago
https://en.wikipedia.org/wiki/Slowloris_(computer_security)
glenngillen|6 years ago
I'm not sure what they do for the non-JS use case. They sit in the request path like a CDN though so maybe they just return an error or deliberately slow response times?
nodja|6 years ago
s09dfhks|6 years ago
jsnell|6 years ago
Steamspy.com seems to trigger one of these basically every time when loaded with a fresh cookie.
nielsbot|6 years ago
aussieguy1234|6 years ago