"There’s ways to get around TLS signatures but it’s much harder and requires a lot more legwork to get working"
I wouldn't call it "much harder". All you need to bypass the signature is to choose random ciphers (list at https://curl.se/docs/ssl-ciphers.html) and you mash them up in a random order separated by colons in curl's --ciphers option. If you pick 15 different ciphers in a random order, there are over a trillion signatures possible, which he couldn't block. For example this works:
> NOTE: Due to many WAFs employing JavaScript-level fingerprinting of web browsers, thermoptic also exposes hooks to utilize the browser for key steps of the scraping process. See this section for more information on this.
This reminds me of how Stripe does user tracking for fraude detection https://mtlynch.io/stripe-update/ I wonder if thermoptic could handle that.
I'm aware of curl-impersonate https://github.com/lwthiker/curl-impersonate which works around these kinds of things (and makes working with cloudflare much nicer), but serious scrapers use chrome+usb keyboard/mouse gadget that you can ssh into so there's literally no evidence of mechanical means.
Also: If you serve some Anubis code without actually running the anubis script in the page, you'll get some answers back so there's at least one anubis-simulator running on the Internet that doesn't bother to actually run the JavaScript it's given.
Also also: 26M requests daily is only 300 requests per second and Apache could handle that easily over 15 years ago. Why worry about something as small as that?
He does use it (I verified it from curl on a recent Linux distro). But he probably blocked only some fingerprints. And the fingerprint depends on the exact OpenSSL and curl versions, as different version combinations will send different TLS ciphers and extensions.
What I have seen it is hard to tell what "serious scrapers" use. They use many things. Some use this, some not. This is what I have learned reading webscraping on reddit. Nobody speaks things like that out loud.
There are many tools, see links below
Personally I think that running selenium can be a bottle neck, as it does not play nice, sometimes processes break, even system sometimes requires restart because of things blocked, can be memory hog, etc. etc. That is my experience.
To be able to scale I think you have to have your own implementation. Serious scrapers complain about people using selenium, or derivatives as noobs, who will come back asking why page X does not work in scraping mechanisms.
Claude was scraping my cgit at around 12 requests per second, but in bursts here or there. My VPS could easily handle this, even being a free tier e2-micro on Google Cloud/Compute Engine, but they used almost 10GB of my egress bandwidth in just a few days, and ended up pushing me over the free tier.
Granted it wasn't a whole lot of money spent, but why waste money and resources so "claude" can scrape the same cgit repo over and over again?
> Also also: 26M requests daily is only 300 requests per second and Apache could handle that easily over 15 years ago. Why worry about something as small as that?
That doesn't matter, does it? Those 26 million requests could be going to actual users instead and 300 requests per second is non-trivial if the requests require backend activity. Before you know it you're spending most of your infra money on keeping other people's bots alive.
The git seems to only contain the build of the website with no source code.
The author is probably using git to push the content to the hosting server as an rsync alternative, but there does not seem to be much leaked information, apart from the url of the private repository.
> with tools like Anubis being largely ineffective
To the contrary - if someone "bypasses" Anubis by setting the user agent to Googlebot (or curl), it means it's effective. Every Anubis installation I've been involved with so far explicitly allowed curl. If you think it's counterproductive, you probably just don't understand why it's there in the first place.
There are also HTTP fingerprints. I believe it's named after akamai or something.
All of it is fairly easy to fake. JavaScript is the only thing that poses any challenge and what challenge it poses is in how you want to do it with minimal performance impact. The simple truth is that a motivated adversary can interrogate and match every single minor behavior of the browser to be bit-perfect and there is nothing anyone can do about it - except for TPM attestations which also require a full jailed OS environment in order to control the data flow to the TPM.
Even the attestation pathway can probably be defeated, either through the mandated(?) accessibility controls or going for more extreme measures. And putting the devices to work in a farm.
This is exactly right, and it's why I believe we need to solve this problem in the human domain, with laws and accountability. We need new copyrights that cover serving content on the web, and gives authors control over who gets to access that content, WITHOUT requiring locked down operating systems or browser monopolies.
Indeed, I named it after akamai because they wrote a whitepaper for it.
I think I first used akamai_fingerprint on https://tls.peet.ws, where you can see all your fingerprints!
Blocking on ja3/ja4 signals to folks exactly what you are up to. This is why bad actors doing ja3 randomization became a thing in the last few years and made ja3 matching useless.
Imo use ja3/ja4 as a signal and block on src IP. Don't show your cards. Ja4 extensions that use network vs http/tls latency is also pretty elite to identify folks proxying.
Some of the bad actors, and Chrome, randomize extensions, but only their order. I think it's ja3n that started to sort the extensions, before doing the hashing.
Blocking on source IP is tricky, because that frequently means blocking or rate-limiting thousands of IPs. If you're fine with just blocking entire subnets or all of AWS, I'd agree that it's probably better.
It really depends on who your audience is and who the bad actors are. For many of us the bad actors are AI companies, and they don't seem to randomize their TLS extensions. Frankly many of them aren't that clever when it comes to building scrapers, which is exactly the problem.
I'm curious about why the user-agent he described can bypass Anubis, since it contains "Mozilla", sounds like a bug to me.
Edit: Nevermind, I see part of the default config is allowing Googlebot, so this is literally intended. Seems like people who criticize Anubis often don't understand what the opinionated default config is supposed to accomplish (only punish bots/scrapers pretending to be real browsers).
It is a cute technique, but I would prefer if the fingerprint were used higher up in the stack. The fingerprint should be compared against the User-Agent. I'm more interested in blocking curl when it is specifically reporting itself as Chrome/x.y.z.
Most of the abusive scraping is much lower hanging fruit. It is easy to identify the bots and relate that back to ASNs. You can then block all of Huawei cloud and the other usual suspects. Many networks aren't worth allowing at this point.
For the rest, the standard advice about performant sites applies.
So clearly the `data` member is already an integer. The sane way to cast it would be to cast to the actual desired destination type, rather than first to some other random integer and then to a `void` pointer.
Like so:
uint8_t * const data = (uint8_t *) ctx->data;
I added the `const` since the pointer value is not supposed to change, since we got it from the incoming structure. Note that that `const` does not mean we can't write to `data` if we feel like it, it means the base pointer itself can't change, we can't "re-point" the pointer. This is often a nice property, of course.
Your code emits a compiler warning about casting an integer to a pointer. Changing the cast to void* emits a slightly different warning about the size of integer being cast to a pointer being smaller than the pointer type. Casting to a long and then a void* avoids both of these warnings.
Those bots would be really naive not to use curl-impersonate. I basically use it for any request I make even if I don’t expect to be blocked because why wouldn’t I.
There are plenty of naive bots. That is why tar pits work so great at trapping them in. And this TLS based detection looks just like offline/broken site to bots, it will be harder to spot unless you are trying to scrap only that one single site.
I heard about curl-impersonate yesterday when I was hitting a CF page. Did something else to completely bypass it, which has been successful, but should try this.
mrb|5 months ago
I wouldn't call it "much harder". All you need to bypass the signature is to choose random ciphers (list at https://curl.se/docs/ssl-ciphers.html) and you mash them up in a random order separated by colons in curl's --ciphers option. If you pick 15 different ciphers in a random order, there are over a trillion signatures possible, which he couldn't block. For example this works:
But, yes, most bots don't bother randomizing ciphers so most will be blocked.ospider|4 months ago
halJordan|4 months ago
mandatory|5 months ago
benatkin|5 months ago
This reminds me of how Stripe does user tracking for fraude detection https://mtlynch.io/stripe-update/ I wonder if thermoptic could handle that.
mips_avatar|5 months ago
joshmn|5 months ago
snowe2010|4 months ago
Symbiote|4 months ago
In a month or two, I can be annoyed when I see some vibe-coded AI startup's script making five million requests a day to work's website with this.
They'll have been ignoring the error responses:
— a message we also write in the first line of every HTML page source.Then I will spend more time fighting this shit, and less time improving the public data system.
geocar|5 months ago
(downloaded with Safari)
I'm aware of curl-impersonate https://github.com/lwthiker/curl-impersonate which works around these kinds of things (and makes working with cloudflare much nicer), but serious scrapers use chrome+usb keyboard/mouse gadget that you can ssh into so there's literally no evidence of mechanical means.Also: If you serve some Anubis code without actually running the anubis script in the page, you'll get some answers back so there's at least one anubis-simulator running on the Internet that doesn't bother to actually run the JavaScript it's given.
Also also: 26M requests daily is only 300 requests per second and Apache could handle that easily over 15 years ago. Why worry about something as small as that?
mrb|4 months ago
renegat0x0|5 months ago
There are many tools, see links below
Personally I think that running selenium can be a bottle neck, as it does not play nice, sometimes processes break, even system sometimes requires restart because of things blocked, can be memory hog, etc. etc. That is my experience.
To be able to scale I think you have to have your own implementation. Serious scrapers complain about people using selenium, or derivatives as noobs, who will come back asking why page X does not work in scraping mechanisms.
https://github.com/lexiforest/curl_cffi
https://github.com/encode/httpx
https://github.com/scrapy/scrapy
https://github.com/apify/crawlee
klaussilveira|4 months ago
Keystroke dynamics and mouse movement analysis are pretty fun ways to tackle more advanced bots: https://research.roundtable.ai/proof-of-human/
But of course, it is a game of cat and mouse and there are ways to simulate it.
dancek|5 months ago
chlorion|4 months ago
Granted it wasn't a whole lot of money spent, but why waste money and resources so "claude" can scrape the same cgit repo over and over again?
jacquesm|5 months ago
That doesn't matter, does it? Those 26 million requests could be going to actual users instead and 300 requests per second is non-trivial if the requests require backend activity. Before you know it you're spending most of your infra money on keeping other people's bots alive.
geek_at|5 months ago
~$ curl https://foxmoss.com/.git/config [core] repositoryformatversion = 0 filemode = true bare = false logallrefupdates = true [remote "origin"] url = https://github.com/FoxMoss/PersonalWebsite fetch = +refs/heads/:refs/remotes/origin/ [branch "master"] remote = origin merge = refs/heads/master
vanyle|4 months ago
The author is probably using git to push the content to the hosting server as an rsync alternative, but there does not seem to be much leaked information, apart from the url of the private repository.
seba_dos1|5 months ago
To the contrary - if someone "bypasses" Anubis by setting the user agent to Googlebot (or curl), it means it's effective. Every Anubis installation I've been involved with so far explicitly allowed curl. If you think it's counterproductive, you probably just don't understand why it's there in the first place.
jgalt212|5 months ago
coppsilgold|5 months ago
All of it is fairly easy to fake. JavaScript is the only thing that poses any challenge and what challenge it poses is in how you want to do it with minimal performance impact. The simple truth is that a motivated adversary can interrogate and match every single minor behavior of the browser to be bit-perfect and there is nothing anyone can do about it - except for TPM attestations which also require a full jailed OS environment in order to control the data flow to the TPM.
Even the attestation pathway can probably be defeated, either through the mandated(?) accessibility controls or going for more extreme measures. And putting the devices to work in a farm.
delusional|5 months ago
peetistaken|4 months ago
piggg|5 months ago
Imo use ja3/ja4 as a signal and block on src IP. Don't show your cards. Ja4 extensions that use network vs http/tls latency is also pretty elite to identify folks proxying.
mrweasel|5 months ago
Blocking on source IP is tricky, because that frequently means blocking or rate-limiting thousands of IPs. If you're fine with just blocking entire subnets or all of AWS, I'd agree that it's probably better.
It really depends on who your audience is and who the bad actors are. For many of us the bad actors are AI companies, and they don't seem to randomize their TLS extensions. Frankly many of them aren't that clever when it comes to building scrapers, which is exactly the problem.
jamesnorden|4 months ago
Edit: Nevermind, I see part of the default config is allowing Googlebot, so this is literally intended. Seems like people who criticize Anubis often don't understand what the opinionated default config is supposed to accomplish (only punish bots/scrapers pretending to be real browsers).
unknown|4 months ago
[deleted]
xdplol|5 months ago
palmfacehn|5 months ago
Most of the abusive scraping is much lower hanging fruit. It is easy to identify the bots and relate that back to ASNs. You can then block all of Huawei cloud and the other usual suspects. Many networks aren't worth allowing at this point.
For the rest, the standard advice about performant sites applies.
unwind|5 months ago
Like so:
I added the `const` since the pointer value is not supposed to change, since we got it from the incoming structure. Note that that `const` does not mean we can't write to `data` if we feel like it, it means the base pointer itself can't change, we can't "re-point" the pointer. This is often a nice property, of course.[1]: https://elixir.bootlin.com/linux/v6.17/source/include/uapi/l...
ziml77|5 months ago
baobun|5 months ago
OutOfHere|5 months ago
lxgr|5 months ago
npteljes|4 months ago
jacquesm|5 months ago
jbrooks84|4 months ago
keanb|5 months ago
f4uCL9dNSnQm|5 months ago
VladVladikoff|5 months ago
_boffin_|5 months ago