It's a long .gif file which shows that cloudflare's website loads just fine, but HIBP is unusable. Thanks Troy and Cloudflare for making this (free) service unusable. It's free, so I shouldn't expect that it works, anyway. Chrome on Linux and no VPN, fwiw.
The "Checking if site connect is secure" message pisses me off so much. They are lying straight to my face. It is about checking if they want to serve me, nothing about the site or my connection.
I believe the implementation is buggy, because I almost never see a Cloudflare page and I can consistently see one here when double-clicking the button.
This looks like an obvious misconfiguration, after passing the first quick Cloudflare check you should've been forwarded to a page with the right results. Sounds like Troy needs to fix his API calls.
Just out of interest, have you checked if installing Cloudflare's Privacy Pass bypasses the issue?
Cloudflare doesn’t like my VPN so I get a lot of their challenges. Now whenever I see one I just close the tab.
Let’s say you run a SaaS using Cloudflare. You may be extremely happy you block 10000 bot requests for every false positive that’s a real human. But let’s say that false positive was a potential customer that would only have paid if they weren’t blocked, and now you just saved less than a penny in server costs to lose hundreds of dollars of lifetime value from that customer.
Sure if you run a free service use Cloudflare. Give in to centralizing the web more, supporting more censorship, and annoying the hell out of a ton of people in the process. But if you’re making money, I don’t see why you wouldn’t have authentication tied to paying users for anything of value, or think of bot traffic as a cost of doing business.
Depends on the application. I worked on one where new customers cost upgrades of $8 related to cost of reporting requirements (similar to background checks). Letting 10k bot requests supply stolen identities was incredibly costly.
Similarly, say an e-commerce business releases a limited edition product. Many users won't end up getting it anyway so blocking a few users is usually a much better experience than letting bots buy the product for resale later.
On the other hand, it's absolutely infuriating when blogs/search engines come up with these.
> You may be extremely happy you block 10000 bot requests for every false positive that’s a real human. But let’s say that false positive was a potential customer that would only have paid if they weren’t blocked, and now you just saved less than a penny in server costs to lose hundreds of dollars of lifetime value from that customer.
The problem is that economics are working on the opposite way. We are not happy blocking 10000 bot requests. We are happy blocking 1 bot that would make millions of requests. This means that it sometimes ok to loose few customers who would pay hundreds of dollars per month each, if it allows us to block one bot who would cost thousands.
Bots have the bad habits of targeting the most expensive parts of the system, if there is one query that is hard/expensive/could be abused, that’s the one that is going to be targeted in priority.
Cloudflare isnthe solution to VPN companies allowing bot accounts to taint their IP address.
It sucks as a real person using a VPN, but having your website be overwhelmed by bots suck more than a few VPN users having trouble using your website.
If Cloudflare didn't exist, websites would probably just block VPN IPs like streaming services do.
Cost of business. Depended on your SaaS, you could be saving more money blocking the requests to those bots than acquiring one customer. It’s a trade off you have to deal with when you get to a certain scale.
> That's a 91% hit rate of solved challenges which is great. That remaining 9% is either humans with a false positive or... bots getting rejected
If I meet a "human check", I quickly decide whether it is worth me solving it, or just close the tab. I could imagine 9% of people just giving up. Some of these CAPTCHAs require you to find 20 fire hydrants on 3 different rounds of tiles, just to fail you anyway. We have loads of data on websites keeping user's attention [1], this also seems to apply to CAPTCHAs.
Besides, I think it is now well known that AI is fully capable of solving CAPTCHAs.
I'm trying to buy a car, which involves going to a bunch of dealer websites and seeing what they have in stock. Usually after checking 2-3 dealers I'll get a Cloudflare error and can't access any car dealer websites for a day or two (or I can keep going if I use a different browser, or just Incognito mode).
I guess "Turnstile" might be what I'm running into?
This is as practical solution to a very real problem. HIBP is so valuable both the no API case, and the case where bots scrape the API reduces people's online security.
However relying on the near universal behaviour tracking and fingerprinting of large corporations is extreemly worrying. The better Turnstile works the more like Google's proposed Web Environment Integrity it becomes.
"Invisible" assuming you have javascript enabled and use a mainstream browser. The failure mode on these is worse than regular captchas because cloudflare won't even give you a chance to prove that you're a human, you'll just be stuck in a refresh loop.
You don't even need to do anything unorthodox. I'm using Firefox with Javascript enabled, set to block fingerprinting and cross-site cookies, and Cloudflare's bot detector regularly puts me into infinite loops. It's especially frustrating when it's the login page for a paid service doing this to me. Why on earth would I abuse your site with a bot using credentials associated with my real name and payment info?
Surely people who disable javascript are used to the majority of the web being broken for them. JS is an integral part of the web, regardless of how people feel about it.
I did a lot of work on this many years ago at Google. As the article says, it can work well and be minimally invasive for users (they need to run JS but that's a much lower bar than solving complicated CAPTCHAs).
There are several services like Turnstile. I'm an advisor to Ocule [1] which is a similar thing, except it's a standalone service you can use regardless of your serving setup rather than being integrated into a monolithic CDN. It's a smaller company too so you can get the red carpet treatment from them, and they aren't so aggressive about blocking VPNs and privacy modes because their anti-bot JS challenges are strong enough to not need it. They're ex-reverse engineers so know a lot about what works and what doesn't. Their tech may be worth looking at if you're concerned about over-blocking.
The mention of Turnstile using proof of work/space is a bit puzzling/disappointing. That stuff doesn't work so well. There are much better ways to create invisible JS challenges. The core idea is to verify you're in the intended execution environment, and then obfuscate and randomize so effectively that the adversaries give up trying to emulate your code and just run it, which can (a) lead to detection and (b) is very slow and resource intensive for them even if they aren't detected. Proof of work/space doesn't prove much about the real execution environment.
BTW, the author asks what proof of space is. It's where you allocate a huge amount of RAM and then ensure the allocation was actually real by filling it with stuff and doing computations on it. The goal is to try and restrict per-machine parallelism by causing OOM if many threads are running in parallel, something end users won't do. Obviously it's also a brute force technique that can break cheaper devices.
I think the idea is they ramp up the difficulty of the proof of work when a user is suspect by their other tests, after ramping it up to bot levels it causes those requests to become more costly even if they're not blocked.
Interesting that both the client side API and server side API for Cloudflare's turnstile seem to match Google's reCAPTCHA nearly exactly, which works in pretty much the same way with the exception that you can't configure it to _never_ show a visual captcha (in rare cases the v2 Invisible reCAPTCHA will still show the "select all the X from the images below" dialog)
Even down to the API endpoint and JS API names.
https://www.google.com/recaptcha/api/siteverify
https://challenges.cloudflare.com/turnstile/v0/siteverify
grecaptcha.render({ callback: function (token) { ... } });
turnstile.render({ callback: function (token) { ... } });
As soon as I saw the examples I recognised the names, I guess it's designed to be a drop in replacement?
That peak is around 400 req/s, right? I would expect the usual solution to be to just rate limit each user to some reasonable limit, and then 429 if theyre over.
I feel like 400 req/s should be absorbed, especially since my 5€/month VPS can handle about 2x that, sustained. I might be missing something here, but that just doesnt seem like a peak big enough to warrant degrading user experience.
Sadly I'm often categorized by websites as "probably a bot haha get fucked", and its lost sites hundreds or thousands of $$$ worth of revenue over the years, just from me alone.
The service is unauthenticated so there aren't any "users" to rate limit on. You could try to guess based on IP but it's trivial to get access to massive amounts of IPs (there are proxy/scraping services that do this for you)
That's 400 req/s on a distributed cloud environment. 400 requests per second on a $5 VPS is practically free, 400 requests per second on Azure Lambda (or whatever they call it) can bankrupt you.
That $5 VPS will also be null routed within days with the kinds of DDOS traffic sites like HIBP get.
Where is your 5 euro/month VPS hosted, and does it use HTTPS?
I did a non-scientific test last year, in which I ran a basic Apache instance that returned a 204 response (No Content) on $5 instance and HTTPS alone made it drop around 500 requests/second (once again, no other processing happening other than returning headers for a 204 response)[1]. My understanding at the time is that in general on lower priced virtualization options you don't get VMs which provide access to CPU instructions that speed up encryption/decryption.
"Which poses an interesting question: how do you create an API that should only be consumed asynchronously from a web page and never programmatically via a script?"
Web developers use JavaScripts to make HTTP requests to API endpoints. The data is being consumed by a script, programmatically. Unless Javascripts are neither scripts nor programs. Good luck with that argument.
There is a W3C TAG Ethical Web Principle that states web users can conusme and display data from the web any way they like. They do not need to use, for example, a particular software or a particular web page design.
2.12 People should be able to render web content as they want
People must be able to change web pages according to their needs. For example, people should be able to install style sheets, assistive browser extensions, and blockers of unwanted content or scripts or auto-played videos. We will build features and write specifications that respect peoples' agency, and will create user agents to represent those preferences on the web user's behalf.
With respect to the JavaScripts authored by web developers and inlined/sourced in web pages, web users have no control over them short of blocking them outright. As such, arguably they are not ideal for making HTTP requests to API endpoints. Unfortunately these scripts can be, and are, used to deny web users' agency.
He wants to provide a free service, which he is not obligated to do and costs money for him to do, for individuals to check their email. He never intended for bots to check a billion emails, and obviously doesn't want to pay for that.
That should be respected, and people failing to respect it is why we see the destruction of the open web with things like remote attestation as the only way forward.
Complaints like "but the open web" or "but my exotic browser" are honestly worth nothing against a potential solution for a real, pressing issue like spam requests and bots - if we want to keep the nice things we better start coming up with alternative solutions to the real problem, because corporations will decide the future for us if we don't. Ignoring it will not work out for us.
I asked an online shop I frequent if they have an API, as they discount some inventory sometimes, to which they responded there is no public API. It's easy enough to find (using Inspect->Network->XHR) how they serve data, and with just five minutes of work, their raw JSON can be loaded in Python and polled for specific brands/categories/individual products, including anything flagged for a discount.
Programmatic? Yes. Against their wishes? Also, yes.
The guilt or bad feelings subsided after realizing I'll use more of their bandwidth how I was searching before via the browser. Using the API, I see when the last inventory update occurred, and only after this will I search their discount inventory.
(In any case, if they are on EC2, I cost them 3 to 8 cents per month, estimating high, which they have certainly recouped.)
What about botting in games? It's something that almost everyone is against so it's clear that the general opinion is not aligned with the W3C TAG Ethical Web Principle there.
I've seen many games die due to bots. If developers just straight-up allow them all the normal players quit.
Just put in a multi second delay to _all_ requests to that API, humans wait but this costs bots and slows the aggregate progress.
Add HTTP headers documenting the API you want them to use as a bot author will look and wonder why it's performing so badly.
The difference between human speed and computer speed is so noticeable that it can be leveraged... No high bar or complex adversarial solution needed, and the side benefit is that if a bot does persist it spreads the load.
Not sure why downvoted without much comments. I can see a possible problem in that the bots would just establish multiple parallel connections and wait.
I am most likely misunderstanding this, but why can't you just use browser automation to then generate the turnstile token? For example, just use Playwright/Selenium/Phantom to generate the token and then use that in an API call?
Cloudflare's code attempts to detect browser automation is happening.
For example, desktop computer but clicking things without moving your mouse? Suspicious. Say you're a phone, but have desktop computer fonts installed? Suspicious. And suchlike, the precise methods are the results of a cat-and-mouse game.
If these heuristics identify your browser as suspicious, they either show you an interactive captcha, or they just refuse your request.
> the unsolved challenges are when the Turnstile widget is loaded but not solved (hopefully due to it being a bot rather than a false positive)
So people employ these measure and have no clue whom they filter. Reminds me of the online shops who block you because you click on the products too fast. Congratulations, you lost a customer to keep the CPU utilization at 20%.
Really interesting article and details, we are in similar cases, I would definitely consider such implementation and we'll also look at alternatives to Cloudflare.
Thanks you Troy for writing and sharing your experience.
For this style or abuse mitigation I’m always surprised that HashCash [1] or similar simple, locally implemented proof of work mechanisms aren’t more common.
This can be implemented in a way that remains transparent (albeit via JS), poses little impact on ‘good’ users, but protects against a lot of traffic patterns that may be undesirable. The cost can be scaled to match infra capability and the challenge can be a combo of the request data and time. Valid windows for that time can then be synced with cache validity which removes the need to keep tabs on any state.
For those deeper in this space. What am I missing here that prevents this from being the norm?
The concept makes sense to me, it's reminiscent of hmac, but with the additional constraint that the "secret" key is in the open. And the server side verification is opaque so you have to hope it works. But I like that it's not captcha which has made web browsing worse over the years
I'm curious to know if there have been similar open source implementations of turnstile where website operators have found ways of limiting an API call to just a browser, without captcha. Does anyone know of any?
Recently fighting with bots in a different situation. I discovered you can return code "466" in nginx, which is a special code that completely disconnects the TCP session.
[+] [-] randunel|2 years ago|reply
https://imgur.com/a/K5z1X2R
It's a long .gif file which shows that cloudflare's website loads just fine, but HIBP is unusable. Thanks Troy and Cloudflare for making this (free) service unusable. It's free, so I shouldn't expect that it works, anyway. Chrome on Linux and no VPN, fwiw.
[+] [-] kevincox|2 years ago|reply
[+] [-] mananaysiempre|2 years ago|reply
> This post may contain erotic or adult imagery.
Yeah it’s fucked alright.
[+] [-] unlog|2 years ago|reply
[+] [-] jeroenhd|2 years ago|reply
Just out of interest, have you checked if installing Cloudflare's Privacy Pass bypasses the issue?
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] 46Bit|2 years ago|reply
[+] [-] realusername|2 years ago|reply
[+] [-] jiofj|2 years ago|reply
[deleted]
[+] [-] mrieck|2 years ago|reply
Let’s say you run a SaaS using Cloudflare. You may be extremely happy you block 10000 bot requests for every false positive that’s a real human. But let’s say that false positive was a potential customer that would only have paid if they weren’t blocked, and now you just saved less than a penny in server costs to lose hundreds of dollars of lifetime value from that customer.
Sure if you run a free service use Cloudflare. Give in to centralizing the web more, supporting more censorship, and annoying the hell out of a ton of people in the process. But if you’re making money, I don’t see why you wouldn’t have authentication tied to paying users for anything of value, or think of bot traffic as a cost of doing business.
[+] [-] nijave|2 years ago|reply
Similarly, say an e-commerce business releases a limited edition product. Many users won't end up getting it anyway so blocking a few users is usually a much better experience than letting bots buy the product for resale later.
On the other hand, it's absolutely infuriating when blogs/search engines come up with these.
[+] [-] IMTDb|2 years ago|reply
The problem is that economics are working on the opposite way. We are not happy blocking 10000 bot requests. We are happy blocking 1 bot that would make millions of requests. This means that it sometimes ok to loose few customers who would pay hundreds of dollars per month each, if it allows us to block one bot who would cost thousands.
Bots have the bad habits of targeting the most expensive parts of the system, if there is one query that is hard/expensive/could be abused, that’s the one that is going to be targeted in priority.
[+] [-] jeroenhd|2 years ago|reply
It sucks as a real person using a VPN, but having your website be overwhelmed by bots suck more than a few VPN users having trouble using your website.
If Cloudflare didn't exist, websites would probably just block VPN IPs like streaming services do.
[+] [-] kredd|2 years ago|reply
[+] [-] bArray|2 years ago|reply
If I meet a "human check", I quickly decide whether it is worth me solving it, or just close the tab. I could imagine 9% of people just giving up. Some of these CAPTCHAs require you to find 20 fire hydrants on 3 different rounds of tiles, just to fail you anyway. We have loads of data on websites keeping user's attention [1], this also seems to apply to CAPTCHAs.
Besides, I think it is now well known that AI is fully capable of solving CAPTCHAs.
[1] https://www.nngroup.com/articles/response-times-3-important-...
[+] [-] Rebelgecko|2 years ago|reply
I guess "Turnstile" might be what I'm running into?
[+] [-] mtsr|2 years ago|reply
Fortunately I haven’t seen the longer blocking you mention.
[+] [-] ZiiS|2 years ago|reply
However relying on the near universal behaviour tracking and fingerprinting of large corporations is extreemly worrying. The better Turnstile works the more like Google's proposed Web Environment Integrity it becomes.
https://www.eff.org/deeplinks/2023/08/your-computer-should-s...
[+] [-] krackers|2 years ago|reply
[+] [-] nonameiguess|2 years ago|reply
[+] [-] shortcake27|2 years ago|reply
[+] [-] ZiiS|2 years ago|reply
[+] [-] mike_hearn|2 years ago|reply
There are several services like Turnstile. I'm an advisor to Ocule [1] which is a similar thing, except it's a standalone service you can use regardless of your serving setup rather than being integrated into a monolithic CDN. It's a smaller company too so you can get the red carpet treatment from them, and they aren't so aggressive about blocking VPNs and privacy modes because their anti-bot JS challenges are strong enough to not need it. They're ex-reverse engineers so know a lot about what works and what doesn't. Their tech may be worth looking at if you're concerned about over-blocking.
The mention of Turnstile using proof of work/space is a bit puzzling/disappointing. That stuff doesn't work so well. There are much better ways to create invisible JS challenges. The core idea is to verify you're in the intended execution environment, and then obfuscate and randomize so effectively that the adversaries give up trying to emulate your code and just run it, which can (a) lead to detection and (b) is very slow and resource intensive for them even if they aren't detected. Proof of work/space doesn't prove much about the real execution environment.
BTW, the author asks what proof of space is. It's where you allocate a huge amount of RAM and then ensure the allocation was actually real by filling it with stuff and doing computations on it. The goal is to try and restrict per-machine parallelism by causing OOM if many threads are running in parallel, something end users won't do. Obviously it's also a brute force technique that can break cheaper devices.
[1] https://ocule.io/
[+] [-] arsome|2 years ago|reply
[+] [-] philo23|2 years ago|reply
Even down to the API endpoint and JS API names.
As soon as I saw the examples I recognised the names, I guess it's designed to be a drop in replacement?https://developers.google.com/recaptcha/docs/verify https://developers.google.com/recaptcha/docs/invisible
[+] [-] lionkor|2 years ago|reply
I feel like 400 req/s should be absorbed, especially since my 5€/month VPS can handle about 2x that, sustained. I might be missing something here, but that just doesnt seem like a peak big enough to warrant degrading user experience.
Sadly I'm often categorized by websites as "probably a bot haha get fucked", and its lost sites hundreds or thousands of $$$ worth of revenue over the years, just from me alone.
[+] [-] nijave|2 years ago|reply
[+] [-] jeroenhd|2 years ago|reply
That $5 VPS will also be null routed within days with the kinds of DDOS traffic sites like HIBP get.
[+] [-] mhitza|2 years ago|reply
I did a non-scientific test last year, in which I ran a basic Apache instance that returned a 204 response (No Content) on $5 instance and HTTPS alone made it drop around 500 requests/second (once again, no other processing happening other than returning headers for a 204 response)[1]. My understanding at the time is that in general on lower priced virtualization options you don't get VMs which provide access to CPU instructions that speed up encryption/decryption.
[1] my comment at the time https://news.ycombinator.com/item?id=30155559
[+] [-] nilsherzig|2 years ago|reply
[+] [-] 1vuio0pswjnm7|2 years ago|reply
Web developers use JavaScripts to make HTTP requests to API endpoints. The data is being consumed by a script, programmatically. Unless Javascripts are neither scripts nor programs. Good luck with that argument.
There is a W3C TAG Ethical Web Principle that states web users can conusme and display data from the web any way they like. They do not need to use, for example, a particular software or a particular web page design.
2.12 People should be able to render web content as they want
People must be able to change web pages according to their needs. For example, people should be able to install style sheets, assistive browser extensions, and blockers of unwanted content or scripts or auto-played videos. We will build features and write specifications that respect peoples' agency, and will create user agents to represent those preferences on the web user's behalf.
With respect to the JavaScripts authored by web developers and inlined/sourced in web pages, web users have no control over them short of blocking them outright. As such, arguably they are not ideal for making HTTP requests to API endpoints. Unfortunately these scripts can be, and are, used to deny web users' agency.
[+] [-] andersa|2 years ago|reply
He wants to provide a free service, which he is not obligated to do and costs money for him to do, for individuals to check their email. He never intended for bots to check a billion emails, and obviously doesn't want to pay for that.
That should be respected, and people failing to respect it is why we see the destruction of the open web with things like remote attestation as the only way forward.
Complaints like "but the open web" or "but my exotic browser" are honestly worth nothing against a potential solution for a real, pressing issue like spam requests and bots - if we want to keep the nice things we better start coming up with alternative solutions to the real problem, because corporations will decide the future for us if we don't. Ignoring it will not work out for us.
[+] [-] PennRobotics|2 years ago|reply
Programmatic? Yes. Against their wishes? Also, yes.
The guilt or bad feelings subsided after realizing I'll use more of their bandwidth how I was searching before via the browser. Using the API, I see when the last inventory update occurred, and only after this will I search their discount inventory.
(In any case, if they are on EC2, I cost them 3 to 8 cents per month, estimating high, which they have certainly recouped.)
[+] [-] Kiro|2 years ago|reply
I've seen many games die due to bots. If developers just straight-up allow them all the normal players quit.
[+] [-] raverbashing|2 years ago|reply
> 2.12 People should be able to render web content as they want
Yes, the choice between Chrome/FF/Safari/Lynx or some other remains
Any discussion not discussing how to deal with abusers is moot
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] buro9|2 years ago|reply
Add HTTP headers documenting the API you want them to use as a bot author will look and wonder why it's performing so badly.
The difference between human speed and computer speed is so noticeable that it can be leveraged... No high bar or complex adversarial solution needed, and the side benefit is that if a bot does persist it spreads the load.
[+] [-] mordae|2 years ago|reply
[+] [-] SCUSKU|2 years ago|reply
Otherwise, great write-up! And excellent service!
[+] [-] michaelt|2 years ago|reply
For example, desktop computer but clicking things without moving your mouse? Suspicious. Say you're a phone, but have desktop computer fonts installed? Suspicious. And suchlike, the precise methods are the results of a cat-and-mouse game.
If these heuristics identify your browser as suspicious, they either show you an interactive captcha, or they just refuse your request.
[+] [-] ahofmann|2 years ago|reply
And if you don't stop them, you at least slowed them down big time. This could also be enough to make the attack useless.
[+] [-] wewxjfq|2 years ago|reply
So people employ these measure and have no clue whom they filter. Reminds me of the online shops who block you because you click on the products too fast. Congratulations, you lost a customer to keep the CPU utilization at 20%.
[+] [-] cinntaile|2 years ago|reply
[+] [-] nicoboo|2 years ago|reply
Thanks you Troy for writing and sharing your experience.
[+] [-] kimburgess|2 years ago|reply
This can be implemented in a way that remains transparent (albeit via JS), poses little impact on ‘good’ users, but protects against a lot of traffic patterns that may be undesirable. The cost can be scaled to match infra capability and the challenge can be a combo of the request data and time. Valid windows for that time can then be synced with cache validity which removes the need to keep tabs on any state.
For those deeper in this space. What am I missing here that prevents this from being the norm?
[1]: http://www.hashcash.org/
[+] [-] politelemon|2 years ago|reply
I'm curious to know if there have been similar open source implementations of turnstile where website operators have found ways of limiting an API call to just a browser, without captcha. Does anyone know of any?
[+] [-] fswd|2 years ago|reply
[+] [-] hkt|2 years ago|reply
[+] [-] hoseja|2 years ago|reply
[+] [-] kurtoid|2 years ago|reply