top | item 43668433

Anubis Works

319 points| evacchi | 10 months ago |xeiaso.net

208 comments

order

raggi|10 months ago

It's amusing that Xe managed to turn what was historically mostly a joke/shitpost into an actually useful product. They did always say timing was everything.

I am kind of surprised how many sites seem to want/need this. I get the slow git pages problem for some of the git servers that are super deep, lack caches, serve off slow disks, etc.

Unesco surprised me some, the sub-site in question is pretty big, it has thousands of documents of content, but the content is static - this should be trivial to serve, so what's going on? Well it looks like it's a poorly deployed Wordpress on top of Apache, with no caching enabled, no content compression, no HTTP 2/3. It would likely be fairly easy to get this serving super cheap on a very small machine, but of course doing so requires some expertise, and expertise still isn't cheap.

Sure you could ask an LLM, but they still aren't good at helping when you have no clue what to ask - if you don't even really know the site is slower than it should be, why would you even ask? You'd just hear about things getting crushed and reach for the furry defender.

adrian17|10 months ago

> but of course doing so requires some expertise, and expertise still isn't cheap

Sure, but at the same time, the number of people with expertise to set up Anubis (not that it's particularly hard, but I mean: even be aware that it exists) is surely even lower than of people with Wordpress administration experience, so I'm still surprised.

If I were to guess, the reasons for not touching Wordpress were unrelated, like: not wanting to touch a brittle instance, or organization permissions, or maybe the admins just assumed that WP is configured well already.

jtbayly|10 months ago

My site that I’d like this for has a lot of posts, but there are links to a faceted search system based on tags that produces an infinite number of possible combinations and pages for each one. There is no way to cache this, and the bots don’t respect the robots file, so they just constantly request URLs, getting the posts over and over in different numbers and combinations. It’s a pain.

mrweasel|10 months ago

> I am kind of surprised how many sites seem to want/need this.

The AI scrapers are not only poorly written, they also go out of their way to do cache busting. So far I've seen a few solutions, CloudFlare, require a login, Anubis, or just insane amounts of infrastructure. Some site have reported 60% of their traffic coming from bots not, smaller sites is probably much higher.

cedws|10 months ago

PoW anti-bot/scraping/DDOS was already being done a decade ago, I’m not sure why it’s only catching on now. I even recall a project that tried to make the PoW useful.

gyomu|10 months ago

If you’re confused about what this is - it’s to prevent AI scraping.

> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums

https://anubis.techaro.lol/docs/design/how-anubis-works

This is pretty cool, I have a project or two that might benefit from it.

x3haloed|10 months ago

I’ve been wondering to myself for many years now whether the web is for humans or machines. I personally can’t think of a good reason to specifically try to gate bots when it comes to serving content. Trying to post content or trigger actions could obviously be problematic under many circumstances.

But I find that when it comes to simple serving of content, human vs. bot is not usually what you’re trying to filter or block on. As long as a given client is not abusing your systems, then why do you care if the client is a human?

namanyayg|10 months ago

"It also uses time as an input, which is known to both the server and requestor due to the nature of linear timelines"

A funny line from his docs

xena|10 months ago

OMG lol I forgot that I left that in. Hilarious. I think I'm gonna keep it.

pie_flavor|10 months ago

Unfortunately it is also false (if taken out of context; Anubis rounds the time to the nearest week, which is probably good enough if the next-nearest week is valid too). Clock desync for a variety of reasons is pervasive - you can't expect the 10th percentile of your users to be accurate even to the day, and even the 25th percentile will be five minutes or so off.

AnonC|10 months ago

Those images on the interstitial page(s) while waiting for Anubis to complete its check are so cute! (I’ve always found all the art and the characters in Xe’s blog very beautiful)

Tangentially, I was wondering how this would impact common search engines (not AI crawlers) and how this compares to Cloudflare’s solution to stop AI crawlers, and that’s explained on the GitHub page. [1]

> Installing and using this will likely result in your website not being indexed by some search engines. This is considered a feature of Anubis, not a bug.

> This is a bit of a nuclear response, but AI scraper bots scraping so aggressively have forced my hand.

> In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can't or won't use Cloudflare, Anubis is there for you.

[1]: https://github.com/TecharoHQ/anubis/

JsonCameron|10 months ago

Yeah. Unfortunately at the current moment it does prevent indexing. Perhaps down the line we can whitelist search engines ips. However some like google, use the same for the AI and search indexing.

We are still making some improvements like passing open graph tags through so at least rich previews work!

snvzz|10 months ago

>Those images on the interstitial page(s) while waiting for Anubis to complete its check are so cute!

Love them too, and abhor knowing that someone is bound to eventually remove them because found to be "problematic" in one way or another.

prologic|10 months ago

I've read about Anubis, cool project! Unfortunately, as pointed out in the comments, requires your site's visitors to have Javascript™ enabled. This is totally fine for sites that require Javascript™ anyway to enhance the user experience, but not so great for static sites and such that require no JS at all.

I built my own solution that effectively blocks these "Bad Bots" at the network level. I effectively block the entirety of several large "Big Tech / Big LLM" networks entirely at the ASN (BGP) by utilizing MaxMind's database and a custom WAF and Reverse Proxy I put together.

xyzzy_plugh|10 months ago

A significant portion of the bot traffic TFA is designed to handle originates from consumer/residential space. Sure, there are ASN games being played alongside reputation fraud, but it's very hard to combat. A cursory investigation of our logs showed these bots (which make ~1 request from a given residential IP) are likely in ranges that our real human users occupy as well.

Simply put you risk blocking legitimate traffic. This solution does as well but for most humans the actual risk is much lower.

As much as I'd love to not need JavaScript and to support users who run with it disabled, I've never once had a customer or end user complain about needing JavaScript enabled.

It is an incredible vocal minority who disapprove of requiring JavaScript, the majority of whom, upon encountering a site for which JavaScript is required, simply enable it. I'd speculate that, even then, only a handful ever release a defeated sigh.

jadbox|10 months ago

How do you know it's an LLM and not a VPN? How do you use this MaxMind's database to isolate LLMs?

runxiyu|10 months ago

Do you have a link to your own solution?

roenxi|10 months ago

I like the idea but this should probably be something that is pulled down into the protocol level once the nature of the challenge gets sussed out. It'll ultimately be better for accessibility if the PoW challenge is closer to being part of TCP than implemented in JavaScript individually by each website.

fc417fc802|10 months ago

Ship an arbitrary challenge as a SPIR-V or MLIR black box. Integrate the challenge-response exchange with HTTP. That should permit broad support and flexible hardware acceleration.

The "good enough" solution is the existing and widely used SHA( seed, nonce ). That could easily be integrated into a lower level of the stack if the tech giants wanted it.

tripdout|10 months ago

The bot detection takes 5 whole seconds to solve on my phone, wow.

bogwog|10 months ago

I'm using Fennec (a Firefox fork on F-Droid) and a Pixel 9 Pro XL, and it takes around ~8 seconds at difficulty 4.

Personally, I don't think the UX is that bad since I don't have to do anything. I definitely prefer it to captchas.

Hakkin|10 months ago

Much better than infinite Cloudflare captcha loops.

oynqr|10 months ago

Lucky. Took 30s for me.

nicce|10 months ago

For me it is like 0.5s. Interesting.

cookiengineer|10 months ago

I am currently building a prototype of what I call the "enigma webfont" where I want to implement user sessions with custom seeds / rotations for a served and cached webfont.

The goal is to make web scraping unfeasible because of computational costs for OCR. It's a cat and mouse game right now and I want to change the odds a little. The HTML source would be effectively void without the user session, meaning an OTP like behavior could also make web pages unreadable once the assets go uncached.

This would allow to effectively create a captcha that would modify the local seed window until the user can read a specified word. "Move the slider until you can read the word Foxtrott", for example.

I sure would love to hear your input, Xe. Maybe we can combine our efforts?

My tech stack is go, though, because it was the only language where I could easily change the webfont files directly without issues.

lifthrasiir|10 months ago

Besides from the obvious accessibility issue, wouldn't that be a substitution cipher at best? Enough corpus should render its cryptanalysis much easier.

rollcat|10 months ago

The problem isn't as much that the websites are scraped (search engines have been doing this for over three decades), it's the request volume that brings the infrastructure down and/or costs up.

I don't think mangling the text would help you, they will just hit you anyway. The traffic patterns seem to indicate that whoever programmed these bots, just... <https://www.youtube.com/watch?v=ulIOrQasR18>

> I sure would love to hear your input, Xe. Maybe we can combine our efforts?

From what I've gathered, they need help in making this project more sustainable for the near and far future, not to add more features. Anubis seems to be doing an excellent job already.

pabs3|10 months ago

It works to block users who have JavaScript disabled, that is for sure.

udev4096|10 months ago

Exactly, it's a really poor attempt to make it appealing to the larger audience. Unless they roll out a version for nojs, they are the same as "AI" scrapers on enshittyfying the web

throwaway150|10 months ago

Looks cool. But please help me understand. What's to stop AI companies from solving the challenge, completing the proof of work and scrape websites anyway?

crq-yml|10 months ago

It's a strategy to redefine the doctrine of information warfare on the public Internet from maneuver(leveraged and coordinated usage of resources to create relatively greater effects) towards attrition(resources are poured in indiscriminately until one side capitulates).

Individual humans don't care about a proof-of-work challenge if the information is valuable to them - many web sites already load slowly through a combination of poor coding and spyware ad-tech. But companies care, because that changes their ability to scrape from a modest cost of doing business into a money pit.

In the earlier periods of the web, scraping wasn't necessarily adversarial because search engines and aggregators were serving some public good. In the AI era it's become belligerent - a form of raiding and repackaging credit. Proof of work as a deterrent was proposed to fight spam decades ago(Hashcash) but it's only now that it's really needed to become weaponized.

marginalia_nu|10 months ago

The problem with scrapers in general is the asymmetry of compute resources involved in generating versus requesting a website. You can likely make millions of HTTP requests with the compute required in generating the average response.

If you make it more expensive to request a documents at scale, you make this type of crawling prohibitively expensive. On a small scale it really doesn't matter, but if you're casting an extremely wide net and re-fetching the same documents hundreds of times, yeah it really does matter. Even if you have a big VC budget.

ndiddy|10 months ago

This makes it much more expensive for them to scrape because they have to run full web browsers instead of limited headless browsers without full Javascript support like they currently do. There's empirical proof that this works. When GNOME deployed it on their Gitlab, they found that around 97% of the traffic in a given 2.5 hour period was blocked by Anubis. https://social.treehouse.systems/@barthalion/114190930216801...

perching_aix|10 months ago

Nothing. The idea instead that at scale the expenses of solving the challenges becomes too great.

userbinator|10 months ago

This is basically the DRM wars again. Those who have vested interests in mass crawling will have the resources to blast through anything, while the legit users get subjected to more and more draconian measures.

ronsor|10 months ago

I know companies that already solve it.

mushufasa|10 months ago

I looked through the documentation and I've come across a couple sites using this already.

Genuine question: why not leverage the proof-of-work challenge literally into mining that generates some revenue for a website? Not a new idea, but when I looked at the docs it didn't seem like this challenge was tied to any monetary coin value.

This is coming from someone who is NOT a big crypto person, but it strikes me that this would be a much better way to monetize organic high quality content in this day and age. Basically the idea that Brave browser started with, meeting it's moment.

I'm sure Xe has already considered this. Do they have a blog post about this anywhere?

snvzz|10 months ago

My Amiga 1200 hates these tools.

It is really sad that the worldwide web has been taken to the point where this is needed.

pabs3|10 months ago

Recently I heard of a site blocking bot requests with a message telling the bot to download the site via Bittorrent instead.

Seems like a good solution to the badly behaved scrapers, and I feel like the web needs to move away from the client-server model towards a swarm model like Bittorrent anyway.

seba_dos1|10 months ago

Even if these stupid bots would just learn to clone git repos instead of crawling through GitLab UI pages it would already be helpful.

deknos|10 months ago

I wish, there was also an tunnel software (client+server) where

* the server appears on the outside as an https server/reverse proxy * the server supports self-signed-certificates or letsencrypt * when a client goes to a certain (sub)site or route, http auth can be used * after http auth, all traffic tunnel over that subsite/route is protected against traffic analysis, for example like the obfsproxy does it.

Does anyone know something like that? I am tempted to ask xeiaso to add such features, but i do not think his tool is meant for that...

rollcat|10 months ago

Your requirements are quite specific, and HTTP servers are built to be generic and flexible. You can probably put something together with nginx and some Lua, aka OpenResty: <https://openresty.org/>

> his

I believe it's their.

immibis|10 months ago

Tor's Webtunnel?

dmtfullstack|10 months ago

Humans are served by bots. Any bot requesting traffic is doing so on behalf of a human somewhere.

What is the problem with bots asking for traffic, exactly?

Context of my perspective: I am a contractor for a team that hosts thousands of websites on a Kubernetes cluster. All of the websites are on a storage cluster (combination of ZFS and Ceph) with SATA and NVMe SSDs. The machines in the storage cluster and also the machines the web endpoints run on have tons of RAM.

We see a lot of traffic from what are obviously scraping bots. They haven't caused any problems.

Tarq0n|10 months ago

Ok? Not everyone has the same resources or technical sophistication.

udev4096|10 months ago

PoW captchas are not new. What's different with Anubis? How can it possibly prevent "AI" scrapers if the bots have enough compute to solve the PoW challenge? AI companies have quite a lot of GPUs at their disposal and I wouldn't be surprised if they used it for getting around PoW captchas

relistan|10 months ago

The point is to make it expensive to crawl your site. Anyone determined to do so is not blocked. But why would they be determined to do so for some random site? The value to the AI crawler likely does not match the cost to crawl it. It will just move on to another site.

So the point is not to be faster than the bear. It’s to be faster than your fellow campers.

matt3210|10 months ago

A package which includes the cool artwork would be awesome

xena|10 months ago

You mean with the art assets extracted?

  $ mkdir -p ./tmp/anubis/static && anubis --extract-resources=./tmp/anubis/static

appleaday1|10 months ago

Nice will try to deploy to my sites after I eat some mac and cheese

matt3210|10 months ago

Very nice work!

perching_aix|10 months ago

> Sadly, you must enable JavaScript to get past this challenge. This is required because AI companies have changed the social contract around how website hosting works. A no-JS solution is a work-in-progress.

Will be interested to hear of that. In the meantime, at least I learned of JShelter.

Edit:

Why not use the passage of time as the limiter? I guess it would still require JS though, unless there's some hack possible with CSS animations, like request an image with certain URL params only after an animation finishes.

This does remind me how all of these additional hoops are making web browsing slow.

Edit #2:

Thinking even more about it, time could be made a hurdle by just.. slowly serving incoming requests. No fancy timestamp signing + CSS animations or whatever trickery required.

I'm also not sure if time would make at-scale scraping as much more expensive as PoW does. Time is money, sure, but that much? Also, the UX of it I'm not sold on, but could be mitigated somewhat by doing news website style "I'm only serving the first 20% of my content initially" stuff.

So yeah, will be curious to hear the non-JS solution. The easy way out would be a browser extension, but then it's not really non-JS, just JS compartmentalized, isn't it?

Edit #3:

Turning reasoning on for a moment, this whole thing is a bit iffy.

First of all, the goal is that a website operator would be able to control the use of information they disseminate to the general public via their website, such that it won't be used specifically for AI training. In principle, this is nonsensical. The goal of sharing information with the general public (so, people) involves said information eventually traversing through a non-technological medium (air, as light), to reach a non-technological entity (a person). This means that any technological measure will be limited to before that medium, and won't be able to affect said target either. Put differently, I can rote copy your website out into a text editor, or hold up a camera with OCR and scan the screen, if scale is needed.

So in principle we're definitely hosed, but in practice you can try to hold onto the modality of "scraping for AI training" by leveraging the various technological fingerprints of such activity, which is how we get to at-scale PoW. But then this also combats any other kind of at-scale scraping, such as search engines. You could whitelist specific search engines, but then you're engaging in anti-competitive measures, since smaller third party search engines now have to magically get themselves on your list. And even if they do, they might be lying about being just a search engine, because e.g. Google may scrape your website for search, but will 100% use it for AI training then too.

So I don't really see any technological modality that would be able properly discriminate AI training purposed scraping traffic for you to use PoW or other methods against. You may decide to engage in this regardless based on statistical data, and just live with the negative aspects of your efforts, but then it's a bit iffy.

Finally, what about the energy consumption shaped elephant in the room? Using PoW for this is going basically exactly against the spirit of wanting less energy to be spent on AI and co. That said, this may not be a goal for the author.

The more I think about this, the less sensible and agreeable it is. I don't know man.

Philpax|10 months ago

> First of all, the goal is that a website operator would be able to control the use of information they disseminate to the general public via their website, such that it won't be used specifically for AI training.

This isn't the goal; the goal is to punish/demotivate poorly-behaved scrapers that hammer servers instead of moderating their scraping behaviour. At least a few of the organisations deploying Anubis are fine with having their data scraped and being made part of an AI model.

They just don't like having their server being flooded with non-organic requests because the people making the scrapers have enough resources that they don't have to care that they're externalising the costs of their malfeasance on the rest of the internet.

abetusk|10 months ago

Time delay, as you proposed, is easily defeated by concurrent connections. In some sense, you're sacrificing latency without sacrificing throughput.

A bot network can make many connections at once, waiting until the timeout to get the entirety of their (multiple) request(s). Every serial delay you put in is a minor inconvenience to a bot network, since they're automated anyway, but a degrading experience for good faith use.

Time delay solutions get worse for services like posting, account creation, etc. as they're sidestepped by concurrent connections that can wait out the delay to then flood the server.

Requiring proof-of-work costs the agent something in terms of resources. The proof-of-work certificate allows for easy verification (in terms of compute resources) relative to the amount of work to find the certificate in the first place.

A small resource tax on agents has minimal effect on everyday use but has compounding effect for bots, as any bot crawl now needs resources that scale linearly with the number of pages that it requests. Without proof-of-work, the limiting resource for bots is network bandwidth, as processing page data is effectively free relative to bandwidth costs. By requiring work/energy expenditure to requests, bots now have a compute as a bottleneck.

As an analogy, consider if sending an email would cost $0.01. For most people, the number of emails sent over the course of a year could easily cost them less than $20.00, but for spam bots that send email blasts of up to 10k recipients, this now would cost them $100.00 per shot. The tax on individual users is minimal but is significant enough so that mass spam efforts are strained.

It doesn't prevent spam, or bots, entirely, but the point is to provide some friction that's relatively transparent to end users while mitigating abusive use.

marginalia_nu|10 months ago

You basically need proof-of-work to make this work. Idling a connection is not computationally expensive, so is not a deterrent.

It's a shitty solution to an even shittier reality.

1oooqooq|10 months ago

tries to block abusive companies using infinite money glitch from clueless investors, by making every request cost a few fractions of a cent more.

... yeah, that will totally work.

apt-apt-apt-apt|10 months ago

Since Anubis is related to AI, the part below read as contradictory at first. As if too many donations would cause the creator to disappear off to Tahiti along with the product development.

"If you are using Anubis .. please donate on Patreon. I would really love to not have to work in generative AI anymore..."

mentalgear|10 months ago

Seems like a great idea, but I'd be nice if the project had a simple description. (and not use so much anime, as it gives an unprofessional impression)

This is what it actually does: Instead of only letting the provider bear the cost of content hosting (traffic, storage), the client also bears costs when accessing in form of computation. Basically it runs additional expansive computation on the client, which makes accessing 1000s of your webpages at high interval expansive for crawlers.

> Anubis uses a proof of work in order to validate that clients are genuine. The reason Anubis does this was inspired by Hashcash, a suggestion from the early 2000's about extending the email protocol to avoid spam. The idea is that genuine people sending emails will have to do a small math problem that is expensive to compute, but easy to verify such as hashing a string with a given number of leading zeroes. This will have basically no impact on individuals sending a few emails a week, but the company churning out industrial quantities of advertising will be required to do prohibitively expensive computation.

userbinator|10 months ago

Yes, it just worked to stop me, an actual human, from seeing what you wanted to say... and I'm not interested enough to find a way around it that doesn't involve cozying up to Big Browser. At least CloudFlare's discrimination can be gotten around without JS.

Wouldn't it be ironic if the amount of JS served to a "bot" costs even more bandwidth than the content itself? I've seen that happen with CF before. Also keep in mind that if you anger the wrong people, you might find yourself receiving a real DDoS.

If you want to stop blind bots, perhaps consider asking questions that would easily trip LLMs but not humans. I've seen and used such systems for forum registrations to prevent generic spammers, and they are quite effective.

userbinator|10 months ago

Looks like I struck a nerve. Big Browser, hello ;-)