top | item 43687431

The Cost of Being Crawled: LLM Bots and Vercel Image API Pricing

112 points| navs | 10 months ago |metacast.app

119 comments

order

leerob|10 months ago

(I work at Vercel) While it's good our spend limits worked, it clearly was not obvious how to block or challenge AI crawlers¹ from our firewall (which it seems you manually found). We'll surface this better in the UI, and also have more bot protection features coming soon. Also glad our improved image optimization pricing² would have helped. Open to other feedback as well, thanks for sharing.

¹: https://vercel.com/templates/vercel-firewall/block-ai-bots-f...

²: https://vercel.com/changelog/faster-transformations-and-redu...

ilyabez|10 months ago

Hi, I'm the author of the blog (though I didn't post it on HN).

1) Our biggest issue right now is unidentified crawlers with user agents resembling regular users. We get hundreds of thousands of requests from those daily and I'm not sure how to block them on Vercel.

I'd love them to be challenged. If a bot doesn't identify itself, we don't want to let it in.

2) While we fixed the Image Optimization part and optimized caching, we're now struggling with ISR Write costs. We deploy often and the ISR cache is reset on each deploy.

We are about to put Cloudflare in front of the site, so that we can set Cache-Control headers and cache SSR pages (rather than using ISR) independently.

zamalek|10 months ago

I'm sure what you can share is limited, as I'm guessing this is cat and mouse. That being said, is there anything you can share about your implementation?

bhouston|10 months ago

The issue is Vercel Image API is ridiculously expensive and also not efficient.

I would recommend using Thumbor instead: https://thumbor.readthedocs.io/en/latest/. You could have ChatGPT write up a React image wrapper pretty quickly for this.

styfle|10 months ago

The article explains that they were using the old Vercel price and that the new price is much cheaper.

> On Feb 18, 2025, just a few days after we published this blog post, Vercel changed their image optimization pricing. With the new pricing we'd not have faced a huge bill.

gngoo|10 months ago

I once sat down to calculate the costs of my app if it ever went viral being hosted at vercel. That has put me off on hosting anything on vercel ever or even touching NextJS. It feels like total vendor lock in once you have something running there, and you're kind of end up paying them 10x more than if you had taken the extra time to deploy it yourself.

arkh|10 months ago

> you're kind of end up paying them 10x more than if you had taken the extra time to deploy it yourself

The length to which many devs will go to not learn server management (or SQL).

sharps_xp|10 months ago

i also do the sit down a calculate exercise. i always end up down a rabbit hole of how to make a viral site as cheaply as possible. always ends up in the same place: redis, sqlite, SSE, on suspended fly machines, and a CDN.

jhgg|10 months ago

$5 to resize 1,000 images is ridiculously expensive.

At my last job we resized a very large amount of images every day, and did so for significantly cheaper (a fraction of a cent for a thousand images).

Am I missing something here?

jsheard|10 months ago

It's the usual PaaS convenience tax, you end up paying an order of magnitude or so premium for the underlying bandwidth and compute. AIUI Vercel runs on AWS so in their case it's a compound platform tax, AWS is expensive even before Vercel adds their own margin on top.

mvdtnz|10 months ago

You're not missing anything. A generation of programmers has been raised to believe platforms like Vercel / Next.js are not only normal, but ideal.

BonoboIO|10 months ago

Absolutely insane pricing, maybe for small blogs, but didn’t they calculate this trough?

Millions of episode, of course they will be visited and the optimization is run.

Banditoz|10 months ago

Yeah, curious too.

Can't the `convert` CLI tool resize images? Can that not be used here instead?

ashishb|10 months ago

As someone who maintains a Music+Podcast app as a hobby project, I intentionally have no servers for it.

You don't need one. You can fetch RSS feeds directly on mobile devices; it is faster, less work to maintain, and has a smaller attach surface for rouge bots.

arresin|10 months ago

If you want to do something interesting with the feeds it would be harder.

VladVladikoff|10 months ago

Death by stupid micro services. Even at 1.5 mil pages, and the traffic they are talking about this could easily be hosted on a a fixed $80/month linode.

KennyBlanken|10 months ago

This isn't specific to microservices. I've seen two organizations with a lot of content have their website brought to its knees because multiple AI crawlers were hitting it.

One of them was pretending to be a very specific version of Microsoft Edge, coming from an Alibaba datacenter. Suuuuuuuuuuuuuuuuuure. Blocked its IP range and about ten minutes later a different subnet was hammering away again. I ended up just blocking based off the first two octets; the client didn't care, none of their visitors are from China.

All of this was sailing right through Cloudflare.

ramesh31|10 months ago

The cost of getting locked into Vercel.

nullorempty|10 months ago

Yeah, AI crawlers - add that to my list of phobias. Though for a bootstrapped startup why not look to cut all recurrent expenses and just deploy imagemagik that I am sure will do the trick for less.

GodelNumbering|10 months ago

Wow this is interesting. I launched my site like a week ago, only submitted to google. But all the crawlers (especially the SEO bots) mentioned in the article were heavily crawling it in a few days.

Interestingly, openai crawler visited over a 1000 times, many of them for "ChatGPT-User/1.0" which is supposed to be for when a user searches chatgpt. Not a single referred visitor though. Makes me wonder if it's any beneficial to the content publishers to allow bot crawls

I ended up banning every SEO bot in robots.txt and a bunch of other bots

marcusb|10 months ago

I've seen a bunch of requests with forged ChatGPT-related user agent headers (at least, I believe many are forged - I don't think OpenAI uses Chinese residential IPs or Tencent cloud for their data crawling activities.)

Some of the LLM bots will switch to user agent headers that match real browsers if blocked outright.

outloudvi|10 months ago

Vercel has a fairly generous free quota and a non-negligible high pricing scheme - I think people still remember https://service-markup.vercel.app/ .

For the crawl problem, I want to wait and see whether robots.txt is proved enough to stop GenAI bots from crawling since I confidently believe these GenAI companies are too "well-behaved" to respect robots.txt.

otherme123|10 months ago

This is my experience with AI bots. This is my robots.txt:

User-agent: * Crawl-Delay: 20

Clear enough. Google, Bing and others respect the limits, and while about half my traffic are bots, they never DoS the site.

When a very well known AI bot crawled my site in august, they fired up everything: fail2ban put them temporarily in jail multiple times, the nginx request limit per ip was serving 426 and 444 to more than half their requests (but they kept hammering the same Urls), and some human users contacted me complaining about the site going 503. I had to block the bot IPs at the firewall. They ignore (if they even read) the robots.txt.

dvrj101|10 months ago

Nope they have been ignoring robots.txt since the start. There are multiple posts all over the internet.

randunel|10 months ago

> Optimizing an image meant that Next.js downloaded the image from one of those hosts to Vercel first, optimized it, then served to the users.

So Metacast generate bot traffic on other websites, presumably to "borrow" their content and serve it to their own users, but they don't like it when others do the same to them.

ilyabez|10 months ago

Hi, I'm the author of the blog (though I didn't post it on HN).

I'd encourage you to read up on how the podcast ecosystem works.

Podcasts are distributed via RSS feeds hosted all over the internet, but mostly on specialized hosting providers like Transistor, Megaphone, Omny Studio, etc. that are designed to handle huge amounts of traffic.

All podcast apps (literally, all of them) like Apple Podcasts, Spotify, YouTube Music, Overcast, Pocket Casts, etc. constantly crawl and download RSS feeds, artwork images and mp3s from podcast hosts.

This is how podcasts are distributed since they were introduced by Apple in early 2000s. This is why podcasting still remains an open, decentralized ecosystem.

sergiotapia|10 months ago

Another story for https://serverlesshorrors.com/

It's crazy how these companies are really fleecing their customers who don't know any better. Is there even a way to tell Vercel: "I only want to spend $10 a month max on this project, CUT ME OFF if I go past it."? This is crazy.

I spend $12 a month on BunnyCDN. And $9 a month on BunnyCDN's image optimizer that allows me to add HTTP params to the url to modify images.

1.33TB of CDN traffic. (ps: can't say enough good things about bunnycdn, such a cool company, does exactly what you pay for nothing more nothing less)

This is nuts dude

jsheard|10 months ago

> Is there even a way to tell Vercel: "I only want to spend $10 a month max on this project, CUT ME OFF if I go past it."?

Yes actually, there's a lot to complain about with Vercel but to their credit they do offer both soft and hard spending limits, unlike most other newfangled clouds.

OTOH god help you if you're on Netlify, there you're looking at $0.55/GB with unbounded billing...

leerob|10 months ago

> Is there even a way to tell Vercel: "I only want to spend $10 a month max on this project, CUT ME OFF if I go past it."? This is crazy.

(I work at Vercel). Yes, there are soft and hard spend limits. OP was using this feature, it's called "spend management": https://vercel.com/docs/spend-management

sgarland|10 months ago

+1 for BunnyCDN. It's fantastic.

greatgib|10 months ago

A single $5 vps should be able to handle easily tens of thousands of requests...

Not that much for simple thumbnails in addition. So sad that the trend of "fullstack" engineers being just frontend js/ts devs took off with thousands of companies having no clue at all about how to serve websites, backends and server engineering...

bigiain|10 months ago

It's 1999 or 2000, and "proper" web developers, who wrote Perl (as God intended) or possibly C (if they were contributors to the Apache project), started to notice the trend of Graphic Designers over-reaching from their place as html jockeys, and running whole dynamic websites using some abomination called PHP.

History repeats itself...

majorchord|10 months ago

> A single $5 vps should be able to handle easily tens of thousands of requests...

Source:

e____g|10 months ago

> A single $5 vps should be able to handle easily tens of thousands of requests

Sure, given enough time. Did you miss a denominator?

mediumsmart|10 months ago

Don’t feed the bots. Why a pixel image? Take an svg and make it pulse while playing.

CharlieDigital|10 months ago

Is there no CDN? This feels like it's a non-issue if there's a CDN.

ilyabez|10 months ago

Hi, I'm the author of the blog (though I didn't post it on HN).

We're going to put Cloudflare in front of our Vercel site and control cache for SSR pages with Cache-Control headers.

dylan604|10 months ago

I guess it goes to show how jaded I am, but as I was reading this, it felt like an ad for Vercel. I'm so sick of marketing content being submitted as actual content, that when I read a potentially actual blog/post-mortem, my spidey senses get all tingly about potential advertising. However, I feel like if I turn down the sensitivity knob, I'll be worse off than knee jerk thinking things like this are ads.

ilyabez|10 months ago

Hi, I'm the author of the blog (though I didn't post it on HN).

I can assure you it is not an ad for Vercel.

bitbasher|10 months ago

$5 for 1,000 image optimizations? Is Vercel not caching the optimization? Why would it be doing more than one per-image on a fresh deploy?

cratermoon|10 months ago

"Step 3: robots.txt"

Will do nothing to mitigate the problem. As is well known, these bots don't respect it.

randunel|10 months ago

Would you reckon OP's bot(s) respect it when borrowing content from the large variety (their words) of podcast sources they scrape?

andrethegiant|10 months ago

It’s a shame that the knee-jerk reaction has been to outright block these bots. I think in the future, websites will learn to serve pure markdown to these bots instead of blocking. That way, websites prevent bandwidth overages like in the article, while still informing LLMs about the services their website provides.

[disclaimer: I run https://pure.md, which helps websites shield from this traffic]

mtlynch|10 months ago

>I think in the future, websites will learn to serve pure markdown to these bots instead of blocking. That way, websites prevent bandwidth overages like in the article, while still informing LLMs about the services their website provides.

Why?

There's no value to the website for a bot scraping all of their content and then reselling it with no credit or payment to the original author.

dmitrygr|10 months ago

Until these bots become good citizens (eg: respecting robots.txt), I will be serving them gzipped gibberish that decompresses to terabytes.

The ball is in their court. You don’t get to demand civility AFTER being a dick. You apologize and HOPE you’re forgiven.

pavel_lishin|10 months ago

> I think in the future, websites will learn to serve pure markdown to these bots instead of blocking.

What for? Why would I serve anything to these leeches?

RamblingCTO|10 months ago

I think you're a bit late to the game ;) I built and sold 2markdown last year, which was then copied by firecrawl/mendable. And then you also have jina reader. Also "compare with" in the footer does nothing.

Swizec|10 months ago

If only there were some way for websites to serve information and provide interactivity in a machine readable format. Like some sort of application programming interface. You could even return different formats based on some sort of 4-letter code at the end of a URL like .html, .json, .xml, etc.

And what if there was some standard sort of way for robots to tell your site what they're trying to do with some sort of verb like GET, PUT, POST, DELETE etc. They could even use a standard way to name the resource they're trying to interact with. Like a universal resource finder of some kind. You could even use identifiers to be specific! Like /items/ gives you a list of items and /items/1.json gives you data about a specific item.

That would be so awesome. The future is amazing.

tough|10 months ago

how would one serve them .txt instead?

happyzappy|10 months ago

Cool globe graphic on that site :)

detaro|10 months ago

or you know, AI crawlers could behave and get all that without any extra work for everybody. What makes you think they'll suddenly respect your scheme?

cachedthing0|10 months ago

"Together they sent 66.5k requests to our site within a single day."

Only scriptkiddies are getting into problems by such low numbers. Im sure security is your next 'misconfiguration'. Better search an offline job in the entertainment industries.

aledalgrande|10 months ago

I know the language earned you the downvotes (please be kind), but the author of the article is ex Google and ex AWS, I too would expect some better infra in place (caching?) and certainly not Vercel.