Ways to reduce the costs of an HTTP(S) API on AWS

[+] georgyo|6 years ago|reply

I run a small service, ifconfig.io, that is now getting 200 million hits a day from around the world.

The response from it is about as small as you could make it, however at that volume it is about 150gb a day.

If I hosted this on AWS, the bandwidth alone without any compute would cost $900 a month. Prohibitively expensive for a service I just made for fun.

The cost of just sending the HTTP response headers alone is the majority of that cost to. There is no way to shrink it.

It is currently hosted on a single $40 linode instance and can easily keep up with the ~2400 sustained QPS. I think it can get up to about 50% more traffic before I have to scale it. And linode includes enough bandwidth with that compute to support the service without extra costs.

I don't see how anyone pays the bandwidth ransom that GCP and AWS charge.

[+] blantonl|6 years ago|reply

Well, to be fair, most of us who are paying the "bandwidth ransom" do have to scale, quite significantly I might add, and so the value is the platform as a whole.

Furthermore, if you are doing something for fun like you are, the bandwidth ransom definitely comes into play for elastic cloud environments, but anyone doing anything significant on AWS/GCP has definitely already negotiated down their bandwidth spend with their AWS/GCP account management team.

[+] jeremyjh|6 years ago|reply

For a lot of services, bandwidth is the smallest part of the hosting cost; often around 10%. It really depends on the kind of workloads and traffic you are getting. Of course, the low percentage is partially because their other services are also all very expensive relative to a VPS or dedicated host, but its not really a comparable service offering.

[+] juliansimioni|6 years ago|reply

This is a good list of ways to reduce outgoing bandwidth costs, but as someone who has switched from backend developer to running a small business, I can't help but notice that they don't talk at all about whether any of their cost savings were meaningful to the business.

Sure, it looks like they saved about $2000/month, but consider that those savings probably won't even pay for more than a quarter of a one of their developers.

Even though their service is free (their parent company gets business value from the aggregate analytics they obtain through their service), it very possible that there's something they could have done to bring more value to their parent company than the money they saved here.

Maybe it's unreasonable to expect a company to talk about that in a blog post, but it left me wondering.

[+] markonen|6 years ago|reply

My read was that they actually saved over $8000 per month:

- They mention that the initial savings of $1500/mo from omitting unnecessary headers was 12% of their egress cost (so the total before this was $12500)

- Then they got an additional 8% of savings by increasing the ALB idle connection timeout to 10 minutes (down to $10120)

- Finally they said they saved $200 per day by switching to a lighter TLS certificate chain ($6000/mo, so down to $4120)

None of those steps seem to have required any meaningful amount of development work. Let's say this took a developer one week? The return on that effort would be $100k a year, or $2500/hour for the first year alone.

[+] yelite|6 years ago|reply

> consider that those savings probably won't even pay for more than a quarter of a one of their developers

Although I never run a business, I do believe this kind of optimization is quite meaningful even though they will never be the top priority of a business.

Those optimizations lower operational cost while being mostly maintainance free (except the one that switches off from AWS certificate manager, which may increase some effort when renewing), risk free (unlike refactoring a large legacy system) and requiring little engineering effort (Maybe 10 engineering days from investigation to writing the blog post?)

In addition this blog post itself brings intangible benefit on their branding, website ranking and hiring.

[+] jacob019|6 years ago|reply

Let's say it is $2k/mo. I also run a small business. And when things are growing it's easy to think that way. But in the long run every business that faces competition needs to focus on the bottom line. How many developer hours do you think it took to save that $24,000 per year? Not much. And that is just one example. A culture that ignores efficiency is doomed to failure.

[+] zamadatix|6 years ago|reply

I'm not sure this is really questioning anything more than "I wonder if there is something they could have done better in terms of business operations" to which I can't imagine the answer ever being anything other than "yes", especially in retrospect.

[+] lucb1e|6 years ago|reply

> those savings probably won't even pay for more than a quarter of a [developer]

So you're assuming that configuring nginx properly, once, takes 3 months, every year? If it takes the developer (or sysadmin) less long than that, you're already saving money.

[+] lowdose|6 years ago|reply

Isn't IT all about investing in fixed cost upfront and reaping profits on the variable costs in the future?

[+] whatupmd|6 years ago|reply

Hey boss, just found a way to save 2 grand a month without any operational impact!

Johnson, you’re fired! I just saved myself 10 grand a month!

Narrator: where do I sign up to work for that guy...........

[+] kohtatsu|6 years ago|reply

I'd much rather that $2,000/month go to my developer than to line Bezos' pocket.

[+] ggregoire|6 years ago|reply

FYI: $2000/month pay 2+ developers in half of the world.

[+] bdcravens|6 years ago|reply

If it saves $24,000 a year, and your developer cost is $100/hour, 240 hours or less spent a year on this effort is your breakeven. Pretty sure that's a win.

[+] tuananh|6 years ago|reply

yup. at scale, percentage is what matters.

[+] alex_young|6 years ago|reply

All great ideas.

Another suggestion:

Terminate somewhere else.

If you fit inside of the CloudFlare T&Cs, you can probably save a much larger amount terminating there and having them peer with you using the same TLS every time, or failing that, try someone like BunnyCDN.

I've found that while AWS CloudFront is easy to instrument, it's neither very performant (lots of cache misses even when well configured), or cost effective (very high per byte cost).

[+] stefan_|6 years ago|reply

This. If your service is collecting aggregated analytics data from users, bytes that those users would never care to send in the first place, you can get vastly vastly better pricing on traffic by going with providers that don't care too much about high-quality peering.

[+] StavrosK|6 years ago|reply

> terminating there and having them peer with you using the same TLS every time

Can you elaborate for someone who isn't that familiar with networking? How does this work?

[+] unknown|6 years ago|reply

[deleted]

[+] iconara|6 years ago|reply

This was a great read.

We went through something similar a couple of years ago, when TLS wasn't as pervasive as it is today and at first focused mostly on minimising the response size – we were already using 204 No Content, but just like the OP we had headers we didn't need to send. In the end we deployed a custom compiled nginx that responded with "204 B" instead of "204 No Content" to shave off a few more bytes. It turned out none of the clients we tested with cared about the string part of the status, just that there was a string part.

When TLS started to become more common we realised the same thing as the OP, that the certificates we had were unnecessarily large and costed us a lot, so we switched to another vendor. When ACM came we were initially excited for the convenience it offered, but took a quick look but decided it would be too expensive to use for that part of our product.

[+] chrismeller|6 years ago|reply

I was honestly expecting some kind of meh article that said to reduce headers, enable compression and other basic stuff. I was pleasantly surprised that wasn’t the case... and absolutely astounded that the handshake provided that much of a difference, it was the last thing I would have thought of.

[+] maxkuzmins|6 years ago|reply

At such a high volume of requests it probably makes sense to consider going one abstraction level lower by replacing HTTPS with plain SSL sockets based communication for further cost reduction.

Nice deep dive into the S of HTTPS anyway.

[+] tumetab1|6 years ago|reply

> Also, the certificate contains lengthy URLs for CRL download locations and OCSP responders, 164 bytes in total.

If you're going on that path It's probably best to avoid revocation altogether, since it doesn't really work, and go the let's encrypt way, certificates with lower lifespans.

On that scale a 15 days cert on rotation is probably fine.

[+] mhenoch|6 years ago|reply

That's a good point. Seems like Let's Encrypt certificates contain an OCSP URL but no CRL URL, so they are a bit smaller.

[+] SlowRobotAhead|6 years ago|reply

> We’re currently using an RSA certificate with a 2048-bit public key. We could try switching to an ECC certificate with a 256-bit key instead

Having just ruled out RSA on an embedded project for exactly this reason, definitely the first thing that came to mind.

If they’re getting down to the byte differences, under their additional options, they really should have had binary serialized data instead of JSON. Something like CBOR “can” near immediate conversion to JSON but it would mean an update to all of their end points and they might not be feasible but could be worked in for new projects over time.

[+] namibj|6 years ago|reply

I'm sad about the state of support for ed25519/curve25519 crypto in TLS.

If you could reasonably deploy a website that doesn't offer anything else for https, you'd instantly fix many session establishment-based CPU DoS attacks. It's multiple times faster than what you usually allow your server to negotiate.

[+] bandris|6 years ago|reply

Perhaps AWS Certificate Manager certificates are deliberately large so more outgoing traffic can be charged?

Interesting idea from the post: "it could be a selling point for a Certificate Authority to use URLs that are as short as possible"

[+] jrockway|6 years ago|reply

I doubt it. AWS's certs are just another three-quarters baked AWS feature. They did the best they could with the resources they had.

At my last job we had a fun and exciting outage when AWS simply didn't auto-renew our certificate. We were given no warning that anything was broken, and it apparently began the internal renewal process at the exact instant the cert expired (rather than 30 days in advance as is common with ACME-based renewal). Ultimately the root cause was that some DNS record in Route 53 went missing, and that silently prevents certificate renewal.

We switched TLS termination from the load balancer to Envoy + cert-manager and the results were much better. You also get HTTP/2 out of the deal. We also wrote a thing that fetches every https host and makes sure the certificate works, and fed the expiration times in prometheus to actually be alerted when rotation is broken. Both are features Amazon should support out of the box for the $20/month + $$/gigabyte you pay them for a TLS-terminating load balancer. Both are features Amazon says "you'll pay us anyway" to, and they're right.

[+] rlastres|6 years ago|reply

Funny enough, Amazon.com uses a Digicert certificate similar to the one mentioned on the article, they don't seem to use the ones they provide for free on AWS :slightly_smiling_face:

[+] raxxorrax|6 years ago|reply

You have to terminate TLS at their load balancers though as they don't hand out any private keys of course. Still a great service.

Digicert is pretty expensive otherwise... always a shock when I look up prices... There is let's encrypt, but I never tested it with anything hosted on AWS.

Still, the article has great tips. And even if your app is some B2B service with <200 users, it still wouldn't hurt to implement the measures. Even if the product owner doesn't care if the solution costs 20$ or 200$ a month. Some of these tips are pretty low effort. Saves energy at least.

[+] yandie|6 years ago|reply

Big surprise. Contrary to the popular belief, AWS wasn't/isn't built to support Amazon.com. Some fundamental pieces are designed for Amazon.com scale, but most other services are not (ACM in this case)

[+] chrissnell|6 years ago|reply

Didn't see it mentioned: SSL tickets. If you were running a NLB and nginx in a pool of instances, you can use an Openresty-based inplementation of SSL tickets to dramatically speed up negotiation of reconnecting clients. You will need a Redis server to store the rotating ticket keys but that's easy with AWS Elasticache. You will also need to generate the random keys every so often and store them in Redis, removing the oldest ones as you do. This is a task that I accomplished by writing a small Go service.

If you serve a latency-critical service, tickets are a must.

[+] arkadiyt|6 years ago|reply

> Didn't see it mentioned: SSL tickets

They do talk about it, SSL tickets and TLS session resumption are referring to the same thing.

[+] rlastres|6 years ago|reply

I guess this might be specially relevant for traffic patterns similar to the one described in the article, for other use cases most likely those optimisations will not translate into big savings

[+] devit|6 years ago|reply

How about the obvious solution of not having ANY data transfer out?

Encrypt and sign the data via NaCL or similar, send via UDP duplicated 5-10 times, no response at all from the server (it's analytics, it doesn't matter if very few events are lost and you can even estimate the rate).

As for the REST API, deprecate it and if still needed place it on whatever 3 VPS services have the lowest costs, and use low TTL DNS round-robin with something removing from DNS hosts that are down.

[+] coleca|6 years ago|reply

Fascinating article. I love posts with this type of in-depth investigation into what everyone else would just pass over and not even think about.

It's not surprising that it's related to the gaming industry. Some of the best AWS re:Invent videos I've seen are in the GAM (gaming) track. Even though I've never worked in that field, the problems they get hit with and are solving often are very relevant to any high-traffic site. Because of the extreme volume and spikiness of gaming workloads, they tend to find a lot of edge cases, gotchas, and what I'll call anti-best practices (situations where the "best practice" turns out to be an anti-pattern for one reason or another, typically cost).

[+] ajbeach22|6 years ago|reply

I wonder what the cost is compared to terminating SSL at Clodfront? For my web tier architectures, I use Cloudfront to reverse proxy both dynamic content (from the api) and static content (from s3). SSL is terminated only at CloudFront.

[+] ball_biscuit|6 years ago|reply

I don't think you can use Cloudfront to serve that kind of traffic. Cloudfront costs are described here: https://aws.amazon.com/cloudfront/pricing/

So for 10k HTTPS requests, the price is 0.01 $. If you serve 5 billion per day, that is 5000$ a day. With such high traffic I believe it is needed to handle it using performant webservers (Go, Erlang?) to keep costs reasonable, and probably terminating SSL at load balancer is the way to go

[+] synunlimited|6 years ago|reply

You could also look into using brotli compression over gzip for some more savings of bytes over the wire.

[+] Ayesh|6 years ago|reply

Brotli support in API clients are quite low. I run a small API service, and you'd be lucky to see API clients even using gzip.

[+] LeonM|6 years ago|reply

I don't think there would be many clients that would support that. The API clients are usually not browsers in their case.

[+] meritt|6 years ago|reply

This is an awesome article but if your egress costs are so high that you're deciding which HTTP headers to exclude, you should probably be moving to an unmetered bandwidth provider, or at least one that charges a reasonable amount for egress.

[+] caymanjim|6 years ago|reply

Is there any such thing? I don't know of any cloud service provider that offers unlimited bandwidth. There are very few providers who could handle five billion connections per day in the first place, regardless of bandwidth.

[+] tyingq|6 years ago|reply

Maybe also consider caching API responses in a cheaper non-AWS CDN where possible. APIs like "zip code to list of cities" where the output is the same for all users, and doesn't change often.

[+] nimish|6 years ago|reply

Switch to ecdsa certs and shave another few hundred bytes :)

Bandwidth is the killer thing with aws. It's designed to make you move services inside the boundary.

[+] pragnesh|6 years ago|reply

"accept-encoding: gzip" header is request header. why it is present in unoptimized response in first place ?

97 comments