An Alternative Approach to Rate Limiting

sulam|9 years ago

There's a fixed memory solution that doesn't suffer from the boundary condition that allows you to double the rate. It's pretty straightforward, so I'll describe it in prose, since it's 6am and I'd rather not get the code wrong. :)

The approach uses a ring buffer. If you're not familiar with them, they are a fix-sized array or linked list that you iterate through monotonically, wrapping around to the beginning when you are at the limit. Our ring buffer will hold timestamps and should be initialized to hold 0's -- ensuring that only someone with a time machine could be rate limited before they send any requests. The size of the buffer is the rate limit's value, expressed in whatever time unit you find convenient.

As each request comes in, you fetch the value from the buffer at the current position and compare it to current time. If the value from the buffer is more than the current time minus the rate limit interval you're using, then you return a 420 to the client and are done. If not, their request is ok and you should serve it normally, but first you store the current time stamp in the buffer and then advance the counter/index.

cakoose|9 years ago

The article describes solutions that use Redis so that multiple app servers can share rate limits. The article's second solution is basically what you're describing, except adapted to work with Redis.

Also, what do you mean by "fixed" memory? Sure, the memory doesn't grow over time, but neither does the memory of the other solutions. Of the solutions listed in the article, this is the most memory-hungry.

irgeek|9 years ago

In response, we implemented a shadow ban: On the surface, the attackers continued to receive a 200 HTTP response code, but behind the scenes we simply stopped sending document invitations after they exceeded the rate limit.

And right there they broke the service for legitimate users. Totally unacceptable collateral damage IMHO.

lenzm|9 years ago

Shadow ban for everyone that exceeded the rate limit or just the one attacker? As others have said that's shitty for legitimate users that go over the rate limit.

karmakaze|9 years ago

Without shadow ban, you're just telling the spammers how to be effective and stay just under the limit.

nullnilvoid|9 years ago

The problem with rate limiting is to distinguish a normal user from a spammer. A normal user can send more requests than usual sometimes. If a normal user gets rate-limited by mistake, you are going to get lots of upset users.

vacri|9 years ago

Wow, that's really bad. I get shivers at the thought of trying to troubleshoot that as a client.

matt_wulfeck|9 years ago

Agreed. This would be so absolutely frustrating to me and very easy to circumvent for nefarious people.

timothycrosley|9 years ago

I would think if you have a consumer application that can't handle double what is set as the rate limit during a very small corner case (start and end of the the minute barrier) you have bigger problems. As you're still effectively enforcing your rate limit over time with that approach. This just sounds like micro-optimization at its worst.

mattb314|9 years ago

Agreed. Especially given that these rate limits seemed to be aimed at stopping something catastrophic like spammers using 100x allotted capacity, a 2x innacuracy shouldn't really matter. The solution was interesting, however, and I could see it being useful in a situation where users are expected to run very close to their rate limits. For example, I could see AWS being fairly careful about not letting anyone use more than their allotted compute/network bandwidth because getting double bandwidth without paying for it is a pretty big deal.

zokier|9 years ago

Yeah, I was also thinking how meaningful ~20MB of memory use really would be in this context. Or how badly would racy token bucket perform in the real world. Still, enjoyed the read.

markwaldron|9 years ago

Had we not discovered the attack, we could have faced a huge surge in our delivery costs and a decline in our email sender reputation.

This isn't just about the application being able to handle double the amount of load, it about keeping costs down and preventing someone from drastically increasing your bill. I work at a fintech company, and many of the 3rd party providers we use have a non-negligible cost per API call. You wouldn't want to wake up one morning and find that someone triggered one of those calls thousands of times while you were asleep.

joneholland|9 years ago

You know you can implement a token bucket that doesn't share state between your API servers, in about 10 lines of code, using just a in memory map.

Your incoming requests should be balanced across all of the servers so you just derive the allowed throughput and divide by the number front ends....

iEchoic|9 years ago

This can be appropriate as a tactical short-term hack, but for anyone reading who doesn't have a good sense for when to cut corners here, this isn't a great general-purpose solution. You're building assumptions about the way your machines receive requests into your service logic instead of externalizing it in your load balancer, which isn't a good practice.

In practical terms this means that whenever the characteristics of your environment changes, your rate limiting suddenly gets wonky. If you're doing a rolling deployment and take 1/3 of your machines out of service, or if some machines go down for maintenance, or a variety of other things happen to your machines, you're going to end up with fluctuating rate limits.

It sounds like OP is running a relatively mature service, so this probably isn't the best idea for them.

mason55|9 years ago

This only works if your bucket size is much larger than your number of servers.

In the degenerate case, imagine a rate limit of 2 total requests per minute load balanced across 2 servers with enough traffic that my request is basically hitting a random server. In this case, 50% of the time my second request will be rate limited incorrectly because each server has a bucket of 1 and my second request went to the same server as my first.

I'm sure someone smarter than me (and better at probability) could come up with an equation where you input your rate limit & number of servers and it tells you the probability of a false positive for a user.

drchickensalad|9 years ago

You have to send all the traffic from one client to one server then right? Seems not without heavy drawbacks.

Otherwise you can easily get them to hit their limit super early with bad luck.

daliwali|9 years ago

I think rate limiting is the wrong idea. Say for example, a client wants to re-fetch everything that it has cached, it may send a burst of requests in a short amount of time, and some of those requests may be wrongly rejected due to rate limits. This is what happens when a browser refreshes a page for example.

A better approach I think is delaying request handling based on resource allocation. If one client is disproportionately using up more time than others, then processing that clients' request will be queued up to process later, while well behaving clients will get their requests handled quickly. I think this is a more realistic approach and imposes less arbitrary restrictions.

Animats|9 years ago

This is why I've suggested, and used, fair queuing for rate limiting. Any source can make lots of requests, but they queue up by IP address. (In practice, you have queues organized by hashed IP address, which you service round-robin. That's how routers do fair queuing.) If one queue gets very long, then you start sending 429 errors for new adds to that queue. Otherwise, the requests from the heavy load source just take a little longer.

Unlike most rate limiters, this requires little or no tuning. Each source IP address gets an equal amount of service. Sources which generate heavy load compete with themselves, not others.

The main tuning parameter is how long a queue can get. It's useful to have both length and time limits. But only hostile sources should hit those. They don't change with capacity or number of users. Brief load transients should be handled normally, if slowly.

This won't stop a distributed denial of service attack from many IP addresses. But anything coming from a small number of sources will be handled without much trouble.

jdmichal|9 years ago

Once you start getting into quality-of-service, basic rate limiting like the discussed is not enough. I think Stripe did a better job of covering such concerns in their rate limiter post. They specifically talk about being lenient to "bursty" traffic, as that was a legitimate use-case for clients.

https://stripe.com/blog/rate-limiters

https://news.ycombinator.com/item?id=13997029

rco8786|9 years ago

Those concepts are not mutually exclusive and are almost universally used in tandem.

Queues and backpressure to properly serve bursty traffic and rate limiting to protect against bad actors.

philsnow|9 years ago

Where do you queue those requests ? If you do it anywhere under your own control, you will accumulate memory and open connections.

If you issue a 429, the client knows it needs to wait and retry, and you've pushed the backpressure all the way past the demarcation line of your own infrastructure.

jon-wood|9 years ago

This depends why you want to rate limit. If its the classic case of stopping one customer consuming all resources because of an overenthusiastic script it's fine. If you're trying to avoid paying for lots of calls to an expensive API, or competitors scraping data, you want to reject the requests entirely rather than just delaying them.

jdwyah|9 years ago

fwiw TokenBuckets are very happy to separate "burst" from "refill rate". ie It's easy to give each client 10 tokens, but then have their bucket refill at 60/minute.

See https://www.ratelim.it/documentation/safety for how to configure a limit with "burst" in RateLim.it

Queueing sounds nice, but the web is synchronous. If somebody crushes you do you really want to queue them and then process the reqs 5 min later? After the connection has already terminated?

sarreph|9 years ago

This is why I like HackerNews comments. As a primarily front-end dev building a SaaS, I'd already bookmarked this post and was planning implementing it. But it seems like the comments here are pointing me in a better direction.

jnwatson|9 years ago

Sounds like a lot of work to avoid writing 20 lines of Lua.

michaelmior|9 years ago

I wrote Locomotor[0] to automate the translation of Python code to Lua for Redis. Basically, you add an annotation to a function and the first time the code executes, it's converted to Lua and shipped to the server. You can see an example here[1]. It's far from foolproof and many Python language constructs have not been implemented, but it can handle some relatively complex code.

[0] https://github.com/michaelmior/locomotor

[1] https://github.com/michaelmior/locomotor/blob/master/bench/t...

perfmode|9 years ago

I'd be curious to see the Lua code that implements this. Anyone care to indulge me?

jdwyah|9 years ago

agreed. I was a bit leery of diving into Lua as I was building http://ratelim.it but it really expands Redis's capabilities dramatically and was easy enough to add.

My apps all write the lua into redis and store the hash when they boot up. Duplicative, but means everybody is on the same page and it's easy to store the lua in the main codebase.

joaodlf|9 years ago

It's a nice post with a lot of detail and nice imagery... With that said, how would a simple, slightly modified, exponential backoff work any worse?

cakoose|9 years ago

What? These are different things.

This article is about the server deciding which requests to reject. Exponential backoff is a strategy clients use deciding when to retry after their request is rejected. (Plus, the article is about malicious clients; they're not going to follow your preferred backoff strategy.)

More concretely, how would exponential backoff ensure that you don't allow more than 10 requests/second per user?

dvt|9 years ago

Literally was about to post exactly this (I was thinking Fibonacci). The article's solutions seems like way too much work for something that shouldn't be half as complicated.

pacaro|9 years ago

It's good practice to rate limit endpoints for a variety of reasons, but in particular any endpoint that exposes user authentication should be rate limited.

So this should be a tool that every service at scale has access to.

IIRC the lack of rate limiting burned Apple relatively recently.

Is this yet another area where we all reinvent the wheel? I've yet to see a recommendation for an off the shelf solution

unknown|9 years ago

[deleted]

seniorghost|9 years ago

Thanks for referencing my earlier post in the article! We use the "sliding window log" you described at ClassDojo, but your more memory-efficient approach looks great.

https://engineering.classdojo.com/blog/2015/02/06/rolling-ra...

stonelazy|9 years ago

Was wondering what would be the best way to rate limit API requests that are in the order of thousands per minute, am guessing not any of the methods suggested in this write up helps?! Help pls.

jdwyah|9 years ago

why not? 1000s/min should be NBD.

contingencies|9 years ago

Zooming out a little, the fundamental problem here is broadly recognized as a modern protocol design challenge. To phrase the consideration roughly: the response to a request should not require more resources than the client has already spent to request it, either in terms of bandwidth or processing (including memory, storage IO bandwidth, etc.).

Obviously, in some cases such design is not possible. The classic case is HTTP, where the entire purpose is to supply some (arbitrarily large) volume of data in response to a small request, and therefore there is a bandwidth challenge.

Conventional defense strategies tend to utilize the fact that TCP requires a three-way handshake to instantiate, thus validating the peer's IP address (unlike UDP), and include:

(1) An authenticated session, eg. using a session key derived from a separate API call.

(2) Rate limiting per authenticated user, either based upon data over time or request frequency. (This alone is the subject of the article)

(3) Segregating read-only, cacheable data (even if it expires within seconds) on separate infrastructure such as CDNs or memory-based caches.

(4) Aggressive use of HTTP caching.

(5) Careful tuning of HTTP session length related configuration to suit the application profile.

A newer strategy is the use of captcha, however this is not viable in automated (ie. API) use cases. Another relatively 'new' (for HTTP) strategy is the use of websockets or other mechanisms for real time 'push', to avoid the latency and processing overheads of the conventionally enforced HTTP request-response model.

Additional options would include segregating clients to speak to different servers (ideally in different data centers, on different links) such that overhead may be scaled across different infrastructure. Thus even if a single server is targeted by an attacker damage is limited in scope.

Another architectural defense would be the enforcement of a gateway/proxy layer (internal or third party) obscuring real datacenter netblocks from attackers, however this comes at the cost of latency where data cannot be cached.

Cloudflare basically provide all of the above (plus additional features) as a service.

Finally, in native mobile application development where an API is the primary interface with some central system, another simple step that can be taken (with appropriate care regarding peer authentication and cache invalidation) is the use of a cached set of IP addresses within the client as a means to identify servers. In this way, attacks against DNS infrastructure will also be nullified, and you can focus the demands of segments of your user base on different infrastructure. (Here in China, DNS is often hijacked or broken, though this is much less of a concern in typical western environments. It will also reduce startup latency on slow mobile links the world over.)

68 comments