There's a fixed memory solution that doesn't suffer from the boundary condition that allows you to double the rate. It's pretty straightforward, so I'll describe it in prose, since it's 6am and I'd rather not get the code wrong. :)
The approach uses a ring buffer. If you're not familiar with them, they are a fix-sized array or linked list that you iterate through monotonically, wrapping around to the beginning when you are at the limit. Our ring buffer will hold timestamps and should be initialized to hold 0's -- ensuring that only someone with a time machine could be rate limited before they send any requests. The size of the buffer is the rate limit's value, expressed in whatever time unit you find convenient.
As each request comes in, you fetch the value from the buffer at the current position and compare it to current time. If the value from the buffer is more than the current time minus the rate limit interval you're using, then you return a 420 to the client and are done. If not, their request is ok and you should serve it normally, but first you store the current time stamp in the buffer and then advance the counter/index.
The article describes solutions that use Redis so that multiple app servers can share rate limits. The article's second solution is basically what you're describing, except adapted to work with Redis.
Also, what do you mean by "fixed" memory? Sure, the memory doesn't grow over time, but neither does the memory of the other solutions. Of the solutions listed in the article, this is the most memory-hungry.
In response, we implemented a shadow ban: On the surface, the attackers continued to receive a 200 HTTP response code, but behind the scenes we simply stopped sending document invitations after they exceeded the rate limit.
And right there they broke the service for legitimate users. Totally unacceptable collateral damage IMHO.
Shadow ban for everyone that exceeded the rate limit or just the one attacker? As others have said that's shitty for legitimate users that go over the rate limit.
The problem with rate limiting is to distinguish a normal user from a spammer. A normal user can send more requests than usual sometimes. If a normal user gets rate-limited by mistake, you are going to get lots of upset users.
I would think if you have a consumer application that can't handle double what is set as the rate limit during a very small corner case (start and end of the the minute barrier) you have bigger problems. As you're still effectively enforcing your rate limit over time with that approach. This just sounds like micro-optimization at its worst.
Agreed. Especially given that these rate limits seemed to be aimed at stopping something catastrophic like spammers using 100x allotted capacity, a 2x innacuracy shouldn't really matter. The solution was interesting, however, and I could see it being useful in a situation where users are expected to run very close to their rate limits. For example, I could see AWS being fairly careful about not letting anyone use more than their allotted compute/network bandwidth because getting double bandwidth without paying for it is a pretty big deal.
Yeah, I was also thinking how meaningful ~20MB of memory use really would be in this context. Or how badly would racy token bucket perform in the real world. Still, enjoyed the read.
Had we not discovered the attack, we could have faced a huge surge in our delivery costs and a decline in our email sender reputation.
This isn't just about the application being able to handle double the amount of load, it about keeping costs down and preventing someone from drastically increasing your bill. I work at a fintech company, and many of the 3rd party providers we use have a non-negligible cost per API call. You wouldn't want to wake up one morning and find that someone triggered one of those calls thousands of times while you were asleep.
This can be appropriate as a tactical short-term hack, but for anyone reading who doesn't have a good sense for when to cut corners here, this isn't a great general-purpose solution. You're building assumptions about the way your machines receive requests into your service logic instead of externalizing it in your load balancer, which isn't a good practice.
In practical terms this means that whenever the characteristics of your environment changes, your rate limiting suddenly gets wonky. If you're doing a rolling deployment and take 1/3 of your machines out of service, or if some machines go down for maintenance, or a variety of other things happen to your machines, you're going to end up with fluctuating rate limits.
It sounds like OP is running a relatively mature service, so this probably isn't the best idea for them.
This only works if your bucket size is much larger than your number of servers.
In the degenerate case, imagine a rate limit of 2 total requests per minute load balanced across 2 servers with enough traffic that my request is basically hitting a random server. In this case, 50% of the time my second request will be rate limited incorrectly because each server has a bucket of 1 and my second request went to the same server as my first.
I'm sure someone smarter than me (and better at probability) could come up with an equation where you input your rate limit & number of servers and it tells you the probability of a false positive for a user.
I think rate limiting is the wrong idea. Say for example, a client wants to re-fetch everything that it has cached, it may send a burst of requests in a short amount of time, and some of those requests may be wrongly rejected due to rate limits. This is what happens when a browser refreshes a page for example.
A better approach I think is delaying request handling based on resource allocation. If one client is disproportionately using up more time than others, then processing that clients' request will be queued up to process later, while well behaving clients will get their requests handled quickly. I think this is a more realistic approach and imposes less arbitrary restrictions.
This is why I've suggested, and used, fair queuing for rate limiting.
Any source can make lots of requests, but they queue up by IP address. (In practice, you have queues organized by hashed IP address, which you service round-robin. That's how routers do fair queuing.) If one queue gets very long, then you start sending 429 errors for new adds to that queue. Otherwise, the requests from the heavy load source just take a little longer.
Unlike most rate limiters, this requires little or no tuning. Each source IP address gets an equal amount of service. Sources which generate heavy load compete with themselves, not others.
The main tuning parameter is how long a queue can get. It's useful to have both length and time limits. But only hostile sources should hit those. They don't change with capacity or number of users. Brief load transients should be handled normally, if slowly.
This won't stop a distributed denial of service attack from many IP addresses. But anything coming from a small number of sources will be handled without much trouble.
Once you start getting into quality-of-service, basic rate limiting like the discussed is not enough. I think Stripe did a better job of covering such concerns in their rate limiter post. They specifically talk about being lenient to "bursty" traffic, as that was a legitimate use-case for clients.
Where do you queue those requests ? If you do it anywhere under your own control, you will accumulate memory and open connections.
If you issue a 429, the client knows it needs to wait and retry, and you've pushed the backpressure all the way past the demarcation line of your own infrastructure.
This depends why you want to rate limit. If its the classic case of stopping one customer consuming all resources because of an overenthusiastic script it's fine. If you're trying to avoid paying for lots of calls to an expensive API, or competitors scraping data, you want to reject the requests entirely rather than just delaying them.
fwiw TokenBuckets are very happy to separate "burst" from "refill rate". ie It's easy to give each client 10 tokens, but then have their bucket refill at 60/minute.
Queueing sounds nice, but the web is synchronous. If somebody crushes you do you really want to queue them and then process the reqs 5 min later? After the connection has already terminated?
This is why I like HackerNews comments. As a primarily front-end dev building a SaaS, I'd already bookmarked this post and was planning implementing it. But it seems like the comments here are pointing me in a better direction.
I wrote Locomotor[0] to automate the translation of Python code to Lua for Redis. Basically, you add an annotation to a function and the first time the code executes, it's converted to Lua and shipped to the server. You can see an example here[1]. It's far from foolproof and many Python language constructs have not been implemented, but it can handle some relatively complex code.
agreed. I was a bit leery of diving into Lua as I was building http://ratelim.it but it really expands Redis's capabilities dramatically and was easy enough to add.
My apps all write the lua into redis and store the hash when they boot up. Duplicative, but means everybody is on the same page and it's easy to store the lua in the main codebase.
This article is about the server deciding which requests to reject. Exponential backoff is a strategy clients use deciding when to retry after their request is rejected. (Plus, the article is about malicious clients; they're not going to follow your preferred backoff strategy.)
More concretely, how would exponential backoff ensure that you don't allow more than 10 requests/second per user?
Literally was about to post exactly this (I was thinking Fibonacci). The article's solutions seems like way too much work for something that shouldn't be half as complicated.
It's good practice to rate limit endpoints for a variety of reasons, but in particular any endpoint that exposes user authentication should be rate limited.
So this should be a tool that every service at scale has access to.
IIRC the lack of rate limiting burned Apple relatively recently.
Is this yet another area where we all reinvent the wheel? I've yet to see a recommendation for an off the shelf solution
Thanks for referencing my earlier post in the article! We use the "sliding window log" you described at ClassDojo, but your more memory-efficient approach looks great.
Was wondering what would be the best way to rate limit API requests that are in the order of thousands per minute, am guessing not any of the methods suggested in this write up helps?! Help pls.
Zooming out a little, the fundamental problem here is broadly recognized as a modern protocol design challenge. To phrase the consideration roughly: the response to a request should not require more resources than the client has already spent to request it, either in terms of bandwidth or processing (including memory, storage IO bandwidth, etc.).
Obviously, in some cases such design is not possible. The classic case is HTTP, where the entire purpose is to supply some (arbitrarily large) volume of data in response to a small request, and therefore there is a bandwidth challenge.
Conventional defense strategies tend to utilize the fact that TCP requires a three-way handshake to instantiate, thus validating the peer's IP address (unlike UDP), and include:
(1) An authenticated session, eg. using a session key derived from a separate API call.
(2) Rate limiting per authenticated user, either based upon data over time or request frequency. (This alone is the subject of the article)
(3) Segregating read-only, cacheable data (even if it expires within seconds) on separate infrastructure such as CDNs or memory-based caches.
(4) Aggressive use of HTTP caching.
(5) Careful tuning of HTTP session length related configuration to suit the application profile.
A newer strategy is the use of captcha, however this is not viable in automated (ie. API) use cases. Another relatively 'new' (for HTTP) strategy is the use of websockets or other mechanisms for real time 'push', to avoid the latency and processing overheads of the conventionally enforced HTTP request-response model.
Additional options would include segregating clients to speak to different servers (ideally in different data centers, on different links) such that overhead may be scaled across different infrastructure. Thus even if a single server is targeted by an attacker damage is limited in scope.
Another architectural defense would be the enforcement of a gateway/proxy layer (internal or third party) obscuring real datacenter netblocks from attackers, however this comes at the cost of latency where data cannot be cached.
Cloudflare basically provide all of the above (plus additional features) as a service.
Finally, in native mobile application development where an API is the primary interface with some central system, another simple step that can be taken (with appropriate care regarding peer authentication and cache invalidation) is the use of a cached set of IP addresses within the client as a means to identify servers. In this way, attacks against DNS infrastructure will also be nullified, and you can focus the demands of segments of your user base on different infrastructure. (Here in China, DNS is often hijacked or broken, though this is much less of a concern in typical western environments. It will also reduce startup latency on slow mobile links the world over.)
sulam|9 years ago
The approach uses a ring buffer. If you're not familiar with them, they are a fix-sized array or linked list that you iterate through monotonically, wrapping around to the beginning when you are at the limit. Our ring buffer will hold timestamps and should be initialized to hold 0's -- ensuring that only someone with a time machine could be rate limited before they send any requests. The size of the buffer is the rate limit's value, expressed in whatever time unit you find convenient.
As each request comes in, you fetch the value from the buffer at the current position and compare it to current time. If the value from the buffer is more than the current time minus the rate limit interval you're using, then you return a 420 to the client and are done. If not, their request is ok and you should serve it normally, but first you store the current time stamp in the buffer and then advance the counter/index.
cakoose|9 years ago
Also, what do you mean by "fixed" memory? Sure, the memory doesn't grow over time, but neither does the memory of the other solutions. Of the solutions listed in the article, this is the most memory-hungry.
irgeek|9 years ago
And right there they broke the service for legitimate users. Totally unacceptable collateral damage IMHO.
lenzm|9 years ago
karmakaze|9 years ago
nullnilvoid|9 years ago
vacri|9 years ago
matt_wulfeck|9 years ago
timothycrosley|9 years ago
mattb314|9 years ago
zokier|9 years ago
markwaldron|9 years ago
This isn't just about the application being able to handle double the amount of load, it about keeping costs down and preventing someone from drastically increasing your bill. I work at a fintech company, and many of the 3rd party providers we use have a non-negligible cost per API call. You wouldn't want to wake up one morning and find that someone triggered one of those calls thousands of times while you were asleep.
joneholland|9 years ago
Your incoming requests should be balanced across all of the servers so you just derive the allowed throughput and divide by the number front ends....
iEchoic|9 years ago
In practical terms this means that whenever the characteristics of your environment changes, your rate limiting suddenly gets wonky. If you're doing a rolling deployment and take 1/3 of your machines out of service, or if some machines go down for maintenance, or a variety of other things happen to your machines, you're going to end up with fluctuating rate limits.
It sounds like OP is running a relatively mature service, so this probably isn't the best idea for them.
mason55|9 years ago
In the degenerate case, imagine a rate limit of 2 total requests per minute load balanced across 2 servers with enough traffic that my request is basically hitting a random server. In this case, 50% of the time my second request will be rate limited incorrectly because each server has a bucket of 1 and my second request went to the same server as my first.
I'm sure someone smarter than me (and better at probability) could come up with an equation where you input your rate limit & number of servers and it tells you the probability of a false positive for a user.
drchickensalad|9 years ago
Otherwise you can easily get them to hit their limit super early with bad luck.
daliwali|9 years ago
A better approach I think is delaying request handling based on resource allocation. If one client is disproportionately using up more time than others, then processing that clients' request will be queued up to process later, while well behaving clients will get their requests handled quickly. I think this is a more realistic approach and imposes less arbitrary restrictions.
Animats|9 years ago
Unlike most rate limiters, this requires little or no tuning. Each source IP address gets an equal amount of service. Sources which generate heavy load compete with themselves, not others.
The main tuning parameter is how long a queue can get. It's useful to have both length and time limits. But only hostile sources should hit those. They don't change with capacity or number of users. Brief load transients should be handled normally, if slowly.
This won't stop a distributed denial of service attack from many IP addresses. But anything coming from a small number of sources will be handled without much trouble.
jdmichal|9 years ago
https://stripe.com/blog/rate-limiters
https://news.ycombinator.com/item?id=13997029
rco8786|9 years ago
Queues and backpressure to properly serve bursty traffic and rate limiting to protect against bad actors.
philsnow|9 years ago
If you issue a 429, the client knows it needs to wait and retry, and you've pushed the backpressure all the way past the demarcation line of your own infrastructure.
jon-wood|9 years ago
jdwyah|9 years ago
See https://www.ratelim.it/documentation/safety for how to configure a limit with "burst" in RateLim.it
Queueing sounds nice, but the web is synchronous. If somebody crushes you do you really want to queue them and then process the reqs 5 min later? After the connection has already terminated?
sarreph|9 years ago
jnwatson|9 years ago
michaelmior|9 years ago
[0] https://github.com/michaelmior/locomotor
[1] https://github.com/michaelmior/locomotor/blob/master/bench/t...
perfmode|9 years ago
jdwyah|9 years ago
My apps all write the lua into redis and store the hash when they boot up. Duplicative, but means everybody is on the same page and it's easy to store the lua in the main codebase.
joaodlf|9 years ago
cakoose|9 years ago
This article is about the server deciding which requests to reject. Exponential backoff is a strategy clients use deciding when to retry after their request is rejected. (Plus, the article is about malicious clients; they're not going to follow your preferred backoff strategy.)
More concretely, how would exponential backoff ensure that you don't allow more than 10 requests/second per user?
dvt|9 years ago
pacaro|9 years ago
So this should be a tool that every service at scale has access to.
IIRC the lack of rate limiting burned Apple relatively recently.
Is this yet another area where we all reinvent the wheel? I've yet to see a recommendation for an off the shelf solution
unknown|9 years ago
[deleted]
seniorghost|9 years ago
https://engineering.classdojo.com/blog/2015/02/06/rolling-ra...
stonelazy|9 years ago
jdwyah|9 years ago
contingencies|9 years ago
Obviously, in some cases such design is not possible. The classic case is HTTP, where the entire purpose is to supply some (arbitrarily large) volume of data in response to a small request, and therefore there is a bandwidth challenge.
Conventional defense strategies tend to utilize the fact that TCP requires a three-way handshake to instantiate, thus validating the peer's IP address (unlike UDP), and include:
(1) An authenticated session, eg. using a session key derived from a separate API call.
(2) Rate limiting per authenticated user, either based upon data over time or request frequency. (This alone is the subject of the article)
(3) Segregating read-only, cacheable data (even if it expires within seconds) on separate infrastructure such as CDNs or memory-based caches.
(4) Aggressive use of HTTP caching.
(5) Careful tuning of HTTP session length related configuration to suit the application profile.
A newer strategy is the use of captcha, however this is not viable in automated (ie. API) use cases. Another relatively 'new' (for HTTP) strategy is the use of websockets or other mechanisms for real time 'push', to avoid the latency and processing overheads of the conventionally enforced HTTP request-response model.
Additional options would include segregating clients to speak to different servers (ideally in different data centers, on different links) such that overhead may be scaled across different infrastructure. Thus even if a single server is targeted by an attacker damage is limited in scope.
Another architectural defense would be the enforcement of a gateway/proxy layer (internal or third party) obscuring real datacenter netblocks from attackers, however this comes at the cost of latency where data cannot be cached.
Cloudflare basically provide all of the above (plus additional features) as a service.
Finally, in native mobile application development where an API is the primary interface with some central system, another simple step that can be taken (with appropriate care regarding peer authentication and cache invalidation) is the use of a cached set of IP addresses within the client as a means to identify servers. In this way, attacks against DNS infrastructure will also be nullified, and you can focus the demands of segments of your user base on different infrastructure. (Here in China, DNS is often hijacked or broken, though this is much less of a concern in typical western environments. It will also reduce startup latency on slow mobile links the world over.)