top | item 16941554

Scaling a High-Traffic Rate Limiting Stack with Redis Cluster

213 points| momonga | 8 years ago |brandur.org | reply

40 comments

order
[+] jihadjihad|8 years ago|reply
Redis IMHO is in the pantheon of excellent open-source projects, right up there with the likes of HAProxy in terms of code quality, speed, and downright reliability. 100% agree with the notion that more such building blocks need to be built.
[+] spmurrayzzz|8 years ago|reply
Agreed. I'd throw nginx into that cohort as well.
[+] papercruncher|8 years ago|reply
We use Redis Cluster quite extensively. The one thing to be very cautious and load test if running in a cloud environment is failover of nodes that are very loaded in terms of keys. If your nodes are holding multiple GBs of data, and depending on your persistence and other configuration settings, Redis may need to hit the disk to recover. If you don't have enough IOPS provisioned, be prepared for a long recovery time. The other thing that used to be problem but is getting much better now is the maturity of the different client libraries with respect to handling Redis Cluster specific idiosyncrasies.
[+] chucky_z|8 years ago|reply
I just got back from RedisConf and antirez brought up the idea (or that it's already in-development... he was not clear) of releasing an official redis cluster proxy for use with older/less-featured clients.

I believe it was brought up in the keynote (which I missed unfortunately), and also as part of one of the Redis Clients talks.

[+] kraftman|8 years ago|reply
Interesting. At which point is this recovery a problem? Id assume it would only be recovering on the slave since there will have been a newly promoted master after failover?
[+] chucky_z|8 years ago|reply
Excellent article! The use of Lua solves a lot of potential issues here with competing writes to similar spaces for rate limiting, causing potential bizarre errors.

The one thing I would note that doesn't seem to be covered is if you are using a relatively large Lua script and running eval over and over it's getting cached every time, instead `SCRIPT LOAD ...` can be ran, which spits out a sha1 which can then be ran with `EVALSHA (sha1) (keys) (args)`. This can potentially speed stuff up as well as cutting back on memory.

[+] hamandcheese|8 years ago|reply
But requires extra logic and possibly tooling to do that correctly. The scripts aren’t persisted iirc, so if a node restarts the script won’t be loaded.
[+] baconomatic|8 years ago|reply
I couldn't agree more with "We need more building blocks like Redis that do what they’re supposed to, then get out of the way." Redis has become such a foundational piece of software for me and the projects I work on.

Plus, it's just plain fun to use.

[+] dnomad|8 years ago|reply
Frankly this strikes me as really hacky. A million operations a second isn't even that much. Something like Chronicle [1] can do millions of atomic operations a second. A cluster of 10 nodes for what are basic in-memory counters? And the wackiness of Lua scripts to read from the cache?

It all seems a bit much. I've solved similar problems in the trading space (processing raw market data feeds) with much less.

It's interesting how different communities have their hammers and nails. Redis seems to have really taken over certain consumer-web-oriented communities. In other more enterprise communities I've seen people lean heavily on distributed cache products like Hazelcast etc. And in trading this sort of thing is so bread and butter and common that everybody has internal solutions.

[1] https://chronicle.software/

[+] dividuum|8 years ago|reply
I wonder if this would also be a use case for foundationdb. All the "clustering" would be built-in and performance seems to be quite good (https://apple.github.io/foundationdb/performance.html), although probably not comparable to redis with configuration that accepts data loss. Anyone has experience with that?
[+] spullara|8 years ago|reply
I've used it for similar things in the past. Best practice on FDB would be to use snapshot reads on the counters and the add atomic mutation operation so you never have conflicts.
[+] sciurus|8 years ago|reply
It's nice to hear a success story about Redis Cluster. When I worked at Eventbrite we used Redis heavily, both for the usual use cases (caching, ephemeral storage like sessions) as well as at the core of services like reserved seating. We did our own sharding client side as a layer on top of the redis-py library and relied on sentinel to handle failover. After Redis Cluster was released, we had some interest in it, but were were nervous enough about the limitations in its capabilities and the additional complexity of operating it that we never experimented with it.
[+] ttul|8 years ago|reply
I fucking love Redis. We use it inside a large scale email sending platform to do all manner of rate limiting and real time analysis of streaming data to make routing decisions. Could not live without Redis.
[+] abalone|8 years ago|reply
Silly question but any idea what tools were used to create the diagrams in this post?
[+] awshepard|8 years ago|reply
Hazarding a guess, it looks like it might have been Monodraw, or something similar.
[+] pulkitsh1234|8 years ago|reply
More details on Stripe's rate limiter(s): https://stripe.com/blog/rate-limiters. An awesome gist is given at the bottom too, which has implementations of the different rate limiters, And also the `EVAL` part this post talks about.
[+] xstartup|8 years ago|reply
In adtech, we average over a 100 million operations per second and we don't even touch redis.

We've been using Memcache all while and have no desire to change that.

[+] zxcmx|8 years ago|reply
This would be an interesting post if you mentioned what you were doing 100 million times per second. How tangled are your writes? What are your consistency requirements?

100 million set operations per second is not the same as 100 million counter increments etc.

[+] sandGorgon|8 years ago|reply
isnt this the exact usecase that kafka solves ? Its great to see redis being able to do it just as well as kafka probably.

I'm quite interested to see how they implemented a queueing solution without the new Redis Streams infrastructure.