Scaling a High-Traffic Rate Limiting Stack with Redis Cluster

[+] jihadjihad|8 years ago|reply

Redis IMHO is in the pantheon of excellent open-source projects, right up there with the likes of HAProxy in terms of code quality, speed, and downright reliability. 100% agree with the notion that more such building blocks need to be built.

[+] spmurrayzzz|8 years ago|reply

Agreed. I'd throw nginx into that cohort as well.

[+] papercruncher|8 years ago|reply

We use Redis Cluster quite extensively. The one thing to be very cautious and load test if running in a cloud environment is failover of nodes that are very loaded in terms of keys. If your nodes are holding multiple GBs of data, and depending on your persistence and other configuration settings, Redis may need to hit the disk to recover. If you don't have enough IOPS provisioned, be prepared for a long recovery time. The other thing that used to be problem but is getting much better now is the maturity of the different client libraries with respect to handling Redis Cluster specific idiosyncrasies.

[+] chucky_z|8 years ago|reply

I just got back from RedisConf and antirez brought up the idea (or that it's already in-development... he was not clear) of releasing an official redis cluster proxy for use with older/less-featured clients.

I believe it was brought up in the keynote (which I missed unfortunately), and also as part of one of the Redis Clients talks.

[+] kraftman|8 years ago|reply

Interesting. At which point is this recovery a problem? Id assume it would only be recovering on the slave since there will have been a newly promoted master after failover?

[+] chucky_z|8 years ago|reply

Excellent article! The use of Lua solves a lot of potential issues here with competing writes to similar spaces for rate limiting, causing potential bizarre errors.

The one thing I would note that doesn't seem to be covered is if you are using a relatively large Lua script and running eval over and over it's getting cached every time, instead `SCRIPT LOAD ...` can be ran, which spits out a sha1 which can then be ran with `EVALSHA (sha1) (keys) (args)`. This can potentially speed stuff up as well as cutting back on memory.

[+] hamandcheese|8 years ago|reply

But requires extra logic and possibly tooling to do that correctly. The scripts aren’t persisted iirc, so if a node restarts the script won’t be loaded.

[+] baconomatic|8 years ago|reply

I couldn't agree more with "We need more building blocks like Redis that do what they’re supposed to, then get out of the way." Redis has become such a foundational piece of software for me and the projects I work on.

Plus, it's just plain fun to use.

[+] dnomad|8 years ago|reply

Frankly this strikes me as really hacky. A million operations a second isn't even that much. Something like Chronicle [1] can do millions of atomic operations a second. A cluster of 10 nodes for what are basic in-memory counters? And the wackiness of Lua scripts to read from the cache?

It all seems a bit much. I've solved similar problems in the trading space (processing raw market data feeds) with much less.

It's interesting how different communities have their hammers and nails. Redis seems to have really taken over certain consumer-web-oriented communities. In other more enterprise communities I've seen people lean heavily on distributed cache products like Hazelcast etc. And in trading this sort of thing is so bread and butter and common that everybody has internal solutions.

[1] https://chronicle.software/

[+] dividuum|8 years ago|reply

I wonder if this would also be a use case for foundationdb. All the "clustering" would be built-in and performance seems to be quite good (https://apple.github.io/foundationdb/performance.html), although probably not comparable to redis with configuration that accepts data loss. Anyone has experience with that?

[+] spullara|8 years ago|reply

I've used it for similar things in the past. Best practice on FDB would be to use snapshot reads on the counters and the add atomic mutation operation so you never have conflicts.

[+] sciurus|8 years ago|reply

It's nice to hear a success story about Redis Cluster. When I worked at Eventbrite we used Redis heavily, both for the usual use cases (caching, ephemeral storage like sessions) as well as at the core of services like reserved seating. We did our own sharding client side as a layer on top of the redis-py library and relied on sentinel to handle failover. After Redis Cluster was released, we had some interest in it, but were were nervous enough about the limitations in its capabilities and the additional complexity of operating it that we never experimented with it.

[+] ttul|8 years ago|reply

I fucking love Redis. We use it inside a large scale email sending platform to do all manner of rate limiting and real time analysis of streaming data to make routing decisions. Could not live without Redis.

[+] garganzol|8 years ago|reply

Author has an enjoyable writing style. Thumbs up for quality writing.

[+] simonw|8 years ago|reply

His blog is one of my favorites - so much good stuff on there. A few recent highlights:

Touring a Fast, Safe, and Complete(ish) Web Service in Rust: https://brandur.org/rust-web

Scaling Postgres with Read Replicas & Using WAL to Counter Stale Reads: https://brandur.org/postgres-reads

Redis Streams and the Unified Log: https://brandur.org/redis-streams

[+] shizcakes|8 years ago|reply

Another approach to this problem is to use Twemproxy: https://github.com/twitter/twemproxy, which can be used like a sidecar Redis load-balancer.

[+] sciurus|8 years ago|reply

Similarly, Envoy has redis support that looks promising.

https://www.envoyproxy.io/docs/envoy/v1.6.0/intro/arch_overv...

[+] nasalgoat|8 years ago|reply

Twemproxy has memory and latency issues that caused us to write our own balancing code in our API. Just FYI.

[+] abalone|8 years ago|reply

Silly question but any idea what tools were used to create the diagrams in this post?

[+] awshepard|8 years ago|reply

Hazarding a guess, it looks like it might have been Monodraw, or something similar.

[+] pulkitsh1234|8 years ago|reply

More details on Stripe's rate limiter(s): https://stripe.com/blog/rate-limiters. An awesome gist is given at the bottom too, which has implementations of the different rate limiters, And also the `EVAL` part this post talks about.

[+] xstartup|8 years ago|reply

In adtech, we average over a 100 million operations per second and we don't even touch redis.

We've been using Memcache all while and have no desire to change that.

[+] zxcmx|8 years ago|reply

This would be an interesting post if you mentioned what you were doing 100 million times per second. How tangled are your writes? What are your consistency requirements?

100 million set operations per second is not the same as 100 million counter increments etc.

[+] unknown|8 years ago|reply

[deleted]

[+] 3uclid|8 years ago|reply

Which company?

[+] sandGorgon|8 years ago|reply

isnt this the exact usecase that kafka solves ? Its great to see redis being able to do it just as well as kafka probably.

I'm quite interested to see how they implemented a queueing solution without the new Redis Streams infrastructure.

[+] unknown|8 years ago|reply

[deleted]

40 comments