top | item 31173009

(no title)

dormando | 3 years ago

Hi! I'd like to offer some hopefully useful information if any Slack folks end up reading this, or anyone else with a similar infrastructure. I'll start with some tech and make a separate philosophical comment.

Also caveat: I have no deep view into Slack's infrastructure so anything I say here may not even be relevant. YMMV.

First some self promotion: https://github.com/memcached/memcached/wiki/Proxy memcached itself is shipping router/proxy software. Mcrouter is difficult to manage and unsupported. This proxy is community developed, more flexible, likely faster, and will support more native features of memcached. We're currently in a stabilization round ensuring it won't eat pets but all of the basic features have been in for a while. Documentation and example libraries are still needed but community feedback help speed those up tremendously (or any kind of question/help request).

It's not clear to me why memcached is being managed like this; mcrouter seems to only be used to abstract the configuration from the clients. It has a lot of features for redundant pools and so on. Especially with what sounds like globally immutable data and the threat of cascading failures during rolling upgrades it sounds like it would be very helpful here.

If cost or pool sizes are the main reasons why the structure is flat, using Extstore (https://github.com/memcached/memcached/wiki/Extstore) can likely help. Even if object value sizes are in the realm of 500 bytes, using flash storage can still greatly reduce the amount of RAM necessary or reduce the pool size (granted the network can still keep up) with nearly identical performance. Extstore takes a lot of tradeoffs (ie; keeping keys in RAM) to ensure most operations don't actually write to flash or double-read. Extstore's in use in tons of places and everyone's immediately addicted.

Finally, the Meta Protocol (https://github.com/memcached/memcached/wiki/MetaCommands) can help with stampeding herds to help keep DB load from exploding without adding excess network roundtrips under normal conditions. I've seen lots of workarounds people build but this protocol extension gives a lot of flexibility you can use to help survive degraded states: anti-stampeding herd, serve-stale, better counter semantics, and so on.

discuss

No comments yet.