top | item 28833202

(no title)

dormando | 4 years ago

Looks like this article got one bit of updated information but missed everything else... I'll address some things point by point:

data structures: - yes, fewer datastructures, if any. The point isn't that they have the features or not, but that memcached is a distributed system _first_, so any feature has to make sense in that context.

"Redis is better supported, updated more often. (or maybe memcached is "finished" or has a narrower scope?)" - I've been cutting monthly releases for like 5 years now (mind the pandemic gap). Sigh.

Memory organization: This is mostly accurate but missing some major points. The sizes of the slab classes doesn't change, but slabs pages can and do get re-assigned automatically. If you assign all memory to the 1MB page class, then empty that class, memory will go back to a global pool to get re-assigned. There are edge cases but it isn't static and hasn't been for ten years.

Item size limit: The max slab size has actually been 512k internally for a long time now, despite the item limit being 1mb. Why? Because "large" items are stitched together from smaller slab chunks. Setting a 2mb or 10mb limit is fine in most use cases, but again there are edge cases, especially for very small memory limits. Usually large items aren't combined with small memory limits.

You can also _reduce the slab class overhead_ (which doesn't typically exceed 5-10%), by lowering the "slab_chunk_max" option, which puts the slab classes closer together at the expense of stitching items larger than this class. IE; if all of your objects are 16kb or less, you can freely set this limit to 16kb and reduce your slab class overhead. I'd love to make this automatic or at least reduce the defaults.

LRU: looks like the author did notice the blog post (https://memcached.org/blog/modern-lru/) - I'll add that the LRU bumping (mutex contention) is completely removed from the _access path_. This is why it scales to 48 threads. The LRU crawler is not necessary to expire items, there is also a specific thread that does the LRU balancing.

The LRU crawler is used to proactively expire items. It is highly efficient since it independently scans slab classes; the more memory an object uses the fewer neighbors it has, and it schedules when to run on each slab class, so it can "Focus" on areas with higest return.

Most of the thread scalability is pretty old; not just since 2020.

Also worth noting memcached has an efficient flash backed storage system: https://memcached.org/blog/nvm-caching/ - requires RAM to keep track of keys, but can put value data on disk. With this tradeoff we can use flash devices without burning them out, as non-get/non-set operations do not touch the SSD (ie; delete removes from memory, but doesn't cause a write). Many very huge installations of this exist.

I've also been working on an internal proxy which is nearing production-readiness for an early featureset: https://github.com/memcached/memcached/issues/827 - scriptable in lua, will have lots of useful features.

discuss

boulos|4 years ago

For people who don't know and didn't realize this from the comment: dormando is the principal maintainer of memcached and has been for years (e.g., bradfitz was much less involved after he joined Google).

brody_hamer|4 years ago

I’m sorry for asking a rookie question, but you seem to know memcached really well and I couldn’t find an answer online.

Is there a way to obtain/monitor the time stamp of LTU evictions?

I want to get a sense of how memory constrained my memcached server is, and it seems intuitive to me to monitor the “last used” date of recent evictions. Like, if The server is evicting values that haven’t been accessed in 3 months; great. But if the server is evicting values that were last used < 24 hours ago; I have concerns.

dormando|4 years ago

There are stats in "stats items" / "stats slabs". Last access time for most recent eviction per slab class, etc. (see doc/protocol.txt from the tarball).

"watch evictions" command will also show a stream of details for items being evicted.

tayo42|4 years ago

Not sure if there's a better place to ask? But ill just try here. Curious about a design decision in the extstore. It seems to include a lot of extra stuff around managing writes and whats in memory and whats on disk. Why do you think this is better then just mmap-ing and letting the OS decide whats in memory using the fs cache and what pages are still disk?

dormando|4 years ago

That's an excellent question; it turns out there are a _lot_ of semantics that the OS is covering up for you when using mmap. For instance (this may be fixed by now), but any process doing certain mmap syscalls locked access to any open mmap's in an OS. So some random cronjob firing could clock your mmap'ed app pretty solidly.

There are also wild bugs; if you google my threads on the LKML you'll find me trying to hunt down a few in the past.

Mainly what I'm doing with extstore is maintaining a clear line between what I want the OS doing and what I want the app doing: a hard rule that the memcached worker threads _cannot_ be blocked for any reason. When they submit work to extstore, they submit to background threads then return to dequeueing network traffic. If the flash disk hiccups for any reason it means some queue's can bloat but other ops may still succeed.

Further, by controlling when we defrag or drop pages we can be more careful with where writes to flash happen.

TLDR: for predictable performance. Extstore is also a lot simpler than it may sound; it's a handful of short functions built on a lot of design decisions instead of a lot of code building up an algorithm.

avinassh|4 years ago

I have a noob question, why it did have a limit of 8 threads earlier and why it is now at 48? Why not just use all the available threads?

dormando|4 years ago

It was an algorithmic/lock scaling limit. Originally it was single threaded, then when it was first multi-threaded it scaled up to 4 threads. Then I split up some locks and it scaled to 8 threads (depending). Then I rewrote the LRU and now reads mostly scale linearly and writes don't. If there's enough interest we'll make writes scale better.

Partly this is because the software is so old that the thread scalability tends to track how many CPU's people actually have.