top | item 36923888

(no title)

Has anyone seen max (p100) client latencies of 300 to 400ms but totally normal p99? We see this across almost all our redis clusters on elasticache and have no idea why. CPU usage is tiny. Slowlog shows nothing.

discuss

GauntletWizard|2 years ago

I would guess your problem is probably scheduler based. The default(ish) Linux scheduler operates in 100ms increments, the first use of a client takes 3-4 round-trips. TCP opens, block, request is sent, the client blocks on write, the client attempts to read and blocks on read. If CPU usage is high momentarily, each of these yields to another process and your client isn't scheduled for another 100ms

jontonsoup|2 years ago

Hmm. We have super low CPU utilization- something like 9%. This is also across 10+ different clusters.

nicwolff|2 years ago

Are you evicting or deleting large sets (or lists or sorted sets)? We use a Django ORM caching library that adds each resultset's cache key to a set of keys to invalidate when that table is updated – at which point it issues `DEL <set key>` and if that set has grown to hundreds of thousands – or millions! – of keys the main Redis process will block completely for as long as it takes to loop through and evict them.

jontonsoup|2 years ago

nope!

tayo42|2 years ago

Is the memory full and evicting? Or do you have a large db with lots of keys with ttls? Redis does a bunch of maintenance stuff on the same thread iirc in the background but not really

jontonsoup|2 years ago

Memory is maybe 50% full. We are totally over provisioned. We actually just downsized and it didn’t impact anything.

We do expire but we don’t think we have a thundering herd problem with them all happening at the same time.

secondcoming|2 years ago

Is it doing backups?

jontonsoup|2 years ago

My understanding is elasticache does not let you turn them off.