Visualizing AWS Storage with Real-Time Latency Spectrograms

[+] btown|11 years ago|reply

> Every few seconds one of the writes takes forever [~5s]. You can notice the long periods of inactivity, and after that a green dot at the right of the chart: that’s our slow call. What is likely happening is: the local cache saturates and when that happens the application has to wait until the local data is pushed to the remote volume. Boy, you sure don’t want one of your critical code paths to hit one of these slow calls.

I'm surprised that there's no asynchronous way that the FS cache will flush itself i.e. when it reaches 50% capacity, and rate-limit incoming requests if it's too full. The idea that an FS cache is so dumb that it can't do anything while it's flushing its entire self is a bit scary - I'd expect that circular buffers and granular locking mechanisms could be used to great effect here. Is this kernel code? Userspace code? Is there research into this? Fundamental tradeoffs that I'm missing?

[+] mrjones|11 years ago|reply

It would be interesting to see the client/benchmarking program. It almost sounds like it could be single-threaded ... which would mean the delay is an artifact of the benchmark only having one op outstanding, rather than something inherent in the storage layer.

[+] huhtenberg|11 years ago|reply

That's clever and well executed. Wrong palette though :P

Red implies problems, green implies "normality", but here this association is misplaced. Perhaps a typical "fire" palette would be better - from dark brown to red to orange to yellow and, ultimately, to white for the extremes.

[+] degio|11 years ago|reply

OP here. Unfortunately the ansi palette is pretty limited so I didn't have a lot of flexibility in the color choice. That said, this can definitely be improved. I can work on it if people find it useful.

In the meantime, it's very easy to tune the colors your own: just modify this line https://github.com/draios/sysdig/blob/master/userspace/sysdi... in your local version of the script, using this as a reference http://misc.flogisoft.com/_media/bash/colors_format/256_colo....

[+] bcantrill|11 years ago|reply

Neat! This is definitely a step forward -- and thanks for the shout-out to our (that is, Sun's and Joyent's) prior work here. Tempted to also incorporate this into agghist and aggpack, the new DTrace actions I added for this kind of functionality.[1] Anyway, good stuff -- it's always good to see new visualizations of system behavior!

[1] http://dtrace.org/blogs/bmc/2013/11/10/agghist-aggzoom-and-a...

[+] andrewguenther|11 years ago|reply

It would be interesting to run these tests on different instance sizes, specifically for data on the instance store. The larger the instance, the fewer neighbors you have to worry spending those precious IOPS.

As for SSD vs Magnetic EBS, I can't say that I'm surprised. I'd assume that EBS implements some sort of cache in between you and your actual disk on the other side of the network so that the writes can return even faster. Try doing this again with reads and I'd bet you'd get some interesting results.

Edit: Also, did you pre-warm your EBS volumes? http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-prewa...

[+] degio|11 years ago|reply

Yes, I did pre-warm the volumes before using them.

And yes, there are several interesting workloads that I didn't test, including read only and read+write. It's potential material for another blog post.

[+] robszumski|11 years ago|reply

Nice job on the graphics for the post. Thanks for taking the time to animate and annotate well.

[+] amulyasharma|11 years ago|reply

In the world of IOPS provisioned iops application demanding faster and faster iops this tool is handy for devops guy to find the truth of iops being used and how its performing, selecting if there is need to upgrade the storage ..

[+] outputlogic|11 years ago|reply

Calling this visualization a heatmap would be more appropriate than a spectrogram.

[+] unknown|11 years ago|reply

[deleted]

[+] digikata|11 years ago|reply

I really want to lop off the 'ns' and '10 sec' divisions of all the charts and expand the resolution...

16 comments