I would love to see some longer-term measurements. I get the impression that this benchmark was doing a one time measurement. But the one big problem I have with cloud services is I/O performance over a longer time period.
We read so many times about very unreliable performance. Sometimes it's ok and sometimes it's really, really bad.
Without any kind of time period to continuously run a benchmark in, this doesn't really help. For all we know, the first placed service was just having a good day and the last placed a very bad one.
As a hobby, I've been running long-term benchmarks on a handful of cloud services, for exactly the reason you suggest -- performance varies quite a bit over time. You can browse through some of the data at (http://amistrongeryet.com/dashboard.jsp). The UI on this site is abysmal, and it only shows 30 days of data (I've actually collected almost two years), but it still gives some flavor of how much variability there is. For instance, check out this graph of SimpleDB reads: (http://amistrongeryet.com/op_detail.jsp?op=simpledb_readInco...). I've blogged on the data from time to time; for instance, (http://amistrongeryet.blogspot.com/2010/04/three-latency-ano...).
If there's interest, I'll work on making this data more accessible, adding documentation, and providing access to the full two years. In any case it's limited by the fact that I'm only probing AWS and Google App Engine, and only one or two instances of each. What I'd really like to do is open this up to crowdsourcing -- as I discussed a while back at (http://amistrongeryet.blogspot.com/2011/07/cloudsat.html). If anyone is interested in participating in a project like this, let me know!
I was surprised that the Storm on Demand SSD instances didn't seem do better (really only ~2x as fast as spinning disks?). Turns out this is some weird bundle of benchmarks.
If I'm reading right, its ~868MB/s seq write (CPU bound) and ~594MB/s seq read (CPU bound). Not sure how to read the random IO results. The bigger instances would probably be faster still (more CPU).
So the Storm on Demand SSD instances seem to be blazingly fast.
The service seems really interesting but the charts/graphs could use a little tufte love.
Why are the bars in the chart in a different order than the lines in the table? Maybe have the different amazon/xyz-provider be different shades of the same color?
Interesting from an IO point of view. It would be nice to have a clearer tie-in with CPU and other benchmarks for these providers. Does anyone have any experience with Storm on Demand to share? Their prices for SSD-based servers look enticing.
We co-locate with their parent company Liquid Web and have conducted extensive testing of their cloud services. They are a very solid service, knowledgable technical staff, very quick support response, great hardware, excellent reliability, and reasonable prices. We've been monitoring Storm for 2 years with 100% availability in 2 of their 3 regions: http://cloudharmony.com/status
Here is a CPU performance comparison for the same set of servers. This metric is an approximation of EC2's ECU for other providers (the algorithm uses 16 CPU benchmarks and EC2 instance performance as a baseline): http://cloudharmony.com/benchmarks?benchmarkId=ccu&selec...
If you are running additional queries and need a web service token (the site currently only allows you to generate 5 reports/day for free), ping me and I'll send you one for free.
I really wish they'd give more information about their methodology. For example, I've used the SoD SSD servers for some of my own testing, and they pull 30K IOPS for small random synchronous writes. How does that translate to "361.62"? WTF are the units here? What workload were they even testing? Yes, I know they list the random grab bag of tools they're using, but most of those are capable of generating many different kinds of I/O and they don't say what arguments they used. "361.62" seems very precise. I'm sure the two digits after the decimal point really impress the pithed snails or MBAs who are the benchmarkers' apparent target market. However, given both the bogosity of combining disparate measurements like this and the well known variability over time of cloud performance, that precision is not justified. Numbers that are more precise than they are accurate or meaningful are just decoration.
P.S. I expect someone will ask for more specifics, so here are a few. First, Bonnie++ sucks. Many of the numbers it produces measure the memory system or the libc implementation more than the actual I/O system. I've seriously gotten more testing value from building it than running it, so its very presence taints the result. Second, fio/hdparm/iozone might be redundant, according to which arguments are used. Or the results might be non-comparable. Either way, the aggregate result could only apply to an application with exactly that (unspecified) balance of read vs. write, sequential vs. random, sync vs. async, file counts and sizes, etc. Did they even run tests over enough data to get past caching effects? That's particularly important since they used different memory sizes on different providers. Similarly, what thread counts did they use on these different numbers of CPUs/cores? Same across all, or best-performing for each component benchmark? With such sloppy methodology anything less than an order-of-magnitude difference doesn't even tell you which platforms to test yourself.
So this compared Amazon's cc.4xl server @ 2.10 an hour against, e.g., Rackspace's 16G server at 0.96 an hour, and used its local storage in RAID 0 vs Rackspace's SAN storage? I'm pretty sure you couldn't have compared apples to oranges any better.
[+] [-] pilif|14 years ago|reply
We read so many times about very unreliable performance. Sometimes it's ok and sometimes it's really, really bad.
Without any kind of time period to continuously run a benchmark in, this doesn't really help. For all we know, the first placed service was just having a good day and the last placed a very bad one.
[+] [-] snewman|14 years ago|reply
As a hobby, I've been running long-term benchmarks on a handful of cloud services, for exactly the reason you suggest -- performance varies quite a bit over time. You can browse through some of the data at (http://amistrongeryet.com/dashboard.jsp). The UI on this site is abysmal, and it only shows 30 days of data (I've actually collected almost two years), but it still gives some flavor of how much variability there is. For instance, check out this graph of SimpleDB reads: (http://amistrongeryet.com/op_detail.jsp?op=simpledb_readInco...). I've blogged on the data from time to time; for instance, (http://amistrongeryet.blogspot.com/2010/04/three-latency-ano...).
If there's interest, I'll work on making this data more accessible, adding documentation, and providing access to the full two years. In any case it's limited by the fact that I'm only probing AWS and Google App Engine, and only one or two instances of each. What I'd really like to do is open this up to crowdsourcing -- as I discussed a while back at (http://amistrongeryet.blogspot.com/2011/07/cloudsat.html). If anyone is interested in participating in a project like this, let me know!
[+] [-] petedoyle|14 years ago|reply
I spun up a storm-ssd-3gb instance and ran bonnie++ on it. Results here: https://gist.github.com/2069845
If I'm reading right, its ~868MB/s seq write (CPU bound) and ~594MB/s seq read (CPU bound). Not sure how to read the random IO results. The bigger instances would probably be faster still (more CPU).
So the Storm on Demand SSD instances seem to be blazingly fast.
[+] [-] kristianp|14 years ago|reply
[+] [-] dfc|14 years ago|reply
Why are the bars in the chart in a different order than the lines in the table? Maybe have the different amazon/xyz-provider be different shades of the same color?
[+] [-] blakeperdue|14 years ago|reply
[+] [-] jread|14 years ago|reply
[+] [-] jread|14 years ago|reply
[+] [-] jread|14 years ago|reply
[+] [-] PaulHoule|14 years ago|reply
[+] [-] donw|14 years ago|reply
[+] [-] jread|14 years ago|reply
[+] [-] ericd|14 years ago|reply
[+] [-] justincormack|14 years ago|reply
[+] [-] CPlatypus|14 years ago|reply
P.S. I expect someone will ask for more specifics, so here are a few. First, Bonnie++ sucks. Many of the numbers it produces measure the memory system or the libc implementation more than the actual I/O system. I've seriously gotten more testing value from building it than running it, so its very presence taints the result. Second, fio/hdparm/iozone might be redundant, according to which arguments are used. Or the results might be non-comparable. Either way, the aggregate result could only apply to an application with exactly that (unspecified) balance of read vs. write, sequential vs. random, sync vs. async, file counts and sizes, etc. Did they even run tests over enough data to get past caching effects? That's particularly important since they used different memory sizes on different providers. Similarly, what thread counts did they use on these different numbers of CPUs/cores? Same across all, or best-performing for each component benchmark? With such sloppy methodology anything less than an order-of-magnitude difference doesn't even tell you which platforms to test yourself.
[+] [-] Cloven|14 years ago|reply