James has put out a lot of great material. He's given a talk series and wrote a conference paper called "On Designing and Deploying Internet-Scale Services" which is a trove of valuable design insights:
He writes some of the more practical and useful advice I've seen about how to run a successful business based on high-scale distributed systems.
One of his mnemonics that's stuck with me is the "Four Rs" of recovery-oriented computing: Restart, Reboot, Reimage, Replace (slide 6 in PDF). A human shouldn't be engaged to troubleshoot a problem until the platform has first tried in successing to restart the software, reboot the OS, re-image the machine, and finally replace the hardware entirely. After these auto-recovery steps have failed, only then is it time to engage a human. He describes the connection between these techniques and significantly lower operations costs.
This is why Amazon beat Google in the "as a Service" space. Google provided a framework with AppEngine, but customers wanted the flexibility of the primitives offered by AWS.
I think that Amazon being first in 2005/2006 helped a lot, too. Google now provides "primitives" through GCE. I think Google realizes that the future is GCE not GAE.
Offer IaaS first, the PaaS. Google had it backward.
While AWS does have a small vendor lock in too, in the end it's still recognizable , I mean it's doable for example to move an AWS architecture to open stack without a major code rewrite. If you built your web app on GAE and GWT however, your investment to move it out of Google will be much bigger.
You have to give full credit to AWS for inventing previously unknown pricing models. In some cases, it was actually the pricing model that created new use cases. Glacier and S3 comes to mind.
That's a very important observation. When I describe AWS to audiences I always make clear that it is a combination of a technology and a business model.
So, with No SSH - how do you debug that one-off problem that is only on machine abc12? I'm not talking about mutating, I'm talking about attaching gdb to the process while it's handling requests. I'm talking about collecting CPU profiling information from production. Stuff like that.
[+] [-] yarapavan|10 years ago|reply
[+] [-] jcrites|10 years ago|reply
https://www.usenix.org/legacy/event/lisa07/tech/full_papers/... (somewhat condensed)
Slides from a talk: http://mvdirona.com/jrh/talksAndPapers/JamesRH_AmazonDev.pdf
He writes some of the more practical and useful advice I've seen about how to run a successful business based on high-scale distributed systems.
One of his mnemonics that's stuck with me is the "Four Rs" of recovery-oriented computing: Restart, Reboot, Reimage, Replace (slide 6 in PDF). A human shouldn't be engaged to troubleshoot a problem until the platform has first tried in successing to restart the software, reboot the OS, re-image the machine, and finally replace the hardware entirely. After these auto-recovery steps have failed, only then is it time to engage a human. He describes the connection between these techniques and significantly lower operations costs.
[+] [-] jimbokun|10 years ago|reply
This is why Amazon beat Google in the "as a Service" space. Google provided a framework with AppEngine, but customers wanted the flexibility of the primitives offered by AWS.
[+] [-] nopzor|10 years ago|reply
[+] [-] eranation|10 years ago|reply
Offer IaaS first, the PaaS. Google had it backward.
While AWS does have a small vendor lock in too, in the end it's still recognizable , I mean it's doable for example to move an AWS architecture to open stack without a major code rewrite. If you built your web app on GAE and GWT however, your investment to move it out of Google will be much bigger.
[+] [-] rodionos|10 years ago|reply
[+] [-] jeffbarr|10 years ago|reply
[+] [-] axelfontaine|10 years ago|reply
Amen. We couldn't agree more: https://boxfuse.com/blog/no-ssh
[+] [-] bluecmd|10 years ago|reply
[+] [-] LinuxBender|10 years ago|reply
Instead of ssh vs. no-ssh debate, how about:
* Use SSH AllowGroups to limit logins to the development manager/lead or on-call engineer
* with a policy that for each time you have to call them beyond N times per week, you have to buy their team a round of drinks of their choosing?
Deal?