There was a major cascading failure in the power grid a few years back.
I thought there was a case of Amazon outage attributed to the same class of error.
The engineering trade-offs that are required are:
1) to protect the servers themselves from being damaged
2) when servers go offline to protect themselves, this may cause other servers to go offline.
3) to isolate the failure to specific subgroups in a network.
4) to provide enough excess capacity to take the load in the event of an outage
Bugs will occur, no matter how good the engineering is. Clients will need to be smarter, for example - implement some kind of exponential back off depending on whether the network is responsive or not.
Very interesting. As with all outages of major services, it seems it started with a confluence of independently minor, unforeseen events.
One question they didn't address, though, is if they're going to address the core problem - that a positive feedback loop of overloading supernodes is possible. It seems to me that a p2p system should be able to recover from having 20% of its nodes taken offline, rather than spawning a full collapse.
Avoiding the scenario where 20% of supernodes go offline to begin with is of course desirable, but since any number of things could cause that, it seems like a genuinely resilient system should remain functional (even in a degraded capacity) even if only a small fraction of nodes remains available.
That's desired behavior, although much more difficult in practice than it is in theory. If 20% of your network goes down and you can still serve clients normally, it means that you have a big reserve of machines useful only in case in big outages. I don't know if you can justify it economically.
You can also gracefully degrade performance, by rejecting client connections, disconnecting progressively some clients, accepting loss of consistency etc. It depends how far you can go without infuriating your customers.
We discovered that large-scale real-time systems(in our case, currently 400.000 concurrent connections) are really hard to stabilize against presence storms, network problems and buggy clients, among others.
Valid points, although I do not have enough experience with p2p networks (and almost no knowledge of skype's particular architecture) to judge but 20% - 30% of SN does seem substantial.
What I would like to see and couldn't find anywhere is more details on the objective impact it had on the users (number of users that had degraded/no service etc.). I think it would give a more complete picture.
teyc|15 years ago
I thought there was a case of Amazon outage attributed to the same class of error.
The engineering trade-offs that are required are: 1) to protect the servers themselves from being damaged 2) when servers go offline to protect themselves, this may cause other servers to go offline. 3) to isolate the failure to specific subgroups in a network. 4) to provide enough excess capacity to take the load in the event of an outage
Bugs will occur, no matter how good the engineering is. Clients will need to be smarter, for example - implement some kind of exponential back off depending on whether the network is responsive or not.
comex|15 years ago
I guess companies can have several faces, but it still strikes me as bizarre.
lukev|15 years ago
One question they didn't address, though, is if they're going to address the core problem - that a positive feedback loop of overloading supernodes is possible. It seems to me that a p2p system should be able to recover from having 20% of its nodes taken offline, rather than spawning a full collapse.
Avoiding the scenario where 20% of supernodes go offline to begin with is of course desirable, but since any number of things could cause that, it seems like a genuinely resilient system should remain functional (even in a degraded capacity) even if only a small fraction of nodes remains available.
toumhi|15 years ago
You can also gracefully degrade performance, by rejecting client connections, disconnecting progressively some clients, accepting loss of consistency etc. It depends how far you can go without infuriating your customers.
We discovered that large-scale real-time systems(in our case, currently 400.000 concurrent connections) are really hard to stabilize against presence storms, network problems and buggy clients, among others.
djipko|15 years ago
What I would like to see and couldn't find anywhere is more details on the objective impact it had on the users (number of users that had degraded/no service etc.). I think it would give a more complete picture.
teoruiz|15 years ago
I assume they can't turn regular Skype nodes into supernodes because they must be reachable from a public IP address.
Where they using a cloud computer provider such as EC2?
JoachimSchipper|15 years ago
unknown|15 years ago
[deleted]