Early yesterday afternoon, one of our developers poked his head into my office. “Hey, have you been following the twitter traffic on Skype? Apparently it’s down. Are you still connected?”. Sure enough, the Skype on my desktop was down. In fact, it had crashed, and the client needed to be restarted. And yesterday evening I discovered that the same had happened on my home computer.
Much has been written about the scale and magnitude of the outage – how big it was, how it happened, the technical underpinnings of Skype’s super-node model, whether businesses should rely on Skype, and so on. Om Malik called Skype “one of the key applications of the modern web”, and pointed out the incomprehensible productivity loss associated with the outage.
At its low, fewer than 100,000 computers were connected to the Skype. At 6 PM EST yesterday, reports were coming in that low hundreds of thousands of people were attached, and by midnight I was seeing 1.7 million. This morning at 7 AM, 5 million, and as I write this, 6.8 million.
Skype is coming back online. Slowly, but it is coming back to life. It’s no small task, either. If 25 million Skype users need to reboot their PC’s, and it takes 5 minutes per reboot, then the aggregate time to get the Skype network back online would be 125 billion minutes, or 237.8 years of rebooting. Naturally, most of that activity is going to take place in parallel – that is at the same time as other Skype users are rebooting their computers. But what will the elapsed time actually be?
It seems as if people are waking up, discovering that their Skype clients have crashed, and then restarting them. In turn, super-nodes are coming back online, and capacity on the network is increasing. Even so, we may not see Skype’s full recovery for another couple of weeks, as many people have already left for their holidays.
If Skype were a true telephone company, they likely could have been back online much more quickly. The concentrated and centralized architecture of a telco lends itself much more easily to a restart, and that begs the question “How does a peer-to-peer network plan for a catastrophic failure?”
So far, Skype’s answer seems to be to bring online a cluster of “mega-super-nodes” – big beefy computers that can presumably seed the core super-node network, rather than relying on third parties. By maintaining these nodes directly, Skype can presumably start a cascade reboot if necessary. If, for example, Skype maintained 100 massive servers that could each act as super-node for 10,000 Skype users, they could bring 1,000,000 users back online within a matter of minutes, instead of the nearly 12 hours it took yesterday.
The businesses that Skype is courting as part of its push to increase revenues are going to want answers. It’s simply impossible to rely on voice service that might take days to come back to life.
Over to you, Skype.