outage

Early yesterday afternoon, one of our developers poked his head into my office.  “Hey,  have you been following the twitter traffic on Skype?  Apparently it’s down.  Are you still connected?”.  Sure enough, the Skype on my desktop was down.  In fact, it had crashed, and the client needed to be restarted.  And yesterday evening I discovered that the same had happened on my home computer.

Much has been written about the scale and magnitude of the outage – how big it was, how it happened, the technical underpinnings of Skype’s super-node model, whether businesses should rely on Skype, and so on.  Om Malik called Skype “one of the key applications of the modern web”, and pointed out the incomprehensible productivity loss associated with the outage.

At its low, fewer than 100,000 computers were connected to the Skype.  At 6 PM EST yesterday, reports were coming in that low hundreds of thousands of people were attached, and by midnight I was seeing 1.7 million.  This morning at 7 AM, 5 million, and as I write this, 6.8 million.

Skype is coming back online.  Slowly, but it is coming back to life.  It’s no small task, either.  If 25 million Skype users need to reboot their PC’s, and it takes 5 minutes per reboot, then the aggregate time to get the Skype network back online would be 125 billion minutes, or 237.8 years of rebooting.  Naturally, most of that activity is going to take place in parallel – that is at the same time as other Skype users are rebooting their computers.  But what will the elapsed time actually be?

It seems as if people are waking up, discovering that their Skype clients have crashed, and then restarting them.  In turn, super-nodes are coming back online, and capacity on the network is increasing.   Even so, we may not see Skype’s full recovery for another couple of weeks, as many people have already left for their holidays.

If Skype were a true telephone company, they likely could have been back online much more quickly.  The concentrated and centralized architecture of a telco lends itself much more easily to a restart, and that begs the question “How does a peer-to-peer network plan for a catastrophic failure?”

So far, Skype’s answer seems to be to bring online a cluster of “mega-super-nodes” – big beefy computers that can presumably seed the core super-node network, rather than relying on third parties.   By maintaining these nodes directly, Skype can presumably start a cascade reboot if necessary.  If, for example, Skype maintained 100 massive servers that could each act as super-node for 10,000 Skype users, they could bring 1,000,000 users back online within a matter of minutes, instead of the nearly 12 hours it took yesterday.

The businesses that Skype is courting as part of its push to increase revenues are going to want answers.  It’s simply impossible to rely on voice service that might take days to come back to life.

Over to you, Skype.

{ 3 comments }

Lights out at Facebook

by alec on July 31, 2007

We might be looking at the first big Facebook outage folks. Today, at various times, the front page of Facebook has displayed:

  • Your account is temporarily unavailable (7:00 AM EDT today).
  • Facebook is upgrading.
  • Facebook is temporarily unavailable.  We're working on it…

And a few minutes ago DNS wouldn't resolve the site name altogether. 

TechCrunch picked up on it at 10 PDT, but perhaps more interesting is this assertion that Facebook may been hacked.  One thing is certain… you don't take your site offline in the middle of the day to do server upgrades.  Whatever the reason — hacked, failed upgrade, dead servers — the tension must be excruciating in downtown Palo Alto at the moment. 

{ 0 comments }

BlackBerry outage explained

April 20, 2007

If you were one of the many people who lost BlackBerry service earlier this week, RIM has offered an explanation.  Apparently an insufficiently tested storage software upgrade was the root cause.  The good news, presumably, is that when they get it right increased storage will allow us all access to more features and of course, more email. 

Read the full article →
Alec on LinkedIn Alec on Twitter Alec on Facebook Calliflower on Youtube RSS Feed Contact me