giza: Giza White Mage (Default)
[personal profile] giza
[Edit: It has since been pointed out to me that I made some incorrect assumptions about the situation. As such, I have edited this post to reflect the facts of what really went on. I apologize for not getting the full story the first time around.]

The Anthrocon website is back up, with no data lost. Here's the post-mortem of what happened:

At about 7 AM this morning, the company owning the data center, Limelight Networks performed some scheduled maintenance on their power grid. Unfortunately, Murphy's law intervened, and their main breaker malfunctioned and cut power to many machines. Included in the outage were Limelight's own email servers and phone system. Even more unfortunately, while Limelight did the right thing by notifying their customers about the scheduled outage beforehand, some of the emails to their customers (including NearlyFreeSpeech) simply never reached their destination. It's unclear what exactly happened to the emails, other than that they failed to make it into the recipients' inboxes OR spam filters.

Our webhost noticed that all their machines stopped responding at the same time, and promptly tried to get in touch with Limelight. This was complicated somewhat since Limelight's phone service was still out. Once they did get in touch with Limelight, they were informed that a second shutdown would be necessary in about one hour to replace the main breaker. At this point, NFSN proceeded to bring things up in a limited capacity, since the power would be going out again.

After about 1.5 hours, it turned out that the second shutdown would not be necessary after alll. So NFSN proceeded to finish up their file system checks and start bringing all of the servers back up. The webservers were brought back online fairly quickly, but the real bottleneck was bringing all of their MySQL servers back online, it took about 40 minutes to start them all.

One additional problem remained: routing. Since routers were also affected in the power outage, some of the database servers were unreachable by the webhosts. (3 /24 subnets for the curious) This was responsible for about half of the total downtime.

So what did I learn today? NFSN is pretty awesome when it comes to dealing with service issues and staying in touch with their customers. They were active in their forums and gave us updates about every half hour or so. They even went so far as to look into renting a truck so as to move their servers to a new data center if the outage proved to be extended.
This account has disabled anonymous posting.
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

Profile

giza: Giza White Mage (Default)
Douglas Muth

April 2012

S M T W T F S
1234567
891011121314
15161718192021
22232425262728
2930     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags