AFF We're back!!

Status
Not open for further replies.

admin

Established Member
Administrator
Joined
Jun 18, 2002
Posts
1,455
Qantas
Bronze
Virgin
Red
As you may be aware, we experienced a major outage at our Data Centre on Monday morning. The outage lasted for 3 days!! It goes without saying that I apologise for AFF not being available for such an extended period and thank you all for your patience and understanding.

I know that many of our members have an interest in what happened. If this is of interest to you - please read on...

You can find an archive of the announcements I posted HERE as well as a summary of the situation as at 11am yesterday HERE. So I won't repeat it again.

The long story short is that at about 7:30 am Monday (all times in AEST) there was a municipal power outage in the town where our server is hosted. The standby generator kicked in. It caught fire, activating the automatic sprinkler system. The fire and water damaged a number of servers which had to be individually inspected and repaired. I suspect some of the delay was also due to the lack of availability of emergency staff (including municipal fire inspectors who had to give the OK to re-power the data centre) over the Easter weekend.

I must stress that at no time was our data at risk. We perform continual rolling backups which are stored in a different location - just for this type of disaster situation.

Now I fully acknowledge that "bad things" happen, but do question how the Data Centre manage their emergency infrastructure. Of greater concern to me was the lack of communication. (You can read the comments in the discussion forums in the links I referenced above.) This meant that I was denied the information I needed to make an informed decision whether or not to move to a different Data Centre. The main issue of moving (other than cost and loss of a few hours of data from the last backup) is the delay when arranging the server, provisioning, configuring, securing and then restoring the backup. Even in a non-emergency situation this can take 24 to 48 hours. No point in doing this if the current server could be restored quickly.

Yesterday morning (after 2 days) I pulled the plug and arranged to move to a new Data Centre. By 4am this morning, the new server was setup and configured and the restore job had commenced. This would have taken 4 to 6 hours. Then lo and behold, our old server came back up!! So we are now live on our old server (so there is no data loss at all). I still plan to move to the new Data Centre in the coming days and have been advised that there will be very little (if any) downtime.

As you can appreciate, much of what happened was beyond my control. Other than the unaffordable (for us) option of having a parallel live server in a separate location, there wasn't much more I could have done. Hindsight is a wonderful thing: had I known on Monday morning the extent of the problem, I would have immediately commissioned a new server. But I didn't know that then! Having an off site backup and an independent server administrator are the key components of our disaster/recovery situation which served us well.

Apologies once again and a big thank you for the support messages I received.

Finally, I'd like to thank all those who worked so hard to get us back online with zero data loss. This is the first time we have had such a long outage - let's hope its the last!

You can discuss the outage HERE.
 
Last edited:
The Frequent Flyer Concierge team takes the hard work out of finding reward seat availability. Using their expert knowledge and specialised tools, they'll help you book a great trip that maximises the value for your points.

AFF Supporters can remove this and all advertisements

Status
Not open for further replies.

Enhance your AFF viewing experience!!

From just $6 we'll remove all advertisements so that you can enjoy a cleaner and uninterupted viewing experience.

And you'll be supporting us so that we can continue to provide this valuable resource :)


Sample AFF with no advertisements? More..

Recent Posts

Back
Top