Server Outage

Ref: Our sincere apology.

Unfortunately today we experienced some downtime on a few of our servers (10). We have been working on this to rectify it and it is now live. My sincere apologies for this incident. I've been incredibly blessed by the support shown by many of you on this unfortunate occasion, you have been very gracious and understanding, although it didn't affect all our customers it did effect the majority and not having to deal with lots of furious clients helps get things resolved quicker, it doesn't stop all of us at Blue Sky Hosts and those we work with from being aware that this circumstance is unacceptable and cannot happen again in the future.

This is the first time in 7 years of business that we have ever had such an issue occur and we will do all that we can to ensure it does not happen again, in roughly 61,320 hours of live business we have only had aprox. 5hrs outage which results in a 0.008 downtime, which has happened as a result of today's 5hr outage.

Nothing to the extent of the unfortunate events of Blackberry or Microsoft earlier this year, but none the less pretty poor.

It's taken years to build up our client base from Government organisations to small personal blogs and know first hand what downtime can do to a business, we care that today your site may have had visitors with lost attempts at access and may have left you unable to get time sensitive emails sent or received. We know very well how important uptime is when running a site and hope that this explanation below will help you to understand where the issues have come from and why it shouldn't happen again.

Firstly our data centres are in Dallas and copied/backed up in Chicago, both in the USA, this enables us to deal with long standing business colleagues in America that I know personally and meet with face to face, ensuring high quality top of the range equipment which is secure and backed up rather than outsourced to a foreign country with staff I've never met. I work with an all English speaking team. Within the staff structure there are higher level Server support technicians, these guys handle server side issues of an advanced nature. As soon as this issue arose Troy was working on the case in Nashville TN.

During some routine overnight router configuration updates, stemming from recent security tightening (such as *Romainian IP blocking - see foot note) which need daily updating due to rapid changes, a bad router configuration was committed. Unfortunately, our data center had nobody on hand to reset the configuration (a situation we will have to get resolved) and we had to await a senior technician's 2:00 p.m. GMT arrival, we run remote access and pay other staff to be physically present like with most hosting companies.

Two problems exacerbated this outage, and we will address them both to prevent another outage of such length.

1) The data center where we colocate our equipment had nobody on hand that could perform complex router configuration overnight, and our staff, without remote access (because the router was out) couldn't do it (US staff were up from the very moment this server went offline and began working to resolve it, as is their job).

2) We did not have a rescue configuration set on the router to make recovery simpler.

Problem 2 was a really silly oversight on our part, and has been resolved. If a router configuration crops up in the future, a data center technician can simply hit the reset switch on the router and a working configuration will be restored within minutes, unlike this scenario which meant waiting for a senior technician.

The first outlined problem we'll have to work on. We colocate our equipment in Dallas because bandwidth is so much less expensive than in Nashville where our American team our based headed up by Troy. However, over the past several months we've been negotiating with another data center to obtain more competitive pricing, and while we won't be able to match Dallas pricing, (looks like it will end up being double the cost instead of the typical five times the cost for service), we will investigate moving our equipment to Nashville or even re-sourcing to London so that we'll have direct access all the time, this may however result in increased pricing. The alternative is to find out if our current data center can put more senior techs on call 24-7 rather than just standard staff in the data centre. Rest assured the server is monitored 24-7 and this should not happen again, if it does, we will have procedures in places to make sure it is resolved quickly.

Kedd, december 11, 2012

« Vissza

Közlemények

Server Outage

Hónap szerint

Hónap szerint

Támogatás

Get in touch

You’re a step further to make the best desicion

Közlemények

Server Outage

Hónap szerint

Hónap szerint

Támogatás

Get in touch

You’re a step further to make the best desicion

Jelszó létrehozása