During some routine overnight router configuration updates, stemming from recent security tightening (such as *Romainian IP blocking - see foot note) which need daily updating due to rapid changes, a bad router configuration was committed. Unfortunately, our data center had nobody on hand to reset the configuration (a situation we will have to get resolved) and we had to await a senior technician's 2:00 p.m. GMT arrival, we run remote access and pay other staff to be physically present like with most hosting companies.
Two problems exacerbated this outage, and we will address them both to prevent another outage of such length.
1) The data center where we colocate our equipment had nobody on hand that could perform complex router configuration overnight, and our staff, without remote access (because the router was out) couldn't do it (US staff were up from the very moment this server went offline and began working to resolve it, as is their job).
2) We did not have a rescue configuration set on the router to make recovery simpler.
Problem 2 was a really silly oversight on our part, and has been resolved. If a router configuration crops up in the future, a data center technician can simply hit the reset switch on the router and a working configuration will be restored within minutes, unlike this scenario which meant waiting for a senior technician.
The first outlined problem we'll have to work on. We colocate our equipment in Dallas because bandwidth is so much less expensive than in Nashville where our American team our based headed up by Troy. However, over the past several months we've been negotiating with another data center to obtain more competitive pricing, and while we won't be able to match Dallas pricing, (looks like it will end up being double the cost instead of the typical five times the cost for service), we will investigate moving our equipment to Nashville or even re-sourcing to London so that we'll have direct access all the time, this may however result in increased pricing. The alternative is to find out if our current data center can put more senior techs on call 24-7 rather than just standard staff in the data centre. Rest assured the server is monitored 24-7 and this should not happen again, if it does, we will have procedures in places to make sure it is resolved quickly.
Teisipäev, Detsembril 11, 2012