July 26, 2017

Disaster Diaries 1 - Servers and water don't mix

Disaster Diaries are true stories taken from our support logs. The diaries show how we work with IT support companies to improve their profitability. We do this by working together to rescue their clients when overwhelmed by disasters.

It’s Saturday and alerts start coming through about servers being offline…

“It’s got to be electricity,” is the first thought that goes through the mind of Acme Financial Services CIO Nabeel Hamdi. Anyway, that’s what he’s hoping.

Acme holds the lease on a full floor in a multi-tenanted office building in Mayfair, London. One Saturday afternoon at 14:02 SMS alerts started coming through one after another about server failures. These were also picked up by Acme’s IT support company, Askforhelp-IT. Acme’s MD and CIO Nabeel agreed to meet at Acme’s office in 30 minutes to see what was going on?

14:35

As they exit the lift, which is running, indicating there is electricity in the building, the 2 men put their ears to the door of Acme’s offices. The distinctive sound of water can be heard somewhere beyond the door. Nabeel got his key in the door and flung it wide into an office that was both chilled and wet, the carpet squelching beneath their feet as they dashed through the office and into the server room from which the water sounds were emanating.

Water was cascading through the drop ceiling right onto the main server cabinet. The room, which would normally be characterised by the hum of dozens of server fans was ominously quiet excepting for the sounds of running water.

The clock starts ticking – it’s 39 hours until the trading day starts on Monday

No servers? No email. No customer data, no data feeds. Without these critical systems the business would be unable to trade on behalf of their customers, creating the potential for income and reputation loss not to mention the breach in FCA regulations.

See two possible alternatives

Those servers were irreparable. Replacement servers were immediately ordered from a hardware leasing company. Received, configured and deployed while the tape backups were ordered from the tape library outside of the city.

It’s now Monday, noon.

Even though the leased servers are now deployed in a temporary rack, industrial dehumidifiers running constantly to leach water out of the walls and carpets, the tape recoveries have just begun. Starting with the recovery of the large email server and domain controller. Expected recovery of these 2 critical machines if all goes to plan, 23:00 Monday night. The DR team returns 06:00 Tuesday to move onto recovery of the document management system and the first of 2 database servers.

Because they missed Monday’s trading, all customers and business contacts needed to be called alongside informing the Financial Conduct Authority of a breach in the guidelines concerning continuity of services.

Oh and just to make things interesting…

Friday’s backup tapes were still in the machines when the flood started on Saturday. Thursday’s backup was the only viable resource for the disaster recovery.

Aggravation index: This headache goes to 11.

DR using DATAFORT Critical Care:

In actual fact, DATAFORT is a trusted partner of Askforhelp-IT and Acme is a DATAFORT Critical Care client. This service takes block level changes from each production server every 15 minutes through the trading day and adds them to server images maintained both on a local backup appliance which are then mirrored to the data centre. These images are constantly being recompiled, so an up-to-date image is always available for recovery.

The backup appliance was also destroyed in the office flood so the decision was made to invoke the virtual servers from the data centre. After Askforhelp-IT notified DATAFORT of the situation at 14:15 Saturday, by 07:00 Monday morning, the virtual servers had been invoked and tested, email accessible through webmail. Traders were working from a serviced office onto laptops provided by Askforhelp-IT and using their mobiles to field business calls. 

Customers, suppliers and business partners were never aware that there had been a problem at all. And what about the FCA? No breach of service guidelines so no need to contact them.

Aggravation index: -1. What aggravation?

The preceding is a description of an actual event that took place at a customer’s office Q4 of 2014. Thank goodness there was no need to resort to tape recovery, but this is an accurate presentation of the protocol that would have been used to recover a customer’s network using tape.

Do you have a particular customer whose recovery strategy is giving you concern? Call or email Business Development Director Marcie Terman for more information on how we will work with you to protect your customers.  while enhancing your business efficiency and reputation while generously sharing revenue.