One of the core reasons why everything switches to cloud computing – besides speed, security, scalability, and convenience – is the opportunity to benefit to data centers’ almost permanent uptime. When you outsource your computing processes to a remote data center, you no longer have worries associated with systems located on your premises. Risk of downtime, sabotage, failure, data loss, and theft is on their end. But as we know – nothing works perfectly. Even the leading edge data centers are sometimes subject to an outage.
In March 2018 a major data center in London lost power for several hours. During the time, many UK businesses weren’t able to access their data and processes on the cloud, resulting in a financial loss that is hard to calculate. The reason is power outage which is very likely follower by generator failure.
One of the most common mistakes data centers make is neglecting the state of their backup infrastructure. All data centers have generators, but when the grid power supply remains stable for years, those generators are left forgotten. As none checks the actual condition of these and any other backup devices in data centers, the failure hits when it’s least desired. Data center engineers learn the backup infrastructure is down only when they resort to it due to a failure. Therefore, UPS failure remains the leading reason for data center outage.
The risk is great as many data centers don’t routinely check their backup infrastructure. “It worked the last time we ran it, we didn’t switch it on, hence it still works”. While this really can’t happen with world’s largest cloud services employing all quality measure and well-trained staff in their data centers, human errors can still happen.
On Feb 28, 2017, Amazon Web Services faced a major server outage. The outage effectively shut down many large websites: Wix (including those based on it), Quora, Trello, and many else. Some other services ran slowly, like Alexa, and cloud for Nest thermostats. In an attempt to debug an issue with billing systems, an employee had to shut down few servers. Instead, his command caused a domino effect that made a large portion of AWS servers to shut down. The outage lasted for hours.
Amazon did respond to the outage, upgrading the tool for shutting down the servers before maintenance. The risk of a domino effect and switching off excess capacity is now zero. But this is just one of many possible weak spots in such large and complex systems. None knows where else a human error can cause significant damage beyond all security measures taken.
Data center outages are not uncommon, despite all measures services take to eradicate them.
A 2016 Ponemon Institute report compared the reasons for data center outages from 2010, 2013, and 2016. The infographic below shows the root reason for data center outages.
As you can see, environmental reasons including water and air conditioner failure are now less common than before, but UPS failure remains the leading root cause of an outage. What is most worrying however is a sharp growth of cybercrime-caused accidents.
As seen from above, the costs are very high with a tendency to rise further. The average downtime duration in this study is found to be 95 minutes. This illustrates that an outage on average costs the cloud $750,000 and might become greater in the future.