Routine upgrade causes major outage: 3 lessons for IT

Scheduled upgrades usually go off without a hitch, but when they don’t – well, we don’t need to tell you what a headache that can be. Dropbox learned that lesson over the weekend. 

The cloud storage service was set for a routine upgrade on Friday, but things went wrong. According to the head of infrastructure at Dropbox, Akhil Gupta:

During this process, the upgrade script checks to make sure there is no active data on the machine before installing the new OS.

A subtle bug in the script caused the command to reinstall a small number of active machines. Unfortunately, some master-replica pairs were impacted which resulted in the site going down.

The silver lining – if there can be one in an incident like this – is that no one’s data was at risk during the outage. It was just inaccessible.

But that’s of little comfort to those who needed access to it.

Takeaways for IT

Upgrades are likely on the mind of IT pros everywhere. Windows XP’s looming deadline alone is enough to have tech-minded managers thinking: What would happen if we were faced with a prolonged outage?

Here are some keys to remember for upgrading systems:

  • Who needs notifications? Some upgrades will automatically cause outages. These require letting every user know the systems will be down. But others only have the possibility of brief outages. In these instances, who should be alerted? Should the outage be assumed or seen as a worst-case scenario? Have a set plan for how your department will handle these notifications.
  • When should you alert them? With more and more users working remotely some or all of the time, you can no longer assume a Friday at midnight email outage won’t affect anyone. Give plenty of advance notice whenever possible.
  • Realize disaster recovery isn’t perfect. Relying on disaster recovery to get you back up and running is fine. But it can’t be relied on to instantly bring everything exactly as it was at the moment of the outage. Realize these limitations, and test more than you think you have to before beginning an upgrade.