A few keystrokes brought the Internet to its knees

There’s new information on what caused that massive Amazon Web Services (AWS) outage last week. Or should we say, who caused it? 

Amazon has come out with more information about the outage, in which it shared that human error was the main culprit. An Amazon Simple Service Storage (S3) team member was troubleshooting a problem and executed a command to bring a few servers offline.

However, due to an input error, a larger set of servers than expected were brought down. That included servers that support several popular sites and services. And it took longer than expected to get these sites back up and running.

Making changes

Anyone can make a mistake. But most people’s mistakes don’t lead to massive outages of web services.

The good news: Amazon has laid out some pretty serious lessons learned and changes implemented to prevent this from happening again. These include:

  • tools to prevent capacity from being taken offline too quickly
  • added safeguards to prevent the command from affecting too many systems in the future, and
  • auditing operational tools.

These post-mortems, as they’re known, are a must for any department. When something goes wrong, be sure to get the whole team together and talk about what happened and how it can be prevented in the future.

It’s important to make these meetings blame-free as much as possible. Get to the bottom of what caused the issue and work with your team to come up with permanent solutions.