A single typo brought Amazon Web Services to a grinding halt

3 Mar 2017

AWS cloud stand. Image: drserg/Shutterstock

Amazon has learned a valuable lesson after a costly outage of its Amazon Web Services was found to be the result of a single typo.

When many major online services such as Trello and Medium all suffer outages at the same time, questions are immediately directed at Amazon Web Services (AWS), Amazon’s web hosting division.

This is what happened last Tuesday (28 February), when a number of websites began to experience either issues with their services or, in the ironic case of isitdownrightnow.com, went down altogether.

It even affected internet of things devices used by the company IFTTT, which resulted in smart lightbulbs not being able to turn on.

The outage lasted for several hours and while it did not appear to affect some major social media clients such Instagram, it was still damaging to a considerable number of other web services.

Following up on the cause of the incident, Amazon has blamed a particular engineer who mistyped a certain command during an inspection of its Simple Storage Service (S3) subsystems.

According to The Guardian, the engineer was attempting to uncover why S3’s billing service was performing much slower than it should.

To help identify the problem, the engineer tried to take a small number of S3 subsystems offline, but the typo resulted in a far greater scale of shutdown.

‘A larger set of servers was removed than intended’

In a statement, Amazon said: “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

“The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region.”

To make matters more challenging for Amazon, the restarting of many of these subsystems had not been done in years.

“S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected,” Amazon said.

To a certain extent, AWS were lucky with the outage as it only affected the region of northern Virginia, but promised it has now put protection measures in place to prevent a similar incident from happening again.

AWS cloud stand. Image: drserg/Shutterstock

Colm Gorey was a senior journalist with Silicon Republic

editorial@siliconrepublic.com