The incident raised questions about public cloud reliability as it is believed the cause of the global outages stemmed from Google Cloud.
Multiple sites including Spotify and Discord suffered outages yesterday (8 March), becoming unavailable for many users around the world.
Both companies confirmed outages on Twitter after users reported being unable to properly connect to these platforms.
Something’s not quite right, and we’re looking into it. Thanks for your reports!
— Spotify Status (@SpotifyStatus) March 8, 2022
A couple of hours after Spotify and Discord shared the issue online, both companies reported that their systems had returned to normal.
Based on Discord’s incident report, the increase in errors was noticed at around 9am Pacific Standard Time and was resolved by around 1pm.
Cybersecurity watchdog Netblocks said that Twitter also suffered “intermittent failures” at around this same time, while some users said they were unable to access Wikipedia.
Google Cloud
The issue has been linked to Google Cloud, with Discord referencing Google Cloud as the cause in its incident report. While Spotify did not give details of its outage, a case study from Google suggests that Spotify is a customer of its cloud services.
“Engineering has root-caused this issue to a Google Cloud component called Traffic Director, which is responsible for configuring our load balancing layer,” Discord said in its report. “In its malfunction, it caused our internal load balancing layer to not have a valid configuration, which caused a loss of availability of the API.”
Google Cloud’s Status Dashboard yesterday said an issue with Traffic Director was caused by a “recent release”, which was rolled back following the incident.
“We have identified a probable root cause and will be publishing an incident report within the next several days,” Google said.
The global outage has raised concerns about the reliability of public cloud services, with ServerChoice commercial director Adam Bradshaw saying they are “not a silver bullet for IT infrastructure”.
“Ultimately, if you don’t own the infrastructure, you don’t control it,” Bradshaw said. “There are many other components that can be used alongside public cloud, like colocation services, that can ensure public cloud outages don’t result in mission-critical incidents for a business.”
Bradshaw added that these outages had the potential to impact “millions of users” and presented “significant reputational damage” for companies like Spotify.
Cloud service provider Civo’s CEO, Mark Boost, said that many online services “solely rely” on public cloud providers such as Google Cloud and that many believe brand familiarity and “sheer size” are enough to maintain uptime.
“However, as Spotify’s outage demonstrates, the sheer size of hyperscalers massively increases its complexity, with the huge number of variables and moving parts heightening the risk of a malfunction that leads to an outage,” Boost said.
“Hyperscaler cloud providers use extremely complex billing systems to maximise their revenue streams, all whilst giving off the impression that they are always the most reliable option for businesses – this just simply isn’t true.”
Don’t miss out on the knowledge you need to succeed. Sign up for the Daily Brief, Silicon Republic’s digest of need-to-know sci-tech news.