Last week was an unhappy one for Google. It suffered two major outages; the root cause of each could concisely be described as “oops”.
For 47 minutes Dec. 14, many Google Cloud services were apparently down. They actually weren’t – but nobody could authenticate to them so they were inaccessible. Then, for a combined total of six hours and 41 minutes on Monday and Tuesday, by Google’s count, Gmail began bouncing emails sent to some gmail.com addresses, saying those addresses did not exist.
The company has now released detailed reports of what went wrong, and they offer lessons for every IT shop. Kudos to Google for their transparency in describing the incidents in detail, including the embarrassing bits.
Here’s what happened.
The 47 minutes Google techs would likely love to forget began in October, when as part of a migration of the User ID Service, which maintains unique identifiers for each account and handles OAuth authentication credentials, to a new quota management system, a change was made that registered the service with the new system. That was fine. However, parts of the old system remained in place, and they erroneously reported usage of the User ID Service as zero. Nothing happened at that point because of an existing grace period on enforcing quota restrictions.
On Dec. 14, the grace period expired.
Suddenly the usage of the User ID Service apparently fell to zero. The service uses a distributed database to store account data (it uses Paxos protocols to coordinate updates) and rejects authentication requests when it detects outdated data. With what it thought was zero usage, the quota management system reduced the available storage for the database, which prevented writes. Within minutes, the majority of read operations became outdated, generating authentication errors. And to make life more interesting for the technicians trying to troubleshoot, some of their internal tools were impacted as well.
Google does have safety checks in place that should detect unintended quota changes, but the edge case of zero usage was not covered. Lesson: even if it seems improbable, take those edge cases into account.
To get things moving again, Google took several steps. First, it disabled the quota management system in one datacentre, and when that quickly improved the situation, five minutes later disabled it everywhere. Within six minutes, most services had returned to normal. Some suffered lingering impact; you can see the whole list here.
But now the real work begins. In addition to fixing the root cause, Google is implementing a number of changes, including:
- Reviewing its quota management automation to prevent fast implementation of global changes
- Improving monitoring and alerting to catch incorrect configurations sooner
- Improve the reliability of tools and procedures for posting external communications during outages that affect internal tools
- Evaulating and implement improved write failure resilience into the User ID service database
- Improving resilience of GCP Services to more strictly limit the impact to the data plane during User ID Service failures
The Gmail failure hit in two waves. On Monday, Google Engineering began receiving internal user reports of delivery errors and traced them to a recent code change in an underlying configuration system that resulted in the provision of an invalid domain name (instead of gmail.com) to the SMTP inbound service. When the Gmail accounts service checked these addresses, it could not detect a valid user, so generated the SMTP error 550 – a permanent error that, for many automated mailing systems, results in the user being removed from their lists. The code change was reversed, which corrected the situation.
On Tuesday, the configuration system was updated again (Google does not say whether it was the same change, re-applied, or another buggy one), and bounces started again. The changes were reversed, and Google has committed to the following:
- Update the existing configuration difference tests to detect unexpected changes to the SMTP service configuration before applying the change.
- Improve internal service logging to allow more accurate and faster diagnosis of similar types of errors.
- Implement additional restrictions on configuration changes that may affect production resources globally.
- Improve static analysis tooling for configuration differences to more accurately project differences in production behaviour