Google announced the cause of the large-scale outage that occurred last week
Last week, Google experienced large-scale service outages around the world. Any service that needs to log in to a Google account cannot be logged in, causing many services to be unavailable.
In a short report, Google stated that the problem lies in the quota failure of the authentication system, which prevents the automatic expansion of the storage quota and affects the operation of the system.
Google has now released a detailed accident investigation report, which shows that the root cause of the downtime is a problem caused by the migration of the old and new Google authentication system.
Google writes
Google uses an evolving suite of automation tools to manage the quota of various resources allocated for services. As part of an ongoing migration of the User ID Service to a new quota system, a change was made in October to register the User ID Service with the new quota system, but parts of the previous quota system were left in place which incorrectly reported the usage for the User ID Service as 0. An existing grace period on enforcing quota restrictions delayed the impact, which eventually expired, triggering automated quota systems to decrease the quota allowed for the User ID service and triggering this incident. Existing safety checks exist to prevent many unintended quota changes, but at the time they did not cover the scenario of zero reported load for a single service:
• Quota changes to large number of users, since only a single group was the target of the change,
• Lowering quota below usage, since the reported usage was inaccurately being reported as zero,
• Excessive quota reduction to storage systems, since no alert fired during the grace period,
• Low quota, since the difference between usage and quota exceeded the protection limit.
As a result, the quota for the account database was reduced, which prevented the Paxos leader from writing. Shortly after, the majority of read operations became outdated which resulted in errors on authentication lookups.
As a remedy, Google will optimize the quota management automation system to prevent the rapid global configuration adjustment of the automation system from affecting the service.
At the same time, internal tools will be improved to prevent Google engineers from being unable to log in and failing to recover from errors when similar problems occur next time.
Google said it apologizes to all affected users and will conduct a more thorough investigation next, and adjust and optimize Google projects based on the results of the investigation.