Google: SRE overload system caused Google’ service down

SRE overload system

Google released an analysis report about March 12 large-scale service disruption, pointing out that the SRE overload system caused the Google cloud storage error rate to increase.

On the 12th, many users around the world reported problems with Gmail, YouTube, Google Drive, Google Music, and other Google services, including parts of North America, South America, Europe, and Asia, and Google subsequently admitted this error. Google Cloud Status Dashboard shows that this failure affected all areas of Google Cloud Storage.

Raysonho @ Open Grid Scheduler / Grid Engine [CC0], via Wikimedia Commons

Google said that the internal blob (large data object) storage service experienced a 4 hour and 10-minute service interruption. The root cause was analyzed, which indicated that on March 11, Google SRE was significantly increased in the storage resources of the metadata used by the internal blob service; on March 12, in order to reduce resource usage, the SRE made configuration changes, and its side effect was to make The critical part of the system is overloaded to find the location of the blob data, and the increased load eventually leads to error rates.

Google said there will be a separate incident report for the impact of non-Google Cloud Platform services. Google apologizes for the service and application customers affected by the incident and said it is taking steps to increase availability and prevent such disruptions from happening again.