Google: We had to shut down a datacenter to save it during London’s heatwave
Google has revealed the root cause of the outage that disrupted services at its europe-west2-a zone, based in London, during a recent heatwave.
“One of the datacenters that hosts zone europe-west2-a could not maintain a safe operating temperature due to a simultaneous failure of multiple, redundant cooling systems combined with the extraordinarily high outside temperatures,” states Google’s incident report.
The report doesn’t explain why the cooling systems failed, but does say Google first became aware of “an issue affecting two cooling systems in one of the datacenters that hosts europe-west2-a on Tuesday, 19 July 2022 at 06:33 US/Pacific and began an investigation.”
We inadvertently modified traffic routing for internal services to avoid all three zones in the europe-west2
The Register has checked weather records for the day in question: just before Google noticed cooling problems – 2:20PM in London – the temperature was 102°F/39°C.
That’s a level of heat that’s manageable in places where datacenter designers know that sort of temperature can be expected. But as July 19 was the hottest day on record in London, the UK capital is not such a place.
Engineers worked on mitigations to the failed cooling systems from 07:02 Pacific, but their efforts failed.
London temperatures remained above 95°F/35°C deep into the evening and at around 6PM in London Google engineers “powered down this part of the zone to prevent an even longer outage or damage to machines.”
In other words, they shut the zone down to save it from a worse outage.
Chaos kicked in after the shutdown decision. Closing the datacenter meant “Compute Engine terminated all VMs in the impacted datacenter, representing approximately 35 percent of the VMs in the europe-west2-a zone.”
Google also made a mess of trying to provide redundancy.
“At the start of the incident, we inadvertently modified traffic routing for internal services to avoid all three zones in the europe-west2 region, rather than just the impacted europe-west2-a zone.”
So while only part of europe-west2-a zone was down, Google told itself to ignore working resources.
Google and other cloud vendors advise users to employ multiple zones to improve resilience. Google’s error, therefore, went against its own advice.
The cooling system came back online at 14:13 Pacific – past 10PM in London when temperatures were still sizzling.
“Google engineers are actively conducting a detailed analysis of the cooling system failure that triggered this incident,” the report states.
The search giant and cloud aspirant has also pledged to:
- Investigate and develop more advanced methods to progressively decrease the thermal load within a single datacenter space, reducing the probability that a full shutdown is required;
- Examine procedures, tooling, and automated recovery systems for gaps to substantially improve recovery times in the future;
- Audit cooling system equipment and standards across the datacenters that house Google Cloud globally.
The incident report also offers a detailed account of the incident’s impact on Google cloud services, and offers the figure of 18 hours, 23 minutes as the duration of the outage – plus a “long tail duration” of 35 hours, 15 minutes before things were back to normal. ®
via The Register https://ift.tt/o72NuR1
July 31, 2022 at 11:38PM