Microsoft had “insufficient” staff levels at its data centre campus last week when a power sag knocked its chiller plant for two data halls offline, cooking portions of its storage hardware.
The company has released a preliminary post-incident report (PIR) for the large-scale failure, which saw large enterprise customers including Bank of Queensland and Jetstar completely lose service.
The PIR sheds light on why some enterprises lost service altogether: so many storage nodes were gracefully shut down - or had components fried - in the incident that data, and all replicas of it, were offline.
In addition, after storage nodes were finally recovered, a "tenant ring" hosting over 250,000 databases, failed - albeit with uneven impact on customers.
Chillers offline
Microsoft said the cooling capacity for the two affected data halls “consisted of seven chillers, with five chillers in operation and two chillers in standby (N+2)”.
A power sag - voltage dip - caused the five operating chillers to fault. In addition, only one of the standby units worked.
Microsoft said the onsite staff “performed our documented emergency operational procedures (EOP) to attempt to bring the chillers back online, but were not successful.”
The company appeared to be caught out by the scale of the incident, with not enough staff onsite, and its emergency procedures not catering for the size of the issue.
“Due to the size of the data centre campus, the staffing of the team at night was insufficient to restart the chillers in a timely manner,” the company said.
“We have temporarily increased the team size from three to seven, until the underlying issues are better understood and appropriate mitigations can be put in place.”
On its EOP, Microsoft said: “The EOP for restarting chillers is slow to execute for an event with such a significant blast radius.”
“We are exploring ways to improve existing automation to be more resilient to various voltage sag event types.”
While there weren’t enough staff to execute the documented procedures, having more staff would’ve gotten to the same result faster, as the chillers themselves have issues.
Preliminary investigations showed the chiller plant did not automatically restart “because the corresponding pumps did not get the run signal from the chillers.”
“This is important as it is integral to the successful restarting of the chiller units,” Microsoft said.
“We are partnering with our OEM vendor to investigate why the chillers did not command their respective pump to start.”
Microsoft said the faulted chillers could not be manually restarted “as the chilled water loop temperature had exceeded the threshold.”
With rising temperatures, and thermal warnings from infrastructure, Microsoft had no choice but to shut down servers.
“This successfully allowed the chilled water loop temperature to drop below the required threshold and enabled the restoration of the cooling capacity,” it said.
Storage, SQL database recovery
Still, not everything recovered smoothly.
The incident impacted seven storage tenants - five “standard”, two “premium”.
Some storage hardware was “damaged by the data hall temperatures”, Microsoft said.
Diagnostics weren’t available for troubleshooting because the storage nodes were offline.
“As a result, our onsite data centre team needed to remove components manually, and re-seat them one by one to identify which particular component(s) were preventing each node from booting,” Microsoft said.
“Several components needed to be replaced for successful data recovery and to restore impacted nodes.
“In order to completely recover data, some of the original/faulty components were required to be temporarily re-installed in individual servers.”
An infrastructure-as-code automation also failed, “incorrectly approving stale requests, and marking some healthy nodes as unhealthy, which slowed storage recovery efforts.”
The failure of a tenant ring hosting over 250,000 SQL databases further slowed recovery, Microsoft said.
“As we attempted to migrate databases out of the degraded ring, SQL did not have well tested tools on hand that were built to move databases when the source ring was in [a] degraded health scenario,” the company said.
“Soon this became our largest impediment to mitigating impact.”
A final PIR is expected to be completed in a few weeks.
0 Comments