Root Cause Analysis of Microsoft Azure Service Interruption
Microsoft recently experienced a significant interruption of their Azure cloud service. Since a decent amount of data was available for this incident, I decided to do a partial root cause analysis. All of my source data came from Microsoft's official Azure blog post on 2014-Nov-24.
I did this primarily to create a sample analysis that could be used on my Causal Factor Tree Analysis page. (It has needed a sample for years!) Another motivation, however, was that Microsoft's statement of root cause seemed rather inadequate to me... it was more like a restatement of the immediate cause for the disruption. Quoting from their blog:
A bug in the Blob Front-Ends which was exposed by the configuration change made as a part of the performance improvement update, which resulted in the Blob Front-Ends to going into an infinite loop.
That just doesn't cut it, in my book. I think they could have done a whole lot better. Maybe they could re-start from my partial analysis; they wouldn't even have to pay me! Not very much, anyway. 😀
Unfortunately, I don't have quite enough data to arrive at final root causes, but I can say that I think they might be found by investigating along the following paths. (Note: I didn't address the Service Health Dashboard issues with this because I didn't have much to go on, but ideally, that would be included in any complete analysis.)
- A bug in the Blob Front-Ends code wasn't found prior to that code being released to production. Why not?
- The infinite-loop-inducing behaviour due to interaction of the Blob Front-Ends bug and the new performance improvement patch wasn't seen during patch testing (referred to as "flighting" in Microsoft's blog post)? Why not?
- The "standard" process for rolling out patch installations to the service in incremental batches (rather than all servers in a short timeframe) wasn't followed. Why not? (...and don't go concluding that "human error" was the root cause, because that's useless; see the analysis chart for more info.)
Now, finally, here is a PDF copy of the analysis. (Note: BFE in the chart stands for Blob Front-Ends.)
If you want a copy of the Microsoft Visio file that the PDF was created from, please leave a comment below!
by Bill Wilson |
Loading Quotes...
|
Cool example, thanks for providing the chart too makes it easier for me to follow and understand. Trying to wrap my head around root cause analysis.