Root Cause Analysis of Microsoft Azure Service Interruption

Microsoft recently experienced a significant interruption of their Azure cloud service. Since a decent amount of data was available for this incident, I decided to do a partial root cause analysis. All of my source data came from Microsoft's official Azure blog post on 2014-Nov-24.

I did this primarily to create a sample analysis that could be used on my Causal Factor Tree Analysis page. (It has needed a sample for years!) Another motivation, however, was that Microsoft's statement of root cause seemed rather inadequate to me... it was more like a restatement of the immediate cause for the disruption. Quoting from their blog:

A bug in the Blob Front-Ends which was exposed by the configuration change made as a part of the performance improvement update, which resulted in the Blob Front-Ends to going into an infinite loop.

That just doesn't cut it, in my book. I think they could have done a whole lot better. Maybe they could re-start from my partial analysis; they wouldn't even have to pay me! Not very much, anyway. 😀

Unfortunately, I don't have quite enough data to arrive at final root causes, but I can say that I think they might be found by investigating along the following paths. (Note: I didn't address the Service Health Dashboard issues with this because I didn't have much to go on, but ideally, that would be included in any complete analysis.)

  1. A bug in the Blob Front-Ends code wasn't found prior to that code being released to production. Why not?
  2. The infinite-loop-inducing behaviour due to interaction of the Blob Front-Ends bug and the new performance improvement patch wasn't seen during patch testing (referred to as "flighting" in Microsoft's blog post)? Why not?
  3. The "standard" process for rolling out patch installations to the service in incremental batches (rather than all servers in a short timeframe) wasn't followed. Why not?   (...and don't go concluding that "human error" was the root cause, because that's useless; see the analysis chart for more info.)

Now, finally, here is a PDF copy of the analysis. (Note: BFE in the chart stands for Blob Front-Ends.)

Root Cause Analysis (Partial) of Microsoft Azure Service Interruption

If you want a copy of the Microsoft Visio file that the PDF was created from, please leave a comment below!

 



by Bill Wilson
Loading Quotes...

Home/blog
Site Map
 
Bill Wilson © 2004-2015

Last updated: November 25, 2014 at 9:27 am

Leave a Reply

Your email address will not be published. Required fields are marked *