Clearly yesterday was a bad day. Team Foundation Service was mostly down for approximately 9 hours. The underlying issue was an expired SSL certificate in Windows Azure storage. We use HTTPS to access Windows Azure storage, where we store source code files, Git repos, work item attachments and more. The expired certificate prevented access to any of this information, making much of the TFService functionality unavailable.
We were watching the issue very closely, were on the support bridge continuously and were investigating options to mitigate the outage. Unfortunately we were not successful and had to wait until the underlying Azure issue was resolved. I have a new appreciation for the “fog of war” that happens so easily during a large scale crisis. We’ll be sitting down early this week to go through the timeline hour by hour – what we knew, what we didn’t know, what we tried, what else we could have tried, how we communicated with customers and everything else to learn everything we can from the experience.
I can appreciate this problem. Team Foundation Service has dozens of “expiring objects” – certificates, credentials, etc. A couple of years ago, when our service was in its infancy, we too were hit by an expired certificate due to an operational oversight. Afterwards we instituted a regime of reviewing all expiring objects every few months to ensure we never allow another to expire. I’m still not as confident in our protection as I’d like. The current process relies on developers to document any expiring objects they add to the service and for the ops team to properly manually confirm all the expiration dates on a timely schedule. We took the occasion of this incident to raise the priority of automating this check to reduce the likelihood of a recurrence. Of course, one of the things you quickly learn when operating a large scale mission critical service is that you can’t assume anything is going to work. For instance our automated expiration checks, once we build them, might fail. Or, when they find an issue, the alerting system may fail to deliver the alert. Or, let’s say the alert is delivered by email, we may have personnel change and forget to update the email address the alert is sent to, causing it to get ignored. And on and on. The hard thing about this is that anything can go wrong and it’s only obvious in hindsight what you should have been protecting against – so you have to try to protect against every possibility.
I haven’t yet seen the Azure incident review so I don’t know exactly what failures led to this outage. Yesterday, the focus was on restoring the service – not understanding how to prevent the next issue. That will be the priority over the next couple of weeks.
Anyway you look at it, it was a bad, humbling, embarrassing day that we have to learn from and prevent from ever happening again.
I apologize to all of our affected customers and hope you’ll give us a chance to learn and continue to deliver you a great service.
Brian