Yesterday evening Pacific Standard Time, Azure storage services experienced a service interruption across the United States, Europe and parts of Asia, which impacted multiple cloud services in these regions. I want to first sincerely apologize for the disruption this has caused. We know our customers put their trust in us and we take that very seriously. I want to provide some background on the issue that has occurred.
As part of a performance update to Azure Storage, an issue was discovered that resulted in reduced capacity across services utilizing Azure Storage, including Virtual Machines, Visual Studio Online, Websites, Search and other Microsoft services. Prior to applying the performance update, it had been tested over several weeks in a subset of our customer-facing storage service for Azure Tables. We typically call this “flighting,” as we work to identify issues before we broadly deploy any updates. The flighting test demonstrated a notable performance improvement and we proceeded to deploy the update across the storage service. During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting. The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues.
Once we detected this issue, the change was rolled back promptly, but a restart of the storage front ends was required in order to fully undo the update. Once the mitigation steps were deployed, most of our customers started seeing the availability improvement across the affected regions. While services are generally back online, a limited subset of customers are still experiencing intermittent issues, and our engineering and support teams are actively engaged to help customers through this time.
When we have an incident like this, our main focus is rapid time to recovery for our customers, but we also work to closely examine what went wrong and ensure it never happens again. We will continually work to improve our customers’ experiences on our platform. We will update this blog with a RCA (root cause analysis) to ensure customers understand how we have addressed the issue and the improvements we will make going forward.