Root Cause Analysis:
February 2, 2017 – Okta Service Disruption
Problem Description & Impact:
On Thursday, February 2, 2017, between 4:16am and 6:20am PST, Okta experienced a service disruption in US Cell 2 in which end-users experienced latency and errors while accessing Office 365. Between 4:16am and 5:05am PST end-users in US Cell 2 experienced an average 25% error rate connecting to Office 365. Error rates increased between 5:06am and 6:20am PST, and peaked as high as 95% at 6:01am PST. During this time, users may also have experienced latency or errors when connecting to other applications.
From 5:30am and 6:05am, end-users in US Cell 1 - 4 also experienced sporadic latency and timeouts, with latency peaking at 40%, while error and timeouts remained below 10% of requests for interactive user sessions. This degradation was mitigated and service returned to normal at approximately 6:20am PST.
The initial service degradation in US Cell 2 was the result of an anomalous spike in Office 365 authentication requests which began at approximately 4:11am PST. The spike caused an overload condition within Okta’s Proxy Server and Office 365 Web Application tiers. At 5:30am PST, the spike in Office365 requests resulted in an increase in queued requests and errors which impacted system responsiveness in US Cells 1 – 4.
While Okta has rate limiting in place at multiple layers within the infrastructure, the overload condition which occurred during this event exposed an incorrect rate limit setting for Office 365 requests, which has been resolved.
The cause of the Office 365 authentication spike is also currently under investigation by Okta and will be appended to this Root Cause Analysis once identified and resolved.
Mitigation Steps and Recommended Future Preventative Measures:
At 4:16am PST, Okta was alerted to errors occurring at our Office 365 Application tier due to the increase in authentication requests. After initial assessment, Okta began mitigation steps by tuning thread capacity within the Office 365 Web Application tier and adding additional Office 365 Web Application server capacity to respond to the increased request volume.
At approximately 5:30am PST, the increase in Office 365 authentication requests began impacting US Cells 1 – 4. Okta implemented further mitigation steps by both adding additional capacity to the router and Office 365 tiers to take the additional authentication request traffic and by blocking the offending Office 365 authentication requests which returned service to normal.
Okta is taking the following actions following this service disruption: