Root Cause Analysis: Service Disruption - 02/02/2017 Skip to main content
https://support.okta.com/help/oktaarticledetailpage?childcateg=&id=ka02a000000xahnsak&source=documentation&refurl=http%3a%2f%2fsupport.okta.com%2fhelp%2fdocumentation%2fknowledge_article%2froot-cause-analysis-service-disruption-02022017
How satisfied are you with the Okta Help Center?
Thank you for your feedback!
How satisfied are you with the Okta Help Center?
1
2
3
4
5
Very Dissatisfied
Very satisfied
Enter content less than 200 characters.
Average Rating:
Root Cause Analysis: Service Disruption - 02/02/2017
Published: Feb 4, 2017   -   Updated: Feb 27, 2017

Root Cause Analysis: 

February 2, 2017 – Okta Service Disruption 

 

Problem Description & Impact: 

On Thursday, February 2, 2017, between 4:16am and 6:20am PST, Okta experienced a service disruption in US Cell 2 in which end-users experienced latency and errors while accessing Office 365.  Between 4:16am and 5:05am PST end-users in US Cell 2 experienced an average 25% error rate connecting to Office 365.  Error rates increased between 5:06am and 6:20am PST, and peaked as high as 95% at 6:01am PST.  During this time, users may also have experienced latency or errors when connecting to other applications.

From 5:30am and 6:05am, end-users in US Cell 1 - 4 also experienced sporadic latency and timeouts, with latency peaking at 40%, while error and timeouts remained below 10% of requests for interactive user sessions.  This degradation was mitigated and service returned to normal at approximately 6:20am PST. 

Root Cause: 

The initial service degradation in US Cell 2 was the result of an anomalous spike in Office 365 authentication requests which began at approximately 4:11am PST.  The spike caused an overload condition within Okta’s Proxy Server and Office 365 Web Application tiers.  At 5:30am PST, the spike in Office365 requests resulted in an increase in queued requests and errors which impacted system responsiveness in US Cells 1 – 4.   

While Okta has rate limiting in place at multiple layers within the infrastructure, the overload condition which occurred during this event exposed an incorrect rate limit setting for Office 365 requests, which has been resolved. 

The cause of the Office 365 authentication spike is also currently under investigation by Okta and will be appended to this Root Cause Analysis once identified and resolved. 

Mitigation Steps and Recommended Future Preventative Measures: 

At 4:16am PST, Okta was alerted to errors occurring at our Office 365 Application tier due to the increase in authentication requests.  After initial assessment, Okta began mitigation steps by tuning thread capacity within the Office 365 Web Application tier and adding additional Office 365 Web Application server capacity to respond to the increased request volume. 

At approximately 5:30am PST, the increase in Office 365 authentication requests began impacting US Cells 1 – 4.  Okta implemented further mitigation steps by both adding additional capacity to the router  and Office 365 tiers to take the additional authentication request traffic and by blocking the offending Office 365 authentication requests which returned service to normal. 

Okta is taking the following actions following this service disruption: 

  1. Okta has tuned the Office365 rate limit to prevent future occurrences of the overload condition.  Okta is also conducting a review of the rate limiting functionality and will make additional improvements as needed. 
    Status:  Complete. 
     

  2. Okta is actively investigating the source of the spike in Office 365 authentication requests, and has engaged Microsoft to assist in this investigation.  The spike in requests has been determined not to have been malicious in nature. 
    Status: In Progress.  Okta will provide a status update to this RCA on 2/10.
    2/10/2017 Update: Okta is continuing the investigation effort with Microsoft to identify the spike in authentication requests which precipitated this event.  To date, the root cause has not been identified.  We will provide an update to this investigation effort by 2/15/2017.
    2/16/2017 Update: The root cause analysis is ongoing to determine the cause of spike in Office 365 authentication requests.  We will continue to provide updates here until this is fully explored and mitigated.  The next update to this investigation will be no later than 2/24/2017.  In addition to exploring the root cause for the spike, Okta has deployed additional router infrastructure to isolate these requests to further ensure system reliability.
    2/27/2017 Update: Okta continues to investigate the spike in Office 365 requests with Microsoft. All operational changes to block/prevent this spike from impacting Okta in the future are complete.

  3. Okta has deployed additional Office 365 and Router Tier capacity and tuning changes to all production and preview cells. 
    Status: Complete 
     

  4. Okta is increasing its load testing profile to include overload scenarios to drive further system stability and process improvements. 
    Status:  Complete

Post a Comment