Root Cause Analysis: Service Disruption - 05152017 Skip to main content
https://support.okta.com/help/oktaarticledetailpage?childcateg=&id=ka02a000000bne7sai&source=documentation&refurl=http%3a%2f%2fsupport.okta.com%2fhelp%2fdocumentation%2fknowledge_article%2froot-cause-analysis-service-disruption-05152017
How satisfied are you with the Okta Help Center?
Thank you for your feedback!
How satisfied are you with the Okta Help Center?
1
2
3
4
5
Very Dissatisfied
Very satisfied
Enter content less than 200 characters.
Root Cause Analysis: Service Disruption - 05152017
Published: May 18, 2017   -   Updated: Jun 22, 2018
Root Cause Analysis: 
May 15, 2017 – Service Degradation
 
Problem Description & Impact: 

On Monday, May 15, 2017, between 6:01am and 6:38am PDT, Okta experienced a service degradation in US Cell 2 in which interactive user-sessions and API requests experienced increased response times.  While requests were experiencing slow response times, a small number of authentication errors occurred during this time.

Additionally, from 6:38am until 7:34am PDT, Okta administrators may have noticed delays in job processing.  Service was fully restored at 7:34am PDT.

Root Cause: 
At 6:01am PDT, Okta experienced an internal network degradation which caused a decrease in capacity within our web application tier.  This resulted in increased response times for interactive user-sessions and API requests.  Okta addressed the waiting thread condition on the impacted web application servers and system responsiveness returned to normal at 6:38am PDT.

During the investigation, Okta paused job processing in US Cell 2 to help with the root-cause analysis.  Once the network degradation was identified as the root case, the job server processing was un-paused at 7:34am PDT.

Mitigation Steps:
  • In response to the network degradation and resulting server impact, Okta identified the impacted servers, cleared the error condition, and returned the servers to service.
  • Immediately following the event, Okta partnered with our hosting provider and is conducting a root cause analysis to conduct root cause analysis to identify and prevent similar network degradation in the future.
  • Okta is deploying multiple monitoring/alerting improvements along with improved response procedures.  This will allow us to more quickly identify similar issues in the future and improve our execution of mitigating steps once identified.