Root Cause Analysis: Service Disruption - 05152017 Skip to main content
How satisfied are you with the Okta Help Center?
Thank you for your feedback!
How satisfied are you with the Okta Help Center?
Very Dissatisfied
Very satisfied
Enter content less than 200 characters.
Root Cause Analysis: Service Disruption - 05152017
Published: May 18, 2017   -   Updated: Jun 22, 2018
Root Cause Analysis: 
May 15, 2017 – Service Degradation
Problem Description & Impact: 

On Monday, May 15, 2017, between 6:01am and 6:38am PDT, Okta experienced a service degradation in US Cell 2 in which interactive user-sessions and API requests experienced increased response times.  While requests were experiencing slow response times, a small number of authentication errors occurred during this time.

Additionally, from 6:38am until 7:34am PDT, Okta administrators may have noticed delays in job processing.  Service was fully restored at 7:34am PDT.

Root Cause: 
At 6:01am PDT, Okta experienced an internal network degradation which caused a decrease in capacity within our web application tier.  This resulted in increased response times for interactive user-sessions and API requests.  Okta addressed the waiting thread condition on the impacted web application servers and system responsiveness returned to normal at 6:38am PDT.

During the investigation, Okta paused job processing in US Cell 2 to help with the root-cause analysis.  Once the network degradation was identified as the root case, the job server processing was un-paused at 7:34am PDT.

Mitigation Steps:
  • In response to the network degradation and resulting server impact, Okta identified the impacted servers, cleared the error condition, and returned the servers to service.
  • Immediately following the event, Okta partnered with our hosting provider and is conducting a root cause analysis to conduct root cause analysis to identify and prevent similar network degradation in the future.
  • Okta is deploying multiple monitoring/alerting improvements along with improved response procedures.  This will allow us to more quickly identify similar issues in the future and improve our execution of mitigating steps once identified.