Root Cause Analysis:
May 15, 2017 – Service Degradation
Problem Description & Impact:
On Monday, May 15, 2017, between 6:01am and 6:38am PDT, Okta experienced a service degradation in US Cell 2 in which interactive user-sessions and API requests experienced increased response times. While requests were experiencing slow response times, a small number of authentication errors occurred during this time.
Additionally, from 6:38am until 7:34am PDT, Okta administrators may have noticed delays in job processing. Service was fully restored at 7:34am PDT.
At 6:01am PDT, Okta experienced an internal network degradation which caused a decrease in capacity within our web application tier. This resulted in increased response times for interactive user-sessions and API requests. Okta addressed the waiting thread condition on the impacted web application servers and system responsiveness returned to normal at 6:38am PDT.
During the investigation, Okta paused job processing in US Cell 2 to help with the root-cause analysis. Once the network degradation was identified as the root case, the job server processing was un-paused at 7:34am PDT.