Root Cause Analysis - Service Disruption - 11/06/2017 Skip to main content
https://support.okta.com/help/oktaarticledetailpage?childcateg=&id=ka02a0000005upnsay&source=documentation&refurl=http%3a%2f%2fsupport.okta.com%2fhelp%2fdocumentation%2fknowledge_article%2froot-cause-analysis-service-disruption-11-06-2017
How satisfied are you with the Okta Help Center?
Thank you for your feedback!
How satisfied are you with the Okta Help Center?
1
2
3
4
5
Very Dissatisfied
Very satisfied
Enter content less than 200 characters.
Root Cause Analysis - Service Disruption - 11/06/2017
Published: Nov 16, 2017   -   Updated: Jun 22, 2018

Root Cause Analysis:
Minor Service Disruption
November 6, 2017


Problem description & Impact
On Monday, November 6th, 2017, Okta experienced a service disruption between 12:34pm - 12:56pm PST in US Cell 4.  During this window, user and API initiated requests intermittently received HTTP 502 error responses.  The error rates in US Cell 4 averaged around 4% and peaked to 7% at 12:50pm PST.  

Root Cause
At 12:36pm PST, Okta’s proactive monitoring alerted us to growing number of HTTP-502 errors reported on a majority of primary US Cell 4 app tier of servers responsible for servicing API and interactive user requests.  Deeper post-mortem analysis found the issue to be precipitated by an internal API endpoint being triggered on a repeated basis with a query resulting in an extremely high resource utilization across 75% of nodes within the tier.  This condition further resulted in high resource utilization across the remaining primary nodes and fail-over nodes now charged with handling all traffic for the tier. 

Mitigating Steps & Corrective Actions
By 12:47pm PST, Okta’s operations team responded to the problem and began taking mitigation actions to reduce resource utilization across the affected nodes. After this mitigating activity, the HTTP 502 errors were abated.   To prevent an immediate return of the scenario the additional measure of a full block was placed on the API endpoint in question.   Following the immediate mitigation actions, an expedited code fix was also made to change the affected query to prevent future recurrence 

In response to this event, Okta has also performed a review of all queries that may utilize the same code path to ensure no additional queries are able to trigger such a condition in the future.   In addition, additional failover read-only nodes are being deployed in each Okta cell to ensure sufficient failover capacity to service user requests in a timely manner during automatic failover scenarios.