Root Cause Analysis - Service Disruption - 11/06/2017
Published: Nov 16, 2017
Updated: Jun 22, 2018
Root Cause Analysis: Minor Service Disruption November 6, 2017
Problem description & Impact On Monday, November 6th, 2017, Okta experienced a service disruption between 12:34pm - 12:56pm PST in US Cell 4. During this window, user and API initiated requests intermittently received HTTP 502 error responses. The error rates in US Cell 4 averaged around 4% and peaked to 7% at 12:50pm PST.
Root Cause At 12:36pm PST, Okta’s proactive monitoring alerted us to growing number of HTTP-502 errors reported on a majority of primary US Cell 4 app tier of servers responsible for servicing API and interactive user requests. Deeper post-mortem analysis found the issue to be precipitated by an internal API endpoint being triggered on a repeated basis with a query resulting in an extremely high resource utilization across 75% of nodes within the tier. This condition further resulted in high resource utilization across the remaining primary nodes and fail-over nodes now charged with handling all traffic for the tier.
Mitigating Steps & Corrective Actions By 12:47pm PST, Okta’s operations team responded to the problem and began taking mitigation actions to reduce resource utilization across the affected nodes. After this mitigating activity, the HTTP 502 errors were abated. To prevent an immediate return of the scenario the additional measure of a full block was placed on the API endpoint in question. Following the immediate mitigation actions, an expedited code fix was also made to change the affected query to prevent future recurrence
In response to this event, Okta has also performed a review of all queries that may utilize the same code path to ensure no additional queries are able to trigger such a condition in the future. In addition, additional failover read-only nodes are being deployed in each Okta cell to ensure sufficient failover capacity to service user requests in a timely manner during automatic failover scenarios.
Help Article Feedback
We’re sorry this article didn’t meet your needs. What specifically about the article was not helpful?