Root Cause Analysis - Service Disruption - 05/06/2018
Published: May 10, 2018
Updated: May 10, 2018
Root Cause Analysis: Service Degradation May 6, 2018
Problem Description & Impact
On Sunday, May 6th, 2018, beginning at approximately 7:03pm PDT, Okta experienced a service degradation in US Cell 2 whereby admins in US Cell 2 may have experienced slightly elevated error rates. Administrators as well as integrations making API update calls would have also experienced extended Read-Only mode until the issue was fully resolved at 7:45pm PDT. End user authentication was not affected during this time.
The service degradation was the result of a hardware failure in the primary database infrastructure. The Read-Only mode occurred during the database primacy change as a function of our fail-over to the secondary database-tier.
Mitigation Steps and Recommended Future Preventative Measures
At approximately 7:03pm PDT, Okta’s proactive monitoring alerted to Read-Only mode operation in the US Cell 2. Okta operations team responded to the problem and took quick actions to route traffic back to primary database infrastructure. The authentication requests which were in-flight during this time were always successful. To prevent this issue from re-occurring in the future, Okta worked with Amazon Web Services to identify and mitigate the affected hardware infrastructure components. Okta is looking to review/enhance the recovery procedures to minimize impact when such failures are encountered.
Help Article Feedback
We’re sorry this article didn’t meet your needs. What specifically about the article was not helpful?