Root Cause Analysis - May 5 2017 - Service Disruption - Database Failover Skip to main content
https://support.okta.com/help/oktaarticledetailpage?childcateg=&id=ka02a000000bndysay&source=documentation&refurl=http%3a%2f%2fsupport.okta.com%2fhelp%2fdocumentation%2fknowledge_article%2froot-cause-analysis-may-5-2017-service-disruption-database-failover
How satisfied are you with the Okta Help Center?
Thank you for your feedback!
How satisfied are you with the Okta Help Center?
1
2
3
4
5
Very Dissatisfied
Very satisfied
Enter content less than 200 characters.
Average Rating:
Root Cause Analysis - May 5 2017 - Service Disruption - Database Failover
Published: May 9, 2017   -   Updated: May 9, 2017

Root Cause Analysis: 
May 5, 2017 – Service Disruption (Database Failover)


Problem Description & Impact: 
On Friday, May 5, 2017, at approximately 9:55am PDT, Okta experienced a service disruption in US Cell 2 whereby an average of approximately 4% of users experienced login failures.  Administrators and API users were unable to make system configuration changes during this time, and Administrators were presented with a banner notification indicating the Okta system was in Read-Only mode.  The disruption continued until 10:11am PDT, at which point the service was returned to normal.


Root Cause: 
The service disruption was the result of a kernel misconfiguration which was applied at approximately 9:55am PDT in response to an intermittent file system deadlock found in the database servers. This misconfiguration triggered an error condition and subsequent database fail-over and recovery.  


Mitigation Steps and Future Preventative Measures: 
Upon detecting the database error and unresponsiveness, Okta began immediate automated database fail-over and recovery procedures.  Authentication requests which were in-flight during the error failed.  In most cases, error handling and retry logic executed which allowed for successful execution of the end-user requests

Okta has identified and remedied the configuration action which triggered the database error condition to prevent this issue from occurring in the future.

Post a Comment