Root Cause Analysis:
May 5, 2017 – Service Disruption (Database Failover)
Problem Description & Impact:
On Friday, May 5, 2017, at approximately 9:55am PDT, Okta experienced a service disruption in US Cell 2 whereby an average of approximately 4% of users experienced login failures. Administrators and API users were unable to make system configuration changes during this time, and Administrators were presented with a banner notification indicating the Okta system was in Read-Only mode. The disruption continued until 10:11am PDT, at which point the service was returned to normal.
The service disruption was the result of a kernel misconfiguration which was applied at approximately 9:55am PDT in response to an intermittent file system deadlock found in the database servers. This misconfiguration triggered an error condition and subsequent database fail-over and recovery.
Mitigation Steps and Future Preventative Measures:
Upon detecting the database error and unresponsiveness, Okta began immediate automated database fail-over and recovery procedures. Authentication requests which were in-flight during the error failed. In most cases, error handling and retry logic executed which allowed for successful execution of the end-user requests
Okta has identified and remedied the configuration action which triggered the database error condition to prevent this issue from occurring in the future.