Root Cause Analysis
May 17th, 2018
Problem Description and Impact
On May 17th, beginning at 11:28am PDT, Okta experienced a service disruption in US Cell 6. Users attempting to authenticate to their Okta tenant may have experienced "Service not available: No session DB's available" or HTTP 5xx errors. Through the duration of the incident, approximately 30% of US Cell 6 users attempting primary authentication or multi-factor authentication encountered the errors. Users were typically able to resolve the issue by retrying the request. End-Users and Administrators who already had an active session with Okta were not impacted. Okta took mitigating steps and the issue was resolved by 12:19pm PDT
Okta's investigation of the issue found that the service disruption occurred when the cell's session token management subsystem reached capacity as the result of an unexpected increase in the number of active sessions persisted within the cluster. Due to a configuration error, the monitoring in place at the time did not adequately detect and alert to the increase in resource consumption which resulted in the errors prior to mitigating steps being taken.
Mitigating steps and future preventive measures
At approximately 11:47am PDT, Okta began taking steps to mitigate the resource consumption issues within the session token management subsystem. Upon completion of these mitigating activities, the service was returned to normal at approximately 12:19pm PDT.
To prevent re-occurrences of this service disruption, Okta has taken the following actions:
- Okta has deployed multiple monitoring/alerting improvements along with improved response and mitigation procedures. These enhancements will allow us to more quickly identify similar issues in the future and improve our execution of mitigating steps once identified.
- Okta has significantly increased system resources within our session token management subsystem.
- Okta is conducting a thorough architectural review of our session token management subsystem to identify opportunities to increase resiliency and scalability.