Root Cause Analysis - Service Disruption - 05182018 Skip to main content
https://support.okta.com/help/oktaarticledetailpage?childcateg=&id=ka00z0000019th1sae&source=documentation&refurl=http%3a%2f%2fsupport.okta.com%2fhelp%2fdocumentation%2fknowledge_article%2froot-cause-analysis-service-disruption-05182018
How satisfied are you with the Okta Help Center?
Thank you for your feedback!
How satisfied are you with the Okta Help Center?
1
2
3
4
5
Very Dissatisfied
Very satisfied
Enter content less than 200 characters.
Root Cause Analysis - Service Disruption - 05182018
Published: May 22, 2018   -   Updated: Jun 22, 2018

Root Cause Analysis 

May 17th, 2018 
Service Disruption  

 

Problem Description and Impact 


On May 17th, beginning at 11:28am PDT, Okta experienced a service disruption in US Cell 6. Users attempting to authenticate to their Okta tenant may have experienced "Service not available: No session DB's available" or HTTP 5xx errors.  Through the duration of the incident, approximately 30% of US Cell 6 users attempting primary authentication or multi-factor authentication encountered the errors.  Users were typically able to resolve the issue by retrying the request.  End-Users and Administrators who already had an active session with Okta were not impacted.  Okta took mitigating steps and the issue was resolved by 12:19pm PDT 
 

Root Cause 


Okta's investigation of the issue found that the service disruption occurred when the cell's session token management subsystem reached capacity as the result of an unexpected increase in the number of active sessions persisted within the cluster.  Due to a configuration error, the monitoring in place at the time did not adequately detect and alert to the increase in resource consumption which resulted in the errors prior to mitigating steps being taken. 
 

Mitigating steps and future preventive measures 


At approximately 11:47am PDT, Okta began taking steps to mitigate the resource consumption issues within the session token management subsystem.  Upon completion of these mitigating activities, the service was returned to normal at approximately 12:19pm PDT. 

To prevent re-occurrences of this service disruption, Okta has taken the following actions:  
  • Okta has deployed multiple monitoring/alerting improvements along with improved response and mitigation procedures.  These enhancements will allow us to more quickly identify similar issues in the future and improve our execution of mitigating steps once identified. 
  • Okta has significantly increased system resources within our session token management subsystem. 
  • Okta is conducting a thorough architectural review of our session token management subsystem to identify opportunities to increase resiliency and scalability.