Root Cause Analysis:
Problem description & Impact:
March 7, 2018
On March 7th, beginning at 5:15pm PST, Okta experienced a service disruption in US Cell 6. End-User, Admin and API requests within Okta tenants in US Cell 6 were slow, unresponsive, or encountered HTTP 5xx errors. Okta took corrective actions to address the issue and service was returned to normal in US Cell 6 by 6:08pm PST. Root Cause:
The disruption was the result of a memory issue which impacted the cell's Redis cluster which is used to cache application data. The root cause of the memory issue was traced to a combination of high memory usage within the Redis cluster, a mis-configured application server which increased the volume of caching above expected levels, and sub-optimal Redis configuration settings. Because both the primary and secondary Redis tiers were equally impacted by the increased memory demands, the cluster was unable to recover.
Once the Redis cluster was unable to service cache object requests, web requests began to experience latency, and eventually resulted in errors in retrieving the cached objects, resulting in HTTP 500 errors. Mitigating Steps & Corrective Actions:
At 5:15pm PST, Okta’s operational monitoring alerted Okta technical operations team of the increased latency and errors. To stabilize the Redis cluster, additional capacity was added, and application servers were recycled to clear hung connections and rebalance load. By 6:08pm PST, the cell had returned to normal performance levels.
To prevent future recurrence, Okta has undertaken the following actions:
- Okta has completed work to up-size our Redis caching clusters in US Cell 6 and will deploy the same mitigating changes across all remaining cells.
- Okta will conduct exercises to replay this failure sequence within a test environment to determine and subsequently deploy the optimal settings for Redis memory management, client connection configuration and health-check thresholds.
- Okta is investigating new tooling to allow us to flush Redis cache should such a mitigation step be needed in the future.
- Okta has implemented additional monitoring and alerting to capture and action similar error conditions to improve our mitigation response in the future.