Root Cause Analysis:
April 18, 2017 – Okta Feature Disruption (SMS Delays)
Problem Description & Impact:
On Tuesday, April 18, 2017, at approximately 10:15am PDT, Okta experienced an issue in all cells whereby users experienced extended delays in receiving SMS notifications for multi-factor authentication, password reset, and SMS factor registration. Queued SMS messages were significantly delayed, though latency improved over time until the issue was fully resolved at 11:45am PDT. All other multi-factor authentication methods were unaffected by this disruption.
At approximately, 10:15am PDT, Okta experienced a failure with one of our redundant SMS providers due to a data center failure within that provider. SMS retries were directed to our alternate SMS provider whereby Okta encountered an overload condition with that provider due to the significant increased traffic.
Shortly after detection and assessment of the SMS delivery issue, Okta took steps to redistribute SMS traffic to a redundant SMS provider. Rebalancing work continued over the course of the event until the SMS provider service was fully restored.
Okta is taking the following steps to prevent this issue from occurring again:
Okta is re-engineering it's SMS integration to ensure that all failover providers have the same capacity. The new integration will also enhance our load-balancing capabilities as well as improve our monitoring, and diagnostic tools.
ETA: by 5/31/2017
Okta is adding new operational alerting base on end-user SMS retry behavior. This additional alerting will provide better early warning detection and response for SMS related issues in the future.
ETA: by 4/30/2017
Okta will be implementing additional internal test tools for Support and Engineering staff to quickly identify and respond to SMS service provider issues.
ETA: by 5/31/2017