Root Cause Analysis:
May 4-5, 2017 – Feature Disruption (Intermittent E-mail Delivery)
Problem Description & Impact:
On Thursday, May 4, 2017, at approximately 8:26am PDT, Okta experienced a feature disruption impacting US cells 1-3 whereby approximately 33% of queued e-mails failed to send to the intended recipients. The feature disruption continued until Friday, May 5, at 10:47am PDT, when the issue was fully resolved.
The issue was the result of a bug in a deployment automation script which resulted in an incomplete configuration being applied to an email relay server which had recently been relaunched following a hardware failure. E-mail requests routed through the misconfigured server failed to send correctly to the intended recipient. Subsequent retries routed through other redundant e-mail relay servers were successfully delivered.
Mitigation Steps and Future Preventative Steps:
Okta identified and resolved the e-mail relay server configuration issue by re-deploying the correct deployment automation script. Normal e-mail delivery was confirmed to be fully functional at 10:47am PDT.
Okta is resolving the deployment script bug and is adding additional server launch validation steps. Okta has also added additional monitoring to catch any similar future post-deployment failure conditions within the e-mail relay server infrastructure.