AWS US-East 2a outage

Incident Report for ShipHawk

Postmortem

Incident summary

ShipHawk API and Web Portal were not available between 9:57 AM and 11:15 AM Pacific Time, 7/28/2022.

The incident was caused by an AWS outage at US-EAST-2.

Detection

This incident was detected at 10:02 AM Pacific Time when the internal alerting system diagnosed an outage. Some of the application servers, primary database node, search engine nodes were not accessible.

After more investigation, we found that disk volumes attached to the primary database are completely inaccessible. Eventually, we found that it was caused by a major outage in the AWS US-EAST-2a availability zone.

Recovery

After it was confirmed that the issues are caused by the US-EAST-2a outage at 10:30 AM, the devops team initiated switching to the database replica which is located in a different AWS availability zone. That was finished at 11:09 AM and it took additional 10 minutes until all services fully recovered.

Timeline

All times are Pacific Time.

09:57 AM - the system response time started growing

10:02 AM - internal notification systems signaled about the primary database node outage

10:07 AM - the engineering team started the investigation

10:30 AM - the root cause was identified and the team started working on recovery plan

11:09 AM - the database replica was promoted to a primary node

11:19 AM - the system has fully recovered

Corrective actions

Increase number of availability zones in order to minimize the effect of potential AWS outage
Reduce time it takes to switch to redundant availability zones.

Posted Aug 01, 2022 - 13:59 PDT

Resolved

This incident is fully resolved.

Customer impact: Customers were not able to use ShipHawk services.

Start Time: 9:57am Pacific Time
End Time: 11:25am Pacific Time

Posted Jul 28, 2022 - 13:28 PDT

Monitoring

ShipHawk services are now back online. We will continue to monitor as services are restored.

To follow updates from Amazon, please see: https://health.aws.amazon.com/health/status

Customer impact: Customers are not able to use ShipHawk services.

Start Time: 9:57am Pacific Time
End Time: 11:25am Pacific Time

Posted Jul 28, 2022 - 11:27 PDT

Update

It appears that Amazon hosting (AWS) in US-East 2a is experiencing an outage. Our DevOps team is actively working to restore ShipHawk by switching to an AWS facility that is not impacted by this outage. We expect to restore services soon.

To follow updates from Amazon, please see: https://health.aws.amazon.com/health/status

Customer impact: Customers are not able to use ShipHawk services.

Start Time: 9:57am Pacific Time

Posted Jul 28, 2022 - 11:06 PDT

Investigating

We are currently investigating this issue.

Posted Jul 28, 2022 - 10:11 PDT

This incident affected: ShipHawk Application.