ShipHawk API and Web Portal were not available between 9:57 AM and 11:15 AM Pacific Time, 7/28/2022.
The incident was caused by an AWS outage at US-EAST-2.
This incident was detected at 10:02 AM Pacific Time when the internal alerting system diagnosed an outage. Some of the application servers, primary database node, search engine nodes were not accessible.
After more investigation, we found that disk volumes attached to the primary database are completely inaccessible. Eventually, we found that it was caused by a major outage in the AWS US-EAST-2a availability zone.
After it was confirmed that the issues are caused by the US-EAST-2a outage at 10:30 AM, the devops team initiated switching to the database replica which is located in a different AWS availability zone. That was finished at 11:09 AM and it took additional 10 minutes until all services fully recovered.
All times are Pacific Time.
09:57 AM - the system response time started growing
10:02 AM - internal notification systems signaled about the primary database node outage
10:07 AM - the engineering team started the investigation
10:30 AM - the root cause was identified and the team started working on recovery plan
11:09 AM - the database replica was promoted to a primary node
11:19 AM - the system has fully recovered