Trouble logging in
Incident Report for ShipHawk
Postmortem

Incident summary

During an internal process that archives data, we noticed that disk usage beginning to increase and decided to upgrade the volume proactively. Due to internal AWS optimization processes, the upgrade created slowness in the system, which later led to the incident. We promoted a replica database to restore the service and service was restored at 11:45am PST.

Leadup

9:30am PST - we started an internal process that archives data

10:30am PST - internal monitoring systems alerted fast increasing disk usage

10:35am PST - the volume attached to the database servers was upgraded

This change resulted in degraded database performance.

Fault

Due to internal AWS optimization processes, the volume upgrade created slowness in the system, which later led to the incident starting at 10:42am PST.

Impact

Customers hosted on shared instances were not able to use the system from 10:42am PST to 11:45am PST.

Affected services:

  • Web Portal
  • Workstations
  • ShipHawk API

Detection

The Incident was detected by the automated monitoring system and was reported by multiple customers.

Response

After receiving the alerts from the monitoring system, the engineering team connected with ShipHawk Customer Success and described the level of impact. The incident notification was posted to https://status.shiphawk.com/

Recovery

3 steps were performed for the service recovery:

  • primary database node disabled
  • the database replica was promoted to primary
  • the OLD primary node hostname was pointed to the NEW primary node by updating DNS records

Timeline

All times are in PST.

10/15/2021:

10:00am - an internal process that archives data started

10:30am - internal monitoring systems alerted fast increasing disk usage

10:35am - the volume attached to the primary database node was upgraded

10:42am - the database performance degraded

10:43am - the monitoring system alerted multiple errors and API unresponsiveness

10:50am - the engineering team began an investigation of the incident

11:20am - the root cause was understood and the team created an action plan

11:30am - primary node was disabled and the replica was promoted to a primary

11:40am - OLD primary node hostname was pointed to the NEW primary node by updating DNS records

11:45am - the service is fully restored

1:30pm - a new database replica was created and the sync process started

10/16/2021:

2:30pm - the new database replica sync process finished

Root cause identification: The Five Whys

  1. The application had an outage because the database performance degraded
  2. The database performance degraded because the volume, attached to the primary database node, was upgraded
  3. The volume was upgraded because the disk usage fastly increased
  4. Because we ran data archiving processes that used more disk than was expected
  5. Because the data archiving process was tested on the environment with different primary/replica database configurations and the problem was not identified during tests

Root cause

The difference in configurations of the test and production systems led to missed inefficiency in the data archiving process.

Lessons learned

  • The test environment requires configuration changes to more closely resemble production
  • The data archiving process should start slower
  • The internal process to promote replica databases to primary needs to be faster
Posted Oct 26, 2021 - 11:19 PDT

Resolved
This incident is resolved. We’re sorry this prevented your team from fulfillment during this outage period. Understanding this urgency, we made every possible effort to solve this as quickly as possible. The incident started at 10:42am and was resolved before 11:45am Pacific Time. A post-mortem will be provided and accessible on this status page within the next 3-5 business days.

Please contact support@shiphawk.com if you have additional questions or concerns.
Posted Oct 15, 2021 - 14:12 PDT
Monitoring
A fix has been implemented and we are monitoring the results. Customers can now login. Monitoring will continue throughout the day. Next update to finalize/close this incident will be provided within the next few hours.
Posted Oct 15, 2021 - 11:45 PDT
Update
We are continuing to investigate this issue.
Posted Oct 15, 2021 - 11:24 PDT
Investigating
Some users may be experiencing trouble when logging in to ShipHawk. Our Engineering team is currently investigating issues related to login. We will send an additional update at 11:45am Pacific Time.
Posted Oct 15, 2021 - 11:20 PDT
This incident affected: ShipHawk Application and ShipHawk API.