Pages may be slow loading
Incident Report for ShipHawk
Postmortem

Incident summary

Between 6:30am and 3:30pm PST, several customers experienced slowness of the application.

Leadup

In preparation for the peak season, we provisioned additional servers for anticipated volume. Our customers collectively experienced larger order, shipment and rate request volumes than we expected. Additionally, FedEx, UPS and other Carrier APIs experienced delayed response times to requests made by our system.

The combination of these issues slowed down ShipHawk API response times for some customers.

Fault

With the load more than expected, API response time slowed down. Automated load balancer marked some of the slower servers as unhealthy which led to higher load on healthy servers and that slowed down the system even more.

The engineering team made a decision to add more servers to help handle the extra load. The added resources did not help. Adding new resources for rating caused much higher use of database connections, which resulted in errors and did not help with performance degradation.

Impact

ShipHawk users experienced slowness of the service from 6:30 am PST till 3:30 pm PST. Some of the API requests were failing by timeout and syncing with external systems was delayed.

A total of 9 urgent support cases were submitted to ShipHawk during the impact window.

Detection

It was first detected by monitoring systems at 6:30 am PST and then was reported by customers at 6:42 am PST

Response

Customers were notified about the slowness via our status page at 6:44am PST.

We responded to the incident with all possible urgency and ultimately made the necessary changes to solve the problem while continuing to processing similar volumes to Black Friday and Cyber Monday through the end of the week.

Recovery

We needed to add more servers for processing extra API requests, but that created too many connections to the database. The solution was to implement a database connection pooling system that allowed us to optimize the database connections usage.

Around 3:00 pm PST, the new connection pool system was activated and we were able to added more resources to process API requests and background jobs. That resolved the slowness at 3:30pm PST.

To further mitigate the chances of another incident, we set up redundant ​connection poolers and provisioned more resources to production throughout the night. That proved effective during the next day (Tuesday 11/30), when ShipHawk was experience similar API load and response times remained stable throughout.

Timeline

All times in PST

Monday, 29 November
6:30am - monitoring systems alerted average API response time increase and an increased number of "499 Client Closed Request" errors

6:32 am - engineering team started investigating the slowness

6:42 am - customers reported slowness of Item Fulfillments sync and overall application slowness

6:44 am - Status Page was updated with the details about the incident.

7:30 am - API load balancer reconfigured, to prevent a cascade effect when the load balancer was removing slow instances from the pool which was adding more load to healthy instances, and that made them slow/unhealthy too

8:00 am - application servers reconfigured, more resources moved to API services from backend services, to better match the type of the load

9:00 am - existing servers upgraded to more powerful EC2 instances, extra servers provisioned for handling the extra load

10:00 am - monitoring systems detected errors related to extra high use of database connections which prevented us from provisioning more servers

11:00 am - the decision was made to configure a new database connection pooling system that should mitigate the database connections issue and allow provisioning more resources

3:00 pm - a new database connection pooling system was installed and configured

3:30 pm - confirmed that the incident resolved

Tuesday, 30 November

12:00am - 4:30am - additional application and background processing servers added for redundancy

Root cause identification: The Five Whys

  1. The application had degraded performance because of added load on the API and slow carrier response times.
  2. The system did not automatically address the added load because database connections were consumed.
  3. Because we pushed extra resources and didn’t expect this to cause an issue with database connections.
  4. Because we need did not have tests to cover load tests that would have identified this.
  5. Because we had not previously felt this kind of testing was necessary until we reached this level of scale.

Root cause

Suboptimal use of database connections led to issues with the application scaling. The team did not have an immediate solution for that because the issue had not been replicated in testing.

Lessons learned

  • We need more application load testing in place.
  • Carrier API response slowness can cause slowness for the application.
  • Customers with high API usage volatility should isolated from other multi-tenant users.

Corrective actions

  1. Introduce new load testing processes.
  2. Implement better automated scaling system for the peak load periods.
  3. Prioritize solutions to mitigate response time delays due to carrier response time delays.
Posted Dec 03, 2021 - 11:04 PST

Resolved
This incident has been resolved.

In an effort to help during this heightened holiday processing, we will provide extended support hours from 3:00 AM to 9:00 PM Pacific Time via normal support channels through Friday 12/3/21 for all customers.
Posted Nov 29, 2021 - 16:13 PST
Monitoring
Our engineering team is deploying additional changes to address page slowness. We are seeing significant improvement with site and API responsiveness with these changes, and we will continue to closely monitor system performance.
Posted Nov 29, 2021 - 15:12 PST
Update
We continue to experience exponentially larger volumes than anticipated, despite significant over-provisioning of system resources in preparation for Black Friday/Cyber Monday. As a result, some customers are experiencing slower than normal performance. ShipHawk engineering will continue to make incremental improvements throughout the day and will inform you as changes are made.
Posted Nov 29, 2021 - 12:31 PST
Update
The deployed changes are now in effect across the system. Overall site and API performance continues to improve. ShipHawk Engineering will continue to tune and monitor performance.
Posted Nov 29, 2021 - 11:09 PST
Identified
ShipHawk Engineering is deploying changes to address system performance. We expect those changes to have a positive impact on site and API responsiveness over the next 15-25 minutes, and we will continue to monitor system performance.
Posted Nov 29, 2021 - 10:23 PST
Investigating
Some clients have reported they are still seeing slow response times. Our engineering team is investigating further for a complete resolution. We will update you as soon as we know more information.
Posted Nov 29, 2021 - 09:16 PST
Monitoring
Our engineering team was able to improve the responsiveness of ShipHawk's WebPortal and API, and error messages have subsided. We will continue to monitor the issue throughout the day to confirm the resolution of this issue.
Posted Nov 29, 2021 - 08:39 PST
Update
There are no new updates at this time. Engineering is continuing to resolve this issue. We will update you as soon as we have more information.
Posted Nov 29, 2021 - 07:51 PST
Identified
The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive. We're investigating the cause and will provide an update as soon as possible.

Our engineering team is working on a solution. The next update will be within 30 mInutes.
Posted Nov 29, 2021 - 06:50 PST
This incident affected: ShipHawk Application and ShipHawk API.