Between 6:30am and 3:30pm PST, several customers experienced slowness of the application.
In preparation for the peak season, we provisioned additional servers for anticipated volume. Our customers collectively experienced larger order, shipment and rate request volumes than we expected. Additionally, FedEx, UPS and other Carrier APIs experienced delayed response times to requests made by our system.
The combination of these issues slowed down ShipHawk API response times for some customers.
With the load more than expected, API response time slowed down. Automated load balancer marked some of the slower servers as unhealthy which led to higher load on healthy servers and that slowed down the system even more.
The engineering team made a decision to add more servers to help handle the extra load. The added resources did not help. Adding new resources for rating caused much higher use of database connections, which resulted in errors and did not help with performance degradation.
ShipHawk users experienced slowness of the service from 6:30 am PST till 3:30 pm PST. Some of the API requests were failing by timeout and syncing with external systems was delayed.
A total of 9 urgent support cases were submitted to ShipHawk during the impact window.
It was first detected by monitoring systems at 6:30 am PST and then was reported by customers at 6:42 am PST
Customers were notified about the slowness via our status page at 6:44am PST.
We responded to the incident with all possible urgency and ultimately made the necessary changes to solve the problem while continuing to processing similar volumes to Black Friday and Cyber Monday through the end of the week.
We needed to add more servers for processing extra API requests, but that created too many connections to the database. The solution was to implement a database connection pooling system that allowed us to optimize the database connections usage.
Around 3:00 pm PST, the new connection pool system was activated and we were able to added more resources to process API requests and background jobs. That resolved the slowness at 3:30pm PST.
To further mitigate the chances of another incident, we set up redundant connection poolers and provisioned more resources to production throughout the night. That proved effective during the next day (Tuesday 11/30), when ShipHawk was experience similar API load and response times remained stable throughout.
All times in PST
Monday, 29 November
6:30am - monitoring systems alerted average API response time increase and an increased number of "499 Client Closed Request" errors
6:32 am - engineering team started investigating the slowness
6:42 am - customers reported slowness of Item Fulfillments sync and overall application slowness
6:44 am - Status Page was updated with the details about the incident.
7:30 am - API load balancer reconfigured, to prevent a cascade effect when the load balancer was removing slow instances from the pool which was adding more load to healthy instances, and that made them slow/unhealthy too
8:00 am - application servers reconfigured, more resources moved to API services from backend services, to better match the type of the load
9:00 am - existing servers upgraded to more powerful EC2 instances, extra servers provisioned for handling the extra load
10:00 am - monitoring systems detected errors related to extra high use of database connections which prevented us from provisioning more servers
11:00 am - the decision was made to configure a new database connection pooling system that should mitigate the database connections issue and allow provisioning more resources
3:00 pm - a new database connection pooling system was installed and configured
3:30 pm - confirmed that the incident resolved
Tuesday, 30 November
12:00am - 4:30am - additional application and background processing servers added for redundancy
Suboptimal use of database connections led to issues with the application scaling. The team did not have an immediate solution for that because the issue had not been replicated in testing.