Amazon has shared more details about the major outage we reported on late last week, which brought many of the world's biggest websites offline.
In a blog post detailing the fallout, the company's cloud arm AWS explained that the US-EAST-1 region went offline because the company added more capacity to the Kinesis system, without checking if the operating system's configuration actually allowed for that to happen. As you could have guessed by now – it didn't.
In order for the servers in the Kinesis fleet to communicate with each other, they need to create so-called “threads” between each one in the front-end fleet. According to AWS, there are “many thousands of servers” in the fleet, so when it adds new ones, it could take a few hours before these threads are created.
However, in this particular case, adding capacity “caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration.” The fastest way to fix the issue was to reboot all of Kinesis, which took awhile because one can only bring “a few hundred” servers back online at a time.
In order to make sure this doesn’t happen again, AWS plans on using bigger servers, and setting up a notification system.
“In the very short term, we will be moving to larger CPU and memory servers, reducing the total number of servers and, hence, threads required by each server to communicate across the fleet,” the post reads.
“This will provide significant headroom in thread count used as the total threads each server must maintain is directly proportional to the number of servers in the fleet.”
The company also plans new “fine-grained alarming for thread consumption in the service,” as well as “an increase in thread count limits in our operating system configuration, which we believe will give us significantly more threads per server and give us significant additional safety margin there as well.”