Fault Failed Fastly's Fragile CDN
For a time, social media sites like Twitch and Reddit along with news sites like the Guardian and the New York Times all went offline with a 503 error as browsers were unable to reach the destination servers.
Fastly, the cloud-computing company responsible hosting cached versions of those site, blamed a software bug triggered when one a legitimate customer changed their settings. Within minutes, around 85% of Fastly client sites reported errors, some knocked offline completely. The outage lasted just under an hour, with fault finding diagnosing the problem in 40 minutes. A solution was identified in 50 minutes and the propagation of the fix taking it to 58 minutes to normal services resuming.
Overall, that's a quick turnaround for a software fault. However, the awkward questions remain.
For Fastly, how do ordinary clients trigger a bug that causes a cascading failure through most of a CDN, by simply changing their own settings? What was the error condition? Can you even test for that? What safeguards are there within the CDN to prevent a cascading failure?
For the wider Internet, is it wise to have such a small number of key players supporting such a large chunk of web infrastructure? You can bet the world's Black Hat hackers just added CDN's to their list of targets. Why hack one site when you can hack the CDN and take down thousands?
This may be an object lesson that the commercial cloud, consolidation and offloading of performance infrastructure may not be the unbridled success we were led to believe.