What Happened: We are still gathering details, but it appears an unscheduled change to a core router by one of our […]
Today WordPress.com was down for approximately 110 minutes, our worst downtime in four years. The outage affected 10.2 million blogs, including our VIPs, and appears to have deprived those blogs of about 5.5 million pageviews.
What Happened: We are still gathering details, but it appears an unscheduled change to a core router by one of our datacenter providers messed up our network in a way we haven’t experienced before, and broke the site. It also broke all the mechanisms for failover between our locations in San Antonio and Chicago. All of your data was safe and secure, we just couldn’t serve it.
What we’re doing: We need to dig deeper and find out exactly what happened, why, and how to recover more gracefully next time and isolate problems like this so they don’t affect our other locations.
I will update this post as we find out more, and have a more concrete plan for the future.
I know this sucked for you guys as much as it did for us — the entire team was on pins and needles trying to get your blogs back as soon as possible. I hope it will be much longer than four years before we face a problem like this again.
Update 1: We’ve gathered more details about what happened. There was a latent misconfiguration, specifically a cable plugged someplace it shouldn’t have been, from a few months ago. Something called the spanning tree protocol kicked in and started trying to route all of our private network traffic to a public network over a link that was much too small and slow to handle even 10% of our traffic which caused high packet loss. This “sort of working” state was much worse than if it had just gone down and confused our systems team and our failsafe systems. It is not clear yet why the misconfiguration bit us yesterday and not earlier. Even though the network issue was unfortunate, we responded too slowly in pinpointing the issue and taking steps to resolve it using alternate routes, extending the downtime 3-4x longer than it should have been.