Monday’s major Facebook outage sent the internet ablaze with worry and conjecture about whether it was caused by engineering error, malicious attack, or even something else.
But, roughly one hour after the outage, which also saw Instagram and WhatsApp down for the five-hour count, a Reddit user named u/ramenporn, who claimed to be part of the “Recovery Team” for the ongoing issue, tried to explain what was happening.
Facebook technicians needed to be in physical proximity to routers
Some early answers seemed to have come from Reddit. Specifically, they seemed to come from people who weren’t authorized to speak.
“As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that’s that BGP peering with Facebook peering routers have gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC),” read the initial post on Reddit by an alleged Facebook insider.
It continued, “There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified.”
“Part of this is also due to lower staffing in data centers due to pandemic measures,” added the Reddit user u/ramenporn.
Not long after the alleged insider began telling potentially sensitive information about Facebook’s desperate attempts to regain control of its domains, the Reddit user’s post was deleted. But the Wayback Machine preserved an archive of the information.
Another Reddit user seemed to think this ordeal would lend credibility to their case against Facebook’s management regarding why we need knowledgeable staff available in major data centers. Other users commiserated on the challenge of solving an issue like this when the very network goes down.
“The problem is when your networking goes down, even if you get in via a backup DSL connection or something to the datacenter, you can’t get from your jump host to anything else,” read a post from mike_d.
The Reddit user u/ramenporn also said that the Monday outage was likely caused by Facebook network engineers accidentally locking themselves out of the larger system during a configuration change.
Once this happened, it meant the only ones who could do anything needed to be in proximity to the physical routers in Facebook’s data center, to bring the servers back. Luckily, someone (or probably several extremely stressed out someones) fixed the underlying issue, as Facebook and Instagram returned to service shortly before 6:00 PM EDT.
This was a developing story and was regularly updated as new information became available.