The massive October 4th Facebook outage was not due to a breach and was not classified as a security issue. But the fact that it went down — and was inaccessible for an extended period — is itself a security concern that the enterprise must address.
That security concern is business continuity.
According to reports, The Facebook outage was due to a misconfiguration of the border gateway protocol (BGP) that snowballed beyond its control. Somehow, as part of routine maintenance, a command was launched that accidentally disconnected all of Facebook’s data centers.
Facebook’s DNS servers realized its network backbone was no longer communicating with the internet and stopped sending out BGP advertisements. To users, it appeared as if Facebook was sending a message for everyone to take its servers off its “internet maps”.
While most (if not all) enterprises are not as big as Facebook, some universal lessons can be learned from this significant incident.
First, we must understand what BGP does; then, investigate how the Facebook outage occurred. Finally, we can explore the importance of business continuity planning for security teams and what companies can do to prevent shutdowns from occurring at their organization.
The Role of BGP in the Facebook Outage
BGP, Border Gateway Protocol, is much like DNS in that it allows network and internet traffic to travel to its destination as quickly as possible. The function of BGP is to act like a GPS and provide the best route. With a service as large as Facebook, there are almost endless routes your packets might take.
DNS, on the other hand, is used to translate names to IP addresses, much like an address book.
Cloudflare’s analogy is excellent: DNS tells you where you’re going, and BGP tells you how to get there. With DNS, you know where to go — you have an address. But what route do you take? That’s where BGP comes in.
But if DNS or BGP is down, your site is unreachable. Maintaining and ensuring that the BGP is doing its job is equally important in keeping DNS operational and secure.
This explanation simplifies things without getting into the specifics, but the high-level overview is enough to underscore the importance.
How Did All of Facebook’s Internal Dependencies Lead to the Outage?
While BGP did play a role in the Facebook outage, it was not the root cause. According to Facebook’s explanation, it was a routine maintenance command that disconnected the servers. Things got out of control when the company’s DNS servers realized the network backbone was no longer communicating with the internet. Because the servers knew something was wrong, they quit sending out BGP advertisements.
As the outage continued, reports began to circulate that the issue was likely a result of a series of circular internal dependencies. This means that when everything went offline, the functions required to run those services also went offline; it was like an infinite circle.
For internet devices and servers everywhere, the message they were receiving from Facebook was essentially saying, “please remove our servers from your maps”. Large cloud server providers reportedly noticed a significant amount of BGP updates from the social media giant before it went down.
Ultimately, Facebook’s BGP systems took themselves off the map due to other internal infrastructure issues.
The outage only goes to prove how critical business continuity planning should be.
Why is Business Continuity Planning So Critical for the Enterprise?
For Facebook, we may not ever grasp how much business the company lost while being unavailable for part of October 4. But for any business, regardless of how big or small, an outage can have severe financial consequences. Facebook is large enough to have addressed the issue rather quickly. Many organizations don’t have the internal resources to recover from an incident of that magnitude.
Business continuity should be an essential element of a robust cybersecurity program. But key to understanding where business continuity fits in is knowing the difference between business continuity, disaster recovery and continuous operations.
Essentially, business continuity planning works alongside disaster recovery and other resilience strategies to put your stakeholders in a better position to respond to any incident that could impact your brand, bottom line, reputation or valuation.
A robust business continuity plan allows you to continue offering your product or service during a disruption.
What Can Companies Do to Prevent This From Happening At Their Organization?
Whether your organization is massive like Facebook or not even close in size, it can be challenging for any company to truly grasp all of its internal dependencies without testing to determine what can happen if abnormalities are introduced.
Unless your company has mastered the art of chaos engineering, knowing every single dependency is almost impossible. The best way to approach this is to embrace that fact and plan for the potential failure.
As the threat landscape continues to wreak havoc on cybersecurity, falling into the trap of thinking that disruption can only come from external sources is understandable. But, according to the IBM 2021 Cost of a Data Breach Report, system error and misconfiguration are common attack vectors, too.
Like anything in cybersecurity, it’s a balancing act. Think of the big picture like a circus: Your company is on the high wire but you need a net. And that net, especially in issues surrounding internal threats, could be working with a third-party to get an outside (and honest) opinion about where you stand.
Much like a vulnerability assessment or penetration test, having an external team of experts assess the security risk that could arise from internal dependencies is a critical defense strategy.
At the end of the day, your risk posture is only as good as what you know.