Skip to main content

The FB outage

 This outage has caused considerable noise everywhere. It was quite discomforting for me because during the whole conversation nobody bothered to understand the gravity of the issue. I don't expect end users to understand the issue. But this is going to be a blogpost for all of those in the tech field,

Such an event can happen how much ever chaos engineering, best of the tech jargon we implement in the stack

To all my Site Reliability Engineer friends, Site Up is our first priority. I myself said many a times outage is news and SREs should prevent outage. But I'm afraid this is leading to a cult in the industry who despises outages and takes no learnings from it.


I don't know what has happened in Facebook. I can explain a scenario which may or may not be right but that can definitely show the gravity of the issue.

Let's draw a probable Facebook architecture




Disclaimer


I don't work at Facebook. So this might not be how facebook routes traffic. This is based on my experience with large scale companies

When we open any big site, our traffic reaches to a nearest server PoP(Points of Presence) maintained by the site.  How does the traffic reach to the nearest location. That's taken care by routing. The routing always sends traffic to the nearest location. How does routing know where these IPs are present in the internet? Facebook and any other ASNs(Autonomous Services) announce their IPs using BGP protocol. The route propagates through the internet and the shortest of the available routes will always be picked for routing. If a PoP location is down, BGP announcement can be withdrawn from that location and traffic seamlessly flows to other locations.

PoP location are usually small and they send traffic to a full blown Data Centre unless the requested resource is already cached in them(like a viral FB video). The data centre to forward the traffic is picked based on some rules and traffic to data centre is sent via the internet addressed to specific datacentre's Inbound IP range.  This is again routed by BGP. PoP to DataCentre communication may have more security setting to avoid any untrusted access.

Now during the outage facebook IPs were not reachable on the internet.All their IPs had no announcements. So their services are down.

1. Why not just bring back the announcement?
All the work done by infrastructure team is remote. Nobody goes to any of the PoP locations or data centres to configure them. The DC team racks these servers and brings them up. Everything is done remotely via a VPN connection to the remote site.
Now with facebook IPs not announced from anywhere, reaching DCs via VPN will also be impossible because even that IP would be removed from announcement
Is the infrastructure so fragile? There should definitely be an out of band access to sites, another path to reach site. What does another path mean?

BGP announcements will be made by core routers or switch. There will be a pair of routers atleast and a pair of peers atleast who pick the announcement.

There will be another pair of routers atleast to reach the site in an unevent of total blackout on core routers. These routers will be announcing different set of IPs only to cater management traffic.
Why do we need separate management access? A bad DDoS(Distributed Denial of Service attack) can take production network down and the DataCentre can be cut off. In one of my org, we DoSed ourselves due to a mistake in configuration and we had to reach the site to stop the source of DDoS via the management network.

Based on the statements from Facebook and unofficial communication in Reddit, it looks like their management access is also down. For readers,  if your org runs non cloud data centres and doesnot have a management access, it's time to learn and get that path up.

2. Why would management access be down?
Based on the information available, it looks like a side effect of a command run across all such network infrastructure. Why would somebody run on all routers at once, why would somebody run across all sites? Nobody knows this unless FB answers it. But the whole point is not to put blame on somebody. You, me, everyone of us could run this if we felt it's a readonly non fatal command. I have seen network devices crumbled on just metric collection due to memory leak in SNMP. We had seen handful of our servers going down for a particular scsi command meant to be harmless to just collect some metrics. If the command was not handled appropriately by the hardware or if the command exacerbated an already existing problem in the devices, even though the command is meant to be benign, could cause catastrophe.

Running any command on critical production like the devices at edge probably needs more slow ramp. The rollout of features have to be slow like one device a time or one site a time and slowly ramped up. This is a war against agility vs stability. 

3. So facebook has to get the DC engineers reach the site and reestablish the announcements. They found it difficult to reestablish the announcements because of two reasons

  •  They used facebook workspaces for internal communication which is also blackholed so they have to get a new communication platform to talk to DC engineer
  • Secondly the network device access controls might not be allowing DC engineers to login and kick start the announcement. Who grants access to DC engineer without connectivity to the site to give access? A chicken and egg problem

4. Now when all of this is chalked out. Nobody knows how the systems will behave when the traffic is let in. All sites will have caches of least recently used data. All these caches are stale now, they might have evicted data as they are over Time to Live. Cold cache can cause thrashing to the data infrastructure. Distributed Systems across multiple data centre were fragmented for hours. Once the connection comes back up all of them have to again join together and become healthy. Nobody would have planned for complete shutdown and restart of Facebook. A plan has to be chalked out so that once things are back up, site should not crash again due to the distributed systems not in sync or the cold cache or any other unanticipated reasons. Many apps would have crashed as they couldn't establish connection cross site and such apps have to be restarted before letting the traffic in.

The whole SRE team should have been in the war room. No chaos engineering would have tested this case. And with the complexity of stack, nobody would have the confidence to take a decision and zero a plan.

It could be a manual error/oversight snowballed in to a big issue. There are learnings

  1. Have out of band access to the site
  2. Out of band access and prod network should not have the change ramped at the same time. Change in edge has to be tested and treating infrastructure as software doesn't mean fast rollout of software in critical components like edge
  3. Have other ways of communication established within your org among engineers. Wikis and other internal documentation should be reachable to sail through outages. Make contingency plan to accommodate that like hosting wiki in a different cloud provider
  4. There should be an empowered group to take call on the plans to move forward and they should shield pressure off from engineers. A noisy war room is less useful.
Overall a site has turned itself off and came back up. We should appreciate Facebook and the engineers for the time they have put in and sailed through the crisis. It's not a DNS issue or configuration issue in one data centre(may be). But it's that outage when stars are well aligned against you








Comments

Post a Comment

Popular posts from this blog

How we have systematically improved the roads our packets travel to help data imports and exports flourish

This blog post is an account of how we have toiled over the years to improve the throughput of our interDC tunnels. I joined this company around 2012. We were scaling aggressively then. We quickly expanded to 4 DCs with a mixture of AWS and colocation. Our primary DC is connected to all these new DCs via IPSEC tunnels established from SRX. The SRX model we had, had an IPSEC throughput of 350Mbps. Around December 2015 we saturated the SRX. Buying SRX was an option on the table. Buying one with 2Gbps throughput would have cut the story short. The tech team didn't see it happening. I don't have an answer to the question, "Is it worth spending time in solving a problem if a solution is already available out of box?" This project helped us in improving our critical thinking and in experiencing the theoretical network fundamentals on live traffic, but also caused us quite a bit of fatigue due to management overhead. Cutting short the philosophy, lets jump to the story.

LXC and Host Crashes

 We had set up a bunch of lxc containers on two servers each with 16 core CPUs and 64 GB RAM(for reliability and loadbalancing). Both the servers are on same vlan. The servers need to have atleast one of their network interface in promiscuous mode so that it forwards all packets on vlan to the bridge( http://blogs.eskratch.com/2012/10/create-your-own-vms-i.html ) which takes care of the routing to containers. If the packets are not addressed to the containers, the bridge drops the packet. Having this setup, we moved all our platform maintenance services to these containers. They are fault tolerant as we used two host machines where each host machine has a replica of the containers on the other. The probability to crash for both the servers at the same time due to some hardware/software failure is less. But to my surprise both the servers are crashing exactly the same time with a mean life time 20 days. We had to wake up late nights(early mornings) to fix stuffs that gone down The

The server, me and the conversation

We were moving a project from AWS to our co-located DC. We have setup KVMs scheduled by Cloudstack for each of the component in the architecture. The KVMs used local storage. The VMs are provisioned with more than required resources because we have the opinion that in our DC scaling during peak load and then downscaling doesn't offer much benefits financially as we are anyways paying for the hardware in advance and its also powered on. Its going to be idle if not used. Now we found something interesting our latency in co-located DC was 2 times more than in AWS. The time for first byte at our load balancer in aws was 60ms average and at our DC was 112ms. We started our debugging mission, Mission Conquer-AWS. All the servers are newer Dell hardwares. So the initially intuition was virtualisation is causing the issue. Conversation with the Hypervisor We started with CPU optimisation, we started using the host-passthrough mode of CPU in libvirt so VMs dont see QEMU emulated CPUs,