This outage has caused considerable noise everywhere. It was quite discomforting for me because during the whole conversation nobody bothered to understand the gravity of the issue. I don't expect end users to understand the issue. But this is going to be a blogpost for all of those in the tech field,
Such an event can happen how much ever chaos engineering, best of the tech jargon we implement in the stack
To all my Site Reliability Engineer friends, Site Up is our first priority. I myself said many a times outage is news and SREs should prevent outage. But I'm afraid this is leading to a cult in the industry who despises outages and takes no learnings from it.
I don't know what has happened in Facebook. I can explain a scenario which may or may not be right but that can definitely show the gravity of the issue.
Let's draw a probable Facebook architecture
PoP location are usually small and they send traffic to a full blown Data Centre unless the requested resource is already cached in them(like a viral FB video). The data centre to forward the traffic is picked based on some rules and traffic to data centre is sent via the internet addressed to specific datacentre's Inbound IP range. This is again routed by BGP. PoP to DataCentre communication may have more security setting to avoid any untrusted access.
1. Why not just bring back the announcement?
BGP announcements will be made by core routers or switch. There will be a pair of routers atleast and a pair of peers atleast who pick the announcement.
Running any command on critical production like the devices at edge probably needs more slow ramp. The rollout of features have to be slow like one device a time or one site a time and slowly ramped up. This is a war against agility vs stability.
- They used facebook workspaces for internal communication which is also blackholed so they have to get a new communication platform to talk to DC engineer
- Secondly the network device access controls might not be allowing DC engineers to login and kick start the announcement. Who grants access to DC engineer without connectivity to the site to give access? A chicken and egg problem
4. Now when all of this is chalked out. Nobody knows how the systems will behave when the traffic is let in. All sites will have caches of least recently used data. All these caches are stale now, they might have evicted data as they are over Time to Live. Cold cache can cause thrashing to the data infrastructure. Distributed Systems across multiple data centre were fragmented for hours. Once the connection comes back up all of them have to again join together and become healthy. Nobody would have planned for complete shutdown and restart of Facebook. A plan has to be chalked out so that once things are back up, site should not crash again due to the distributed systems not in sync or the cold cache or any other unanticipated reasons. Many apps would have crashed as they couldn't establish connection cross site and such apps have to be restarted before letting the traffic in.
The whole SRE team should have been in the war room. No chaos engineering would have tested this case. And with the complexity of stack, nobody would have the confidence to take a decision and zero a plan.
It could be a manual error/oversight snowballed in to a big issue. There are learnings
- Have out of band access to the site
- Out of band access and prod network should not have the change ramped at the same time. Change in edge has to be tested and treating infrastructure as software doesn't mean fast rollout of software in critical components like edge
- Have other ways of communication established within your org among engineers. Wikis and other internal documentation should be reachable to sail through outages. Make contingency plan to accommodate that like hosting wiki in a different cloud provider
- There should be an empowered group to take call on the plans to move forward and they should shield pressure off from engineers. A noisy war room is less useful.