The Real Reason Why Facebook Went Down

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
yesterday was not an exciting day to be a facebook engineer because if you're not living under the rock you know that facebook whatsapp instagram all went down yesterday because of a big outage in this video i'm gonna discuss the complete analysis of what that happened now we have the information from facebook officially what went wrong and let's just see how that happened if you're new here make sure you leave a like subscribe to the channel and hit the bell icon this is free of cost and helps the channel grow so facebook came up with two blog posts two official blog posts and these blog posts have been written by the vice president of infrastructure at facebook so the first one did not contain a lot of updates it was just a generic message that we had a fault and you know the general stuff sorry for making the services down and so on but this blog post which is titled more details about the october 4 outage this blog post actually goes into depth of what really happened not exactly even on the depth but the overall idea is pretty clear that what went wrong with facebook although there already were speculations of the bgp and the dns problems but there are a lot of exciting events happening in this whole event which i'm excited to discuss on the whiteboard in a few seconds but these couple of blog posts then there was a great blog by cloudflare as well at the time facebook was down so they were like pretty fast and so was for sand from on his youtube channel so these couple of videos which roseanne also launched they were pretty helpful in doing an initial diagnosis of like what exactly is not working why is facebook not working but now because we have more data with us let's just get into the depth of what happened at facebook on october 4th 2021. so in order to understand the full procedure of events let's start a little bit with how these cloud service providers usually work for example we have cloudflare we have aws we have facebook as well i'll consider facebook as a cloud but they just use it for their own internal purposes but the thing remains same the logic remains same these cloud providers what they do if this is earth for example you would have a lot of servers available in different locations of the world so let's say these are the servers of facebook available now what happens is there is a network called internet which is obviously what we know that you route the traffic over the internet and there is another network which these cloud providers create which is their own backbone infrastructure so let's say you are somebody who's sitting right here and when you connect to a facebook website it redirects you to this this server right here on facebook right now it is possible that let's say you are requesting a video file right you you go to facebook.com and you request a video now this is very much possible that the server does not have that video file for it right now you are already connected to the server so this is the job of this server to extract out that video file and let's assume that video file is present here now this has two options the first option is that this server right here also uses internet to route to that particular server right you use internet right you are sitting at your home you connect it to your isp you route it to that nearest server and now it's the job of this server right here now if it uses internet it is actually on a network which is used by billions of people simultaneously right so in order to speed up things what these cloud providers do is they come up with their own backbone infrastructure which connects only their own services in a network right so if this is the network for facebook you're gonna see that this is obviously very very small compared to the whole internet but the advantage here is that it is optimized this is super fast and this is optimized and this is internal this is these outs cannot be accessed on internet so it makes it relatively safer and faster as well right so this is known as facebook's global infrastructure change all right so this is fine now if somebody's accessing you get the idea from here if there needs to be any data transfer that happens on facebook's global backbone not on internet right so a similar global backbone exists for aws for cloudflare for all these major providers as well they do not transfer their internal traffic to their own servers via internet right they have their own global backbone infrastructure whatever you want to call it so what happened on october 4th is that some of the facebook team members were trying to update this part of the network right they were trying to patch something some updates maybe whatever they were trying to do and it is okay to take down the part of the network because hey even if you're here if this node is non-reachable maybe you would try this node right the facebook would try this node automatically that's fine you can take down the part of the infrastructure but what happened was while taking this part down it actually took down the whole global nodes right the whole global infrastructure took down why was that that was because of a wrong command issued so this was again human error not exactly a machine error so wrong command was issued and this took down the whole network and of course you would wonder why does facebook does not have any security implementations for preventing something like this and the blog article says that they do but their auditing software which checks if these commands are executed or not also had a bug right so that bug actually led to this so there were two things which went really wrong in order in very specific order that is the wrong command first of all and the second one is that this wrong command was exactly the kind of command which was a bug in their auditing software so this was not detected this happened the whole global infrastructure went down that means all the nodes were taken down not just the part of that network which they were trying to maintain do some maintenance on and yep the moment this happened there was a next set of event which got triggered so as you can see on the blog it says this change caused a complete disconnection of a server connections between our data centers and the internet and this loss of connection caused a second issue which was the real problem i mean i mean this was also a real problem but this is the problem which is linked to dns right this part we still know a bit about from the reddit post and the internet and so on but the earlier one was completely new all right so far facebook's data centers all of the data centers have been disconnected from the internet now that is fine but why was facebook.com not even opening right because if you would have gone to facebook.com or you know web.whatsapp.com you're gonna see that ip address not found which is not the signal of that you don't have a you know a routable server it means that you don't even have a dns in place right what happens after this second thing goes wrong let's understand that in order to understand that we have to do a little bit of dns crash course how it works on a very high level overview is that we have something known as border great gateway protocol which broadcast to the whole internet something known as authoritative name servers right these name servers are actual servers which your computer asks that where the data center is right so for example this is your computer it goes to your os where os goes to the root name server the root name server goes to the authoritative name server this right here knows the address ip address of facebook.com right your operating system might be windows the root name server is dot com for facebook because it ends in the dot com and the authoritative name server is the name servers managed again by facebook this was the problem facebook manages its whole infrastructure from top to bottom right if this was done by some other company which was not down it would be easier to reboot the whole system but the authoritative name servers actually contain the ip addresses actually not exactly contained they have to answer on the ip addresses of where these data centers are now the way facebook is considered is that if the part of the network goes down it stops broadcasting that authoritative name server to the bgp now this is interesting because the facebook does that because hey if a part of the network is not even available why should we send anyone there anyway right because that that will just maybe crash their page or you know would just return some wrong response or anything so what they do is they take it off from the broadcasting list on bgp and bgp is responsible for telling the internet that hey where the authoritative name servers are which in turn tell the ip address of these data centers now two big problems the first problem is that these data centers are obviously down and the second problem is facebook has configured it in such a way that these bgp stopped broadcasting these name servers for facebook because the whole infrastructure went down and facebook was configured to just remove the entries which were not working right from the authoritative name server now this means facebook has really just disappeared from the face of earth because now you have no server which is connected to internet and you have no routable way of even getting to those servers which are down not just the public ones but even the private ones so if you're sitting here in california where the engineers might be and if you want to ssh or do any sort of things with some data center here you cannot do that via internet because most likely that is an internal service and you cannot do it via internal service because well your infrastructure is completely down right so you need to go ahead and do physical access and you know have to manage it physically on those locations so the next part of the problem was dns failure right which is because of this border gateway protocol and broadcasting and authoritative name server but just to add icing to this the fourth major problem was even if you got access to this physical data centers facebook had physical security in place right and this security made it harder for the engineers to actually enter the real servers and real buildings and you know just access the computers and work with them because there were certain safety measures and safety guidelines put up in place and to be honest that's a good thing right you don't just want somebody to walk in in 30 seconds and access everything and you know just do anything they want so the physical security was another barrier i think facebook was down for eight to ten hours i don't have the exact number but it was close to like in india at least from nine pm to four am so it closed like i mean this is where it went down and this is where the dns came up at least and then gradually the services came up close to like eight to ten hours of downtime that's pretty impressive i mean just figuring out the whole problem sending people physically to the data center breaking into the data center restarting everything waiting for the dns to propagate i mean this must be a super busy day for people working at facebook but yeah this was a pretty interesting event of a complete infrastructure failure which to be honest i cannot say that might not happen again because this was fundamentally a bug with the auditing system that means a bug here and a bug here two bugs combined together could just take down the whole facebook infrastructure right and if there is any other bug existing in other cloud providers i mean the day this happens for aws or cloudflare you can pretty much expect most of the internet parts of internets to be down for six to eight hours which would be catastrophic right for cloudflare for aws for google cloud these are like even more fundamental backbones compared to whatsapp and facebook and stuff like this because these services might internally be using them right so yeah i mean interesting set of events but let's see what facebook learns from it and maybe they would add some more detail some more technical in-depth of how this happened but that's it for now all right so that's pretty much it for this video if you liked it make sure you leave a like and subscribe to the channel thank you so much for watching and i'm gonna see you in the next video really soon if you're still watching this video make sure you comment down in the comment section i watched this video till the end also if you're not part of code dam's discord community you are missing out a lot on events which we organize on a weekly basis to code you already know the drill make sure you like the video subscribe to the channel if you haven't already and thank you so much for watching
Info
Channel: codedamn
Views: 5,615
Rating: 4.9324579 out of 5
Keywords: The Real Reason Why Facebook Went Down, Why Facebook, Instagram & Whatsapp Were Down, Real Reason Why Social Media was Down, facebook down, instagram down, whatsapp down, facebook instagram whatsapp down, facebook whatsapp instagram server down, facebook down globally, whatsapp server down, instagram server down, facebook and instagram down, facebook instagram outage, whatsapp instagram down, whatsapp and instagram down in india, whatsapp facebook instagram server down
Id: N9TsSs0Y6Hs
Channel Id: undefined
Length: 12min 15sec (735 seconds)
Published: Wed Oct 06 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.