Why Did Facebook Fail? - Computerphile

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so when it moves it's changing the route to get between two nodes when it flashes it disappears and then there we are that's it gone and the world is free of facebook if a tree falls a dream takes does the tree exist but no one knows because no one's seen the fact what are they going to be saying in silicon valley this morning i'm not sure we can put that on a family friendly church once upon a time not so long ago there was a little website and its name was facebook and then it stopped it ceased to be it ceased to be took along instagram and whatsapp with it and uh yeah so a bit of excitement for those network nerds last night as we watched facebook disconnect from the internet the first thing to say is that facebook know what they're doing when it comes to networking they're designing their own gear they're designing their own switches their own routers they're writing their own operating systems on top of linux to run these things they're not some fly by night operation that don't know what they're doing here the big the thing that's probably happened here is that someone's had a bad day at work they've pushed something up to the servers and to the switches and things and it just didn't work and normally when you do that it sort of you get a slight error message someone yesterday probably pushed the wrong button and brought down the whole of zuckerberg's empire what was interesting about it from a networking point of view is just exactly what happened this wasn't a case of um you couldn't access the website it wasn't responding but the computers were still there facebook literally managed to disconnect themselves from the internet the stories coming out from journalists in states and things is that facebook engineers couldn't get into the building because their cards weren't being swiped which you can start to understand why that would be the case when you've got a such a large network that you're dealing with and they like to eat their own dog food and so their whole network is built on top of their own infrastructure when you get as something as catastrophic as this happen then it's going to be a big problem for them so the big question then of course is what actually happened i'm using the right bgp replay to look at this i've pre-recorded it so i can skip through it if we run back in time we can get to around three o'clock gmt what we've got here is the network where facebook's dns server is so dns is the system which takes the name facebook.com and converts it to the ip address that facebook uses so watch mike's video on that so dns acts a bit like a phone book you can look up facebook.com when you get an ip address back well people actually know what a phone book is these days uh well somewhere older viewers will youtubers might need to know that it's a book that have telephone numbers yeah dns is the bit of the internet that converts names facebook.com instagram.comwhatsapp.com into the ip addresses of the servers that convert that so what we've got on screen here is this is the network that facebook's main dns server is on and why that's important is when you request facebook.com or whichever one it is you want to go to or your phone's app requests it it'll end up talking to that probably indirectly by your isp getting the ip address from that and then it will make a connection to that ip address to actually connect to the thing so before your computer connects to facebook's computer it asks where facebook is this red dot is the network that facebook's dns serves on these other dots are just other networks on the internet so different isps and things we've got cogent over here i think it's tellia over there in sweden and things different isps that are providing things and if you look at it on a normal day facebook is very very well connected they have direct network connections they peer as the terminology is in the networking industry with pretty much anyone who will appear with them and things so rather than having to connect to someone else you connect directly to facebook there's a bit of fiber optic between the two of you and the data travels that way that's how they can cope with the sheer amount of bandwidth from everyone checking their phone what we see happening is quite interesting it's actually a really good example of how the internet can cope with disruption because if we just scroll forward in time we can see that as things start to happen we get to this point where everything starts to reroute and rather than going directly to facebook you can start to see it all is going to cogent and telia and then as we go a bit further on we suddenly see that everything starts to disappear until you get to a point there we are where there's nothing connected to facebook except there's one machine down here there's one network and suddenly facebook's network the network that runs all facebook's computers are on is completely disconnected from the internet and so when you go to look at facebook.com the compute your computer cannot find out where the machine is on the internet so it can't even send a packet there saying connect to facebook not that facebook would be there to connect to because it doesn't know where it is so at about what we're looking about sort of 1627 gmt yesterday afternoon 10th 4th of october 2021 facebook had effectively disconnected themselves from the internet their computers were no longer connected to the internet now this isn't the case of someone going around along pulling all the ethernet cables out pulling all the fiber optics out and physically disconnecting them now this is a configuration error in a protocol called bgp for border gateway protocol 4 version 4. and this is a protocol that's used to it's used by the routers that connect networks together so rather than having a sort of static map of if you want to get from a to b you go down this route then that route then that route the computers configure themselves automatically they're constantly refreshing things as the situation changes on the internet so a network link may go down someone might put a jcb through a fork through a fibre optic might misconfigure a server or that link may just have a lot of data transmitting across it in which case it's better to root it via a different route because it will get there faster because it's not to deal with the contention on that link so you need to continuously recalculate which route to take so that it you get the best performance out of it and to do that you need to know where everything is on the internet so you can work out which is the best route to take at this point now if it was static you could use something like dijkstra's algorithm which mike's talked about run that on all the data and you can work out the quickest route things like auto route or google maps they will often do it like that they'll use an algorithm like that because the road network static you know where everything is you know the sort of capacity of it uh and things and so you can work out the route beforehand with the internet we want to do things dynamically and so rather than storing everything statically each of the routers on the internet are communicating with each other using the border gateway protocol for to tell them what they can see and where they can get to and what the cost of going that way is and things and they're exchanging this information and so the router will get information from this another router and it says okay i can get to let's say facebook this way by that machine but the cost of going that way is low is higher than the cost of going by the direct link i've got so i'll send my packet via the direct link and so it sends it that way what we saw here is that suddenly those direct links started disappearing so as the direct link disappeared it said well actually i can't get that way anymore i'll go this way which is why we saw that suddenly it went from everyone going to facebook to at this point you're starting to go through a few people who are now saying i can go that way and it sort of concentrates to a certain sets of routes now why did that happen well as part of bgp4 the machines are announcing that they can get somewhere and that this network is at the other end of this thing and what seems to have happened is that around sort of three four o'clock gmt yesterday facebook stopped announcing the routes to their network so they stopped saying to the routers they were connected to hey you can reach facebook 129.134.30.12 here and things they stopped announcing that you could reach things there so as soon as they stopped announcing it the routers they were connected to said well i can't get to it that way because it's not being announced anymore that was a a stale route it doesn't exist anymore i'll go this way because that is still saying i can reach you that way because you've got propagation times rooter a which is facebook say tells router b and it tells root to c but root of c is also telling root of b you can reach there and so you've got to wait for it to update and propagate round so you get this sort of effect where suddenly you get roots that probably don't exist it's not that they could probably reach facebook at this point it's actually that they still think i thought they could when they told this machine that they could but actually by the time it updates they couldn't reach there and that's when you get things suddenly disappearing completely and facebook's no longer accessible because they're not announcing that they exist technically the machine is probably never disconnected from the internet if you knew that you could send it to this machine and then that would pass it on to facebook you probably have hacked a static route into your machine and if your isp did it as well and actually got the data to facebook um but to do that you need to know these things and then you have to manually configure it and that's what probably what got facebook into the problem in the first place is changing the configuration on the bgp things now why would that have an effect internally well there's probably two things that probably at play here facebook.com probably used the same dns servers to store their internal things as well and when they did a lookup upon them they weren't connected to internet so they couldn't access them but i suspect the other thing is is that often when you've got a big network and facebook do have a big network they have thousands and probably tens if not hundreds of thousands of machines working on their network you probably end up building your own internet type network internally facebook google apple microsoft all of these they're of such a size amazon that they will have their own effectively mini internet um and networks of networks internet working which is where the name comes from internally and so they will use ibgp they'll use bgp internally or ibgp which is the one designed for interior networks and i imagine that the same configuration changes which made it drop the announcements externally probably had a similar effect internally so you're sitting at your desk and you realize facebook's gone down so you ssh into the server that you need to configure but the router that you're connected to doesn't know where to send that packet to because it suddenly no longer knows where things are and then you probably can't make the voip voice over iphone call to the person who uh can switch things back on and you've suddenly got a chicken and egg situation how do we communicate um without facebook and all the tools we use when our network's down when everything communicates over our network so how do we bring it back up i did hear one thing where apparently the people who had physical access to the servers to battery and the switches to reboot them didn't have the credentials to do so and the people who had the credentials couldn't communicate with the things because they couldn't get into the building to get the credentials and whether that's true or not or just so rumors flying around on hacker news and things is a another story facebook would have needed to roll things back and then to start re-advertising their routes and which is what happened about nine o'clock gmt last night they started to re-advertise things then they went down again and then they sort of came back up as they sort of reconfigured things so basically configuration error took it completely out uh because they just effectively disconnected themselves from the internet accidentally instead of i guess it's the sort of bgp equivalent of tripping over the network cable on the way out the door and sort of somebody suggested it was like locking your car keys in the boot and not having that yeah so it's a bit like that sort of situation i need this to get get things restarted but i can't get that because i can't access it internally because the network's down externally because the network's down and things what do you think they'll do after this secondary dns servers what do you reckon uh so i mean this is an interesting question there's going to be um i mean let's let's not forget this is an extraordinary event facebook have a very resilient network they have good disaster management things we probably things like this happen all the time internally and we just don't notice it this just happened to be the one event that everything aligned in the wrong way and things weren't wrong uh and and it became big literally big news when you sort of no one can well it would have been big news if anyone could actually tell anyone but because once that was down people wouldn't be able to tell each other what was going on it's sort of yeah until computer file comes out with the news but there'll be different procedures put in place for how you update the servers and things to make sure this doesn't happen again but the thing is you're not going to get 100 right there's no way you can get it 100 right because there'll always be some situation um where someone could go wrong someone on twitter suggested that there should be a sort of five minute period where if he's not confirmed after that it reverts back to the original configuration you know a bit like when you change the display in windows you get that sort of confirm within 10 seconds so you can see things the problem with that is that's fine if it's a problem that will occur within those five minutes if it requires some set of conditions that happen aft in the sixth minute you've got the same situation you can make it slightly better but and i'm sure they actually do have procedures similar to that but there's always something that can go wrong i mean for our point of view for people who use facebook or for those of you who use instagram or whatsapp it perhaps should make us think about the way that people build these tools i mean this wouldn't have happened in the same way to email it would be very much harder for someone to bring email down in the same way because the servers are all run by different people and if you send an email to me it talks to my server whereas if you send an email to someone on gmail it talks to gmail server if you talk to someone at the university of nottingham it talks to someone on well actually microsoft server because they're supporting the network for it and so on but you might lose gmail or you might lose my server or whatever but you wouldn't lose the whole lot in one go the problem here is all our eggs are in one facebook basket do not try this at home only try this at home oh yeah that's a very good try this at home but not anywhere else realistic random number generator in the sense that if you're in some state and you've got your rule you're always going to produce the exact same state that you had before right it seems obvious so this repeats itself after what
Info
Channel: Computerphile
Views: 205,672
Rating: 4.9681654 out of 5
Keywords: computers, computerphile, computer, science, University of Nottingham, Dr Steve Bagley, Facebook, BGP, Network, Outage, Instagram, WhatsApp, 4K, UHD, Why Facebook Went Down
Id: Bie32IZlMtY
Channel Id: undefined
Length: 15min 26sec (926 seconds)
Published: Tue Oct 05 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.