Evolution of Facebook Backbone

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right coming in X we have Gaia with Facebook while he's working his way up I want to take a minute to recognize our fellowship recipients may or may not be in the room right now but ask the stand and give us a quick wave we have a Bronwyn Lewis was a recipient this time in the room all right Aaron Rusedski and Collin McIntosh are three fellowship recipients this cycles so thank you very much for all those that applied and congratulations folks here we have guy with Facebook guy yeah thank you hi my name is Gaea and I'll be talking a little bit about Facebook backbone before we start talking protocols and topologies and all of that stuff let's take a look at the raw numbers and the scale that we deal with you know Facebook is pretty global it has like one point to eight plus billion users and it's growing over the last couple of years we saw a major trend change we saw a lot of our users move into the mobile space right now it's sitting at about six hundred plus million users what does this mean this means the fact that you know now we are catering to a section of users that are at variable bandwidth and very varied latencies or increased latency as well this also means like the traffic patterns inside the backbone change with the shift to mobile in addition to the changes that happen outside of Facebook's roam we also have things that happen within our domain itself currently we have Instagram that's sitting at about 200 plus million users one of the other things how many of you love the one minute look-back that we did for the 10 year anniversary it's pretty awesome right let me clear interesting story about it all of that was conceived and executed in about two to three weeks before the launch think about it we have one point two eight plus billion users one may look bad done in three weeks all I'm trying to say is that what it means to us as network and infrastructure is the fact that fact that not only do we need to be reactive but we also need to be proactive such that we are able to make these business services and products get rolled out on that node let's get back to the basics and look at our topology so we run a pretty traditional backbone so we have a bunch of clusters that sit in data centers they all connect to this data center outer layer this is our label edge router this is where our imposition happens this is the gateway for all the machines out of the data center we have a standard backbone layer we run a true BGP free core this is label switched order in our network the other end of the paradigm is the edge here's where what we call our peering routers or PRS these get us basically our external connectivity to the users all of them among the edge routers run a bunch of MPLS LSP measures a quick rundown of our protocols we like to keep it simple I Esaias used for us our IGP ibgp and ebgp for our peering RSVP signaling mainly because we use FRR and sub 50 milliseconds failover and we also run out a bandwidth it's a self-healing network we have a lot of flows they're bursty it helps like being adaptive let's quickly take a look at the pattern in our backbone itself so if you look at this the graph that's being shown there's a significantly large proportion of machine-to-machine traffic as we start rolling more services out as our data centers is starting to grow we find the chatter among these machine-to-machine services becoming more and more not only that it's an interesting traffic packet size pattern as well according to us let's look at like you know the packet distribution in the edge this is a pretty standard I mix pattern we have about 35% of 64 byte packets the average packet size itself is about 810 bytes let's flip the coin and let's look at what happens at our D our data center routers here's where it gets interesting you see a large amount of small byte small sized packets so we have about 59 60 percent of 64 byte packets and now the average size becomes less than 300 bytes so now that we've discussed this let's look at some of the challenges that we face with the backbone itself a lot of people in the audience know this graph the demand continues to grow what happens with this is capacity planning becomes a big challenge for us the other well-documented challenge with the TE waste auto bandwidth based backbone has been packing third one that we significantly faced is schedule and currently we have millions and millions of flows and the current paradigm of protocols does not allow any kind of scheduling among them let's look at like you know what capacity planning challenges so this is kind of the graph that comes out right in in general in a network why is it this way so we try to protect all the bits that go across our backbone we also passed a plan our backbone for worst case failure analysis what does that mean we make sure that it survives a single node failure or a single link failure we have utilization that basically has like regional peaks in global peaks and so on so we'll have a bunch of valleys but ultimately we also have an over provisioned backbone the other aspect that I touched upon earlier was bin packing a lot of our services have certain latency thresholds that we want to honor so ideally we would like to run our links pretty hard we want to run our like shortest path as hard as possible let's take a look at how it breaks in the current paradigm so let's say we have a demand that goes from A to C so we build an LSP that is at on three gate let's assume that you know all of the links are at like ten did bandwidth a few minutes happens a few minutes later a second LSP comes in which is now catering to the demand that goes between node e to node C it is still at 3-game so currently 6 gig is utilized among the 10 gig link a few more minutes happens the demand between LSP a to ospc increases to 6 cake order bandwidth kicks in reoptimize ation happens this is perfect right we are utilizing our 10 gig link at 9 gig which is great here is where it breaks the next time the demand between e and c close to 6 cake as well now we can are chopped at 12 gigs in the Tang giggling so we have to break the shortest path and we move away which is fine but unfortunately there is no mechanism today that can break the sub flow the flow between E and C into two flows and actually make sure that they utilize the link B to C to maximum right now we have what 4 gig of traffic or bandwidth that's available on B to C that's not utilized the other aspect that we talked about was scheduling the only way we can alleviate the problem currently is by using qsr so we have three classes of services say for being premium followed by f2 and B it's the only way that anything can happen in terms of guaranteeing SLA is if there is capacity constraint there is no way of like doing kind of application to understand what's happening in the network and backing off so we talked a little bit about the problems let's see solutions if you have seen the video about like Facebook campus or if you walked on Facebook campus we are a big believer in slogans and in the recent f8 conference suck jokingly put this up he said we are going to go to a more fast food stable infra this does not mean that we are going to suddenly introduce change reviews and like you know multiple levels of peer reviewing and so on if you listen to our David's Wofford stock at like you know nanog in Phoenix if we have known that we believe a lot in automation and over the past few months given the size of our team this is basically enhanced our view what this slogan means to us is we got to be intelligent in pulling our systems we go to automate you got to be able to build systems that can be reused by other decisions with that can happen to make our network more programmable so first such system that we built was what we call network global view for anybody who runs a backbone that's big which is a mix of peering circuits which is a mix of transport leased circuits it's a mix of transport own circuits a single source of truth or a desired view of the network is very very important a lot of times we sit and we have a maintenance and people ask us hey can we take down the shelf there's always a prayer that goes on hey I hope my shared risk link groups are right I hope nothing has changed so that desired view was very important to us so the first thing we did was we build something known as circuit DB what it did was it basically took a holistic view of all of our network all of the links that we can use all of the shared risk link groups that we have so we could actually do some resiliency analysis offline we automated it we made it repeatable this gave us an ability to also audit things we have a desired view of the net that we think and we have a derived thing that actually we get from polling and monitoring and so on and finally this gave us the visibility this was the system that we could point maybe our edge fabric design or edge intelligence that we want to do or like the backbone with like the tee controller or te decision engine that we want to do we could point it to this system so we built the system and then how do we mean how do you make sure this stays the way this basically stays like the single source of truth this has the data integrity so the best side effect that we had was like you know we could get into this automated provisioning automated decom automated grooms everything so we ensured that a manual question does not go and touch the router it's all done through this tool so what happens let's say we are going to go ahead and we are going to provision a provision a circuit it could be coming in from an optical system it could be coming from peering and so on the first few things people do are essentially like you know put my planning reserving porch circuit testing once this is completed what happens you go you allocate an IP you do a config let generation and you push it to the routers what the system did was did all of it in an automated manner so that essentially were ensured that our single source of source of truth that we built continues to be the single source of truth I had alluded earlier about our demand growth right let me tell you a story about what happened when I was sitting like in a couple of capacity planning meetings so we launched graph search last year so we're sitting in a meeting and people are coming to us and saying that hey we need to transfer our graph search indexes across coasts across continents among the data centers like okay that's pretty cool what is the amount of data that you're transferring it's petabytes you're like wow so we said one more piece next week happens and then Hadoop wants to do a namespace move so all of this basically translates into petabytes and terabytes of data that needs to be transferred so quickly we came to realization that what we needed was a holistic view of all the jobs in our system as well so that kind of led to what we call the traffic manager or like a bulk traffic manager control so typically what happens in our network is so we have a source tier which is an collection of hosts that's sitting in a data center that basically wants to transfer traffic some X amount of data to a bunch of destination tier as we call it which either may sit in the same data center or it sits across the backbone so the first thing we did was we said hey we need to actually have a lightweight daemon that can actually sit on every host in our in our data centers this basically the demon will then talk to what we call the bulk TM aggregator the aggregator essentially is a collection of all the jobs that need to happen so let's say we have graph search that comes in and says hey I need to transfer X amount of data for Hadoop comes in and says I need to transfer X amount of data so they all talk to this global aggregator now this job manager or the global aggregator will in turn talk to a backbone backbone path computation element or a backbone controller which we'll talk a little later so all all the bulk traffic manager gets is essentially a transfer rate which then it goes ahead and programs it on all the hosts so let's say a job has some amount of minimum bandwidth that it can take and transfer these jobs so what we also did was basically using IP tables and IP role set and Linux DC function DC shaping function we ensured that we could actually monitor them as well so let's say we have a source tier that that is requesting lot of bandwidth and it's like given over 100 Mbps but it's not transferring hundred Mbps so we can actually take that bandwidth away and give it to some other job and since we are continuously monitoring these as soon as we find this tier actually starting to transmit more traffic we bring it back to the assured minimum bandwidth level this was kind of like you know gave us a view of the entire jobs that we want to what are the applications doing and then we can actually start talking about controlling them so now that we had a view of the network we had a view of all of our jobs the next logical thing was to actually do something known as computation element basically now let's run an algorithm let's make sure all the demands that we have can be placed optimally in the network so what's the current paradigm we basically have a distributed system we have every router doing what it thinks happens best in the network let's take it away let's now remove it and we have the centralized element that can actually look at everything so we have all of the statistics that we need we have all of our LSP statistics we have all the link utilization statistics and we run a converged backbone what I mean by converge backbone essentially we have all of our machine to user and machine to machine traffic running the same set of links in same topologies so now this path computation element can actually take all of this and build atom and mesh and now it has the circuit DB to tell it what the desired state of the network is and what my shared risk link groups are so you can also do offline analysis so once we actually have a demand and we have all of our like you know help state of the network in terms of failure everything we can actually do some intelligent things across the network how is the visualization of the solution end to end this looks like a complex diagram but it's not it's basically just a walkthrough of what we did it basically had a server that would talk into a bulk traffic manager through a bunch of AP is the Marc traffic manager which essentially talked to the decision engine that's going to do its magic in the backbone and tell that hey you can transfer X amount of data it could either be 100 Mbps or it could be certified gigs whatever it is one nice thing about this picture is you see this cue as bubbles that are there what we did was whenever we actually wanted to do a job transfer we basically ensured that the host itself marks so we don't need to actually have fancy ecl's or fancy policies in the routers that do QoS so every job is given a forwarding class which is essentially marked by the hosts itself before the transfer what do you guys do on Facebook more often guys post status messages you guys wish people happy birthday maybe like you know tell people that you're fine during hurricane sandy and stuff this is kind of what we see our post pictures right it could be so all we see is a collection of pictures you can think but actual data that we deal with is something like this so we've kind of established that we have a lot of data that we need with that we deal with so let's let's see what what we are actually doing with all the systems that we have built together to solve the use cases that we targeted with the differing elements that we built was these three first thing was what we called a durable demand so we have application owners come to us and say hey can you quickly let me know if this can be done and we want to transfer some data so that is what we termed it as durable demand so they come in they ask for a demand it goes through the same process of bulk TM talking to backbone backbone ensuring everything is fine and then we go ahead and program it the demand goes on and once once the job completes that bike Traffic Manager essentially in forms the backbone PCE saying that it's done and PC removes that the other aspect of it was calendaring remember earlier I had alluded to a little bit about graph search and Hadoop and so on so what we found out was like you know especially with like graph search that they needed to do this transfer of petabytes twice a week and the SLA was supposed to be 36 hours so this actually became like calendar demands so this automatically became like you know inputs into our PCE so every time before this demand started executing it basically turned up a tunnel with the necessary bandwidth and then it would just go on so this was like the second aspect of it which was calendar demands the final piece of the thing that we saw was periodically optimizations so we talked a little bit earlier about bin packing so then what we did at work decided was let's say we take the PCE and we start monitoring every single LSP in the network once it reaches a certain maximum threshold if we break that LSP into multiple sub other space then we can at least achieve a much better bin packing it works to the worst way as well we set a minimum threshold on each LSP and once each L speed reaches this minimum threshold then you can actually go LS the other species that you can reduce the number of LSPs in your network so let's see what happens right so we have a demand from East Coast West Coast it's fine it's all been admitted it's great so we say okay go ahead voila everything is fantastic right at this point of time remember I said we still run a converged mad one so now what happens this there's a fiber cut so at this point of time we crash and burn so this is why the the importance of feedback comes in so in actually the studies that we made a point one percent loss point one percent loss will actually cost ninety percent loss of throughput in TCP imagine that a point one percent packet loss so we had to ensure that our feedback is tied into the whole system as well so essentially when we see this PCE reacts PCE basically does a recalculation and is able to inform the bulk traffic manager to either throttle or actually kill the job this way we are actually maximizing what happens in the network so continuing the theme of poster right this is another one at Facebook we love to eat rate we love to ship products that can give us some gain so let's look at the visible results that we saw with all this so this was like a capacity graph that was before let's take a transfer and see what happened so this was actually the allocated bandwidth for a transfer across time this was the actual usage so it's pretty close so essentially what happens this the allocation and use the allocation actually like you know adapts itself to growing needs or decreasing needs of this host year this was like fantastic for us what else can we do I talked a little bit earlier about our transport thing now imagine if you can actually do extend this layer extend this PCE into multi-layers so imagine you can actually have l1 and l3 and have a shared control plane can actually have standby transponders sitting there and like doing l1 assisted wealth transfers and you can actually have like reconfigurable rotom's imagine doing optical restoration in like you know through software to me personally as a network engineer this is an exciting time to work on because we are actually challenging the fundamentals that we have grown up with we truly truly believe that we're just scratching the surface and we're just beginning beginning in this journey and we have to continue to get it get to solutions that defy the norm to satisfy the demand and ever-changing landscape if you like to talk to us more about our solutions or have different ideas for our problems please seek us out in the hallway or catches at the and at the social that we are hosting tomorrow and we can talk more about it thank we have time for questions run Matt Matt Peter Yahoo here one quick question you said you got your poke traffic manager talking to all the servers when it's doing the shaping on the server is there any communication at the application layer to let the application know that it's being throttled or are you simply doing the throttling and letting the application deal with the aftermath era of other shaping yeah so I think the throttling is done by the TC mechanism in Linux and then the application handles beyond that all right thank you thank you my local lithium technologies does your system support UDP and multicast traffic scheduling as well or just TCP sockets no we don't do multicast okay yeah UDP yes look so it's essentially like TCP for us a lot but we don't we actually don't do any of multicast not bad one at all okay okay thank you so two questions VJ the first question is I would actually like to see the math where Idol transponders actually work out in practice because I've been trying to do that for like 15 years and I never managed to make the math work so that would be very cool the second component is how do you do the network modeling for when you do your bandwidth allocation so the path changes where to you and how do you construct and model the network for a priori traffic alkylation and post facto capacity deletion model sorry I didn't get quite get the second question can you repeat that please the second question is how do you do a representative model of the network for a priority traffic alkylations right because you have to have the topology you have to annotate the edges and the links of the topology with the available bandwidth current utilization and then put that into the path computation for allocation right so that's the first question how do you represent that and the second question is then one post-facto of capacity deletions I'd say we have a fiber cut or something how does that signal go back into the model and how do you propagate that backup to the path control okay regarding the first one which is the stand the standby transponders so the point is to defy the odds right the point is to actually do things that that the current technology allows us to do so we have actually advances in the transport layer now where you can actually do software based like you know transport or shifting how it works out will I trait will learn from it and we'll see where it goes on the second aspect of it which is the path computation element so we use a customized third party solution for this so in terms of like so we have like atom and mesh that it reads it's able to construct this as a topology and is able to know which which traffic gets synced in a node or which traffic actually is a transit through it right the other aspect that we talked about was like the network events and so on so that is where we actually feed our extensive monitoring system that is actually a listening like a thrift based API that the PCE listens to from the monitoring which can actually feed the link link failures then it goes ahead and basically does a recalculation and similarly there is an asynchronous no if ocation back into our bulk traffic manager to say hey throttle this or like go ahead and kill it thank you no actually my question was even more basic right how do you represent the model what is your model representation inside your your system what you're talking about here are data feeds into the system right so for example that dead is a data feed but how do you represent the network model that's something that we've been trying to kind of noodle around you saw the talk today earlier about the yang net conf discussion how is that implemented at Facebook ultimately it's a relational database that we represent all of this in oh alright thanks LJ Cisco can you comment a little on the methodology you use to get the I think I heard the numbers right 0.1% loss translated to a 90% reduction in 3 foot I'm interested in where those numbers came from so actually that I think some some papers out there that describe this as well so for us what we did was we basically did like a brute force method right we said okay so we're actually going to see we actually set up like these transfers and we say okay let's start dropping packets on these and come to a point where we actually think the laws the throughput becomes you know nominal and we felt like you know at point 1 percent is where it was so it's for us it's internal study of what our patterns do but from what I've heard there are actually other research studies as well that kind of like you know agree with what we have seen in the network ok any considerations about the traffic you know V motion within the traffic manager can you also forget about j4 entails dgn yeah so we do not run any virtualized environment for us every every every tier or every host it's a specific application so we don't have to deal with VMs and stuff like that so that's why there's no emotion for us my cues are early on you showed your traffic packet size distribution histograms and you said these are quite interesting why did you say these are quite interesting because I look to the moment I'm not surprised at all okay so these are quite interesting because we were obviously when we talk to vendors and we tell them that hey you have to support like 60 65 percent of 64 byte packet size at all the features that we want they go back and be like what right so interesting in the sense like you know there is like performance associated with all of them so so it makes a big difference to like you know as a hardware forwarding element and what features you turn on and the performance associated with it rom balakrishnan given the size of your network is your traffic manager deployment is a hierarchical architecture or is it centralized so the traffic managers kind of hierarchical is what I would say but that is a global aggregator that is actually the arbitrary to the whole thing but essentially what happens is it kind of feeds into all of these like so we have like what we call a tier of hosts and within the tier of hosts you you need not have everything talking the same way that's why it's it's it's semi hierarchical but essentially you have the aggregator that's doing the most of the job here the second question is did you say that you are throttling but host-to-host are doing a path computational capacity computational path to path capacity computation on the earth throttling at the host level yeah so what happens is the toggling actually happens at a job level so for us everything is a job and everything boils down to tier of hosts so let's say a transfer actually is saying that I am giving you an assured bandwidth of 100 Mbps how it splits across the host is not necessarily that interesting to us and then when actually this job aggregation becomes less than say it moves from 100 to 50 Mbps then the whole job gets throttled so that's kind of like it could be a bunch of horse but that's where it gets startled so the last one that I have is all the cost marking is done on every packet at the host level yes Clark Gaylord Virginia Tech so your point one percent creating ninety percent loss presumably that's because this random loss as opposed to congestion based loss is that correct what kind of a link are you thinking is going to lose one packet in a thousand so well I think it's it's it's pretty subjective right in essence like you know what would be see in terms of like you know let's say we actually have like a cut that actually tells us what services need to be protected as FRR where does the sub 50 millisecond come in as well we have where we have hotspots port how hard do we have to run it and so on so to us it was like more more so from an understanding of what lost means to our applications and how we need to minimize it that's kind of why it was a last question middle back Scott why Google I just want to clarify so when you say ninety percent you mean 90 percent throughput loss right now ninety percent back a lot yes okay yes thank you all right we're going right ahead out to afternoon break sponsored by Go Daddy and et in xkl or when we come back we have a pair of submarine cable talks as well as a block of lightning talks thank you very much see you soon you

Info

Channel: NANOG

Views: 9,021

Rating: undefined out of 5

Keywords: Gaya Nagarajan, Verilan, Washington, traffic-engineered, Network Operators, Facebook, Nanog 61, backbone, NANOG, Internet Service Provider Industry, Bellevue Citytownvillage

Id: sLnS3e4MSrM

Channel Id: undefined

Length: 31min 5sec (1865 seconds)

Published: Tue Jun 03 2014