Building Scalable Data Centers: BGP is the Better IGP

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
do we have Peter Peter tell us why bgp is the better IGP so uh thanks for the body for coming and thanks to the committee for picking up our talk today we're going to talk about our experience with data center design and user bgp as the only protocol for the data center routing so before I begin let's quickly look at the structure of this table so I'm going to start with the problem statement talk about our data centers our challenges then outline our first design which was done a few years ago we got talk about why we chose BGP over more conventional particles such as OSPF or SAS and finally we will talk about our new approach we're using right now for new deployments so let's have a look at our up there is any specifics so we run on the services and on the services is everything which serves as customers over the Internet and has over millions of active users at every time so we do run out of servers on the scale of hundreds of thousands in other the center's and these servers utilize mostly 10g in 1g Nick's I can say the old 10g right now there's still a lot of one G but 10g Nick's were a major driver for our higher capacity and higher bandwidth so all the applications we run our kind of specific serve applications are well we're of a network they designed to be run the network and we treat network as computer so essentially the application is one aware of topology and utilizes full capacity and full capabilities of a network to prompt computations in other words the applications are parallel and specialized to perform just say one function such as web indexed amputation or say shows an email sends a mailbox of internet and so this is quite different from traditional enterprise of course and this brings some specifics to design an application behavior so most important part is I believe like Angie mentioned before it's a specific type of traffic flow that we have so resolve specifics but very simply in very simple fashion you can divide the traffic flows into queries and background traffic so queries is what customers send you or what what applications sends to each other to perform say calls or out some requests and background traffic is the majority all traffic that does all the work such as weapon desk navigation or mailbox moves and all those things and that round is a majority of all traffic in our data centers it's a heavy bot flow and veyron from server to server or in other words that the comb industry may run east-west so the challenge was twofold first to find apology which is simple enough but provides enough capacity from server to server at the same time allowed us to keep a design very simple and uniform and secondly we hope to pick up a routing protocol but allows us to use single solution for data center everywhere the protocol which has as similar as possible and the protocol which has a very wide vendor support so the last requirement is specific because we often tend to change our vendors for purpose of Johnny costs down so in order to keep a solution working we need a protocol which has very wide Interop and support from every vendor so let's look at our first initial solution so we choice of cost apology I guess is now very common in the industry so clause as the comb industry is a network which has three stages middle stage is often called spine layer and the input and output stages are often being folded together and being called as a leaf leaf layer so the clause may have multiple stages normal is a happy having three stages but you can actually grow with apology and have five seven and more stages the only problem is that as you grow up a cloth apology you also increase the complexity of filling structure so this topology provides full bisection bandwidth as long as your number of up links M from beliefs which is more or equal to a number of down links going down to your say top of rack or end of row switches we use top effect only so this is why you see top of rack who they are on the diagram so our initial design was exactly a class topology which had from say anywhere from eight to thirty switches in the spine layer and which connects our containers because we most use can tell deployments you know the data centers that links them into a single fabric inside data center so we deploy live switches in pairs so this way we can use not just aggregation or M lakhs to increase capacity or bandwidth inside the container so as you can see initial design has allowed to inside the container so the reason for this was most it you have compatibility with our old designs because our old designs were based on led to the apologists mostly and as you can see am blacks have been chosen as a solution to aggregate capacity inside the pod sets or containers as a coal man also one specific is a or subscription is only done at the top of rack layer this way you can treat the topology as a single Bennett domain orbit above rack or in other words we talked about location guys you can tell them you have full capacity in the rack and out of iraq you have this or subscription ratio so pull the routing after a lot of discussions we chose to use BGP only as a symbolical so as you can see we have a single s number single s for the spine for all spine devices and then a single s number for every container so essentially our own belief switches you simply advertise beliefs belief events in to BGP and you use equal cost multi path followers shown across be links Amit apology so this fine with is normally from 8 to 32 switches like I said and this means we require a vendor to support this large fan out for our equal cost multi path both on the leaf and spine layers based apology so probably this is the main question why we chose BGP over a more conventional protocols which OSPF or is is so first of all the main argument was simplicity of bgp because if you look at bgp specification and source code you can see the statement genes and data structures in BGP are much simpler compared to say OSPF if you simply compare OSPF adjacency formation process will be G be added exchange you can clearly see what BGP is much simpler of course back some downsides because some limitation of BGP put too much effort into optimization specifically into update taking process and we've seen some vendors having issues and bugs in a specifically update packing process but in general our experience in testing and operations once that BGP deployments are much simpler to operate and support and secondly probably the our what very important feature is BGP being essentially a particle which is distance data allows you to change browsing behavior hop-by-hop basis so essentially arm in our deployments will use this to do some sort of simple traffic engineering we can peer will every device whose motor obsession and inject route information at every table and this way perform some sort of very simple crude by actually efficient traffic engineering so secondly trying to drama-free BGP is generally more simple basically because BGP has the same single rib local retail and if you take OSPF for is is you have links the database which has to be translated into the routing table using SPF our calculations so normally a BGP is a more explicit in the sense because what you see basically is most often what you get in rows in table plus you can always see which ones you receive and which ones you send to a neighbors you ever see no SPF you have to look at the flood lists and links being sent to a neighbors and overall trying to troubleshoot OSPF is generally more difficult of course whereas some downsides again let's say for example OSPF may allow you to find MTU issue actually before you run into any problems bgp will form adjacency or actual fitting session even if m2 actually mismatch because tcp will actually perform entry discovery and you'll never find out an interview problem however in our environment we do have very tight control of all the switches and the network so you can always ensure that the MTU size is uniform everywhere so the last argument even propagation or even flooding in BGP is more constrained since BGP is distance vector I change the topology will normally stop at some point of hierarchy let's see if a link flaps in OSPF the the event floods for whole area no matter what happens in BGP normally once you find different best paths you stop volume V DV event and just converge on it on a on a new path so in general we found that under specific Fedor scenarios BGP has better fault isolation so but two main common arguments against BGP the first one is that configuring BGP is much more complex and say compared to a links a protocol observe all the neighbors as numbers possibly be three policies on the peering sessions but in our environment we have this all done automatically by using every generation and there is no complex policies just viewing sessions and s numbers nothing not nothing special so this wasn't actually a problem for our deployments and for the convergence time which is be gp's main problem on the Internet has been issue for us mostly because we have very simple and symmetric apology such as the class all links a uniform all pairs a uniform if something fails analysis and simulations and experience shows bad it takes only a few seconds to do converge and probably the biggest feature which contributes here is that all things are point-to-point and it's all fiber with no converters so once the link fails most between limitations shut down bgp session and the network performs restoration of the from failure almost instantly so our experience has been that practically any failure has been healed within like maybe less than a second and even even if it takes to say five to six seconds our environment can tolerate this issue with no problems so there have been some problems with this old approach so first of all as you can see we still had the two in the container and this of course had all the problems mostly broadcast terms única storms and lag issues because all mblaq implementations are proprietary and as you change the vendor you get new issues with every new implementation so m lag is out that's a nice picture but since its own unknown standard you always encounter some issues I guess almost everyone has this experience secondly having just single spine singular spine devices lingual scalability of the capacity data center so the only wakens scale single spine is by adding more line cards or more poor density in the spine boxes which somewhat limits producer vendors because vendors who have some really nice spine spine devices but say if you have to change a vendor for the cost reasons you you actually limit yourself by walking into a chassis boxes with large bore densities so one of reasons was let's try to get to a simple boxes with lower pour densities but just try to scale the spy the spine are not up but horizontally and lost the M lags only Matassa to two devices two switches in the container so essentially the same problem have to grow capacity up and can scale horizontally sober new design was let's pick up our cross topology and create multiple parallel cross topologists instead so let's say on this diagram you see two class apologies green and red and every tour switched over for X which has an uplink to every class topology so right now uww capacity by having to boil topologies so in other words which is actually the same cost apology but now her in five stages so if you count stages from from the tour to a spine and then down again you'll see whether actual five stages the cross from toilet for so bismuth apology allows you to scale horizontally by adding more and more plural designs or pair of apologies so you normally depend on having high pore densities in your say spine switches to scale the capacity and the next problem was hitting our BGP design into this new topology so since our main intent was to get rid of well two in the container we went ahead and pushed BGP all the way down to the top of rack switch so as you can see in this topology every top of rack switch has a separate SN number so effectively every top of rack is a small BGP device in sonia's so we allocate a single as per our role of live devices in every container and the spine once again all parallel spine is open spines missed apology Epis a mass number so two biggest benefits is al3 down to a tour and the user bgp the same protocol was before for routing of course our biggest win was we don't have any l2 issues because majority of our support and professional issues have been tied to broadcast from so unique as each bronica storms anything which is specifically for net and now we can grow capacity horizontally by simply adding more and more boxes in the spine or a leaf layer can increase capacity I mean how come out how much you want of course was a downside because by adding more switches you also increase bullying identity and so this problem main challenge trying to manage read many links in missed apology especially handling the link errors but we do still have same protocol everywhere it's still BGP and there is no requirements to do any distribution or interworking just single protocol everywhere and lastly BGP has some some nice properties specifically the SPF attribute allows you to see the exact path you predict stupid apology and this also helps in a troubleshooting because you can no hate this previous game from a tour switch embed container in the container so I we don't have time to go over all specifics of this design with all the tips and tricks here and there to optimize BGP and this design so we only cover a few most important issues we found in this design so first of all we need to support a feature known as a BGP most parallax it's not very widely used stuck it on mobile based some people don't the requirement for this feature is specifically to allow equal cost multi path across paths which have different and spiff attribute contents the same as the flamp but different contents normally BGP doesn't do Lord surely cross these bands but we need ways to implement our load balancing over any case prefixes so this was the first requirement to vendors and not all vendor support this functionality but I guess right now most of them actually do secondly we had to use 16-bit essence and we had to rely on the private test numbers remain use the main requirement for private establish is that we can use that for a simplified filtering at the edge and we can use be removed by retains feature on the data center edge to easily strip all the information about data center routes so in other words this helps to implement any chosen policies but the downside we only have 1000 private s numbers and the number of tours which is often exceeds 1000 I mean maybe 2 or 3 folds so the question was how can we handle how can we use parrot s numbers in this topology without I mean without getting to any black hole routing issues so the solution was an hour pretty old and well well-known functionality known as allow s in which permits a BGP speaker to accept his own s number gain so we apply this feature on our tour switches on the uplink sessions to beliefs and we'll use the same BGP s numbers beside the containers so essentially let's say in every container with s numbers 1 2 3 4 & 1 2 3 4 again and so on so forth something like that once again this functionality is not very as the Uniform supported by all let's say Cisco has been changing this a few times so first it wasn't there when it wasn't again it's different from nx-os to Cisco IOS and iOS XR but we managed to get most vendors to support this functionality proper properly so basically we sneaked a short message to our vendors if I was listening we only require a few very simple features to make the design works and most people have Bose some people don't have them implanted properly but it's a bit it's a pretty hard to implement this functionality in bgp it's all those are over simple features to support we only have a few minutes left but let me point out about a few very specific design features which we run into so first of all you cannot do summarization in this design in other words you cannot summarize server subnets the reason is if you do this you can easily run into black hole scenarios let's take out the first specific summarization case default routing let's say your leaf switches on the same default route to beat or switches so in our case a and B send default only to return switches to improve scalability and reduce table sizes what's going to happen if a link from A to Z fails the switch now to switch C will not know about that and keep saying traffic to both and belief switches and this will create black hole for traffic going from switch C to switch D so the only way to avoid this is by not using c4 by not using summarization from upper layers to bottom layers and the same problem applies when try to summarize server subnets let's say you pick up leaf switches a and B and summarize the subnets on C and D into one range and once the game is gets a black hole scenario were switch be Mabel's connection to switch see but still advertise the summer range in traditional designs people solve is by creating with Peter links between the leaf switches for example the problem here is that the number of leaf switches is growing when you have two device you find that when you have four devices you have to use more links to create full pure mesh will have 16 devices it's almost impossible to create this pure mesh because all devices to support a big apology in case of failure so this might be a skill limitation but in our case the switches we use and the skillet praedyth is a pretty consistent and the only use may be about 40% 50% of full Phebe size in every switch in the network so we found out we don't need to summarize and if we need to work in only somewhere as a point point things which hasn't got any issues alright so the quick summary traditional perception of bgp was a bgp only suitable for large-scale deployments and it went bouncing but what he found out in that aside that it works perfectly as an IG be a single protocol for our routing problems so BGP is simple enough it supports per hub for hope traffic engineering and it's been implemented and supported by basically all vendors we know and so this all together made this particle probably the best choice for us do this while using in our data centers thank you any questions Thanks okay so we are over a little bit of time on break we're gonna try to come back on time though real quick we do have a survey winner from the surveys yesterday is Hans Otto being in the room go to the back table to collect an iPod Touch and for the rest of the giveaways for the rest of the week we have another ipod touch and a pair of noise cancelling
Info
Channel: NANOG
Views: 27,186
Rating: 4.8714857 out of 5
Keywords:
Id: yJbqnOdD3cg
Channel Id: undefined
Length: 26min 51sec (1611 seconds)
Published: Tue May 17 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.