Datacenter Networking @ Facebook

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right we're going to get started if everyone could take your seats quick we'll do the survey winner for yesterday I am actually very impressed we had almost 200 survey responses which means that this is finally working thankfully most of them were pretty complimentary is Mark and tone in the room are Cantona no that's a shame no iPad Mini for him how about Hannes Adelman there we go Hannes your your prizes with the registration desk you're welcome to grab it now or later on if you'd like so we still have two more of these to give away we'll give away one for the survey responses for today and we'll give away one for the survey responses for Wednesday so please after you've seen the content fill them out one other thing to keep in mind is that if you want to fill them out as the presentations happen you can do that you can go in and edit your responses later so feel free to fill it out for this one and then fill it out for the next one and come back later on tonight if you'd like to give us a more complete view of what you saw for the rest of the day including the tracks one other interesting program you note for the first time ever I'm extremely proud to announce that all of the content from yesterday all the recorded content from yesterday is now live on YouTube so you can actually pull up all the sessions now if you'd like to send them to your friends it's fantastic we've you know over the years gotten requests to get this stuff up quicker than you know week or two weeks after and so we're extremely proud I'm not sure if it's linked each one is individually linked to the agenda yet if it's not it will be by the end of the day but it's certainly all searchable on YouTube by just searching for 9:59 with that I'd like to invite David Swafford up to talk about data center networking at Facebook thank you thank you hi I'm David and today we're going to take a look at data center networking over at Facebook before we get started though I want to give you a few key stats to give you some context of what we're dealing with we have a lot of people on Facebook today specifically 1.15 billion people active every single month they upload 350 photos every 350 million photos every single day that alone accounts for seven petabytes of storage consumed each month in terms of the network in terms of the network graphs that we see on the engineering side this is a this graph right here is actually showing about a year's time span comparing traffic in and out of Facebook as a whole to and from the internet which is shown in green at the bottom labeled machine to user first hour traffic on the inside of the network the machine to machine traffic the time span of this graph here is about a year span so you can see here over just one year we've literally doubled if not almost tripled the traffic on the inside of the network while remaining relatively stable on the traffic leaving and entering our network this is just mostly due to the dynamic nature of replication and things becoming more and more involved or more and more media rich for example high-res photos in your newsfeed and just more content being produced on the inside so let's we're going to be talking throughout this throughout this talk about clusters at Facebook so to give you some idea a cluster in our world is just a unit of computing power it's a bunch of servers a bunch of racks and related network here so taking a look at what it looks like when you actually hit facebook.com/ from a web browser or phone you initially hit a web machine in one of our front end clusters that web machine though doesn't have all the content you you actually need it's just going to have the static content so your request is going to generate hundreds if not thousands of requests in the background to all the different machines and that feed things like news feed advertising even the messages and photos and then ultimately have database servers in the backend one key challenge we have at Facebook is that a single users request to one web machine in the front end easily generates hundreds if not thousands of requests in the background which is kind of what we saw on that graph earlier so Mike reversing for example is one definite problem we see on a day-to-day basis looking at what our network looks like though that used to power these clusters so we're going to walk through different generations throughout this talk so our first generation cluster here it was entirely a layer two domain we had a rack switch at the top of every rack a single rack switch though because we and still today a single rack switch because we don't trust any single rack but we had a wreck switch at the top of the rack it was purely a layer to environment though that racks which had an i Peters from management it didn't layer 2 switching and nothing else though it connected upstream to a pair of cluster switches which are kind of along the lines of what you might expect in a datacenter environment large chassis based platforms that stood pre high off the ground from one of a few vendors now that the early designs of this used one gig and then eventually used early first gen 10 gig what can have some challenges though that we had with this design one of the big ones was usable capacity we only had to cluster switches so regardless that we're done and see in a box we always plan for an entire box going crazy so that means right away we have to over-provision 50% in this model but as we started to adopt 10-gig it wasn't really 10 gig but at that point it was oversubscribed 10 gig so usable capacity was even far less than what it looked like but ultimately why we left this design was really the layer to scaling challenges that we ran into so as we grew these clusters bigger and bigger we didn't exceed any manufacturer spec in terms of how many MAC addresses we could throw at these devices but what we found was they started to choke literally things like mac learning became problematic our processing was something completely unheard of to go device and it would just failover or completely fall over so and that wasn't any specific vendor or platform it was just one of the things we noticed as we grew so looking at our second generation cluster we had a lot of layer two challenges so we said let's go ahead and keep layer two constrained within the rack so now we still have a sir bunch of servers with the rack so to the top now that racks which is a router to our environment it runs a routing protocol up string to our cluster switches in this design we also added two more cluster switches so that way when we lose one we only have to over provision for 25% of and failure and in this model we actually have a line right 10 gig so we legitimately have 30 gigs out of this rack which is great because now the application is no longer suffering from fighting to or starving to get some bandwidth so I mentioned we run a routing protocol to the rack specifically we run BGP as the only routing protocol to the rack you might ask yourself why so a few things that we really like about running BGP within the data center is it provides a lot of granularity for control we use bgp communities to tag route to different locations that we can finally filter on them without having to do exact prefix matching we can also do control inbound and outbound and multiple layers to kind of protect ourselves also it scales really well for example it runs the internet much larger than us so that kind of gave us some confidence going into this so looking at one of the challenges we led into from this is that as we grew in the graph we saw at the beginning of the slides our traffic between clusters is pretty heavy and so while we have a lot of traffic growing going in and out of the internet the traffic between clusters was growing at an such an exponential race rate and it rate then it became pretty hard for us to keep up things like back when links were starting to become saturated because in the early design we may have only had a few clusters and they connected upstream to a backbone and as we grew they all still connected upstream to the backbone regardless of their location but before we go into the problem a little bit more detailed let's talk about why we have a backbone so our purpose is just to connect our data centers together and also to connect our private and transit peering so one of the problems we started running into though is that the backbone devices that we're connecting these clusters they're very powerful they are high-end devices that can carry easily millions of routes but their density on in terms of ports like 410 gig is much lower than on the datacenter side when you look at it by a the actual cost per port so they became almost too powerful for the needs of tons of traffic between the clusters within the data center so we step back and we said you know there's just got to be a better way to kind of approach this so that led to the next drawing here within each data center region we built a separate back-end data center network so by data center region we have for example the West Coast the United States the East Coast and each of these will have one or one or multiple buildings in a little campus so what we do in this design is that we cross connect all the clusters in a full mesh there's back-end data center network that's local to that little data center region or campus and now we'll the only traffic actually hitting the backbone is traffic leaving that data center or going in and out to Internet peering providers so that worked for a quite a while it gave our application owners a water room to grow and start and start turning on more and more features that just needed a lot more bandwidth but it's not perfect because there's still reasons to always improve on everything especially in our environment one of the big challenges we saw with this design is that we had a lot of chassis devices throughout the environment and chassis devices are pretty complicated they're typically propriety and they also have lots of state to maintain lots of wine cards lots of chips and a six that's a lot of things to make sure that are actually working properly from an individual boxes perspective and as we started to grow we noticed that they they break in really obscure ways so a great failure for us is a LAN card just by completely failing and powering off but bad failure is one where it's silent really fails it still looks like it's okay but it starts black calling traffic and also efficiency is another reason we want to kind of work on because the Kirt they're the model I just showed we were building a cluster for photos a cluster for web regardless of how much servers went into that cluster how many servers went into that cluster we still needed to provide at least four cluster switches in the first design or the one the one right before and so there's still some minimum amount of sizing and planning to install all this but what happens when you want to build a web cluster that's small and a foetus cluster that's twice the size well eventually you run out of ports and you just build two of those but your web pusher is under provisioned and while you can buy less line cards it's not as easy to grow incrementally so the third generation did a design that we are we we have in place this is a folded clove a Stata Center wide fabric so in this model we start out still with a rack switch to the top of every rack but now it connects up string to one of a handful of fabric switches in this model 8 it could be 16 though right away though this is not really that much different but we rename this to a pod so this pod is now a chunk of a cluster think a fit or a 10 and it connects upstream to a spine layer pretty pretty common at first and the idea here though is as we go into this design we're trying to get away from CH a C's go to smaller more fixed platform devices that are kind of easier to manage and easier to automate so in this model we want to be able to grow horizontally a little bit faster so with the click of a button we doubled the spine right here but right now we don't have that much more capacity because it's a server pod that's a chunk of a cluster we still need to build many clusters so the idea is that we'll start growing these pods and eventually turn this into what's called the plane layer so as we go to scale the entire data center will have a whole bunch of pods and they'll connect to a bunch of spines and the spines will all basically kind of stack three-dimensionally until you have enough ports to connect everything the interesting change with this is on though is that there's no chassis involved here or maybe very small chassis in some places it's just a lot more devices so now instead of playing for 25% over provisioning to lose a cluster switch it might be over provision a 16 a 32nd maybe 64th of the capacity because your device maybe only has like 64 links on it instead of hundreds this is an exciting design and also a very scary one at some point because it's very different from all the tooling we had in place so that meant all of our tooling no longer worked which is exciting though because it gave us the opportunity to start from fresh start fresh approach it with a brand new perspective what software did we want that we didn't have before what kind of analytics were we not doing because there's a lot more links here there's a lot more problems for failure or potentials so that goes into our next section how do we manage all these devices so at Facebook we try to approach everything from a software mindset on the networking world from configuring devices alerting on issues oddity or auditing and then alerting on issues and even remediating of issues we try to make software do this as much as possible so that we can free up our engineers to make a greater impact ditch the whole traditional NOC perspective and have real engineers actually respond to real alarms so for example this frees up our team to spend time learning how to write better software spend time to automate anything that's still done manually and then finish this feedback loop to share back with the team so that we don't make the same mistake twice so that we continually grow on top of each other this is actually my favorite part about working at Facebook because you could come in as a network engineer have no programming or scripting knowledge and be encouraged to grow and learn it and then be supported to actually focus and attack a problem you see that management might not see and it's a very supportive and encouraging environment between the whole team so let's take a look at this in like some real detail because what I said makes sounds cool at all but let's see it actually like exists so how do we deploy a cluster switch in either the fabric model or the original models so everything from planning the poor map generating the configuration applying that configuration even vowed in that that configuration works for example did those bgp sessions that you configured in your template actually come up that's entirely done by software we use engineers just to physically install devices and enable with them for live traffic the last one we could easily automate when we feel comfortable it's just a matter of comfort level so let's take a look kind of stepping away into another system we have at Facebook it's called F bar or Facebook's Auto remediation system f bar creates alerts or I'm sorry engineers create alert so network engineers create alerts based on issues that they've seen for example a common Alert might be this interface has received power on the on an optical link that is worse than this substance such threats hold so the alerts the engineers create these audits and they also create remediation strips for example tell F bar what to do when it sees this alert from Sutton from some device on its trigger alarms in our environment and then f bar reacts to these alarms looking at a real example of f bar in use we connect with a lot of peering providers many in this rooms for example at internet exchange points so with an internet exchange connection you have a lot of bgp sessions involved it's common for one of those to go down every now and then for example we might be doing maintenance somebody else might be doing madness or it might be a hardware failure so that session normally goes down and typically involves somebody to look at it so instead we use software to kind of help filter this and kind of analyze the situation first through F bar so in this example we're on a router that dropped its bgp session at an area exchange that device generates a syslog message about the session going down that creates an alarm but that arm doesn't go to anyone right away it goes to f bar f bar says you know I know what that device is it's a it's a router and exchange point I know what a bgp session is and I know what that alert means and actually somebody has told me what to do in this case so f bar will actually login to the device and say hey device I saw that this session went down are you still down the device will respond back no it's good but then FR is like well let's check a few things first so f bar we'll take a look and say what interface connects it up here how are my interface that's looking do i have any errors do i have good light from the optically advise will respond back no everything's cool so f bar also keeps track of when it's last saw this alarm so so assuming everything was cool the bgp sessions back up and there has been no recent occurrence of this it's probably safe to ignore this alarm and in this case f bar will it'll throw it away but let's say when we went and checked with the device and saw that we had bad light levels on that interface but the session came back up well bad light levels from our experience is enough to at a certain threshold we'll call CRC errors it also caused things it just bounced and traffic to get garbled so that's likely the cause in this case of why the session drop F bar has the intelligence there to say okay somebody needs to go look at this even though it's backup there's a problem and it will escalate it to an engineer the cool part about this is it lets you eliminate the NOC your engineers that are like for example our network operations seem we don't have scripts that they read it's not a big like projection screen of watching monitors and reacting they get alarm alarms and tickets that are legitimately filtered already so instead of having a filter for the noise and say what actually happened what calls us they'll get a ticket that says F bar saw this bgp session drop the Eagle interface has bad light and now you eliminate a whole bunch of troubleshooting so let's take a look at it another section I mentioned we have a lot of chassis zat Facebook and a lot of line cards so this drawing here is just a generic representation of a line card on a network device blind cards by themselves not horribly complicated in this drawing so we've got a switch ASIC we've got a control plane CPU we've also got five controllers interfaces and a fabric interconnect but it starts getting more complicated as you look at what the card actually can do so the switch a psychic for example it has layer three tables layer 2 tables it probably is a bunch of other stuff too that it's keeping track of so when you look at this in the form of a chassis give a lot of these cards connected together so what happens when one line card goes crazy but doesn't completely fail and so the control plane doesn't notice it your route processor or your supervisor says well I don't see any problem everything's cool you're continually flowing traffic still through these cards but one of the cards is just lost it's mine and it starts throwing away traffic for maybe one prefix but everything else is cool something's obviously wrong but in some cases you might not notice it until somebody reports a problem for example maybe your load balancer went down and it you notice it from a little bouncer alarm so in our environment we see these kinds of issues often enough that it's kind of gotten us interested in this fabric design that we're that we went to and so what basically the point of this though is you get to the point where you have enough oddball issues that you just can't trust the Box you can't trust the line card anymore and that goes into kind of our next section we try to monitor everything so every link every interface every bgp session we also try to monitor every fib across the entire environment so for example the line cards I just showed across every cluster switch every device with line cards will monitor the make sure that the routing table in the control plane is actually in sync with the routing routing table on the line cards and will automate that in the background to proactively check issues that I was kind of mentioning earlier we also look at in aggregate tons of data from all of our servers because they're everywhere and they're ultimately going to see issues more more likely than we might from the network device so for example tracking your tcp retransmit stats at the server is an interesting metrics to see because if you start seeing one section that were doing tons and tons of retransmits well something upstream is wrong because there that shouldn't be normal and that's kind of an interesting way to look at it so at Facebook we are a very big big on the this culture of automation we try to filter all the noise and software we try to automate anything that's repetitive mainly so that engineers can focus on real problems and you might be thinking yourself well that works great you probably have hundreds of engineers doing this in reality we don't we have a very small team and it's just keeping focused and try to automate anything that you're doing today in just little chunks stepping into bigger chumps and it eventually builds on itself so that eventually you're doing something completely new tomorrow new and exciting and working on helping the others kind of grow so let's talk about our rack switches though so we have a lot of rocks wishes at Facebook and we used to have this interesting problem it was an install and forget model so as we installed rack switches whatever configuration standard was that at present at that point well that's the standard that that rocks which was configured with so over time we had some drift and we got some and got to the point where the drift was pretty bad and when we started trying to turn on ipv6 it was not a fun experience because you might get you had ipv6 support in new Rex but not in the old rax but as an application to such as messaging that wanted to use v6 well it didn't make sense because they can't turn on they don't want to turn on v6 for one data center or not the other so it kind of stalled the progress so specifically ipv6 is something we're excited about we want everywhere every part of the environment to have ipv6 and we literally mean ipv6 everywhere for real not just at the edge we want every single RAC across every single data center to have v6 every single service in every data center serving production traffic on v6 literally in the early early part of next year and the every rack part is actually this year so a few things might start to distract your mind why would we want to do this well in our environment ipv4 is not going to last forever both the amount of address space in the small and the private range but just as a protocol we don't want to kind of drag on we want to help proceed into adopting v6 also when you're troubleshooting issues issues band-aids are not much fun to troubleshoot well we don't really do tons an ad inside they kind of make things harder at some points and finally it's just really really cool you can put on Facebook and every single IP address on every single server if you wanted to which we're trying to do so let's take a look at how we went about rolling out ipv6 so it involves dual stacking backbone and cluster switches it also involves rocks which upgrades because all of those old rack switches had to be upgraded for new code they had to have a fresh configuration applied we also had to turn up ipv6 for all of our services so anywhere any code specified v4 had to be migrated to support v4 and v6 so let's look at the old way we would upgrade a cluster of rack switches specifically so when we wanted to try out new code or new configuration idea we don't really have a big lab so we we typically tried out things in the web clusters so when we went to the droll new code for example to try out a new feature we would coordinated an outage window with all of the affected service owners we would drain traffic meaning that we would start redirecting traffic away and from that cluster and into other clusters which for a web customer was easy because there's not really any state there what other than the existing to active TCP connections so once those rolled off and you stopped sending new connections there was nothing there and then we ultimately did the upgrades but even in the old model we did a lot of the upgrade work by hand so we would have scripts that we'd run to do the reboots or like Rock switches but we would still heavily heavily involve typing of commands to just push all this and stage all the config and code so looking at the way we did the web cost records let's take a look at what kind of difficulties we ran into as we try to upgrade the other clusters so we actually found that database clusters are pretty easy to upgrade in our environment even though they're pretty dangerous because if you do your your cluster that happens to be the master database region for everybody well you got a lot of risk involved there because you're going to be filling over database servers and trying to make sure that it's as transparent as possible but fiscally specifically in the database realm of our world we only have a few teams to manage in this in visibility about five total like on-call people or teams to work with such as database engineers and also things like of backup engineers the web cluster was easy I mentioned database cluster was surprisingly actually relatively easy now looking at where we found the most difficulty these servers clusters where all the things like newsfeed set the photos machines even the advertising machines they typically had a lot of different people involved in the same clusters so similarly on average we saw about 120 unique people that we had to coordinate from the network team when we wanted to do an outage to drain this entire service cluster one of the big problems though that caused some heartache there is that in our world we have this idea of dedicated in shared rocks a dedicated RAC is where one service owner owns the entire rack so all the services all the servers present are for photos for example or handle chat but then we also have shared racks where somebody like myself who might need a few servers can just go allocate them and I would share a rack with somebody else that might have advertising machines or news feed machines present so the problem that we ran into that really kind of made a step back and revisit this idea of draining the whole clusters it doesn't make sense to make 120 people agree when one person might owe in five servers and one might own a bunch of servers so we said why are we drain the entire cluster why does it even need to be a single window and what is net it have to be so heavily involved literally upgraded racks which with new code or new configuration can be automated pretty easily it was all the extra pieces involving project managers involving people to say yes you can go reboot my rack switch because it's cool now that was the painful part so let's take a look at an early attempt to try to solve it now an early attempt where we tried to solve this problem we said we're going to blacklist all the unsafe racks based on the host names present so in our environment a host name for a server easily maps back to a service for example if it handles news feed it probably has the word news feed in it so we went to all these servicers specifically for one of the clusters we were trying and said hey we want to go upgrade all these racks can't tell you when because it's we're too far out but whenever are you safe with us rebooting your rack at any point and us not telling you a lot of them said you're crazy because they want to know when we're going to reboot the racks which because if they're on call gets page for a chat being down they kind of want to know that we caused it for example so we had a lot we spent a lot of time trying this idea basically found that we could and basically the model was we were going to go walk all the racks and upgrade whatever past and did not have blacklisted hosts intervene a lot of time involved asking people about this and we didn't really get to upgrade very many rack switches at all we still tried it and it was a good learning experience so looking at how we solved the problem we said let's split this apart a dedicated rack and a shared rack are actually two completely different things even though it's the same cluster so for dedicated racks meaning just photos machines are present for example we said let's shift the responsibility of upgrading the racks which to the service owner have them help us do the racks which upgrade and for shared racks we said we can't make everybody agree on a single window so we're going to pick four of them but we're going to pick a little bit more accurate accurately this time we schedule every single rack giving it a time slot you're starting to be upgraded this time and ending at this time and handle and built full automation around this so for dedicated racks I mentioned we shifted the responsibility to the service owner you might ask yourself you must be crazy because what service owner wants to login to a rack switch do their own upgrades when all they know is PHP it so what realistically though what we found and we're talk in a minute but we didn't actually make them log into the rack switch but real world we found it was a lot less time involved for all the service owners and ourselves no longer are we spending this time saying hey can we do this maintenance or hey we want to do it not at this time are you cool now it's we need your help you want to apply p6 we wanted to play a few six help us do it by launching this tool and here's how here's where you can go for help and let us know if you need anything it was pretty cool though his service owners actually took really quickly to this idea because it empowered them it let them do their upgrades around their own maintenance schedules not around arbitrary times we picked so now they're not the user in the rack there's the customer and we're actually friends and actually talking and not not having this tension that we had before so it says specifically for the dedicated racks where we asked them to help us we focus from the beginning to make this smooth user experience so we provided software that we built so basically a CLI tool for example that will allow them to do a single button upgrade they didn't have to know anything about the rocks which they were in they didn't have to know even the name of the racks which of the racks all they have to do is say I'm in this rack and based on the standard name scheme so do I even need an upgrade and the tool will tell them yes you do and then they could say well okay cool go upgraded the tool go do the upgrade and give them some progress along the way and let them kind of view what's going on we also fake focus early on to provide detailed reporting from the start so if something goes wrong this representer knows exactly what happened and so do our engineer who need to help and resolve this quickly and we also wanted to make it easy to manage a job for example if they started it needed to stop it suddenly that needed to be built in right away so that way we don't start upgrading racks in by accident so I mentioned this idea of scheduling shared racks so the reason we wanted to schedule the shared racks is we needed to provide all of the servicers accurate timing and notification so while you can schedule is something easily with like cron for example that doesn't really help the whole feedback loop of letting people know what's going on so we built this system that wrapped around the timing with our task management system so that way when we scheduled we sent them a task and said hey you're scheduled for this timeslot say 5:30 tomorrow and in the afternoon or evening and your rack is going to be upgraded between 5:30 and 6:00 but then the tooling we built also went and reminded them and said hey we sent you that say for example they simply sent the task to week said what we'll remind you a day ahead say hey just reminder we're show up here in your rack tomorrow but then again we'll say hey we're upgrade your rep right now just in case you forgot and so we built software to do that specifically because a lot of services still needed to be manually drained so we have a lot of services I mentioned earlier the big names like photos machines chat machines but we have a lot of internal services for example servers I might manage or servers that our monitoring teams might manage so those might not have quite the intelligence to automatically filter so they might have to kind of prepare for this event and we also said in our discussions why not finish all of the workflow of scheduling everything and eliminating all the pieces that typically stopped us before so I just wanted to kind of say or take a look at this where are we now so right now service owners handle racks which upgrades they literally are upgrading racks whenever we ask and you need them to be and we've reached this point where the network is constantly upgrading itself so you might ask how did you do all this because there's a lot of pieces kind of left out so we built a system called Jorn amento to refresh in Italian just to kind of do a little playful pun with everybody and so it's a client-server model based on thrift it runs analytics is written entirely in Python and it's backed by my sequel the idea of this system is to integrate every piece that was manual before so for example I mentioned I identifying this I mentioned there's a lot of service centers present well some service centers live in one cluster but not the other so we had to automate the step of involve finding out who actually lives in what cluster and find out who's actually on call at this time so that we're not bugging the wrong person also scheduling and chairing some general logic to actually run and execute these upgrades based on the schedule but also queue minutes super safe so that way if we get off in the schedule we don't start taking about down a bunch of rack switches that we forgot to do or missed and finally we made job management super simple and so I just want to ask you to ask everybody a question here what would you know if you're not afraid I would say automate your day job so that way you can focus on the impossible I'd like to thank everybody for attending and open up the floor to questions and on top of the questions I actually have several teammates here and we're going to be outside right afterwards with like a discussion area in case anybody has some deeper deep discussions that we can't go into here but any questions as well so Scott light from Google so with respect to f bar does it how is it talking bgp how does it check what the BTB configuration is and how does it gather other stats off the switches themselves okay so f bar is built to react to alarms so we'll actually have an external system that is based that we basically configure with different metrics for example S&P poling or what what I mean to say oh yeah so I send them P polling is how you get the stats and how do you act how do you actually like reconfigure the BGP is it just CLI okay so yeah so it's two parts it's getting the the stats originally so if you take the whole workflow of the actual session the issue there we generate an alarm based on the actual syslog message initially of the fact that the BGP session went down but the rest of the actual interactions to the noise are basically scripting around CI commands so literally fr what I'll actually launch an SSH session to the box and it'll be programmed in the remediation script that the network team would have written of basically a step by step of what commands an issue what to look for and what to kind of perform on that the remediation script itself is actually Python code so it's not like this it's not literally sin and check I can expect behavior but it's it's Python code that kind of walks around simple expect COI kind of commands thick in the there we go in the center hi Leslie Carr with with Wikimedia um are any of these tools open source and available to the nanog community so the FR one is actually publicly like talked about I don't believe it's open source right now some of these in a way are so for example we built the backend system as ornamento based on thrift the idea of thrift is actually is an open source project now with the Apache foundation that's kind of beers off to the off a little bit and digresses but the idea of like thrift is open-source but I don't think any of the other systems are just yet mostly because we've been moving kind of quickly and having quite step back and put the time to do that left back sorry you're right Jared bands are from GoDaddy regarding the the Python environment did you use a programming framework and did you consider using things like a AMQP AMQP service-oriented architecture or something like RabbitMQ so in terms of the Python like framework we do we do most of our like I think so in terms of all the Python framework we base everything we do most of our actual like coding strictly and directly and like them so just don't know like like text editing but there's a lot of tools built around Facebook infrastructure to let us kind of build on each other so for example I mentioned we integrate with things like the tasks system to help notify people and also the idea of hi finding out who owns the servers there's a lot of tools that other teams have built within infrastructure that kind of let us expose and access that information more generically so that we don't have to directly access different databases but we can kind of go through intermediary like helper services the other question on the frameworks or the I kind of forgot the other question but I wasn't familiar with what you were asking on the other part if that helps is a hey you can come back and ask again I seem back left back right Paul Porter with Topsy in regards to measure describing having automation to the point where if an alert fires there's a script that can take a look under the hood determine if something's wrong and then possibly just discard the alert if if possible so we're doing a lot with automation as well but where do you draw the line typically if you're getting alerts a lot of them which are being discarded there's either something wrong with your check or something wrong with the device or something something is wrong generally there's not alerts that fire off that you can just ignore so do you now have people that are going through checking what's being discarded or do you find yourself losing any visibility to a impending problem because a lot of guess as the saying goes where there's smoke there's fire yeah oh that's a good point we basically still eventually go back and look at it kind of overall what is being thrown away they kind of make sure one thing is a note those we don't filter every single alert with F bar so for example if it's a critical backbone device connecting for example links that cut and that cross connect between data centers there's some intelligence built-in that says hey regardless of what this or it is it's super important because it might be a ton of capacity that might have been affected we basically use f bar more to kind of help us weed out the smaller things so for example if a racks which generates an alarm because we actually monitor even the syslog at every rack switch well sometimes with rack switches you'll run into problems that are probably a problem but you can ignore but since we monitor all syslog messages it's it's interesting in that there there is likely some cleanup we can do and not generate in order to start with but we kind of go for the idea of trying to generate too many alarms and then pare it down versus losing them because we used to not alert on anything and now we've taken this opposite approach which Peter here is laughing about which actually would you like to help on this one yeah this is how Peter who's by the way with edge as well yeah so I think the key answer your question you know when we get a bunch of alerts that we don't have classified or maybe reclassified as drops we we also have graphs that we look at to see if there's a spike in any given type for an even time so we can see like okay there's ten alarms right now but there were 3,000 that were ignored so that in and of itself is like considered a problem thank you real or center-left smart when Bechard limelight networks one question I had was the upgrade systems that you talked about with regards to updating configurations firmware what-have-you is that to find simply to the edge devices that is those that are closer you're closer that also incorporated into the more important the higher scale backbone inter pop connections and also with regards to that how how do you address the introduction of new gear new equipment new technologies into your into your network with this upgrade facility in place surely that must create a few problems yeah so the first question in terms of whether or not because I talk deeply about the racks which I mentioned that more specific because the racks which was our biggest problem area because we just simply had way more devices there and they were always ignored the most because the backbone while really important we had a lot less devices so that wasn't as painful today we do not do the backbone or cluster switches with that system mostly because it's relatively new and we're still baking it but there is work in progress to add backbone upgrade for support and also add the costs which support and then to your other question in terms of introducing new devices we're very big on being multi vendor for everything so today we have multiple vendors and so when we go to build anything we don't try to constrain yourself to any single product or any single seoi so for example we're writing code that's going to login to a device we generally will even if we only have one vendor deployed but we have other vendors and use of other parts the network will write the software to actually handle all like three or four or whatever vendors we might have in that case I think the focus from the beginning when we write tooling like that to already support out of the other vendors is kind of what saves us the heartache of later adding one but there are occasions where we'll add another vendor that is totally new it causes some problems at first but it's just more software right and it's it not too unmanageable at that point though we got time for one more question anyone okay well thank you very much thank you for coming you

Info

Channel: NANOG

Views: 17,838

Rating: 4.8347106 out of 5

Keywords: Internet Service Provider (Industry), David Swafford, datacenter, Facebook, NANOG 59, NANOG, Verilan, Network Operators

Id: xC461XfmI0E

Channel Id: undefined

Length: 42min 57sec (2577 seconds)

Published: Tue Oct 08 2013