Arista Networks EOS Evolution and Quality with Ken Duda

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
my name is Ken Duda I'm one of the company's founders I'm currently the CTO I also run the software team and I want to talk about a particular issue with software which is really near and dear to my heart which is quality quality is the most important attribute of our products not one of the most important it's the most important and the reason why is really simple with a network ain't working ain't nothing working it's got to stay up it's got to work and so we've taken the challenge of how to build a much higher quality network box how to give customers a much better experience in our network we've taken that challenge as our first priority from day one how do you build network software that's fundamentally better and I believe the answer has three parts quality comes from three places it comes from culture it comes from architecture and it comes from testing I'm gonna talk about each of those three and I love questions by the way so please interrupt with any questions on the cultural side what I mean by the culture is what's important to people when they come in each day how do people think about their jobs how does each person make decisions that support the overall company's objectives when it comes to the culture of quality what it means is behaving every day and in every way like you mean it like quality really is the most important thing for example do you give your executives bonuses for hitting a particular ship date at Arista nobody gets a bonus for hitting a ship date some of our competitors sometimes they do say well come on sure they give up bonus you got to encourage the engineers to go faster right so it's not really so bad is it well no one's gonna intentionally ship products don't work but when there's that meeting you're trying to figure out you pull the gun and let this thing out you're going around the room and you're seven or eight software managers and the first seven say yep my stuff solid is ready to go we can it full full QA now we're gonna be on target no issues and you're the eighth guy and you know that it's really close I mean it mostly works yeah there's a few unexplained test failures and there's just one case you know about but that never really happens in in the field right and you may start making excuses and you say I think we can do it the pressure do you really want to be the guy who stands up and says sorry my boss doesn't get his bonus because my group's not ready like it's a disaster right so you don't do it people make the big excuses and they end up shipping things before they're ready whereas when I go to my boss the CEO and I say jaishree you know I told you I thought we could ship by early September it's looking more like October we're just seeing more failures in our automated test beds than we like to see she says the same thing to me every time which is okay we'll manage in the field and I was explaining this to an executive at a major financial firm and he nodded not a really a networking person Norvig on the business side and he says oh I see you're shifting income statement risk from our balance sheet to yours I'm like what and I realize what he said was really smart that's exactly what we're doing we're saying we will delay the shipment of our software in our products until they're ready and in by doing so maybe we don't make our numbers one-quarter we take that risk and we take that risk to avoid putting our customers networks at risk because it's just too important to us that their networks actually really really work and so why is that is a really a business reason for that and it's kind of interesting I feel like culturally at the heart of it it's not even really a business decision at least not for me for me it's about why am I doing this why did I start this company why do I come to work each day I had you know I've done well enough of previous companies I could have retired before founding arista so what makes it worth it and what makes it worth it to me is the pride of my switches deployed in some of the world's greatest networks massive web companies mainstream financial services the big HPC centers and I'm so proud that we make their networks better each day and what I hate is messing up their network because that's not why I'm here I didn't come today I didn't found this company to build switches to mess up people's networks and so if our switches are misbehaving and that's causing someone a problem the hell isn't supposed to happen that's what I want to focus on and get that taken care of because that's why we're doing this that's why we're here so I don't know what gives us the best stock value in three months or in 12 months or four five years I'm to me that's not really the important thing my assumption is we do a great job for our customers the stock price ultimately takes care of itself and so that's why why we're doing this another example of the culture of quality is when a customer is having a problem how does the organization react the tack engineer looks at the situation and says I'm not sure that's what you're supposed to be doing that that doesn't look good to me what do they do do they have to go through some complex escalation procedure talk to their manager who talks to an escalations manager who talks to an engineering manager and then they kind of make a business decision about okay well which customer is this and how much do they buy last quarter and how much we think they're gonna buy next year and what engineer do we need to come look at the problem and what feature is that engineer on is it critical for the next release bla bla bla bla like I think that's a completely false business trade-off the right answer is you give the tack engineer of phone numbers of the technical leads in each of your areas and when they think the products misbehaving they pick up their phone and they call the right person no escalation procedure no manager is involved no phony business decision to make because first there's again that's sort of the cultural reason which is that quality is the most important thing we need to live that every day and this is part of it if something's going wrong in the field that's our priority so it's partly for the cultural reason but I actually believe that in the long run the business case is actually pretty clear if there's a problem in your code and it's impacting a customer it doesn't actually matter what customer it is in fact you should thank your lucky stars that it's a relatively small customer who's relatively patient because you know what if you don't jump on that bug you even if it's a the nicest sort of easiest-going lowest key smallest customer in the world you jump on that problem because it's just a matter of time until whatever that problem is goes and bites on your top ten customers and the fact that has happened somewhere else is great luck take advantage of it don't fix it there yet the maintenance release out and then you're not gonna have a problem in a massive scale whereas if you go through this false business trade-off you end up delaying not avoiding the work of actually making your software work and hurting far more customers in the interim so our tech leads have very clear instructions from me when something's wrong in the field go fix it because it's more important than new products it's more important than new features it's more important than hitting dates I talked to account managers about this and of course the salespeople they need to close deals right they need to latest features and latest platforms they need at all yesterday except never sacrifice quality I get that message from them loud and clear they would rather apologize to the customer a hundred times for missing the date than apologize even once for melting down the network and so our account managers tell me they're prouder to represent Russ's products and happier representing our products than anything earlier in their career because they can spend their time selling and not apologizing so it really it really pays for itself so the culture of quality is really it's really the foundation it's really the backbone of getting to a high quality product and high quality experience it's just living every day as the behaving 100% consistent that getting it right getting it working is more important than features more important than dates more important than shipping new platforms and that's what we do each day on the architecture side they're a bunch of things we've done architectural II to lead to higher quality software the first was just starting with the right foundation pure Linux our competitors look at Linux and they say oh I get it it's like the X works but is free they miss the UNIX architecture the UNIX philosophy and either's community when the first thing you do is take the kernel in and start hacking hack out the whole network stack happened something else you cut your connections to that community you can no longer easily upgrade to the newest version of Linux and you can no longer just take ordinary Linux software and run it on your switch has to be ported to because the syscalls are subtly different and your environments not really exactly quite Linux anymore by taking pure Linux we follow the first rule of the UNIX philosophy which is that when it comes to the kernel if you don't touch it you don't break it so we keep our hands out and keep our code and user space processes where they're all individually restartable the second thing we did architectural II is that we took the system database approach there's a question once you separate your code into lots of processes how do you then coordinate those processes and the obvious answer is no message passing one process learned something it tells another process which does some computation and updates a third process and so on message passing suffers from two really serious flaws the first is the restart ability problem if a process exits it restarts how do you get that process caught back up with all the others maybe you can restart from a checkpoint and then replay a bunch of messages and get there eventually and and there are real systems work that way like journal metadata and file systems or database redo logs and transactional database systems but doing that in every single one of your agents or your user space processes is just a it's not a good way to get your architecture under the right foot but even if you solve that problem there's actually a bigger one hiding behind it which is the update rate mismatch problem this happens when a source of messages generates them faster than the receiver can handle them and they start to queue up and the receiver falls behind and the farther behind it gets the wonky or the system behaves until eventually you can't save up messages forever something's gotta give it crashes and then the food fight starts within the within our competitors engineering team because the guys that are sending the messages say hey you guys got a process those messages faster and the guys that are you know responsible for the code near their saw say we're processing them as fast as we can you guys have got a throttle back the rate you're sending them they can't even agree on who's bug it is and the reason is they're both right it isn't either of their bug the bug is in the architecture message-passing is a flawed construct so we don't use it instead we use a database when things change when the FIB changes when links go up and down whatever is happening in the box as those things happen the process that detects them updates the system database the system database then propagates the updates for the agents that need to react to load the new routes into the hardware or whatever you like make me saying at this point well what are you talking about like if you change the state to fast that I can't keep up you've got the exact same problem no that's the beauty of being state oriented if one process receiving update starts to fall behind our system database automatically coalesce as the updates if the routes next hop changes from A to B to C to D and the ASIC driver is falling behind I don't have to send it all four of those updates I can just send it the prefix changes to D as long as you get into the correct final state it doesn't matter the path you take to get there so when somebody's falling behind we can automatically shed load coal less work reduce the load on these processes falling behind they will have to keep up with everybody else and keep the whole system stable so once routing settles down the whole thing's running correctly and I believe that this architectural advantage compared to message passing has been a major benefit for us in our customers when it comes to stability of our equipment under the surprising demanding scenarios when you know and when all you know what breaks loose in your network and some router is fapping up and down and adding when withdrawing 10,000 routes and adding into 10,000 routes in withdrawing them again and that something nutty happens and nobody can keep up with that kind of churn but there's a very big difference between doing the best you can until the conditions removed then resuming normal operation versus crashing which is what we've seen some of our competitors boxes do in those kinds of demanding scenarios the third part of quality is testing and way I'd like to say this is the first rule of quality is get rid of QA we don't have a QA team at Arista at all because QA is a fundamentally flawed concept QA is the lie that you tell yourself that is possible to say hey these guys here they're responsible for quality they're the Quality Assurance team it's just not true it doesn't work these products are too complicated the software has too many dark corners the QA team doesn't even know where the dark corners are and even if they knew where they were they don't have the tools to drive the software into those corners in order to actually achieve quality what you need to do is have each development team be responsible for the quality of their area of the software that means every feature team owes me two things they owe me the new code and they only a set of automated tests that prove the code works we've taken a 100% automated approach to testing from the very beginning and the reason why we did that is actually kind of in a kind of interesting it was selfish he wasn't for our customers I did it for me as a software person because I've always been a software guy just software guy at heart and if you're a software person you know that a major requirement in order to create great software is to refactor because when you first do a module you might do a great job it does all everything you need for the first version but then version 2 shows up new requirements oh well I think I can fit that in here and fit that in over there and it still hangs together pretty well and you ship and then comes version 3 it's like okay I did not know I was going to need you to do that so at that point you need to refactor the module but what happens in some of our competitors environments is that you try to do that to refactor the module and well let me say this differently you're trying to do your feature or you look at the module you're building on and you're like cut it doesn't meet my needs it doesn't give me the right callbacks it doesn't give me the right notifications it doesn't have the right indices to make the right look up to the right updates fast like how are you supposed to build on this thing and so you look at other features that are similar to yours and you find these elaborate groans shadow copies of state parallel indices these hacks where you say well I care about a but I'm gonna get a callback on B I know I'll hook B and set a timer and then start polling a and oh my god this is the way you create software that usually works this is not the way you create software that actually works so what you would happen to you if you try to go and fix this problem you try to refactor that module something really interesting happens you do the work you refactor the module you rip out piles of workaround code from other modules you do your feature you're feeling great you go to check this in you can't check that in why not well you know Ken that code you just refactored that's being used by four different features on nine different platforms tell me were you planning to retest all that yourself well of course I can't do all the testing myself that's crazy but but you know we're doing a big release here right we're surely QA is already gonna test all this stuff well back when we first developed a feature or first shipped the platform yeah we did full feature test but you know for this release we weren't planning on making any major changes in that area so we thought we'd just do regression test Oh regression test what does that mean we poke it with a stick it Wiggles we ship some customers can tell if there's any problems you know and so that leads to my first rule of quality is get rid of QA my first rule of avoiding regressions is get rid of regression test I started arista bound and determined that I as a software engineer could refactor any module I wanted on any release I wanted because our automated test environment would run every test every case for every feature on every platform for every release and that's what we've done we have a data center here another one across Central Expressway full of equipment running 24 by 7 where software decides which branch to build which test bed to load that build onto which tests to run on that test bed runs the tests puts the results in a database matches a log against if the test fails matches the log against fingerprints of known failures updates to bug hey happen again right over there or if not goes to an outsourced group to think print and open a bug and dispatch to engineering this system has effectively eliminated regressions we've had almost zero and I wish I could look you in the eye and say we've ever had a bug and that's not true we've we've had bugs but as you go down an Aristo maintenance trane the software just gets better it's not the situation our customers are used to but they're used to is hey the vendor finally has a release it fixes the bug and introduces five more right so customers are loath to upgrade and as a result what I see now is people in the field getting burned by zombies bugs that we fixed sometimes years earlier or the customer hasn't upgraded and so what we're trying to do about that is first be able to assure the customer with a straight face as you go along the maintenance trane it's just getting better you don't have to recertify every time we fix a bug second make it easier for the customer to tell if they need to upgrade so we're working on a feature that monitors your installation and figures out which known bugs apply given the hardware models you have the software versions they're running and features that are enabled and potentially other operating characteristics and alerts the customer these five bugs are present in your network maybe they haven't bitten you yet but they might now be a very good time to upgrade here's a maintenance release that has all those bugs addressed and no new exposures and then we've got to make it easier less painful for customers to actually do the upgrade because if you've got a network ops background turn update update the software in 2000 switches it's not a fun task and that's what we're trying to do with our smart system upgrade addressing it from two points of view first not requiring maintenance windows you can update the software on our switches on a live network and a control showed at the leaf you lose 17 milliseconds if you're using em lag or maintenance mode you lose exactly zero booth on the way down and on the way back up um and then a second automatic automate the process of going switched by switched by switch taking the switch out of service updating the software put back in the service and invalidating it is still functioning correctly and still doing the same rule it was doing prior to the upgrade if anything looks off raise your hand and say that somebody with a you know carbon-based brain please look at this because I'm not sure this is a cooling you know exactly like it should and then then a human operator gets involved in either says oh that's fine that's expected go to the next switch or the human operator says oh boy that's not right roll back and then we automatically undo the upgrades across the network so that's what we're doing to try to get our customers operational experience which we believe is already far better than competitors in terms of the quality they've experienced up to the next level to get them onto the heads of our maintenance trains so we can put an end to zombie bites and there will be you know all of us will be in a much better place
Info
Channel: Tech Field Day
Views: 25,250
Rating: undefined out of 5
Keywords: Tech Field Day, Networking Field Day, Networking Field Day 10, NFD10, Arista Networks, Ken Duda
Id: VdJZq4dRjf4
Channel Id: undefined
Length: 20min 57sec (1257 seconds)
Published: Fri Aug 21 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.