Cisco 8000 Series - Under the Hood

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right so my name is LJ Walker I'm a principal tme in SPNs which is whatever we call the big router business unit this week we seem to change we seem to change names more often than I change jobs so at one point I had a set of business cards printed that just said highway training Network monkey and I figured that was easier to just leave that there than try to keep up so we'll talk a little bit in detail about the system if anybody online watching this is in Barcelona and wants to come hang out hang out this afternoon at the breakout session we're gonna do these slides and a few more here but what we're gonna talk about is just kind of an introduction to the technical part of the technical side of the systems talk a little more detail about the Cisco silicon one which is the new in pu the new ASIC that powers the system try to give folks a little bit of an idea of what the kind of technical technology challenges that we have to solve trying to build systems that go this big and go this fast the the key theme I guess when you build a system like this ends up being what trade-offs are you gonna make - you know you everybody wants it to go faster everybody wants to choose less power everybody wants it to be cheaper at some point you can't choose all three of those things at the same time so a lot of a lot of building systems especially when you get out to the to the edge of some of these parameters is about making trade-offs that kind of makes sense and then we'll talk a little bit about what we're looking for moving forward in terms of the next two to four years after that will segue into mark Nola's got some excellent material on how we're gonna do 400 gig optics in the in the optical domain so start out you know why do we need a new chip alright the answer is well you don't necessarily need a new chip if everything that the old chip does is fine forever we don't seem we don't tend to work in an industry that supports maintaining things forever obviously the requirements change everybody wants to go bigger networks everybody wants to build faster networks in large part this kind of architectural shift has either been enabled or domain ended by the growth of the mega scale cloud providers and some people think that's a terrible idea and some people think that's a great idea the reality of the market is that the networks that are being built in that space or orders of magnitude larger and orders of magnitude higher bandwidth than what most of the market was building five or ten years ago and so there's been a tremendous amount of kind of both implicit and explicit change in how those networks are built and it's it's that those changes have flowed down into kind of the supply chain and how people and how people are asking us to build these really big systems with the with the scale out of these networks there's a meaningful increase in the amount of attention that people are paying to bandwidth that gets delivered and how much power it takes as you look at kind of the TCO type calculations for a lot of these big networks the amount of money that people are spending on power in often cases is comparable to or in some cases even outstrips what they're paying on the systems themselves so building systems that are power efficient and allow the people who are deploying them to make the trade-offs that they want to make in terms of where do I need to burn watts that's going to actually either save me money or make me money depending on depending on which side of that you're on and again that's that's true that's driven primarily by the hyper scale cloud players and a lot of the technology and innovation that's been developed to support those very very large networks is now flowing down into the rest of the ecosystem and everybody else kind of gets to benefit from that we'll talk a little bit about some of the changes in the underlying component technologies these are all the little bits and pieces that the serious serious hardware nerds of which I only pretend to be one have to think about in terms of you know how are we going to build these systems how we're going to make everything fit so we'll talk about those the an ASIC architecture if you'll let me kind of bastardize that term a little bit usually last somewhere around 10 to 15 years so when we started looking at the program and kind of looking at what was available in the market some of which had been completely designed by Cisco some of which had been partially designed by Cisco and some of which was purchased through merchants silicon providers we took a look and said we don't have a really good view on where the silicon is gonna be in the 2020s we had a really good this was about five or six years ago when we kind of sat down and and got a meeting of the minds and and did some of the math and said we're we're not sure what's gonna exist in the 2020s and so we're gonna have to start looking at different or things that give us options of how to build some of these systems differently so as you start thinking about what does the chip need to do or what kind of chip do you need to go out and get you know you think about things where is it going to be deployed to what extent does it have to have feature parity I've been building and supporting big routers now for almost 25 years and one of the things that people have had to get used to is in the old old days new routers had all of the features as the old routers and over the last 15 20 years that has changed to where now new routers certainly when they first ship don't have all the features that the old routers do and there's always a couple of things that if you want to go another order of magnitude faster you're not going to be able to do whatever feature you used to do so people have started both designing and building networks differently so we have to think about how we're gonna deliver how we're going to deliver the features that people absolutely can't live without and how we're gonna help people get away from the features that they're gonna have to live without if they want to do you know in 2020 what you're talking about being hundreds of terabytes in a single system and then you have to think about things where you know how what kind of life support what kind of life span are you gonna have for a chip like this you know we go and we go and build something like this you can't spend the amount of money that it takes to design a chip this complex and do one iteration of it sell a few thousand of them and go home right you never ever ever in a million years make back the money that it takes it is on the order of hundreds of millions of dollars to get something like this out the door it's a tremendous amount of money it takes it takes an incredibly large amount of software and hardware development and then the initial manufacturing investment you know you look at these fabrication facilities from the likes of TSMC and these are multibillion-dollar buildings using multi tens of millions of dollar machines to fabricate these these chips and what you end up with is you know something that's 800 square millimeters worth of silicon that can run you know terabytes and terabytes of data the general-purpose CPUs have also let's say developed a lot yeah a lot of the functions that we were doing back in the days on let's say specialized Asics are now super nice implemented on general-purpose CPUs how has this influenced you are it's a building process you also have to plan what can I do on a general purpose CPU were very good so I don't know I make an example encryption yeah back in the days we had specialized chips for that now the modern x86 have catched up with these functions do you also have to consider what can maybe the general-purpose CPUs do in I don't know five six ten years ahead so we do when and one of the one of the challenges of putting these systems together is that the network people primarily think in terms of bandwidth and the CPU people primarily think in terms of operations per second or cycles per second the there are orders of magnitude different if you know the I think the really really high end x86 forwarding paths will do 10 ish gigs on a core and we're talking about systems that do 200 terabits so the difference there is in in my universe software forwarding doesn't exist what what we have done and will certainly continue to do is if you start drilling down into the different blocks inside the NP you almost all of those functionalities were things that at some point were too complex to do in Hardware the look-up tables the data structures you know more and more things just like they'd just like the CPU universe more of those things have been implemented in logic and then integrated directly into the NP you and it's the it's that combination of things that let you do the features at terabit scale that said there are certainly still plenty of things where there's no hardware support for it some of it because the algorithms are complex enough that it just doesn't make sense but a lot of that a lot of that technology ends up being integrated into silicon over time said the new routers don't always have all the features of the old ones do you have any kind of example for that the list is it's probably too long to get into here in terms of what features are delivered when and where generally what happens is there's you know you got to crawl before you can walk from a forwarding plane standpoint and you know it's you know why do you do why do you do to IP and mpls first well because a that's what the most people plus people want and B it's because you can't do BGP policy accounting until you can do all the forwarding that makes on top of it so a lot of it is just around the idea of you know what order do we do we implement things in and then you can get into things where as you get faster and faster there are you know the old almost anything that requires stateful or deep packet inspection if you're gonna build an NP you if you're gonna build silicon that can look really really far into the packet you're paying that cost every time you switch back it right if I say I want to be able to look at you know a thousand bytes into the packet for whatever key fields are there I've got to carry that data through the entire pipeline so whether I use it or not I'm gonna be Pimentel you I'm gonna be burning those wires I'm gonna be burning those gates and for ultra-fast systems the trade-off just doesn't make sense you know but nobody's interested in me adding 50 percent power to the NP you so I can do a stateful inspection feature everybody wants that everybody wants to feature but when you look at what that trade-offs gonna cost especially at really high speeds the math just doesn't net out to where to where people are ultimately where people are willing to pay for it so a little bit of history on the on the system 20/20 now we started this plan about six years ago again it's at some really really bright people down in the room said what did the systems look like now what do we think the systems are going to look like then what sorts of physical constraints what sorts of mechanical constraints are we going to have building the system we acquired a company called latest semiconductor in 2016 this is the group that did the core intellectual property on the ASIC we did the QSF PDD MSA in 2016 for the optical modules you know fast forward three or four years a ton of hard work by a really large group of people and then ship the first platforms to the customers I guess now three or four months ago and then did the formal launch last month from San Francisco where Chuck and a bunch of guys got up on the stage and you know waved hands and showed off what we'd all been working on for four five years so it's the same overview here Najim covered the systems here the two fixed systems one are you - are you the product numbers are not the number of rack units they just happen to be so for the first two systems the 80 203 will be a 1 ru system for example and then in the 8800 series this is the modular distributed chassis z' the number there sis the number line cards so one of the kind of guiding principles when we started doing the development was we had access through the liebe acquisition had access to a chip that could function both as or as a standalone processing unit for a fixed system also as the forwarding NPU for line card based systems and also as the fabric interconnect for these larger modular systems there's lots of Technology and supply chain cost benefits to being able to use the same silicon in different parts of the system so you get you know you get to amortize some of those really really high fixed costs for silicon over over multiple uses throughout the throughout the system so one of the things we talk about is the 8000 is really the first completely new router design we've in about 15 years 16 years now I guess you know you look at kind of the legacy of big routers inside Cisco and I'm pretty old at Cisco I'm not old enough to have been around for the AGS but I was aalayam old enough to have been around for the 7000 and the 7500 and then the VIPs that went into there that was all kind of one very similar architecture it was all software forwarding in 1998 we launched the GSR which was the first really truly modular switch fabric interconnect system on the market did six different meaningfully different Hardware engines on GSR where that architecture that switch fabrics scaled from 2.5 gig per slot in its first incantation which is almost impossible for me to imagine all the way up to 40 gigs per slot by the time the product was done there in 2003 we launched the CRS which inherited a fair bit of software technology from the GSR completely new network processing unit on the CRS that became CRS 3 and CRS X we took the NPU technology from CRS and used it to build the NCS 6k took the fabric interconnect from the DNX group and built the NCS 6k out of that built the two-tier bit line card on the NSCs 6k in 2008 we took the nexus 7 k fabric and a customized easy chip NP you turned that combination into the ASR 9k and then have run through now the fifth generation of line cards and silicon on the ACE r9 k 2015/2016 in the XR universe we've built merchants silicon based systems for the 5500 which is the modular system this one's based on Jericho Jericho plus Jericho - and then the 5000 the NCS 5000 system which is based on the X GS series silicon from Broadcom so on spitfyre on the 8000 we've gone and taken a completely new in pu where we got from the leap acquisition some meaningful changes in the software universe with iOS xr-7 and put those together to create the 8000 series so the zoom in close up top system here's the one are you this is the 82 a1 this is 24 by 100 or sorry 24 by 400 + 12 by 100 the two are you version on the bottom is the 8200 - it's exactly the same chip if you look at the if you look at everything other than the mechanical design it's basically an identical system we took the same number of surtees the same number of links and basically did the internal mechanical breakouts differently to optimize the 2ru system 400 gig density rather than 4 400 gig density so that's really about the only change are the only difference between these two systems the 8800 series the front view here this is the modular system there to our piece we're done and our peas run XR in that regard the software architecture the software model if you will looks very similar to the other distributed XR routers that we've shipped in the past and then the two different line card types up here on the tops picture of the 48 by 100 gig card if you notice the the the horse the vertical pitch of the systems of these line cards is a little bit bigger than what folks are used to seeing on like the 5500 or the ASR 9k and there's kind of two reasons for that the first being we wanted to be able to build the 48 by 100 from a physical standpoint and this is how much space you need to put the optics in there we also wanted to as because we we're looking down the road and again these were decisions made you know 4 & 5 years ago looking down the road at wanting to be able to do very high power long range 400 gig optics and you're looking at power modules you know 20 watts doesn't sound like a whole lot of power in most contexts when you start talking about putting 36 of them in something that's only about that big square it's an awful lot of power and when router nerds say power what they almost always mean is heat so we have to look at the mechanical design to be able to take those optics we want to be able to put any of those optics into any of those modules into any of those plug holes at a medium at any given time so mechanically designing the system to have sufficient thermal capacity is pretty important there so if you look at the back of the chassis the fabric cards go into the back fabric carts going vertically the long cards go in horizontally we borrowed the orthogonal design from the NCS 5500 and then the fan trays go behind the fabric cards and pull air through the front the power connections are in the back the power modules get added and removed through the front all of the modular systems have four fan trays all the modular systems have eight fabric cards the one of the one of the things we paid a lot of attention to when doing the design was we wanted to we wanted to make sure that people had enough flexibility to populate the fabric however they wanted so we have customers and use cases 4 4 5 7 & 8 fabric cards obviously you have more fabric cards you get more bandwidth if the system is only populated with the hundred gig cards which is probably a fair enough a fair bit of the market for the foreseeable future those you only need five of the fabric cards to provide a fully redundant system so if you've got a system with five fabric cards and you shoot one of them in the head all the hundred good cards are still going to be able to run wire rate every port all the time on just four cards as you move to the 400 gig line cards you've got more uplink bandwidth so you need eight cards eight fabric cards for full redundancy if you lose one of those you can run the system at full bandwidth on seven most of the kind of other competing or alternative architectures we looked at don't do they don't do fabric redundancy in the sense that if you lose something while all the cards are populated you're not going to be able to run all that traffic some people need that some people want that some people don't you know as if you if you explode the system out and you just draw all the NP use and how they're connected it's just a great big leaf spine topology inside the system if you take the 18 slot system it's for NP use on a LAN card it's for em to use on each fabric so you've basically got 32 spines and eighteen times four seventy-two leaf NP use in a great big to layer clove emmerich and then X our software wraps the kind of instrumentation and control system around that and makes it look like one router but if you if you just take if you just draw boxes and lines on a sheet of paper it's nothing other than a big leaf spine system that's been mechanically compressed block diagram of the system fan trays go in the back there and yellow fabric cards are in this salmon orange color their vertical orthogonal connectors to the line cards that go in the front of the chassis the two block diagrams here on the right up here on the top there are for forwarding a 6-4 into use and 36 qsf EDD 56 ports so each of these NP use is wired up to nine of those ports all of the line cards have standard x86 DRAM that runs XR that looks very much like the other XR modular systems we've built on the bottom here is the block diagram for the 48 by 100 card 240 a6 to NP use on this card we obviously don't have as much bandwidth we've got more physical ports but we don't have as much bandwidth so we don't need as much silicon to do the work there's the max act device that lets us do yr8 max SEC on all of these hundred gig ports and then there are 48 QSF p 28 cages on the front of that line card alright Cisco silicon one this is my baby alright so the chip the attempts to walk the line between things that were formerly only in the domain of switching Asics and things that were formerly only in the domain of routing Asics doesn't do all of both nothing ever does all of everything that's not a that's not a thing the chip itself is is was a complete redesign it was a startup that they got started in 2015 doesn't look to be honest doesn't look like anything that anybody's ever built it was a different approach to solving a different set of problems we'll talk a little more detail about how they did it but it's incredibly impressive 10.8 terabits it will do 7.2 billion packets a second you can use that you can use it as a fabric interconnect you can use it as a line card in pu you can use it in a stand-alone system the external name for this is the cue 100 so this is what's been shipped if anybody wants to walk across the street we've got a mechanical sample on a really fancy little light box over there it looks like a great big chip which from the outside is never all that impressive but from the inside it's pretty it's pretty amazing so I always makes the joke that this is the this is the two-hour slide I'm kind of an NP you basic nerd this is stuff I like talking about so I'll try not to I'll try not to drive everybody totally crazy here we built these spider charts and if you look at when you know when you start trying to build and design these chips these are all the things that everybody wants right everybody wants a system that is infinitely scalable counts every single statistic known to man is completely programmable has very dense very deep very fast QoS has an infinite amount of packet buffering will go infinitely fast uses no power costs no money and has more bandwidth than anybody will ever need to switch exactly exactly right this is it's super easy you know we'll just step through the we'll just step through the matrix and build it in reality what we end up doing is making trade-offs in terms of which of these things we're gonna try and optimize for so these are all very well spider charts that represent the trade-offs that were made by major silicon players in the market so on the on the right side if you will I have two examples of what we're traditionally thought of as switching SOC s if you look at the things they optimized for right where its bandwidth power packets per second and in you're giving up scalability you're giving up buffering in a lot of cases you're giving up the ability to add features to it later what you get out of that is something that's faster and cheaper and lighter if you take the opposite approach to that and go over to the left side of this slide you look at devices that are we're optimized for the and I hate the quote unquote service provider versus data center market like I hate that terminology but it does translate relatively well most people know what you're talking about so in the in the classic service router space right would they cared about really different things right these systems generally didn't go into the network in 2010 and get ripped out in 2012 when something new was along these things went into the network you know if you look back on things like the GSR these things went into the network in 1999 and came out of the network in 2014 all right these are systems where you know they wanna they want to put those systems in they want to add features they want to swap out line cards they want the old line cards to work with the new line cards then they want the newer line cards to work with those old line cards and everything everything has to kind of you know run for a much longer period of time so you end up building the NP use in really different ways right you have a lot more program ability you have a lot more scale especially because you don't necessarily know how the systems are going to scale over time that market traditionally cares a lot more about QoS they care a lot more about deep buffering so you just make a different set of trade-offs when you're when you're designing the chip when we went and built this guy it doesn't the this chart doesn't really look like any look anything else all right it's a it's a different set of trade-offs one of the one of the examples I use here is everybody gets everybody gets the same amount of string right the the current state of the technology building chips says that for a given instance I can only put down whatever the hell it is 700 or 750 square millimeters of silicon all right that's the that's the biggest chip that anybody can physically manufacture no matter what I do or how bright I am or what choices I make I am ultimately limited to a certain number of transistors and if I can't solve the problem in that number transistors the problem can't be solved so in that context kind of picking and choosing what things I want to optimize for is really really important and then finding and it doesn't take much more than about to either algorithms or intellectual property blocks or whatever now and you want to use in that context but finding one or two ways to do things meaningfully more efficient or meaningfully faster can drive really really big impacts into how fast the chip can go so when we when we look at the silicon one architecture the chart in the middle is you know it's it's kind of a hybrid of these other things it's not going to be in absolute terms it's not going to be quite as fast or quite as much bandwidth or quite as power efficient as a chip that is 100% optimized around those things but it's really close it won't scale in terms of fib or routing or stats it won't scale to the extent that something like the ASR 9k or the CRS would because that the memory required to do that just drives too much cost into or too much opportunity cost into the rest of the system but we've we've taken choices that we obviously think are the right ones for a big part of the market and we definitely think that that's direction that things are going just from cost pressures and power pressures I had this conversation with people all the time you're like but I really like this thing that I used to have and I say we can give you that are you willing to spend 50% more in terms of power to get that and used to be the answer was yes because power wasn't terribly important it just wasn't that big a number now it's that big a number and people are starting to care an awful lot okay so the middle one is the one that you want for when it comes to futures what I find interesting is that EPS packets per second probably it's like the only one that is like right on the outermost right edge it can be and that's only important for the upper right one the others weren't even close to that so you must have made the decision that this is really really important even though it's only in one of them but not the other yep it is and the the packet right and the ability to do that processing is it's important to some extent because it's directly coupled to bandwidth so driving driving that bandwidth is it kind of implicitly drives the packet right one of those two or three kind of key intellectual property advantages that we have in this chip is and I'll show you I'll show you a diagram of the chip here in a minute we've got on this chip we've got six different forwarding slices and each of those runs at the clock right each of those pipelines runs at the clock rate where the highest number I've ever seen on another device is four clock rates are in the 1 to 1.2 gigahertz range we run them between 1 and 105 depending on exactly which system it is so having having that kind of parallel slice architecture in the chip I certainly don't get those packet rates for free but I get them at a meaningfully lower cost than if I had to replicate the entire forwarding core six different times one of the of the meaningful changes we do is we've got some really slick intellectual property around how these multiple forwarding slices can access the memory structures that store the information and so a lot of that drives it drives packets per second rate that generally you wouldn't associate with this much bandwidth so the silicon itself it's I I think it's fair to say this is a merger of the two best silicon teams in the networking industry the liba team was the group that built the DNX chips that was do networks acquired by Broadcom the Cisco silicon teams built all this r9k it was the forwarding engines for the crs and the GSR so it's that group it's that group of people who've been combined into what is now the the Cisco silicon organization the the goal was to get capabilities that looked like routing at speeds and feeds and power that looked like switching that same set of trade-offs one of the ways we certain one of the ways we did that was eliminating all of the off chip memory we'll talk a little bit in detail the the memory bandwidth and the memory access rates these things do not track Moore's law right they don't grow that fast so as we try to grow bandwidth and capabilities at what looks like Moore's Law we have to find other ways to do that and one of them is the the cost and power and board space that it takes to put down a whole bunch of these off chip memories is just more than you can afford if you want to go this fast so this chip uses a and what we call a 2.5 D design it's the HBM the high bandwidth memory that lives on a silicon interposer that's directly connected to the main forwarding die this is all packaged into a single chip so there's a there's a large silicon block it's roughly that big that has two of the HBM devices plus the main forwarding die that lets you get the buffer memory this eight gigs of buffer memory let you get it physically closer to the main die without having to put that much memory into the chip itself you could never fit that much memory onto a chip that and and run it that fast and then we talked a little bit about the you know the algorithms and the intellectual property that go into doing the forwarding so that's kind of where the that's kind of where the secret sauce is we won't talk too much about that because it's the secret sauce and then we will talk about kind of the the multi slice architecture and how you can scale the system how you can scale the NPU up and down there I have one question to the to the buffers we just were you aiming for concrete number what kind of memory you would like to have in the chip or was it kind of we put as much in as we can the design so it's it's not as much as we can some of that is determined by what memory technology is available in the industry there is you know there's there's no technical reason that I couldn't put it couldn't have put down for HBM stacks rather than two and had 16 gigs of buffering the kind of the physics of the memory universe is such where you you can put on the order of tens of megabytes of memory in the chip on the obviously to go from tens of megabytes to gigabytes is a couple of orders of magnitude the only way to kind of jump over that huge chasm is to go onto onto an external device and so the HBM those devices are manufactured by it's a memory standard it comes in the sizes it comes in I think you could probably contend that 8 gigs is probably more than you actually need for the vast majority of cases but that's what gets built so that's what we use you know there's you you can't go and get a 512 Meg HBM part and it just doesn't get built I can I can also understand when customers come to you with buffers it's like religion someone's really large I don't care yeah it's for you of course you have to let's say Cerf many use cases and then kind the end it's a compromise like you showed in the graph yeah yeah so in in the 8200 series in the fixed series we run all of the forwarding slices on the chip in what we call network mode which literally means you have external facing interfaces plugged into them being able to do the entire chip as affording entity lets me build these very very efficient single chip systems in you know 4 4 cases where a single system solves the bandwidth problem like you know meets those requirements it's an order it's a binary order of magnitude more efficient in terms of power in space than anything else because I get to save all the interconnect right as soon as you as soon as you try to take a system and make it bigger than whatever you can do on a single chip now I have to build some sort of mesh or some sort of fabric of multiple chips and the the increase in in cost and power for there is pretty substantial so if you if you look at the block diagram of the chip it's a six slice architecture each of those forwarding slices we logically break it up into a receive path and a transmit path there's what we call the SMS the shared memory switch this is the interconnect inside the device that connects that interconnects the slices so packet comes in on slice 0 wants to go out slice 4 we do the forwarding lookup for you're out where it's going switch it across this internal SMS over to the other slice do the transmit functions on that side and out it goes so if you take a bunch of these and you stitch them together like this we have the same silicon right the same part operating in two different modes here in the line card mode we take three of these slices and we run them in what we call fabric mode just an incredibly original curve for the forwarding operates as a switch fabric rather than as a network facing port and so if you if you kind of if you kind of unfold this picture you end up with that leaf spine topology that I was talking about so if you put 32 with these fabric chips in the middle and you put 72 of them on the outside now you've got either a folded or a wrap leave spine architecture here so the you know the from a scalability standpoint I can wire up you know hundreds of these things and I can put them in a modular chassis if I want or I can take them and wire them into a whole bunch of small fixed boxes and build even larger networks that way right the the hyper scalars build the hyper scalars build systems primarily out of small fixed boxes because there's no such thing as a modular box that's big enough to do the kind of band widths that they need so the the kind of main architectural trade-offs or architectures if you will that we're going to talk about there's there's three that we're going to compare and one of the big parts about this is you know none of these are clearly better than the others for all cases that's not the way the world works you can optimize things for you know you can optimize these chips for different functions you make different trade-offs in what we call and I don't pretend that these terms are absolutely perfect they're just the terms that I choose to use so don't give me too much of a hard time about that in in a very very highly optimized system on chip type in P you you've got a relatively rigid pipeline these systems are designed to operate as standalone chassis right they don't have voq based traffic managers they don't have intelligence you know there's there's no there's no protocol built into these devices to ask another chip hey how many buffers are you using because I want to do allocation more fairly right there's there's less interaction there when you start looking at the hardware blocks of the system themselves they're much more tightly coupled right there you know the the analogy I've used in the past is there you can bolt things together and you can weld things together right and it's more efficient to weld them together right you don't have to pay for the bolts but then you can never ever ever get them back apart so these you know the the blocks inside the chip are very very tightly coupled they you know the entire thing is kind of designed together you often when you know to scale the system up the way we've done this for the last six or eight years is we've taken a forward in core and basically you get better you get smaller process nodes the chip technology gets better you take one of those cores you stamp out another one and it's great because you can do this now in the same amount of silicon footprint and you get 2x to the bandwidth or maybe you get 4x to the bandwidth if you're willing to put down four of these cores it gives you the highest amount of bandwidth and generally this are you've got to fit everything on the chip so you've got to do all your forwarding tables all of your look-up tables if you want to do ACLs if you want to do tkm type lookups you've got to have all of that silicones got to live on the die again you're constrained at how big that die can be so you end up with this kind of really tightly coupled you know optimize for a very specific set of things generally speaking whatever the feature set is when it ships is the features that you're gonna get when it's three years from now so and then the example on achill an example of this would be the x GS line from Broadcom right it's that that like that that traditional switching silicon you look at kind of the other you know one other point on that triangle if you will what I've called a pipeline SP looking architecture and this is where you've made some changes in kind of the underlying assumptions or requirements right you want to be able to build a bigger system and you want to be able to communicate state back and forth between different impe use right so if you're trying to build a scheduled voq fabric the chip on the right side of the chassis has to know at least something about the state about the cue state on the on a chip from the left side of the chassis so you have to build traffic managers and scheduler hierarchies that keep track of all that state to implement kind of system-wide QoS rather than ships in the night and queueing if you will most of these are still pipelined they're generally still tightly coupled in terms of you know stage two comes after stage one you can't change that order around if you end up with some feature requirement where if you if you have to do if you have to do some feature that requires information you haven't already computed then you're gonna have to either just live without it or you're gonna have to loop the entire packet back around and that's relatively painful both from a software development and a bandwidth standpoint the current side of the art here is that these have two cores that share some set of the resources and replicate some set of the resources what you get out of here is generally much higher bandwidth than you get on kind of a legacy service router type into you but not as much bandwidth as you get on the system on chip the designers here will generally allocate a little bit more of that silicon space for the on chip lookups for the on chip T cams and then depending on where you're building it a lot of times we'll have the ability to add an external tkm device people like Broadcom make these external T cams you literally wire the external T cam onto some of the ports that the NP you would otherwise use for network or fabric interfaces so the advantage of this external T cam is you get a lot more space to do fib lookups or ACL lookups or classifiers or police but you're using bandwidth ports that you would use otherwise to do that lookup so it gives you more it gives you more flexibility when you do the hardware design but those t cams are very large they're very expensive they use a lot of power so you got to make sure you really need that capability before you kind of jump off the bridge and spend that money so the you know the kind of this hybrid model that we do with with some silicon one the forwarding engines themselves are built out of a set of run to completion stages so what this means is that I no longer have to worry about having stage to be a different Hardware block than stage one right I've got some finite number of clock cycles that I have to do the math but I can do them in whatever order Island all right so within within one of the with one in one of the engines in the chip I have the ability to do these features in different order you know when when customer a squares on their life that doing MPLS inside GRE is the holy grail and that's the way they want to do it that's okay when the other customer turns around and says no no no no it has to be GRE inside MPLS and you know if if I've got something that's welded together and expects one it's very difficult to do the other if I've got two stages and I can program them into the sense that I don't care which one I do first it gives me a lot more flexibility in terms of how I can deliver those features the one of the kind of meaningful changes or meaningful advances I guess if you will is the engine that lets me do this sort of architecture that's not terribly larger and more expensive in terms of silicon than welding everything together right it's not it's not free I pay something for that ability but one of the one of the reasons that the chip was so impressive is that you have the ability to do that kind of it's not arbitrary that's the wrong word more flexible ordering of features without paying a terribly high tax on it in terms of silicon just to clarify so in one process when when it goes over the chip the long pipeline you switch what is done in which order and how long is your pipeline so how many steps you can do right so what it what it lets you do is it lets you build it's like a time check really I can build an almost arbitrarily long pipeline okay right because I've got I've got to run to completion engine where if I want to I can say just do some more work just go to another T cam lookup now I may pay for it in terms of performance but I don't have to recirculate the entire packet so it doesn't cut them it doesn't cut the bandwidth the the traditional way to solve this was I had three stages I couldn't get it done I need to do some more math so I looped the packet back around and now I've forwarded the entire packet again I've cued it again I've burned all the bandwidth again I've done all the memory accesses again and it's and it's a relatively coarse mechanism to do that so what I can what I get here is the ability to do a much more granular allocation of forwarding resources to the packet so if you have some flexibility in the order and the flexibility in the steps that you would like to accomplish Wow I can writing I can write a block of code that says do some lookups do some math do some stuff those aren't those aren't strictly wired the way they used to be I can say again I remember how you do the MPLS part first then the GRE part or I can flip that around and if you need less number of stages right now and if and if the if if I imagine I have a contrived case where I have an inner leaf set of packets where the even packets all need five instructions and the odd packets all just need one now my average is three in a legacy architecture I had to do exactly the same thing to every packet so I either if the pipeline was three stages long half the packets had to get recirculated and if the pipeline was five stages long then everybody would run it right but now I'm wasting all that resource for the simple packets so here I get to here I get to recover all those resources that I don't use on the the simple packets this is really it's a kind of a game changer to the whole if you take from an architecture standpoint it is it's it's completely unlike anything anything else that exists alright some of the things that we kind of looked at in terms of you know what makes this hard right we're trying to chase Moore's law because compute runs mostly at Moore's law but the things the pieces we're using to build these systems don't grow that fast right you know we're trying to do a 4x bandwidth leap in about three years you can't just do a new chip right you've got to have memory technology you've got to have enough internal bandwidth to move the data around you've got to figure out how to get this much power to the components you've got to turn around and figure out how to get the equivalent amount of heat off the components right because everyone you know physics say every watt that goes into a system comes out as 3.41 BTUs worth a heat right there's no way around that that's a that's a law physics if the system uses the power the heats coming out I got to get rid of it somehow so we have spent an increasing amount of time doing kind of thermal and mechanical designs for these systems compared to what we did in the past and then I'll let mark talk about the optics because trying to just tell you to grow the density of the optics is every bit as important as trying to do it in the in the silicon we got to have memory right I got a store state to do packet lookups I got to worry about the physical size of the memory how many bits how many square millimeters what sizes and shapes of memory can I either integrate into the silicon or can I buy from somebody how many operations per second right this is a huge challenge in networking because the commodity DRAM memory that's used in servers which is obviously what dominates the memory market because the volumes are astronomically high it's not remotely close to fast enough for the stuff that I want to do right I want to do 7.2 billion packets a second on a chip each of those packets has a couple dozen memory accesses you do the math pretty soon you're approaching a trillion memory operations a second server memory is not designed to do that right so I either have to put SRAM down directly in the chip or I have to use something like HBM to kind of bridge the gap between the on and off chip memories there I got to try and get rid of as many accesses as I can right to go seven billion packets a second I can't spend very long on each packet so I've got to have the right design of both memory types and memory layouts and do some really really slick algorithmic magic as to how I'm doing all these lookups to make the to make the router go fast enough there so you got to worry about how do I get the signal especially as we start doing high speed right this is doing high speed signaling is an incredibly difficult part of what we do I do not have a EE I also do not have a degree in optical physics I'm a computer science graduate who spent his entire career managing and troubleshooting networks so to quote whoever was famous you know any sufficiently new technology is indistinguishable from magic something like that the signal integrity piece is magic to me you know the the short version is the closer you put things together the easier it is as you start dragging them further apart you have to do all sorts of fancy math and when the math starts using Greek letters that's usually where I get what I usually get off the train but you do have to we do spend a lot of effort on optimizing the systems around you know you can and you can solve some of this if I've got a if I've got a trace that's too long to drive a signal from the NP you to the optic I can just put a retime ER in there which literally regenerates a signal that's great except those real timers cost money and they cost power and they take up space so one of the things that I will continue does better than anyone else alive is the physical design of those boards and how I can build a board that's the size and shape to meet the band width I want while minimizing the additional amount of device the additional devices that I have to put down on the board a huge part of this is all you know driven with the motivation of reducing the power of the system right we've got a tin terabit router switch switch router or whatever the hell you want to call it that will run in about 415 watts right you can make coffee well you can make coffee with less power than that the 100 Gig the the hunter gig version runs at about 750 watts one of the things that you will see is we have stopped including the optics power in the system power numbers and we're not trying to be we're not trying to cheat we're not trying to be devious we're not trying to get away with anything in the old days when I might put 12 optical modules on a line card and those optical modules were three watts each nobody really cared whether we were adding or subtracting 35 watts power 36 ports on a LAN card 20 watts per optic that's 720 watts just for the optics so there's no you know the the the variants in optics power the number of optics we're looking at is now high enough that I can't make any assumptions about what people are gonna use so I'm just taking the numbers out and telling everybody the power design there's a ton of work that went into again I got to get all this power off the I got to get it off the utility I got to get it into something that cleans it and turns it into the right voltages and then I got to get it to all the components go across the street look at the chassis the busbar in there is about that big around it has to carry a tremendous amount of power but there's a there's a ton of little there's a ton of little improvements and little innovations that go into the mechanical design of these things that I I try to give the engineers a shout out because it's it's kind of hard and visible work that they do a lot of work on where you know if I can take out one voltage conversion step maybe that's three or five percent efficiency for that little part of the for that little part of the design and yeah on a you know on a thousand watt line card three percent is 30 watts of power 30 watts power is not that much power but 30 watts of power times eighteen line cards is 500 watts of power and if I can do that two or maybe three times pretty soon I'm talking about enough power that that's one less module I've got a wire and if those power modules are several to a couple thousand dollars a month just in terms of what you got to pay the you know you got to pay the Kolo or you got to pay the utility ends up being a lot of money tons and tons of work went into you know you got to get the power in and then you got to get the heat out right you've got all these different components right the chips the memory devices the optics all of them have some set of temperature range that they can operate in some of those ranges are really different so you have to kind of engineer different parts of the system to be cooled to different temperatures you've got issues where you know the the air comes in the front of the chassis comes over the optics if the optics are using 700 watts of power now I've added 700 watts of heat to the air that came over those so all the downstream devices are now dealing with warmer air so inside the chassis I've got to do both thermal modeling and engineering to take all that heat and you know take what I've got heat sinks where I can spread it out I've got to try and get it conducted into the air and then I've got to get it back out of the back of the system preferably without melting whatever's behind it although sometimes we don't care one question is it always front to back or back to front or can you switch the cooling in the modular systems it's always front to back in the fixed systems you can flip the fans now that said if you flip the fans to where the optics are on the downstream side it has to be cooler in the room the the max ambient temperature changes if you if you invert the fans so to give you some idea of you know how much power we're we're looking at here how much component here a fully loaded chassis is 20z an a-class CPUs its 576 gigs of buffer memory its 640 gigs of DRAM and it's over 5000 lasers so all of this stuff goes in there all of its generating heat all of its using power it's doing it in different physical locations even within the device right we're to the point now where I can't take the infi you you know it's roughly a square inch I can't take the into you and treat it as if it's a uniform temperature because there are parts of the chip that are hotter than generate more heat than the others so I've got to not only do a design for the total thermal power I've got to do a design that can handle whatever the hottest particular spot on the chip is so I've got all this work you know you've got different optical modules that have both different amounts of heat that they generate and different amounts of power that they require so I've got an engineer for all of that and all of that all of that work kind of goes into the mechanical design and if you're doing drawing we do a lot a few of they cycle right right I had a slide on optics Mark's gonna do the optics piece that's the end of the of the NP you nerd part so questions more questions what's the position of silicon one beyond the service provider network is this the new architecture that's going to bleed out into the data center high-end enterprises is this something that's going to be pervasive so I can't talk in terrible detail about that but I'll joke we didn't spend that much money on a company to just use it in one part of the network mm-hmm that technology you know we'll certainly find its way into other parts of the network it will certainly find its way into other Cisco products as we announced last month it will also find its way into customer networks who want to just consume the chip right you don't just you don't have you don't have to buy a complete system full stack system anymore be a component of a white box yeah for me I've been reading about that yeah and so and that's you know we're you know we're we're now publicly open to the idea of you know selling different different levels of the stack to different people now that said I do think there's I do think there's a lot of hype around people wanting to build their own systems who have no idea how hard it is but that aside you know we've we've announced that word that we're open to that we've been open to that for a while but now we've announced it know you're you know you're playing in broad comes game as well yeah and that's the there we've had a number of those customers who want you know they want silicon diversity they want system diversity they want software diversity the nice part about my world is that I'm kind of the only people that can sell them all of it I don't have to sell them all of it but you know we can play everywhere very cool yeah so but still mostly at the high end or how does that most mostly at the high end yes you ADP still has its place and the Selleck ruins at the higher end of it and maybe there will be something in the middle who knows okay thank them
Info
Channel: Tech Field Day
Views: 4,833
Rating: 4.9166665 out of 5
Keywords:
Id: KIGct1QOtdI
Channel Id: undefined
Length: 58min 30sec (3510 seconds)
Published: Thu Jan 30 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.