Barefoot Networks Evolution of Networking

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so I started this about by roughly a little more than five years ago and then we devoted this many years to build something that is quite useful for the industry and then now we're sort of at a natural juncture where we can actually reflect on what we have done so far and then what it means and attract to make observations out of that so clearly programmable switch chips and have same power points and costs as fixed function switches we deliver them and then on top of this programmability without compromise people can now realize their beautiful ideas and and those ideas are not just Nestle and you know deriving from the chip designers this can be done by the OEMs or your vendors sometimes by the hyper scalars and large online service providers themselves and this means that we'll keep having more innovations going forward right so that's the high-level summary of my talk but people are moving forward we need to confirm one thing right where is the evidence that these promo chips are really running without penalty right so I just wanted to get some numbers not produced by us but produced by our partners so this is the number that we collected from one of our OEM partners their public websites so here is the number for the top innovate switches and here is the number for fix function chip base which by the way these two chips are good with exactly the same process not know technology so it's fair comparison and in the street for most be disruptive roughly the same Pertamina offers more profile another thing power consumption report again slightly better than yet one and then the latency complex t you know can after maybe a few more hundred nanoseconds but still you know if you're optimizing Tofino for latency you can go as low as this number so this really confirms that hey there is no penalty for programmability right yes so just just because it's one of my hobbies I'm curious how are you guys actually getting how are you guys defining getting full entropy on on your chip bool entropy which one are you to laggy cmp hash algorithm I mean having full entropy out of a chip is not an easy task and I'm just sort of curious how you're coming up with that well I mean the beauty of its programmable you can ultimately come up with the algorithm that actually allows you to extract the full entropy out of whatever you know wherever the differentiation in fact it is and that's really one of the more unique aspects of allows you to get in a full of distribution and network where some people are limited to only using seeds that probably don't you know fully bring out yeah like electrical impulses to generate your interview no I mean we're still using packet data but we could follow up offline to some of the the specific examples but it's something that some people really care and care about a lot I'm sure it sounds like you're aware yeah want to get really nervous there's always somebody is we're just trying to say okay so if you look at the history of computing this has happened many times so graphics it started with fixed function chips but about two decades ago Nvidia came out with this programmable GPUs it's Pro in high-level language called OpenCL and then the compiler is responsible for lowering that user idea high level user idea into the real realizations when they did that of course they were usually focusing on this graphics related application acceleration that people usually are gradually realized all this high-level program ability and then the target which is merchants I can do more things than just original things that they envision including virtual reality augmented reality and then at this point GPUs are used for a lot of other things like machine learning cryptocurrency and all those things if you went back two decades ago and then asked their CTO whether they would have imagined these kind of things do type said no they didn't even know that they would happen right but now it's happy it has happened we believe that something similar to that kind of explosion can happen with this programmable merchant silicon obviously we're targeting l2 l3 switching in some Enterprise switching and better telemetry kind of features people are gradually realizing that oh I can build very fast very cost-efficient middleboxes and application gateways and so on and then we'll see more and more innovations beyond just regular networking some combinations for you know storage and compute acceleration or interconnect for accelerator kind of workloads ok so I wanted to share some examples of those and then the first example is this layer for connection load balancer what is layer for load balancer it receives traffic from the internet or outside the word destined to load balanced addresses also known as virtual IP addresses here and then each virtual IP address is handled by a number of physical servers each of which owns a deep Gate direct IP address and then essentially load balancer maintains this bit to deep pool mappings and then for each new connection it choose one of the dips in the pool and then it's consistently poured packets to the chosen dip and of course these people can change frequently because you do scale up or scale out or scale down for maintenance and those things right so people have now realized that they can actually build layer for load balancer very fast layer for load balancer using Tofino and then it complements the state of art which is scaled out software based load balancer then they can either build layer for load balancing appliances and then deployed them in addition to software load balancers where they sometimes fold this layer for load balancing functions into their existing switch is typically top of rack switches and then the benefits are obvious first of all Tofino can handle PI billion roughly five billion packets a second we'd guarantee it sub microsecond latency and then if it cannot handle that many you know complicated things at the compile time it'll fail but once it compiles it guarantees full line rate maximum latency power consumption done so how nice is that right so that saves a lot of cost especially if you're burning many software based load balancers in your environment it offers predictable and high performance and because of this it's also quite useful to ensure robustness against attacks and remember load balances are usually the first year which actually receives untrusted external traffic so this you know robustness against availability attack is usually very important when you build you know layer for load balancer simple hashing just doesn't work the reason is this suppose you had say 4 dips for servers in your deep pool then you were doing some simple stateless hashing so some connections were mapped here now deep 2 has been brought down for maintenance or for failure whatever reason you could have assigned this dip to space to a new or another dip say tip 3 then this is not a load balancer anymore because the space is now you know fairly divided or you could have we balanced this in a fair fashion but then what happens is something like this connection I write which was previously mapped to deep three now it's mapped to deep for even when deep three was alive so you're breaking this connection right so it's a connection breaker so that's why you have to do this on a connection basis and then you have to create per connection state and also ensure connection per connection consistency right so doing this requires some careful engineering in the data plane and then we have actually demonstrated build some prototype and some of our customers actually took that idea and then build their own load balancers when you do these per connection consistency as I said you have to work you have to ensure some scalability because you're creating per connection state entry in the data plane so it's important to be able to ensure a large scale and also when you burn your you know precious SRAM to maintain this kind of connection state it's important not to waste that space for useless connections like attack connections so it's important to build these kind of features in the in the device too so we have prototypes to ideas the first one is what we call cache mode it's essentially an accelerator we're almost transparent accelerator that sits in front of or on top of existing software load balancer and then as you can see the data plane has one simple connection table and then when when it when the incoming packet doesn't match here it just simply redirects that packet to the software load balancer which can run the switch control plane or maybe another server and then and then the the the software load balancer maintains this connection table okay so this is actually the dump the most dominant way of using this solution layer for load balancer and then many of our hyper scalar customers in in the cloud business have built their own load balancers this way taking advantage of their existing software load balancers so this has been in their production for more than a year now we have also built more elaborate and more powerful load balancer which basically takes care of the connection selection even for the first packet of a new connection in the data plane and then control only manages the learning process this is also what we have demoed about two years ago and then showcased the next interesting example is HPCC standing for high-performance congestion control this is actually not what we did this is what our one of our customers did Alibaba and then they actually showcased this at a top-tier conference called cecum a couple month ago and essentially they used the int data plane telemetry feature to do a very fast and precise congestion control between the switches and their and host right so congestion control is always closed feedback loop between where congestion happens basically switches and where you can actually throw all the traffic meaning your end host right what they did is something like this they basically used our you know Tofino switches which actually add int information to every packet and then this int information is now piggybacked into the TCP or our DMA congestion control protocols act packet and then the end host are their smart NIC built with FPGA actually use this information to adjust rate precisely one very important thing is that because this information link utilization and cue statistics are so precise and timely and accurate they could actually do what is called almost unthinkable so far which is M I am the multiplicative increase and multiplicative decrease so TCP variants all have used like additive increase they are very conservative when increase when they increase the rate and then when they face congestion they do multiplicative decrease because they have to back up very quickly this is good for fairness and congestion control but it's slow the reason they had to be slow is because the congestion information is just one bit easy and bit or just packet drop very very opaque information now with int they can actually do aggressive increase so they actually handle congestion very fast and yet it's still you know stable and it converges very quickly so the summary of that this is actually the Charter that we copied from their papers not ours when in caste or sudden congestion happens the the latency of HPCC is less than 10 microsecond even when you have three hubs right and then DC qcn is the state of art our DMA based congestion control which is widely used in you know large data centers such as Microsoft and so on this is what is actually shipped in today's NICs smart Nick's like Mellanox nick-san are in your Intel NICs they can do at best this kind of congestion control HPCC reduces this down way further and then compared to the other you know popular congestion control protocols used in you know Microsoft Google other places HPCC outperforms all these things down right right so that's the beauty of programmability given to the end-users and powerful you know end-users the last one so accelerating training so training is a big deal these days with GPUs and TP use and so on one would think that the training is done just once and then you just keep reusing that algorithm model for inference that's not true every single hour minute they actually be run training because the goals change the models change and that the input data keeps changing so it's being able to learn all you know run a lot of training gaps in parallel and very fast is very important for all these service providers interesting thing is that GPUs and CPUs are getting faster a lot more powerful over in a short period of time and therefore the total amount of time it takes to run say one you know famous benchmark training set gets reduced over a few years but this gap which is basically networking overhead doesn't go down so CPUs our GPUs are just way faster so you're actually end up wasting most of your training time to do better communication that's what we're observing and then we're trying to address this problem here so disability deep neural networks each worker each GP or TPU runs some models and then it they have to periodically exchange all these parameters right and so in the old days actually they people introduced this x86-based parameter servers and then all the parameters were extended through this parameter server and then people quickly realize that x86 is not a great I own machine they cannot keep up with the rate of GPUs so they switch it to more peer-to-peer you know GPU or TPU based models where a hyper ring or hypercube is built and then Model exchange rate updates are done through this way but it introduces latency because it always takes n square or Logan steps right now what we're observing here is that yes this intensive old 12 communication between the workers especially when the number of worker is growing is very heavy and faster GPUs make this problem even more skewed right so only if we can accelerate this kind of parameter exchanges it would be actually quite useful so essentially this is a cartoon visualizing what we have done using Tofino working as a very fast parameter server remember tokino can handle five billion packets a second with guaranteed latency processing latency so that that's like you know 50 or almost a hundred x86 servers i/o capacity right so to pino reserved some space you know its own SRAM and then it admits some weight updates from one of these servers or workers and then it collects weight from the other worker and then it falls by folding I mean that it just simply calculate addition and then generates the average and then it sends out the results back and then two more space are made available so it can admit more weight the next set of weights and this continues so this is a simply very simple you know a high speed reduction mechanism that is done on to phenol when you use this kind of parameter server we measure the actual time that it takes to run learn some models with the same accuracy target and then the performance gain is sorry about somewhere between one and half to 3x and then this is consistent of course ng network and hundred you network so as you can imagine we're turning order n square problem into order N or from each individual workers going to be its constant because their own weight and somehow magically they received aggregated weight so that means that our system actually scales linearly whereas the juicing systems when they actually increase the number of workers in the system the performance actually goes down so what this means is that this is even more future probe when you have to deal with even larger models and hence when you need to introduce more workers the benefit will be even more renowned pronounced so that's the summary so this is my last slide now we have programmable high speed machines without penalty if your end user sometimes you may want to do programming on your own if you're hyper scalars if you have your own developed muscles if you're not you could just go to your favorite OMS and then tell them that hey I need these kind of new pictures right so equipment vendors can so far they couldn't send you just software upgrade now they can new forwarding features take this weeks maybe a quarter or two to develop by them and then by then you know you you don't have to figure out the hack right because you you can clearly solve your problems in your network and you you don't need to complete a harder upgrade that usually expense you can keep using the same hardware and yet enjoy new features so that that concludes my talk you know one of the things you see you know as I see all this and again I go back to you know it's cool to program language and you know you know all the problems that can come with that in looking at like the load balancing and the telemetry and the you know information and the tag in the end and all the features to add on there how many simultaneous either functions or commands could we look at for that I mean you have to pick one feature or the other in doing this or do I have a you know is it truly a program where I can give it a series of steps to go through and I mean how much flexibility really is there in this or am I in a constrained space in trying to think of that yeah you know you got you got one job and are you gonna do it this way or that way and which which way does this allow me to go well the nice thing about program ability you can choose the simplest way that you want to start to it for example congestion control but if I want to do load balancing and congestion control and telemetry where where's the limit you know it's kind of it's kind of like you know we know all the vendors give me all these cool features to do and you can say you can do all this just not at the same time so I mean in the first generation there was a limit to that capacity in Tofino to what we did with the ultra series we added almost double you know the resources so you can now do a lot more ultimately you're always gonna be limited by the amount of memory and engines you know on the device but not necessarily do you have to be doing you know AI you know computation as well as layer for load balancing as telemetry all at the same time the one thing we want to prove out is you can do everything you can do today and typically that one additional thing that is important in that place in the network but through the power of just loading a different program you can repurpose that one you know top of rack to if it's if it's not using an AI training cluster you can have it do the machine learning an example that James has talked about if it's in a load balancing cluster where you're doing scale out I don't know storage or whatever else you can do that so well I can always envision customers who are gonna be like oh yeah but we need that in that and that and that and in the same rack you know in the exact same switch perhaps office
Info
Channel: Tech Field Day
Views: 658
Rating: 5 out of 5
Keywords: Tech Field Day, Networking Field Day, Networking Field Day 21, NFD21, Barefoot, Barefoot Networks, Tofino, Intel, P4, Chang Kim, ASIC
Id: -TY8qdNjc2U
Channel Id: undefined
Length: 21min 59sec (1319 seconds)
Published: Wed Oct 02 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.