Barefoot Networks Data Plane Telemetry and Deep Insight

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
you right so to the next presentation Temari I'm part of the Prada management team at barter networks and I'm responsible for advanced applications which are basically all the interesting things you can do thanks to the probability of Tofino and our platform so all the things that add and arkady discussed about all that flexibility actually is not just an abstract things as a purpose and so I'm focusing on some of these applications and in the last couple of years I was particularly busy with telemetry which was one of the initial use cases we had here at Barford networks and our customer embrace so I'm gonna cover telemetry and analytics as a part of this presentation so Before we jump into analytics and network visibility we'll talk a little bit about the foundational technology that is behind this telemetry and that is called int or in monitor telemetry before we go there this is a scene for one from one of my favorite movies Indiana Jones the Last Crusade and here you see Indiana Jones stepping on an invisible bridge and that bridges over a precipice and he has to take what is called a leap of faith so he has to walk into the unknown just you know having faith that he will not go in the precipice and so I like to ask these questions would you operate your networks on a leap of faith you just go and you know deploy a critical business application on these networks without having any visibility on what's happening in your network and of course the the answer I get back is always of course no I need my management tools operational tools I need basically to have visibility in what I'm doing and getting your visibility from the network on on what have implemented so far but guess what that's pretty much what's happening today and that's where you know network monitoring invisibility is use technology from like 20 years ago also instrumentation on the network is fundamentally based on probes NetFlow which is sample so the visibility you get back from that as it's pretty limited use ping and traceroute and you can capture what we call like especially if you look at the modern applications and especially us working with the cloud providers that have like very performant applications you can detect things like microbrews jitters delays and any networking event that can cause microscopic issues that can still be very damaging to your applications so troubleshooting and analysis is pretty limited which brings down the visibility and with no visibility we you have not control on what's happening in your network and so that's basically you know where where you are back and where you are at at this point while the application people typically have instrumentation to diagnose issues in their applications the network guys don't have all this same so what what is the approach we wait to solve this problem so we look at this problem and and what we can provide thanks to the programmability of our of our platform and we thought about why we can design a system a solution that helps answering these what we call ground hood questions of visibility for every packet so we move from like you know indirect measurements that you do from the control plane or sample measurements you do on your data plane using probes or net flow with packet by packet measurements and be able to answer these four questions which is how come this packet got here so which path is this packet take in the network and why it got here so which forwarding rules were enforced in the data plane in terms of routing switching and other things so that this packet got here and last two questions pertain more to performance type of considerations are my applications impacted by latency or congestion and if they are what is happening over there what's the overall can dynamic of that congestion so would that pocket share these cues with when the queue was congested so these are important things that we really believe are going to answer you know the basic questions about what's happening in my network and so you don't have to take the you know leap of faith so we kind of classify and we call that solution int + int is a kind of the foundational data plane capability is called in Barnett or telemetry is scalable and programmable data plane framework to observe packets collect information about these packets and report this information back to an analytic engine so we are at int 2.0 and this nap shot is from the int specifications which are open if you go on before dot org you'll you can download these packs and find out more about int but there are two main kind of pillars or domains for design T you have visibility that collects information which is based on probes you see this like set of icons on in like in orange on the right side of the slide and probes are fundamentally modeling your applications and injected into the network and represents emulation of what that application does this is one aspect of that int telemetry which we don't really believe in - we believe in - more making measurements into the real traffic and real applications and so what you see on the left side is the category of int that applies to real traffic and so there you have three different flavors of that one that makes observation on a packet by packet base and doesn't modify the packet so that information about the packet is reported by the device that does the observation without me any modification or adding metadata to packet and on the right side you have what we call int embed mode where you take that metadata and you put it into packets while you have all this different version and flavors telemetry and customer requirements are not all the same and and we we want to offer flexibility so customer can deploy this based on what requirements they have so we heard customers saying you know that's that's a great thing you can give me like a packet by packet telemetry but I don't want you to modify my packets I just don't want this metadata to leak out of the int domain I don't want you to grow the packet sides I don't want to make any adjustment to my network to deal with this new packet format which can be added for example the excel on base or genève or other type of encapsulation so that's why you have that export mode that does a packet by packet observation but generator reports independently from the packet and that's why you also have the embed mode because other customer actually like the solution where the metadata is added into into packets so you as the packet goes through the network and you collect the metadata on the other side now you have the history of that packet through the network so different flavors different type of you know designs there and they are all there to offer a flexible implementation for int I have two questions for you so first question how long is it taking to learn the network's behavior and how often is it reporting on this yeah absolutely so this is something I'll cover also part of the use cases the dis is a real-time technology so detecting a network problem happens as the problem occurs in the network and the reason is like the data plane enforce and streams out the telemetry without requiring any agent or any CPU intervention and that information gets right straight directly to the real-time analytic engine which is for example deep inside and often that's another important things about this telemetry the telemetry is even base you can have like billions of packets going by but if there is no network change or behavior change or performance change in your network you are not going to report any information you only report when you have packet drops congestion or anything that you set as an event or a trigger that you want to monitor and if you guys are getting this history of the packets are you guys marketing this actively as a security tool for security we heard that's actually a great question we had a lot of use cases around security because not that now that you have this deep visibility you can get early information about potential DDoS attack or volumetric dose attack so while today we are not marketing this for security because this is mainly a network performance monitoring technology there are potential use cases where security may become in future an interesting use case yeah I'm thinking if I know where my packets have been before they get to where they're going that's I mean I think that's something everyone needs what are the fields that you can export metadata on is it is it flexible as a somebody who's purchasing your chipset I guess assume absolutely deploying it in their box and choose what they want to pull okay let me go through my next slide that talks a little bit about that so here you have a little bit of int in action so while this packet are flowing by into the export mode where we do a reporting independently on every device we generate information and telemetry reports as this packet are flowing through the network the information that you collect is programmable and the reason this has to be programmable is customer may want to observe different things for different application for example I may have applications that are latency sensitive so I may want to connect collect information about you know timestamps and in latency across the network I may have applications that are security bounded so maybe I have a PCI zone in my data center and I want to prove that the part of this application stays within the boundary of the security zone so I'm going to collect information as the switch ID so this set of metadata is programmable and we also give controls from an API standpoint to choose different set of metadata for different applications are you triggering other workflows as well like packet captures things like that to happen when you collect that data yes when we collect the data that little information you see is streamed out from each of the switches it's a packet snapshot can be as small as just the 5 topple information plus the metadata or can be even the entire packet that is sent out and the other key piece of these technologies that we use a system called change detector so as I mentioned you have billions of packets per seconds in these terabit switches you don't want to send billions or reports not technology will scale to ingest the type of data if you set the telemetry to generate reports on an even base now you can really make that scalable because you're going to report only on anomalies the data plane can recognize when you're violating a thresholds at baseline when there is a hot change when there is a congestion and generate reports all in this type of situations so the technology becomes very scalable and as a result the analytic solution on the other side can scale pretty well so if you're so you're analyzing and visualizing the packets are you actually doing decodes and things and getting deep into the packets I mean what what all are you getting out of that great I'll be giving an example through a quick demo of this but yes the answer is yes we are getting deeper into the snapshot that we get and collecting the information we we can take information up to the file topple or even further than that for example for for our DMA another application we could go deeper beyond the five topple and look inside at the packet if you have the exelon or genève we can look on the in the inner adder and in the outer adder so you can correlate for example overlay with underlay type of visibility so it's very powerful in that regards thanks to the programmability of the left-wing is it going to be specific two types of traffic so you mentioned VX Leonor standard IP packets or is it really I can do the same things to any type of traffic yes you can pretty much in terms of visibility and the mechanism to report and get the analytics you can do this on every packet but the stuff that you do the type of correlation may be different for a V excellent packet for example I may be interested in finding correlation between the overlay and the underlay so I may want to see from which V tap this packet is coming from and which application it's actually communicating over that V tap and keep track of that so we have this int very flexible int which is based on real-time traffic and real pockets and not probes and we can do a both and embed modem so we vary embed information into packets or an export mode we can track flows for path and latency with that but we also have additional things that we can provide as part of the telemetry packet drop reporting and when we say packet drop reporting is not just saying there was a packet drop but kind of providing more contextual information about that pocket was dropped for this reason this is where the packet was dropped in this switch in this queue on this interface so then an analytic engine can take this rich metadata and correlate that packet drop with other information for example congestion reports congestion reports are pretty interesting the way I like to explain them imagine you have your security system at home right when nothing is happening you know your camera is not functioning but as soon as you detect some activity in this case a congestion build up an intelligent data plane like the Tofino data playing can start taking snapshot of that cue and the record imagine your security camera turns on and get like 30 seconds you know video recording and streams that information out soon analytic engine so again the magic of of this programmable data planes works on the fact that is flexible and provides information only for events that are considered as anomalies which makes the old solution work I've had a question on the data telemetry side as I work a lot in the service provider space and as the service providers move towards Metro Ethernet 3.0 MF 3.0 that's data telemetry is actually a thing that has become a big deal in Metro Ethernet networks for the analysis of performance especially on layer 2 networks is that are you working with any anybody in the Metro Ethernet space has that been brought up for p4 is an application so we work with a number of vendors that have like products in that space and they definitely brought up as you know one of the use cases the emitter emitter Internet you know you may want to track a congestion point at the access because that's where things typically happens and a lot of this Metro Ethernet of like big layer 2 domain as well right because still based on the old type of designs so what congestion and microburst will are likely to happen and so this type of technology this type of telemetry is definitely valuable for them yes so let's talk about a little bit more and int is the foundation of technology we added additional control of the over the in Barnet or telemetry to make it production worthy so we already talked about a few of those but I'd like to spend a couple of minutes it's also very important while you can turn on this for every pocket in your network and and observe every flow is important to give control or what specific applications you want to instrument so you have a concept the control called the watchlist so on a device to sophisticated ACL you can fundamentally say I'd like to monitor this application imagine like your ACL instead of saying drop forward or mirror of the packet let's say it's do int instrumentation and and that in it's something you can control at a better switch level that gives you the ability to control which applications are being monitored second information I think somebody asked the question what I want to collect on a per flow based on a per application base you can determine what you may want to collect or instrument again different applications have different requirements some of them are more latency bound like for example RDMA some of them are like more packet drop so you can enable and turn on this feature on using this fine grained control intelligent triggers are very important so it's very important to say I like you to report on this condition and then intelligent data playing thanks to the flexibility of you know Tofino and p4 can generate information when that trigger is activated so you now have even base telemetry rather than you know telemetry on every pocket or telemetry brace on probe which in one the first case is not scalable in the la in the latter case it's not accurate and in last but not least this intelligent data planes can do intelligent load balancing so if you have a scale out implementation of an analytic solution on the other side you can direct observation for different flows and application you can steer them on different ingestion engine so if you have like a hundred thousand millions of flows or several millions of flows you can still make the solution scale so the other component of that control is the programmability and that programmability helps as both add and arcadi pointed out programmability helps building the glue when you want to take that solution and distribute in different place of the network so a great use case for this telemetry it's in the network but it's also at the edge of the network where you want to get more visibility you want to stop that application or the performance problem at the ingress of the network either than bringing into network so it's very important to and we are actually looking as Intelli we can pour some of this technology in some of the NIC and this morning that we that we are we are developing and there are use cases where also of course you can bring this to things like virtual switches and in that a plane that are running on bare metal like xdp or eb PF so these are like oh that the variability of p4 becomes like before becomes the vehicle to push this pretty much everywhere imagine p4 as being your RFC that describes how the int should be done we have open source and implementation of our NT in p4 the torque so every vendor can go and grab it and die and do the same work and see me you know consistent implementation the last characteristic and control is the real time so which is very important somebody was asking how soon can I detect issue as soon as it happens because we stream that information directly from the data plane into the analytic engine there is no CPU in the middle there is no agent required that fundamentally is is a key characteristic to implement Corrections if you want to to these problems and this becomes very interesting if you want to build closed-loop solutions so that SPR becomes Sprint or the scalar was marked program of real time int that's what we mark it as unique and differential in things on on Tofino so let's move let's shift gear and move on on deep inside and deep inside is the network marketing solution that on the other side collects these int telemetry and builds intelligent analytics so it's a software application runs outside the switches it runs in fact on commodity servers and can run on of course Intel Architecture it's an application that is containerized it's very modular and there are different deployment models you can implement it on a single server or multiple servers or over kubernetes cluster we have two interfaces here for deep inside you have a southbound interface that gets the data from from the network these are the int reports that we can generate on a pop-by hope we are export mode or we can generate at the sink or at the destination point of that int session through the embed mode that information as long as this is consistent with the open specification and you have consistent semantics for the metadata we can go process it and realize and visualize the analytics about that specific network so deep inside will use that information to build dashboards intelligent analytics we'll cover that in the demo we also have a northbound interface that provides the same information to a third-party network management tools I like to present this to customers saying we are not just giving you another network management tools you have plenty you have enough we don't want to give you another one we'll give you fundamentally a magnifying glass so you can augment the visibility of the existing management tools you can very well have your network management tool here or you have your build your own dashboards using this northbound API we have two models for the API we have a restful interface so everything that is visualized through the native UI of deep inside can also be gather through the API and we have a streaming API so all the information that is detected can be pushed to the northbound API imagine like a the top sub type of channel you subscribe for an anomaly you get a report as soon as that uh normally happens we have a few customers who are trying to implement micro segmentation and not having a fun time at it this seems like overkill but can this you know help you out with application dependency mapping as far as figuring out your how you want to segment your network and and separate your applications and yeah absolutely and to give you a little bit of context I used to be at VMware it will be a problem manager for anise X I know exactly the problem I mean this application provides path information so you can discovering the path and overlaying that path the part of the flow over the topology you can use that information to find the application dependency mapping especially if you now have int into something like OBS you can definitely you know discover not just what happens in the underlay but what potentially happens in the overlay and find out the correlation between the underlay and the overlay thanks to int so exactly yeah absolutely that that's possible thank you so some of the use cases for for for the telemetry we are talking about the main one as being the network troubleshooting some of these problems are really hard to find they leave clues in the network when I was working with some of the meteorite or net customers and there's yet this big layer two domains when you have a spanning tree look good luck finding the root cause of that spanning tree loop yeah I've found that over 500 miles exactly and try to stop the car that 100 miles per hour yeah and you restart the engine it's very difficult and so our customers solve this problem today is they throw a lot of people at the problem and tools and as a result the operation of that network just and the cost of the operation grows so with this one of the first use cases just reduce the root cause the time of the cause analysis of the problem and I'll show you in the demo we can actually if you click find out who is responsible for that congestion and that dropped for example event other customer use this for continuous monitoring because the application of specific SLA and I want to monitor for the SLA and get report remember even driven telemetry get a report as long as far as that threshold has been violated some of starting to use this to implement feedback loop because now you have you know very fine in telemetry you can use that information and in real-time to correct the problem before it start creating major outages or issues through your application and I'm going to focus on the accused key is during the demo so just an interesting question with it with your network troubleshooting idea and just kind of picking on the spanning tree one where we all know how much fun that is when nothing goes anywhere what I think I had heard you said previously was that the telemetry data sent out via API or whatever is injected into the data plane so if it's going into the same data plane that is having the problem how do we expect to receive that information to be able to figure those things out that's a great question so one way is to inject that metadata into packets as these packets goes through the Metro they add the other method is like having a kind of an out-of-band channel where each device fundamentally streams out the telemetry one of the things that it's important to understand is the amount of telemetry we are talking about here is negligible when compared to the actual traffic you can pick for example a front end and port of the switch and use that port to stream out the telemetry so you can separate out the actual traffic from the telemetry traffic okay so we got the ability to specify exactly to specify where it goes similar to things like the college like year span so you can separate this on VLAN on a vrf on a on a specific route in domain and anyway the amount of telemetry that you generate because it's based on packet snapshots is definitely smaller than the overall traffic so you're not going to kill your network with with telemetry reports the ecosystem is also very important and so having programmable switches I mean a thinker caddy cover you know some of the OEMs that have a Tofino switch Cisco was the first OEM tool enable int in their production switches so you can go and enable using an X OS int in a Cisco Network a now cover that we have other OEMs and audience the solution the int solution is fully integrate two white boxes and offer integrated into Sonic vsi so if a cloud customer that wants to consume this and we have customers deploying this into production they'll be able to do so we're also working on other targets and mix we're also working with other virtual data planes like obvious and bare metal servers so we are as we go expanding that ecosystem some of the key features this more of a reference light we can do a bunch of things on the eye we can leverage now that the power of int to do different type of analysis the solution is topology where so as long as you provide enable things like lldp and you have SNMP we can go and discover the connectivity matrix the connectivity between the different switches and if that connectivity evolves we can adopt deep insight adjust we have native dashboards in UI but we all suffer northbound interface to integrate this with other solutions and I'm gonna focus on this particular piece on the demo and we also can do that retention so you can use this as almost like a TiVo go back in time and find all your network was behaving let's say two days three days ago because we store historical data in the system how long were you keeping that data for and it's really how are you storing it as well where is it yeah where's it store that so we store it locally on the server today we require an SSD you know on the server we support today in the current deployments up to three days of them in the bigger scale out implementation but it's really up to how much storage you have on the on the server available and in future we are also supporting a model where you can export snapshot if you go over the three days and you want to store that information we can take a snapshot to the data store and store it on the owner like for example offline storage device or a remote storage device as well when you guys are tying into other network management systems does deep insight have the ability to say in anomalies detected to go ahead and block traffic or make any changes to a third-party management tool yet we don't do provisioning and we don't do changes but in the demo I'm going to cover the specific use case we use a third-party system to implement that change we provide the information that so that a controller can implement the change yeah so it can be automated so that's exactly the specific use case I'm going to demo here we have we have other a closed loop solution imagine like you have a control that provisions your network one of the first use case for telemetry is can I verify if that design was properly realized into the data plane so use telemetry for that or just do continuous monitoring through true int and verify if at any point in time there is an issue that deviates from my model that I need to correct and then use closed loop solution so that now thanks to the streaming API is from deep inside and the end controller I can go back and fix the problem before it causes issues so this is kind of a an example of that closed loop solution and many customers are trying to implement today some other use cases today we do network performance monitoring but somebody asked about what about the other use cases so security is something we have on the horizon application monitoring will be enabled once you have like this pushed on more type of devices like the Knicks and the virtual switches and argue Mei Chang will cover that during his talk is becoming another interesting use case our DMA is very keen on and sensitive about congestion and pocket drop and int can be used to monitor and adjust the congestion do congestion control there kind of you know kind of a takeaway slide from a DI standpoint it's a scale out Network analytic software that we privatize and sell to customers separated from the switches some customers use that to argument the capability of their existing network monitoring solution it after using the network troubleshooting is very accurate the DVR for your network can can enable some of these feedback loop type of solution as far as the telemetry that you're receiving this isn't necessarily just Network insight we're including the NIC in this as well so you're seeing and then from host to wherever the the flow that's the target today we are not it's not part of this demonstration but that's the tag we have done a number of integration with different vendors of course now that we are part of Intel as a strategy and as a roadmap we will pull this will push this telemetry on this devices as well yeah because expanding out to being able to say this packet came from this NIC on this host to ingress to this network and beyond would be absolutely so do you have a particular speed of NIC that you're targeting or D all of the above and you want to think that yeah I mean we've we've demoed this with everything from 10 gig 200 kick Nix it's the technology is really agnostic I think we would want to bring this to all all NICs you know all server-side NICs from that technology being agnostic but it costs not necessarily so so well 10 gigging up may make sense to it being down with a one gig part while useful you may look at it go it's really too expensive to get into that commodity yeah I mean I guess as we had talked earlier and it talked about I think the the path is you know all all NICs all all capability all switches are getting you know more and more intelligent over time and so I would see in a you know a not too distant future right it it applying to even 10 gig NICs and you know a good segue to you know the conversation we'll have here and and and I'm sure you know our our our our compatriots here from the ethernet in telly thir neck team will talk about you know some great new capabilities coming to new NICs as well cool if four great before you dive into the demo just one quick question gap spaces on your network where you don't have too few no you don't have the capability to see those what are you doing to identify those and how can you bring that - let me just ring him back in and what you guys are doing that's a great question and if you use especially the export mode or the ho buy up int that's a great way to start putting this technology in places of your network where it really it must matter and typically customers are starting to deploy this at the edge of the net or at the top of the rack where they need that visibility so they provide sort of two ingress and egress point locations that I'm assuming that you're saying majority of traffic passes through these two interfaces and then I'll provide some topology data in order to be able to get this some insight about what's happening inside internally and I guess then you could always isolate ultimately that it's not happening in this part of the network or this isn't the source or this isn't what you know receiving it and so you could always start to slowly you know deploy this at least on certain clusters or new deployment from that side and just how to cure us if you work with any of the other folks out there in this sort of you know fabric land to talk about like hey maybe we we can use you guys to stream into our platform and actually aggregate a much larger current collection system into it to get the insight that you guys are providing on this side yeah absolutely and that's as I mentioned right that's what we have the northbound API streaming interface we want to enable this technology just to be consumed by other solutions we have been talking with number of vendors in the industry that can add they see this as a as an additional type of information they want to ingest in their like bigger you don't you don't care if you're in the path you just getting a feed right exactly yeah okay great so let's move on the next and let's talk about do a quick run through the demo so this the closed loop demo use case as you can see this is like a picture from the topology I'm really happy to show here a scenario where we have different vendors and different platform working together we have a wedge 100 running sonic and that has that int solution we have a Cisco Nexus 3464 as a spy and we have two Cisco top of racks 24 180 all running that int export mode so all by hope that all work together to provide that type of solution so as long as you adhere to the specification which are open and thanks to the fact that you have p4 that now is ported all this platform you can get this ubi cáritas telemetry deployed everywhere
Info
Channel: Tech Field Day
Views: 400
Rating: 5 out of 5
Keywords: Tech Field Day, Networking Field Day, Networking Field Day 21, NFD21, Barefoot, Barefoot Networks, Tofino, Intel, P4, Roberto Mari, Telemetry
Id: OQFs764MKds
Channel Id: undefined
Length: 38min 38sec (2318 seconds)
Published: Wed Oct 02 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.