Ensuring Customer Experiences on the Wired Network with Juniper Mist

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
and thank you everybody for attending today's session we are very excited with today's launch and even more so with the way we are really changing the wired networking access layer with what ai ops is bringing to the table along with me i also have i'll be presenting we'll go a little bit deeper into how marvis has helped us solve the difficult problem first the difficult problem being the day two operations how do we assure and ensure a good experience for every wild client so he took you through the journey of juniper mist where we began our foundation in the wireless network and after the acquisition of juniper we were also able to bring the aiops framework to the wired network very excited to be here today and talk to you and and showcase to you how that has been helping um a lot of our customers across different verticals so just like as as bob said on our self-driving network's journey telemetry is key good data is key to make good decisions our focus the last couple of years has been getting the right telemetry information from the wired switches this works both for third-party switches as well as for our juniper stack obviously from a juniper perspective we get a lot more telemetry information given the comprehensive capabilities of junos that enables us to deliver what we call ai primitives and i'll be talking to you in the follow on slides about this whole service level experience framework that we have enabled with this extensive telemetry coming into the miss cloud from the juniper ex switches so what is this service level experience we in the industry have always been very familiar with slas right slas for the network slas for awan link estela is for an application but mist has been the first one in the industry back when we started our journey in 2016 with coming out with this framework of user experience everything that we do everything that we leverage machine learning for is all focused on measuring the user experience and then quantifying identifying in some cases proactively even recommending what is impacting that user experience that's what you saw so you're presenting on the wireless side with some of the mars actions on wireless some of the marvelous actions on wan and now we'll go a little bit deeper on the wired side typically we think wired access it a gig port is a gig board a dengue ford is a 10g quarter life is good what we found when we enabled this visibility two years ago and have it running in a lot of large enterprises in retail in healthcare in higher ed that our customers our network operators our network admins even help desk are finding issues that they were previously unaware of existed that were actually impacting end user experiences these end user experiences could be end users who are you know employees such as us with our laptops and us and uh our workstations it could also be headless devices think of your digital signage in retail think of your server machines or your audio visual equipment in the enterprise in your ebc's all of these devices are connecting to our network either wirelessly or wired and with the foray of iot we're also seeing a lot more cameras come into the network and this is where the wired assurance comes into play and talks about exactly how are we measuring again the user experience for every wired client connecting to the network both from a pre-connection perspective as well as post-connection perspective for pre-connection we then do anomaly analysis to see if there are any changes in behavior of baseline based on dhcp authentication issues you saw sudhi talk about the persistently failing clients on the wireless side as part of this exciting launch we're doing today we do the same thing also on the wired side so again the power of full stack the power of telemetry makes um us being able to deliver actionable insights to the network admin to ensure that the user experience is always optimal from a post-connection experience perspective we look at congestion congestion and downstream congestion upstream a lot of our customers look at capacity planning on the wired network thinking hmm today i have a one gig uplink should i be going to a 10 gig opening in some case should i be going to 100 gig uplink when you look at higher ed for example and some of the again the telemetry information that we process and then apply the baseline to helps answer those questions of is the network out of capacity so enough said about the technology let's exactly see how this is helping some of our customers in real life so this is a a use case where wide assurance really shined for our customer where we are seeing this trend with a lot of the wired infrastructure also now moving towards a more dot more.1x back-end framework this transition as we know is complex and it's scary because you're moving your entire network from your legacy frameworks to 1.1x back-end and a lot of not a lot of endpoint devices sometimes are capable also so it requires a lot of automation to ensure that the migration is seamless so what happened here with this large retailer who has deployed in thousands of locations and they have more than thousand switches deployed is our ai primitives in the form of service level expectations actually were able to catch an issue impacting across all of their thousand plus locations specifically one particular location where it was not a switch health problem it was not a throughput problem there was a successful connect problem however on that particular site turns out as part of the automation that was done the switch on this site was missed as being added as a nice client on the radio server and that led to a huge blip on the network where the clients connecting on that wired switch were not able to authenticate so with the insights that we were able to provide for this retailer with distributed enterprises across u.s across canada they were able to very quickly again find that needle in the haystack find the actual issues impacting a particular site proactively and take action so within 24 hours they would they were able to demand the situation escalate to the infosec team to get that switch added as a nice client to the radio server and all was well this is about how successful connects was helping our customer find that one store out of thousands one switch out of thousands and help fix the issue now let's get to post connection how do we measure user experiences on the wired network from a post connection perspective this is where we look at the throughput sle so through podessally here is actually measuring again the throughput from every port on every switch sending telemetry data to the miss cloud so again it's not about switch health it's not about network uptime it's about actually measuring the impact to end users because of a throughput issue the three throughput issue could be again because of congestion downstream it could be because of a congestion uplink it could be because of network latency and jitter it could be also because of interface anomalies in certain ports the myriad reasons why an issue could be impacting an end user experience are many right however being able to sift through all the data all the logs all the trends and being able to very quickly pinpoint what exactly is impacting the end users post connection experience is what this throughput sle is all about and this is where we're able to correlate if it is a wired wireless stack we are able to correlate that end user across the wireless into the wired if it's a wired only stack we can do the same thing for just the all the devices connecting to the switches for every port and then measure in a user minute whether the experience is positive or negative so when you see throughput is 98 success that success is measured in the users actually getting the level of throughput that has been designed as per the network that two percent failure though is what we're interested in and that two percent failure is what we then are able to identify as to the root cause or what's causing that failure as well as what the scope is with switches which vlans which interfaces on that particular switch and more importantly which client it's always about client to cloud it's always about measuring the end user experience on the wired network now again let's let's see how this was actually helping some of our customers when we were able to do this post connection analysis for every port sending telemetry data to the miss cloud so this is a large enterprise and storm control is typically a good thing however when misconfigured it can cause impact to the white clients in this case the users are always complaining about hey my video experience in my executive briefing center in this enterprise is very choppy and the network array team was not able to quite figure out why because they had done all the right configurations at least from what they had seen however while the assurance when enabled on the switches and just getting telemetry information from all the switches was very quickly able to identify that a handful of clients were consistently facing a throughput issue the switch events as proof confirmed it was a storm control in effect frequently and basically what was happening was that the storm control configuration for multicast had a low threshold again a needle in a haystack we're talking about 15 000 ports at this one location serving this enterprise customer impacting maybe just a few clients because of a misconfigured storm control policy not only did wide restorance help find the problem proactively let them know exactly what it was but then end result was once the configuration was made correct the end user experience was restored and that network is now humming this is again not just visibility into data and trended logs but actually telling the network admin what is the root cause who it's impacting and the substantiating evidence to say this is exactly why uh end user experience is being impacted let's now go to higher ed very very interesting phenomena here and again this is a recent phenomena based on what we're seeing with this the college campus is opening so students are coming back to campus there's a rapid increase in client counts and obviously a lot of the universities are seeing a surge in multicast and broadcast traffic in this particular hierarchy and this is pretty common from what we have seen they have a large l2 domain configured as part of their network policy which is fine however what we did see is that proactively again as the students started coming on board we saw the congestion classifier on the throughput sle start spiking up and start causing alerts as part of evidence of that spike we also saw the insights from every switch every port seeing a spike in traffic what was a baseline 20 gig daily usage suddenly spiked 10x to 200 to 300 actually more than that 15x to 300 gig obviously there was a problem in the network again with the help of the sles with the help of this visibility of looking at exactly what ports what switches were getting impacted um and trying to maintain the current design we were able to help this customer resolve the issue we introduced misstech for the centralized data plane so they could maintain their current large l2 domain architecture and prevent any throughput issues from happening with the students coming back to campus so visibility yes but more importantly visit the visibility with a consequent action of largely domain having a lot of traffic congestion how can we maintain that architecture maintain that design and make sure that our students or rather every white client is getting the throughput it deserves last but not least let's see how this is helping in retail again a very very interesting incident here where what we saw was that this was a again a retail location thousands of stores they've deployed wide assurance for the juniper ex stack in the access layer and we were seeing these clients connected to those switches getting successfully connected they were getting an ip address but we were seeing that they were consistently coming up under the throughput sle as clients that were failing to get the right level of of throughput when we go into marvelous actions what we were able to very quickly see that marvis proactively identified that there was a negotiation mismatch for that one client in the hundreds of wire clients in this one store again a needle in the haystack problem that marvis proactively found more interestingly this was a headless device it was a digital signage in one corner of the store so there was no user complaining however the device was in distress the device was not able to provide the content on the screen because it was not able to get the traffic passing through the switch all because of a simple configuration of an empty or mismatch on that particular port that that device was plugged into and this is again where wide assurance for the campus whether it's enterprise whether it's higher ed whether it's retail whether it's whether it's healthcare becomes even more important in assuring customer experiences because you have a lot of iot devices coming on board that are headless there is no user complaining there is nobody opening a ticket to say hey my app is not working because all of these iot devices in most cases are headless but how do you then have a system that can pinpoint and identify these are all the end users headless or not having issues and more importantly what is the issue because of this is where aiops comes to the fore this is where the baselining using machine learning is so helpful to catch anomalies and then marvis comes in to say exactly what the root cause is for a user on the wired network have a good question what do you folks actually mean by actions because this isn't actions that marvis is taking on its own right it's it's this is saying is that really a label that says is there actions that you want an operator to take or something i don't know if that's the appropriate word but maybe not asking the wrong question that's a fantastic question ed right so if you if i take this back to what bob was presenting as part of a self-driving journey uh that we are on a lot of these actions are either what we call driver assist mode where if the action being recommended is out of the marvelous control for example if it's a radius issue or it's a dhcp issue then we will recommend that action will be taken by the network admin however if it is an action that's within the domain of marvis for example if it's a missing vlan and he will talk to you about some of those actions in detail if it's a missing vlan on a switch that is being controlled by the miss juniper miss cloud then that vlan once the network admin says yes marvis i trust you go and make the change can be automatically fixed by mars itself the same thing in the case of an empty mismatch again if that switch is under the domain of marvis and the network adam allows marvis to make those changes we can automatically go in for that port because we know where the issue is mars can make that mtu configuration change because it knows the device plugged in and what it needs to ensure that the network is back in operating for that particular client so it is the self-driving piece where some actions are taken by marvis they've authorized to do so and some are driver assist for elements that are beyond the network control right do you guys distinguish between the two in terms of the the display or where they're labeled within the ui interface or is that just is it all listed under actions and and you just need to know the difference between what fits in self-driving and what doesn't we actually do so we in the follow on slide you'll see us go through a demo of each action and when i double click on every action you'll see something called ai validated so if an issue resolved by the system admin will actually say yeah i validated because of resolution by the network administrator versus if marvis takes care of itself he'll say hey validated by marvis making that change so is there a mechanism for a way that a human can follow up on an automated action to make sure it actually happened and went through like it should yes so you will act everything that we are showing here and these are perfect questions for my follow on slide are all tied to what we call web hooks so for example if you're using a ticketing platform right as much as we like to showcase marvis as the cup of coffee view we know that a lot of our customers have different elements and so they all they essentially pull all of that information into the ticketing platform so what we do here with each action being shown and the detail for every action to say exactly this switch this port this wire client all our information gets automatically created as a ticket into a platform like servicenow then once that issue gets resolved either because of marvel's taking the action or because of the network admin taking the action we can also go ahead and close that ticket into the ticketing platform saying this issue is not resolved with that same ticket id that we generated internally within the system so it's almost like a full feedback loop does it answer your question i mean it's great that you can show the issue is closed but i guess i'm thinking about is there's some way to to verify that it was the correct solution yeah actually uh drew that that is the uh the point so so within marvis there's two stages there's a there's a resolution of the issue and then there's an ai validation of the issue and so you could say hey i've i added this vlan and so there should be no missing vlan now uh but uh you know marvis will come back and say but wait users are still failing on that ap on that port so we so there is a layer of ai validation that conclusively says the problem we were seeing we no longer see okay thank you and you will see more of this in in the follow-on demo as well so so now another part of marvis is what we call the conversational interface and if you talk to you know if you heard bob before our our key vision here is that dashboards are dead and the only way that network administrators and users will interact with um a system or a network will be through the conventional interface this could be your level one help desk this could even be your tier three you know your level three network operators so what does um marvis conversation interface do for wired access essentially you're now chatting with marvis through this new ci that we have developed for wired access query specifically again if it's a wired wireless stack marvis has more data mouse is able to correlate across the wired wireless stack if it's wired only no problem even there we have enough telemetry information to to identify pre-connection post-connection issues now the video i'll be showing you here of again this was the same live demo network that uh sudhir was showing you earlier is we basically are talking to marvis and saying hey as a ticket came in so this is your level one help desk and instead of them again getting you know switched logs and and you're going into every port what the trend of traffic has been they just go into a marvelous ci and say hey i want to troubleshoot switches at this particular site and very quickly the ci interface tells them exactly what switches ask the user to choose one of those and says exactly what's impacting in this case the issue was more around the network northbound where we were seeing impact from the van side that were causing end user experience issues the same cia interface can now talk back to the failure timeline of when exactly that was happening that same ci interface enables a workflow for looking at the switch insights for any events that are transpiring on that switch and through that same ci interface you're able to look at again thousands of locations thousands of switches but a single port or one switch across the many where because of a network issue upstream because of latency and therefore jitter it was impacting those seven ports on that switch and therefore impacting those seven clients all done through the marvel sdi interface so a very easy operational workflow enabled with marvel ci pulling in all the telemetry information hooking into the miss dashboard into the ai engine back-end and giving away for a tier one help or sorry a level one help desk interact with the geneva cloud to solve very complex problems it's all about reducing the mean time to resolution by proactively finding the issue of what's impacting end user experiences we now go into a deeper look into morris action so to your point do an ed what you were raising earlier so marvelous actions again is all about proactively finding issues some that can be resolved by mars directly if authorized to do so some that are going to require the network administrators help resolve for example if it is a bad cable either a third-party on a third-party switch or on a juniper switch we will require the network admin to call the installer to replace that cable however if it's a missing vlan or a negotiated mismatch we will be able to call out exactly um what port it's on and then also present whether the issue still exists has it been resolved and has it been validated abi i would like to chime in would like you to chime in here and talk a little bit about the loop detected and the port flap actions absolutely thank you tsunami we all love uh our loops on the network uh the loops are there are a few indicators for us to identify loops on the network uh the product we're about to introduce to you right after probably will you'll never have to see any of the loops again but we a lot of networks today still continue uh to run on traditional networks so there are a few things that you can uh if you've enabled spanning tree uh and if there's still uh a loop in the network you will continue to see a lot of topology change notifications that is one indicator you will also see what's not initial case you know a huge surge in either broadcast or multicast traffic because of there's just probably a storm brewing in the data science world uh we call bringing two things together uh and identifying where there might be an issue called we call it as spatial correlation so we we used uh some of these indicators uh you know uh topology change notifications sudden surgeon tx and rx of bcmc broadcast and multicast traffic bring them all together and identify if there is a loop in the network and that's that's uh again marvel's actions uh the reason why uh it is so powerful is it just because it is across the entire org and you're able to pinpoint and identify uh exactly that there is a loop that is detected among a few switches now you want to go uh further up as well and uh and do more things in terms of correlation as we go along as a system lens now what what exactly do we do from the from the construct of pork flaps uh port flaps are again uh devices that are continuously failing uh uh or some one of some of these hid card readers or card readers that can negotiate well they're just going up and down continuously one they're not able to actually serve clients nobody's able to swipe their cards and get in two uh it's just a continuous up and down on the network so we would like for you to understand uh and probably take action on that probably you as an admin want to shut that four down or go configure it to accordingly but that is the intent of bringing about pork flaps out to your uh to your attention but i think what i personally like is what is showcased here in front of your slide right now bad cables in in the juniper networks org uh we've said this about 1k switches 50k ports we monitor every single port uh every uh at all points and times we were able to identify uh 22 bad cables bad cables again do you need an engine to analyze that yeah the reason is there are a few indicators again which will tell you what what when cables are bad and that's what you want to be able to ingest that data let the engine know and and see if that predict those behaviors are continuously being exhibited if those errors on the cable are continuously incrementing uh is it is the cable continuously down and they're still continuously serving power various indicators for us to identify bad cables we want to indicate showcase true positives in here similar constructs with negotiation mismatches and persistently failing clients and that's the intent of marvelous actions has always been it doesn't matter how many ports you are managing as an administrator it it does for us to actually bring about exactly those 15 20 um or even well within 50 um issues that you as an administrator can take a look at and then immediately take actions on you want to shut the port you're going to reopen a ticket to actually you know fix uh cable issues or open a ticket and then once you mark that issue as result we also go back and validate and on your right right hand side of the screen there's a scroll that runs to say it has been ai validated as well you told me you told the engine that uh it is it has been fixed and it's truly fixed so that that is the intent for mouse actions uh making life easier for you as an administrator as sudhir says it's the cup of coffee view and we want to keep it that way we'll continue to add more things into this view you
Info
Channel: Tech Field Day
Views: 372
Rating: undefined out of 5
Keywords: Tech Field Day, Gestalt IT
Id: LgwN_aCMpzc
Channel Id: undefined
Length: 27min 8sec (1628 seconds)
Published: Thu Sep 16 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.