Advanced Network Device Troubleshooting & End to End Visibility

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
>>Hi and welcome to today's webinar, Advanced Network Device Troubleshooting and End-to-End Visibility. My name is Brad Hale. I'm the Product Marketing Principle for Solarwinds Network Management Products. With my today I have Chris O'Brien, our Product Manager, specifically for Network Performance Monitor. Just a couple of ground rules here. Like I said, today's content will cover advanced network device troubleshooting and end-to-end visibility. We want to make sure that no attendee is left behind, so please ask questions using the Q and A box as opposed to the chat box and we'll do our best to cover them all as we go through the webcast as well as the demonstration portion. And if not, we'll make sure we get those answered at the end of the webinar and there will be a follow up email that goes out with a link to the recorded webcast as well as the answers to the Q and A as well. So with that in mind, I'm gonna ahead and go through a couple of overview slides about Solarwinds and then I'll turn it over to Chris. So, Solarwinds' vision is to manage all things IT in a hybrid world. What that means is basically we want to be able to provide you with the ability to manage and monitor your IT infrastructure regardless of where the applications and the underlying infrastructure are deployed. That means it can be on-premise, it can be in a hosted or SAAS type environment, or in a cloud environment. We want to do that while continuing to take a very user-centric approach to our product development, which is to create products that are modular in the architecture and allow you to only buy what you need when you need it as opposed to selling a giant monolithic approach that provides capabilities or features or functions that maybe you don't need for your particular environment. So with that, I'm gonna turn it over to Chris and allow him to give you an overview of Network Performance Monitor. We'll start with talking about some of the advanced network device troubleshooting and then we'll go through a number of different demos. Chris >>Thanks Brad. Hello everyone, welcome and thanks for joining us here. We've got a NPM Network Performance Monitor, is Solarwinds flagship product. I'm the product manager for NPM and I'd like to give you a quick overview. Then we'll take a look at some of the newer features that we have to help you with advanced device troubleshooting on your networks. One thing that we always like to start out with is that we are a multi-vendor platform. We provide fault, performance, and availability for your Cisco gear as well as you Dell, EMC, F5, all sorts of different equipment. All of this gets aggregated into a single view so that you can have consistent reporting, alerting, and basically consistent visibility across your IT environment. We've put a lot of focus in the last couple of releases on making sure we cover hybrid and cloud, and sort of are taking up that new technology, providing visibility for that new technology as people on-board that into their environments in addition to our traditional sort of bread and butter on-premises coverage. Some recent work also includes topology and dependency-aware intelligent alerts. You'll notice sort of the second half of the list of features here really focus on how to deliver monitoring that is less noisy and less dependent on the administrator to input all sorts of configuration. So we automatically detect topology. We can set it up to automatically have intelligent alerts. Alerts that, for example, don't tell you about 10 servers going down behind a router that just went down. It just tells you about the router that's down. Making it easier to troubleshoot and find that root cause. We cover wired and wireless networks, including that mapping I just discussed. We have automated capacity forecast, alerting, and reporting sort of traditional thing for network monitoring. And one of the other key tenents of Solarwinds, and NPM specifically, is that it's designed for you to deploy, configure, and maintain without consultants, without purchasing professional services. Solarwinds doesn't not even sell professional services because we are very serious about making sure that our customers can use it without professional services. Without further ado, we'll start talking tech with our features here. The first feature I wanna go over is Cisco Switch Stacks. So the cool thing about Cisco Switch Stacks is they've really hit sort of a sweet spot for a lot of customers in terms of having scalability but not having the large upfront investment of a chassis switch. They maintain this sort of pay as you go model of fix switches, but they give you that scalability of a chassis switch. So a really cool technology. A bunch of people make switch stacks, but Cisco by far is the most common. So we provide new monitoring, we built new monitoring to give you better visibility into Cisco Switch Stacks. Just a reminder of how this technology works. Cisco Switch Stacks take nine or less stack capable switches, a pictures on the right there. They look like fix switches, but they can form together and sort of pretend, or as tech guys say conspire, to be a single switch. And the benefit of that is that you get a single point to do management from, a single thing to monitor, a single configuration file to handle. All of these sorts of things that make management easier. You don't have to manage this as nine separate switches. So that's great and all, but that introduces a layer of technology above and beyond a standard fix switch. And traditionally that layer of technology is not something we've had visibility into. So what I'm talking about specifically there is to form a single switch, a logical switch from a set of physical switches. A couple things need to occur. First of all, you need to connect them all together, as you can see that stacked cable on the left-hand side of our image there. You need to connect them all together in that form. It the back point of what would otherwise be a chassis switch. The devices need to talk to each other and elect a master switch, which is effectively the brains of you switch, or sort of like a supervisor. You can think of this as you supervisor in a chassis world. And that device handles all of the operations of the switch and sends data down to the A-6 of all of the other switches to program them for what they need to do and it presents a management interface to use the administrator. So that master switch election occurs and ideally that's something we'd like to get visibility in because it's super critical to the function of the switch stack. Also, all of these devices have their own physical resources, right. And historically you've been able to see the interfaces for all of these devices no matter what, but the CPU, RAM, and hardware health for each one of these components is not something historically you've been able to see for each one of them. You just see it for the master. So despite that being that case, it is true that each one of these physical switches has a finite amount of CPU, RAM, and has hardware health components. And when we say hardware health, we're talking about things like power supplies, potentially multiple power supplies, fans, fan speed, temperature, all of those sorts of components that you need to keep healthy for the overall platform to be healthy, for the hardware to be healthy. And finally, there's this stack ring that's so important, that forms that back plane of our switch that now in this stacking environment is external to the switch, whereas with the chassis it was internal. So there's a greater potential that there will be some sort of break, there will be a loose cable, there will be something wrong there that's affecting your switch stack's capacity or redundancy as a whole. So we need to get visibility into that. So let's jump right in. We will cover all of those things in this demo here and show you what that looks like. Let me pull over our test or demo environment and make sure that I get logged in there. Hopefully everyone can see my screen. >>While that's coming up, there was a question. Is this just Cisco Switch Stacks or does this support Juniper Virtual Chassis as well? >>This release is just Cisco Switch Stacks right now It's by far the most popular vendor. But as you can see, we support a large number of vendors. This specific piece of technology is a first release for coverage is Cisco. So let's jump over to our NPM summary page and find us a switch stack to look at. We'll jump over to this 3750. So there's a new resource here in this 12.0 release that will list our your switch stacks as well as their current status, and this is inclusive of the data ring and the power ring. It's more interesting when we have problems, so we'll take a look at this 3750 rig here that seems to be reporting some problems. We'll cover 3750 as well as 2960S switch stacks and a number of other models. Reminder that we have great documentation in our admin guide if you just google NPM admin guide or you go to our Customer Success Center at solarwinds.com you can get more information about that. We have the traditional monitoring, but I'ma use this new tab that's showing up here on all of my switch stacks. And I'll select and click over to that and we'll see what information comes up here. So the first thing and most obvious challenge to get solved here is listing out the switch stack members. So in this stich stack, we appear to have six member switches and we have the model numbers listed here for each one. And that's really helpful because it lets you know how much capacity I have for power over ethernet, represented by this p, or just regular ports on each one of these. It also give you an indicator if you know your Cisco Switch model numbers, what your back plane performance is and all sorts of stuff that go into that or make up that model number. So I've got six switches here. This little icon on the left will tell me that this is the master switch. My switch number one is the master switch here, so that's good. That's probably what I've configured. It looks like switch priority is seven. So yeah, that is what I configured here. And it looks like in this case, I've configured other switches to have less and less priority for that master election. So the highest priority that I configured does have master and I have configured properly backups for that. Another thing that's unique to Switch Stacks is because all of these physical members, in this case six members, are pretending to be the same logical switch, it gets a little confusing when you try and find the switch in the closet, like the wiring closet or your data center, because they have the same host name. So if your used to physically labeling and you get a little label maker and put the host name printed out on each one of the switches and then you go and find it, that becomes a problem because each one of these has the same host name. So that's why we have serial number here. The serial number's the best way to identify the unique hardware that is acting as switch number one or switch number three or whatever, in the switch stack that's printed on the back of all of your Cisco Switch Stacks physical devices. We also have MAC Address here. And you can start seeing the CPU and RAM and other information that's specific to the hardware health of that one device. Historically there's been a couple of different ways that Cisco has presented this information, but in all cases it ends up being a poor summary unless you can really drill down into the individual components. If you have CPU of RAM high on one of these, you wanna know. You don't wanna say that the average CPU, for example, is 20% and the average is comprised of 100% on one switch. I wanna know if there's 100% on one physical member. So you can of course alert and report off of all these, but set one. It's the check mark here that know this logical entity that we call EW3750A is comprised of six switches. We know that master, we know the RAM and CPU for each and we can find them in the wiring closet. The next component I'd like to discuss here is these interesting looking resources on the right. These are visualizations for the stack rings. The switch stacks have two different types of rings depending on thee capabilities of the switch stack. The first type, and the one that's present in every single switch stack, is the data ring. So this is the cable the forms the data back plane between these different physical switches. And that cable will go from, for example, switch one over to switch two, from switch two to switch three. And if we look at our picture here, we can actually see that over on the left-hand side you can see the cables sort of spidering about there and they make a loop. Looking here we can see that represented as a loop or a ring and this is actually a cool thing for me because sometimes as an administrator I forgot that this thing was literally a ring. It's a loop or a ring topology. And so each one of these need to connected and it will come back up to the first switch. But when you look at the back of the switch sometimes that's not obvious or logical and it makes it harder to troubleshoot. Here, showing it as a ring, we can immediately see the cable between switch number four and switch number five is not functioning. And the result of this is that we got half redundancy. Our bandwidth is reduced to half. And if we were to have another failure in that data ring there would be a catastrophic failure, so we've lost a lot of our redundancy of that ring. We're in a backup mode where data is still passing. So very cool to have this visualized for you and it makes it a lot easier to troubleshoot problems. >>Chris does this work on all Cisco Switch Stacks or is this limited to certain models? >>This works on the vast majority of Cisco Switch Stacks including the 3750, 3600, 2960S and the complete list should be available in the NPM admin guide if you want to check your version if it's not included in the ones I've specified there. >>How about Nexus? >>Nexus are not switch stacks. Nexus is chassis and they have some sects and sort of distribute architecture, that's definitely not stacking. So that will be another topic. >>Okay >>The next thing I wanna look at is the stack power ring. So we've handled the data back plane. The next thing, sort of innovation, the Cisco came up with, or someone came up with and Cisco implemented, is sharing of power. So if you've got a set of switches, say four switches, then you don't really need, in a stack environment, it doesn't really make sense for each switch to have two power supplies for redundant power. You want more like an N+1 configuration so you don't pay so much to get your redundancy and power. You want to be able to absorb perhaps a single power supply loss, but you don't need to absorb half of your power supplies going out. So the stack power ring is another ring of cables at the back of you switches that allow one switch to share the power from the power supplies it has with the other switches as necessary. So traditionally what that looks like is you have a set of switches and each switch in that ring has a single supply and then you select a single additional switch to have two power supplies. So that gives you that N+1 redundancy. If any of the switches lose a single power supply, they will get additional power from that remaining single backup power supply. One of the things that's interesting with the power rings is you can't fit as many switches into a single data ring. So where traditionally you have a maximum of seven or nine, depending on the model, switches in a data ring, you can only have up to four switches in a power ring and this ends up being a limitation of how power do I want to send through a single physical cable that's connected to the back of all of these switches. So with that limit of four that means you sometimes have a data ring or single logical switch that contains multiple power rings. And so, you want to visualize that as well, and see that ring and be able to detect that, for example, the second ring is running in power sharing mode so we may have configured that for redundant mode and now we're running in power sharing as a result of loosing this one cable. So again, we've lost redundancy, we're still functioning, but we want to know about that redundancy loss and go back and resolve that, so we're not always sitting at the edge, we don't have our production environment sitting at the edge of a catastrophic failure. That sort of outlines the coverage we have for Cisco Switch Stacks. There's a couple other resources in the other more traditional Solarwinds views that have the power supply, fan, and temperature settings or readings for each one of these physical components. But that gives you the coverage for the base level coverage that we want for Cisco Switch Stacks to keep them healthy in your environment. And of course it's all reportable. You can alert when there's a broken data ring or power ring or a switch is added, there's a new member to the switch stack, or a master election change. Basically all of the stuff you're seeing here, you can alert to Pong, which is really helpful to get just in time notification that you need to go and do some work. >>Chris, we've got a user here, an NPM user, that says they've got a 3750 stack single switch, but NPM is identifying it as a switch stack. >>That's one of the bugs that we're aware of, so definitely something that should not be happening. But basically what happens is if you have a switch that is capable of stacking, like the 3750, we will report that as a switch stack, so you'll it's technically a stack of one. Now from the Cisco side this is correct, right. This thing is a stack of one, you do have back planes bin which is reported, you do have master election that occurs, and all of that sort of stuff. So that shows up here as a switch stack with one member. But in general as users we don't care about those things unless there's multiple switch stacks. We have a feature request to make that appear as not a switch stack. That seems to be the common request from users. >>Does this work in a VSS configuration? >>So VSS tends to be more like chassis switches, for example the 6800 or 6500 series chassis switches from Cisco, and that's not really switch stacking. That's a chassis technology sort of similar in that two switches act as one, but you can have six switches, for example, in a stack. It's sort of a different technology, so the stacking stuff does not apply there. That's not covered. Any other questions? This is what we have to talk about for the Cisco Switch Stacks. We've got two others to discuss here, two other new features to discuss here, but I want to cover any questions we may have about Cisco Switch Stacks. >>Yeah, really quick on this. This new switch stack monitoring capability is part of NPM version 12? >>That's correct. >>And it does not require any additional modules or anything like that, it just part of the core NPM capabilities? >>That's absolutely right. If you have NPM, you have Cisco Switch Stack Monitoring. My other note is we do get request fairly frequently about adding other vendors and that's something that we're thinking about. If you have specific switch stacks in your environment that your that aren't covered by this Cisco Switch Stack Monitoring, please do shoot us a message either via email or even in the Q and A and let us know and we'll think about how to add that. >>Okay, without further ado, we'll jump over to our next topic here. >>So we did our switch stack demo, the next thing I'd like to talk about is deeper insight into F5 load balancers. So it's a really interesting thing in networks today how sort of as of 10 years ago, networks have routers, switches, and to a lesser extent wireless access points. And if as a network engineer, you had all of that stuff running well, you were doing your job. Your network was running well. But today's networks include these things that we call modern network appliances. And these things cover things like Cisco ASAs or any firewall type, any load balancer, web proxy, WAN optimizers, and so these things sort of are sprinkled around your environment. They tend to be much fewer in number than the switches and routers and access points in your environment, the access layer stuff, but they tend to be extremely important. It's for two reasons really. They tend to sit in bottlenecks of your network. For example, firewalls sit between you enterprise network through your data center network and the internet or between your network and partner networks, and if that single firewall, or ideally a redundant pair, if that goes down then your entire network loses critical connectivity, for example internet connectivity. So even though you don't have many firewalls, they're exceptionally critical in your environment. The same holds true for F5 load balancers. For F5 load balancers, they have a unique position because they're the gold standard load balancer but they're also by far the most popular load balancer. So a really high quality load balancer and because they are so high quality, they cost a lot of money. So it's not uncommon for enterprises to spend 50, 100, or hundreds of thousands of dollars on F5 load balancers and they make that investment because of how critical the services are that the load balancers provide service for. So these load balancers tend to sit in front of the most important services in the entire company. Solarwinds is an example. Solarwinds.com sits behind load balancers. So if those load balancers malfunction, go down, or otherwise not providing proper service, then our website goes down, which is a big deal for us, at least, and tends to be a big deal for any of our customers when those services behind the F5 go down. There's sort of a disparity between the number of devices and their criticality, and this is something we've noticed and we want to go fix. We wanna make sure that these devices that are not routers, that are not switches get the level of coverage that they need, that's commensurate with their importance in the network. In addition to sort of the standard stuff we do for routers and switches, covering CPU, RAM, interface utilization for F5, we also cover thing like connection count. And that connection count can be sort of sliced in a bunch of different ways. You can look at connection count for an LTM, the whole appliance, for virtual servers, for pools, for specific pool members. You can look at DNS resolution requests for a GTM, directly for the whole thing or an entire service on the GTM So this whole set of GTM and LTM are reporting data that you can slice any which way to try and understand the health of the environment or specific servers, services, or logical components. So it's very relational. The other thing that we cover is sort of the bread and butter stuff for the platforms. So things like software versions, serial numbers, as well as the H/A status. And we not only are these thing reporting to be H/A, are they reporting what failover status are they reporting, but also are they reporting that they're synced up, that they're ready for failover. So it's very important. We have all of this in sort of very relational view, and we'll jump into the demo in about 60 seconds here, but it's a very relation view where we show the servers and the pools and the nodes and all of the statuses and how they connect together, which is very different than how routers work in the relational way that the technology works. Connection counts we talked about. We also allow you to drill all the way down to the health monitors. So a lot of times when a service goes down you have health monitors deciding that certain servers are unhealthy, and you want to get to the root of the problem, you need to know what health monitors are deciding what about your servers, and we can do that. And finally, we get down to the individual pool members. So, I'll stop talking about it, we'll start looking at it here. Within Solarwinds, we'll jump back over through my dashboard and network. This is the subsection that comes with NPM. We'll jump over to load balancing. Right off the bat here, this is very different from everywhere else in Solarwinds NPM. We've got this sort of balancing environment view. And when you think about load balancing, it functions based on the relationships of many different components, some of them physical and some of them logical. Sort of the core of how load balancing works. So at the top, we have a list of our services. These are the things that we're to the world that we bought the F5s to make sure that they're up. So that's the top level thing. And now, to get this service up, this service depends on a global traffic manager, local traffic manager, virtual servers, pools, and pool members, all that sort of feed up into the service. And we can see that here. We can see through the status of each component. Now if we have a problem with one of these components, we can drill into it. So there's a couple of ways to do that. If I click this specific service, this ADFSDNS North America, and click on show relations, on the back end we have pooled and understand how all of these components are related. So this specific service depends on this global traffic manager. And in this case, this global traffic manager was detected to be a pair. And this indicates that these two are forming an H/A pair. This guy is active. The H/A status, you can see there, is in sync in the bottom of that hover-over. So they're prepared for failover, but one of the backups is gray and this is unknown. It looks like this hasn't been added to NPM yet, so we would want to go and add that to NPM and you would get that coverage or the device not only noted, but also start getting statistical information from it. And in this case we're having a problem, right? This service comes through this global traffic manager, local traffic manager, virtual server, pool, and pool members and there's problems with each one of these. If we sort of hover over, we can see not only the status, but also the status reason. And we can sort start to walk through this problem area and get to the root cause relatively quickly here. So the service as a whole, this ADFSDNS North America, the status is down, the status reason is no enabled pool, so we can see the pools is dependent on the global traffic manager and the local traffic manager. There's only one virtual server. That virtual has a single pool in it. So let's jump down to that pool that's reporting the problem Now here we can see the status of the pool is down and it says the children pool members are down. So in this case, we've got two children pool members, so we can look at the status of them. One says unable to connect, no successful connects before the deadline. So our timer has expired. And we'll jump deeper into that in a moment. Then this guy says availability unknown. So, the unknown status is coming from F5 and F5 is unable to get that availability. So let's drill into one of these. So we're clicking in. And this time we don't wanna show relations, we wanna show the details space. And when you go into the detail space this is of course going to tell you the stuff that's relevant to that specific component. So we have our status and our status reason again, all of the pools that this pool member is participating in, what F5 server it's on, the number of connections, how many connections per second, so this is sort of concurrent as well as per second, and how much bandwidth it's taking as well. Importantly when something is down, you'll come over here and look at the health monitors and in our lab we always have sort of data consistency, data accuracy, challenges with getting all of this in a single lab. But in this case, you would a list of the names of the health monitors that it's assigned. In a production environment, you would see this specific health monitor is down and we would show you that health monitor status reason as well. So you can very quickly drill all the way down to your problem, even in complex environments that includes multiple physical components, many logical components. Now one of the other things you may have noticed here is that our balancing environment this sort of mini-stack or miniature view of the balancing environment has followed us. And the component that has followed us, or the view that has followed us, is filtered down to anything that depends on that pool member. Showing us all the things this specific pool member is related to. And as we mouse over these things, we can get the mouse overs contain data that's sort of relevant and the right data for that specific component. So if I look at global traffic managers, we can see, for example, its IP, the hosting node, its H/A status, and the number of requests, requests per second here being DNS requests. Whereas if I mouse over the LTM, I can see the number of connections and this is connections across the entirety of the LTM. If I mouse over virtual server, I can see the number of connections specific to that virtual server, the port number that that virtual server's providing service on, all of these sort of right pieces of data for that specific component. So I'm going to jump into one of these. I see my global traffic manager here is reporting badly, so I'll jump and I can continue to use that resource as a navigation function. Jumping over to this GTM, we can see DNS resolution by service, we can jump into each on of these services here. We can list out our services and see their components and see how they're doing their load balancing. We can also, really interesting sort of track the relationships in this environment. So this global traffic manager has six services that depend on it and it feeds through only a single local traffic manager. So there's no redundancy at the local traffic manager, at least for any of the services depended on this global traffic manager. So you can very quickly take a single physical asset, in this case a GMT, global traffic manager, and understand what, for example, pool members servers are depended on by that GMT's function. It's a very cool relational view. Okay, so hopefully some questions have started coming in. >>Yes, the first question is what version of F5 does this support? >>So there's two pieces here. The vast majority of this functionality is supported by version of TMOS 11.2 or later. And all of this data is acquired via SMP. Now for two things specifically, the Health Monitor Status and Status Reason, as well as pulling a server in and out of rotation, or more correctly, pulling a pool member in and out of rotation in a pool. That functions through the iControl API. So that requires TMOS 11.6 or later, if I'm remembering right. And just to double check this and compare against your environment along with the other requirements for this functionality, check out the NPM admin guide. There's a section for network insight for F5 and specifically a requirements section to make sure your environment fits. Any other questions? >>No, that's it for now. >>Okay, we'll jump into a pool and let's see if I can take a look at the pool members here. So within a pool, of course we will list out the members in that pool, the pool members. We'll show you their load, so you can compare relative loads See how well your load balancing algorithm, in this case is round robin, is distributing load between these two. They're quite a bit different. Maybe our round robin, maybe we would want to do something more specific or think about how we're tracking state information and get that more well-rounded. Or at least make sure that each individual server can carry that larger portion of the capacity. The other thing we have here is some capability to change the rotational presence, so to change what members are actually available in this pool. So clicking change rotation, you can just turn these off and on. What we found is although there's a lot of things you can configure on load balancers, sort of the most common operational task, what you do 80% of the time when you're logging into a load balancer, is you're simply taking pool members in and out of rotation so that you can do some sort of maintenance or do some sort of transition over to a new version of a site. So you can do that directly in this interface. You would click one of these. Now in the BIMO environment this is disabled, but when you click one of these it will give you a warning, tell you what you're about to do and make sure you're wanting to do that right now. And then upon selecting okay, we will go ahead and connect over with the API and disable that pool member. Finally, there's some expert tips here. One of the expert tips in this scenario is after you turn a server off, this is sort of designed to not produce a lot of impact to your user. So users currently sticking to a specific server, they'll start to siphon off fluently, but the current users will still be served by that specific user. So, here we're suggesting that you should give it a few minutes after you remove a server from a pool to make sure that that server is not used by many of your production users. And there's sort of expert tips like this sprinkled throughout the product. And click in and you can get more information. So we will jump over now to our next feature, I think. Any more questions about F5? >>Nope >>Okay, great. So we'll jump over to our next feature. That is NetPath, here. Let's get this showing up. That switched on me. Actually, we'll just take a look straight in the slide deck here. NetPath providers visibility across the entire service delivery path, particularly for cloud and web-based services, so hybrid environments where part of the path is your environment, part of the path may be internet, part of the path is some sort of service provider. Things like our access to salesforce.com is something NetPath is designed to understand and help you trouble shoot. So you can see things like where the problem is, who the responsible party is for that node having the problem with that link, and how to contact them. As a reminder here that traditionally, when we're thinking about network monitoring, we're heavily focused on infrastructure gear. We ask the infrastructure gear, "Hey infrastructure gear, how are you doing? "How are your interfaces doing? "How is your fan doing? "How is your CPU doing?" all of these different questions to the infrastructure gear. But the reality is our users don't depend on that directly. They depend on the services the infrastructure in providing. So at it's most fundamental level, networking, we as network engineers, we're delivering services to users. Users can people or users can be, for example, a web server that uses a sequel server. But effectively we're delivering a service to a user. And that path is really designed to give you visibility across that service delivery from your user-base, even when it's server users, like a web server as a user of a sequel server, and do that locally, remotely, internet, for basically all of the different types of environments that fit within a hybrid IT environment that most of the paths today. And we do this by deploying a probe only at the source. This is really important. This is basically representative of of your user base. You can deploy it on the user's machine or on a machine that's sort of adjacent to it in the same office, for example. Depending on how you want to do it. And this probe, with no other instrumentation, will detect the path, the performance of the path, and then give you a granular view into how each component along that path is impacting the end-to-end performance. This makes it much, much easier to start troubleshooting. So let's take a look at what that looks like in NetPath here. So again, we'll go to my dashboard, over to the network tab from NPM, and finally, NetPath Services. So here we have a list of services that I'm monitoring with Solarwinds NetPath. A reminder here that NetPath is a feature of NPM. So if you have NPM, you have NetPath. Now in this list of services, we're monitoring things that are remote like Google, Office 365, Sales Force, and so on, but we're also monitoring our AWS Lab, that's sort of a combination of their network and ours, and that we have some responsibility for the at least logical infrastructure for. And we also have some portals here. The take away here is that NetPath will work for any TCP based service, regardless of where that service sits. So as long as it's TCP based, it will work. So let's take a look at what creating one of these paths looks like. So if I want to monitor, for example thwack.com, a sight near and dear to me. And I know that runs over Port 443, that's encrypted, that's SSL, so it's Port 443. I'll put that in. We work fine with the encryption here. We can put in an alias, some nickname if we want. We'll put a probing interval, so how often am I gonna probe that thing, and click next. After in specify the destination service, I find a probe. I can use one of the probes that I've already deployed to one of my offices. I can use the main poll or any of my unions already in Orion or really do what NetPath was designed to do, which is deploy a probe close to my users. And we'll handle that probe deployment for you, we'll just put a small agent out there and we'll centralize and manage all of that functionality. And I click create. Now that won't work in the demo here, but I just wanted to show you how easy it is to create one of these paths. So then you get that component or that service added to this list. If you drill down into one of these services, we can see Sales Force being monitored from the Austin lab. So let's go ahead and jump into that path. I'll click on that. This takes us to the path inspector page. So that path inspector page gives us a visualization of the path from the probe all the way over here on the left, all the way to Sales Force. And we can see here sort of a summary from this probe, our Austin lab probe to Sales Force. We're monitoring exactly www.saleforce.com, so we will cover DNS resolution and make sure that's good. And we're covering that in this case on 480 for whatever reason, that should be NET C-PORT. At the bottom, we can see Sales Force, specifically for this location is handled by two different servers. So the bigger the service is, it tends to be the more severs that are providing coverage for it. And if they have, for example, F5s doing geographically disbursed load balancing, then you may get some regionality. So in this case, from our lab environment, we're seeing two specific servers, whereas from our production environment or different branch we may see completely different servers. The important thing here is my users at this location, they depend specifically on these two servers. And that's what NetPath is designed to show you. So we sort of start drilling into this path. We can see from the source, I go to our one, our two, our three. There's some multi-path going on here. Continue to go through our network. And we eventually get to our service provider, and we actually use Time Warner Cable at this specific branch at Solarwinds. Time Warner Cable has two autonomous systems, two networks, do to, if you're aware of the history with Time Warner, they had some mergers and acquisitions with Time Warner Telecom versus Time Warner Cable. We can actually see the surfacing in their network architecture here. They connect through both of their networks. They connect us over to TeliaSonera. Now, TeliaSonera's a backbone provider. You can see it listed there, they're an international carrier. And they aren't someone we have a direct business relationship with. We don't pay them for internet connectivity. We don't pay then for a SAAS service. But we definitely depend on them and our users depend on them for salesforce.com access to function properly. So of course, they're discovered here and they're represented here and their performance components are shown. And finally, we get to salesforce.com's own network. Each one of these autonomous systems, these sort of bigger circles here, that's a network and I can click on that and it will expand out for me the nodes that comprise that specific network. And so I click on all of these. And we've got some default summarization because as it turns out, the internet's complicated. And there's lots of nodes and links that you can depend on for all of your internet-based services, but using NetPath, we can really open this up, expand this to the extent we need, and detect where the problems are. Now one of the cool things with NetPath is we assign a latency and packet loss value to every single link and node in this topology, in this path So we can hover over, for example, this top link and we'll see that 25% of our traffic is taking this link and receiving two milliseconds of delay specifically on this link. Whereas some 13% of our traffic goes across this second link with six milliseconds of delay. And you can continue to walk through all of these different components and see how much delay and how much of your traffic is going through each one of these links. As you click through, you can also see the ownership information. So this node here is own by Sales Force. And if I were to need to contact them, I could do that. So let's take a look at a specific troubleshooting scenario and see how this would help us troubleshoot a real problem. Let's start with the scenario. So say for example, someone came over to our desk, we're the network engineers at this company, someone came over to our desk and said "Hey I had a problem browsing salesforce.com "at around noon today and I had a bunch of problems "and my neighbor had a bunch of problems, my cube neighbor, "but a whole bunch of other people did not have problems. "So what's going on here? "In need to solve my problem, "but it's working for some people. "Is this a laptop problem? "How do I fix this?" So as the network engineer, you start drilling in. And if you've got NetPath Monitoring, you may notice across the bottom here, we're got some red. So this bottom pane here is our path history. So we can see both our availability, sort of up down status, as well as our latency numbers here. And we can see our latency number over time. And we can see through most of the day, 11/9, we had about 70 milliseconds or so. It looks like theirs quite a bit of variation, which makes me a little bit concerned. But in general, the latency was under 125 milliseconds. But around noon that spiked up. So let's go ahead and click on this interval. And that will actually load the topology graph, or the path graph, that was occurring at that time. So it will show you the path that was being used to deliver this end-to-end performance or 301 milliseconds. So when our end-to-end latency was 301 milliseconds, our path looked like this. Very quickly, we can see a couple pieces of red. So I'ma sort of let this summarize back down so this is little bit easier to look at. And we can see on the right-hand side we're still using the same two servers. We didn't switch servers or anything like that. We can mouse over them and we'll see our average latency to this server's 280 milliseconds. The bono server's 322 milliseconds. Definitely a problem applying to both of these destinations, so that's not good. But we can see here, NetPath has already identified where the source of that latency problem is. So I'll zoom into that. We can see that there's 234 milliseconds on the land link because this is between two of my nodes. Now, NetPath knows the difference between your links. It knows that this is a connection that's sort of a long haul link that goes over the ocean or satellite or something like that versus a short land link that goes in your wiring closet, maybe between two devices in the same wiring closet. And using that awareness of how long this link is it can understand whether this 234 milliseconds, if that some sort of crazy problem for a land link or if that's to be expected with a satellite link. So NetPath is saying this is out of the ordinary. This is a problem, this latency of 234 milliseconds. We can also see 67%, most of our traffic, is going over this link. And when your troubleshooting, one of the most powerful troubleshooting tools you can use is differences. So we're gonna take a look at the interval right before the problem, I'll click on that. And NetPath will load our topology, or our path for that. We can see that same link between R3 and R5. That was running at three milliseconds before. And now, during the problem period, we're sitting at 234 milliseconds. So definitely a problem with this link, this connection between these two routers. Now if you're monitoring those routers in NPM, in NTM, our configuration management backup tool, and in NTA, our flow tool net flow tool, you can see that mapped over, so NetPath has detected this node, but it's also pulling in more information from our on-premises monitoring. We can see R3 here. The node name is R3 and it's a Cisco 7206 VXR, CPU and RAM and all these sorts of things. So with this high latency, the last thing I'd like to mention is that the high latency is caused by some sort of problem. And because we happen to have on-premises monitoring SMP access and we have MPN and NCM monitoring these things, NetPath will attempt to find the root cause of the problem. Here's it's telling us there was a configuration change and we will click on that configuration change. It will go ahead and pull the configs, it will do a diff for us, and if we scroll down here, we can see this highlighted line is the line that changed. And it looks like that was a policer change, or a traffic shaper change. Our traffic shaper changed from one megabit to 10 megabits. So we're queuing some more traffic because we're doing traffic shaping, interesting. And if we look, we can see that traffic shaping command changed on Ethernet-10. So we'll get back to our topology and we can mouse over, even down the interfaces, that yes indeed this connection is coming into Ethernet-10. So that connection that's having the problem is the interface that we made a configuration change on. So we can select that. That will again connect back to our NPM data set and click to go over to interface details for that. But we call also see here what traffic is going across that interface. Now we saw traffic shaper was applied. Traffic shaping happens as an egress function or an outbound function for that interface, so I can select egress here. And I can start to look at what are the largest flows or data conversations that are happening across that interface. So you really see the end-to-end performance and then drill down into a specific time slot, see a specific component introducing a delay or some packet loss latency, some sort of performance characteristic that's degraded. Drill into the specific component and drill all the way down, in many cases, to the root cause, explaining why that performance degradation is occurring on the specific node. So that's my quick demo of NetPath. I'm curious if there's any questions, Brad. >>The big question, obviously, is how does this work? Is it using ICMP, is it using some other technology to gather the data? >>Yeah, so the short answer there is we have a proprietary probing algorithm that uses a number of protocols. The funny thing is we started with Trace Shark, we thought this was an easy problem to solve, but we found out Trace Shark was blocked by many, maybe even half of enterprise environments, including our own, so that didn't work for us. We also noticed that Trace Shark doesn't deal well with multi-plat path, which is prevalent on the internet. So we built our own proprietary probing algorithm. We actually use a packet driver. We create our own custom crafted TCP packets to do this probing. We listen for the responses, analyze, we send new probes based on that analysis and come up with this picture. Do we have any other questions around NetPath or indeed any of the other features I have demonstrated here >>I think we've answered just about all of them that have come through. >>There's one more that seemed to pop up on the screen from Paul. >>Oh, is there any performance related impact to the probing? >>So that's an interesting question. There's a couple of places that the probing intelligence sits. The first place is on the probing machine, of course. So the rule of thumb here is if you're monitoring a path, two paths, three paths, something like that, it's really negligible performance impact. It's sub-5% of your CPU. So if you're monitoring a path or two, it's totally fine to put them on a production machine, whether that's a user box or whether that's even a piece of infrastructure. It does have to be a Windows device where you place that probe. Then there's that data that the probe gets is rolled up to polling engines and our database server. So you can get requirements or limitations around that from the NPM admin guide. But most people are concerned about that polling component and if you're doing a couple of paths, you're just fine. It will scale up to 30 path as necessary. And if you've got 30 paths, if you got more aggressive intervals, like doing monitoring at five minute intervals or three minute intervals, that's the point where you want to start thinking about having multiple CPUs and NetPath will take significant resources on this box. So all the requirements around that and sort of some scalability numbers are in the NPM administration guide for specifics there. >>Okay Chris, thanks for your time. Thanks for the great demos. Just to quickly wrap it up here, all of our trials for all of our products come with a free 30 day trial, fully functional. So you can download those directly from solarwinds.com and get a complete product overview and demo that in your particular environment. We would also encourage you to go to www.thwack.com. Thwack is our online user community, which has over 150,000 registered users. It's a great place for getting information and sharing tips and tricks, on not only the Solarwinds products, but also networking in general. This is an area also that you can go see what we're working on for all of our different products, so you kinda get an idea of what's coming next. And also provides a venue to provide your input into what you'd like to see coming next in our future products. So with that in mind, I'd like to thank you all for attending and we look forward to seeing you in webcasts in the future. Thanks a lot.
Info
Channel: solarwindsinc
Views: 7,528
Rating: undefined out of 5
Keywords: SolarWinds
Id: JggmX4PVVsg
Channel Id: undefined
Length: 57min 28sec (3448 seconds)
Published: Fri Nov 11 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.