>>Hi and welcome to today's webinar, Advanced Network Device Troubleshooting and End-to-End Visibility. My name is Brad Hale. I'm the Product Marketing Principle for Solarwinds Network
Management Products. With my today I have Chris
O'Brien, our Product Manager, specifically for Network
Performance Monitor. Just a couple of ground rules here. Like I said, today's content will cover advanced network device troubleshooting and end-to-end visibility. We want to make sure that
no attendee is left behind, so please ask questions
using the Q and A box as opposed to the chat
box and we'll do our best to cover them all as we
go through the webcast as well as the demonstration portion. And if not, we'll make
sure we get those answered at the end of the webinar and there will be a follow up email that goes out with a link
to the recorded webcast as well as the answers
to the Q and A as well. So with that in mind, I'm gonna ahead and go through a couple of
overview slides about Solarwinds and then I'll turn it over to Chris. So, Solarwinds' vision is to manage all things IT in a hybrid world. What that means is basically
we want to be able to provide you with the ability
to manage and monitor your IT infrastructure regardless
of where the applications and the underlying
infrastructure are deployed. That means it can be on-premise, it can be in a hosted or
SAAS type environment, or in a cloud environment. We want to do that
while continuing to take a very user-centric approach
to our product development, which is to create
products that are modular in the architecture and
allow you to only buy what you need when you need it as opposed to selling a
giant monolithic approach that provides capabilities
or features or functions that maybe you don't need for
your particular environment. So with that, I'm gonna
turn it over to Chris and allow him to give you an overview of Network Performance Monitor. We'll start with talking about some of the advanced network
device troubleshooting and then we'll go through a
number of different demos. Chris >>Thanks Brad. Hello everyone, welcome and
thanks for joining us here. We've got a NPM Network Performance Monitor,
is Solarwinds flagship product. I'm the product manager for NPM and I'd like to give you a quick overview. Then we'll take a look at some of the newer features that
we have to help you with advanced device
troubleshooting on your networks. One thing that we always
like to start out with is that we are a multi-vendor platform. We provide fault,
performance, and availability for your Cisco gear as
well as you Dell, EMC, F5, all sorts of different equipment. All of this gets aggregated
into a single view so that you can have
consistent reporting, alerting, and basically consistent visibility across your IT environment. We've put a lot of focus in
the last couple of releases on making sure we cover hybrid and cloud, and sort of are taking
up that new technology, providing visibility
for that new technology as people on-board that
into their environments in addition to our traditional sort of bread and butter
on-premises coverage. Some recent work also includes topology and dependency-aware intelligent alerts. You'll notice sort of the second half of the list of features
here really focus on how to deliver monitoring
that is less noisy and less dependent on the administrator to input all sorts of configuration. So we automatically detect topology. We can set it up to automatically
have intelligent alerts. Alerts that, for example,
don't tell you about 10 servers going down behind a router
that just went down. It just tells you about
the router that's down. Making it easier to troubleshoot
and find that root cause. We cover wired and wireless networks, including that mapping I just discussed. We have automated capacity
forecast, alerting, and reporting sort of traditional thing
for network monitoring. And one of the other key tenents of Solarwinds, and NPM specifically, is that it's designed for
you to deploy, configure, and maintain without consultants, without purchasing professional services. Solarwinds doesn't not even
sell professional services because we are very
serious about making sure that our customers can use it
without professional services. Without further ado,
we'll start talking tech with our features here. The first feature I wanna go
over is Cisco Switch Stacks. So the cool thing about
Cisco Switch Stacks is they've really hit sort of a sweet spot for a lot of customers in terms of having scalability but not having the large upfront investment
of a chassis switch. They maintain this sort
of pay as you go model of fix switches, but they give you that scalability of a chassis switch. So a really cool technology. A bunch of people make switch stacks, but Cisco by far is the most common. So we provide new monitoring,
we built new monitoring to give you better visibility
into Cisco Switch Stacks. Just a reminder of how
this technology works. Cisco Switch Stacks take nine or less stack capable switches, a
pictures on the right there. They look like fix switches,
but they can form together and sort of pretend, or
as tech guys say conspire, to be a single switch. And the benefit of that is that you get a single point to do management from, a single thing to monitor, a single configuration file to handle. All of these sorts of things
that make management easier. You don't have to manage this
as nine separate switches. So that's great and all, but that introduces a layer of technology above and beyond a standard fix switch. And traditionally that layer of technology is not something we've
had visibility into. So what I'm talking about
specifically there is to form a single switch, a logical switch from a set of physical switches. A couple things need to occur. First of all, you need to
connect them all together, as you can see that stacked cable on the left-hand side of our image there. You need to connect them
all together in that form. It the back point of what would otherwise be a chassis switch. The devices need to talk to each other and elect a master switch,
which is effectively the brains of you switch, or sort
of like a supervisor. You can think of this as you
supervisor in a chassis world. And that device handles all of the operations of the switch and sends data down to the A-6
of all of the other switches to program them for what they need to do and it presents a management interface to use the administrator. So that master switch election occurs and ideally that's something
we'd like to get visibility in because it's super critical to the function of the switch stack. Also, all of these devices have their own physical resources, right. And historically you've been
able to see the interfaces for all of these devices no matter what, but the CPU, RAM, and
hardware health for each one of these components is
not something historically you've been able to see
for each one of them. You just see it for the master. So despite that being that case, it is true that each one
of these physical switches has a finite amount of CPU, RAM, and has hardware health components. And when we say hardware health, we're talking about things
like power supplies, potentially multiple power
supplies, fans, fan speed, temperature, all of
those sorts of components that you need to keep healthy for the overall platform to be healthy, for the hardware to be healthy. And finally, there's this
stack ring that's so important, that forms that back plane of our switch that now in this stacking environment is external to the switch, whereas with the chassis it was internal. So there's a greater potential that there will be some sort of break, there will be a loose cable, there will be something wrong there that's affecting your switch stack's capacity or redundancy as a whole. So we need to get visibility into that. So let's jump right in. We will cover all of those
things in this demo here and show you what that looks like. Let me pull over our
test or demo environment and make sure that I get logged in there. Hopefully everyone can see my screen. >>While that's coming
up, there was a question. Is this just Cisco Switch Stacks or does this support Juniper
Virtual Chassis as well? >>This release is just Cisco
Switch Stacks right now It's by far the most popular vendor. But as you can see, we support
a large number of vendors. This specific piece of technology is a first release for coverage is Cisco. So let's jump over to our NPM summary page and find us a switch stack to look at. We'll jump over to this 3750. So there's a new resource
here in this 12.0 release that will list our your switch stacks as well as their current status, and this is inclusive of the
data ring and the power ring. It's more interesting
when we have problems, so we'll take a look at this 3750 rig here that seems to be reporting some problems. We'll cover 3750 as well
as 2960S switch stacks and a number of other models. Reminder that we have great
documentation in our admin guide if you just google NPM admin guide or you go to our Customer
Success Center at solarwinds.com you can get more information about that. We have the traditional monitoring, but I'ma use this new
tab that's showing up here on all of my switch stacks. And I'll select and click over to that and we'll see what
information comes up here. So the first thing and
most obvious challenge to get solved here is listing
out the switch stack members. So in this stich stack, we appear to have six member switches and
we have the model numbers listed here for each one. And that's really helpful
because it lets you know how much capacity I have for power over ethernet,
represented by this p, or just regular ports
on each one of these. It also give you an indicator if you know your Cisco Switch model numbers, what your back plane performance is and all sorts of stuff that go into that or make up that model number. So I've got six switches here. This little icon on the left will tell me that this is the master switch. My switch number one is the master switch here, so that's good. That's probably what I've configured. It looks like switch priority is seven. So yeah, that is what I configured here. And it looks like in this case, I've configured other switches to have less and less priority
for that master election. So the highest priority that I configured does have master and I have configured properly backups for that. Another thing that's
unique to Switch Stacks is because all of these physical members, in this case six members, are pretending to be
the same logical switch, it gets a little confusing when you try and find the switch in the closet, like the wiring closet
or your data center, because they have the same host name. So if your used to physically labeling and you get a little label maker and put the host name printed out on each one of the switches
and then you go and find it, that becomes a problem because each one of these
has the same host name. So that's why we have serial number here. The serial number's the
best way to identify the unique hardware that is acting as switch number one or switch
number three or whatever, in the switch stack
that's printed on the back of all of your Cisco Switch
Stacks physical devices. We also have MAC Address here. And you can start seeing the CPU and RAM and other information that's specific to the hardware health of that one device. Historically there's been
a couple of different ways that Cisco has presented this information, but in all cases it ends
up being a poor summary unless you can really drill down into the individual components. If you have CPU of RAM high on
one of these, you wanna know. You don't wanna say that the average CPU, for example, is 20% and the average is comprised of 100% on one switch. I wanna know if there's
100% on one physical member. So you can of course alert and report off of all these, but set one. It's the check mark here
that know this logical entity that we call EW3750A is
comprised of six switches. We know that master, we know
the RAM and CPU for each and we can find them in the wiring closet. The next component I'd
like to discuss here is these interesting looking
resources on the right. These are visualizations
for the stack rings. The switch stacks have two
different types of rings depending on thee capabilities
of the switch stack. The first type, and the one that's present in every single switch
stack, is the data ring. So this is the cable the
forms the data back plane between these different physical switches. And that cable will go from, for example, switch one over to switch two, from switch two to switch three. And if we look at our picture here, we can actually see that
over on the left-hand side you can see the cables sort
of spidering about there and they make a loop. Looking here we can see that
represented as a loop or a ring and this is actually a cool thing for me because sometimes as an
administrator I forgot that this thing was literally a ring. It's a loop or a ring topology. And so each one of these need to connected and it will come back
up to the first switch. But when you look at
the back of the switch sometimes that's not obvious or logical and it makes it harder to troubleshoot. Here, showing it as a ring,
we can immediately see the cable between switch number four and switch number five is not functioning. And the result of this is
that we got half redundancy. Our bandwidth is reduced to half. And if we were to have another failure in that data ring there would
be a catastrophic failure, so we've lost a lot of our
redundancy of that ring. We're in a backup mode
where data is still passing. So very cool to have
this visualized for you and it makes it a lot easier
to troubleshoot problems. >>Chris does this work on
all Cisco Switch Stacks or is this limited to certain models? >>This works on the vast majority of Cisco Switch Stacks
including the 3750, 3600, 2960S and the complete list should be available in the NPM admin guide if you
want to check your version if it's not included in the
ones I've specified there. >>How about Nexus? >>Nexus are not switch stacks. Nexus is chassis and they have some sects and sort of distribute architecture, that's definitely not stacking. So that will be another topic. >>Okay >>The next thing I wanna look at is the stack power ring. So we've handled the data back plane. The next thing, sort of
innovation, the Cisco came up with, or someone came up with
and Cisco implemented, is sharing of power. So if you've got a set of
switches, say four switches, then you don't really need,
in a stack environment, it doesn't really make
sense for each switch to have two power supplies
for redundant power. You want more like an N+1 configuration so you don't pay so much to get your redundancy and power. You want to be able to absorb perhaps a single power supply loss, but you don't need to absorb half of your power supplies going out. So the stack power ring is another ring of cables at the back of you switches that allow one switch to share the power from the power supplies it has with the other switches as necessary. So traditionally what that
looks like is you have a set of switches and
each switch in that ring has a single supply and then you select a
single additional switch to have two power supplies. So that gives you that N+1 redundancy. If any of the switches
lose a single power supply, they will get additional power from that remaining single
backup power supply. One of the things that's
interesting with the power rings is you can't fit as many switches
into a single data ring. So where traditionally you have
a maximum of seven or nine, depending on the model,
switches in a data ring, you can only have up to four
switches in a power ring and this ends up being a
limitation of how power do I want to send through
a single physical cable that's connected to the back
of all of these switches. So with that limit of four
that means you sometimes have a data ring or single logical switch that contains multiple power rings. And so, you want to
visualize that as well, and see that ring and be able to detect that, for example, the second ring is running in power sharing mode so we may have configured
that for redundant mode and now we're running in power sharing as a result of loosing this one cable. So again, we've lost redundancy,
we're still functioning, but we want to know about
that redundancy loss and go back and resolve that, so we're not always sitting at the edge, we don't have our production environment sitting at the edge of
a catastrophic failure. That sort of outlines the coverage we have for Cisco Switch Stacks. There's a couple other resources in the other more
traditional Solarwinds views that have the power supply,
fan, and temperature settings or readings for each one of
these physical components. But that gives you the coverage
for the base level coverage that we want for Cisco Switch Stacks to keep them healthy in your environment. And of course it's all reportable. You can alert when
there's a broken data ring or power ring or a switch is added, there's a new member to the switch stack, or a master election change. Basically all of the
stuff you're seeing here, you can alert to Pong,
which is really helpful to get just in time notification that you need to go and do some work. >>Chris, we've got a
user here, an NPM user, that says they've got a
3750 stack single switch, but NPM is identifying
it as a switch stack. >>That's one of the bugs
that we're aware of, so definitely something that
should not be happening. But basically what happens
is if you have a switch that is capable of
stacking, like the 3750, we will report that as a switch stack, so you'll it's technically a stack of one. Now from the Cisco side
this is correct, right. This thing is a stack of one, you do have back planes
bin which is reported, you do have master election that occurs, and all of that sort of stuff. So that shows up here as a
switch stack with one member. But in general as users we
don't care about those things unless there's multiple switch stacks. We have a feature request to make that appear as not a switch stack. That seems to be the
common request from users. >>Does this work in a VSS configuration? >>So VSS tends to be more
like chassis switches, for example the 6800 or
6500 series chassis switches from Cisco, and that's not
really switch stacking. That's a chassis technology sort of similar in that
two switches act as one, but you can have six switches,
for example, in a stack. It's sort of a different technology, so the stacking stuff
does not apply there. That's not covered. Any other questions? This is what we have to talk about for the Cisco Switch Stacks. We've got two others to discuss here, two other new features to discuss here, but I want to cover any questions we may have about Cisco Switch Stacks. >>Yeah, really quick on this. This new switch stack
monitoring capability is part of NPM version 12? >>That's correct. >>And it does not require any additional modules
or anything like that, it just part of the core NPM capabilities? >>That's absolutely right. If you have NPM, you have
Cisco Switch Stack Monitoring. My other note is we do get
request fairly frequently about adding other vendors and that's something that
we're thinking about. If you have specific switch
stacks in your environment that your that aren't covered by this Cisco Switch Stack Monitoring, please do shoot us a message either via email or even in the Q and A and let us know and we'll
think about how to add that. >>Okay, without further ado, we'll jump over to our next topic here. >>So we did our switch stack demo, the next thing I'd like to talk about is deeper insight into F5 load balancers. So it's a really interesting
thing in networks today how sort of as of 10 years ago, networks have routers, switches, and to a lesser extent
wireless access points. And if as a network engineer, you had all of that stuff running well, you were doing your job. Your network was running well. But today's networks include these things that we call modern network appliances. And these things cover
things like Cisco ASAs or any firewall type, any load balancer, web proxy, WAN optimizers,
and so these things sort of are sprinkled
around your environment. They tend to be much fewer in number than the switches and
routers and access points in your environment,
the access layer stuff, but they tend to be extremely important. It's for two reasons really. They tend to sit in
bottlenecks of your network. For example, firewalls sit
between you enterprise network through your data center
network and the internet or between your network
and partner networks, and if that single firewall,
or ideally a redundant pair, if that goes down then your entire network loses critical connectivity, for example internet connectivity. So even though you don't
have many firewalls, they're exceptionally
critical in your environment. The same holds true for F5 load balancers. For F5 load balancers,
they have a unique position because they're the gold
standard load balancer but they're also by far the
most popular load balancer. So a really high quality load balancer and because they are so high quality, they cost a lot of money. So it's not uncommon for enterprises to spend 50, 100, or hundreds
of thousands of dollars on F5 load balancers and
they make that investment because of how critical the services are that the load balancers
provide service for. So these load balancers
tend to sit in front of the most important services
in the entire company. Solarwinds is an example. Solarwinds.com sits behind load balancers. So if those load balancers
malfunction, go down, or otherwise not providing proper service, then our website goes down,
which is a big deal for us, at least, and tends to be a big deal for any of our customers
when those services behind the F5 go down. There's sort of a disparity between the number of devices
and their criticality, and this is something we've
noticed and we want to go fix. We wanna make sure that these devices that are not routers,
that are not switches get the level of coverage that they need, that's commensurate with their
importance in the network. In addition to sort of the standard stuff we do for routers and switches, covering CPU, RAM, interface
utilization for F5, we also cover thing like connection count. And that connection count can be sort of sliced in a bunch of different ways. You can look at connection
count for an LTM, the whole appliance, for
virtual servers, for pools, for specific pool members. You can look at DNS
resolution requests for a GTM, directly for the whole thing
or an entire service on the GTM So this whole set of GTM
and LTM are reporting data that you can slice any which way to try and understand the
health of the environment or specific servers, services,
or logical components. So it's very relational. The other thing that we cover is sort of the bread and butter
stuff for the platforms. So things like software versions, serial numbers, as well as the H/A status. And we not only are these
thing reporting to be H/A, are they reporting what failover
status are they reporting, but also are they reporting
that they're synced up, that they're ready for failover. So it's very important. We have all of this in sort
of very relational view, and we'll jump into the demo
in about 60 seconds here, but it's a very relation
view where we show the servers and the pools and the nodes and all of the statuses and
how they connect together, which is very different
than how routers work in the relational way
that the technology works. Connection counts we talked about. We also allow you to drill all the way down to the health monitors. So a lot of times when a service goes down you have health monitors deciding that certain servers are unhealthy, and you want to get to
the root of the problem, you need to know what health monitors are deciding what about your
servers, and we can do that. And finally, we get down to
the individual pool members. So, I'll stop talking about it, we'll start looking at it here. Within Solarwinds, we'll jump back over through my dashboard and network. This is the subsection
that comes with NPM. We'll jump over to load balancing. Right off the bat here,
this is very different from everywhere else in Solarwinds NPM. We've got this sort of
balancing environment view. And when you think about load balancing, it functions based on the relationships of many different components, some of them physical
and some of them logical. Sort of the core of how
load balancing works. So at the top, we have
a list of our services. These are the things
that we're to the world that we bought the F5s to
make sure that they're up. So that's the top level thing. And now, to get this service up, this service depends on
a global traffic manager, local traffic manager, virtual servers, pools, and pool members, all that sort of feed up into the service. And we can see that here. We can see through the
status of each component. Now if we have a problem
with one of these components, we can drill into it. So there's a couple of ways to do that. If I click this specific service, this ADFSDNS North America,
and click on show relations, on the back end we have pooled and understand how all of
these components are related. So this specific service depends on this global traffic manager. And in this case, this
global traffic manager was detected to be a pair. And this indicates that these
two are forming an H/A pair. This guy is active. The H/A status, you can see there, is in sync in the bottom
of that hover-over. So they're prepared for failover, but one of the backups is
gray and this is unknown. It looks like this hasn't
been added to NPM yet, so we would want to go and add that to NPM and you would get that coverage or the device not only
noted, but also start getting statistical information from it. And in this case we're
having a problem, right? This service comes through
this global traffic manager, local traffic manager, virtual server, pool, and pool members and there's problems with each one of these. If we sort of hover over, we can see not only the status, but
also the status reason. And we can sort start to walk
through this problem area and get to the root cause
relatively quickly here. So the service as a whole,
this ADFSDNS North America, the status is down, the status reason is no enabled pool, so we can see the pools is dependent on the global traffic manager and the local traffic manager. There's only one virtual server. That virtual has a single pool in it. So let's jump down to that pool
that's reporting the problem Now here we can see the
status of the pool is down and it says the children
pool members are down. So in this case, we've got
two children pool members, so we can look at the status of them. One says unable to connect, no successful connects
before the deadline. So our timer has expired. And we'll jump deeper
into that in a moment. Then this guy says availability unknown. So, the unknown status is coming from F5 and F5 is unable to get that availability. So let's drill into one of these. So we're clicking in. And this time we don't
wanna show relations, we wanna show the details space. And when you go into the detail space this is of course going to tell you the stuff that's relevant
to that specific component. So we have our status and
our status reason again, all of the pools that this pool
member is participating in, what F5 server it's on,
the number of connections, how many connections per second, so this is sort of concurrent
as well as per second, and how much bandwidth
it's taking as well. Importantly when something is down, you'll come over here and
look at the health monitors and in our lab we always have sort of data consistency, data
accuracy, challenges with getting all of this in a single lab. But in this case, you would a list of the names of the health
monitors that it's assigned. In a production environment, you would see this specific health monitor
is down and we would show you that health monitor status reason as well. So you can very quickly
drill all the way down to your problem, even
in complex environments that includes multiple
physical components, many logical components. Now one of the other things
you may have noticed here is that our balancing
environment this sort of mini-stack or miniature view of the balancing environment has followed us. And the component that has followed us, or the view that has followed us, is filtered down to anything that depends on that pool member. Showing us all the things this specific pool member is related to. And as we mouse over these things, we can get the mouse overs contain data that's sort of relevant and the right data for that specific component. So if I look at global traffic managers, we can see, for example,
its IP, the hosting node, its H/A status, and
the number of requests, requests per second
here being DNS requests. Whereas if I mouse over the LTM, I can see the number of connections and this is connections across
the entirety of the LTM. If I mouse over virtual server, I can see the number of connections specific to that virtual server, the port number that that virtual server's providing service on, all of
these sort of right pieces of data for that specific component. So I'm going to jump into one of these. I see my global traffic manager
here is reporting badly, so I'll jump and I can continue to use that resource as
a navigation function. Jumping over to this GTM, we can see DNS resolution by service, we can jump into each on of these services here. We can list out our services
and see their components and see how they're doing
their load balancing. We can also, really
interesting sort of track the relationships in this environment. So this global traffic manager has six services that depend on it and it feeds through only a
single local traffic manager. So there's no redundancy at
the local traffic manager, at least for any of the services depended on this global traffic manager. So you can very quickly take
a single physical asset, in this case a GMT,
global traffic manager, and understand what, for
example, pool members servers are depended on by that GMT's function. It's a very cool relational view. Okay, so hopefully some
questions have started coming in. >>Yes, the first question is what version of F5 does this support? >>So there's two pieces here. The vast majority of this functionality is supported by version
of TMOS 11.2 or later. And all of this data is acquired via SMP. Now for two things specifically, the Health Monitor
Status and Status Reason, as well as pulling a server
in and out of rotation, or more correctly, pulling a pool member in and out of rotation in a pool. That functions through the iControl API. So that requires TMOS 11.6 or later, if I'm remembering right. And just to double check this and compare against your environment along
with the other requirements for this functionality, check
out the NPM admin guide. There's a section for
network insight for F5 and specifically a requirements section to make sure your environment fits. Any other questions? >>No, that's it for now. >>Okay, we'll jump into a pool and let's see if I can take a
look at the pool members here. So within a pool, of course we will list out the members in
that pool, the pool members. We'll show you their load, so
you can compare relative loads See how well your load
balancing algorithm, in this case is round robin, is distributing load between these two. They're quite a bit different. Maybe our round robin, maybe
we would want to do something more specific or think about how we're tracking state information and get that more well-rounded. Or at least make sure that
each individual server can carry that larger
portion of the capacity. The other thing we have
here is some capability to change the rotational presence, so to change what members are actually available in this pool. So clicking change rotation, you can just turn these off and on. What we found is although
there's a lot of things you can configure on load balancers, sort of the most common operational task, what you do 80% of the
time when you're logging into a load balancer, is you're
simply taking pool members in and out of rotation so that you can do some sort of maintenance
or do some sort of transition over to a new version of a site. So you can do that
directly in this interface. You would click one of these. Now in the BIMO environment
this is disabled, but when you click one of these it will give you a warning,
tell you what you're about to do and make sure you're wanting
to do that right now. And then upon selecting okay, we will go ahead and connect over with the API and disable that pool member. Finally, there's some expert tips here. One of the expert tips in this scenario is after you turn a server
off, this is sort of designed to not produce a
lot of impact to your user. So users currently sticking
to a specific server, they'll start to siphon off fluently, but the current users will still be served by that specific user. So, here we're suggesting that you should give it a few minutes after
you remove a server from a pool to make sure that that server is not used by many of your production users. And there's sort of expert tips like this sprinkled
throughout the product. And click in and you can
get more information. So we will jump over now to
our next feature, I think. Any more questions about F5? >>Nope >>Okay, great. So we'll jump over to our next feature. That is NetPath, here. Let's get this showing up. That switched on me. Actually, we'll just take a look straight in the slide deck here. NetPath providers visibility across the entire service delivery path, particularly for cloud
and web-based services, so hybrid environments
where part of the path is your environment, part of
the path may be internet, part of the path is some
sort of service provider. Things like our access to salesforce.com is something NetPath is designed to understand and help you trouble shoot. So you can see things
like where the problem is, who the responsible party is for that node having the
problem with that link, and how to contact them. As a reminder here that traditionally, when we're thinking
about network monitoring, we're heavily focused
on infrastructure gear. We ask the infrastructure gear, "Hey infrastructure
gear, how are you doing? "How are your interfaces doing? "How is your fan doing? "How is your CPU doing?" all of these different questions
to the infrastructure gear. But the reality is our users
don't depend on that directly. They depend on the services the
infrastructure in providing. So at it's most fundamental
level, networking, we as network engineers, we're
delivering services to users. Users can people or users
can be, for example, a web server that uses a sequel server. But effectively we're
delivering a service to a user. And that path is really
designed to give you visibility across that service delivery
from your user-base, even when it's server users, like a web server as a
user of a sequel server, and do that locally, remotely, internet, for basically all of the
different types of environments that fit within a hybrid IT environment that most of the paths today. And we do this by deploying
a probe only at the source. This is really important. This is basically representative
of of your user base. You can deploy it on the user's machine or on a machine that's
sort of adjacent to it in the same office, for example. Depending on how you want to do it. And this probe, with no
other instrumentation, will detect the path, the
performance of the path, and then give you a granular view into how each component along that path is impacting the end-to-end performance. This makes it much, much easier
to start troubleshooting. So let's take a look at what that looks like in NetPath here. So again, we'll go to my dashboard, over to the network tab from NPM, and finally, NetPath Services. So here we have a list of services that I'm monitoring
with Solarwinds NetPath. A reminder here that
NetPath is a feature of NPM. So if you have NPM, you have NetPath. Now in this list of services,
we're monitoring things that are remote like Google, Office
365, Sales Force, and so on, but we're also monitoring our AWS Lab, that's sort of a combination
of their network and ours, and that we have some responsibility for the at least logical
infrastructure for. And we also have some portals here. The take away here is that NetPath will work for
any TCP based service, regardless of where that service sits. So as long as it's TCP
based, it will work. So let's take a look at what creating one of these paths looks like. So if I want to monitor,
for example thwack.com, a sight near and dear to me. And I know that runs over
Port 443, that's encrypted, that's SSL, so it's Port 443. I'll put that in. We work fine with the encryption here. We can put in an alias,
some nickname if we want. We'll put a probing interval, so how often am I gonna probe
that thing, and click next. After in specify the destination
service, I find a probe. I can use one of the probes that I've already deployed
to one of my offices. I can use the main poll or any
of my unions already in Orion or really do what NetPath
was designed to do, which is deploy a probe close to my users. And we'll handle that
probe deployment for you, we'll just put a small agent out there and we'll centralize and manage
all of that functionality. And I click create. Now that won't work in the demo here, but I just wanted to show you how easy it is to create one of these paths. So then you get that component or that service added to this list. If you drill down into
one of these services, we can see Sales Force being
monitored from the Austin lab. So let's go ahead and jump into that path. I'll click on that. This takes us to the path inspector page. So that path inspector page
gives us a visualization of the path from the probe
all the way over here on the left, all the way to Sales Force. And we can see here sort of
a summary from this probe, our Austin lab probe to Sales Force. We're monitoring exactly
www.saleforce.com, so we will cover DNS resolution
and make sure that's good. And we're covering that
in this case on 480 for whatever reason, that
should be NET C-PORT. At the bottom, we can see Sales Force, specifically for this location is handled by two different servers. So the bigger the service
is, it tends to be the more severs that are
providing coverage for it. And if they have, for example, F5s doing geographically
disbursed load balancing, then you may get some regionality. So in this case, from our lab environment, we're seeing two specific servers, whereas from our production
environment or different branch we may see completely different servers. The important thing here is
my users at this location, they depend specifically
on these two servers. And that's what NetPath
is designed to show you. So we sort of start
drilling into this path. We can see from the source, I go to our one, our two, our three. There's some multi-path going on here. Continue to go through our network. And we eventually get
to our service provider, and we actually use Time Warner Cable at this specific branch at Solarwinds. Time Warner Cable has two
autonomous systems, two networks, do to, if you're aware of
the history with Time Warner, they had some mergers and acquisitions with Time Warner Telecom
versus Time Warner Cable. We can actually see the surfacing in their network architecture here. They connect through
both of their networks. They connect us over to TeliaSonera. Now, TeliaSonera's a backbone provider. You can see it listed there, they're an international carrier. And they aren't someone we have a direct business relationship with. We don't pay them for
internet connectivity. We don't pay then for a SAAS service. But we definitely depend on them and our users depend on them for salesforce.com access
to function properly. So of course, they're discovered here and they're represented here and their performance
components are shown. And finally, we get to
salesforce.com's own network. Each one of these autonomous systems, these sort of bigger circles here, that's a network and I can click on that and it will expand out for me the nodes that comprise that specific network. And so I click on all of these. And we've got some default summarization because as it turns out,
the internet's complicated. And there's lots of nodes and
links that you can depend on for all of your internet-based services, but using NetPath, we
can really open this up, expand this to the extent we need, and detect where the problems are. Now one of the cool things with NetPath is we assign a latency
and packet loss value to every single link and node
in this topology, in this path So we can hover over, for
example, this top link and we'll see that 25% of our traffic is taking this link and
receiving two milliseconds of delay specifically on this link. Whereas some 13% of our traffic
goes across this second link with six milliseconds of delay. And you can continue
to walk through all of these different components
and see how much delay and how much of your traffic is going through each one of these links. As you click through, you can also see the ownership information. So this node here is own by Sales Force. And if I were to need to
contact them, I could do that. So let's take a look at a
specific troubleshooting scenario and see how this would help us
troubleshoot a real problem. Let's start with the scenario. So say for example, someone
came over to our desk, we're the network
engineers at this company, someone came over to our desk and said "Hey I had a problem
browsing salesforce.com "at around noon today and
I had a bunch of problems "and my neighbor had a bunch
of problems, my cube neighbor, "but a whole bunch of other
people did not have problems. "So what's going on here? "In need to solve my problem, "but it's working for some people. "Is this a laptop problem? "How do I fix this?" So as the network engineer,
you start drilling in. And if you've got NetPath Monitoring, you may notice across the
bottom here, we're got some red. So this bottom pane here
is our path history. So we can see both our availability,
sort of up down status, as well as our latency numbers here. And we can see our
latency number over time. And we can see through
most of the day, 11/9, we had about 70 milliseconds or so. It looks like theirs
quite a bit of variation, which makes me a little bit concerned. But in general, the latency
was under 125 milliseconds. But around noon that spiked up. So let's go ahead and
click on this interval. And that will actually
load the topology graph, or the path graph, that
was occurring at that time. So it will show you the path
that was being used to deliver this end-to-end performance
or 301 milliseconds. So when our end-to-end latency was 301 milliseconds, our
path looked like this. Very quickly, we can see
a couple pieces of red. So I'ma sort of let
this summarize back down so this is little bit easier to look at. And we can see on the right-hand side we're still using the same two servers. We didn't switch servers
or anything like that. We can mouse over them and
we'll see our average latency to this server's 280 milliseconds. The bono server's 322 milliseconds. Definitely a problem applying to both of these destinations,
so that's not good. But we can see here, NetPath
has already identified where the source of
that latency problem is. So I'll zoom into that. We can see that there's 234
milliseconds on the land link because this is between two of my nodes. Now, NetPath knows the
difference between your links. It knows that this is a connection that's sort of a long haul link that goes over the ocean or
satellite or something like that versus a short land link that
goes in your wiring closet, maybe between two devices
in the same wiring closet. And using that awareness
of how long this link is it can understand whether
this 234 milliseconds, if that some sort of crazy
problem for a land link or if that's to be expected
with a satellite link. So NetPath is saying this
is out of the ordinary. This is a problem, this
latency of 234 milliseconds. We can also see 67%, most of our traffic, is going over this link. And when your troubleshooting,
one of the most powerful troubleshooting tools
you can use is differences. So we're gonna take a look at the interval right before the problem,
I'll click on that. And NetPath will load our
topology, or our path for that. We can see that same
link between R3 and R5. That was running at three
milliseconds before. And now, during the problem period, we're sitting at 234 milliseconds. So definitely a problem with this link, this connection between these two routers. Now if you're monitoring
those routers in NPM, in NTM, our configuration
management backup tool, and in NTA, our flow tool net flow tool, you can see that mapped over, so NetPath has detected this node, but it's also pulling in more information from our on-premises monitoring. We can see R3 here. The node name is R3 and
it's a Cisco 7206 VXR, CPU and RAM and all these sorts of things. So with this high latency, the last thing I'd like to mention is that the high latency is
caused by some sort of problem. And because we happen to
have on-premises monitoring SMP access and we have MPN and
NCM monitoring these things, NetPath will attempt to find
the root cause of the problem. Here's it's telling us there
was a configuration change and we will click on that
configuration change. It will go ahead and pull the configs, it will do a diff for us,
and if we scroll down here, we can see this highlighted
line is the line that changed. And it looks like that
was a policer change, or a traffic shaper change. Our traffic shaper changed from
one megabit to 10 megabits. So we're queuing some more traffic because we're doing traffic
shaping, interesting. And if we look, we can see that traffic shaping command
changed on Ethernet-10. So we'll get back to our topology and we can mouse over,
even down the interfaces, that yes indeed this connection
is coming into Ethernet-10. So that connection
that's having the problem is the interface that we made
a configuration change on. So we can select that. That will again connect
back to our NPM data set and click to go over to
interface details for that. But we call also see here what traffic is going across that interface. Now we saw traffic shaper was applied. Traffic shaping happens
as an egress function or an outbound function
for that interface, so I can select egress here. And I can start to look at what are the largest flows or data conversations that are happening across that interface. So you really see the
end-to-end performance and then drill down into
a specific time slot, see a specific component
introducing a delay or some packet loss latency, some sort of performance
characteristic that's degraded. Drill into the specific component and drill all the way down, in many cases, to the root cause, explaining why that performance degradation is
occurring on the specific node. So that's my quick demo of NetPath. I'm curious if there's
any questions, Brad. >>The big question, obviously,
is how does this work? Is it using ICMP, is it
using some other technology to gather the data? >>Yeah, so the short answer there is we have a proprietary probing algorithm that uses a number of protocols. The funny thing is we
started with Trace Shark, we thought this was an
easy problem to solve, but we found out Trace
Shark was blocked by many, maybe even half of
enterprise environments, including our own, so
that didn't work for us. We also noticed that Trace
Shark doesn't deal well with multi-plat path, which
is prevalent on the internet. So we built our own
proprietary probing algorithm. We actually use a packet driver. We create our own custom crafted TCP packets to do this probing. We listen for the responses, analyze, we send new probes based on that analysis and come up with this picture. Do we have any other
questions around NetPath or indeed any of the other
features I have demonstrated here >>I think we've answered just about all of them that have come through. >>There's one more that seemed to pop up on the screen from Paul. >>Oh, is there any performance related impact to the probing? >>So that's an interesting question. There's a couple of places that the probing intelligence sits. The first place is on the
probing machine, of course. So the rule of thumb here is
if you're monitoring a path, two paths, three paths,
something like that, it's really negligible performance impact. It's sub-5% of your CPU. So if you're monitoring a path or two, it's totally fine to put
them on a production machine, whether that's a user
box or whether that's even a piece of infrastructure. It does have to be a Windows device where you place that probe. Then there's that data that the probe gets is rolled up to polling engines
and our database server. So you can get requirements or limitations around that from the NPM admin guide. But most people are concerned
about that polling component and if you're doing a couple
of paths, you're just fine. It will scale up to 30 path as necessary. And if you've got 30 paths, if you got more aggressive intervals, like doing monitoring
at five minute intervals or three minute intervals,
that's the point where you want to start thinking
about having multiple CPUs and NetPath will take significant
resources on this box. So all the requirements around that and sort of some scalability numbers are in the NPM administration
guide for specifics there. >>Okay Chris, thanks for your time. Thanks for the great demos. Just to quickly wrap it up here, all of our trials for all of our products come with a free 30 day
trial, fully functional. So you can download those
directly from solarwinds.com and get a complete product overview and demo that in your
particular environment. We would also encourage you
to go to www.thwack.com. Thwack is our online user community, which has over 150,000 registered users. It's a great place for getting information and sharing tips and tricks, on not only the Solarwinds products, but also networking in general. This is an area also that you can go see what we're working on for all
of our different products, so you kinda get an idea
of what's coming next. And also provides a venue
to provide your input into what you'd like to see coming
next in our future products. So with that in mind, I'd like to thank you all for attending and we look forward to seeing
you in webcasts in the future. Thanks a lot.