Grafana Dashboard📊: Monitor CPU, Memory, Disk and Network Traffic Using Prometheus and Node Exporter

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey what's up everybody my name is moss normand and in today's video i'm going to show you how to write a custom grafana dashboard i'm going to walk you through step by step how i build a dashboard from scratch to monitor a virtual machine's key resource usage the dashboard that you're going to build is going to look like this it will include cpu utilization over time memory usage over time disk space used and available over time and finally network traffic if you're a system administrator for an application or maybe even for a cluster of vms these are key metrics that you would want to monitor on those vms or on that vm in addition to grafana there are two other dependencies the data source that we're going to be using is prometheus and the data collector that we're going to be using is node exporter node exporter is going to collect all of that cpu and memory usage and the other metrics data and it will expose that data on an endpoint that prometheus will then scrape and provide uh to grifana when grafana queries prometheus i've seen some tutorials on grafana dashboards where they basically showcase the import feature uh in grafana where you can import a a pre-built dashboard but that's not what we're going to do in this tutorial in this tutorial we're going to build each panel from scratch and form each of the prometheus queries ourselves rather than using the import feature in grafana and i'm not saying that the import feature in grafana isn't useful i think that's an extremely useful utility but you'll definitely learn more if you build each panel on the dashboard yourself and form the queries from scratch now that you know what we're going to build grab a coffee and let's get to it the first thing i want to show is the end point that node exporter exposes data on so that prometheus can scrape that data and uh graphonic and then query prometheus so node exporter is running on my vm and it's running on uh port 9100 and the endpoint that i can access the the node exporter data on is the uh metrics uh endpoint and when i access the metrics endpoint i get pretty much all of the data that node exporter is collecting from this virtual machine now let's go into grafana and in grafana i am going to create a new dashboard in order to create a new dashboard i just come up here to the plus sign and i hit dashboard and we're presented with this screen where we can add a new panel and the first metric that we're going to uh track is going to be cpu utilization so we're going to go ahead and click add new panel now in this panel screen we can uh select which kind of visualization we want we do want to keep a graph uh visualization in this case and then under uh under the actual graph itself the graph grid we can enter a metric so from here we can select the data source i'm going to keep it as default which is prometheus so we're going the data source is going to be prometheus that has our node exporter data and then we can form a prometheus promql query in the in this metrics field here so the first metric that we need is the virtual machine's total cpu time in seconds and to get that we just type in node cpu seconds and then total and from here we'll specify the job name that we want to query as well as the instance that we want to query so in this case my job name is jenkins node and my instance is going to be localhost 9100 which is where uh node exporter is exposing that metrics endpoint i'm also going to specify that the mode shouldn't be idle mode for cpu time so i'm gonna do use the not operator and then specify mode uh not equal to idle so it's gonna take all the modes like system and user uh cpu time but we're gonna ignore idle time and when this loads the graph you can see that uh cpu time just continuously increases and that's because uh node cpu seconds total is a counter and it gives us a kind of a hint here on how we can modify our query to prometheus it says that it's a counter and that we should apply the rate function and we are going to we're going to apply the irate function or the instant rate function rather than rate because we'll get a little bit better resolution of the data but as graffana is suggesting we want the rate of of cpu time rather than just the counter of how much cpu time there has been since the machine was started so to get the rate let's uh invoke the irate function and the irate function takes as a parameter the interval of time over which you wanted to query data points and in this case we'll just do one minute and we'll let that load and now you can actually see the rate of cpu time per second over time one thing to notice about the legend in this graph is that it breaks it down by uh mode and you can see a few other labels and one thing to notice is that i only have one cpu on this virtual machine if i had multiple cpus you would have you would see cpu 0 1 you know however many cpus i have and since i only have one cpu i would like to remove the cpu label but also if you had multiple cpus and you wanted to just see the overall cpu utilization across all cpus we would want to use an aggregate operator and we'll use average and then specify without cpu so that's going to remove the cpu label from this particular query and we'll let that load oh and i forgot to provide parentheses around our second part of the query and uh so that removes the cpu label from here and if we had multiple cpus this would be the aggregate of this would be the average utilization across all cpus if we weren't if we didn't take the average across all cpus and we had multiple cpus then our utilization would be would show that it was over a hundred percent because each cpu uh can can have one uh at least one second of cpu time per second right so if we had 4 cpus we could potentially see 400 percent of 400 percent utilization when in reality it's just 100 utilization across the board the next thing i want to do is just clean up the legend a little bit so i actually want to uh pull the mode and just specifically target the mode in the legend because that's the only thing that's different between uh each of these legend items so if i add mode here now i get a much cleaner legend and it breaks it down by i o wait time system user nice etc and so now we've got a much a much better graph of cpu utilization now the last thing that we want to change here is the uh left y axis here we want to make sure that this is represented in percent rather than an integer so if i go down here to axes and then the y-axis i'm going to just type in percent and we'll use percent 0 to zero to 100 here or zero to one and now we have the correct percentage and we have the cpu time and i'll update the panel title as well to represent cpu uh utilization let's go ahead and save and i'm going to call this host resource usage example we'll save that and now we have our first uh we have our first panel in our dashboard representing cpu utilization for the vm that i'm currently operating on now let's add an additional panel we'll add uh one next to the cpu utilization so let's go in here and this panel is going to map out our memory our memory usage so what we want is we want to know what used memory uh what our used memory is on this virtual machine we want uh our free memory our buffers and cached memory we're going to use a graph visualization again and the title i'll just update it to memory usage and our first query is going to be used memory and to calculate used memory we're first going to get the total amount of memory so that's going to be node memory mem total and in bytes and we'll pull the labels instance and job in here so instance is going to be localhost 91 since that's where node exporter is uh hosted and the job is going to be jenkins node that's just the name of of this particular job and from total memory we're going to subtract uh free memory and that's going to be node memory mem3 bytes and for the same labels we'll also subtract cache memory and finally subtract buffers whoops okay and this should get us uh free memory in bytes and you can see now that we have uh memory used memory and let's update the legend here so that that's represented as used memory uh one thing that i notice on the y-axis is that we have the y-axis is in bytes so let's update that by going down here to the the axes and we're going to use update the unit to be bytes iec and when that updates we can see now the the bytes are represented in gigabytes so that makes it a little bit easier to to read the next query that we'll add here is just buffers let's update the legend of buffers on that one and the next one is going to be cached memory and then finally we're gonna graph free memory there we go free bites you'll notice that it still um graphs out free whoops free memory is actually still graphed even though i haven't provided parameters into it and that's i believe that's because there's only one instance in uh to query in this case okay so now we have all of our memory mapped out and our memory usage mapped out and let's go ahead and save that and hit apply and now we have memory usage and cpu the next metric that i want to track is going to be disk space so i want to use an available disk space i'm going to drag this panel down to the bottom here and we'll add a new panel and i'm going to get the uh the first one is going to be used disk space and then the next panel that i add is going to be available disk space so to get the used disk space all we have to do is uh reference node file system and then we'll get the available bytes from our job here jenkins node and then the instance is going to be our node exporter in addition to providing the job an instance name i also have to specify which devices that i want to i i have to specify if i don't want to track any devices and there are a couple of devices that i don't want to track usage for and the first one is going to be device shouldn't be equal to uh device and then loop devices so i don't want to track any loop devices i also don't want to track uh temp uh the temp file system or the namespace file system so again i'm gonna pass in a query here uh for the device temp fs and then the last device that i don't want to track is going to be the gnome virtual file system so now we're going to subtract available bytes from total uh the total available bytes on the file system and so this actually shouldn't be available bytes this should be it should be total bytes so from total bytes we're going to subtract the available uh the available bytes on the file system so we'll subtract node file system available bytes and uh we'll only specify the jenkins job and the instance and let's see what that returns okay so now we have uh the available i'll put this up here in the title uh the available disk space and in the y-axis we have bytes so this isn't very readable so let's update the the units to be bytes in the left axis okay and now we get gigabytes in the left axis so that's a lot more readable and then under the legend we we have a lot of labels included in the legend but the only labels that we want in the legend is device so what i'm going to do is uh pull out the device from from there and just have the legend include only the device name or the device label also i said that this was uh available but this isn't actually available this is used space on disk so let's go ahead and save this panel and we'll apply it to our dashboard so now we've got use disk space i'm going to just drag it across here so we have a little bit more a little bit more space there a little extra real estate and the next query that we're going gonna add or the next panel that we're gonna add rather is gonna be available disk space so let's go ahead and add a panel i'm gonna drag it below used here and it is gonna be a graph so to get the available uh disk space what we're going to do is we're going to take node file system available bytes let me just make sure that's correct yep available bytes and our job is going to equal jenkins node and the instance is our node exporter and similar to the to the other graph we don't want to include the devices for loop the loop devices so to not include those i'll just add this small expression here for loop devices and then again we don't want to include the temp or namespace file systems so temp fs and uh nsfs and then we also don't want to include the gnome file system and then this gives us our available uh available disk space i'll also update the legend here so we only see the device in the legend the device name in the legend and that looks a lot cleaner and then the last thing we want to do is update that y-axis so that it's in bytes and now we have uh the bytes on the gigabytes a much more human readable format in the y-axis so let's go ahead and save and apply that so now our dashboard is looking pretty pretty filled out the last metric that that we want to do is network traffic so we want to see traffic received traffic and transmitted traffic so let's go up and add one more panel and we'll bring this down to uh the bottom here and to get network traffic what we're going to do is take the rate of the received the total received bytes and the rate of the total transmitted bytes uh from our virtual machine and it'll be ordered by the interfaces so to get that let's go ahead and we'll do irate and then node network receive bytes total and then that's going to be on our job our jenkins node and the instance is going to be localhost we're going to uh we're going to do this over take it over five minute interval period and then let's see if that brings up any data it does so we have our received bytes received data the rate of bytes received data over time broken down uh we we can see the labels in the legend here what we want is the device so let's break this down to the device and now we actually have the interfaces uh as the only label displayed in the legend so that that's a little bit easier to read so the second query is going to be the total bytes transmitted and uh just to specify the interface that the interface here is for received bytes let's add that to the legend now let's take the rate of bytes transmitted so we're going to do irate network transmit bytes total and we'll do jenkins node the instance is going to be localhost 91 and then the interval will be over five minutes we'll break down the legend uh same as before but now we'll say it's transmit so now the the legend is broken down by receive and transmit for each of the interfaces now one thing that we have here is is the the left y-axis is kind of off we don't we want to specify the units uh that we want to track this in so we're going to track it in kilobits per second so now we're going to do kilobytes per second as our unit okay so that now we have a much easier to read graph when we zoom in here and then finally let's update the panel tile to represent that this is network traffic i'm going to go ahead and save our dashboard so we added available disk space and we added network traffic and we'll go ahead and apply that i'll kind of stretch this across a little bit great and there we have it that's a pretty much the whole dashboard these are some of the key metrics that you would want to track as a system administrator or someone who's responsible for maybe a cluster of servers or just a single virtual machine in this case we can see the the key metrics outlined in this dashboard i hope you enjoyed this tutorial and if you uh didn't get a chance to like follow along building dashboard no worries i i will export this dashboard into json format and and provide a github repository where you can download the json format and import it into uh your own grafana instance if you want to if you liked the video please consider throwing a like on it and subscribing to the channel for more videos and if you have any questions comments or feedback please leave them below in the comments thanks for watching
Info
Channel: Tech and Beyond With Moss
Views: 29,258
Rating: 4.909502 out of 5
Keywords: prometheus, grafana, monitoring, node-exporter, devops, data visualization, grafana prometheus, grafana dashboard, grafana prometheus dashboard tutorial, node exporter, prometheus monitoring, prometheus node exporter grafana, prometheus exporters, prometheus monitoring dashboard, grafana tutorial, prometheus node exporter grafana setup, data visualization examples, grafana tutorial for beginners, prometheus grafana, grafana dashboard creation, devops tools 2021
Id: YUabB_7H710
Channel Id: undefined
Length: 26min 3sec (1563 seconds)
Published: Sun Oct 18 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.