How to monitor Containers in Kubernetes using Prometheus & cAdvisor & Grafana? CPU, Memory, Network

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video I'll show you how to monitor CPU memory and network usage of containers running in kubernetes in the first dashboard we will focus on CPU we'll create a graph to monitor SCP usage as a percentage of the limit given to the container now some of you may find it more useful to monitor CPU usage in Virtual course with requests and limits shown on the graph also it's very important to monitor CPU throttling especially for the CPU intensive applications we also going to run some load tests to verify our dashboard next is the memory usage of the containers the first graph will also show the memory usage as a percentage of the limit given to the container and the top graph will measure memory in bytes with requests and limits on the chart we'll use container memory working set bytes May attract to measure memory which is the same metric that kubernetes uses to decide when to kill a pot in the third dashboard we'll measure Network pressure by its received and transmitted as well as how many packets will drop due to some kind of errors or collisions for the load test we'll measure a networked output between the pots also if you want to learn how to monitor persistent volumes you can watch the previous video to monitor all of it will deploy see advisor as a demon set to collect metrics for running containers and Cube State Magics to fetch requests and limits of the containers from the kubernetes API server and of course Prime metals to scrape all of those targets to reproduce this example you can use my terraform code to create kubernetes in AWS and deploy parameters using yaml files first we need to create custom resource definitions for parameter separator and then deploy the rest of the monitoring components create grads first by using Create keyword instead of apply then apply all the files under monitoring folder make sure all the ports are up including parameters grafana CA advisor and Cube State metrics since we don't have any ingresses let's support 4 parameters and grafana to the Local Host if you open Targets in parameters by now we should have C advisor and Cube State metrics targets besides container metrics the advisor also exposes some generic specs of the VM where it's running those metrics start with machine underscore for example if you run machine CPU cores it will give you the number of processors on that machine for this demo I used T3 extra large CPU instance type which has four virtual CPUs according to AWS all the metrics related to Containers running on those machines start with container underscore you can find CPU memory Network and file system Matrix but the advisor at this point does not support all network attached storage systems such as EBS so I decided to leave it out for this tutorial if you want to monitor usage of persistent volumes you can watch in the previous video let's start with CPU metrics as you can see sea advisor also provides some limits for the containers but I prefer to take those values from Cube State Matrix now let's login to the grafana unless you have changed the password in the secret the default username is admin and the password is devops123 whenever I start when configuring grafana using config Maps I always check if the data source is working first alright let's go ahead and create our first dashboard to monitor CPU let's call it CPU usage then if you have a lot of namespaces and ports typically you would create a variable to use as a filter let's call it namespace to dynamically populate the variable you would use one of the metrics that exist in all namespaces for example CPU usage seconds total because all the containers use CPU to run some metrics may be only visible for a small subset of namespaces for example metrics to monitor persistent volumes after you execute this query you can choose the variable in our case we want kubernetes port namespace in grafana select parameters data source I called it main then grafana has a special function label values it accepts the magic and the label that you want to use namespace in our case Also let's include all values in case you want to show all the pots in the cluster on the same chart now we can create our first graph let's call it CPU usage as a percentage of the limit same here select the data source you can use a variable for the data sources as well the query that we're going to be using will be relatively large it takes the rate of the CPU usage seconds and divides it by the CPU quota for the rate function we need to specify the time interval which is one minute here this interval must be at least four times larger than the scrape interval if you followed along and deploy parameters using provided yaml files you have a 15 seconds default scrape interval you can also use built-in grafana interval variable which will adjust based on the time interval you specify for the dashboard I recommend starting with a fixed interval and adjusting later on so if you have a scrape interval of 2 minutes this query with one minute interval won't return anything keep it in mind now if you use older version of container D or even Docker to run your containers you may need to remove container G kind label in the older version there was a special container called pulse that kubernetes used to initiate the network to filter those containers out you would use not equal to Port but first just remove that variable and make sure that it works first also older versions of CA advisor had different label names if you face any issues execute just a magic without the labels in the parameters query Explorer and verify the labels names since a port can have multiple containers and each container may have its own limit we need to sum by both to use the namespace variable just use the dollar sign and don't forget the tilde special character for the legend we'll use container name and pod name I always prefer to move the legend on the right hand side as a table for the volume you can optionally use last non-null value Also let's change the design of our graph change line interpolation width and opposition set gradient mode to opacity as well sometimes you may have gaps in the graph to avoid that I typically use the option to connect null values since we divided seconds by the quarter we have a volume between 0 and 1. you can multiply it by 100 or use graphana unit type to convert it for you to remove extra zeros we can set decimals to 1. that's it for the first graph let's make it larger and set up a refresh rate to 5 or 10 seconds the global scrape interval is 15 seconds you can adjust it as well based on your needs in case you use one of the managed parameters that charges by ingestion and storage the maximum scrape interval that you can set is 2 minutes after that you may have problems with time series data you can use wearables to switch between kubernetes namespaces you can also use all options to get containers from all namespaces which may be helpful in identifying problematic applications for the following dashboard we'll use CFS completely Fair scheduler throttled seconds to plot the container throttling graph for this magic we'll use similar labels such as namespace port and container let's create a new panel call it CPU throttling this dashboard will show time in seconds for how long the container was throttled due to the nature of the port limits and how it's implemented using c groups it's almost impossible to get rid of all throttling for CPU intensive tasks but you can keep it under control some people even recommend removing CPU limits for such applications but you need to do your own research on that topic same here for the labels you may need to remove a kind equal to container label for the legend let's use container name label since each container in the pot gets its own request and limit let's also a shift the legend to the right and transform it into a table instead of at least same last non-no volume this volume is only for the legend section now let's customize a little bit the graph make a line wider and add few opacity for the unit type let's use seconds grafana will automatically convert it based on the value 2 milliseconds or even microseconds for now C advisor would be the most CPU intensive application in our cluster that would also be slightly throttle other ports that don't use a lot of CPU such as Prometheus and Cube State metrics will show zero throttling mostly it's because we have only a couple of targets registered with parameters the third graph in this dashboard will show CPU usage as a percentage of the limit let's create a few more variables the first one is POD we can also use the same label values graphana function and filter by pod name as you can see here to restrict ports to the current namespace we use this variable for the last graph we would also need a container variable the same logic here restricted by the pot's name also it will be beneficial to add a namespace variable in this query similar to the port variable alright we have the all graphana variables we need for the following graph let's call it CPU usage in course with requests and limits select the main data source for the prom KL query we'll use the rate function and sum by the pots and containers as a reminder you can find those dashboards in my GitHub repository for the legend since each pod can have more than one container let's use container in the Pod labels next is the limit since the same containers in the pot will get the same limits we can simply use the average function to get rid of extra values this metric comes from Cube State Matrix component you can use it to get limits for both CPU and memory for the limit just use the constant string limit next query to fetch the requests of each container in the pot exactly the same logic but magic name requests instead of limits for the legend let's use request that's all for the prom KL queries now we can slightly customize the graph first is the legend volume width and fill opacity for the unit we can just use short type instead of 1 for dismal let's use two now let's customize this dashboard a little bit first I want to update the limit line to do that we need to override a variable you can specify the specific variable name or use a rejects expression select the limit here then remove the fuel opacity change the line style from solid to Dash and finally let's update the color to Red you can find it under the color schema the limit line is done next let's do same customization for the request line but for color use green or any other you want remove the fuel opacity convert it to Dash and green color for the line now this graph looks much better at least for me you can switch between name spaces and ports to monitor each container let's make it a little bit smaller to run a demo we're going to use two pots Ubuntu Port 1 and Ubuntu Port 2 with the default Ubuntu 2204 image to keep containers running we can create infinite while loop the only difference between those spots is resource allocation the first pod has 500 milliseconds CPU request and 800 limit the second one has 500 requests and 1500 limit let's apply them we have two running pots let's go ahead and SSH to the first Ubuntu pod to simulate a high CPU load we can use a stress to install it using the default Ubuntu package manager then SSH to the second port and install the same stress utility after you deploy Newports you may need to refresh the dashboard to fetch new variables now we have the default namespace here and two Ubuntu Bots the second Ubuntu Port has 1.5 virtual cores limit and 0.5 request let's run the stress agility on both ports with one CPU now initially the first Ubuntu pod had 0.8 limit and 0.5 request with the stress tool consuming one CPU it should be throttled by kubernetes let's wait a few seconds alright you can see that the first Ubuntu board reached the CPU limit and is now throttled by kubernetes the Pod will still be running but depending on the amount of throttling it may greatly impact your application the first graph shows only a selected container from the specific pod when other two graphs show all the ports in the namespace based on the bottom chart the CPU usage for the Ubuntu Pod 1 is 100 and for the port 2 is about 65 percent we can switch to bot 2 since we have a larger CPU limit both usage is somewhere in the middle between request and limit and it's not throttled next we're going to create a dashboard to monitor memory consumption if you type container memory you'll find multiple Matrix but we'll focus on the working set bytes because it's the same metric that kubernetes uses to kill pods if they reach a limit we call CPU as soft limit since kubernetes will continue running the board and just throttle it but memory is a hard limit when a port reaches 100 it will be killed immediately we're going to call this dashboard memory usage Also let's create a few variables right away the first one is the same namespace for the query you can use one of the memory metrics or just keep the same CPU it does not matter here for memory usage let's also include all option to monitor the whole cluster the next variable is a pod and a final one is a container the first graph will also show the memory usage as a percentage of the memory limit given to each container to calculate we'll divide the memory working set bytes by the limit this will also give us a value between 0 and 1 which we can convert with a grafana type system or multiply it by 100 for the legend use container in the Pod then pretty much the same customization settings I'll point out only the differences from now on for the unit use a percent between 0 and 1. that's all for the first graph the second one will show the container memory usage in respect of requests and limits when we monitor the CPU we use a counter metric tab that always goes up that's why we need to apply the rate from Kill function when we monitor memory usage parameters uses a gauge metric type a gauge can go up or down and it's perfect for monitoring memory usage or maybe a temperature that's why we can simply use this magic and add labels to filter pots and containers it's the same magic that we use in the CPU chart to get memory limit inside of course use bytes and finally the request line that's all now let's customize it in the same way we did for the CPU change line Styles and update colors for the magic use bytes and override limit and request alright we all set these requests and limits match the Pod spec now let's minimize it again before the demo go ahead and SSH to the Ubuntu Pod 1 for the stress let's use 700 megabytes it should be in the middle between the request and the limit on the bottom chart you can see that it's around 70 of the limit I would keep both those charts the bottom one can also show the memory usage for all ports if you select all option the last dashboard will be used to monitor Network usage we're going to use bytes received and transmitted as well as how many packets were dropped optionally you can also unfollow the same logic and add errors in one of the following videos we'll create a dashboard using the four golden signals from the SRE book there we'll use Network saturation instead of Simply measuring how many packets were dropped or transmitted let's call this dashboard Network usage we would need the same variables here the first one is an a space then the pot we don't need a container variable since all the containers share the same network within the bottle this graph will show how many bytes were received and transmitted so let's call it Network i o pressure we'll use exactly the same logic here as the node exporter full dashboard which you can find it on grafana.com except that we'll use the pods instead of notes Matrix first prom kill will output received bytes by the bot now here we use irate instead of rate function on the high level it's recommended to use irate for high volatile values the network is one of those Also let's use the same approach as node exporter dashboard and convert bytes to bits by multiplying it by 8 for the legend use receive and the name of the pod for the next query use transmitted instead of receive you can plot this Matrix on the same graph or may choose to use separate graphs it's up to you here we have transmitted and the Pod name the same Legend on the right hand side as a table last none no volume line width and fill opacity for the unit use bits the last customization is to flip the transmitted values to the negative on the y-axis we need to use red Jacks since we combine transmitted with the Pod name the final graph will show the packet drop rate which in normal conditions should be zero or close to it we'll also combine received and transmitted on the same chart for the union type use packets per second that's all dashboards that I wanted to create for this video now let's run the final test let's measure the network throughoutput between the pods deployed on the different notes we have some gaps here because I forgot to select connect all new values option let's fix that to measure Network throughoutput we'll use iberf utility we need to install it on both Ubuntu Bots one bot will act as a server and the other one will be a client since we don't have any Services let's grab the IP address of the first Ubuntu pod optionally we can add by dear option to transmit and receive at the same time to get a more accurate measurement I use T3 extra large easy to instance type where the declared Network performance is up to 5 gigabits per second based on the parameters we got similar results keep in mind that the network data transfer between ports in the same availability zone is free but for multi-azine you will be charged for the network usage the dashboards that we created are a good starting point to monitor applications in kubernetes but it's not enough to go into production only with these dashboards in the next videos I'll show you how to add monitoring coverage using other approaches such as four golden signals and other methods also we'll create more applications and language specific dashboards in the future thank you for watching and I'll see you in the next video
Info
Channel: Anton Putra
Views: 19,385
Rating: undefined out of 5
Keywords: kubernetes, prometheus, cadvisor, grafana, kubernetes monitoring, kubernetes tutorial, kubernetes prometheus, kubernetes prometheus and grafana, kubernetes prometheus stack, monitor kubernetes with prometheus and grafana, monitor kubernetes cluster with prometheus and grafana, monitor kubernetes, k8s, prometheus operator, prometheus operator kubernetes, k8s tutorial, monitor containers with prometheus, kubernetes grafana prometheus, devops, aws, monitoring, anton putra, sre
Id: dMca4jHaft8
Channel Id: undefined
Length: 25min 57sec (1557 seconds)
Published: Thu Nov 24 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.