Monitoring, Logging, And Alerting In Kubernetes

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

You might be interested in checking out m9sweeper. https://m9sweeper.io/

Specifically - in the context of monitoring and alerting about your k8s security posture

👍︎︎ 2 👤︎︎ u/IntelletiveConsult 📅︎︎ Jun 14 2022 🗫︎ replies
Captions
[Music] what is the ideal self-managed monitoring and logging stock especially if you're using kubernetes now before i answer that question let me stress one more time that i'm talking about self-managed tools if you prefer using sas that's great you should actually continue doing that but this video is about self managed not software as a service so let's get going let's see what is the ideal monitoring and logging stuck there are quite a few things we need to put to assemble monitoring and alerting and logging and so on and so forth and we're going to start with metrics the golden standard for metrics today is nonetheless the prometheus project in the cncf foundation and it's been kind of a standard kubernetes is exposing its metrics for informative use format most of the applications that you might be running in kubernetes are also doing the same are exposing promote your send points so it is a standard and for a good reason so let me take my cluster i already have a cluster i did that in advance and install promote use the command is relatively straightforward at least if you're using helm so helm upgrade we want to install if it doesn't exist already a name the chart namespace i will create a namespace the values there's not much happening over there so no need to show you the values and i'm going to set the host to be promote use dot whatever is the ip of my cluster our external load balancer to be more precise and i love using nipaio for the demos not for the real thing so let me get the address and open it in a browser and take a quick look at how prometheus looks like now beware i'm not going to go into details into everything you should know about prometheus or any other tool that i'm going to feature today this is just a quick overview of what you might want to combine and explore yourself now one cool thing about prometheus amazing thing can be observed if you go to the targets section and over there we can see that prometheus is already monitoring actually let me stop for a second and say prometheus is a pool based mechanism so prometheus is pulling metrics from different destinations from different locations unlike other mechanisms that are pool based meaning that we have an agent that is pushing metrics to some central database this is the other way around so prometheus once installed by default is already figuring out what to watch for and we can see that it already figured out that there is api server of kubernetes itself that it should get metrics from there that it should get metrics from the nodes c advisor which i'm not sure how useful actually it is very useful it gives you the metrics of containers themselves it watches for pods it watches for end points it monitors itself and finally out of the box it watches for the push gateway i will explain push gateway in a couple of minutes for now just remember push gateway is there it could be extremely important especially for the legacy stuff that might have trouble getting into the cloud native game so prometheus does three things essentially it pulls metrics from whichever locations it discovers it stores those metrics in its internal database time series database and it allows us to query those metrics and find out whatever we need to find out about the system and that language for querying is called from ql so let's take a look at one or two examples of how that looks like so let's say that i want to find out what is the cpu utilization of the containers running in my cluster well the simplest possible query would be just to list all the results of a specific metric which is container cpu usage seconds total one of the many metrics and then hit that execute button now assuming that you don't like reading a lot of text we can switch to graph view and over there we can see how cpu doing how it is going now one thing you should notice or know from the start that some metrics are cumulative so we do not see what is the current usage we see what is the cumulative usage of cpu over period of time essentially from the very beginning now if we would like to see what is the cpu usage the real cpu usage not cumulative but what was it at any given point in time we can extend the query and say hey give me the rate of that metric accumulated over fractions of five minutes and then once you figure that out you will probably figure out that uh there are many many many many many many many many many functions that graph that pronq not graphql from ql supports and then you would need to go to the documentation and figure it out read it it's like a good literature right now let me go back to the query let's say that i would like to find out what is the cpu usage in a given namespace in a specific namespace then i would extend my query to show to limit actually the results to a specific namespace or i could limit it even further and by the way limitations are going through labels each metric has a number of labels associated with it and we can use those labels to filter the results and i can say hey it's not only about containers in that specific namespace but i would like to see cpu usage of containers with a specific name and then i would get a more limited or more focused result so apart from prometheus which is already running in a cluster and it is fetching metrics from kubernetes api we also got node exporters there is one node exporter on every single node of the cluster and prometheus is getting metrics from there as well prodrone we got push gateway now push gateway is interesting since prometheus can only pull metrics from somewhere that poses a problem because some metrics some applications some parts of the system might not expose metrics in a way that prometheus can get it from there so we have push gateway push gateway is yet another destination that prometheus is watching and pulling metrics for so whenever we have something that cannot expose metrics in prometheus format we can actually use push gateway to push metrics over there and then promote use pulls it from there and that can be anything any application anything that you want can theoretically and in practice be reconfigured or changed slightly to push metrics to the gateway and from there on from it you still get it what else did we get oh yeah alert manager so i think that's the final important component in what we got with prometheus not everything that we are going to explore there are other things anyways alert manager is querying prometheus constantly or periodically for certain rules and when those rules are reaching no not rules actually when those queries based on those rules reach certain thresholds alert manager will send notifications somewhere and that somewhere can be stuck it can be emailed or anything else so alert manager is in a way querying promote use periodically and say hey is this okay is if it's okay then there's nothing for me to do if it's not okay i'm going to send notifications somewhere now that part might be the part that i like the least a lot manager is a bit [Music] not really friendly let's say and i believe that there are better solutions namely robusta now if you haven't seen robusta then or if you haven't used robusta you should and if you're not familiar with it well i made a video there is the video in the description somewhere go and check it out now that we have everything we need to collect and query metrics and get alerts more or less stop this video this is the future me talking this video is sponsored by robusta hey so robusta is a platform for kubernetes notifications for troubleshooting and for automation within the context of this video it complements prometheus and alert manager it has a very easy setup a few questions through a cli wizard and off you go you get notifications to slack or ms themes if that's what you want whenever something goes wrong or whenever you want to be notified about something now truth be told alert manager does something similar but robusta puts it on a very different level it is much better and much easier to work with it comes with out of the box playbooks for commonly used alerts and if that's not enough you can create your own playbooks and that is going to be easier than creating your own alert manager queries on top of that the alerts come with the context imagine context instead of something somewhere fail go figure it out you get the information you need to almost immediately understand what's going on so long story short please try out robusta it's a great tool and i strongly strongly strongly recommend it and at the same time by trying robusta you'll be helping this channel a lot now let's get back to the original video let's talk about logs how do we collect logs and what do we use to collect them now in the past long long time ago if you're young you might not know those times but long long time ago we were pushing not pushing putting metrics sorry not metrics logs somewhere right we have an application then there would be a directory next to that application where we write logs and then logs would be all over the place the bigger the system the more locations the more places we had with logs and you can only imagine going through logs spread across many different servers was painful extremely painful in the meantime we all most of us understood that logs should be managed centrally and they are distributed by nature meaning that logs are being created everywhere and then we need to collect those logs get them from everywhere and push them somewhere now there are many solutions for that uh the most famous most commonly used at least in the past would be uh elasticsearch logstash and kibana logstash namely which would be collecting logs and pushing to elastic but today i'm not using elastic because this is whether there is something much better and that's something much better at least in my opinion it's called loki so let me install loki uh with a simple command helm upgrade installed the name is lockheed grafana lokistakis where it's located the namespace yes the namespace is monitoring i want to create a namespace and i want to wait and while i'm waiting for loki to be installed this might be a good opportunity to remind you of something extremely important and that important thing is subscribe right now go subscribe and then like the video because it's awesome right now let's get going let's see what we get with lucky actually there is not much to see there is nothing to see with loki because we're missing a crucial component we are missing a way to visualize uh logs and other stuff so for now let me explain what it's doing and later on we are going to see it in action so what is doing well it's sitting in a cluster and waiting uh for somebody something to push logs into it so it's a database it's a database specialized for logs and that's something that entity pushing something to lucky in this case is called prom tail so just as we have exporters which are providing metrics on each node for each node for prometheus we have prompt tail running on each node of the cluster watching for all the logs happening in containers in that node that server and pushing those logs streaming them to lockheed so this is opposite from prometheus prometheus is pooling metrics but in case of loki metrics are pushed from nodes to lock itself and there is a good reason for that prometheus is not collecting every single metric it is aggregating metrics because it would be just overwhelming but locks are different we need every single log entry so we have lucky actually from tail pushing logs to loki which is database for logs similar to let's say elastic except that loki is specialized for logs and logs only and it does it really really really well and now comes the moment that everybody's waiting for how do we watch those things prometheus is great for querying but not really for observing what's going on and there is no interface so far that we can use to observe the logs in loki so we need a visualization tool we need a sort of dashboard there are very few dashboards or very few applications third-party applications that became so indispensable as graffara grafana is a ruler of its own domain almost nobody is disputing that it is the de facto standard to aggregate fetch and visualize metrics logs or almost anything else we are going to connect it and get metrics from prometheus and from loki but in theory and in practice you can use grafana to get stuff from almost anywhere that's what makes it absolutely awesome now installation of grafana could be just as simple as installation of the previous tools installed and applied to the cluster but i'm going to complicate it a bit more because i want not only to run grafana the engine itself the ui itself but i wanted already to have some dashboards baked in and i wanted to have some data sources baked in meaning that i wanted to know from the very beginning where to go to get information and so on and so forth so my helm chart values file is this time more complicated there are usual suspects like yes english should be enabled but then there are some less usual sections to begin with there is data sources but i'm saying hey you should be fetching metrics from primitives and you should be fetching logs from loki those are the addresses go get information from those two uh databases of sorts don't ask me much more about it we have a database provider which is a silly thing that i'm not going to even bother explaining because it's extremely silly and then we have dashboards now in this case i'm having only one dashboard a single dashboard and you should probably figure out that you can have as many dashboards as you want i'm going to use a dashboard that already exists that was made by somebody else somebody who contributed to a community and what matters here are two things first of all the id of the dashboard is ten thousand i'm going to get back to that id and the data source is prometheus meaning that hey that's the data source that's where you get information to visualize it using this dashboard now you might be asking hey how do you know what the heck is dashboard 10000. well let me show you how i found out that i want exactly that dashboard if you go to graphander.com you will see all the dashboards that were contributed to the projects by different individuals and companies and so on and so forth and then you need to know what you want to visualize in my case kubernetes but it can be anything else this is very high level you could go drill down into kubernetes or you could visualize almost anything else and once you find out what you want to do you search for it and you find the one you like or you find multiple and then check them out and in my case i found the dashboard for kubernetes which happens to have the id one not one thousand ten thousand so let's go back to the terminal and let me install grafana by a typical command you know helm upgrade install grafana grafana grafana manyspace is this one i want to create a namespace i want some values i'm going to set the host to be whatever it is and i'm going to wait this time i'm not going to tell you to subscribe to your channel because you already did that and i don't have to wait long so there we go i have my grafana up and running now if you paid attention to the instructions you will see that we should retrieve the secret so that we can do the initial authentication to grafana so i'm going to copy and paste that command and execute it and then i will get as the output the password that i should be using to log into grafana and with that in mind let me open it in a browser and see what we'll get username is admin password is whatever is in the output and let's see let's see it in action first of all let's go to the dashboards and over there there is only one but that's the famous 10 000 kubernetes dashboard and if i go inside the dashboard we can see that there are some colors moving around and some numbers and stuff like that i will let you explore the dashboard yourself you can change the name space you can change the duration of the metrics or how long you want to observe something and so on and so forth do it yourself pretty colors that's important right better than netflix so going back to the diagram we got graphana in the cluster and then a human a person can go to grafana and visualize things in this case through dashboards but there are other ways to visualize as well and in this case so far grafana is fetching information from prometheus but that's not the only place it is getting information from it is getting information from lucky as well this direction i think so let's take a look at whether we can find out uh something about the logs can we query the logs stored in metrics but query them through grafana now apart from dashboards and trust me you can have dashboards for logs as value if that makes any sense i don't think it does because blocks are a different purpose than monitoring but if you want you can have dashboards but i'm not going to use dashboards i'm going to go to the explore section and over there i can query any data source or data from any data source currently connected to grafana so i'm going to select loki or it's already selected and then i'm going to type a query that is very similar to promql just tweaked to be appropriate or tuned for logs actually let's postpone lucky let's go to prometheus first and see whether we can query promote use can we do can we execute the same query from grafana that are executed in prometus apart from seeing the metrics in dashboards so let me copy and paste the query from before and put it into graphana make sure that you select from it uses data source and there we are those are the same metrics different dashboards different visualization but the same metrics and that's very important because uh in this way we can visualize things we have a single place where we visualize information or data from multiple sources so for loki we can write a query ourselves or we can select what we want from a drop down not to drop down below the query field the search field we can select an application and that would give us a starting point and from there on we can query whatever we want to query we can extend those selections by custom querying you know writing uh i'm not sure what's the language uh it's not promql it's something lock ql or i'm not sure what the name is but it's very similar to prometheus query language anyways the locks are there not much is happening in my tiny tiny tiny tiny cluster but that's enough to see that locks are being collected being shipped to loki and i'm using rafana to visualize them to search for them not much visualize but more query so that's about it that's the perfect combination for self at least in my opinion perfect combination for self-hosted not managed not sus but self-hosted monitoring logging alerting and is there anything else no that's it monitoring logging alerting uh system mechanism platform whatever you want to call it try it out if you haven't already and if you feel that there are better solutions please let me know in the comments i would like to hear what are the better self-hosted solutions sus is a different beast because we do not need to care about some things and we might care about others but let me know in the comments what do you like what is your preference for monitoring logging and alerting thank you so much for watching see you next time cheers you
Info
Channel: DevOps Toolkit
Views: 17,585
Rating: undefined out of 5
Keywords: devops, devops toolkit, tutorial, k8s, kubernetes, monitoring, logging, monitoring and logging, monitoring and logging kubernetes, monitoring and logging k8s, troubleshoot, how to troubleshoot kubernetes, prometheus, prometheus kubernetes, prometheus k8s, loki, loki logging, robusta, grafana, grafana dashboard, grafana prometheus dashboard tutorial, prometheus grafana, grafana loki, prometheus and grafana monitoring, grafana prometheus, grafana alerts, grafana dashboard tutorial
Id: XR_yWlOEGiA
Channel Id: undefined
Length: 22min 7sec (1327 seconds)
Published: Mon Jun 13 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.