Unified Alerting Grafana 8 | Prometheus | Victoria | Telegraf | Notifications | Alert Templating

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] [Applause] [Music] [Applause] [Music] hey youtube this is bharat from ready's llc today we are going to learn about basic dashboard creation alerting setup and notifications via slack setup for those who aren't aware of slack it's an enterprise level collaboration platform that organizations use in order to have a secure yet efficient collaboration across various teams and departments slack also integrates with most of the plugins and applications out there so it's very easy to use for the developers as well i would suggest you guys look into the previous video that was released on my channel for the setup of grafana victoria metrics telegraph and nginx reverse proxy this will help you understand the components that we are dealing with today and it will make your understanding better moving forward so let's get started let's go to the explore option first so here i'll get a metric browser which shows a list of metrics that i can choose from from a list of labels that i can choose from and then from those labels values i can choose you know which server i want or maybe which label i want so i can even group them by labels okay so what we are going to do now is we will go with the simple metric which is the cpu usage metric let's go with cpu usage system let's select host name could be anything and as soon as i select that uh you're not seeing this here because i didn't click use query so let's select a first name from here which is pnx03 and then i click use query and then i will say if you usage system so it says it has only uh it is showing only 20 time series but it has around 40. so let's assume it has 40 v cos cpu course so we cannot wait uh i mean we cannot actually visualize so many ports at a time so what we could do is we could say cpu is equal to cpu total and then we just we can just query that so this is the total cpu usage of the system we can also get an average from here by using the average function and then you see you get the average usage uh based on the past one hour okay and that is around 9.17 so let's visualize this into a dashboard i'll click on the plus button and then i'll click on dashboard so i uh first things first let's give this dashboard a name let's say phoenix monitoring phoenix is a location and then i can also give tags here as to which um you know for you to be able to identify and then you can also select or create more folders and assign each dashboard to its own folder okay let's uh save this dashboard now we are in the dashboard let's add a new panel so this plus button here add a panel now here what i'll say is cpu usage percent okay and the title will change as you can see now i'll say cpu usage system and then first name is equal to phoenix three okay and then i'll say i want only the total usage okay now this is coming in percentage but you are not seeing the person symbol on the y-axis so what we can do is we can go to uh standard options we can choose the unit over here so we'll say percent you can select from 0 to 100 and then you can also have a lot of options apart from this anyway so now we are seeing the percentage over here right so minimum can be zero maximum can be 100 okay you can set your own thresholds yeah so now if you if you see this tooltip seems to be too long it's giving too much information we don't want that we only need the name of the server that we are looking in the metric of so what we can do is we'll say go to the legend format open two flower brackets and just say host name and then close it if you want the ip if you have set it in your telegraph you can add it here as well so you can say ip you see that's how you do it and this is this total cpu usage that we are looking at okay so that's done and then you have options for transparent backgrounds or maybe the gray hues all that stuff and let's close the panel options the tool tip option so let's say currently it's only one metric or one server but what if you're having multiple servers you and if you want to see the tooltip for all the servers then it's going to be difficult but for that you can just say all go to the tooltip option click all and let me show you how you can do make use of it so your cpu usage system host name is equal to phoenix 0 4 cpu is equal to so now when i actually go if you see every point of the data i'm able to get both the metrics in the same tooltip so i'll just copy this again and paste it here that's changed yeah now what if you are using a particular naming convention and want everything to show up within a single query instead of writing multiple queries what you can do is you can put a negation mark that is what i know it's called but i'm not sure what it's called exactly and you can say you can write regex so phoenix dot star and once you query you'll get all the host names that are available with that particular naming that's the only rejects i'm aware of so that's the only thing i'm going to show unfortunately but as time moves forward maybe i'll make more tutorials on the rejects part and how to make your life easy so this is how it looks right so and let's go to the next option graph styles so here it's a line graph if you want you can maybe go for the different kind of metric data points kind of thing uh this is more of a styling related [Music] um section so you can always uh make sure that connect null values is always true um because it will help you connect your data points even if they're missed show points always or auto whatever it is leave it as it is stacks it is normal so if you see normal it is going to go up to 100 percent but if it's not in 100 usage and you see this kind of graph you're actually baffled okay so let's go with off you'll have your series if you think this is too [Music] much you can go back to your um around what you say standard options you can just remove the minimum and the maximum and the graph will scale on its own like based on the percentage so you can juice to minimum zero and leave it as it is okay so this is how it can be done so i'll still let you do the 0 minimum 100 maximum change the field of the series name is not needed you can select your palette on the color palette as you need so that's good to show when there is no value you don't need to fill this okay so this is done thresholds now so let's say if you want this graph to turn red as soon as it crosses a particular threshold let's say if i say if it crosses two percent to make the graph like red or make the lines red so that will actually happen though it's not quite visible right now it can happen so let's not use the threshold here the thresholds can be used for different kind of graphs like the gauge graph something like that so remove the threshold and then there comes the value mappings i'm not quite sure what it exactly does as i haven't quite come across using this so i'll make another video maybe covering this even i'm not sure about the data links the standard options have been covered the access part is you can have the access to the left or the right this is up to you this is for the y-axis and label is not needed usually but if you want you can set it and you can set the grid lines to disappear or up here it's all totally up to you so this is all styling part okay so that's about it so i'll save apply this and i'll say save yeah so this is uh how it looks after save now what if i want to create an alert from this say any server goes about um x amount of percent of cpu usage what uh alert should i create so the current alerting format for graphata is classic math reduce resample the classic one is a very basic automatic one so when you say when the average of uh a a being the cb usage here is about 3 then you need to give me an alert yeah so it would say if it's about 3 then the value will be 1 yeah so let's see if it's about 10 then give me an alert and then you click run queries this will come become zero most probably so has anything crossed in oh something has crossed or not let's say 50 percent so let's see so what we are saying for the alert conditions we'll say if condition b evaluated every minute for the past five minutes is actually greater than zero then we'll throw an alert so and if there is no data we'll still throw an alert if there is an execution error or timeout we'll still throw it on it and we'll click on preview of it if you see currently the state is normal because nothing is over 50 but what if i say 11 there should be one or two maybe not let's go with 10 and queries yeah there's something about him then i'll click your viewers now here you see we are getting this metric value here yeah so you cannot actually work with this uh when it comes to i mean it will work it will send you an alert but if you want to visualize it or make it understandable for your end users then it's not possible from the classic condition this is not documented anywhere i had to search and search and eventually find it in a github issue that classic condition doesn't support variables usage of variables in the alert notification so let's introduce math and reduce first things first we got a raw uh collection of data here okay with certain conditions and then what we'll do is we'll reduce that data to get the mean value of it so let's say if the average of a is what we get and we run the query so what it does is for each host it gives me a single item quite very bad okay so this is an expression it's not a query it's an expression so i'll add another expression i'll say when i'll do the math this is basically doing the same classic condition but with reducing between we'll say for each host that reduce sends me if the value is above 10 give me an alert so i'll say if b b is greater than 10 give me an alert so run queries stat is just as you can see if you can correspond the data here and here you see all the hosts are 0 but this particular one is coming down to 1 and then if i go to preview alerts it will say that i'm getting all this data but if i move to the end it says the state is alerting why why because we are still evaluating condition b let's move it to c c is what we created right now right so we preview the alerts once again okay and now it would say these are all normal but i want to see the alerting part so here it is the alerting now if you see it has two series of data one is variable c and one is variable b we always get the value from variable b okay we cannot use the value here you can even use the value here now how do i convert this into a meaningful alert mean okay i'll scroll down i'll remove the dashboard uid the panel uid i'll just add two new data options i'll say a summary of the alert and the run book let's say let's say we have a jira link jira dot atlassian.com and our confluence for that matter let's say document cpu plus usage plus a large something like that so a general confluence link and i'll say flower brackets labels dot post is reporting a cpu usage percentage or cp usage of values b dot capital value percentage the last five minutes and i'll say the these are labels so you could say location could be phoenix client could be five appears or someone then and then you just select a folder okay and here the rule name would be i cpu usage okay once this is done i'll save and exit okay let's go back let's refresh we go to the alert rules if you see in the fireworks folder i have this alert ready so this is the alert and if it triggers we need it to go to a particular channel or maybe a particular yeah a particular channel or a particular person or whatever it is so what's the next thing next thing is we need to configure contact points contact points are like recipients for ellipse so i'll just say click new contract point there are many options here you can choose a web hook you can choose which drops you could choose telegram slack page duty pushover ops genie email discord and whatever so let's go with the slack i already have made one it will not show you the token but this is a slack bot token that you get when you create a small app or something like that next uh this is not required to be filled but if you want to feel free to do that these options are still there and then you have the test option click on test and then i'll say this is my slack so let's say i want an alert here so i'll just put a sample alert here if you see this is a test of it okay now this is a custom um template the way it's firing but um removing the template and saving the contact point by default you don't get that so test send testing you get something like this you see the value that we were seeing in the alert section and the value that you see in the alert section is correlating with this one here yeah so this is not quite readable is it and the labels keep pulling up the annotations keep pulling up this is not quite readable so what can we do now that is where templating comes in anything that's for the next section so this is a test notification you send a test notification this way and then let's say you want only a particular kind of alerts to go there like anything related to a particular client should go into a particular channel that is what you you think you need right so what you can do is you can go to notification policies a new policy you can match labels so you can say client is equal to yes then choose the contact point for it and save so if this particular client has any alert then it will come to the slack but what if you want to do one for every champ then for that you would have to create multiple contact points with the same token but uh you have to mention a different channel and every time so recipient can change from alerts to maybe file repeals or you can change to any other plan that you have right so that is one way to do it but now this is the average so we'll save that and the notification policy also is saved just go back to the alerts this is now pending for three minutes so if it it its evaluation passes for the five minutes it will send me an alert now you see there are multiple servers here and it says it is pending for this one so if you see the summary now is replaced the variables are replaced with actual data so iof blah blah blah is reporting a cpu usage of x amount of percent for the last five minutes so this is the message that you will receive in your slack so let's wait uh so while this is happening let's go to our next topic which is the template this is something that no youtube channel has covered so far i've had very much tough time you know dealing with this because the looks of the alerts were very bad by the you know the default template so let's go with the template that i've made for slack so this is a very basic template uh i've made it quite readable anyone who is into programming can maybe understand but so it starts with defining certain sections yeah defining certain sections and then using those sections so um and passing the data to them so if you have alerts that are firing you're passing them on to this specific section so slack default message if you have alerts that are resolved then you are passing them to the specific section which is the result yeah so that is how it happens to give an explanation let's start with a new template okay so just before we go ahead with this template let me show you how the default coming here so the default one being this let's see how it's written so i'll use v for the sake of it so if you see um this is how they've written it i mean this is a more neat no this is the way they have written so if you see they're doing the value string over here instead of actually parsing the values uh so this is something i've replaced but anyway the way this should be written in this file for general understanding should have been this yeah so if you see i'm defining a section called subject yeah and in that i'm opening a square bracket closing it and within that square backward so everything that is passed on to this section comes from a dot an object so dot status would be the status and then function would be to uppercase so this will change to uppercase it's something like that now if equals to status dot status firing then what they are doing is they are putting a semicolon here they are saying the length of alerts number of followers firing and n and then they are closing the square bracket so anyone who is into programming would probably understand this but if i have to place it correctly then if like this and then maybe like this so every if has an end every uh define has an end okay so this is the basic idea so let's start doing a template and maybe then you'll understand what i'm trying to say so i'll say define the template subject or the alert subject okay and i'll end it here i'll say now how do i want it to look so i'll say let's see um if i want to add any sort of if i want to add any sort of particular data or something like that then i can do it from within these sections so let's say i don't want to use the square brackets i want to use ignite file so if status is equal to like if equal to dot status firing i'll say and then end let's take it in and then i'll open and close the bracket and in this bracket i'll say i want the the name of the edit so i'll say dot labels dot alert name and then i have to close this in the flower packets as well okay so labels name and then how do i pass every alert through this section so i'll say i'll give a definition i'll define the name of this template i'll say alert template i'll end it and then i'll say i'll do a for loop and for loop in this is like a range yeah so range end i'll say if dot alerts dot firing then i'll do another end here i'll send it through um let's say the subject i mean if i want to send it to the subject i would have to uh like forward it as a template so now this part also becomes a template so i have to reference to it as a template the name of the template is let's say alert underscore subject and i'll close it and i'll forward the dot alerts file yeah so dot alerts dot file uh that's uh you know how i send it usually so that is the basic test of it but what you could do is you can just reference this in the slack contact point or as well so you if you go to the contact points and edit it here you have an option just specify it here template alert subject so that's how it will be so coming back to our templating so if status is firing then it will say firing alert name but if status is going to resolved i'll say resolved dot labels dot alert name okay so that is one section of the template i don't need to pass this here but yeah that's about it i guess let's see okay so another template now let's file you can also do alerts result and i'll save the template now what i'll do is i'll edit my contact point i'll say template alert template and i'll save it and then i'll try to maybe send a test notification if you see insert oh we already had our previous alert here if you see there's a high ecb usage person going on here the alert summary the alert run book the labels and the source of the alert and if you want to silence it then the silencing link so you can also mute image okay so coming back to our template so it's alert template and this template will be you see how i passed the um dot over here that dot will come contain all the objects of the alert so that's why it was added that way okay so that's just i know this part okay so here if you see the firing part came but it didn't complete and that's because most probably i've done something wrong so if status okay the status is not coming from the default object because we have already taken the alerts part here so i think we'd have to do for each alert we have to do maybe like this template alert subject dot so maybe doing the range on alerts dot firing we'll just save this we'll just try to send a oh yeah that one so if you see firing test on it so this is the alert name this is the firing that we have written now this is the usual um subject i mean if you're passing the title of the alert within this template then it will come as a body but if you do the same over here template alert subject and if you reference it to the section in the alert template that you have done earlier then it would read something like range alerts firing the alert template and then end it here and then save it then edit it once again send a test on it that quite didn't come as i expected so most probably the range functions are not working here oh yeah i forgot to give a dot okay so let's save this let's try once again and now if you see even the title is getting the same output so whatever name you're given your templates it's a very unique name you cannot expect to actually how can i say this you cannot expect to [Music] change the uh you cannot expect to reference your title from within the body so this has to be separated this has to be separate but that is for the slack slack contact point for any other contact points you are not able to mention templates i think you can only you know for email for example you don't have an option to use a template yeah you can only use variables here then comes maybe discord a discord has a template but if you go to something like which drops then that doesn't have a template if you go to something like telegram then that doesn't have a template so there are only very few notification contact points that you can use template for for all the others the default ones book so it's better if you're standardizing on a particular kind of template then it's better to edit the default uh template file in the var lib grafana alerting folder so that's all about the templating part i'll leave my template that i've created for slack uh in the git hub for you guys to have a reference i hope that helps and maybe you guys can share better templates moving forward now next next comes the uh oh yeah we already covered the notification policy so client is equal to fire eps we can send it to this contact point so there's a specific routing based on the labels as well okay and you can also configure the number of you know like you can have uh waiting options and all that stuff um you know the timing options section here or if you edit this then you have you don't have timing options for multiple policies routing policies but the root policy can have timing options i think they'll bring the timing option for the other policies as well but currently this is what it is okay and then comes the silencing part so you can actually go to one of your alerts and silence them you can just click silence uh you can say duration of two hours four hours or whatever this is very helpful during maintenance windows where you can expect certain alerts to pop up so that's when you can use the silence option alert groups so it's like grouping all your alerts based to specific criteria let's say all your alerts relate to cpus which can be grouped by a particular tag or label or something like that so this is all self-explanatory currently in this tutorial i'm not going to cover the alert groups and let's see um we don't have anything else to cover you can have external alert managers or something and i haven't tried this option yet so once i do i'll surely get back to you with another video so that's it for today guys i hope that was helpful uh covering the alerts but if you want me to cover anything else then do let me know and i'll do that thank you

Info

Channel: ReadyDedis LLC

Views: 12,302

Rating: undefined out of 5

Keywords: grafana, unified alerting, alertmanager, alert template, slack, notification, grafana 8 alerting, grafana 8 message template

Id: UtmmhLraSnE

Channel Id: undefined

Length: 40min 8sec (2408 seconds)

Published: Wed Feb 16 2022