Guide to Grafana 101: Getting started with alerts

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so today's session is all about getting started with grafana alerts it's actually part of our series on guide to grafana 101 this is the second session in the guide to grafana 101 series the last one was about building awesome visuals and this one as you know is about alerting this is an interactive session so please feel free to ask questions at any time using the q a feature in zoom so there's two features one is a chat and one is a q a so please use a q a feature if you have questions if you just have things to say use the chat feature and we have a team of dashboarding experts including myself to answer them at the end of today's session so put those questions in as something comes up to your mind i'd also like to remind everyone that you will receive a recording of today's session along with code and links to resources as well as the slide deck that i'm going to use today so don't worry if you forget to take any notes or things like that all right so let's get into it uh just a little bit about me before we get started into what we're going to do in today's session my name is aftar i'm from south africa i'm super interested in how to use technology to empower people that's why i really love my job as a developer advocate here at time scale as part of my job i learn about new technologies and just learning new things all the time like uh alerting in grafana and devops and monitoring so if you wanna follow along with that journey i talk about what i learned on twitter and on my website after.com so you can check that out now today's session is gonna have four parts so the first part we're gonna look at alerting principles take you through just uh what you wanna alert on give everyone uh some basic understanding of alerts before we delve into the second part which is how alerts specifically function in grafana then the bulk of today's session is going to be in part three which is the let's code portion where we're going to use a scenario of monitoring a production database and i'll define alerts for uptime memory consumption and disk usage on that database then i'll also set up popular notification channels to send alerts over things like slack ops genie and pagerduty which are some examples of software that you want to have alerts in uh so that's something to get you excited for what's to come and then in part four i'm gonna leave you with some resources so that you can take action on yours by yourself and we're gonna of course answer your questions so please put them in as you have them throughout the session so let's get uh into the first section which is about alerting principles now uh just so that everyone is starting off on an equal setting i know some people might be beginners like this others are looking for more advanced material alerts are basically an important part of any monitoring system because they tell us when things go wrong and when things need human attention now this could be when something crashes it could be when you're consuming too many resources and it could be um when there's an outage or when users are reporting performance degradation this could be also you know increase in support tickets you could alert on a number of things so essentially alerts help us to know about when things go wrong and get humans to take action as soon as possible now there's some best practices for implementing alerts that i just wanted to mention to you so the first one is to avoid over alerting now really what this means is that you know only alert on something that requires immediate human input if an engineer gets alerts too frequently it really ceases to be useful and to serve its purpose as like something you should take uh notice of this goes for engineers like us in our day-to-day work but also in our personal lives i mean think about when facebook updates you about stuff that you don't care about you wind up tuning out or just disabling alerts or worse ignoring them and that can be really bad in a you know a critical context where you know you need to take action when something is going wrong so make sure to only use alerts on relevant things to avoid alert fatigue and then the second thing is you want to use use case specific alerts now what you alert on really depends on your scenario so for example if you're running a sas platform you want to alert on things like site up time latency things where you know like if the ability to provide your service degrades you want to know about it uh if you're on the other hand monitoring infrastructure you want to monitor things like your database disk usage the cpu and memory of your various hosts as well as things like api errors on the various uh pieces of infrastructure talking to each other so i want to hear from you in the chat uh you know what are you looking to monitor and use alerts for let us know in the chat uh and that'll be really useful in order to maybe contextualize some of the things that i'll be talking about in the second part of today's session uh and of course uh please use the q a feature different from the chat to ask questions that we're gonna answer at the end of today's session if anything is confusing or you have uh things that you'd like to learn more about okay so that takes us to part one everyone knows about what alerts are supposed to do now let's take a look at how alerting works in grafana so in this section i'm going to take a look at how alerts work in grafana and their two constituent parts so there's two parts the first one is going to be about alleged rules and the second one is going to be about notification channels and i'll also introduce alert states which we'll examine more in section three of today's uh technical session i also realized that some people might be new to grafana so i recommend you check out session one that we did on creating awesome visuals my colleague lacey just posted the link in the chat for everyone to see so you can check that out after today's session for an introduction to visualization in grafana fortunately you don't need any viz knowledge for today's session so let's jump into how alerting works now why would you want to use grafana alerts so the main reason is that you can run visualization and custom alerting all in one tool so now most people know grafana as a visualization tool it also provides this alerting functionality to notify your anomalies and the main benefit is that you don't have the overhead of learning how to use another piece of software and then you also don't have to integrate into your back end you can have alerts right in your dashboards from when you're setting up things to monitor uh some some rules about how alerts work in grafana so the first one is that they are limited to graph visuals with time series output so you can see in the screenshot that i put on the screen here you can only work use alerts on graph visuals that have time series outputs you can't format the output as a table or something and this is because it lets you need to have this notion of time uh you can't use a table output or anything like that and you also can't use alerts on other visualization types like gauges or single stats or something like that but the good news is two things the first one is that you're mainly dealing with time series data for alerts because you really want to monitor how something is changing over time whether it's things like uptime resource consumption at a certain point in time and then the second part is like you can turn things like gauges and single stats into a graph visual because usually those kinds of things look at the last data point in a time series and the graph visual will just show you all the data points so there are ways to work around you know if you want to alert on different things to have them in the format that grafana can handle it the other thing to note is that grafana does allow many different data sources and notification channels but not all of them are supported so you can look at the grafana docs for a list of them there's quite a few that are supported things like postgres prometheus cloud watch for the demos today i'm going to be using time scale which is based on top of postgres as my data source surprise surprise um mainly because that's the one that i'm most familiar with and that i'm using in my monitoring system so there's two parts of alerting in grafana that you need to know about in order to understand the demo that we're going to do today the first part is alleged rules now alert rules are probably the most important thing to understand and these are conditions that you define for when the alert gets triggered and when you want to be notified so here are some examples of alert rules things like you know the last this usage is more than 90 uh that's a threshold you can also have thresholds you can also have alerts that have a threshold being sustained uh sustained uh passing of a certain threshold so the example on screen here is like the average memory is greater than ninety percent for five consecutive minutes um so that's uh that's another thing that's going to be um explored more just now and the third one is you don't just have to have thresholds for like upper bounds and lower bounds you can also have a range where in this case your average temperature might be outside a certain range for 10 minutes or it could be inside a sudden range you have all kinds of flexibility over exactly what you want to alert on using grafana then let's take a look at an example ruling grafana so the first thing you got to do is you have your name so in this case i have a descriptive name then what you want to do is look at the frequency of how often this rule is evaluated so how rules in grafana work is according to a scheduler so you need to define how often that scheduler needs to evaluate whether this rule is true or false and so here uh you can set it to evaluate every x number of minutes so in this case i chose one minute and then the four period which is to say okay how long should this rule be true for until we fire off a notification and then then you specify the conditions so uh in this case you choose your aggregate function max average min last time percentage difference things like that uh and then you select the query uh letter and the start and the end time so for example in this case i'm monitoring uh query a which is talking about memory from five minutes ago until now and i want to know if this is above 90 uh and this is the condition that i actually have so that's a little bit of like how alerts actually work working grafana like the anatomy of it we're gonna take a look at implementing that in a few minutes but that's just to give you some uh ideas that's just to give you some idea of how it works and that you're not surprised when i when i do the demo okay so that's alerting rules now let's take a look at alert notification channels now alerts notification channels are where alerts get sent once the rules are triggered if you don't have any notification channels the rule the alerts will only show up in grafana which is good but you know not everyone's looking at grafana all the time so you want to have it in places where you can get people's attention so usually you want to have it where your team can see it so this is a place like slack or email or discord or whatever communication tool you're using so for example in this case i've selected slack from this list of notification channels like grafana support and here's an example of a notification in slack that shows me like okay this came from grafana tells me about what it's about and gives me some more information about what's actually going wrong you also can have alerting alerts sent through various support tools so i've got two on the screen and that i'm going to use today in the demo one is called page of duty very popular way to coordinate um notifications through like text and calls and emails for a whole team and then the other one is ops genie which functions very similar to the way page of duty functions again using um you know coordinated uh things like text and calls and whatnot uh usually grafana provides integrations using things like web hooks api keys and there's a dozen external services and whenever we create an alert we need to assign it to a notification channel and we're going to see that in part 3 where we're going to basically define the messages that we want to be sent when these alerts get sent out okay so that's the basics of alerts rules and notification channels i want to take you through a little bit about alert states we're going to be delving into how exactly alerts move through different alert states later on in today's session when i start illustrating the examples but there's really four main alert states that you need to know about and the reason why alerts have states is that you can think of objects alerts as objects that move through different states depending on the rule that's associated with them so there's four states the first one is okay that means everything is okay it's green no action is needed the second one is pending uh that's when um you have a rule that is true but uh the four clause has not been satisfied um so basically we'll see the first example that i'm gonna go through now is gonna be involving this pending state but it's basically an in-between state between alerting and okay where things are not okay but you haven't quite met the threshold for things to be alerting and for that's to be sent out so that's how you can think about it right now uh then you have the red state alerting this is where things are going wrong notifications are being sent out and then you have the fourth state which is um a separate state from alerting which is no data for when there's no data to actually evaluate the alert rule that sometimes happens when you know pieces of infrastructure goes down or there might be network latency and things like that so these are the four alert states um and now we're going to go into the main part of today's session which is actually taking you through how to implement alerts and notification channels uh just a quick reminder if you have any questions please post them using the q a feature and zoom and we'll answer them at the end of today's session so let's get into the fun part which is let's code i'm gonna get my keyboard out in front of me and let's start to set some of these things up okay so in this section we're going to implement alerting for a very simple monitoring setup uh and i'm also going to give you a mental model for understanding how alerts work in grafana we will go through the life cycle of the alerts and how it moves through different states from start to finish so the scenario that i'm going to use today is monitoring a production database and just for those of you are interested the scenario is basically monitoring this database using prometheus and i'm using timescale the time series database as a remote read and write for prometheus so what's happening is that the metrics are actually being scraped by prometheus and then being read into time scale if you have any questions there's a previous tutorial and webinar that we've done on how to set this up but there'll also be a link to how to set this up yourself at the end of in the resources section so now that you understand how that works in this demo we're going to basically create three different alerts for things that we're monitoring and for each alert we're going to send them through a different notification channel so the first alert that we're going to create is about average memory consumption and the rule here is going to be where memory is greater than 90 over five minutes and the channel we're going to use is slack then we're going to take a look at disk usage where disk usage is greater than 80 and the channel we're going to use is pagerduty and the last one is going to be about service aliveness where the rule here is basically is this database up or down and that's the status that we're going to monitor and the channel here is going to be ops genie so that's an overview of what's going to happen let's get into the first one which is about average memory consumption over uh five minutes and the channel we're going to use is going to be slack so what i'm going to do i'm going to exit out of my slide deck here and move into my grafana dashboard and what i have in front of me is just a simple grafana dashboard that's monitoring the three metrics that i'm going to create alerts on and then next to me in this second screen i have my different notification channels so i have slack i've got pagerduty and i've got ops genie here open so that we can see when notifications actually pop up when the let's get sent out so let's get started with this first one which is about percentage memory use so the way you add an alert to a grafana graph is you want to edit it and you click this bell icon to add the alert and select create alert and then we here now this this first example is showing the memory usage over time and the goal here is to tell us when we have sustained high memory usage over a period of time so we'll define that as the average memory consumption per minute to be greater than 90 for five consecutive minutes so that's the rule that we're going to do the first thing we need to do is give our rule uh a uh descriptive name so i'm gonna call it sustained high memory on taxidb taxidb is just the name of the database that i'm monitoring it's about taxis for those of you who've done the taxi tutorial and time scale if not uh check it out in our docs then what we're going to do is set a frequency for when the rule is going to be evaluated um so so when the rule is going to be evaluated and uh what we're going to do here is evaluate it every one minute so every one minute grafana is going to run the process to check if this thing is true and over here i want to have it for five minutes because that's uh the period of time that we want and what this does is basically uh tell us how long this alert should fire for until we send out a notification this prevents things like false positives and it's especially useful for serious alerts that like wake people up in the night you want to be sure that something is actually wrong and so it's better to wait for a little bit in this case five minutes before we actually send out the alert to make sure that that alert condition is actually true speaking of alert conditions now it's time to define the actual condition that we're going to use in this case uh you just win uh it's you can't change it it's just here so when the average and you you choose an aggregate function so i'm going to choose average i'm going to keep it as average of a certain query so in this case grafana has this query lettering systems you can see when i go to my queries this one is a if i copy this this one becomes b uh you know you can add c d e as you as you like now uh going back to the alerting panel we're going to choose query a and notice that we want the average memory per minute so in this case i want to monitor from one minute ago till now you can also um you can also um sorry i'm just looking at the questions coming in thank you to those putting in questions we can we can answer them at the end um so uh yeah one minute until now you can also choose things like from um now minus one minute now minus five minutes which basically says from the first time until this last time so say you want to monitor from 10 minutes ago till two minutes ago uh if you don't want to just monitor on like the real-time data that's coming in in case you have some lag for your data to be to be inserted into your database um then we want to know when this uh average memory is above 90 so because i already have the graph set up in percentage i can just type 90 here be careful sometimes you know you might have to type something like 0.9 or something depending on the actual format but because i'm using um percents as the output it's going to just be 90. and then we need to specify how the no data and the error handling is handled so in this case no data is basically when there's null values or there's no data to evaluate the rule you can select it to be alerting or to keep the last data okay i'm going to keep it as no data because this actually tells me when i have no data versus when the alert rule is being triggered and then if i have an execution error or a timeout or something like that usually alerting is good to keep here because you know that tells you something is wrong with the way you've set things up okay so now uh that we've actually set up this alerting rule now we need to assign it a notification channel and the notification channel that i'm gonna use is slack so what i'm going to do is navigate over to so the way what i've just done let me just go out of here what i've just done is you go to the bell icon here and you go to notification channels and then what you want to do is set up a new notification channel what i've done is just i've navigated to the existing one that i've set up which is slack now what you often want to do here is have a tiered notification system so that smaller alerts go to less serious places in this case like slack and bigger alerts go to more serious places maybe your support software something like that so that you match the alert severity with the channel that it's used so for example you wouldn't want to call people for like minor things uh so that's actually something to keep in mind can be tricky to decide you know what is serious and what is not so usually many people do something like this where for minor things you'd use slack or email and for critical things you want to use things like phone or sms where you'd use third-party software like pagerduty or ops genie or victorops or something like that so for the example for this example rather i'm going to be using slack since it's super common everyone here probably uses slack for their team um and the process is similar if you're using something like microsoft teams or something like that okay now uh the second thing that we need to do is to set up this notification channel so in the tutorial at the end of today's session what you'll find is a process for getting a web hook url so this is what i have over here i have the url for my slack channel and you can see i'm just going to open the screen a bit more on this side i have a slack channel that i've set up called uh aftar grafana notifications because it's a demo it's just for me and this is where all my grafana notifications are gonna get sent you can see i had a lot of notifications uh earlier today when when things are happening so you want to give your notification channel a name i'm going to select slack as the type of notification channel and you can set all of these different parameters so you can make it the default such that this is sent whenever an alert goes out uh you can also include images uh of the graph so that you can actually uh have that in in the in the place that someone is looking at it so that's often useful you can see like you know what the graph was before then you have something called disable the resolve message so you see over here uh there's a message that says okay when things go back to normal you can disable enable that and then you can set reminders about how often you want to be reminded when something goes wrong so what i've done is just pasted in the the webhook url here um and then i'm giving my slack bot a name and then you can also mention different people so in this case i've just mentioned the whole channel so that everyone is in the channel just gets notified when something is go wrong that seems pretty reasonable so we're going to save and then go back to the actual graph where we just have to pick from the different notification channels we set up in this case i want to select slack and then you select a message or you type a message that tells you you know what's going on when someone is reading this they know what uh what the issue is so i've just said you know this is this is what the that's what the notification is about we save our notifications should be set up let's uh test the rule so this will tell you what is going on and in this case we can see right now because my memory is at like six percent or so uh use uh the conditions are false but what happens if we change this um this this condition to be let's say four percent such that like obviously the memory now is going to be uh greater than uh four percent i'm gonna save this and uh i'm also gonna change this four period for two minutes such that uh we don't have to wait too long so i'm gonna save this and then we're gonna watch this uh slack channel over here to see uh when things start happening in the meantime while we wait for that i want to take you through how alerts were using four work in grafana so as i mentioned this whole idea of four is that the condition is true for a certain period of time we're going to be looking at memory usage and the rule is that it needs to be greater than some percentage over over five minutes and how it actually works is these are alert states again you remember them from a few minutes ago how it actually works is that you know at a certain period of time i'm going to call this t minus one your average memory is below the threshold and your default state when you start is going to be okay everything is going to be okay when you when you start off then what happens when the memory goes over the threshold you move from ok to pending so this pending only happens when you have a 4 in your electoral and then you stay at pending as long as you haven't um met the time that the alert needs to be evaluated for so in this case it's five minutes so for the next five minutes t plus one t plus two t plus three t plus four we're gonna stay at pending as long as that uh that um value is over the threshold value which is in 95 so you can see over here the values are always over 95 and you're gonna stay at pending until you hit t plus five then you've satisfied this four condition and you're going to move to alerting then what happens is now that i've sent out the notification uh through slack and maybe i upgrade my uh configuration on timescale cloud or something like that such that i have more memory uh available uh then what might happen is the memory might drop say 10 minutes later t plus 10. 10 and the rule then becomes false and then we go back to ok so that's actually how these alerts with 4 work i'm going to exit out of this and now let me refresh this page that might have been less than two minutes okay so now you can actually see that um in this graph i have uh the pending state has been triggered at a certain time so this was uh 1 30 maybe one minute to go and you can see the time the yellow the orange line when the pending state was triggered and uh we're going to see let me refresh again when the alert state gets triggered and we should see an alert in the slack channel um i might have gone through things a bit quicker than ideal and just on time you see the alert just got sent in slack you can see the message that we put high memory and then you can also see the values and the specific metric that the alert was triggered on so that's an example of how alerts got sent in slack you can see the slack was there everyone gets a notification to go and check it out and then seeing what happens in our grafana dashboard let me just zoom in here okay that's too much zoom seeing what happened in our grafana dashboard you can see the orange line is when pending came into place and then the red line is when the alerting state happens you can see that this difference for uh in between the pending and the learning stage to show you like when things went wrong and then if we examine it a bit more in the alert the alert panel let me just make this a bit smaller and we go to state history it'll actually show you like okay at this time these are the different uh states that this alert went through and of course if we test the rule right now we can see that the condition is actually true and the state again is um is pending because it needs to evaluate it every two minutes what i'm going to do is so that we don't have more slack notifications as we do this i'm going to change this back to 90 and i'm going to save it and then we should see when we come to revisit slack that this will come back to normal okay let's go through the next um example uh which is alerts without four so if you have any questions please put them in the q a feature i see these questions coming in um and so the next one that we're going to be doing is going to be about disk usage this is going to be a simpler alert which is alerts without for in this case we're going to say disk usage greater than 80 using pagerduty as a channel so let's um let's go back there so i'll come to this about how the lifecycle works going back to my grafana dashboard and you can see over here when when the when the alert went away we get a message saying okay and uh in slack tell us that like this situation is actually being resolved from the metric point of view uh you can then investigate what's going on so for disk usage we're going to repeat the same process we're going to create a new alert and i'm going to call this high disk on high disk usage on taxidb and in this case one thing to notice is that i want to evaluate it every one minute but in this case i don't want to have a four so i'm going to select the four to be zero minutes and this is because for something like disk usage it really doesn't fluctuate all that much so you wanna it can really only increase so i don't really need to wait to know that something is true for a certain period of time before i alert and in this case for the alert rule i want to select the last function so that tells me the last value in my time series and in this case my query that i'm monitoring is query b so i'm going to select b in the drop down and i want to know it from one minute to go until now and let's say the disk is above 90 percent or was it 80 let's just say 90 for now and then we're going to keep the default um and then we need to keep the default uh from from the default value for for no data and alerting so in this case if there's no data i'm just going to set it to no data and then alerting i'm going to set it to alerting and then we're going to select the notification channel um so in this case i said i'm going to use pagerduty let's see how you set up pagerduty um and then so when so when we get to this notification this is the notification channel setup page we're going to give it a name so in this case i'm called that the devops team pagerduty select page of duty from the drop down and then you wanna fiddle with all these parameters and make sure that like your if it's your default you select it you include images and whatnot the thing about pagerduty is that it's a super popular tool for managing support and incidence response for medium and large teams often if you're a small team of maybe 10 people or something you might not need it but once you start going beyond that you probably do need some sort of software to coordinate it in this case all that's needed is an integration key i haven't put my integration key here the real one otherwise you'll be able to send me notifications randomly and wake me up in the middle of the night but in this case i just uh you you put your key in here and then you can set a severity for your cases so moving into pagerduty over here you can set a severity whether things are you know critical or whether something is just a warning or something like that that helps differentiate what happens and then you can also set things to auto resolve which is similar to when things go back to an okay state it gets um resolved and and the issue gets closed so that's how you set up pagerduty you need your integration key uh i've got just a page of duty account over here and then the tutorial that we have at the end of today's session takes you through how to navigate in page duty where to find it uh for the sake of time we're um going to skip that part today now going back to the the actual alerting screen what we're going to do is select page of duty i've called this after pd page pd for page of duty and i'm also going to select slack so this shows you can actually set up multiple notification channels for the same the same alert that you're that you're monitoring and in this case for the message i'm going to select this usage is above 80 percent and i might want to upgrade my storage plan or alter the compression settings so timescale has really good compression so often if you're running out of disk it might be useful to alter the compression settings such that you can compress your old data or set up some sort of automated scheme to to compress data that's older than a certain period of time i'm going to select save and now that should be an effect just to demo this for you i'm gonna make the memory this usage is currently 22 so i'm going to make the disk usage let's say if it's above 15 and uh for zero minutes so i'm just gonna select save and while we wait for the the alert to trigger here let's see actually how this alert works when there's no form so the example here is disk usage and uh this usually greater than 80 so in this case as i mentioned the alert goes uh starts in okay as a default state and because there's no four once the threshold is hit it goes straight to alerting and then uh once the alert rule what's once the alert rule has actually been um satisfied the notification gets sent off that was actually a call from pagerduty if you experienced a delay in the sound because they're calling me to telling me that there's a there's an issue going on with my system and then say as an engineer i then go and take action and go and upgrade the disk capacity it drops to 40 and then the rule goes false and we go back to okay now let's take a look at our pager duty let me just refresh and you can see there's a new let me go back to my demo there's a new notification that's been triggered high disk usage on the taxidb it tells you the service that's running on so in this case i have a whole database for taxis that it's been running on and then i can take action on it in this case i'm just going to resolve it because um because it's just just a test that we're doing and i'm going to put this back the threshold back to 90 so that we don't have uh recurring notifications um you know now that i've demonstrated that point okay so you'll notice that you know the difference when you have four uh cases with four in cases of not four you don't have these pending lines and these orange lines on the graphs um and in this case for disk usage you know one whenever you have an alert in grafana you'll see this growing red thing around it in this case there's just uh a line that tells you a lot an alert happened at this period of time okay um there was a question that i just want to answer live about you know when things go from pending to a lot do we need is it the same timing from um going from alert to okay and that's not uh so that's based on the time that your rule actually gets evaluated so let me show you say if we go back to this memory used one that's based on if you evaluate it every one minute the it might go if it's true it'll might go back to uh pending but if it's not true then it'll go back to okay so you don't actually need every um you don't need to actually wait the same amount of time in between it will uh go back as soon as it sees that that condition is not true anymore it'll go back to okay the pending only applies when the condition is true and it needs to move um from true to alerting the move backwards from alerting to okay doesn't actually uh need that time to evaluate it'll only be based on this parameter called evaluate every rather than the four so that's just to answer eden's question live but i will also type that answer out later for other people to look at but i thought that was relevant so that's why i wanted to mention it keep those questions coming in and they might also be answered live if it's if it's relevant to what i'm doing right now so the third thing that i'm going to do today the third and final example that we're going to look at before we move into question and answer is this example of service aliveness so this is something super common you want to know whether your database is up or not and in this case i'm going to be using ops genie as the example of the notification channel that i'm going to use so let's move back to the dashboard and i'm going to open up ops genie in the right hand side here so you can see the query that i'm using the thing to notice here is that my status is basically one or zero so when status is okay it's one and when it's not okay it'll go back to zero so i'm going to create an alert based on that in mind where i'm going to call this taxidb service status alert and i'm going to evaluate it every one minute for let's say two minutes in this case you don't want your database to be down for too long i think two minutes is enough to account for things like network lag and stuff like that and then in this case you want to know again using the last function and the query is going to be query a and i'm going to set it from five minutes ago till now is fine and in this case we're going to say is below 1 because that's the value that we have over here and then we're going to keep the default uh no data and alerting states here when there's um no data or an execution or a timeout you'll see now in this case with the demo that i'm going to show you when there's no data state is going to come into effect but before we do that let's just take a look at how we set up ops genie again very similar to pagerduty ops genie is a bit more advanced software for things like managing support and incident management it allows you to configure who gets notified on what platforms like email sms voice or using a mobile app and in this case what you need to do is select ops genie from the drop down menu and then put in your api key so this is something that you can find under the teams settings there you can refer to the tutorial that we have in our docs for how to um how to set that up and you can have some configurations here such as auto close and override priority if you want to manage the priority and whether you want to auto close your incidents uh after the alert state uh has gone back to okay essentially okay so that's how you set up page of duty the main thing here is you need an api key the docs will show you how to get that api key i don't want to expose it to everyone over here and then over here in notifications i just select uh after devops team 2 because that's what the ops genie is called it's going to notify my whole team and then the notification message i'm going to say red alert database is down it's very important capital letters are needed for something like this okay so in this case if we just test this rule we can say uh we can see the state is okay and the conditions are false uh but what if i were to do something that would trigger the database to go down so what i'm going to do here is open up my window in timestar cloud so this is the hosted and managed version of time scale it's just the easiest way i like to spin up databases to avoid having multiple time skills running in my like docker or my you know kubernetes and worrying about ports and stuff like that and ports getting um like mixed up and things like that with various applications that i have running so what i'm going to do here is just power off the service that simulates the database going down usually you wouldn't do this by mistake because you have a power of confirmation but who knows it might happen a more common scenario is when um say you're running this not in a hosted and managed scenario but in your own infrastructure and you know you run out of resources in a certain cluster or something like that your database might go down and in this case i can see my service has been powered off so what's going to happen is that we're going to get some alerts here in a moment i'm just going to refresh and we can actually see let me um when i can actually see the database status is 0 right now so say this is one it's gone to zero and uh it's going to evaluate for a certain period of time and then go back to um it's going to go into an alerting state uh pretty soon so let me refresh and um while we wait for that to happen and while i keep my ops genie in the alert uh panel i'm going to take you through what's going to happen with some of these other notifications that we've set up so okay so this is the thing that we're monitoring we're going to look at alerts with no data here in this case so because of the whole service aliveness when your data source is not working uh that might affect other alerts that you've set up so in this case i'm just going to monitor uptime and the rule is going to be when the up is less than one as we've established so when you actually have data everything is going to be okay it might go to pending it might go to a letting but there's data available but when the data is actually null it goes to no data because there's no data available to evaluate that rule but then once data comes back say you turn on your database you're going to figure out what's going on that alerting state goes back to okay such that there's data available to evaluate the rule and then everything is okay now that uh that's how the the alerts with no data works you can see it's like a separate state from alerting um i'm getting a call right now from my ops genie to tell me that something is going on uh let me refresh the page and we should see that there's something going on let's see usually it takes a few minutes but um what we can also see that's interesting here i just opened it up in slack going back to the the demo is that i'm getting a bunch of no data as you can see pending is over here i'm getting a bunch of no data ops genie and let me actually put slack as this as a notification channel so you can see here there's a bunch of node data for things like high memory on the taxi database and the sustained high memory and things like that because these notifications that had rules to evaluate now have no data to evaluate those rules so you can see how it's distinguished between when alerting is happening and when no data is available so in this case for another you know my actual alerting system uh i've got the databases down so you can see it's going down here and then in ops genie hopefully the uh alert will populate here okay i think it might take a while to refresh but such is life with live demos and so you can see my service is actually down the status is zero um and over here you can see there's no data these grade lines show and you can actually see that memory uh is not being populated here so uh we'll give a few more minutes for ops genie to come back uh let's see if it if it does but in the meantime i'm going to take you through a recap and the next step for what we've done because we've actually come to the end of our live demo uh just on time and then we're going to go into answering your questions so put them in if you if you haven't already so today we did a few things we went through a learning principles in grafana we talked about why alerts are important and when you should use them we also looked at a mental model for understanding how alerts work in grafana using the scenario of monitoring a database where i took you through the alert life cycles of different types of alerts and then we defined alert rules for things like aliveness and memory and disk usage and then lastly um and then in those alert rules we looked at alerts with four and different aggregate functions like last and average and then lastly we set up notification channels using things like slack ops genie and page of duty uh secondly as a summary i wanted to leave you with this uh diagram so you know i was using these state diagrams throughout the presentation today this is just a summary of all the possible alerts to help you all the state transitions to help you reason about this when you're implementing alerts in grafana yourself you can take a screenshot right now or you can view this in the slide deck that will send this to you uh after today's presentation probably tomorrow or friday this is just a summary that i found useful to put together because there's nothing like this in the grafana docs unfortunately just to help you quickly understand you know once you're in a certain state what happens uh you know depending on the the data that you get in okay so um if you haven't taken a screenshot there's gonna be uh this can get sent to you so that's the second thing i want to leave you for just to help you reason about it so what's next um we're going to put up a alerting tutorial and it's actually up right now at this um at this uh short link there's going to be a letting tutorial for you to replicate the demo that i've done today so if you want to try this out on a toy system with some toy data before implementing it on your real system you can do that in our alerting tutorial if you want to use time scale in your monitoring setup we have just made available the prometheus the new prometheus adapter as well as some helm charts to spin up time scale and prometheus and grafana and kubernetes you can look at this github link on the screen to navigate to that you can also join our community slack if you have any questions where you can find me timescale co-founders as well as other engineers we have people like grafana contributor time skills sven klim who is always helping people about grafana related things and you can also get support from other members of the community who have experience in implementing these kinds of things um you know they're always active in timescale slack and if you're ready to get started with timescale uh the easiest way to get started is through timescale cloud you'll get 300 in credits to start if you hit the link on the screen in front of you and then two last things before we move into q a uh we want to hear from you the feedback from today's session uh if you have any ideas for future topics that we should cover or any feedback of things you enjoyed or didn't like please uh fill out this feedback survey that we'll send you in the follow-up email as well and then you are also the first to know that we're going to have another session in our grafana 101 uh sessions of webinars or series webinars rather uh this is going to be about getting started with templating and sharing so i saw we had some questions about template variables and stuff like that so i'm going to take you through exactly how those works and when you can use them and when can't you use them um and that's going to be in about a month's time on wednesday 17th of june so you can mark your calendars for that and you'll get an rsvp link in the follow-up email so again thank you so much for spending this time with me today uh we're gonna take your questions and i'll be answering them uh by a text in the q a function in zoom so please put them in so that we can we can answer them i'm going to go and mute and again thanks so much for joining us for today's webinar put your questions in to the q a if you haven't already

Info

Channel: TimescaleDB

Views: 24,932

Rating: undefined out of 5

Keywords: time series, data, database, timescale, timescaledb

Id: n6yZuRr36uI

Channel Id: undefined

Length: 52min 46sec (3166 seconds)

Published: Thu May 21 2020