Getting started with Grafana OnCall

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
uh welcome everyone thank you a lot for joining my name is matthie kukui i'm leading graphan on call project and i will be giving this webinar getting started with graphene on call we'll walk through basics we will do some deep dives in different parts of the product and the goal of this webinar is to make you comfortable and actually after this webinar you should be able to uh feel too powerful to go and set up the phone call for you your team and actually use this tool [Music] so before we start this webinar is being recorded uh everyone of you will receive access to recording please use a q a feature in zoom for questions it will help me to see which questions did i answer which i didn't i'll also see the chat so if you want to do some follow-up questions feel free to use it and i'm switching to the practical part yeah uh before i start damning the product and connecting and configuring everything i want to tell you about graphene on color in general so the purpose of graphene on call is to collect alerts from different color sources not only graphene cloud we allow connecting it with uh hostile grafana yourself instagram other sources actually anything any service which could issue a webhook [Music] it will group cars and deliver other groups grouped alerts to multiple destinations uh to your engineers so a basic platform for us is slack we are adding new platforms we recently added telegram working on others and one of core features of graphene on call is following on call schedules so it will help you to distribute your alert pressure between your team's team in order to achieve reliable instant response so uh starting to show my screen where uh can you find graphene uncle graphon on call is available in graphene cloud only uh so it's a cloud product and i just signed up in graphing cloud and you see this space space is absolutely empty five dot graphing dot net and graphene on call is actually a graphene cloud plugin so it's here uh this round logo you could find it here and first thing uh we need to do is to connect it with alert sources i'm jumping to integrations and i'm creating an integration for single alerts here we have a huge list of possible integrations if you don't see your alert source here no worries we support the hooks and we have hooks with with expected format but if your alert source cannot issue your web hook in expected format feel free to use this one and uh i will show you later how to adjust parsing on graphene on call source on graphene on call side so easiest way is to connect graphene on call with current graphene cloud stack so i'm doing this going back and here it is it connected with uh alerting created contact points in current grafana uh stack and it will start receiving colors from it to here so let's go to raffle alerting and here you'll find webhook uh contact point so that's what we actually created so if you set up alerts here they will show up here uh i also want to show you how to connect it with so-called grafana so i prepared the cur let's start so i just launched my local grafana and it's accessible on localhost 3000 admin admin is default password it's asking me to set up my password but i will not use this instance for a long time and for example uh i want to connect this docker graphana with my cloud encode so i'm going here to contact points new contact point choosing webhook and here i need to specify the top hook of my graphene call cloud consumer so i'm going to graph an on-call new integration i see this option other grafana i'm connecting i'm choosing it and it gave me a very cool hero instead of copy pasting using here and notification and it should show up here so that's it we you see it received one alert in one load group you could add a lot of sources here zabbix we give you instruction how to use it uh how to connect it uh datadock stackdriver etc etc okay so we configured consuming of alerts from alert sources now what we need to do we need to set up escalation chains escalation chain is a set of instructions which will execute one by one after our alert group is registered so others go from the source to graph and on call they are grouped if needed it's configurable and with those groups we start to execute escalation policies one by one so i'm going to escalation chain creating new escalation chain and for example this is my sre critical escalation chain only for alloy groups which require immediate response and here i start creating rules so at the beginning let's notify me after this let's wait for five minutes then notify me again and we have a lot of instructions here which could help you to find comfortable escalation chain for a team for example we we have such an inspection this instruction will prevent the execution of escalation chain at night so it will continue executing in the morning so using this instruction you could hold all your alert groups if they came at night and start escalating them in the morning or uh we have route robin for it is useful for small teams uh where is it okay yeah one by one so just pick users and each every next failure group will be escalated to next person so it will it's a very basic instrument for distributing colored pressure between c okay so we have very basic escalation chain and now it's time to connect it to our integration so i am going here and connecting it to my source i could connect it to another source also and to the third source that's it so uh you could edit it here uh please note that if you add the description chain here it will be also edited for other your sources and that's the moment to think about more complex ways of distributing load so if i receive alert from this integration it will be grouped to your group and this collision chain will be triggered so every time i will be modified but how to distribute lot between times on distributed team that's where our uncle schedules come into play so i'm going to schedules menu and creating a new encode schedule called rotation this is a schedule for sree team and we allow uh setting up configuring uh phone call rotations a little bit differently from other products we found out that it's very comfortable to use uh calendaring apps for setting up on call rotations so we allow connecting uh google calendar outlook other accounting systems to graphene on call so you'll be able to configure on-call rotations there using your mobile apps sharing calendars between team members so accounting apps allow you to have good visibility of who is on call when in my example i will connect a google calendar to graphene on call so i'm going to calendar i created the calendar which is shared with my team i am going to settings and sharing finding this secret address in ical format copying it okay putting it here and here we allow pretty deep configuration how to work with this uncle schedule now for example it will tell you in slack if there are some problems gaps so you will be pretty certain using this schedule so we we check i think all possible problems okay so for now our schedule is empty and i'm going to my google calendar and starting to configure my schedule i just create a slot and name it after the person i want to put on call my nickname is material5 so i'm going here and saving the time slot uh graphene on call downloads the whole on call schedule every 10 minutes [Music] but i can poke it just in case so i click devote and here it is it consumed my user and i will be on call from 6 p.m till 11 p.m we support all calendar features like uh repeats custom repeats for example i can put myself daily and graphene uncle will consume this so i can check next day okay tomorrow i will be also on call we will add the calendar uh if we put multiple people on call in the same time all of them will be on call in this schedule so i can add admin and this will mean that from 6 p.m p.m till 7 p.m on the material quick fire philippine call but from 7 p.m till 10 p.m both of us will be on call and from 10 till 11 me again so this is how it will work [Music] you see uncle now two people also we allow some magic here uh by default all users in on call schedule are in level one level zero sorry we can add users in higher levels and users in higher levels will override users in levels below it's useful for overrides for example i have my schedule configured somehow and i want to override only in wednesday so i add user i type level and i will completely overwrite my 2nd of march so let's reload go to this date and you see that admin is in level one so on the admin will be classified that's actually it so very easy easy thing to use uh we also allow oh sorry uh please go here and check other settings we try to give thoughtful explanation what could be used for also uh we allow uh configuring basic concordation in one calendar and using another calendar completely for overwrites it's useful if you uh if there's one person in your team who is managing on call rotation for everyone and it's kind of solid and you want every member having access to overrides so you just go to google calendar create new phone call schedule added here and these all events from this calendar will be considered as overrides so let's go back let's go to our escalation chains and how do we connect on call failures with escalation chains in a very easy way um concussion poking people from from encore schedule is just a instruction in this collection chain so i'm going here choosing notify user from on-call schedule schedule and choosing the schedule that's it so once alert will come it will be grouped in a year group and on graphene call we'll choose current or color from this schedule and escalate to him it will wait five minutes and escalate to me those instruction instructions they are in charge of the question who to notify but we didn't touch how to identify and i will tell about it a little bit later for now uh let's see how it's working so i will start demoing sending alerts and before that i want to connect slug because slack is we have very good slack integration we spent a lot of efforts building it uh trying testing and rebuilding so in order to connect slug it's like you should go here chat ups install slack integration it gives you a warning please make sure you install it in right workspace so i'll follow the rule yeah so i'm installing it in playground and here it is we have some slight specific configuration here and we're going back to integrations so uh let's see how it's working with slack uh for slack we allow configuring default channel to consume to show alert groups i'll pick my tv webinar so by default everything will go here but it's possible to override it in integration level so we see a default select channel is selected here and yeah it's time to go to slack here's my channel and let's send them alert oh it's registered and here it is the philosophy of our product is to allow you oh you know multiple results uh we really want you to be able to work with other groups without leaving slack so by design you should be able to do everything you need from a call site here so we add here multiple buttons you could acknowledge uh the seller group uh acknowledge button will stop us collection and if you're unacknowledged escalation will continue from the beginning you could resolve you could invite somebody uh to build a war room you could silence for some time and when this time will finish uh escalation chain will begin start from the beginning you could touch attach all your groups to each other and at resolution notes uh one of cool features we have rebuild here live log of what happened and actually what will happen in the future who will be identified why will bentified and which channel will be used so we see in 90 minutes it will call me by phone i'll tell about it a little bit later [Music] i can acknowledge and this plan will update after some delay and i can resolve the earlier group all those features are available from mobile slack app so [Music] it's pretty useful on the go okay going to the next point uh you could ask how do you distribute your yours between slack channels because it does make sense to collect everything for the whole company here and for these and also for enforcing different escalation chains for different types of players from the same your source we have routes i clear it out here our routes uh designed for for flexibility so we decided to use regular expressions uh we have a helper tool which helps you to figure out uh which your groups will be parsed for will feed to this regular expression so for example for this regular expression we use python flavor regular expressions uh all your groups will fit for example if i want to to use only i don't know only some specific region so i want to create a route on for allures only from this legion i will write such a regular expression and uh all your groups from this region will fit so i am creating a route and for this route i could use the same escalation chain or the create go here and create a different one and also here in integrations page i am able to change the slack channel i want to route those allure groups so for example i want to send it to deathmate and let me send them alert it will go to another slug channel it went here [Music] all these all this configuration is meant to define who to notify and let's switch to the question how to unify uh the the question how to unify is up to user so we let user to specify how he or she wants identified and in your personal account settings you have two personal notification chains so for some average warning i don't know any uh error groups i want to be notified this way for something important i want to be identified this way so we let users to configure it by themselves by this by default it's like a 15 minutes and phone call and phone call for something critical i usually prefer doing something this for critical so i receive both at the same moment and for non-critical something like this here also i can connect my phone number so if i connect it i will receive a sms and i will be able to make a test call and actually it will call me and pronounce everything from uh [Music] from the template i'll tell about it later how to configure what exactly to pronounce in first notifications okay so you may ask how do we configure where to send alerts here or here hello groups and we specified here in escalation chain for each instruction where we pick a person we choose default or important notification chain so if i choose this one everyone from this encore schedule will receive notifications using their personal important notification chains yeah let me check with the plan what's next and next is resolution notes so let's play a little bit with uh notifications so let's let's configure default uh my uncle schedule today let's put me on call go here to schedule reload and let's send damage here it is it's building escalation plan and it invited me in slack first so it mentioned me i will receive push notification [Music] it's asking me to invite slackbot here i need to do this oh interesting yeah this way and now i can for example i'm working with incidents so i'm acknowledging it i interrupted this collection chain and i started discussing what's going on with my colleagues some useless information just by anything something's broken and oh wow i found i found out the problem so it's kubernetes bought abc misbehave etc so once i resolved the incident the group i wanted to store this information because it could be useful for the future so i just go here and add it to the resolution node and you see my resource nodes changed so i can see that if i go to alloy groups in depth interface is it this one oh no it's number eight i should unresolve it you see it here yeah it is here you see here we see a resolution node from slack so we just could pick messages from slack and add them to uh to the database we could filter by resolution nodes only or even we could type something from here oh wow cool i added resolution note here and it's synchronized and published here also so we bi-directionally exchange the information and this helps us to analyze our past alerts so if we wanted to organize a meeting and see what happened last week what was the reason so we could filter by started that last week we could remove filtering by status include like dissolved acknowledged new and silenced and filtered by those who have a research resolution node and read them in the meeting one by one analyzing what's going on uh so we don't lose any information about past outages i'm going to other features we have so i'll do we almost almost there we're almost finished we have some other features like outgoing clubhooks it's possible to poke other systems from graphene on call so you could recreate valuers from graph and uncle in other locations like jira or github or any other system using facebook so we just create outgoing table here a pre-published to github set everything here create the hook we go to escalation chain and add instruction instruction to be published to github for example we could do something like this if incident wasn't resolved in 30 minutes republish it to github and resolve automatically so it will disappear from open incidents in graphene on call and show up in github and also one of cool things we have is a main transponder once you set everything up or once you connect new integration uh it could be pretty messy to test things out so for example i connect this cervix and i want to see how everything works but i don't want to bother my colleagues etc so i just declare debug mode for my xbix so phone call will behave absolutely like normal but it wouldn't make phone calls it wouldn't uh mention people in slack but it will post messages to slack and act like normal but without bothering your colleagues and we need to choose duration uh because we don't allow it to we don't allow to enable debug mode forever should it just stop sometime so uh we declare debug mode and uh that's it it could automatically stop or we could stop it by ourselves uh also this feature is useful for it could be figures from our api so you could trigger it from ci cd during the deploy when you expect a lot of mass from your monitoring systems and disable it from the ict system uh for this case uh we have maintenance mode which will collect everything to one incident so whenever yours will go they will be collected to one incident published to slug and nobody will be hooked and yeah i refer to the api graphing on call api is the references available in our documentation so in order to access api you need to go to settings here's the api url create new api token and here we have full reference how to access your alerts their body how to access escalation chains uh all your groups integrations on call shifts we even allow editing phone call rotations through api so some users i know they hack around some cool stuff exporting their google drive spreadsheets to on-call rotations etc etc and that's actually it oh one one thing i forgot uh i talked about uh all yours and failure groups a lot but how do we group alert alerts to earlier groups and actually there's a bigger question how do we render failure groups how do we understand what's the title what's the body and what to pronounce during a phone call all this is configurable and you could change it here in settings of your integration we have received it's pretty powerful instrument it's based on ginger so we receive a load which is actually json here's an example and we parse we apply ginger template to this payload and uh figure out title message uh what to pronounce during a phone call what's image url and actually how to register this solution for example this is our alert payload and this is a template for grouping alerts with the same result of this template ginger's template will be grouped to one layer group it's like grouping key so if i want to group by region i know here remove all these and access payload you see the result of rendering payload i see labels here tables and i see region so all alerts with the same region they will have the same grouping key and uh they will be grouped to the same alert group i could group by region and alert name here it is i could change how what does uncle say me when it makes a call so i wanted to pronounce it uh i don't know annotation so i use payload dot and annotations so that's what uh graphene on call will save me during a phone call that's all that's we went pretty deep into the system and i think that's a moment for me to answer questions i see eight open questions i see some chats in the chat uh okay is it free for three users in free grafana on cloud tier yeah let's talk about uh pricing so graphene on call is available as part of graphing cloud so if you have access to graph on the cloud you will see graphene on call there and cloud racing i don't want to make a mistake here it is yeah for free you have three users of graphing cloud so you'll have three users of graphing on code for pro you'll have 10 users and yes you need to pay some additional for each new user but i can say it's pretty cheap [Music] can you generate daily weekly sla breach reports with mean time to respond to the act uh we are working on that we have pretty nice ideas how to do that it's not possible yet but it will be pretty soon uh is there api endpoint that shows who is on call at the moment with their contact details let's say tier one would like to see who is on call tier two at the moment for service x without checking their calendar uh if you want to see who is on call fast we have this page where with all schedules we expose phone callers now in our api uh i think you could get current on caller uh here in schedule api yeah you see who's on call now you need to make additional requests to the users but actually using two requests or end requests you'll receive the list of on-call users now uh will microsoft teams also support it uh stay tuned uh i'm pretty positive about this uh is sms supported yes yes of course we support sms here in my personal settings via my profile [Music] where i choose how i want identified i have four options we are working on adding more options so sms is available here is are there any limitations about the destination countries or phone numbers [Music] so uh we need to check this in our docs there are some limitations with china i'm not aware of uh i'm not aware about limitations in other country countries uh it's my understanding that maintenance mutes all alert notifications is it possible to mute alert groups by groups for example for a specific cluster filter by text it's interesting so uh you could silence yours by groups forever and how it will actually act it will the grouping will continue so it will continue grouping alerts here and until because if it's resolved and we receive new alerts for this group it will issue a new value group but until this cellular group is not resolved it will continue grouping and grouping so uh i think silence issue a silence feature should answer your question uh is there any reporting on the lyricist most time uh i answered before uh stay tuned it will be available soon are you available to manually open incidents from slack when yeah we have pretty cool features how to create incidents from slack player groups so go to slack type on call and here we allow creating of incidents of valuable groups test and test and it was registered under this indication incident from slack so if you create routes here they will be available uh there in the drop down so uh it's useful to distribute alerts between teams so we actually created routes based on team names and connected to those routes with coalition chains for those teams uh can this be configured via terraform some of you may guess that that's one of the reasons why we created so full api uh we are working on that can we get the recording of the section session yes it will be available for everyone who signed up i think that's all i see uh plenty of questions in the chat uh but yeah let's let's wrap up because we consumed a lot of time so let me share another another screen thank you for attending this webinar i'm sorry if i didn't answer your question i see a lot of messages in that [Music] if you want to continue the conversation please join our slide channel uh the whole development team is there in grafana on-call channel in grafana slag and also we answer you into graphano.com so in this forum we already have some uh discussions about graphing on call if you want to see some features if you missed something if you tried and uh you can't live without some buttons a feature please let us know we will be happy to build uh or we will be happy to tell you where to find thank you a lot uh that's a pleasure for me to tell about the product in front of such audience that's all goodbye
Info
Channel: Grafana
Views: 13,800
Rating: undefined out of 5
Keywords: Grafana, Monitoring, OnCall
Id: 7uSe1pulgs8
Channel Id: undefined
Length: 45min 11sec (2711 seconds)
Published: Wed Mar 02 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.