Fool-Proof Kubernetes Dashboards for Sleep-Deprived Oncalls - David Kaltschmidt, Grafana Labs

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello hello could people sitting on the end of the rows here please squeeze in want to leave enough space because they're still people gonna be trickling in thank you just find a seat that's that's my seat right then hello everybody my name's Tom I will not be giving this talk I'm just here to introduce David hello everyone Tom's already given pep talks for but I haven't finished introducing yet i'm i it's only gonna take about 30 minutes to the introduction so you're good with five here yeah now I've known David for years now I work with it worked with him for a long time and he's kind of one of the best UX guys I know so so I hope you really enjoyed the talk thank you for staying so late this I know this is the last one and David's promised he's gonna you know cut it short so we can all go and get some beer so ran reports were David [Applause] alright hey everyone if this title of the talk was very click bTW I think I succeeded there's a lot of people here I can also disappoint you right away you won't you won't be getting ready made dashboards to solve all your kubernetes problems but I hope I can inspire you to follow a couple of practices that we identified I useful and so just real quick about about myself I'm David working at Cortana no I used to work with Tom already at week Forks and causal and now I'm a bit more interested in UX and also got really excited by things like this so has anyone taken the metro fun on like on on a way to the conference centre because initially I thought well taxi is good but it was really really a lot of traffic here so we switched over to the Metro and then I just saw people get frustrated with this with this door handle so who here thinks they have to remove the lever to the right yeah ok so you will all have problems writing the Barcelona Metro yes so it turns out you have to move to left there is even a little a signifier you can't see it it's because it's so so small telling you to move it to the left yeah but good luck with this anyway and I found this was a perfect segue also an intro to how - pudding for kubernetes like shouldn't happen because usually when things are on fire you don't want to be in front of a confused - boarding situation so I so he'd be hard and the first thing about on Cola but I also learned about con call is one call is hard coincidentally Tom and I founded a company a couple of years ago and I'm more of a front-end person but in Tom did all the backend stuff and the dev of stuff and then this one weekend he said David I'm gonna to a heavy metal festival and you need to watch the servers and I was freaking out and I learned some things on it in their time on there's hard on call like difficult on call but that's also an easy time for one call where you can where not much is going on you can work on some features eliminate some toil and but there's also bad times when when you get paged and then you have to respond to incidents do some firefighting and I was just hoping that it was never in the time when Tom was in a tent somewhere sleeping so being on-call in the context of kubernetes is massively more complicated because I was even though I did a lot of monitoring work in the community space already to me it was mind-boggling about all learning about all the concepts all the different dimensions and the troubleshooting process is just so so difficult because you can be you basically have to do the troubleshooting across a couple of these dimensions that exist along these kubernetes concepts and troubleshooting to me is as much about finding the issue as it is about eliminating where the issue isn't right and all these dimensions the existence of those just add to that complexity so the tools we use shouldn't really get in a way of this sometimes they do especially when go fana makes it really really easy to modify dashboards and all of a sudden you have a thousand and then being in that one call position when you have a bad time it's not really helping all right because what you want to do is you want to have tooling that reduces cognitive load and doesn't doesn't add to it and I'm sure everyone here has been a situation where they just modified a tiny thing on a dashboard and save it as a new one and I guess part of the motivation for this talk is to give you some guidance on why you shouldn't be doing this so the main motivation there is clear now and I wanted to introduce like a mental model that helps me think about this and it's - they're sporting maturity model should give you some sort of vocabulary and best practices to talk about these things with your colleagues and should also give you some sort of guidance on where you are also in your - boarding journey and where you where you could go so this is it pretty simple we've already heard this morning in a keynote or three is a magic number and yeah I just meant a bit as well by default you probably just don't really have a strategy for - boarding and you just sort of changing some things here some things there if you if you are a bit more serious then you try to manage your - boring a bit more and you try to use methodical dashboards I'll tell you but later what this what those are and in what I think is very high which or a very mature level is when you try to optimize that use and you try to actively combat that sprawl and most importantly you try to ensure that those dashboards are consistent by design so let's dive into those just to reiterate real quick about sort of bad practices and for us it was always duplicating a dashboard so this is one of the also one of the verse you exit decisions we ever made was adding this copy tag toggle there because who here uses tags for dashboards okay yeah so you probably had really good intentions doing those but then someone else later came along and cloned this dashboard and maybe just left this thing on but modified something that was somehow also represented semantically in one of your tags and at that point there's diverged and if you later use those texts to find your things you end up in this in this long list of things that this long stuff dashboards that match now but they're not representing any more what the tag meant originally so another problem if you don't have version control so basically what happens when you do modification you hit save if you have the standard longer fauna instance and you don't back up your your data or you don't have your dashboard JSON inverted control and your graph honor goes away then you're gonna have a bad time so this is another characteristic of low maturity and similarly a behavior that's that's also a symptom is browsing for dashboards if you find yourself browsing a lot like going through these folders that I'm the introduced I mean they're really helpful but sometimes you have to go back and forth and there's to find the right thing and that's that's that's the sort of behavior that for a very mature observability platform kind of want to get away from so we can do better let me let me tell a bit more about what media maturity can look like so the first thing you can do to prevent the sprawl of dashboards is to use template variables who here's using template variables okay yeah well done cool so what what we see here already is a know dashboard for kubernetes you don't really need a dashboard for each node obviously because what the things you want to look at like CPU usage or usually usage per core is the same the same panel layout and everything for all of these nodes right so what you can do with Cortana for example you you can register those as template variables for that dashboard and then you get this nice drop-down and then you can have a look through all of these instances and if you're really clever you can also do this for various data sources and as a sort of higher level template variable and then you can basically access a lot of different clusters so let's get to methodical dashboards what does that mean so there's a couple of methods out there that really help you sort of make sense of what could go wrong in kubernetes along those areas dimensions they talked about earlier and one of the popular ones are the used method and also the rep methods also two loci here again so if you click that right method linking I see another sort way but what's more important I think is that you don't really you don't necessarily have to use any of these but it's good to have a method so it's good to have a method for the situation when one of your developer teams comes to you and says we want to we want to add a new app what how should we add the dashboards to this and you should have an answer for this and that's going to be your your method so here's an example now for a user method dashboard these are actually part of the kuben is monitoring mixin there's a link there so the slides are also in the session attached and this is already a pretty good view of what a node represents and also what sort of troubles and node can have similarly there is a whole set of dashboards in this repo so I just took a picture of what other what other sort of dashboards are there and for example there's a new one about persistent volumes and their little novelists there so another good approach that we've identified in addition to methodical dashboards is a rock Eagle dashboards and those do a great job at and summarizing what is going on along those dimensions and the the big benefit here is basically to use the power of the power of trees or like the power of the lower rhythmic drill down so to say and you can this really helps in this elimination process so that you can quickly see in the in the higher level tree dashboards that things are okay and you can move on to the next one and one of these hierarchies along for communities for example it could be it's up to you right but it could be a cluster name space pod and the question then becomes or and then all of these all of these queries will obviously have to be structured in a way that whatever is below or whatever is above aggregates those metrics in a meaningful way and then the question becomes how do you navigate between them so the what we do is that for every graph that we do for example this is this cluster view broken down by namespaces the breakdown here in these queries is always the next little down right so you can move into the next level by using by using one of these drill down links in the table and this is also part of the kubernetes monitoring mixing so you can see how that's being done and a similar hierarchy hierarchical approach you can also use inside the the same dashboard and you can you can use it to represent how an applicator how data flows through an application so in here we're using the the read method again but we use it in one row per service and so each row follows the rep method by having the the request rate and error rate on the left and having the duration of the latency on the right and the the really powerful thing here is that you can sort of you can see from the top which is the local answer that the user is not going to see any errors because there's nothing red so this is good but there's obviously something wrong because the lower ones are red and if we go the other way so looking at the app there's some red there but we know that the app relies on data from the database and by virtue of the database also being red we get a bigger if we get a hunch that the error may be in the database so in just this vertical structure this vertical hierarchy inherently leads me towards a hunch on where where the system is not working on so one thing to remember or one thing to keep in mind about these is that sometimes it's worth splitting up a service or an app into two different dashboards mainly because the magnitude can differ so for example we were on cortex at microfauna and cortex is a Prometheus based service and a lot of data gets being sent all the time so in let's say it's orders of magnitude more like a thousand X so if there were any errors in the query path in the repast they would just be drowned visually also in in the if they were the same if they were in the same - but if those metrics were aggregated so and and one more thing one more tip I want to give is to make the charts themselves really expressive so that even the previous ones in the service hierarchies debuts in color too to give you a quick and then a quick hint on what's going on so the top one and top diagram you can see the only two hundreds are green and the five hundredths or red someone decided to color the 4/4 blue I don't really know what that's about but this definitely helps to quickly draw a conclusion on on the state of of this app and then another another tip that we can give is to normalize graphs also by by by the by the y-axis so you instantly get a feel for how busy something is this is this is especially useful for things that track saturation where the where also the the resources bound as some index CPU for example and it's also good to understand how this all works so again in those take another example here from the companies monitoring mix-ins there is there is there is another there's the whole cluster here and each of these lines represents a node but we don't really know how many CPUs the nodes have because they couldn't it could just be of different I could have been provisioned over different size but the important thing is if we normalize this by the cpu count or across the cluster we can definitely say how how much resources across across the cluster being used leading up to hundred percent and I think this is really powerful because you know because you you reduce the cognitive load again of having to make or like having to draw conclusions on how much space I have left so just just to go back on template variables again I do realize that template variables make it sort of hard to to navigate through the dashboards but the main idea is also that that this is okay and that you actually shouldn't just navigate through them especially if you have three level hierarchies and that what we actually want to encourage is that you don't really browse through the dashboards anymore but that you use alerts to that that have links in them if they give that link directly it's a particular dashboard with all those variables filled in okay and then lastly the the where should we store the dashboard code themselves and there is a couple of initiatives again sycophantic going on and the most important rule revolves around improving the provisioning workflow and we want to integrate well with github there and as a designer out there so I think if you're really interested in this it's it's worth clicking the link in the slides and also just go into the issue 13 8 to 3 and comment on it ok so before we get to the high maturity I just want to show you another situation in of the of the Barcelona Metro so who here has seen these did anyone have did anyone also have a little trouble there because because everyone a bus or a lot of people were still asking to be left handed but I'm right-handed like I think the majority of people but maybe I'm wrong there and so having to take the ticket out having it in my right hands and then putting it into this left turnstile Swiper I don't know that was a weird contact though not all subway entrances have it on the left some of them have it on the right ok yeah sir I guess a good another good segue here now is these people that put it in place they knew fully well that a lot of people all right-handed they decided to do this anyway which you also find sometimes in a DevOps organization where people know good practices but they still deviate so then the question becomes how can we achieve consistency by design and before we get into this let's quickly G let's quickly talk about sprawl again yeah just look at how happy this guy is trimming the hedges just because think about about all these all these dashboards that you create all these one-offs they're gonna be in the way that they're they're gonna be a sprawling hatch that they're definite leads to be trimmed and in a very mature dashboarding organization you will have someone that goes around and and removes dashboards that are no longer being used or you have a very good review process that that says only dashboards that are proof can go into the master the master Cortana for example and then another cool feature that we're thinking about internally ingre fauna is to have meta analytics to basically track the use of dashboards and of queries to give you also the idea on which of these could be able to move potentially so consistency by design there's a lot going on on this slide but I want the main takeaway is that there are libraries out there scripting libraries that give you high order functions to generate to generate certain types of dashboards and the important thing there is that those functions can encode for example a query panel and they they will have options attached to them and if you if you didn't only use this little function call giving it the data source and the career you can ensure that all the rows and all the dashboards that have been created they will all share the same style and there's no longer a fight of should we use line fields or should we use not line fields and then just by using these patterns you guarantee across the organization that that your - more panels are essentially similar enough so that people can find their way really quickly and in my eyes also one of the biggest benefits is this is the smaller change sets because if we use higher order functions you you don't have to deal with this massive Jason anymore because you can only come here to compare let's say the query change for example so I talked about mix-ins a couple of times mix-ins are a set of dashboards and alerts that are peer reviewed and a lot of yeah a lot of those resources exist and I think you should make use of this even though even if you don't so just making these mix-ins are the ones I've been showing you they've written in JSON it but even though they're written in JSON it they can still give you or you can still extract the queries in there and use them in your own dashboard in Germany so it's really so it's a really it's a really good resource to look at how people people decided to monitor Prometheus there are other mix-ins for example for console or other services so it's worth definitely to watch this this talk Oh incidentally also might oh okay yeah and in the future we also really like a workflow where we can in the browser I have also haven't an editor to modify the JSON live so you can then have that be part of the dashboards code journey to from the browser to your PR right away and then ideally also use these higher-level function languages to do - to basically have smaller exchange sets and have a compiler somewhere that renders the JSON so that you can have the live preview again but that's a bit it's a bit future stuff cool so it is a quick summary again it's good to have we have a strategy for dashboarding it's good to start with the goal of managing the use of methodical dashboards and then the next step can be consistency by design I think this is kind of a summary slide of what what I had so far and then I basically want to leave you with two with two kind of main takeaways one of them is the dashboarding method that you adopt like should definitely not get in a way of people trying to use the dashboards and the second one is don't be the persona metric of dashboards thank you very much David I was the one who chose blue for 400s okay what color should they be is the first question yeah purple purple okay well luckily as we use JSON it you can change it in one place that's true anyway right then who's got a question for David we've got a few minutes on someone's got a question you go next year when you come to Amsterdam combat bicycle yeah yeah are they UX friendly or I bet they are he's got a question come on wave hand there we go hey thank first of all thanks for the talk in the great tips I think everyone agrees that this is a should be way to go I'm from one of these organizations where we have literally literally 1800 dashboards we bachelor fana to actually deal with it so every upstream version we have to match the queries - yeah with it so this or is from a a shared graph Anna so every team is free to do their own stuff so if you have any tips on how we get the organization to a level where we can enforce or not enforce this where everyone adheres to this because I know about it some teams know about it but there's like 82 development teams that don't yeah they do the sprawling so yeah so you sort of answered a question yourself already because your training is the internal training is the first step there to make those developers aware that that whatever they add to their graph Anna instance can also get in a way for other people getting to the results that they want all right so I guess maybe sharing the video of this talk I guess my first step but then also I can I can also I guess set your bit at ease that 1800s it's not the biggest number that we've heard so good next question Oh looks like we're done I just want to I want to just say one more thing sorry okay just wanted to call out Frederick here and Jack here and is Matthias in the audience yeah Matthias over here these guys were I don't just do the mix ins on my cert on my own like these guys have helped me they really run it now I'm too busy but I wanted to call them out for the kubernetes mix things has been super helpful and you've probably already using it and you don't know it right if you're using the Prometheus operator that's the default set of dashboards that come with it so thanks very much Ellison thank you David [Applause]

Info

Channel: CNCF [Cloud Native Computing Foundation]

Views: 5,465

Rating: 4.9523811 out of 5

Keywords:

Id: YE2aQFiMGfY

Channel Id: undefined

Length: 28min 38sec (1718 seconds)

Published: Fri May 24 2019