Grafana is Not Enough: DIY User Interfaces for Prometheus [I] - David Kaltschmidt, Weaveworks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
very nice all right everyone seated welcome to I guess the last talk for today on this track I'm glad there's still people here showing up that's good and people are still awake and yeah so what's it about I'm David from beef works and I think there's also a couple of people from us here in the audience and we were mostly known for container net networking but now we do we branched out into monitoring and we're also involved with the cloud native computing foundation and prometheus is one of their projects and we also use it heavily for our own monitoring so yeah we've also and naturally we built some tools around it and I'm going to talk about some of them today um so let's do a quick round of audience participation keep everyone awake who here uses Prometheus oh yeah that's my good sorry who uses Prometheus on kubernetes still all right yeah so I guess half of the previous ones okay that was just to figure out what the audience is and how quickly I can get over this slide I should tell everyone it doesn't know where it is time series database it has a really nice dimensional data model that's going to be a bit important later and it's kind of query engine it's quite nice it's got some built in visualization that I'm going to go over really quickly and yeah plenty more stuff that I'm sure you've heard about today it's open source it's really easy to install get started just run that dr. command and you have you have the Prometheus running that's one throwing itself and if you then go on your local horse localhost 9090 fuel are the first thing you'll see which is the the expression browser so that's kind of the UI that you get just this community is in its vanilla shape it's got some cool things already together type-ahead for metric names so that's pretty cool so we don't really to remember all them completely get a chart mode in a table mode and if you know your data model that you've set up to represent your time series quite well then you're not going to have a hard time I think um but yeah if you forgotten already because you I don't know install that that new prometheus configuration two weeks ago and you don't remember all the label names then it's a bit tricky so and kid TD is pretty quickly and they realize this and they offer it another open another way to look at your committee's data which were console templates I think that was abandoned really quickly but it's like this and your Prometheus supports dibs this basically allows you to predefined graphs that you want to look at and Prometheus also comes with a bunch of them for example if you just go to that URL up there consoles Prometheus overview you get this predefined dashboard so that's pretty cool and you can just modify those so yeah that's also part of the vanilla Prometheus and if you want to go all dashboards then I guess eventually you end up at Crowe fauna which is a full-fledged dashboard solution it's got all kinds rows and cells get auto refresh get a template system if you have multiple servers to dynamically generate selectors for you sort of quite nice I've got user management but it needs its own back-end buttons widely adopted and has lots of integrations so I'm pretty sure here who's who's using bro fauna just look like yeah see is popular it's so popular and so easy to add more and more dashboards right like you just use people just keep adding them and who has more than 10 dashboards Augustana ok yeah so well we can even add more like programmatically just as an aside there's a little library burrowed to write out Prasanna configurations that's open source and that works quite well also documented there under this URL so you can have a little little idea on how that's all set up and the idea there is always that the dashboards themselves can be treated as code so you can also version them so that's quite nice I will I guess well where we talk about copilot is also limitations and the most that most the ones or surrounds around troubleshooting and the reason for that is that profile is usually only the starting point well the starting point is probably the page or the flag notification you get and then you probably look at your graph on a dashboard and so let's dive a bit more into this debugging story and let's say we want to look at this little peak down there where the arrow is pointing what you can what kind of options do you have I guess you can use some of these some of these zoom in zoom out kind of things but sometimes you also forgot like all right what was the query again so you would click on the edit button and then you look okay or actually a couple of queries which one is important now how can i how can I get out of there maybe I want to go to the expression browser actually these links kind of work there this works in in a normal door sauna you can jump into the expression browser if you click one of those so that's pretty cool already but what do you do then and as we've we've thought about this a lot I guess and it's all about exploring the queries to to to figure out what's what's really go over all like the queries like have the essential parts of troubleshooting and we also want to compare compare the time values between what happened now and maybe with the incident time so we have some some kind of wiki we can compare how something's behaved in past and how it is behaving now but we also want to document how go around our troubleshooting and maybe you want to add notes and she had this whole thing with a co-worker and then at some point we want to go home and just hand over the incident for ever's on call next that's quite nice this whole thing is an all new concept these kind of documenting things by using code if anyone had sort of Jupiter notebooks they're quite popular and in this field where you can we can basically describe several what you want to program out in several selves and then add some some notes in between so we kind of took this a tack you also to implement something similar but for for Prometheus and the the important thing is then how do you get how this all done so the first thing I guess you need to do is you get a get out of Ravana into some kind of notebook and then we can use the Prometheus API to take out to figure out to figure out or I like to build your own queries and I guess with the tools that we're going to show they are all based around just using the API and using the for example to teach arts library that the vanilla Prometheus is also using um how we can use that to yeah to your advantage basically and in the same time make a nicer query field so let's let's quickly figure out how we can jump out of we can out of we can jump out of pro fauna and we tried a bunch of things we tried for final plugins but plugins only give you full panels or a data sources and we didn't really want to maintain our own our own graphing panel because that would be a bit annoying and but you can use you can use our Prohm extension that we wrote um that just analyzes this a girl on a page and then those okay this is a this is a panel that has a graph and a graph is based on a query so I can reject some links to your other monitoring system which for us is we've cloud and you can jump there and how do those links get there yeah browser extension this is actually this open source now so you can you can fork it do your own browser extension if you want if you want to use Cortana as your entry and I guess also exit point to get into some other kind of troubleshooting yeah so this is available now either in the coaster or you modify if you want to use vCloud or if you want to fork it and will build your own jumping links so to say yeah so jumping what we're - some kind of dashboards and or some kind of notebook um we implemented one for for beef clouds and this came out of our internal efforts of dog fooding and community space monitoring and the idea is to use it as a starting point to document an incident so now the question is how how do we build this so um because yeah we just build this out yeah we just build ourselves obviously um what makes the pin notebook a notebook is query field most importantly the precentral because that's where you enter the the queries and we we just decided to go along with with the tip a simple query field because by just talking to other developers nobody really wanted a just a click solution and like a click query builder and we just look at the we looked at the original expression browser and what the what is the field there provided a provision so that was quite helpful so we we started with that as well and just imagine it a bit differently or try to take it a bit further so prompt PL has obviously a set of a set of functions for example and keywords so it's good to into office suggest those if you start typing that so we are down on top and then just the just the metric names that are easily available via this URL from your previous installation so you can you can just get those and then suggest those but we want to take it one step further because once you once you have entered your geometric name and you also want to filter probably by some kind of label so now the question is if I if I start typing the opening curly braces what should happen next if if I use the Prometheus API to and insert the metric name there then I get back all the all the labels and all the keys for the labels and all the values so I can basically construct a gesture that's based around the metric name that came before the curly braces so that's really just as simple a greedy algorithm I just look at opening curly braces what's the metric name to the last it's really collectible so then you get back label keys and values you can construct a whole tree pretty easy and then what you end up is you can also have we can also look greedily okay I just entered a handler key so what are the labels that match this particular metric with this particular label key what kind of values would have yeah that's basically the whole thing and then this is kind of an implementation detail on how it's done we're using a directory else which is which came out of Facebook home so if you if you have a use Facebook and all the little messaging fields that's basically written into access so we just used it a bit more for this expression syntax highlighting and of the suggestions exactly and what else is missing in the end notebook filled all our table and then we try to reach out a magic move how can we make this how can we make this prometheus the values table how can we make this better maybe we just extract all the columns so we did that and then what happens if you click on a column name on the column cell all this could actually just mean okay maybe I'll just want to filter by this one so then that gets auto inserted up there this is all we wrote this is all pretty easy if you just we just look at the selection API of the browser where and where you are inside inside the query it's really just in the it's really disagree strategy of figuring out what we're inside we're um the last thing it's missing is charts they were they were built just reusing basically what the vanilla Prometheus you I had so it's based on rickshaw and we really just got the stuff out the code is basically the same so you can do something similar if you wanted to build your own Prometheus tracks there's also the the last link down there this is how arm this is basically how the vanilla Prometheus UI uses the rickshaw charts to draw the results from the range query yep okay so I'm just going to show how this all works together now and I'm quickly going to go through the setup the idea here is that you run your local our cluster with some services and you run your local Prometheus and your local profiler and then we run service we run a hosted Prometheus service for you where you can send all your data and the graph Anna can also can also access that data then so this is what's going to happen and with a browser extension we can jump out of the graph into the cloud right all right let's see that all works together [Music] where oh maybe I do this I remembering because that's better yeah you know seconds all righty so maximize this one so this is up this is just a github repo for the browser extension um if you want to have a look at that one see how its implemented and its installed here it's somewhere here so basically now if I look at okra fauna the browser extension notices our caters of Ravana it's pretty cool let's insert some links on how or where you want to jump so I just hit reload so you can basically see how how the houghtelings can't inserted or not oh this is interesting if there is it Ethernet here what is I brought an adapter oh that's good good cool so this is yeah our own system that we obviously monitor and if you've looked carefully here in this area there was nothing there before and has the as the as the after the the dashboard rendered those those links were inserted and then you can just jump out of there so now we can just so it's either then per panel so I cannot I can either start investigating a single panel or I can just jump into an investigation prefilled with the whole dashboard so I'm just going to click on this one here and then let's see what happens yeah so now I'm not got a complete notebook filled out with a title of the dashboard that I can start like I can start renaming so maybe this with the user service with acting we had Q call okay and then you can see all the query fields prefilled with those queries and then we also see our km we we have some of these queries we actually had multiple but some of these panels from Cortana had multiple queries in them and they're all put together here so that is another thing that we added on top that you can enter multiple queries into into a single field so that they all get this play together so that's nice so now I can just look at all of them and I can remove some that I think I did a finger not very important is probably not very important or look I get this one here was the most important one or well there's a bit of movement there then I can start documenting this stuff what happened above hope yet I would notice now okay I'm I'm trying to document this I would turn into a text node having above was not nice okay so that's okay and then I can construct more queries around this let's just say hey HP what's always a good one maybe do school yep and then I can I can just look at the I can have a look quick look at tables how I this is kind of a similar workflow to what you were probably used before where you try to figure out okay like what what's the label space for this thing this is not very good because they the values all 0 that's why they are all grayed out hmm interesting thank you service your quest this one mm-hmm all right that's right a graph it's more important here would be the will be the table and then you can see arcade no it's it extracted all those values into into table columns and I can look okay maybe I only wanted a certain route let's say I wanted to route trace and so just by clicking on it it looked at the query and then figured out okay I I clicked on a server I clicked into the results that of a service request duration seconds count query so I know exactly where to enter where to where to modify the query string and yeah this basically will be implemented that's pretty straightforward and yet now it's already narrowed down to only two rows yep so it's nice we can do some maybe we want it or maybe you want to it's just this is rendered its first okay that doesn't look very exciting because it's probably some kind of counter that's obviously always obviously increasing all the time so I'm not very exciting as this but we probably want to see the rates of increase so we do a bit of this I can never remember is it outside right yeah from ql and look at this okay yeah see we see you let me see much more spikes that's right maybe we want some finer granularity no we probably want only one to a gradable no those are probably good enough here so and then since we remember okay we had we had two of those let's just i just sum them up so that they could run in together okay yep and that was it and I guess we've apart from all of these above here we can probably delete all of those but we we documented this part down here so I guess that's probably that's all we cared about arm variety that was a demo and let's just get out of this mode back to this yeah so we only started recently working on this and after I've given a version of this talk two months ago and in the meantime we've figured out all this growth line of stuff we get multiple cells are working and multiple graphs and a values table so the next step that we're working on against next weeks will be to make them really shareable so you can share them within within your organization and if your weave clouds that we sign up a V cloud and then there's some syntax tweaks curious for example if you look at that last query there it's some by mode and since we're using some kind of greedy algorithm to the left maybe here we started look to the right you know someone we have to figure out that it was the node CPU metric so there's a lot of little tweaks all right so the main takeaways are actually if you want if you if you just want to rebuilt this on your own look at your behavior like where do you get stuck during troubleshooting you can look for jump points on how to get out of things where usually or usually get stuck you can study the API of of how other things other tools are using our committees and as I've shown it's not really that hard and you probably also find ways to make I like to take it a step further right so and especially if your packet engineer is not really you don't really have to be afraid of a front-end like you can just load something simpler yeah and there's always Elm you know all right we're not hiring not just kidding in in any of these locations and here the backend of this is open source or on cortex for the front end it's still under heavy development we haven't really figured out how to extra the nights prior like the experi biller out into into an open source components but that may then they come other than that I encourage you to configure your prometheus to point it towards cloud or ether works and then you can use the that's willing yep I think that's it yeah any any questions yep so that is virtually infinite because they're using as free buckets so basically as much as Amazon gives us so to say yeah yeah no so yeah I think that's exactly the use case we're trying to address here yeah so it's yeah they're going to be there for like until AWS fades the heater basically yeah yeah I mean you get you get a lot more features there because we can we can then the cortex this codec version is is a it's a scalable version of Prometheus so you can get you can get parallel parallel queries and then especially if they're a bit longer back in time or if they span a lot of time the Curie times are a lot faster than a vanilla Prometheus because they can be paralyzed yep oh yeah button we run where yeah so the user interface not not yet but the we're thinking about how to make an on-prem version for this this cortex you can write yourself if you want to solve your multi-tenancy problem that you may have from your own Prometheus then cortex may be the answer yeah should check that out alright thanks
Info
Channel: CNCF [Cloud Native Computing Foundation]
Views: 10,118
Rating: 4.7192984 out of 5
Keywords: CloudNativeCon + KubeCon 2017, CloudNativeCon Europe, KubeCon, CloudNativeCon + KubeCon Europe 2017, KubeCon Europe 2017, CloudNativeCon Europe 2017, CloudNativeCon + KubeCon, CloudNativeCon + KubeCon Europe, CloudNativeCon 2017, CloudNativeCon, KubeCon 2017, KubeCon Europe
Id: bfSMDERvkZY
Channel Id: undefined
Length: 28min 2sec (1682 seconds)
Published: Mon Apr 10 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.