PromCon EU 2019: PromQL for Mere Mortals

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] so my name is Ian billet I'm an engineer at improbable where I work on our internal platform observability team and the title of this talk is from ql for mere mortals and this talk is about the fundamentals a prompt you are it's about helping people out there to develop the right mental models and thought processes they're going to help them really get to grips with prompt QL and this talk is primarily aimed at kind of the more the new the new people in the ecosystem but I hope that even the more experienced among you can take something away and so this talk is gonna be full of mistakes that I've made which is good and if any of you go away from this talk and don't make one of the mistakes that I made then this talk has been a wrong success and so this was me maybe nine ten months ago when I started out with pom QL and I'm not sure how much has changed since then so a quick agenda we're gonna start with the very basics we're going to talk about time series instant vectors range vectors and we're going to use it as a foundation to build up to talking about gauges counters operators functions then we're going to tie it all together with a demo at the end so the first thing to say is that prompt QL is important of queries alerts dashboards any anywhere in the ecosystem that requires the retrieval of metrics data will necessarily involve prompt QL in some capacity but at the same time it's an intimidating technologies alone this spicy nugget was one of the first alerting rules I had to dig into and even now I couldn't easily tell you what is going on there and in the early days of using prompt QL my workflow was very much just fine some queries kind of mash them together add some square brackets and then to keep trying that until the console stopped complaining at me this wasn't a very sophisticated approach and definitely did not demonstrate an understanding of the fundamentals of this language but in all seriousness low prompt QL is essential but it also means it's a barrier to entry to the ecosystem and we must not lose sight of that so time-series first of all so a time series is a stream of time some value pairs like this and each time series is uniquely identified by the identifier and this identifier is a set of key value pairs that looks like this in the UI which is just syntactic sugar for a kind of a I think it is just like adjacent objects basically so when you set off on your adventure into the prompt QL documentation the first thing you bump into are the four fundamental data types in prompt QL these are strings yeah we all know what those are scalars weird name but that's just a float to me then you bump into insert vectors and range vectors huh what are these so a brief bit of context now to absolve me of looking like a complete idiot in the next section prompt your lives my third query language after SQL jaql although I've never publicly admit to knowing in jqr and prompt you up and so as as a human learning a new skill it is the most natural thing to take the menís models you have for something else and try to the next nearest thing and try to apply that to the new thing you're trying to learn and in my case that was SQL and my knowledge of relational databases I tried to pick that up and put that kind of rationalized time series with respect to that which has kind of hilarious disastrous and often frustrating consequences so firstly so this is the definition of an instant vector from the documentation and I look at that and I think huh okay we have a set of time series we have a timestamp and we have a value for each time stamped series pair so the SQL lobe in my brain kind of starts throbbing at this point and I immediately think of this kind of makes sense right we have a set of time series have a timestamp and we have a value for each series timestamp pair kind of seems sensible right okay so range vectors so and then this is what the internet does look like in the Prometheus UI range vectors now so again this is the definition of a range vector from documentation I read that and I think huh okay we have a set of time series we have a range of time stamps and for that range for each series we have a range of values again the sequel over my head kind of starts throbbing and kicking in here and I think of this so here we have a set of tits a set of x theory is that a series across the top we have time decreasing the left-hand side and we have a range of values for each time series it's kind of seem sensible right so at this point again sorry this is what it looks like in the previous UI so at this point I was confused I was really confused given the mental models I described to you and I had some questions like if we had a range vector with only one sample in it is that actually an instant vector are we really not asking like what is the difference between an array and a matrix and this one led me to the asked the question of are they really that different and the answer to this question is emphatically yes they are very very different and so the problem here was that I was fundamentally misunderstanding the Prometheus dates model I was trying to take the mental models that served me very well in the SQL in the relational database world and trying to apply them to the to the time series world so let me now introduce to you all the way that because the mental models are I've constructed for myself to help me really understand and the previous data model and I'm a very very visual learner so I hope this is going to help some of you as well so it looks like this we have a set of time series across the top we have time decreasing on the left hand side and each point here is a value is a timestamp value in that series the important thing to notice here though is that none of the points occur at exactly the same time and this is exactly how Prometheus works so Prometheus will go off to a scrape target retrieve some data and insert it into the TS DV it's likely a lot of these going on at same time and none of these scrapes going to be happening at exactly the same time so when we have this in mind thinking about instant vectors and range vectors it becomes obvious it's the vectors for example we have in and into the vector will return to you the most recent sample for each series and then we look at range vectors so with range vectors you tend to effectively give prompt QL to timestamps and promptly I'll return to you all of the values that occurred between those two timestamps okay this could be 0 it could be 1 it could be 10 it could be any number so to label this point home one last time instant vectors you're guaranteed to get one value per time series and with range vectors you're going to get any number of values between the two timestamps so with this foundation with time series with range vectors with instant vectors we can now use this to start talking about more complex topics so counters engages firstly counters a counter is a valve is a value in the time series where the number is only ever increasing it's going up it monotonically increasing to use the phrase I don't fully understand and the Kansas this is a gauge and engage is a metric value that can go up or down the tricky thing here with counters is that the the instance that is serving this metric to you if that instance crashes that counter is going to be reset from zero and it's going to start kind of is gonna start increasing from there on and this is gonna be important later in this talk operators okay there are two types of operators the first one is there it's the aggregation operators in the binary operators aggregation operators first so aggregation operators only take instant vectors as input and they only return into that this is output there's also an obvious meet at start so give it a time series is set to time series the aggregation operators will return to you and you a new instant vector where the labels of a new inspector owns you told it to care about and the value at each of those series is the aggregation operator applied to the group of values that you told it to care about that doesn't make any sense we'll do a demo and I'll make a lot of sense and the usual suspects here by some min max etc binary operators so there are three types of binary operators in from QL arithmetic comparison set operators these are the usual suspects in programming languages I'm sure you're all fairly familiar and binary operators surprise surprise they operate between two operands these can either be scalars or insert vectors so when we have two scalars the result is a scalar we have a scalar and an insta better the result is an instant vector when we have an instant vector an instant vector the result is also an instant vector but this is where I get spicy because when you're operating between two instant vectors you need to be very very careful about the label sets on both sides from QL is very very picky about and matching those label set we're going to see this in our demo in just a bit functions there are a bunch of functions available to you from QL they either take an instep actor's input I'm turning into vector or they take a range vector as input and return an it's in fact but there is one function that rules them all there is one function that will account for 90 plus percent of your use of your use cases in prompt QL and that is rate so rate is a function which takes a range vector as input and returns you an instan Becker is output and it basically it calculates the per second increase of your cancer rate must only be used with a count available in every gauge variable it makes no sense to use it with a gauge variable and this is the most common pattern you'll see in prom queue up anywhere and you need to get it kind of branded into your brain it's very very common but rate the rate function is also magic and the reason why it's magic is because of the way it handles counter resets like I mentioned before and what's magic about this is say for example you had your a target that was exposing a metric and that went for two and then it crashed and reset one three the rate function would be smart enough to interpret this as four six seven 9 ok so the rate function using the rate function protects you against inadvertently seeing the world in in a way that you're kind of not expecting to see it it's very important ok so we talked about counters talked about gauges we've talked about operators and I've talked about functions so now I haven't sacrificed anything for demo gods today so bear with me sorry I have a spoon so you know and Brian please don't take down your testing server ok so the the the the realistic demo we're gonna create for ourselves here is we want to figure out on our CPU cores on a machine how much time I walk portion at the time is that CPU spent in idle mode okay and so we're gonna use this metric here node CPU seconds total okay and we can inspect this metric and we see okay we have the CPU label instance job and mode label okay so if we want to figure out the proportion of our time spent in a certain mode we need to take the time spent in that mode divide it by the time the toe see the CPU spent running okay and so we would do this so first thing we'll figure out how long the CPU was spent in the idle mode so if I miss metree okay we'll see we have some fairly large numbers here I'm like okay cool so now I want to aggregate these metrics such that I've only got the CPU label at the end and I might do something like this so if I evaluate this that's wrong that doesn't look right we've got a we got a response returned to us but there's no labels on the left hand side and it's quite a large number on the right hand side so the problem here is that we weren't telling our aggregation operator which labels we care about and so if if you don't tell it it just assumes oh you don't care about any of them we're not going to give you any back and that's exactly what's happened here so to get around this what we do is we use the by operator the by operator tells aggregation operator hey aggregates only buy these labels that I tell you there is a complementary operator called the without operator which says hey I gotta get like all the labels without including these ones I'm gonna tell you about so for you just evaluate this query we can see we get the result that we're expecting and this seems kind of sensible right no this is not sensible and this is a very subtle point that took me a long time to figure out the problem here is that we're summing on a raw cancer okay and this is a big no-no because counters can be reset so you might be inadvertently seeing a state of the world where some of your scrape targets have been reset and so you could be think the state the world is like it is but actually something crashed you're seeing much lower values so how do we protect ourselves against this we use our handy rate function rate functions they they know how to deal with counter resets and you should always use it in this pattern as I said the most common pattern you're going to see in prompt Cuba so we'll evaluate that that looks very sensible and then we're going to wrap rates function up in a sum function you always sum a rate never ever rate a son Danny's be in your heads as well subtlety but an important one so when you evaluate this query we get the we get the result we're expecting but also we are robust against counter resets here okay so now we have the numerator of our equation we need to figure out the denominator of our equation as we want to divide this now by the total CPU time and we're going to take the rates again to be robust and in this query we have CP on the left hand side but also gonna have two CPU on the right hand side so this is gonna work right right no this is not gonna work at all problem here is we're using binary operators and as I said binary operators are very very picky about what labels are on the left hand side and the right hand side on the left hand side here we have only CPU label but on the right hand side here we have CPU you instance job labels so prompt email doesn't know what to do in this in this instance so to get around this we're gonna just apply the aggregation operator to our denominator and tell it to aggregate by the CPU label and we evaluate that and we get the result we're looking for whilst being robust to a class of problems that you might otherwise forget about awesome let me just switch back to my slides okay so just to recap these gotchas in the last demo tell your aggregation operators about which labels you care about otherwise you might get unexpected results number two never compare raw counters okay this is kind of this isn't a meaningless thing to do because of the kinds of reset problem always wrap them up in rate finally be careful with your label sets when you're using binary operators okay yeah you're gonna get unexpected results and frustrating results for a lot of time okay so that's the most of what I have to say resources for prompt QL and the robust perception blog is excellent the blog posts are short and easily guessable and best practice go and look there the Prometheus up and running book is also excellent and the documentation obviously that goes without saying that's everything I have to say obligatory we are hiring everyone is hiring who isn't coming talk to me for my talk about improbable or anything and that's it thank you very much thank you any questions oh there's one hello so what is the difference between AI rate and rate between AI rates and rates oh I believe I'm irate so when you have an intake rate it gives us the per second average for all the samples in that range but I rate I believe only takes that only does the cupcakes the rate between the last two samples is that right yep okay perfect I can ask a question about staleness in five minutes oh I've seen a pattern of doing the sum of a rate of a histogram do you know what that does I know that it's something you do but I've never understood it sum of a rate of a histogram is anyone I can try to give a quick explanation every bucket of a histogram is just a counter of how many observations you've made of that latency category so a counter if you did the if you try to calculate quantile from that histogram over all time that that counter increased you'd get like the average overall time right including counter resets etc so that wouldn't work so what you really want to look at is usually the latency averaged over the last five minutes or so right and so by taking the rate over each of the individual bins the buckets you see their relative increases during the last five minutes right so you get again a histogram as an output but now it's basically over the last five minutes and the absolute values of the buckets don't really matter anymore because for the quantile calculation we only care about the relative amounts in each bucket so that's that's my attempt at an explanation you just as proof he's not a mere mortal more questions so I know you folks in improbably used honest and I believe that effectiveness of bronchial and you're recording also loading rules is somewhat important for you at least that's what I face in my daily life how do you assess effectiveness of other ting rules and recording tools and do you do it at all and how do iterate on these oh gosh there's a lot to to duck to dissect there so the effectiveness of alerting rules is a topic you could do a whole day's worth of talking I think the alerting rule I would only point you to kind of alerting best practices you know page level alerts only if it's actionable immediately something that looks bad but can wait till the morning okay effectiveness in terms of if they take too long lists kind of not really a huge amount you can do about it it depends entirely on the amount of data that is querying the backend effectiveness I mean prom kale is tricky because it's not that like the easiest thing to to read and to interpret and to grok so that's definitely something that you know is something that kind of needs to be thought about and have things put in place um I'm afraid I don't have much else to say about that any more questions i-if prompt UL is nice for mere mortal mortals what can we do from the documentation perspective to make it clear is there anything that we can do I mean funk us for mere mortals clearly because if I'm using it then gosh I'm the flesh and bones human being last time I checked and so with the documentation as I said I'm a very very visual learner like diagrams speak word like speak words to me as it I mean if those we could take these diagrams and put them in documentation it might have helped me not go through this process that I outlined to you all yeah that'd be a great thing I really think more diagrams and documentation is better than less but that's just me and you know can be interpreted in different ways so that has to be a discussion with with the community that I offer trainings um I I did not really understand what what was the number we were calculating with the with the Kaveri there it was in Weston 0.4 what okay so the the metric using was nodes CPU and seconds total and so we were calculating the how what proportion of time our CPU spends in idle mode because in a CPU there's a bunch of different modes there's like interrupt mode there's user modes like whatever and we were calculating the time the proportion of time we spent in an idle mode basically and it was kinda not the greatest and I think because the answer we had midway through is the answer we had at the ends but that's just a function fact that its second when you can't like a per second on a metric that seconds you're going to get like the proportion there but for completeness if the metric we weren't using didn't have this property we would absolutely have to do something like that something like HTTP requests total at the proportion that were 500s on a different endpoint you'd need to do something like that no great talk so me as well as anyone I think I'm using the rate function for CPU but I have troubles deciding what interval should I take for example you yeah I saw that you take ten minutes but I I want to take one minute but then there are too many spikes so what what are you talk okay that's a really good question um so when you're choosing how long to write over the one absolute golden rule you must always follow is the time frame must be at least four times your scrape interval in prometheus the reason this is because you know scrapes are valuable they will fail and if using a rape function you need at least two points to calculate that rate so Prometheus out of the box is 15 seconds so minimum one minute for these kind of things to get accurate measurements but obviously like the longer time frame the smoother it's going to be so you know it's up to you any more questions yep hmm if some of rate is so common and we need to smooth the curve decision with the first derivative is there something prometheus can do to make the queries easier and faster to avoid that run around and so can you make the querying faster like I think that's kind of beside the point it's gonna do the calculation it needs to do there's there's an interesting question there about kind of the syntactic sugar of some rate because that allows you just to mess it up every time if you don't know rate some some rate What did he say that's all I can't remember so then maybe it's that is a very good reason and it doesn't come to my mind now I think but yeah that's down at the prompt ql language level and i'm not quite that level yet how do you deal with metrics that sometimes exist and sometimes doesn't some metrics sometimes exist for example the binary operators with metrics that doesn't exist binary operators with metrics that don't exist so if you're operating between a metric that doesn't exist you not gonna be opera like not operating between it um I'm not sure I understand the question fully their contacts me later on yeah there's two blog posts from Brian robust perception that might be interesting for this one is called a zing existential issues with metrics and I think there's another one using time series as thresholds which kind of touch that a bit like when when some metrics are sometimes missing you want to fill it up with a default value or something like this the more useful blog posts in this case are absolutely learning for jobs and absent alerting for scraped metrics cool yeah all right the absent function I think we're out of time so thank you very much thanks very much [Applause] [Music]

Info

Channel: Prometheus Monitoring

Views: 10,947

Rating: 4.9854546 out of 5

Keywords: prometheus monitoring, prometheus, promcon, metrics, monitoring, observability

Id: hTjHuoWxsks

Channel Id: undefined

Length: 27min 5sec (1625 seconds)

Published: Wed Jan 01 2020