Understanding Prometheus Metric Types | Meaning and Usage (Gauge, Counter, Summary, Histogram)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi I'm Julius and let's talk about the four different metric types in Prometheus gauges counters summaries and histograms each of those types you would use in different situations and there's things you have to know when using them in your applications and also when querying them with prompt KL so let's have a look at each type what it means and what you need to know to use it correctly [Music] okay let's start with the simplest of the formatric types which is the gauge metric gauges are just metrics that can naturally go up or down because they represent a current measurement or a current count of some sort like memory usage a q length disk space or temperature usually for gauges you're really just measuring some numeric value that already exists somewhere else in your program or in the real world and you just want to expose that value as a Prometheus metric so when you add a gauge to your application the instrumentation client libraries give you a couple of methods to update its value first the set method allows you to set the gauge to any arbitrary value but there's also methods that allow you to increase or decrease the current value in a relative way either by one or by a given amount now the cool thing is that the like And subscribe buttons below this video also have an ink method that you can call by simply pressing them so I'd be super thankful if you want to do that to help this channel okay and some client libraries even have helper methods to store the current time as a Unix timestamp in a gauge and that can be useful if you want to expose the last time that some event has happened like a boot time a process start time or the last run of a batch job in the metrics Exposition format a gauge just shows up as a single time series like this at least if you don't split it up by any additional labels and when it comes to query engages in prom KL there's not much you have to know the current value of a gauge is already meaningful by itself so you can often just graph it as it is and of course that doesn't mean that you can't apply all kinds of aggregations or other operations to it when you need to for example for gauges that contain timestamps you may want to subtract them from the current timestamp which you can get using the time prom ql function to figure out how long ago an event has happened okay on to counters counters are a bit more interesting a counter represents a cumulative count over time like the total number of HTTP requests that your app has handled so far or the total duration in seconds that it has spent in handling those requests and unlike gauges counters are only ever allowed to go up over time and never down the only exception to this is when the process that's tracking and exposing the counter crashes or restarts for some reason and in that case the counter value always has to restart at zero that's called a counter reset and we can deal with those gracefully in prom ql now in instrumentation counter metrics only have two methods to update the value the ink method allows you to increment the value by one and you would usually do that when you've just handled a request or some other type of countable event in your application and you just want to record that it happened now sometimes you also want to increment counters by fractional values or even by integer values large than one for example if you wanted to track the number of seconds spent handling requests that would usually be some fractional number of seconds or if you just handled an entire batch of requests at once maybe you want to count all those requests in one go as well for that counters also have an ad method that allows you to increment the value by an arbitrary amount and importantly counters do not have any method to either decrease the current value or to set an absolute value because that just conceptually doesn't make sense for counters in the exposition format counters look exactly like gauges except that the optional metadata indicates that the type of the metric is a counter coming to prompt URL again you'll almost never want to look at the absolute value of a counter if you think about it counters are just cumulative counts over some arbitrary amount of time that totally depends on when the counter started from zero so that maybe a year ago but it also may just be a minute ago and the absolute value doesn't tell you anything useful at all so what you really want to know instead is how fast the counter is going up averaged or smoothed over some window of time so for example to find out how many requests per seconds you're getting for that prompt URL gives you the rate I rate and increase functions and you will always want to wrap one of those functions around the counter before doing anything else with it by the way all those functions can deal gracefully with counter resets basically by treating any decrease in the metric value under the provided averaging window as a reset and correcting for that as much as possible okay on to summaries summaries are useful if you want to track the distribution of Quest latencies or of some other set of numeric values as a percentile or a quantile as we like to call it in Prometheus when you first create a summary in your instrumentation you can specify which quantiles you want to calculate along with the error margins and then whenever you want to track a specific value you call the observe method on the summary object with that value so for example if you just handled an HTTP requests that took 2.3 seconds you would call the observe method with a value of 2.3 to record that duration and the summary metric object will automatically update the output quantiles based on the streaming algorithm in next position summaries are expanded into a set of Time series one for each computed Target quantile as well as the total number or count of observations and the total sum of observations so in the case of request latencies the underscore count time series represents the total number of requests you've handled and the underscore sum time series is the total time you've spent handling those requests you can actually see the output of a summary as a collection of gauge and counter metrics so in prompt ql you can just use the individual series like counters and gauges just please don't try to average or otherwise aggregate quantiles from multiple service instances or other label dim engines because there's just no statistically valid way to average over percentiles for example if you had the 90th percentile latency from 10 different service instances there's just no way how you can compute the overall 90th percentile latency across all the instances that's why you typically only see summaries for distributions where you don't care about aggregating across Dimensions if you do need to aggregate then you'll need to use histograms instead so let's talk about histograms in a way they're similar to summaries in that they allow you to track the distribution of a set of numeric values but instead of outputting pre-computed quantiles a histogram counts the input value into a set of ranged buckets to give you an idea of how many values you've seen for each range category so for example how many fast or slow or still reasonable requests you've had and one specific thing about histograms in Prometheus is that they are cumulative histograms meaning that each bucket also contains the counts of the previous lower ranged buckets so that the normal non-cumulative histogram you can see here would look like this as a cumulative one the benefit of a cumulative histogram is that we only need to define the upper boundary of each bucket range since each bucket implicitly starts at zero and we'll see how that upper bucket boundary is encoded as an Le or less than or equal label in the time series that make up the histogram when you create a histogram in your instrumentation you have to provide a set of bucket ranges to the Constructor and then you can observe values like request durations into the histogram just like you would have done with a summary now the histogram then automatically takes care of incrementing the right bucket counters for you in the exposition format each histogram bucket is exposed as a single counter time series with an Le label indicating the upper value boundary of that bucket Le just means less than or equal so for a bucket with an Le label of 0.2 that would mean that it counts all requests that took less than or equal to 0.2 seconds and you also get the sum and the count of all observed values again just like in a summary now since each bucket generates one output time series the main trade-off you have to make when choosing the number of buckets and their ranges is between cost and resolution the more fine-grained you make your histogram buckets the best better the resolution of the histogram is going to be but at the same time if you go for too many buckets you might blow up your Prometheus server so if you want to learn more about those trade-offs and strategies for choosing buckets check out this Prometheus docs page which I'll also link down in the description so on to pronkiel you can see a histogram as a set of counters and query each one of them individually but the most common thing you'll want to do is to calculate approximated percentiles from the histogram you can do that using the histogram quantile function which is the only function in prom ql that actually looks at and understands the meaning of those Le bucket boundary labels and since the buckets are counters you'll almost always want to wrap either the rate or the increase function around the histogram bucket before passing them into histogram quanta that way you constrain the input histogram to only the events that have happened within a known and recent time range rather than some arbitrarily long time range from whenever your application last restarted and histograms also allow you to aggregate between instances or other label Dimensions using the sum Operator just always be sure to preserve the special Le label in aggregation so that you still end up passing a valid histogram into the histogram quantile function by the way since both histograms and summaries give you the total sum and the count of your observations you can also use both metric types to calculate the average request latency even without any buckets or quantiles that's not as good as knowing the distribution but it's cheap and can come in handy at times okay one last important thing there's a new native histogram type in town that's coming to Prometheus soon it's already implemented in an experimental way but a lot of things around it are still being tied down so it's not quite official yet and I won't dive deeply into it right now but those native histograms will not only be different from the histograms I just described but even totally different from all the other metric types in that they will allow you to store an entire histogram in an efficient way in a single sample of a Time series let me know if you'd like me to make a video about that once it's fully out alright that's it for metric types if you want to learn more about Prometheus right now check out my training courses at training Dot promenaps.com and please also like And subscribe down below if you want to see more Prometheus videos in the future see you next time

Info

Channel: Prometheus Monitoring with Julius | PromLabs

Views: 18,656

Rating: undefined out of 5

Keywords: prometheus, monitoring, tutorial, howto, guide, gauge, counter, summary, histogram, metric types, promql, instrumentation, exposition format

Id: fhx0ehppMGM

Channel Id: undefined

Length: 11min 19sec (679 seconds)

Published: Fri Jan 13 2023