An Introduction to Systems & Service Monitoring with Prometheus • Julius Volz • GOTO 2019

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

This is a 32 minute talk from GOTO Amsterdam 2019 by Julius Volz, Prometheus co-founder. Check out the full talk abstract below:

Since IT systems are a critical component of most businesses, we need to ensure that they are always available and working correctly. This talk gives an overview of different monitoring and observability approaches that can help us achieve that goal and then present Prometheus as a popular option in this space.

Prometheus is an opinionated metrics collection and monitoring system that is particularly well-suited for dynamic cloud-based environments. To provide maximum insight, it offers a dimensional data model and powerful query language. With an alerting model that integrates seamlessly with the query language, it allows you to define precise and actionable alerts.
This talk will give an overview of Prometheus and its core features and design principles and explain where Prometheus fits into the monitoring system landscape.

What will the audience learn from this talk?

  • Where the need for systems and service monitoring comes from.
  • What different approaches to monitoring and observability exist.
  • How Prometheus fits into this landscape and what its basic architecture looks like.
  • How Prometheus's data model improves over previous ones
  • How Prometheus's query language helps to gain better insight into collected data.
  • How Prometheus's architectural simplicity and efficiency makes it easy to get started.
  • Why Prometheus works well for monitoring dynamic environments like cluster schedulers and VMs on cloud providers.
👍︎︎ 9 👤︎︎ u/goto-con 📅︎︎ Oct 15 2019 🗫︎ replies
Captions
[Music] good afternoon so I'm Julius one of the co-creators of the open source monitoring system called Prometheus I usually speak about this in front of a bit more infrastructure heavy audience so I'm curious who here actually uses Prometheus oh yeah so you'll all be bored by this because it's it's going to be an introductory talk and who here has heard about it but is not using it yet oh okay maybe it's gonna be cool but might be useful okay so this talk is first of all a tiny bit of an introduction to what systems monitoring is and why we do it and then I'll deep a bit I'll dive a bit deeper into Prometheus specifically so first of all a primer about systems monitoring you know nowadays IT systems are everywhere and you know from if you want to sell shoes to if you're a large tech corporation you're going to have IT systems either your own or in the cloud with software and hardware et cetera and the goals for these systems are typically similar you want stuff to be available like a website should be should load it should load fast as well it should show the correct contents and then maybe something less user visible but still important to you as an operator is that you're running stuff efficiently so that you're not wasting a bunch of money or CPU time etc under the hood of your systems in reality though anyone who has run any kind of complicated complex distributed system will know that things are always in a state of partial degradation and everything is kind of exploding left and right and the challenge is to make the overall system look ok to the outside world so for example a couple of potential problems you might see might range from hardware to software issues to like other issues for example full disk might prevent a database from storing more data which will then maybe fail some write requests that a user sees software bugs temperature etc the last point here in the list is a bit different from the others for example if you have a container scheduler system like kubernetes and you reserved way too much RAM or memory for each container of a certain type and the container actually only uses like 10% of that then the unused memory that you're reserving cannot be used by other stuff so you're kind of wasting money on on a Mis configuration of your memory reservation which is also something you want to catch and do something about so this is where monitoring comes in I use the term monitoring very broadly so some people there's also the term observability and some other terms I just say monitoring for anything that kind of gets information or signals from your infrastructure and services in any way and then allows you to get inside and act on the information and if you have something that looks like a website or something that serves requests the typical signals that you want to collect information on are how many requests per second are you getting what's the latency distribution how many percent like what's the error percentage of the requests etc etc but also stuff like memory usage CPU usage underlying resource usage metrics and really anything you can imagine measuring and then you want to react when something doesn't look like you expect so there's a couple of different types of monitoring some were more popular in the past and some are still popular nowadays so a typical old one here who has suffered from using Nagios or who's running nodules so the of course system that did its job mainly in the 90s and still nowadays in some legacy installations this is basically you know very heavily doing check based monitoring which means you have statically configured hosts that have you know one has a database server one has a web server and so on and you run regular check scripts that just check right now every minute is the database server running the web server running is the CPU temperature right now okay and then these checks might output the status like okay warning or critical and alert people if something looks bad so this is definitely better than nothing it's pretty much focused on probing a system from the outside in a very primitive way though for example like checking if a process is running CPU temperature and so on not not so good for looking deeply into a process what exactly is happening it's also focused around doing checks on individual machines in a kind of local context and only focusing on the right now moment I know later on these systems also gained a bit more capabilities beyond that but we're really focused on really these simple types of checks and especially they were configured in a static way like this is my database server it has the following checks and this kind of breaks down in environment where you have for example kubernetes scheduling containers and workload shifting around a lot another way to get insight into your applications and do some the do stuff about it is logs logs just means your application emits either a structured log line with key value pairs about an event that just happened or an unstructured log line like set text line which you then would have two powers later and make sense of and you know examples for this is you might just log it to local disk or you might ship it to a remote system like elasticsearch or like in flux TB where you can then do further analysis and actions based on the data you've collected logs are really cool because you give you the highest were kind of the highest possible amount of detail about all the events that are happening in your processes and systems and they're also really simple to produce it's usually just a line of code to emit a log event they do scale in cost with a rate of your events though so if you're getting 1 million requests per second and you want to law all of them as events then you have to log a million events per second and that can get really really expensive they also don't solve the problem of correlating different log lines from different services that correspond to the same request that maybe went to different services so it's a bit tough due to the cost here to base your main operational systems monitoring on logs for high traffic services at least so this is where metrics help a bit metrics or time series track individual numeric values over time and you may be it could be a temperature memory usage the number of requests that a certain process has served since it started and these are all numbers and you sample them every 15 seconds or every 30 seconds or so and store them as a series over time examples of these are for example stats the open TCB's on and prometheus is also in this category so a good thing is metrics are way way way cheaper than logs imagine again like storing a million events per second versus just storing a counter value every 15 seconds which is a single number for that of course you know it gives you way less detail it doesn't give you the full detail about every event that happened anymore but it's still for usual cases good enough to give you an aggregate view of your system and whether it's healthy and then for doing detailed debugging you still often want to have logs for jumping you know to the logs of certain select systems - to figure out what exactly happened and yeah also relatively simple to produce also lacking the inter-service correlation so for the inter-service correlation that's where request tracing really helps this is about tracking a single request through an entire stack of micro services that serves a request so the ideas a request comes in at the load balancer gets some trace idea signed and the trace ID gets passed along the entire chain of all the different sub services that handle a user request and each sub service records the time span which would spent handling a certain sub part of handing that request and then sends that span to a remote system to record it and later to analyze it so this is really great for getting a good intuition and also insight into the life of an individual request where it spends how much time and where it met encounter errors of course this can also this is kind of like logging but on steroids so you have to make it not too expensive typically people sample this so they say like maybe log every thousand thousandth can can't pronounce that one in every thousand requests and and of course the more you sample the more the less useful it becomes but the cheaper it becomes so it's a trade-off tracing is a bit harder to add to your infrastructure because you need to make every node in the path cooperate everyone needs to pass on this trace ID so you can correlate everything in the end and it's really mostly suitable for tracking request style information so you know metrics also help you record stuff like CPU temperatures and other gages and histograms and so on which don't fit so well into this model so now let's get to prometheus as an example for a metrics based monitoring system so Prometheus is a monitoring system that's based on metrics and also can create alerts based on metrics we give you tools for the entire range of what you want to do so getting instrumentation getting data out of the things you care about then collecting the data and then doing useful stuff with the data for example generating alerts or doing dashboards and it's especially well-suited for dynamic cloud environments like kubernetes where containers and applications and so on just move around a lot we try to keep things simple and explicitly don't do certain other things so logging and tracing which I just mentioned earlier we don't do in Prometheus we only do time series we still think that logging and tracing are useful but you will have to use separate systems from Prometheus to do those we allow you to specify a lodging rules which can be potentially complex but have to be very explicit so we don't do automatic anomaly detection where the system just looks at some data and sees this is something it looks different than usual let's ping someone also Prometheus itself for simplicity only has a local storage which naturally has some limitations of you know horizontal scalability and so on but there's ways to build more scalable and durable storage around Prometheus so Prometheus started in 2012 when me and another guy both met proud both came from Google to SoundCloud and SoundCloud already had built their own cluster scheduler before docker even existed and obviously before kubernetes existed and all the existing open source monitoring tools back then were not really suitable anymore for this kind of dynamic environment and we had trouble finding out what was going on we're in suitable enough detail to make the site stable and fast and so we're kind of thinking back and saying well we've worked with a cluster scheduler at Google already so and obviously Google had a monitoring system that worked well with that so prometheus we started building it an hour free time inspired by Google's internal monitoring system called Bachmann and then gradually introduced it at SoundCloud and yeah it's we made it open source and the world started using it Prometheus has been part of the cloud native computing foundation since 2016 we were the second project after kubernetes in there and very important like we're bit different from some of the other open source projects out there we are not a company so we're one of the most independent projects we have at the moment around 20 voting team members in the core Prometheus team and they're all work for different companies some are like redhead others are Griffin alabs one is at Google wants a digital ocean and so on I'm an independent freelancer also in that team so you know there's not one single company driving where this project is going and you can find us at fermitas IO alright let's look at the actual core system architecture of Prometheus first you start out with some with some things you care about we call them targets in the best case these might be services where you control the source code directly that's the best case because then you can just take a Prometheus client library add it to your code and add an HTTP endpoint because Prometheus is a pull based monitoring system that exposes metrics over HTTP and that also helps you track metrics so it helps keep track of the state of internal counters gauges histograms and summary metrics and this allows you to get very good white box instrumentation meaning really instrumentation and metrics where the process itself keeps track in high detail of what it is doing inside then there's things where you cannot you know add an HTTP endpoint with previous metrics directly to the code it might still be a while until Linux the Linux kernel has an HTTP server serving fermitas metrics directly or a my sequel server and for these purposes we have the concept of an exporter a little sidecar process that you run next to or on on top of the thing that you want to actually monitor and then Prometheus gathers metrics from the exporter and the exporter in the background just live contacts the back-end system gathers metrics in whatever proprietary format that is and translates them back to Prometheus metrics so then you have as the heart of the ecosystem the Prometheus server starts out in its simplest form as a single node system and then later you can build more scalable topologies and the from easy server you configure to scrape or pull metrics from all the targets that you have configured in a regular interval and it then stores those as time series on local disk at first so let's do a little bit of an excursion into what actually gets transferred over the wire here this is what the exposition format looks like when Prometheus asks one of the endpoint hey what is the current value of each one of your time series the endpoint answers with something that looks like this I'll talk about the data format a bit more later but basically it is one sample per line and the line gives the identity of the time series and then the current sample value and this is all that's ever being returned so it's only ever the current value of each series that is currently being tracked alright so how does prometheus know where all the targets are obviously nowadays we have quite dynamic environments so we need to integrate with service discovery or some source of truth in your infrastructure to tell Prometheus where all the things should be and where they are and then Prometheus can dynamically all the time figure out what it should scrape from you can then build dashboards using your fauna or prometheus is built in web UI automation against the collected data if you want and also define alerts that Prometheus calculates but then dispatches over a separate component called the alert manager and the alert manager and groups over time and over different dimensions and eventually sends out notifications as email pager Duty ops genie slack and some other mechanisms and you can build your own so let's go into some of my favorite features and features which I think made Prometheus as successful as it is nowadays and maybe these features are not so exciting or unusual anymore nowadays because they're being adopted by more and more other systems but this really helped set apart Prometheus in the beginning so the dimensional data model to help track things metrics in detail a good crew language to go along with that data model being able to start simply and efficient with a single parameter server and then integrating with service discovery to make this whole thing work in dynamic environments and I'll go into each one of these now so first of all the data model promises fundamentally stores time series the time series and Prometheus has some kind of identifier and then we just add time series value ten to ten stem value times time value times time value pairs to those identifiers as the series evolves over time now the timestamp is actually always just in in 64 milliseconds since the UNIX epoch the values are all float 64 turns out to actually work really well for operational systems monitoring now the difference back then was how do we identify time series that's that's one of the most interesting parts here we identify time series first by a central aspect of a system that we're monitoring called the metric name in this case HTTP requests total the total number of HTTP requests that a given process has handled since it started up and then to further further differentiate sub dimensions we add key value pairs that we call labels and these say for example the past that a request happened on the status code or the process that it came from that's the instance here so you see that some of these labels get added by the instrumentation within the process when it handles events and other labels then get attached by parameters when it scrapes your process so it attaches more label saying like this is where the metric actually came from yeah the key value based nature of this data model of the labels makes it pretty flexible it's not a hierarchy like some systems that came before it stats T or graphite for example where the metric name looked a bit like a tree in a in a directory hierarchy where it was then a bit more implicit you don't have to know which component means what and where to new dimensions if you wanted to add or remove dimensions so this makes a little bit more implicit and flexible so now you have this data model you collect data in it and you want to do useful stuff Prometheus brings its own query language to do that it's called prompt QL and it's explicitly not a sequel style language some other systems have tried to build sequel star languages for this kind of purpose and it ended up that many of the typical computations you want to do on time series then either become impossible or very cumbersome and unwieldy with with a sequel style language so premiere prom Carol is a bit different and it's optimized towards the common computations that are useful on time series here's just a couple of examples imagine you have this one type of exporter running on each node in your infrastructure called the node exporter and one metric it exposes is the size of every partition that you have or every file system and it has a couple of labels on each of these time series what's the mount point what's the device what's the machine and came from and so on and then as a sample value it has the size now if you want to just get an overview over infrastructure say like hey give me all partitions greater than 100 gigabytes that are not the route mount point you could start out with a metric name doing a negative filter on the mount point label divided by a billion to roughly get from Giga bytes to 2 bytes to get from bytes to Giga bytes and then you filter this list of time series by hundreds to only get the ones that are larger than 100 and you would get a labeled list of output series here another common query you would see is the ratio of the rates of 500 status codes for example divided by the ratio but by the rate of all requests in this case as averaged over the last 5 minutes so this is one expression and there's a division operator right here now this is just a single number in the output but these binary operations become really magic once you add once you have still some dimensionality preserved by the summing on each of the sides for example we might preserve the path label by adding a modifier and then the binary operator automatically does a join on the path label so basically it will look for label sets that are identical on the left-hand side and on the right-hand side of the binary operator divide those by each other and propagate an equally labeled result into the result and yeah so now you get for example the ratio for each path another example if you track request latency Zin what we call a histogram metric then we have a function that allows you to calculate during query time for example the 99th give me the 99th percentile latency summed over all my instances and as averaged over the last 5 minutes and we do that in a statistically valid way of course it's a bit of an estimation going from a histogram to quantiles but that depends on how you choose your packets how big the error will be ok so now you know this language can get way more complicated and complex and you have to learn it it's a bit of a steep learning curve admittedly but then it really pays off so now you can start using it either to look at stuff right now in the built-in expression browser in Prometheus like what is the current value of all Chinese views outputted by a given expression you can graph it over time but if you want to build real dashboards that you can save and share with your colleagues and make it look nice we would recommend always using Gravano Pravana is the most popular open source dashboard builder and it supports all kind of backends including Prometheus now the cool thing and that was also new at the time is how alerting and collecting transferees were no longer separate systems we're now basing alerting directly on the times use data that we collect so the idea with primitives is really collect everything has a time series first even if it just looks like a boolean value or an enum so you know even just an up or down state you might model as a 0 or 1 sample value and that actually gets compressed and stored really efficiently in primitives and then once you have collected that data then you can have you know central Alert rules in your Prometheus server that act on the data here's one example of an alerting rule that you would configure into Prometheus it takes an arbitrary prompt ql expression and then when the prompt ula expression returns any result time series those time series become alerts so a good alerting expression is one which has an empty result in this case we're taking an expression from earlier we're kind of taking the ratio of 500 requests to total requests multiplying by 100 to get a percentage and then filtering that list of paths by the ones that have an error rate that's above 5% and then each one of these paths that has a too high error rate will become an alert and will be shipped to alert manager just briefly Prometheus is pretty simple to get started with operationally you know you can start on a local node just rights to a local data directory it's written in goal which is also quite convenient for operations especially before the container containerization started you can still get high availability by just running two identically configured from easy servers which pull the same data calculate the same alerting rules and then the alert manager will notice that there's the same alerts by their label sets and you will only get one notification a single server can also get quite efficient if you put it on a big iron machine well you know a big server we've managed to well in kind of synthetic benchmarks get a million samples per second ingested but typically big promises servers have no trouble storing a couple of million concurrently active time series at the same time that means like when you do a full scrape of all your different targets there might be a couple of million different answers that you're tracking in that well scrape the on disk storage format is also really efficient typically its each sample takes around between like one or two bytes on disk the local storage is good for keeping a couple of weeks maybe months of data there's some people who are very courageous and put years in there but we don't really we don't really recommend it because it's a single disk like you can backup it you can even take consistent snapshots and then back them up and so on if you want to do that but of course it's not a clustered replicated horizontally scalable system if you want to do that just briefly there's a way to do decoupled remote storage either with this remote write and read interface that we have in prometheus or another project that's called Panos made by friends of Prometheus also core core people of Prometheus actually which integrates in a bit different way with existing Prometheus deployments to add long-term storage and durability and then giving you unified view over different primitive servers so I can really recommend checking this one out so the last point is how does Prometheus work well together here with dynamic environments so nowadays like first came to the ends then came the micro services in the cluster schedulers and they're all kind of built layers and there's more and more moving parts to keep track of and for a monitoring system it became harder and harder to still know what should currently be where and then also gather the data reliably and define alerts that are on the right aggregation level etc so how to make sense of these dynamic environments the one answer that Prometheus has to this is of course service discovery integration Prometheus supports talking to different sources of truth in your infrastructure for example the most prominent one is nowadays talking to the kubernetes api server saying hey give me all the pods of a certain annotation or tie that's on give me all the end points ingresses etc and Prometheus then uses this information for three distinct but related purposes and I think it's important to understand that these are different ones of course they're related first of all a monitoring system should always know what should be there sometimes this question is a bit ignored for example in some push based monitoring systems if a node or service never reports in maybe the maybe the monitoring system will never know that something should have reported in not sure you still have to have some kind of service discovery integration to correlate incoming data with what should be coming in with pool and service discovery you get that kind of out-of-the-box so you know if parameters cannot actually find something reach something that should be there then you can already use that for automatic alerting then on a purely technical level the service discovery should give from me to some information about how to actually pull data from the thing you just discovered so this is the actual HTTP endpoint and maybe you need certain parameters to scrape from it and then last but not least a good service discovery mechanism also gives you metadata about the object you discovered so for example you know premiere kubernetes is a pretty good one that gives you all kinds of metadata about what you discovered for example pod labels and annotations and so on and you can then choose in Prometheus VR built in configuration language to map that metadata into the time series that you collect so now you know this time series came from a part of a certain type and with a certain app name etc so in conclusion yeah Prometheus works pretty well in these modern dynamic environments by giving you you know a detailed data model to collect data in detail query language to make good use of it getting started easy with simplicity and efficiency and then integrating with service discovery to actually track dynamic containers etc in your infrastructure thank you and I'm open for questions [Applause]
Info
Channel: GOTO Conferences
Views: 14,365
Rating: 4.9708028 out of 5
Keywords: GOTO, GOTOcon, GOTO Conference, GOTO (Software Conference), Videos for Developers, Computer Science, Programming, GOTOams, GOTO Amsterdam, Julius Volz, Prometheus, Cloud Native
Id: 5O1djJ13gRU
Channel Id: undefined
Length: 31min 25sec (1885 seconds)
Published: Fri Oct 11 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.