Prometheus: Design and Philosophy - why it is the way it is

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so our next presentation in distributed systems one of the super important aspect is monitoring and so today we have julius vaults who are going who is going to tell us about prometheus julius all yours so I'm Julius I'm one of the co-creators of the Prometheus monitoring system and I got invited a bit to talk a bit about it how it works what it is why we designed certain things in Prometheus the way we did I forgot my own laptop and now I don't forget my own laptop I forgot my power plug for my laptop at Linux con yesterday which was a big screw-up and it's a quite special one so I'm very thankful for using his laptop right now the demo will be on my own I still have a tiny bit of power left hopefully sufficient for that but first why what's this talk about so when people come from a different kind of monitoring system to Prometheus they often get a bit surprised by how it does things differently some decisions that we took might be a bit surprising and you might even think like why did they do it like that isn't that stupid and so I want to give a bit inside into the reasoning behind some of these decisions and hopefully make you see that there was at least some thought behind it you might not have to agree with that of course but it's all our opinion for how you build a good monitoring system so just some background from Ikea Scott started when Matt proud and I joined SoundCloud in 2012 and we both came from Google or we were kind of you know used to the monitoring tools there but then we came to an environment in SoundCloud where they already had which was revolutionary for the time their own cluster scheduler system based on containers before docker existed before kubernetes and all that stuff existed and this was called bazooka and it was based on you know plain LXE but was already a highly dynamic environment where micro service instances were changing hosts and ports the whole time dynamically in a cluster so and jean-claude was not particularly reliably reliable at that time and we were trying to monitor all these micro services better the problem was we had this dynamically clustered well scheduled word but we didn't have more modern monitoring tools so we still had monitoring tools that were made for a more static world so like stats the graphite New Relic for the alerting side of things Nagios and we were really unhappy with both the data models the way you could query the data in those systems but then also the efficiency in those approaches and also if you look at alerting Nagios is a system which besides being static is one that doesn't really allow you a lot of flexibility and alerting you know based on the history of data it only really knows very brief check history and not a lot of depth so in the end we decided to build a completely new monitoring system which turned out to be Prometheus first introduced at Sound Cloud by now it's a cloud native computing foundation project the second one after after kubernetes and many companies and people are both using it and contributing it and I heard also that docker is thinking of adding native parameters metrics to their components which is really exciting so I think we're going to have a conversation maybe like a bob session about that tomorrow so first of all just the scope what is prometheus Prometheus is really the numerical time series based monitoring system and it's a whole ecosystem for doing that kind of monitoring so we take care or define ways of you know getting metrics out of the things you care about whether it's services or network nodes or whatever in this kind of dimensional time series space format collecting that data storing it giving you a way to query the data and then doing useful stuff with that so waking people up at night if something goes wrong which would be the alerting part and also of course having dashboards our using that data to answer ad hoc debug questions about your system and our focus was very clearly on operational systems monitoring so you know versus business metrics or so and Prometheus was Pichette is specifically created to work well with dynamic systems things where things float and shift around all the time and you still want to be able to have the insight and be able to track where exactly a metric came from so this is the general Prometheus architecture the centerpiece of Prometheus is the Prometheus server of which you would usually run one or multiple in your organization so for example you could have one per team or you know whatever suits your needs and this server you configure to actively pull metrics from the things you care about and the things you care about can be one of three different types which is not completely clearly shown here but one thing might be your services where you own the code and you can do anything to the code that you care about so you can add direct prometheus metrics to your code and expose it on an HTTP endpoint in a specific format where parameters can then pull it if you have things like Linux node or a my sequel daemon you know code that you don't want to put an HTTP server in directly with boletus metrics then what you usually do is you run a little sidecar job next to that thing which primitive scrapes and this little sidecar does the translation of the internal metric state to Prometheus metrics and then you might also have some short-lived jobs which you just cannot scrape reliably at all like a daily batch job which runs for a couple of seconds to lead some you users and then just once the report I ran successfully and I deleted 200 users for that we have this gateway into the push world where these little jobs but only the one doesn't really need to can push something to the push gateway which acts as a metrics cash and then Prometheus server again pulls from that the way that Prometheus discovers what should be scraped is service discovery at least and you can configure things statically but ideally it would use service discovery we support then ten different mechanisms for the major cloud and cluster orchestration systems actually it would be really cool to add native Swonk its discovery into there as well now because the cool thing about that is that service discovery not only tells us where are all the services that we want to pull metrics from but what is all the metadata about them is that the dev instance is it you know although all the labels coming from service discovery then of course you have the lot manager usually you run only one of those in your company that's kind of the central clearinghouse for alerts which you define in the dev different primitive servers they send alerts to the alert manager the alert manager can correlate alerts inhibit once based on others you can configure silences in there and it also uses this dimensional data model that goes through the entire chain here to figure out where to route alerts whether to send them to one team on email or to another team over page of duty slack or whatever and then you have various visualization options I would say Prometheus has four main selling points maybe even five the fifth one I will mention later the biggest ones I think are the first two the dimensional data model being able to in really figure out where a metric came from and what it pertains to and then a powerful query language which we call prom QL to work with that data model to answer questions about your system at anytime then it's really efficient a single note can do like 800 thousand samples ingested per second for and you know a single note can handle multiple millions of time series and also stores them really really really efficiently on disk it's operationally really simple you know what you always hear Ingo talks is it's a static binary you know Prometheus is built and go it's a single static binary you need a small config file for it you just start it up and it starts writing to local disk and you know so you can set it up in a couple of minutes what does Prometheus not do we want to kind of keep the scope to something manageable so we don't do collection of raw logs where you need full detail on every event that happened we only do raw we only do aggregated time series data but if you want to like save every user request with their user ID and email a P and so on use a logging system we don't do request tracing use something like Zipkin for that we don't do magic start anomaly detection you know where we just go into your data try to figure out automatically that something looks weird and wake someone up we only do very explicit queries that result in alerts we also plant it on the whole long term can a horizontally scalable durable storage a bit there's ways for how you can do that externally in a decoupled way I'll talk a bit later about also why we made that decisions and we also are focusing on monitoring so we didn't really integrate user and authentication management into Prometheus itself so this is kind of like the the non goals of Prometheus just you know some visual examples you have the built in expression browser in Prometheus where you can enter any prom ql expression and it will give you the current state of any result time series just to give you a glimpse into what the classic or whatever you're looking at currently looks like and then you can also graph things so in this example you have one example from Q our query here the bazooka instance CPU time in nanoseconds is a counter and this is not just one number it's actually a counter for every instance in the cluster keyed by dimensions such as a prog revision and others and I'm managing to use this thing here wrong probably have to move the mouse up again so this inner thing actually selects you know thousands of time series and then you're taking the rate as averaged over the last five minutes over all of these time series then we're saying okay we only want to see how we're summing up over all dimensions but we're preserving the app in proc dimensions which in bazooka basically identify a service and then we're taking the top three so in this case this would give us like the current top three CPU consuming servers in the bazooka cluster so you could imagine a similar thing in a in a swarm cluster for example oh okay let's try again ah it worked okay so the data model is the first kind of design point I really want to talk about not because it's particularly controversial but because it's different than other monitoring systems so you might know stats T in graphite which have in either you know flat or more hierarchical data model and as Steve already mentioned in his talk it gets a bit inflexible to work with that kind of data model you know in Prometheus we have a metric name and just a bunch of key value pairs attached to it and all the key value pairs attached to a metric name result in identify one time series and so you might have the total number of HTTP requests in the system and then you know the path they happened on the status code and so on if you try to encode the same kind of data in a hierarchical data model it would look somewhat like on the right side people might be familiar with this kind of thing from systems like graphite and now if we query for example for everything that has a method equals post it would look really explicit in a label based data model you see exactly what you're selecting on and you don't need to know how many other path components there are you don't need to implicitly know which path component means what and also in a hierarchical data model depending on the place in the tree the different hierarchical components might have totally different meanings depending on which application they belong to so we think that label based data model is better for this cross for this cross-cutting selection could be more efficient the hierarchical model might not be indexed as well for doing this cross-cutting selection and it's more explicit we have a non sequel query language which trips some people up so with time series databases often people think well you know sequel is the language that all these data analysts already know so let's invent a dialect based on that with prom QL we went for a completely custom language so let's look at some examples of what prong QL looked like for typical queries versus you know a sequel like dialect that I just invented again here you have a query for the total number of IP I requests of a certain system this is not just not one number again this selector selects potentially hundreds or thousands of time series and you take the rate as averaged over five minutes over all these all these points and it will actually automatically then propagate the same labels as the inputs head head into the output set so the output set will have the same number of elements the same number of times used with the same labels but just with the rate applied to them kind of like a map in a functional language I could imagine that in a sequel like dialect this would look somewhat like this you'd have to explicitly say I want to select all these labels in the output you could use a star but then you might double select the value of which you only want to have the rate and then you don't have the metric as the table not too bad maybe another example let's say you have the temperature in Celsius and you measure it like in all kinds of places in the world and you want to average it up by city but only four cities in Germany this could look like this impromptu L again the output set would be the average temperature in every city in Germany I think again this wouldn't be too bad in the sequel light dialect could look something like this but what about your use case which is really common in Prometheus where you want to do binary arithmetic or filtering or comparisons between two whole sets of time series so again in this case you might have the total number of errors as a counter in a service and the total number of requests and you might want to calculate an error ratio again this is many numbers and this is many numbers because it's all dimensional and the divided by here automatically joins up matching Label sets on the left and right to produce an equal cardinality output set and there's many ways to play with that if you have different dimensionalities on different sides there's modifiers but this is a simple example this would give you the error ratio so I think I'm not sure but it could look somewhere like this in a sequel style dialect again you're selecting all the labels you're dividing here then you have to explicitly join I admitted a bunch of stuff because it would have gotten too long so basically we end up like with a pretty unwieldy query so the conclusion for us was at least you know from QL we think it's better for the kind of metrics computation that we typically want to do in Prometheus and also from QL from the beginning was designed to only do reads only selects basically so why I always say select select select if you're only reading anyways and writing happens through a different out of band path my favorite topic pull verses push how much time do I have left let's just check okay we started late anyways loved so this causes some FUD sometimes fat is something that the talker world should be familiar with so we really like push we also acknowledge that we we like pole we also acknowledge that in some situations it will create problems for people but I want to kind of outline what we think is nice about a pole based approach to gathering metrics first of all when you pull directly from all the things you care about and one pole fails you already have one signal that something is wrong you know that instance might be down or might can already use that for basic health monitoring you can do something like horizontal monitoring that's kind of a term coined by one of our contributors where let's say you have team a with their monitoring stack and their service stack and Team B with their monitoring in tears have a stick now maybe team B wants to know some metrics from over here because they're interesting for them in a push based world now you'd have to talk to that other team and say can you configure all your instances to push to to monitoring systems but only some of them so in a pull based world all you have to do is change your own monitoring system to pull in exactly the metrics that you want you should of course maybe still talk to the other team but you know you can be very flexible about what you put pull from where and only configure that in the monitoring system in general a pull based approach makes these kinds of things very flexible so good thing that you can also do is run a copy of production monitoring on your laptop but just now copying the same premisses contact to your laptop running it and you'll get the same data of course you know if you're on the VPN and so on but that allows you to play with the monitoring breaker to try out different alerting rules and all these kinds of things it also allows you to run two completely identically configured from easy servers in production to get high availability in a very simple way they don't talk to each other but they pull in the same data and now they can calculate the same alerts based on the same pulled in data and both send it to alert manager so if one goes down you still get the alert from the other one and alert manager will actually deduplicate based on the label set now there's one thing where people come here but a pull based system really needs is really hard because now I need to have a lot of configuration in my monitoring system it needs to know where all the things are in the world is really cumbersome and then I would say well you kind of need to know that anyways if you're building a monitoring system it should know what should be out there in the world and so I would say you need that part anyways in the monitoring system otherwise it cannot tell you if some instances doing crazy stuff or is completely not reporting in or so and but in a pool based world the instances themselves don't need to do need to know anything in a push based world additionally you will need configuration in all your services to know who they are so that they can attach their identity to the metrics they're sending and they need to know we're monitoring lives which also ties into this flexibility aspect last point yes pull scales actually pretty well I wrote a whole blog post on that on the Prometheus i/o blog if you go to parameters IO blog you'll find it in there but of course people have problems in certain setups with Paul for example if they have you know a firewalled of network segments or it gets even worse if if they're doing IOT stuff and they have like the Internet of Things devices in people's homes and how can they pull from there so admittedly this is not the best use case for Prometheus oops but for many of these cases there are alternatives and workarounds so in general we recommend people to run Prometheus anyways as close as possible to the monitored targets and services because then you know you have fewer Pleet things oh it's doing things by itself that's amazing what doing okay okay anyways rendre meters on the same network segment fewer moving parts better you could open ports of course to pull through certain things but of course that wouldn't work if you in this IOT scenario and you can like ask people to open ports and they're plastic rooters so you can pull from their homes so that's pretty unrealistic or you know open the tunnel from your devices to someplace in the internet from where you can then control yeah not that great anyways so that wasn't really I mean Prometheus is more made for this kind of scenario where you're in a data center you have full access to the stuff that's in there you control you can pull from there anyways so the next point is one yeah this is also one that's going to be interesting to discuss with the darker people like Steve having this is philosophy difference around whether you have one big exporter on on a machine that collects all the metrics for that machine and then Prometheus just collects it from that exporter or whether you run one exporter process per process that you care about per service process for example let's say you had this kind of uber exporter on a machine and you have all kinds of different applications pushing their metrics to that exporter you know my sequel server some hosts statistics and maybe a web application that runs there it's pretty cool right they only need to push locally to a big uber exporter then fermitas can come and collect the metrics data from there and all is good but this has a couple of downsides again so first of all it's a bit of an operational bottleneck the services on the host might belong to totally different people and groups and again this single uber agent on the machine might belong also to a different service group and now they all have to kind of talk to each other it becomes single point of failure with little isolation between the different metrics you push there so one user pushing too much can maybe break everyone else it gets harder for the monitoring to just scrape exactly the metrics that it wants so the thing is usually if you have a shared cluster in your organization you'll have and especially if you're running on the dynamic cluster scheduler you have all kinds of services from all kinds of different teams running on the same node and you might only be interested in the web app on that node not the mice equally not the hosts stats and so on and so that makes that scenario a bit more difficult it's harder to do up from this monitoring and the metadata association remember in the pool based data model with Prometheus we get metadata about the things we pull from from the service discovery mechanism it tells us this is a development instance and it has these in these other labels and properties and then we can map that into the time series data and then when we see a scrape to a specific process fails we know exactly for which labels that failed and can alert with exactly these labels and root on these labels and they give you a lot of information now if you first push that stuff locally to an exporter and then parameters pulls it then have the services we'll have to push the data with the right identity data to that single uber exporter and the monitoring system will need equivalent configuration to then join that data up and know what is wrong and what's missing so this becomes harder also so in general what we recommend is really running one exporter per process for these processes that you cannot directly instrument like the Linux host or a my sequel daemon where you wouldn't want to go hack into the C code if it's your own process go directly to your own process and instrument directly this way Prometheus knows exactly keyed by all the nice dimensions coming from the service discovery what's out there is it healthy and which metrics belong to what cool so why didn't we do clustering we have only the simple local storage first thing I mean clustering is really hard to get right some people asked for it you know like we just want to be able to have all data forever and scale horizontally and have recover gaps in one note from another node and so on but clustering is also hard to get right and it's the first thing that might break in case of a network outage for example and you want your monitoring system to be always up with the most recent data that's the most important thing that it can still use the most recent data to alert and if you just run to primitive servers in this replicated age a mode where they don't talk to each other but pull in the same data you will have that guaranteed so that actually makes for a more stable and a simpler system in total so in the conclusion I could talk about many more points but then I think it would get too long here for me it's not so much about Prometheus specifically but more about all these different monitoring philosophies and decision points that we took in prometheus if premier has died tomorrow that's fine as long as you know I would like to see the same decisions in other systems and I would love to work together with a talker people to integrate Prometheus metrics of course and think about how to apply some of these best practices and I acknowledge I already talked to Steve some opinions already different and we need to kind of figure out what what's best for everyone in the end I have a demo for that I will need my own laptop and this awesome USB Ethernet adapter and using the last power that's remaining in my laptop I will hope that this will work out that's fine that's fine so it's already working and do I have Ethernet connectivity so the first thing I want to do is just show you locally without containers how to set up Prometheus with like minimal just Linux node monitoring just to show how few moving parts there are I have a directory with two static binaries and one dollar config file one is the central Prometheus server itself and it has a configuration file it's a simple llamó based format in this case I'm just scraping to services Prometheus is scraping itself on this 1990 port and we'll scrape the so-called nor the exporter which is one of these little sidecar jobs which basically exports Linux host metrics so it goes like to the proc file system and the sisyphus file system and extracts Linux host statistics from there and exports them in a Prometheus format and the node exporter is not yet running so it should actually fail to pull from there and of course there's many more contract options but this is just to keep it simple now I can just say Prometheus by default it will read from that Prometheus llamo and I should be able to get to my Prometheus server it's up and running I should see in the targets that the node exporter is down because it's not running yet I statically configured it in this case normally you would serve your service discovery and it will discover maybe even multiple targets for each service so let's bring it up see how that works you can specify many flags here but by default it will do kind of the right thing and export a lot of metrics so I started this node exporter and now it's up in Prometheus and we can actually check the end point here and look what this format looks like so it reports some metrics you know one time series per line with all these dimensional metrics dimensional labels it support it exports some metrics about the node exporter process itself like how many go routines are in there and so on these kind of get included automatically when you're using the go client library for Prometheus and then there's all these node metrics like the counters for every CPU and CPU mode how much CPU time in seconds was already used since system boot-up disk metrics and so on and so on so we can go to this query interface ask for like all the node CPU metrics getting back many time series we might not care about every individual CPU but first let's take her actually yeah I'll just say let's actually sum that up but man that only gives us one number so maybe we want to just get rid of the CPU ID information because we don't care about the individual course so now we get a limited result back we don't have much data collected yet so let's do it like this but this is a counter that just goes up forever so that's boring so we want to take the rate over that as averaged over let's say the last 15 seconds because we're collecting every second here and now we're getting a CPU rate and the rate is taking the per second rate of a second value gives us the ratio so if we wanted the usage that would be 100 as a that would be percent and if we only wanted the idle mode we could say only give us the idle CPU mode and now we would know in percent how much percent we are idle and we have multiple cores so it's more than 100 we could actually do average without CPU then we should get something that is under 100 yeah so that's basically just some very simple prompt query examples on how to get started with me like really one minute the Jerome Peters only from docker was really nice and set me up with like a five note easy to swarm cluster and I'm going to try bringing up prometheus on that one not so I first wanted to show it really yeah okay maybe maybe it's not gonna work oh right I forgot yeah that's actually cable would really help of course I thought I had already plugged this in this is genius okay all right cool the remaining battery will be enough so I have I have a little it's one cluster here how do I do this node LS you have five nodes here I'm not really doctor specialist so I'll probably do some things wrong but first I'll create a network called wrong okay create network okay I will just like talker create no network create right prom and then I will start a node exporter this is this one so the the tricky bit is normally this node X prototype process should be run not in a container because it really needs to get to all your proc insist file systems and to the host system actually to get all the metrics it needs but if we do want to run it in a container we can mount like bind mount all these things from the host into the container so it can read the same things and it ends up reporting pretty much the same metrics and just for the demo I'm also adding a publish here so we can play with it in the browser so let's start the node exporter alright right right right yep I need the overlay Network our docker Network RM from talker network this one yeah that's why I had actually all the come okay let's create a node exporter again named exporter and so it's on there so that's cool we should be able to see it on this IP in the cluster and by the way this a global thing that I did there started the node exporter on every node in the cluster this mode equals global so if I SSH them to any of the other nodes now it will be on any other nodes and this is actually load balanced endpoint over all the nodes exporters just for a demo I'll bring up see advisor see advisor which is a container exporter it exports metrics about all the containers running on a host and then I'll bring up prometheus itself and all I did here is I have a little docker file that just inherits from the standard vanilla Prometheus docker file and just swats out the Prometheus configuration just for the demo simplest hack to get a different configuration file in there and so I'm going to build that I'm gonna add actually a custom registry this one and I'm going to push that image to that custom registry just so that you know we have that configuration file back in there and now we're going to again create a phone I can come later so we're going to create the Prometheus container so now we should also have Prometheus running here how much time do I have left like zero minutes one two okay perfect that's all I need we have parameters running here we can start goofin on and we'll have a final running on the same ip and what we can do now is login admin admin it's always a default Ingrid fauna create a data source jealous word Prometheus type Prometheus and the cool thing here is now we can refer to the talker name so that would be just problem on port 9090 I think without this proxy it through Griffin that should work yup data sales is working and now we can create a dashboard say okay let's create a new graph from this Prometheus data source with I'm going to get like a cheat sheet query here steps for example could just get put it in here and we're getting the data and of course we don't have six hours of data add we have maybe a couple of minutes and it's showing that query so that's basically how you could set up that in very happy way briefly on docker swamp we have now node metrics from all the nodes in our swarm we also would have container metrics I didn't go into that and you can see that actually the way we configured the Prometheus on this one cluster is that it discovers the different tasks of this one kit service via DNS and there'd would be better to not just use DNS but use like talk directly somehow to the managers or however however that works and get the real label based metadata for each instance so that we can map it into the time series but you can see that it actually has discovered five see advisors one for every node and five node exporters and one pizzas server so yeah that's basically it I hope to see you around and thank you very much and of course I'm open for questions if there is still time sure yeah we can just do questions at the what's it called birds of a feather exactly yeah or a 30-minute break I might go I have to run and try to get power supply for my laptop before I go to London otherwise I'm royally screwed so yeah but I will definitely be more around either today or tomorrow I mean tomorrow I'll be definitely here so let's discuss all the Prometheus things then cool thank you very much

Info

Channel: Docker

Views: 23,349

Rating: 4.8649788 out of 5

Keywords: docker, containers

Id: QgJbxCWRZ1s

Channel Id: undefined

Length: 40min 29sec (2429 seconds)

Published: Fri Oct 14 2016