How Prometheus Monitoring works | Prometheus Architecture explained

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we're going to talk about Prometheus so first I'm going to explain to you what prometheus is and were different use cases where Prometheus is used and why is it such an important tool in modern infrastructure we're going to go through Prometheus architecture so different components that it contains we're going to see an example configuration and also some of these key characteristics why it became so widely accepted and popular especially in containerized environments Prometheus was created to monitor highly dynamic container environments like kubernetes docker swarm etc however it can also be used in a traditional non container infrastructure where you have just bare servers with applications deployed directly on em so over the past years Prometheus has become the mainstream monitoring tool of choice in container and micro service world so let's see why Prometheus is so important in such infrastructure and what are some of its use cases modern DevOps is becoming more and more complex to handle manually and therefore needs more automation so typically you have multiple servers that run containerized applications and there are hundreds of different processes running on that infrastructure and things are interconnected so maintaining such setup to run smoothly and without application down times is very challenging imagine having such a complex infrastructure with loads of servers distributed over many locations and you have no insight of what is happening on hardware level or on application level like errors response latency Hardware down or overloaded may be running out of resources etc in such complex infrastructure there are more things that can go wrong when you have tons of services and applications deployed any one of them can crash and cause failure of other services and only have so many moving pieces and suddenly application becomes unavailable to users you must quickly identify what exactly out of this hundred different things went wrong and that could be difficult and time-consuming when debugging the system manually so let's take a specific example say one specific server ran out of memory and kicked off a running container that was responsible for providing database sync between two database pots in a kubernetes cluster that in turn caused those two database pots to fail that database was used by an authentication service that also stopped working because the database unavailable and then application that depended on that authentication service couldn't authenticate users in the UI anymore but from a user perspective all you see is error in the UI can't login so how do you know what actually went wrong when you don't have any insight of what's going on inside the cluster you don't see that red line of the chain of events as displayed here you just see the error so now you start working backwards from there to find the cause and fix it so you check is the application back in running does it show an exception is the authentication service running did it crash why did it crash and all the way to the initial container failure but what will make this searching the problem process more efficient would be to have a tool that constantly monitors whether services are running and alerts the maintainer x' as soon as one service crashes so you know exactly what happened or even better it identifies problems before they even occur and alerts the system administrators responsible for that infrastructure to prevent that issue so for example in this case it would check regularly the status of memory usage on each server and when on one of the servers it spikes over for example 70% for over an hour or keeps increasing notify about the risk that the memory on that server might soon run out or let's consider another scenario where suddenly you stop seeing logs for your application because elasticsearch doesn't accept any new logs because the server ran out of disk space or elasticsearch reached the storage limit that was allocated for it again the monitoring tool would check continuously the storage space and compared with the elasticsearch consumption of space of storage and it will see the risk and notify maintainer of the possible storage issue and you can tell the monitoring tool what that critical point is when the alert should be triggered for example if you have a very important application that absolutely can have any log data loss you may be very strict and once take measures as soon as fifty or sixty percent capacity is reached or maybe you know adding more storage space will take long because it's a bureaucratic process in your organization where you need approval of some IT department and several other people then maybe you also want to be notified earlier about the possible storage issue so that you have more time to fix it or a third scenario where application suddenly becomes too slow because one service breaks down and starts sending hundreds of error messages in a loop across the network that creates high network traffic and slows down other services to having a tool that detects such spikes in network load plus tells you which service is responsible for causing it can give you timely alert to fix the issue and such automated monitoring and alerting is exactly what Prometheus offers as a part of a modern DevOps workflow so how does Prometheus actually work or how does it architecture actually looks like and its core prometheus has the main component called Prometheus server that does the actual monitoring work and is made up of three parts it has a time series database that stores all the metrics data like current CPU usage or a number of exceptions in an application second it has a data retrieval worker that is responsible for getting or pulling those metrics from applications services servers and other target resources and storing them or pushing them into that database and third it has a web server or server API that accepts queries for that stored data and that web server component or the server API is used to display the data in a dashboard or UI either through Prometheus dashboard or some other data visualization tool like graph Anna so the Prometheus server monitors a particular thing and that thing could be anything it could be an entire Linux server or Windows Server it could be a standalone poochy server a single application or service like database and those things that Prometheus monitors are called targets and each target has units of monitoring for Linux server target it could be current CPU status its memory usage disk space usage etc for an application for example it could be a number of exceptions number of requests or request duration and that unit that you would like to monitor for a specific target is called a metric and metrics are what gets saved into Prometheus database component Prometheus defines human readable text based format for these metrics metrics entries or data has type and help attributes to increase its readability so help is basically a description just describe what the metrics is about and type is one of three metrics types for metrics about how many times something happened like number of exceptions that application had or number of requests it has received there is a counter type metric that can go both up and down is represented by a GOC example what is the current value of CPU usage now or what is the current capacity of disk space now or what is the number of concurrent requests at that given moment and for tracking how long something took or how big for example the size of a request was there is a histogram type so now the interesting question is how does Prometheus actually collect those metrics from the targets Prometheus pulls metrics data from the targets from an HTTP endpoint which by default is host address slash metrics and for that to work one targets must expose that slash metrics endpoint and two data available at slash metrics endpoint must be in the format that Prometheus understands and we saw that example metrics before some servers are already exposing from if you and points so you don't need extra work to gather metrics from them but many services don't have native Prometheus endpoints so extra component is required to do that and this component is exporter so exporter is basically a script or service that fetches metrics from your target and converts them in format Prometheus understands and exposes this converted data at its own slash metrics endpoint we're Prometheus can scrape them and Prometheus has a list of exporters for different services like my sequel elasticsearch Linux server built tools cloud platforms and so on I will put the link to Prometheus official documentation and export the list as well as its repository in the description so for example if you want to monitor a Linux server you can download a node exporter tar file from Prometheus repository you can enter and execute it and it will start converting the metrics of the server and making them scrape a bowl at its own slash metrics endpoint and then you can go and configure Prometheus to scrape that end point and this exporters are also available as docker images so for example if you want to monitor your my sequel container in kubernetes cluster you can deploy a sidecar container of my sequel exporter that will run inside the pod with my sequel container connect to it and start translating my sequel metrics for Prometheus and making them available at its own slash metrics endpoint and again once you add my sequel exporter endpoint to Prometheus configuration Prometheus will start collecting those metrics and saving them in its database what about monitoring your own applications let's say you want to see how many requests your application is getting at different times or how many exceptions are occurring how many server resources your application is using etc for this use case their Prometheus client libraries for different languages like node.js Java etc using these libraries you can expose the slash metrics scraping endpoint your application and provide different metrics that are relevant for you on that end point and this is a pretty convenient way for the infrastructure team to tell developers emit metrics that are relevant to you and will collect and monitor them in our infrastructure and I will also link the list of client libraries Prometheus supports where you can see the documentation of how to use them so I mentioned that Prometheus pulls this data from endpoints and that's actually an important characteristic of Prometheus and let's see why most monitoring systems like Amazon CloudWatch or new really etc use a push system meaning applications and servers are responsible for pushing their metric data to a centralized collection platform of that monitoring tool so when you're working with many micro services and you have each service pushing their metrics to the monitoring system it creates a high load of traffic within your infrastructure and your monitoring can actually become your bottleneck so you have monitoring which is great but you pay the price of overloading your infrastructure with constant push requests from all the services and thus flooding the network plus you also have to install daemons on each of these targets to push the metrics to monitoring server while Prometheus requires just a scraping endpoint and this way metrics can also be pulled by multiple Prometheus instances and another advantage of that is using pool Prometheus can easily detect where the service is up and running for example when he doesn't respond on the pool or when the endpoint isn't available while with push if the service doesn't push any data or send its health status it might have many reasons other than the service isn't running it could be that network isn't working the package get lost on the way or some other problem so you don't really have an insight of what happened but there are limited number of cases where a target that needs to be monitored runs only for a short time so they aren't around long enough to be scraped example could be a batch job or scheduled job that say cleans up some old data or does backups etc for such jobs Prometheus offers push gateway components so that these services can push their metrics directly to Prometheus database but obviously using push gateway to gather metrics in Prometheus should be an exception because of the reasons I mentioned earlier so how does Prometheus know what to scrape and when all that is can figured in prometheus tamil configuration file so you define which targets Prometheus should scrape and at what interval Prometheus then uses a service discovery mechanism to find those target endpoints when you first download and install Prometheus you will see the sample config file with some default values in it here is an example we have global config that defines scrape interval or how often Prometheus will scrape its targets and you can override this for individual targets the rule files block specifies the location of any rules we want Prometheus server to load and the rules are basically either for aggregating metrics values or creating alerts when some condition is met like CPU usage reached 80% for example so Prometheus uses rules to create new time series entries and to generate alerts and the evaluation interval option in global config defines how often Prometheus will evaluate these rules in the last block scrape configs controls what resources Prometheus monitors this is where you define the targets since Prometheus has its own metrics endpoint to expose its own data it can monitor its own health so in this default configuration there is a single job called Prometheus which scrapes the metrics exposed by the Prometheus server so it has a single target at localhost 1990 and Prometheus expects metrics to be available on a target on a path of slash metrics which is a default path that is configured for that endpoint and here you can also define other endpoints to scrape through jobs so you can create another job and for example override the scrape interval from the global configuration and then define the targets host address so a couple of important points here so the first one is how does Prometheus actually trigger the alerts that are defined by rules and who receives them Prometheus has opponent called alert manager that is responsible for firing alerts via different channels it could be email it could be a select channel or some other notification client so Prometheus server will read the alert rules and if the condition in the rules is met and alert gets fired through that configured channel and the second one is Prometheus data storage where does Prometheus store all this data that it collects and then aggregates and how can other systems access this data Prometheus stores the metrics data on disk so it includes a local on disk time series database but also optionally integrates with remote storage system and the data is stored in a custom time series format and because of that you can't write Prometheus data directly into a relational database for example so once you've collected the metrics Prometheus also lets you query the metrics data on targets through its server API using prompt QL query language you can use prometheus dashboard UI to ask the Prometheus server via prompt ql2 for example show the status of a particular target right now or you can use more powerful data visualization tools like graph Anna to display the data which under the hood also uses prompt QL to get the data out of Prometheus and this is an example of a prompt QL query which this one here basically queries all HTTP status codes except the ones in 400 range and this one basically does some sub query on that for a period of 30 minutes and this is just to give you an example of how this query language look like but with gruffalo instead of writing prompt QL queries directly into the Prometheus server you basically have your final UI where you can create dashboards that can then in the background use prompt QL to query the data that you want to display now concerning prompt QL the Prometheus configuration in graph on ax UI I have to save from my personal experience that configuring promethium o file to scrape different targets and then creating all those dashboards to display meaningful data out of the scraped metrics can actually be pretty complex and it's also not very well documented so there is some steep learning curve to learning how to correctly configure Prometheus and how to then query the collected metrics data to create dashboards so I will make a separate video where I configure Prometheus to monitor community services to show some of the practical examples and the final point is an important characteristic of Prometheus that it is designed to be reliable even when other systems have an outage so that you can diagnose the problems and fix them so each Prometheus server is standalone in self containing meaning it doesn't depend on network storage or other remote services it's meant to work when other parts of the infrastructure are broken and you don't need to setup extensive infrastructure to use it which of course is a great thing however it also has disadvantage that Prometheus can be difficult to scale so when you have hundreds of servers you might want to have multiple Prometheus servers that somewhere aggregate all these metrics data and configuring that and scaling primitives in that way can actually be very difficult because of these characteristic so while using a single node is less complex and you can get started very easily it puts a limit on the number of metrics that can be monitored by Prometheus so to work around that you either increase the capacity of the Prometheus server so it can store more metrics data or you limit the number of metrics that Prometheus collects from the applications to keep it down to only the relevant ones and finally in terms of Prometheus with docker and kubernetes as I mentioned throughout the video with different examples Prometheus is fully compatible with both and Prometheus components are available as docker images and therefore can easily be deployed in kubernetes or other container environments and integrates great with kubernetes infrastructure providing cluster node resource monitoring out-of-the-box which means once it's deployed on kubernetes it starts gathering metrics data on each kubernetes node server without any extra configuration and I will make a separate video on how to deploy and configure prometheus to monitor your kubernetes cluster so subscribe to my channel click that notification bell and you will be notified when the new video is out
Info
Channel: TechWorld with Nana
Views: 355,843
Rating: 4.9518666 out of 5
Keywords: prometheus monitoring, prometheus monitoring explained, prometheus architecture, prometheus architecture explained, prometheus monitoring tutorial, what is prometheus monitoring, what is prometheus monitoring tool, how prometheus work, how does prometheus work, prometheus monitoring kubernetes, what is prometheus, monitoring tools in devops, prometheus setup, prometheus monitoring tutorial for beginners, techworld with nana, devops tools, devops, prometheus tutorial
Id: h4Sl21AKiDg
Channel Id: undefined
Length: 21min 30sec (1290 seconds)
Published: Fri Apr 24 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.