How to Export Prometheus Metrics from Just About Anything - Matt Layher, DigitalOcean

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone my name is Matt leer and this is how to export prometheus metrics from just about anything just to start here's a little bit about me I'm a senior engineer at digitalocean I'm a member of the Prometheus team you can find me on github and twitter at md lair and all the content available in this talk will be linked to from this repository here my talks repo and get up so just to start let's talk about a crash course on Prometheus so how many folks here have deployed Prometheus in their environments quite a few it's awesome that's really great so what is from ETS anyway prometheus is an open-source systems monitoring and alerting toolkit it uses a pull based metrics gathering system over HTTP and a simple text-based format for actually exposing those metrics to a Prometheus server over the network it also features prom QL which is a powerful query language built right into Prometheus the Prometheus architecture looks a little something like this but for the purposes of our talk all we really care about is the Prometheus server in the middle a display such as graph on on the right and jobs and exporters and other sources of pull based metrics on the left this is an example of the Prometheus text format so if I curl the node exporter running on my local machine we'll see a little bit of information about metrics here so a metric has components such as a name possibly some labels that describe different dimensions of the metric a value which is a raw float64 value and some metadata including help text that helps you identify what a metrics purpose is and also the type such as counter guage summary or histogram this is a super basic example of prom QL so I just say Prometheus give me the value of this time series from this instance over the course of let's say a 24 hour period and this shows the hard drive temperatures on some of the machines at my house so what is a Prometheus exporter anyway an exporter is a system that bridges the gap between Prometheus and systems that do not speak the Prometheus metrics format so you might have some service and expose those metrics to something like stance D or perhaps a head like its own custom JSON endpoint or axe far from go but sometimes those systems aren't compatible with from a theist so that's where exporters come in and tip you would run these on the same machine as the service itself but not always as we'll talk about later so some examples of some extremely common exporters include the node exporter which exposes system metrics from unix-like machines such as Linux and FreeBSD the mysqld exporter exposes metrics from a MySQL server and the black box exporter and this is a unique one it actually dials out to remote systems using things like HTTP or ICMP ping to probe these black box systems so let's imagine rolled with a cloud and native future you've got your Prometheus and you've got your cloud and the clouds up it exposes a metric that says everything's all good right so who has an environment that looks something like this Oh nobody I was gonna ask Mike to come work with you but some of us live in a bare-metal reality we've got all these switches and core routers and spine switches on top of rack switches we've got these wrecks of fine racks of bare metal servers so naturally I log into the router and I say you know please give me Prometheus metrics seems like a very polite thing you would ask her router oh uh no such luck what gives ok so you know Linux is pretty open-source friendly so there's probably like a Prometheus metrics device file all right oh no no such luck so where can I find Prometheus metrics for these systems if you want to find Prometheus exporters your best bet is the Prometheus website or you can scour the internet the mailing lists the Prometheus wiki I github and etc or as a last resort you may just have to roll your own and that's going to be the focus of our talk today so let's start with some of the basics of building an exporter in the go programming language so to start your function main is going to build your types wire up dependencies and start an HTTP server so first we create this collector type we'll talk about shortly and we make the Prometheus client aware of the collector we set up an HTTP handler and expose metrics over the standard / metrics endpoint and then we finally start listening for HTTP connections for Prometheus so let's talk about the from atheist go client a little bit the most important interface is probably the Prometheus collector interface and Prometheus client uses this to actually expose metrics over the network so our collector structure here is going to be our implementation of the Prometheus collector interface for some service and let's say we want to expose a metric called requests total which is just the number of requests that go to some arbitrary system and also we're going to explicitly pass a function dependency here so that we can swap this out for testing but this function is going to return an integer that is the number of requests that have occurred and an error if anything goes wrong okay so let's create our collector we have this constructor here we explicitly pass our dependencies makes things much easier to test such as our request function so we can swap it out and also this is where we actually create the meet of our metrics so we have our request total metric here we give it a name such as exporter requests total a help text such as the total number of requests that occur meant for humans and also if we want some label dimensions but in this particular case we don't need any variable or constant label dimensions so we go ahead and omit those so the first method of the Prometheus collector interface is describe and describe accepts a channel of metrics and our channel of metrics descriptions excuse me and then it gathers metadata about each metric so basically you can collect them in a slice and iterate over that pass the descriptions on the channel and you're pretty much done collect is a little more interesting and this is the second method of the Prometheus collector interface so I collect accepts a channel of Prometheus metrics that you can send on and what we need to do first is take a metrics snapshot using our function that we passed in and this must be concurrency safe so if you think about it you might have one Prometheus server but you could have two or five or ten so you need to make sure that you lock appropriately so that nothing can run into like a nasty data race so we retrieve the value or if it fails we can send an invalid metric to notify Prometheus of the error and we can alert on that later on but if it succeeds we take that request value we use this must new cosmetic constructor passed the name of our metric the fact that it is a counter type and the raw value of the request will talk about Casta metrics shortly and why those are important so if you want to build an exporter and go here my recommendations build reusable packages you don't want to mix low-level details of some file format or binary Network protocol or filesystem traversal with actually x40 metrics if you separate these things cleanly you're going to do yourself a lot of favors in the long run write unit tests if you think about it a Prometheus exporter could be a pretty critical part of your production environment you need to make sure this thing functions appropriately so my recommendation is to set it up in tests and perform HTTP GET using suck some fixed set of inputs and a fixed set of outputs and compare the two and finally use prompt to we'll check metrics for linting your metrics so what this does is if I curl some exporter and pipe it through this tool it will give me recommendations on how to make my metrics more standard so in this case we have this metric called X gigabytes and this is a counter so counters by convention should have a total suffix and also you'll notice that we use a unit of gigabytes instead you should use the base unit of bytes and leave that conversion up to your display systems such as griffons with that being said let's go get some metrics so some of the sources of metrics we're going to talk about today include files hardware devices and system calls and there's definitely some overlap between the three to start let's talk about files in particular gathering metrics from proc stat on Linux soap rocks that contains kernel and system statistics that's a fun one to say so the numbers indicated here are the amount of time the CPU spent in various states such as user system idle etc and these are in a kernel internal unit called user Hertz but you can convert it to seconds later on so the top line shows us a summary of all these times added together but we don't need that so we can skip that and like I said for the purposes of our talk we're going to look at user system and idle even though they're I believe 11 different values here so to start let's focus on creating a clear and concise exported API we create the CPU stat structure and this contains stats for an individual CPU we give an ID which is in the string format so we can have CPU 0 1 2 or 3 and etc and then we export the values that we care about as integers here user system and idle so next we're going to create this top-level function scan and scans job is to read and parse CPUs that information from i/o Reader are if you're familiar with go IO reader is one of most fundamental interfaces in the NGO standard library it essentially represents something like a file or a network stream or a byte buffer and this is great because we can accept a file in our production code but we can also swap it out for a byte buffer or another source later on for our tests so use interfaces they're very powerful and also as it turns out the buffaio scanner type takes an i/o reader our and we can use that to easily scan over text-based input so we create this Bluff i/o scanner we're gonna skip the first summarized line because we don't need it and then we enter this inner scanning loop if for some reason our scanning loop exits early say for example we're reading from a network stream and there's a Oh F or something similar we need to be sure we check the error from this scanner type this is important and often overlooked so be careful so within the inner loop we need to carefully handle our slice boundaries of this file so within the loop we have these CPU stat lines we know that each one should contain a CPU prefix and exactly 11 fields so we explicitly check for both of those things if a line doesn't have a CPU prefix it's not related to CPU stats and we don't care but the lines that we do expect we know they should have 11 fields so we make sure we check for that because otherwise if you access your slice later on you could run into a nasty out of range panic that'll take down your program and nobody wants that so now that we've gathered these values into this string slice we're going to parse the values we care about user system at idle from indices 1 3 & 4 respectively and we're going to parse them into an array of 3 values because we know exactly how many we need so we iterate over the indices we convert the raw string into an integer and pack it into our array and finally we unpack it all into a top-level CPU stat structure at the end of our loop so now that we've put this all together let's build an example to try out our API and just see how it goes so we open a handle to proc stat we defer closing it to make sure that we clean up after ourselves and we pass the file which is an i/o reader directly to our CPU stat scan function if it succeeds we're going to iterate over all of those structures and print them to the screen and as you'll see it seems to work just fine and like I said all this code will be available in two slides as well as my tox repository so check that out later if you'd like to see full examples so now that we've put this all together let's build a prometheus exporter so to start we need to wire up our dependencies appropriately in our function main so we're going to create this stats function and this is a closure it has no arguments and it returns a slice of these CPU stat structures and within each call to this function we're going to open a unique handle to proc stat defer closing it to make sure that we clean up and then pass it to our CPU stat scan function so this fulfills two purposes one every time this function is called by collector collect we get a unique handle to proc stat so we're totally concurrency safe and two this is nice because we can swap out this function right here for one that returns say a fixed set of values for our unit tests makes your life much easier so we passed this function to our collector type and then we register it with Prometheus so this is our collector type we're going to export a metric called time user Hertz total and we're also going to pass that function explicitly as a dependency of the collector so here's a tip when you're working with fields that have are structures that have repetitive fields such as user system at idle integers you can use anonymous structures to simplify your code somewhat so we call our stats function here we get our stats we're going to iterate over each entry and what we want to do here is we have this slice of anonymous structures if you've ever done table driven tests and go you may be familiar with this but essentially we're going to associate each of these mode strings with the raw value so the user string is associated with the user value and so on next we iterate over all these tuples and unpack them and then we create the metrics using this must new constant metric constructor and we're going to associate two labels the idea of the CPU and the current CPU mode so let's talk about cosmetics while we're here let's say for example I have a system with two CPUs and they're hot swappable so your exporter is running along you're getting metrics from those CPUs and suddenly one of the CPUs dies or you take it out of the system if you don't use the con symmetric constructors the Prometheus client will continue to export the last value for those time series forever so you need to make sure that if you're implementing the Prometheus collector interface you use the Const metric constructors because they allow time series to come and go as time goes on alright let's put this all together and give our egg for exporter a try it with curl and just see what happens so we curl the exporter we run it and as we'll see we have some metrics available with our given labels such as the CPU ID and the mode and these raw integer values so if you want to gather metrics from files with go here my recommendations use IO reader whenever possible this is much more flexible than say accepting a file path in your API the buffaio scanner type is super useful for all the files that reside in proc insists because typically these are text files that are pretty easy to parse and that's a great type for doing so always check your slice and array boundaries you don't want to run into an out of range panic in your exporter it's no fun for anybody and also check out the Prometheus proc efest library we've done a lot of the work for you for these types alright let's move on to the second part of our talk so let's gather metrics from some hardware devices in this case the silicon dust HD homerun so who here still pays for cable TV wow just a few of us I love my cable TV ok I love NFL and HBO and all those things but this is a pretty cool device if you have a cable subscription I understand it also works with over-the-air but this one's specific to cable so you get your cable from your provider you plug it in you get a cable card to decrypt the signal and you plug in an Ethernet cable and what this becomes is essentially a network TV tuner so you can use this to watch live TV on different devices or also record to something like a plex media server so they also offer a linux utility so I started poking around and I can discover the device of my network and use its ID to ask it questions such as give me your tuner debugging information and as you'll see there's a lot of different statistics available here we have information about the tuners such as the current channel and the channel it's trying to lock to the signal strength the signal-to-noise ratio various bit rates it passed within the device and finally the network packets per second and error rates so naturally I bust out TCP dump and start looking around at all these packets but it turns out I didn't ever need to do that because silicon dust actually has an open source library called Lib HD homerun written in C so we can use that as inspiration to create a go client so let's start by building a go network client API we create this client type it has a mutex so we can serialize access to the device we're going to embed a connection in there and we also give it a timeout if you're making a network connection and a timeout next we have this dial constructor and dials job is to dial a TCP connection to an HD homerun device so first we need to build up our low-level communications types the HD homerun speaks using these specialized packets so this is what a packet looks like and go a packet has a type that specifies the type of message it carries and tags and these specify optional attributes such as debugging information or information about the device and lots of other things so then we create this little function called execute an execute stub is to send a single packet and receive a single packet so we lock the connections to the vice-marshal it's a binary format write it out wait for a response unpack it and then we're all done and you'll notice I've omitted the error checking but don't ever do that in production code ever on top of that we can build a higher-level friendly API because who cares about like what the packets for this thing look like right so we create this high-level function called query and queries job is to perform a read-only query to retrieve information from the device we accept this query string we put it into our packet and then we send it to the device and while I'm here I want to mention this really cool tool called C forgo given a C header you can output pure go constants so you don't have to like write down enumeration to my hand super useful if you ever work with a thing from the kernel would recommend but we can do even better than that query function so we're gonna create this tuner debug type that contains debugging information about a tuner and then we can add this tuner debug method to our client type we query for the debugging information from the device for tuner 0 for example we retrieve this raw bite slice and hey as it turns out a raw bite slice that's a lot like a text file so it's pretty much this the same routine as we did last time we wrapped the bite slice and it bites reader so it adapts to the i/o reader interface we can put an i/o reader into a buff i/o scanner scan through the file and hey we're all done so now let's build an example program and just give this a shot and make sure it all actually works so I build and run my program give it the my device asked it for the tuner zero debug menu and as you can see it seems to work just fine so let's move on we want to export HD home run metrics using Prometheus so how do we do that well the HD homerun device has this network API but we can't actually run our own code on it it's a lockdown little third-party device so what we have to do is enable our Prometheus exporter to dial out to a remote device this is where things get kind of interesting so we create this function dial and dial is used to connect to an HD homerun device given some address string it's going to create the HD homerun client for us and within dial we're going to call our HD homerun dial method or function and then we're going to make sure we set a timeout you want to always set a timeout on your network communications because what happens if something goes wrong the device is down your network is down something in the middle of town you don't want to necessarily leak file descriptors otherwise you've run out of file descriptors your exporter stops working you get paged nobody's happy ok so we have our dial function we pass it as a dependency to our handler type and the handler is going to look something like this so our handler is going to implement the HTTP handle handle our interface from the go standard library and we're going to do that now so this is where this gets interesting we need to actually configure Prometheus to send a target parameter with each scrape request so we're going to retrieve the target parameter here we're gonna validate it and make sure it's got a valid port and if it doesn't we had the default we joined it back together and we tried dialing out to whatever was passed as the target if it fails we return in HTTP 500 some Prometheus can know something is wrong but if it's a seat if it succeeds we're going to defer closing the client we're gonna wrap it in a small interface for testing and we're gonna actually serve the metrics for it so what might that interface look like as it turns out our HD homerun client can do what's most of the work for us actually so we create this device interface here and it's going to wrap the HD homerun device type or the client type and it has the same method signature as our HD homerun client so by the rules of go the HD homerun client type implements our device interface so putting this all together let's give the exporter to try with curl and that target parameter and just see how things go so we curl our exporter running on let's say my local machine and it dials out to the remote HD homerun device so we exposed a couple of metrics here we have network packets per second that shows the packets per second rate for a given TV tuner and we also have this interesting tuner info metric and this contains metadata about each of the tuners available to a device it exports a constant value of one more on this shortly so if you want to configure Prometheus to actually use this thing you have to make from athiest pass a target parameter so this configuration is mostly taken from the black box exporter so definitely look at that repository for more information but the basic ideas are you have this targets list and you pass a list of like say for example your HD homerun devices you create this reliable in configuration and this tells Prometheus to pass the target parameter and also to replace I believe the address of the HD homerun exporter in the metrics with the address of your HD homerun device so relabeling is a super powerful concept I definitely recommend reading more about it particularly the robust perception blog is super useful putting this all together it seems to work just fine Prometheus is up and running it is scraping our HD homerun exporter running on some machine which reaches out to our HD homerun device itself gathers the metrics exports them in Prometheus format and everything is good so let's go talk about that information metric again that metadata you can use these synthetic information metrics for super powerful prom cue while queries so we construct it like this we have these labels here that are the tuner the channel it is trying to lock to and the channel it's actually locked too and we're going to export this metric it has a constant value of 1 it's a gauge and it just has these labels so why is this useful as it turns out prompt QL is super powerful and you can use this to effectively perform a relational join between different time series so if I want to answer a question something like what is the packets per second rate for a given channel this query seems to work so I'm effectively joining my network packets per second query on the channel ID with my HD homerun tuner info metric and as we'll see on channel 3 for my HD homerun device on tuner 0 the packets per second rate for that channel was 840 so this is another super-powerful prom qaul concept and in particular if you learn more about this technique check out Bryon's blog and how to have labels for machine rolls so if you want to gather metrics from hardware devices hear my tips set timeouts for network connections always just do it it won't take you very much time just do it use interfaces for testing makes your life much easier you shouldn't have to dial out to an actual hardware device to run your unit tests right create these synthetic information metrics so yeah you could add these labels to every single one of your Prometheus metrics but then you can result in some very high cardinality time series and that's no good for Prometheus so take advantage of prime ql export your labels as metadata on a single metric and then join them together using from ql and finally learn a little bit about relabeling it's a super powerful concept and one I'm in admittedly not very familiar with but it's super useful and that brings us to the home stretch so let's talk about gathering metrics from some system calls in this case stan FS on Linux so state FS is used to get filesystem statistics so given some mount point the function stat FS returns information about a mounted filesystem so to start we want to build a high level OS agnostic API because that's the friendly thing to do and go so we create this file system structure it has a path or mount point a type which is an enumeration of filesystem types like ext4 or NFS or XFS and the number of files that reside within that filesystem we create this table all function get with a capital G and this retrieves stats for the filesystem mounted at path within get we're going to call get with a lowercase G and this is going to be our operating system specific implementation of get so if you want to make use of system calls and go you must always guard them with build tags your code won't compile on other platforms if you're using system calls unless you're using build tags appropriately and this is one of Rob pikes go proverbs so let's talk about the Linux implementation first we have our file stat FS underscore Linna that go and we have this Linux build tag at the top you don't necessarily need the build tag just because of the file name but I like to add it just because it's more explicit and instead of using the standard library syscall package we're going to use going org exorcist unix this is effectively the modern replacement for syscall because cisco had to be frozen after a certain period in time so get once again lowercase G this is going to be our Linux implementation of that function get the accepts a path and then we're going to create this UNIX stat FST structure we call out to this UNIX stand FS system call and provide it some mount point and we pass a pointer to the structure and as a result the kernel is able to populate that structure so when our function call returns successfully we have a structure with all the information that we need finally we unpack it into the OS agnostic structure remember system calls can't cross the boundary outside of where your build tag is so we unpack the structure here within our Linux type to make sure that it's friendly for all operating systems so next up we're going to talk about code that's not on Linux so we have this file stat FS underscore others that go and at the top you have a build tag that says for every operating system that is not Linux build this code and here we're just gonna leave the function get unimplemented perhaps there are similar api's for other platforms but for the time being leaving it on influence it is fine and I definitely like to return an error that says something like you know whatever this is is not implemented on your current platform such as windows Darwin etc so next let's build an example program and give this a shot and just make sure it all works so we create our program we call out to it we get the first argument and that's going to be our Mountain point path we've passed it to our Stanton FS gate function and this should compile successfully in all the platforms even if it doesn't run and then we print the information to the screen and as you'll see it seems to work just fine on Linux of course you can implement more platforms if you're able but for the time being Linux only is okay if you want to export metrics for this well I'm a little short around time that I might have thought so this is going to be an exercise for the reader if you want to gather metrics with system calls and go hear my recommendations build a high-level easy to use API your caller should not have to care about system calls or pointers or any of that kind of unsafe things so make sure that you make it as easy to use as possible and take away as much of the complexity as possible from your color always use build tags with system calls if you don't your code will not compile on all platforms and go is a great cross-platform language so we should keep it that way and also be really careful about using elevated privileges with system calls it's pretty easy to do something like I don't know overwrite your main disk or reboot your machine or something nasty so just be cautious about using root so in conclusion if you're gonna build from easiest exporters and go it mostly comes down to a set of typical software best practices avoid global and package level state as much as possible and pass your dependencies explicitly as parameters it's going to make your life much easier in the long run it makes your code easier to read easier for your co-workers to review and easier to maintain in the future focus on creating simple and reusable package api's when building exporters like I said you don't need to mix the details of exporting metrics with crawling a file system or some binary format or some network protocol separate these things create nice reusable packages and then import them into your exporter and finally read up on the Prometheus metrics best practices from Prometheus I oh these are really great there's a lot to learn but apply them judiciously and your exporters and you'll be better off in the long run and finally I have a special unrelated announcement digitalocean is working on a kubernetes product so if you're interested there are some folks in here work on that or otherwise come see us on our booth and that's it thank you very much for your time and I look forward to answering your questions all right well I'm not sure where we're at on time right now but I'd be happy to take any questions if anybody has I otherwise just come find me in the hall that's just fine to 7:00 okay sure I'm sorry I think I missed some of that could just pick up a little bit yes yeah so the question was if you have a bunch of metrics in a push format how can you best convert that to a pull format to start I believe there are lots of Prometheus exporter adapters essentially for things like stance D graphite etc so if you wanted like a stopgap solution you could push to those exporters instead and then those export native from a theist metrics but if you want to actually swap it out it kind of depends in your application so you might have to go through and like I guess I don't know which language is using her in your environment but many yeah sure so it's kind of tough but yes my recommendation would be definitely look at the adaptor exporters things like the stance the exporter there are there are several under the Prometheus organization so check those out and those will probably give you a good start anymore yeah okay nothing uh yeah any more any more questions all right cool thank you very much yeah Goan sorry [Music] yeah sure so the question is how easy is it to deploy Prometheus in a Cooper dies environment and if so do you have like Hamill files prepared for deploying Prometheus in a kubernetes environment and truth be told I don't deploy for meeting communities I have co-workers here who do I'm sure there are folks in the room I run this at my house I using system D and ansible I'm bare metal native right but yes so unfortunately I'm not the best person to answer but I'm sure there are folks here would be happy to help any more questions cool thank you very much for attending my talk and I hope it was entertaining [Applause]
Info
Channel: CNCF [Cloud Native Computing Foundation]
Views: 27,369
Rating: 4.8125 out of 5
Keywords:
Id: Zk09Mbu0YQk
Channel Id: undefined
Length: 30min 50sec (1850 seconds)
Published: Fri May 04 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.