Are You Testing Your Observability? Patterns for Instrumenting Your Go Services | Kemal & Bartłomiej

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello everyone we are extremely excited to be here in godus conference it's actually personally my first one ever I participate and it looks so far so good and and I'm super excited also we are super excited to be able to speak about topics we really care about and and love which is observability and programming ingo we hope our talk will be very inspiring and actionable for you so this is because at the end of this talk we would like to for you to know three things first of all why instrumentation and instrumenting back and go application with actionable metrics is really essential second is how to add metrics quickly in go how to test them properly and last but not the least we are what are the common mistakes you should avoid mistakes that we've seen a lot during our work with metrics in ed go in the amazing but sometimes wild open-source world so yeah before that short introduction my name is Bart Aquatica I am engineer working at Red Hat in the monitoring team I love open source and solving problems using go I am part of the Prometheus team and I also co-author of the tunnels project a durable metric system that is meant to scale primitives with me we have also can all hello everyone my name is Cameron I am also working for reddit and I'm working for Observatory open ship Observatory in team and I'm contributing to tunnels as well our job is mainly focused on building scalable observability solutions and platforms for open shift but also as one of the major part of our our job is maintaining primitives and tunnels projects on a daily basis those projects are focused on enabling monitoring via metrics for infrastructure cyber side applications for example micro services running Ingo but let's leave that for now let's focus on very fun tasks let's talk about building cloud answer kind of for demo purposes let's imagine we want to implement application level HTTP load balancer lingo why because why not that it's fun so let's imagine we have implemented transport load balancer as presented in this diagram from high-level perspective we have couple components go components first of all we have a single HTTP server that implements serve HTTP methods so handler via asome reverse proxy struct in standard HTTP util package that reverse proxy allows us to inject custom transport so run through the interface and to inject there our load balancing run twitter which is called transport which is internally using few components first of all Discoverer which allow us to discover towards targets we should proxy requests to a round robin picker which korat to stargates and picks the one in run probe in fair manner so you know replica one two three one two three which you can see below and then at the end run quicker looks on you know takes this this target and forwards the request to the correspondent replica and get it get proxy the response back down to the user now it looks great like all all should work right but is it production ready that's the question so that's why let's let's say we they will deploy a couple of replicas of this product of the service on production in front of some micro-service some replicas as soon as this starts running you know let's say i'm manually hitting the elbe endpoint and it works like it gives me the response I expect so it works very good right we can go home well not necessarily it works for me but are we sure it works for other users are we sure it works when I'm not checking right so you know how many bad gateways error we actually have returning over time we can't really tell and what about discovery like maybe DNS was not working for a portion of time how many replicas it discovers how many discovered yesterday on five pm and what about one picking mechanism maybe you know it's not really picking in round robin manner is a 1/3 of all requests really going to replicas to what's the actual distribution was the distribution of the time what if the user reports for example our load balancer being slow is it you know our replicas being slow or maybe the load balancing of logic is introducing some latency and finally you know how much memory we are actually using maybe you have a memory leak what version we are running there are so many questions that literally we need to answer and we don't know what we need to answer before hand when incident happens that's why in SRA book when you read that you will find monitoring as the foundation of creating any reliable system before even implementing the system itself as you might be familiar some wintering signal we can introduce especially in go our traces locks and matrix guess which signal will give us an answer to those questions like distribution of requests or tail latency yep matrix matrix most likely give us the answer to our question answer that is in comparison to locks and traces may be cheaper definitely new like real time and definitely actionable and well we can create an alert that will trigger another action based on behavior that just happened why Pro me to use though right well I might be biased but prompt use is currently the simples in the cheapest option for collecting storing and querying metrics as well as reliable a reliable alerting it is part of the CCF it fits for the small application but also for the bigger one with help of seen CF projects like turn offs or cortex so let's figure out how to add a metric to our load balancer how to answer this question for example what is the error rate for all the requests that just happen to load balancer so we can do that by incrementing by adding deserting metric HTTP requests total count and we essentially count those requests whenever allowed balancing or request a cure we increment that reporting a method and there was the code that was responded to this request we can introduce this metric really really fast in few simple steps so let's see first of all we need to import golang client from choose client library during our talk by the way we'll be talking mainly about this and and this is really neat library to actually go module which is designed to help with instrumenting any go application it's tiny heavily optimized and and quite simple and very popular it's actually the official one as a next step we need to define variable so let's do that and let's name our counter which is like HTTP request total let's put some description let's put some labels code and method each value in those labels those labels are essentially dimensions for for our metric each unique value value will actually create another series in from it your system next step is to actually count those requests so let's create a simple wrapper over a serve HTTP and we'll which will increment the counter once we respond it once they've rocked handler will respond with some with some code and we need to have some trickery here to record the status and then increment our counter write something that is easy to forget is also the next step which is registering the metric in a registry Prometheus registry so let's focus what we are actually accomplishing accomplishing by what we are accomplishing by registering this metric it's important for the next step which is adding another handler for our server so we have Alby endpoint and we also need to have some endpoint for matrix so you and there is like impromptu use library there is supreme to use HTTP dot handler for that and usually it's a good pattern to put it in slash metrics buff now okay what we actually achieved now is the question how from this we can actually do queries and others so again we define an instrumented era metric now under slash metrics we we have our handler which exposes a metric Pro materials kind of text matrix format next to exposition which looks like this which is literally human readable and as you can see after you know many many requests we we have some value of those counters we literally have only two different responses of 500 and 200 now what is happening is that when you run a separate Prometheus binary you you point this form to use to your load balancer or arms large metrics page and it will periodically scrape so collect those metrics every configured interval for example 50 seconds and it will store those samples and then when you visit from queues UI you can actually query that data over time so what you should see here is essentially an imp in CRO increased function over our counter which means that we visualize 8 here a number of requests per minute that we achieve in our survey which we have on our load balancer so right now we have 120 per per minute and some of them were errors most of them were successes cool but this looks easy right like always fine and it's ok it in most cases is fine however during this talk we would like to present what we learn during couple of years of developing an instrumenting go code that is meant to be ran on production in in both closed and open source and you know it's not sometimes it's tricky so together with camel we'll go through a few less or more advanced issues we need we've seen and how to resolve them and really we are really passionate to for you to be ate for you to be able to avoid them in your while you are instrumenting your application cool first one Global's so there was a saying in amazing Peter Bergen blogpost story of modern Co magic is bad global state is magic this is very true and also in case of primitives client especially if you are instrumenting Samko package with a matrix and then your project import this package but also you know some open-source users could import this package as well and use your you know library with some matrix so and also this library allows you to use Global's for said certain simplicity however somehow this usage leaked everywhere and everyone just use those Global's so we want to break this we want to make sure this is obsolete and I will show you why this is a really bad pattern so let's focus on yeah where is that magic and how to how to fix that on example of our load balancer and go okay let's take our example of HTTP request total metric we have to global states here first of all we have a global variable package level variable where we store our counter second global state is hidden under primitives must register because it's using a global default register which is like global library state now ok what's the issue here like what was the problem first problem is magic right let's say I added my server requests total variable I registered that and then some other you know another package that we import or even dependency of this package that we import so dependency of our dependency just name it has registered the metric with the exactly the same name it will panic like our registration probably will panic and the main problem here is that it will only mention where this second registration happened and because you have stacktrace however it will not mention where is the first registration so you need to look I don't know where like it's so ambiguous and we seen those issues a lot when you're using global state right so please don't use Global's but there is a second issue as well and it's lack of flexibility so let's imagine we don't have like one endpoint but we actually have three one two three and so what happens if i if i instrument those endpoints like this we have only two labels code and method so we are aggregating all requests no matter what endpoint it was so I don't have an answer on to question okay slash free endpoint like what's the error rate for slash free endpoint I don't know because everything is together so we are losing an insight and there is no way for us to solve it when your library whatever is using Global's so let's fix it and how we think we are fixing it well by removing Global's let's replace a global variable first we create a struct which you can instantiate so you can have many of those and the constructor creates this metric with certain name and labels now you can create such object and use it everywhere which is easy but still there is one global state that we didn't fixed so let's do it let's remove the default register it and we can do it by injecting the custom registry so register air sorry and the constructor will will register our metric so we can create the registry then create the new server metrics and we are good and are we good right let's say I want to have three different metrics three different HTTP requests Auto metrics for different end points right and if I do this this will panic as well because again we are registering exactly the same name as you remember but right now in comparison to the to the version where there everything was in global state it is fixable and we are in the control of our registry we know what we are registering or not because like it's explicit so I can go ahead and use a available wrapper which is available in the parameters library rap with labels and I can eject label to make it unique so right now it will register just fine so I have / 1 / 2 + 3 which at the end gives me you know amazing kind of grouping bar endpoints so I can finally answer my question you know what was the error rate of the slash free four slash free end point and yeah it was 48 errors 40 errors so far so this is kind of how you shoot in our opinion build you know the instrument at the matrix you should allow all the libraries that are all the abstract that are using matrix you should consider allowing injecting metric and injecting register or custom register and never use you know magic inside your code ok second pitfall very important as well as alright title of the talk event right no tests okay this is something I'm really passionate about metrics and other observability signals like tracing clocks our profiles are rarely tested right come on who is asserting on your unit test what log message was actually produced if the lock was actually produced right and we don't do that because usually it's a human kind of you know information information for humans so we don't action on that in a computer doesn't don't don't doesn't action on that so there's maybe non point however with metrics in my opinion it's totally different story and I'm super serious here you know they the matrix has have to be tested and let me explain why okay let's take our our HTTP request example right again let's take you know some load balancing logic and again this is exactly the same is in diagram we have an one trip run tripper which first asked for targets it takes these targets and asked pker what target we should choose and the chosen target we essentially forward that to like proper HTTP default transport for example and just at the end we are marking what what address was actually chosen in some header just for testing purposes for example and then in our server end point wrapper the reverse proxy is invoked and we can we record the status and increment our metric so this is our logic offload browser let's say now let's imagine we introduce the back like you know we just were changing something and there is like a buggy recorder which actually return record always 200 code so no matter if our load balancer or kind of back-end behind the load balancer is returning here you know 502 or 404 it will always you know show that 200 and let's see what's the consequences of it and well first of all let's see if the existing tests are catching this so this is like the you know the we are kind of you know we are testing our applications so we are obviously have test for load balancing and let's go through this first of all we have some test cases as you can see we have certain targets that we mock and then we mock the response from the backend as you can see we are triggering an error and then we expect from the whole kind of back and think from the load balancer to return call 5 5 4 2 which is like bad gateway normally you have more cases but let's focus on that one here in the test case we for each of those test cases we again try to invoke the whole load balancer with a custom request we record what it is what it returns and we make sure it returns what's expect what we expected it stable test it's a really good pattern as well I will recommend that now ok the question is does it really catch our back right well not this is the back right in this case when we return 5 4 2 HTTP requests total would be 200 so it means that you know metric is suggesting that there was no error this is very very critical because imagine that you have an alert on number of errors or you are looking on beautiful graph and looking if there are any errors right and you see like all successes although there are many many errors properly handled because the test succeeds and actually a load balancing is working however our metric our monitoring is not working properly right and there is no way to even you know know about that in doing here in testing CI process whatever so this is really bad so let's pick this let's add test for our metrics right so let's introduce another kind of package inside our prim to use go client library and we kind of there is a clash with other test utils so let's name it prompted utils let's add custom registry obviously our new metrics which I'll be kind of mocked and then before we do any request any custom requests to our tests increased a request to our load balancer let's uh sure that for now there are no metrics incremented and we do that by checking cardinality so we are checking how many secrets series will be exposed with this metric there will be none so this is what we expect and we do that by a you know first method we want to introduce here which is collect and count functions or in that matter cool so let's now this is huge let's now add an expectation to our case we expect you know metric we've got five for two to be one to be incremented nothing else should be incremented and let's uh sure that this is actually happening so after that everything the test case was and and the load balancer was actually was performed the request and we have the result we check if with the two float64 we extract the value of the metric for certain labels so 502 which we expect and we check if we have one right in this case and after everything we also check if there is only one one cardinality as well with the correct count again because we have only one series and we expect only one series so if you don't have this line you could have you know many other kind of codes or like any mother other series incremented and that we don't expect right now so this is how you assert on them so what is happening now is that well this test will fail because of our box so we finally cut our back and this is super super important and really please do that and you know maybe not for all the metrics but just for important ones and also it is super useful to see when you are asserting on those metrics how you're actually will be debugging your application furthermore if you have only metrics right so it shows you the behavior and the logic behind that if it's even useful or not so that's amazing and we follow that on our projects like Thanos for example I assume okay so another pitfall lack of consistency what I mean by that so there are useful methods for creating metrics sree necessary book you'd have the four golden signals there is use method wet method and all those methods specify like what from what aspect you should monitor your application using metrics and there are like two advantages of this first of all it helps you to not forget about any aspect right but second thing is that you can use a common tooling like dashboards alerts and raw recording rules there is even mixing project which allows you to if your metric if your application is following this method you can actually use the same dashboard for example so let's focus on one method red and red means our stands for requests per second so what's the saturation of my load balancer for example e means like how many errors per seconds I have the you know what's the tail latency of my application so this is like the free aspect that should be sufficient for like understanding your application so let's focus on now a load balancer does it does it satisfy read method so again we have our server metrics we have our because this is the number of requests that we have fine we have errors definitely we have errors because we have a code which can have 500 we can have success and errors both so that's fine what about D well we don't have timing of the requests right so let's add that that's pretty easy we just need to add a histogram and histogram allows us in a again our serve HTTP wrapper to observe you know how long the the process took how how long the request took and then it puts the observation not exact values of those but like it puts the observation into certain buckets that you have here so if it took you know one second or two seconds it will be in the you know one second bucket for example because it's lower than free and we do and why we do buckets will be explained by camel in a second but essentially with this we have latency and our read method well we satisfy the read method and we can you know use the common tooling so with this I will pass the mic to camel for further pitfalls all right so we have yet another pitfall and of course it's about naming because naming things is hard so but like sticking to the naming convention should be easy enough because we have official documentation in Prometheus you should just need to follow the those things but this is one of the most people one of the most common mistakes that we see in the community well as Bartek mentioned by enabling by sticking to the convention you can also enable projects like monitoring mixing those type of projects as well so I would like to emphasize a couple of points from that documentation first of all you should name you should have suffixes that describes your base units in your metrics the base units part are also important because like you need to you want to stick to your conventions between all those metrics that you have in your system so the second one you have you need to have underscore total suffix for all your metrics for all your countries this is actually going into the open metric standard so it will be mandatory so please respect this one as well and last but not least underscore info suffix so you need to put this for your metadata metrics you can use the like version of your running service metrics like that so yes yet another issue about the naming is the stability so in here I show you I'm declaring a metric and it's the one that from the other slide the previous slides HTTP requests total where we track the number of the requests that we have in the system and then we build an alert on it as Bartek show and then for some whatever reason we decided to change the metric name because of some regulations or whatnot so we put protocol in it and now we have broken our alert so please try to avoid to do this because like this is one of the most common mistakes if you stick to the conventions as well like you wouldn't break your grass your alerts especially others because others are kind of proactive and you kind of depend on the others when you run your application on production so yes another popular topic is cardinality when you talk about Prometheus and performance it's always comes down to cardinality and what is cardinality and the cardinality in this parameters context is the amount of time series that you expose so especially considering the labels you need to be careful about your labels because each label value that you add to your metric it's kind of creates another x reason you are incrementing your cardinality so let's give an example let's say we we already gave an about httprequest total we have a label in there maybe put HTTP method in it and if we have just get and post methods that means we have to time series cardinality is two but it then we had another one it will be three so let's see some bad examples how we can introduce unbonded labels and increase the cardinality so again we have a metric we register it this is for the tracking two connection failures failures so for this one let's say we add an arrow reason and whenever we have a connection error let's just put that in there and increment our metric what could go wrong right so when is this grab our metrics this is what we get you see a lot of like gibberish in der arbitrary error messages and this is not what we want them from e chooses yeah like an aggregate of metric space system so this is not an event logo so this would just like increase your cardinality and it with the easily break your system and there's no value having this thing so how we can fix this so this was the stage that we were in so let's introduce one of the tactics that we used in our systems we use constants or define some constant labels and then create an helper method to map those errors to these constant labels so one in like such an example would be like this and the important parts is you get an error you just return a constant label and you put that in your labels so that you saw that you control everything in place so yes let's move on one another pitfall about cardinality again so histograms so when you are dealing with the histograms you need to be extra careful especially about cardinality histograms are a bit more complex metric types compared to countries and gorges but underneath there are just one bunch of counters and labeled counters so by default I if I remember correctly it is like ten or twelve counters so what and they're very useful because they give you a template of sampled observations so you can for example usually you use that for to try to request raishin or the request size whatnot so since they exposed like by default twelve countries we need to watch off the cardinality let's see an example for sake of simplicity I only define six buckets for this example and we are we still use the HTTP request relation and then we like deploy our system we check out our graphs for a bit later we realize that our average latency is too high and we need more buckets to see what's going on and what could go wrong right just add a couple of buckets and that's what you get and you again deploy your application and you realize that you need more buckets okay Yago just add three more buckets what could go wrong right and then you have them that this looks nice looks nothing could go wrong right everything is fine they keep increasing and then a little bit later you see that like they keep growing and growing growing because with the histograms you have this multiplier so each label that you had like multiplies by twelve by default so especially when you are using the histograms watch out for your cardinality and just don't do just don't increment your buckets like this this is not the way yeah adjusting buckets is an art form more more than science so let's jump into that topic and see how we can actually see adjusting our buckets so one of the other pitfalls is about like poorly chosen histogram buckets so histograms like they are using for observing request durations request latency or recur sizes as I told you before but they are accumulative observations so they have this cont concept of bucket because they by this way rather than like aggregating all the row of data you just collect them into the bucket so they are cheap hence they are cheap by using these buckets you need to watch out for your bucket layouts for to watch out your performance or any other incidents that you could have so let's see again a better example and so we have six buckets same metric as you can see those buckets and you have an actual data of this so since primitives using cumulative histograms then you see ordered one it kind of includes the preceding buckets as well so from this data you could understand that like we have really really low latency requests and they are all collected collected in the lower end of our layout so how we can fix this again client go is our friend you can it can help us here and we use the function over there linear buckets and we just put in there and by that we have like more or less better distribution but the most important part is this in in the slide is know of your distribution so then you like coming up with your layout you just need a way to distribute them correctly and for to do that I kind of suggest that come up with your service level objectives and watch out what you want to watch out and then lay out your buckets accordingly so one of the our last pitfall is not initializing metrics this is one of the most common pitfall and I want to show you why how it goes so again we have this metric is the the metric that we track the failed connections and for this we just define that and we add 32 that metric while running our application okay everything is fine we have our metric here but is it really fine like when we check out the increase of that metric we don't see anything it's just zero we don't catch it because like for the parameters there is no way to form four parameters there is no way to understand if I mean a metric is initialized or subsequent subsequently scrapes just missed that metric so to actually tell the parameters that everything is fine you need to initialize your metrics so that it can catch all those changes and most of the case probably you depends your alerts kind of depend rate or increase functions so that's important so how we can fix it we can just again use a constant label and add that metric and this vid well label values kind of initializes it and then when we call that method then we will see that peak increase and then our others would work yes to sum things up I need to hurry a bit I'm out of time so yes if you want to take three points from this presentations presentation please take those three points first of all monitoring is a essential please monitor your apps instrument them which like whatever you have metrics may be traces may be logs something because you need to know what's going on in the running system and then you can build alerts graphs any facilities that you could use in your to understand introspect your running system and since you build those alerts and graphs please in test your instrumentation because now if you rely on them on your production system and not by not last but the least also kind of a low-hanging fruit just don't use global registry avoid using global as Bartok told magic is bad so don't use it so we have this demo it's actually running a running load balancer you can actually run and see absorb if you want to dig in more just go ahead and check it out and that's it for mas thank you [Applause] [Music] you
Info
Channel: GoDays
Views: 316
Rating: undefined out of 5
Keywords: #GoDays20, talk, goconference, golang, Berlin
Id: LU6D5cNeHks
Channel Id: undefined
Length: 38min 30sec (2310 seconds)
Published: Fri Feb 07 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.