How Prometheus Revolutionized Monitoring at SoundCloud • Björn Rabenstein • GOTO 2017

Video Statistics and Information

Video

Captions Word Cloud

Captions

[Music] thank you it's called Production Engineering but we are always joking about it because product engineering is the other thing and it's always very hard to get that right so I'm all into infrastructure SoundCloud I hope you know what SoundCloud is foot sounds into the cloud sadly I am not so close to the music we are from the side I'm just infrastructure did a lot of stuff is Prometheus I hope you already know what prometheus might even be you only know what sound good is because you know about prometheus I called some got the cradle of Prometheus because we did most of the initial development there four years in if you have counted you might notice that the big announcement of Prometheus was 2015 then later I think later that year we reached 1.0 and right now like earlier this month we released the glorious version 2.0 and then people talk about it like one and a half year after 1.0 and now you see four years that's because Prometheus was start and wasn't strictly wasn't started at SoundCloud but it was started as a pet project by matt and julius the two founders of the project but then sounder was the the company were it got its initial development and initial views and was always open so some people think it was done like in secret that's not true it was always in the same it of auch where you can find it today and of course it took a while to be ready for production and like let's say 2013 we used it for the first time for real things but it wasn't really ready yet to throw it at you all so that happened a bit lighter now prometheus is well-known enough but now that I didn't want to just give the next introduction talk because I guess 20 percent of you are already using it or know a bit about it and that would be boring on the other hand this is not a dedicated monitoring conference or even like prom corn a conference there is just about Prometheus so like just giving something super advanced wouldn't that would probably not help many of you so I thought let's just look back at how those four years have changed our way of monitoring and I mean the hope is that you will get some inspiration how you can do things in your own company organization whatever you will have see some resemblance of problems you run into and then you might get an idea if something like prometheus is for you also other things I'll talk about because it's not only Prometheus so let's start some thought is about ten years old by now pretty precisely ten years I guess level zero no monitoring that's what many people do right I mean it's sad but sometimes you just don't have time it's usually a bad thing to not have any monitoring but I guess that's the default and even if you don't do monitoring you are still you you are still being monitored of course for you users or your customers and they will complain if you're lucky they complain to you but most of the time they use this little bird and tweet it to the world that they are unhappy with your service so Twitter is actually a pretty good monitoring tool in a way I know companies who have automated scripts if certain phrases show up on Twitter that suggests that their site is down people get paged we don't have that but we have this little section in our post mortem form so we do this in every post-modern we ask our community team so have users tweeted about it have they written emails to you sometimes we know there's an outage she put a help page up on our help system then we they check out how many users have used that page so this is really good because you get a real-world feedback every postmodern process of something that is user facing should have this section you know that perhaps this huge outage that all your monitoring systems were taking you about didn't have such a you in fact and that that's good right that might be I mean it might be your monitoring system is oversensitive it might also be your really good risk graceful degradation you can't have an informed decision about that but also the opposite happens all the time you have noticed anything internally but the users were complaining for weeks so all of this is very good to get some calibration okay but let's talk about systems monitoring not just users tweeting at us level one obviously I guess most of you went through that not yours let's say seven years ago more or less if you want an open source monitoring that's that's the gold standard there's AI senior now there are also other systems that are more modern but they're essentially iterations on Nagios I singer I just put this here as the placeholder for all of these in in practice we are using a cigarette sound on we are still using it it's so hard to get rid of legacy systems that have been established sometimes it's also not worth it to get rid of them this is a different story so that's what happens first it looks like that if you look at this is I singer this is actually pretty recent screenshot to prove that we're still using it and this is this is the first thing that comes to mind this is all about hosts you look at a specific house and adios pings their house if that house pings great then it runs other checks there are loads of plugins nagas it's very versatile it can do it has this NR PE Nagios remote processing environment or something I don't even know exactly what the acronym means but that's pretty awesome because you can now go send something to the house and then on the house it can run things and that's very flexible you can do almost anything and you see we check for stats D which is another monitoring system and it's checking for n FS yes we are still using NFS so things are scary here but this is pretty reason or originally there were way more checks for example how cool is this all those things so the problem that we had with that I mean that many problems but the the most obvious one was this is all host oriented but at some point you are talking about distributed systems so you have literally thousands of hosts and more some of them are always down and it's kind of that's not an alerting condition that's normal life right I mean you have to deal with them but you don't want to wake somebody up and a service is distributed it's designed to tolerate that a house is down all the time and that's all it doesn't feel right and in practice it creates a lot of problem so nagas I mean they have developed a lot it's a very mature system so there's the thing like cluster checks that's exactly about clustered distributed services so we were using that as well but there's also this interesting thing so what's the host here the host is called goal-linked redirector which is an internal service doesn't matter I just picked it because it has go in the name like the conference so this is back from many instances that run on some cluster we'll talk about that later and the host is gray status pending so no check but there's a service on it right so that's what you do or nod yours if you have something that is actually not bound to a host you create a pseudo host and then you can still test it so you see this is kind of yeah it doesn't feel right but there are some things that people think that are wrong about Nagas that are actually not that bad so let's quickly talk about that the one is many people think this whole idea that the monitoring server goes to the monitor target and retrieve some information is bad and fundamentally flawed and doesn't scale now if you know anything about from easiest you know that permittees is doing exactly the same thing of course it retrieves different things and it's all very different but this part tool based monitoring that's exactly what Prometheus does and when we created it we were actually thinking this is the only way you can do it and there was a bit of a religious struggle about it in practice I think and I think the more during community has settled down a bit it actually doesn't matter so much it's really like you have to get it right of course but push or pull can read a lot about that on the internet that's not the most important question here so if you say we cannot use Nagas because it pulls things in its first it's wrong and also you couldn't use Prometheus either then the other thing is this whole story about black box versus white box monitoring you might have heard that modern monitoring must be white box we'll talk about that later now not yours it's kind of in the black box monitoring domain but I think that's not my concept it's more or bi it's a cultural thing traditionally monitoring was the obstinate those ops folks in the company and you had the deaf folks and the deaf folks would throw something over the fence and what they throw over the fence from the ops perspective is a black box here's our product it runs please monitor it so you cannot do anything else but like it's sending probes to it and trying to find out if it behaves as expected white box monitoring if you don't know that that is this is a kind of monitoring where you open the box and you look into it and you check out what it is actually doing which usually requires the box to cooperate so you instrument your code to provide metrics or you have in a database you have some stats tables you can query things like that and that involves deaths the developers have to be concerned about monitoring and that's a weird thing in a traditional separator ops and effort but in life DevOps yeh world it's actually very natural and so what had this pretty radical you build a you run it approach which became more more elaborate I try to I usually sell it as we are using through DevOps because we only have people who do both we don't have that we don't have ops and that effort is not just about getting nine more nicely along with each other it's really you don't have dedicated Ops teams but different story so now yours is kind of black box by culture but this whole NRP thing could totally do whitefox monitoring if you want to right so this is the one thing but also perhaps blackbox monitoring is not that bad after all so now every single of my talks mentioned this book I'm a fan I mean every single talk ever since this book came out but now it's really on the internet if you haven't heard about it if you haven't read it and if you run anything medium to large scale you need to read it of course I'm biased my Murphy was my last boss at Google so now I have to admit I didn't work for a SoundCloud forever but I was at this different company Google before as a side reliability engineer you said that right you spoiled my surprise anyway so they describe my job I know most of the people who wrote this so of course I'm biased but this is true gold it's not that you should just copy Google because you are not Google to quote an popular blog post earlier this year that's true you cannot just copy Google believe me we tried soundcard had a bunch of former Google as reason they thought now we are getting bigger and more complex let's just do a sorry like Google doors didn't work but still you can learn from Google and their vast experience with technology that we are now starting to use so definitely look into that and they have a lot of stuff about monitoring and you will also learn something and that's symptom-based alerting that's the kool-aid so you traditionally you have probably never thought about symptoms and courses and what to alert and this is mostly because in a traditional world symptoms and courses are more less the same so if I have an Apache like lamb stack or whatever let's really go back to the 90s and and the server doesn't ping anymore okay that's what Nagas alerts me on and that's the servers down that's the possible cause of an outage but since it's your one server that serves your site your site is also down it's all the same right nowadays there are so many things that could go wrong in here or distributed complex resilient self-healing all the buzzwords system so I don't know like if if the database has higher latency is a replica stone perhaps that's the problem perhaps not should I wake somebody up so causes are something you want to be informed about during work hours to make sure that your system is in the same state but it's definitely not something you want to wake people up because then you would wake people up all the time on the other hand your complex system has so many possible causes you cannot even anticipate what they're what might happen that actually causes a problem to your users or customers so what you want to page people on this is what I call pages and tickets here you want to page people on symptoms on actually occurring or imminent problems if you have an SLO or even SLA if you have contractual requirement with your customers I serve 99% of my Kure successful in a hundred milliseconds that's great because there's a proper monitoring system you can alert on exactly that you can wake somebody up and tell them listen we are not fulfilling our contractual requirement wake up fix it right that's that's a great thing and that catches all the possible causes and it doesn't wake you up for potential cost that might not be problematic at the moment so that's the cool aid and if you think about that that's what blackbox monitoring is about right you send probes from the outside to your system and if you get the expected reply great everything is up users are happy customers happy so blackbox money is actually pretty good for this super cool modern thing blackbox monitoring father-in is this one we want so it's sometimes my talks are like a Bible reading club or something because then I open a quote from the Blue Book Oh low contrast here well I'll try to read it so let's read about this and let's think what what is Google doing I mean you are not copying them but you want to learn from them so Google is doing the following we combine heavy use of white box monitoring oh no this modest broad critical users of black box monitoring the simplest way to think about black box monitoring versus white box monitoring is that black box monitoring is symptom or Internet and represents active not predicted problems good for paging black box monitoring has the key benefit of forcing discipline to only Nakki human when a problem is both already ongoing and contributing to real symptoms on the other hand for not yet occurring but imminent problems black box monitoring is fairly useless so this is the big downer right so if you want to alert on something that will for sure will happen if nobody intervenes that hasn't happened yet that's the best-case right and that's difficult with black box monitoring there's also I mean if you read on in the book there are more things this is my little focus on things and perhaps a bit my personal take on it so why this black box monitoring like we went up and down roller coaster black boxes oldster all then you realize it's actually pretty good for modern alerting but why is it still not sufficient so and this is the one thing you probe is not real user traffic so of course we do our sound we will talk about that later we do a lot of black box proving itself to measure our availability so we play a reference track it's boring it's probably just noise or beep or whatever now imagine some track goes viral on soundcloud millions of users want to listen to that track and then the s3 bucket gets overloaded where that track is located not saying that it actually works like that but you could imagine that track becomes unaccessible millions of people on the world are unhappy but our boring reference track plays just fine and our black box manager tells me you are so this is a real issue because yeah user traffic might look different another thing is the long tail latency if you I mean probably all of you have heard one of those very excited talks where people suddenly realize what like large shops like Google know for a long time long tail latency in complex distributed micro-service architecture multi-tier whatever system is really important you can design your system to avoid hitting the long tail but then you're already thinking about it but you you definitely want to know about your long tail light and see that means it's not sufficient to know median or average latency stare close to verse lists about 90th percentile 99th percentile 99.9% are even perhaps that's really important and now do the math how many probes do I need to find out my 99.9 percent I'll so I need a thousand probes and the slowest one is that with a lot of sight like statistical uncertainty how often can you probe every minute every five minutes thousand probes that meets like eternally until I know my 99.9% always probing so that's really when I want to open the box and ask the binary so what do you think is your latency the binary might lied to me but I get every single request accounted for I can also do fancy things like I can ask in a multi-tier system I can ask the front end here so what do you think is the latency of your back end because that's actually you won't get alive and right the frontier will tell you yes this back end serves everything in the hammer milliseconds or like every hundreds request has a second or something which is still a white box monitoring because you have to inspect one of those tiers but it's kind of throwing this real user traffic that's also pretty good you only get this with white box monitoring and then this is the biggest thing you still need to investigate courses so if you get a nice symptom based alert may it be based on black box or white box monitoring and it tells you something is broken the users are unhappy and your system is super complex so okay let's fix it right but what is actually going wrong and that's really hard and this is where you need more than just alerting you need something some means to investigate what's what's going wrong and that was a big issue at SoundCloud when we went microservice and all those things okay let's quickly check out this is touch point we are using catch points for like probing from all over the world no endorsement implied is just what we happen to end up with you can do fairly complex things yeah it's difficult to read but anyway if you have ever used cat phone you know those emails so you can simulate little browser sessions like a user going to the SoundCloud landing page playing a track and that's how we measure our availability over a year or over a quarter that's great because here's a lot of probes over quarter and you can really count how many ninths you have and you get those pages if you are missing your targets but you see already this has a long delay it's very noisy and I mean it's great for our quarterly goals of availability but it's not right if I want to know within five minutes that now my 99th percentile latency is bad so here you can only see so we conclude we need white box monitoring it's the year 2011 what do we do with my box monitoring again take the coolest thing from hacker news back then there was study so that's is actually not that old compared to not yours or I don't know computer technology in general so stats D is already really nice white box monitoring you instrument your code you send like you count requests you send out your latency and has implementation problems it uses UDP which is very lossy so it sounds like we didn't count 30% of our traffic and didn't notice even so there are like certain things in like practical implementation but the idea is really sound also again our you build it you're running approach made it really easy that developers realized ok have to insert those three lines of code to to count my requests and send them via stats D they get aggregated end up in graphite which is a weird hybrid often time serious database and the dashboard builder but this is what we got like I mean back then that then those dashboard look pretty fancy not bad in a way what were the problems it's aggregating a bit too much like it's very difficult to nail things down to individual producers because graphite is just not powerful for that we should actually consider what those different things are I already mentioned that of it monitoring is many things this is not complete it's just things that come to mind right now so monitoring is observing essentially I want fancy dashboards I want to see what's going on with my systems what are my life science but it's also exploring so if you have an outage and if you are lucky it is an anticipated problem and you have like a graph for the latency of every single database replica you see are this one sticks out this is probably slow and creating problems but often of course the real problems there are problems are those that you haven't anticipated so you were sitting there your - was look fine your users are unhappy and what's going on right then you have to do some interactive exploration and X that's the graphite combo is you can do a bit like you can run query scratch the graphite time series database to call it that Y has a query language but it's really hard to ask questions that you haven't thought about when you set up your matrix they have this hierarchy like hierarchical aggregation aggregating the other way around is really hard and also what I just mentioned like just cleaning something back to a single instance of your 500 instant service it's really hard because graphite just doesn't have that granularity so and then alerting this is what nagas does for us right it wakes me up if something is really wrong and the machines need me and this is all disconnected now how do I even do something like I have something on a dashboard where I see if this graph looks like that I want to get an alert how does not years know about that right and then you have again you it's very flexibility of those plugins where you can tell Nagios please run this graphite query and if the outcome is above that threshold and send me a warning and if it's even above that threshold sent me a page you can do this but it's all separate and and yeah you you want this to be more one thing it's all monitoring right and and this this it feels disconnected you have alerting in a different universe than exploring or observing and the exploring part is really the weak one here so we had tiny dashboards we had alerts that were fairly noisy as and then exploring between wasn't really working and it was all very separate and different and then everything got more complicated because of container so you have many machines perhaps virtual machines and a cloud provider that's bad enough already but then containers happen so nowadays you all know docker and you know container orchestration you'll run many of those containers somewhere I put kubernetes e because we are using it at Sound Cloud and it's pretty like getting into the mainstream is fairly new but it's based on those ten years of Google experience Blue Book etc there are others you don't have to use that it's fundamentally the same problem and Sound Cloud use containers before school I mean after Google of course but before was cool there was no docker there was alexey and and we created in-house a container orchestration thingy that was called bazooka which and I don't like because I don't like weapons but I didn't come up with a name anyway that was like I mean they had many flaws and maturity problems but fundamentally this was top notch that was really weeded container orchestration like the big ones and normal people didn't even have docker because it didn't exist so great stuff but then you have thousands and ten thousands of containers floating around and and now you're you're monitoring targets like a container you're even more it's even more difficult to find where the problem is so everything gets more complicated and then there's this sentence again from chapter 10 we need monitoring systems that allow us to alert for high-level service objectives that's the whole thing right to have proper alerting but retain the granularity to inspect individual components as needed you want both and that is in one sentence why we needed to create Prometheus because there was nothing in the open source space that would do that for us so level three Prometheus this is the architecture I said I'm not going to give a technical introduction to Prometheus I want to focus on one point here which is how how all the dots are connected now promises we often say Prometheus when we are referring through the prometheus which is the middle box but that's not true because you can check out the Prometheus github or there are dozens of repos but then there are so many more that are not in the github communities or if it's a whole ecosystem it's many things and it's also many ideas and concepts that's all Prometheus Promethean thinking like we have this I often use that and code reviews like this is not very Promethean so it's it's it's more than just a binary that runs as your monitoring server first of all it has instrumentation libraries so you already get help to instrument your code similar to study it's really not that difficult if you got over the culture shock and that's not even really represented you have like those jobs exporters here in the corner exporters are like glue binaries if you have a system that isn't instrumented but ideally you have your own software you instrument it and then you have metrics then promises about how to collect those it's like now yours it pulls things in but everything else is different it's about storage and here we get into something that time series database right like graphite so you you record those metrics over time which has a lot of benefits not only for dashboarding but then you have an expression language from ql this is this is very powerful sometimes confusing and it's used for everything so you use it to formulate a query that heats feet your dashboard you use it to interactively explore you later and you use it to formulate alerts it's all the same language all of a sudden all make sense all is consistent - popular was from - back then that's what we built at Souder because there was no suitable dashboard builder it also reads from graphite so you have nice migration paths nowadays grow fauna is the hottest thing didn't really exist back then so prominence is totally deprecated should disappear on our web page that has disappeared from the architecture diagram and Cortana has many data sources is really great to draw dashboards from all kind of sources but prometheus is a very popular one a lot manager takes all the alerts that you're a formulator is from QLD dupes them they'll talk about the later and sends them to the right person which is another thing right if you're alerting it's like a completely separate thing then all your ownership has to be redefined in the alerting system again so it's also something area want more [Music] consolidation this is very quickly just proves that I glossed over the fact that we were using way more things than Nagas and start seeing graphite so ganglia immune intact divert all things we never used hipster and in Flex DB but that would have been the past if we had no promises but now used kubernetes or our own platform but when we use cover letters they had hipster in place because they needed a story about monitoring and later they learned about promises and now yeah you kind of will use Prometheus if you want to monitor kubernetes also New Relic is the representative of external monitoring providers that's something we use pretty heavily just out of desperation because the external monitoring provides they're fairly good they're doing a good job in what they're doing and the on premise open source solution home grown whatever they couldn't really solve your problem so you use all those providers and you paid them a lot of money New Relic was just one we paid the most money to but we had more than this this is how it looks today so it's just like obvious consolidation you use all the layers or things with the same system all with the same language and then beautiful dashboard with Cortana so that there was only quite a relief so talking about this external monitoring providers that's pretty important because they are doing such a good job and they're also evolving and you have like New Relic and data dog and souk owners and and that great stuff now there are situations where you really want this you are like just a small shop you don't really want to happen whole team that is dedicated to monitoring or something it totally makes sense to say hey we can out of that to that external provider and they are really having their right know-how and everything but I guess that many like midsize organizations are in this state and definitely weren't that state back then where they just had no choice so we don't want hosted monitoring we want it on premise like we had weird things where we had an outage because our external internet connection was too slow and then some services started to throw exception which we all send to a break and then our external internet connection was even snow because we have to send so much exceptions to every like there are things where you say I don't want it I'm big enough like we are even running a bare-metal later Center why should we have monitoring somewhere else also it's expensive and you see this dotted line is our usage of external monitoring it just went more more because that's the Onaga wasn't really pull it off and then Prometheus happened so we could reduce that nowadays I put the black flame there because Prometheus can also do blackbox probing if you really want to with the black boxes for so we reduced that a lot and now we have very organically growing very reasonable use of external monitoring for example catch point for our like externally perceived availability that totally makes sense and this is significant like we are shop of a hundred engineers and I guess two of them are paid by the money we save by not using external monitoring providers anymore so this is this is really a lot of money you have to pay to do serious monitoring so yeah catch one no endorsement implied again this is just what we use kind of endorsement implies its latency at of course I'm biased again because fish or Johannes sink who runs this he is a former SoundCloud and also a Prometheus contributor and he essentially said okay we can do blackbox proving this Prometheus let's just do this globally with different pops and everything and and offer this as a service so here are off-premises inside for your on-premises premiere service so if you are using Prometheus and you want some global probing and combine it all again with the same nice consistent semantics that might be something to look at just starting to use that pretty interesting okay how are we timewise okay so that was the big high-level overview I'm not making this overly technical but just to get an idea how it looks more concretely let's just go through the stack quickly and see what we how we do monitoring now so the first thing is instrumentation for white box monitoring you want your code to be instrumented you can just do the classical thing using the vanilla Prometheus instrumentation libraries this is just an example for go don't look too much into every single line but you kind of see the orange part you add a few lines and then you get a histogram for all agencies and you count your different status codes you see something about labels everything is labeled and Prometheus which we will talk about later a bit but then we also have something called JVM kit so radical micro-service approach is share nothing every team does their own thing but in practice you don't want to you don't want to be too radical so you have some shared code and whenever you do something on the JVM which add sound almost means Scala but also some other things we have a like the library or framework that helps you to write a micro service and it's called JVM kit and this gives you basic metrics for free essentially you write that a micro service with that framework doesn't work for goal because it's not JVM this is where I put go code on the left side here but for most of our micro services you just get the same set of metrics and then you add a few on top we see Prometheus library that are custom to your service but that's pretty easy like it's fairly easy to get all those metrics into your services and then this is the best thing create an internal tool make it open source from the beginning and ensure that it comes to the next hipster monitoring tool because when people will integrate against it so we're using sed birth using kubernetes we're using link Rd of course all CNCs projects but they all have kubernetes they all have promised these metrics already built in great stuff right theta is using our monitoring tool and you can use it with their products so this is implementation easier than you think then collection probably more difficult than you think we already mentioned it's so difficult to find your targets so people often like frame promethease as the kubernetes monitoring system but in practice you can use it to monitor everything so it's not like out of the box you put it here in your monitor Canaries because you can also monitor other things and measures and ec2 and whatever you want but that means you have to configure your particular use case and it can get really complicated because it's so flexible and what the community usually recommends use some kind of configuration management if you have anything reasonably complex there's also things like call as operators that's the thing where you can easily run certain things on kubernetes and they have it for Prometheus as well where it gets super easy for that particular use case we use chef as configuration management and for our developers is really easy on the left you see this is for a PMO well I our Mobile API if you deploy that and you put this little thing into a chef role and then magically it monitors everything on kubernetes about a PMO well the system and cluster things these are already labeled talk about them later so this is totally custom it's not predefined we just chose we want a system label and then Jeff generates this premisses config this is just like it goes on for pages it looks really daring and you should not code this by hand but yeah I mean this it really depends on your use case but this all has to do with we are finding things on kubernetes timoni towards all dynamic it will follow up if more containers come up and go down and kubernetes has labels Prometheus's label it's so obvious we have to obviously we have to match them and we get all the coolness labels into Prometheus so on coronet as you say this is my API mobile micro service and it has the system level a PMO well so to me it is knows I should monitor everything with that label and I put all those labels on to Metrix together with other labels like version is this a canary is this production and this is if you look at the Prometheus servers web your eye you see it tells you about the targets if your mobile has 200 things running on kubernetes and all those labels get magically attached once you're through this hoop of configuring it it's all matching and nice and great and then dashboards this is graph honor and since all our micro services that are using a VM key did have those nice set of metrics you can also have a generic dashboard so this is essentially one dashboard we have for you just pick your system and then end and component and track like all those things that we have to find just for us because they make sense for us and you can just pick all those things you can compare the release to the canary if it's performing better or worse and and this is like without a developer doing anything about a particular service of course if you have custom metrics for specific service you can also create your custom dashboards but this is super meaningful dashboards you just get for free in the active exploration I mentioned how important that is on the right you see a part of the classical Prometheus server web UI where you can just hack in from QL queries that is how we did it traditionally but the latest co-founder version has Auto completion and all that jazz in the little text field where you would usually just paste in your prompt your query so nowadays it's so awesome to just go to your co-founder dashboard create a new whatever what's called panel and just start to hack and then it will tell you it knows prompt URL it also completes it's like coding some programming language it autocompletes it asks the community server about labels and metric names is very awesome to to assemble curious that way if you have a nice fury you can save it as a dashboard or purpose it was really just interactive exploration because you are down a current outage so that works really well alert creation I promise it's all from girl by the way from if is too has a slightly different format for that so don't be confused we still use the old one because we have a mix and converted with Chef but yeah details so here I again won't go into details but you can see it's something like I have a rate of of missed iterations and I divided by a total number of iterations so that some percentage cause at the time some word and if I has more than 5% of that for four hours I send out a warning to somebody and you see there are nice annotations like rumblings are really important you can read that in the Blue Book dashboards even like the alert will tell you the dashboard to look at to find out more so that's really great and also at this point I want to emphasize how important time series are because this is like this is the proverbial example about this fool it could be any resource that that gets consumed so if you have a classical modulus this full alert you just alert on certain thresholds like 85% fool so on the left side you would alert all the time although this disk has been filled in a controlled fashion and then it's you're done with that and you just read from it but Nagas will still alert you and then like silencing a nagas alert is really hard really easiest from easiest by the way and on the right page you have something some job goes crazy and fills the disk and you get alerted like two minutes before you hit handler present so I think I have this this is fine dog right so this is fine for nagios and the left page which is actually fine so what you want is some like let's predict the future so let's do some linear extrapolation from my time series I can do this first now just doesn't know what time series right so you can see on the right it goes up up we should probably tell somebody on the left is actually fine and in prometheus this looks like that like example for a bit more complex and predict linear as a function that looks into the time series database and tells you something so if in four hours we are at zero free file space and we already have a certain amount and it's not a read-only file system then severity critical again you define that for yourself this is for us that we sent the page - beurling's all those things on there so now where to send the alerts to this is done with an alert routing tree this is all on the alert manager now and then again it's all label based so we use this system label which we have defined for ourselves so we know which system is owned by which team so here a PMO well goes to certain receivers warning or critical and this is in practice more complicated than you think and then you have all this routing tree you can do whatever you want and you can graph it and this is the routing tree for us looks like a flower or something hopefully yours will look less complex but this happens easily that and you see how complicated that actually is to route alerts the right user and you cannot imagine how that worked with something like nine years before also you group them that's the orange part it's again totally configurable but for us we thought if the alert name and the zone like which data centers in and the system again it's the same we don't want to send out one alert per alert like we won't have one bundled notification classical thing with now yours rec goes down with 32 servers you will get 32 pages not with prometheus you can say ok if like host is down in the same data center for the same owner just sent you one page that enumerates all those 32 host and are down so that's also very helpful to avoid alerting fatigue delivery again if you have an open source system that goes viral then everybody will integrate against you so pay tribute has an explicit prometheus integration so really easy to configure pager duty again no endorsement applied with just using Plato duty to deliver pages we are also using slack for everything including non paging alerts and there you see it's like you see all those nice thing with a run buckling and the dashboard link so you see it in your slack the CRO father also runs nicely on a mobile phone so if you're like the person on call you get something click on it look at the dashboard and you can already see what's going on even before you have opened your laptop and logged into the VPN and everything so that's also pretty good ok looks good so this is parting slide sound that went from we have no clues we up or not we just look at Twitter and I mean it was a joke but we had to show that people wouldn't even tweet about us if we are down they tweet about us if we are up because that's so rare so nowadays we have many nines I'm not sure if I'm allowed to tell you how many but like we reach our availability our ambitious availability goals essentially by now every quarter despite several horrible things that happened to us or even our third-party infrastructure providers those DNS blah blah blah attack so we we asked the managers and tech leads in a retrospective so we are we doing way better why just write it down the one most important reason why we are doing better and most of them wrote down because we have improved monitoring and this is a really nice match final reference shout-out to the Blue Book Mikey because in again one of my former colleagues he made up this famous hierarchy of service reliability modeled on mass loss rely a hierarchy of human needs were your F like food and shelter at the bottom and only then you can do things that are very human-like philosophy and culture and signs and like without food nothing works and there's a similar thing interest reliability your product is the one thing you care about but it's only the tip of this pyramid and you have to build the pyramid from the bottom and I really like how Mikey put monitoring as the base here and yeah this is kind of my life by now so makes a lot sense and from unknown to low to high in four years that's how it worked okay thank you I'll link the slides and they Skeeter brittle and they will be uploaded to the go to side so you can look it up and that was asked to present this no because we get questions we have minus five no plus four minutes for questions [Applause] yes oh thank you very much peon yeah we have two minutes but I think you're floating around and people can that can ask you this first question other results of the monitoring persistent loops where is it for example for displaying data in the dashboards or do you lose the data when switching starting restarting nodes so yeah a very good question the okay so there are two things to it like from we think in the previous world as metrics as something you want history because it's time series based training based alerting and everything but we think about it as ephemeral data so you just want enough to say okay I can predict this space or I can look like how many arrows did I have over the last ten minutes or one hour but not necessarily super long term trending although we get this like the my circle people in the company started to hmm let's look at the last six months of database table growth and predict when we will run out of this space and then you want really long-term data so it is persistent but it's more like a if your server blows up the data is gone right now this promise is to we can finally do consists in hot backups so that's good nice improvement to make the data a bit safer but then there's also something called the remote read and write adapter so you can also like forward all the data into some really distributed to revolt something something storage like open here is DB or influx DB and then you can even write like read it back and there are unclear is against us total data in your distributed storage but you always have your local storage from the more recent data that will never go down even if your network is on fire and your distributed database doesn't work anymore okay thanks a lot of questions came in lately yeah last question when monitoring detects issues you look at the locks to determine the course what stack do you use for logging for collecting storage and really nice almost a scripted so that's the huge thing about logs and metrics so people often confuse the two or they ask me so do I need logs or any metrics and the answers it's like it complements each other you need both things and it's really a different thing so you want some stack of collecting logs and look at them and this is like event logging often let's say elasticsearch elk stack is a good thing we just collect logs at the moment by our calf car so you can just look at them we archive them somewhere on Hadoop but you want some kind of deer they're my logs I want to look at them or I might even want to do some search and aggregation on them but this is different for metrics it's there you can like it if you Google like Prometheus logs metrics you will find some nice blog post about it it's mostly scalability if you're like small-scale you can just log everything and create metrics from it but larger scale metrics just are like on a more higher level higher level and then you can just scale up way more like if we had an elasticsearch cluster for all the logs we create on our thousands of micro service instances we would need an elastic search cluster that is bigger than our whole production system so you cannot just log everything into everything just based on lots but you also need logs for certain things so definitely both things complement each other really interesting to dig deeper and Google with your friend because it's also search engine not just the source of Technology okay so little spoiler pion talked a lot about site reliability engineering at Google the next talk is exactly about site reliability engineering at Google and with that I want to thank you for your talk thank you and just catch me four more questions

Info

Channel: GOTO Conferences

Views: 9,949

Rating: undefined out of 5

Keywords: GOTO, GOTOcon, GOTO Conference, GOTO (Software Conference), Videos for Developers, Computer Science, GOTOber, GOTO Berlin, Björn Rabenstein, SoundCloud, Prometheus, microservices, Cloud, Monitoring, alerting system

Id: hhZrOHKIxLw

Channel Id: undefined

Length: 50min 55sec (3055 seconds)

Published: Thu Mar 01 2018