Getting started with Grafana Loki (Grafana Office Hours #09)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- Hello, everyone, and welcome back to another Grafana Office Hours. As always, I'm Nicole van der Hoeven, and I am a developer advocate at Grafana Labs. I'm with two of my colleagues today. - I'm Paul Balogh. I'm another developer advocate at Grafana Labs. - Yeah and I'm Ward Bekker. I'm a solutions engineer at Grafana Labs. Thanks for hosting me today. - (chuckles) Awesome. We're glad to have you here, for sure. Tell us what you do. What is a solutions engineer? - That's a great question. What do we do? Well, we talk about a lot of things that we're not really expert about. (all laughing) - Wait, I'm just a developer advocate. - Hey, join the club. - Yeah, exactly. I'm just joking a bit there. No, so it's actually a very fun job. So we are technical, so we typically have a technical background and we are part of the pre-sales process. So that means that if a potential customer comes to us, to Grafana Labs, they have all kinds of questions to best use our tools, how to use our tools. Maybe they want to do a POV or a POC, and we just help them and we just assist them. And like I said, we try to be experts and try to know everything about everything. But especially nowadays with Grafana Labs, with the big platform, there's so much things that you don't know that we also are like the gateway to engineering. So if somebody has a question that I don't know, I can just reach out to all the great people within Grafana Labs and get them also on the call. - Nice. - And hopefully help the customer that they want to be our customer or a prospect, that want to be our customer. That's the goal, that's the goal. And then I'm happy. - So then after they're already onboarded, is it you're hands off? You're like, "Leave me alone. Don't bother me. You already paid." (laughing) - Yeah, that would be easy. But also, no, that's not really, so we do have a lot of folks that help customers when they are indeed, when there's a contract signed. We have like a customer success teams that is in regular contact with the customer to make sure that their goals are being achieved. We have support. So there's a bunch, like busloads of people that are actually are helping the customer be successful. But we do interact sometimes. For example, there's always some new use cases that people want to adopt so we're gonna tell about it. And we also have, of course, quite a lot of context. So we build up a kind of a relationship with the customer. So yeah, we know where they're coming from. We know a little bit what your use cases are. So we can sometimes help with like solutioning, architecture, and that kind of stuff. If they want to, for example, migrate new systems or yeah, adopt new technologies. - Yeah, that's awesome. - So today we're going to be talking all about Loki. And I'm gonna be completely honest here. I'm a performance tester. I haven't used Loki. And I've kind of just taken advantage of the fact that Loki has already been set up for me. So I'm on the side where I'm watching everything going into the dashboard. So I'm gonna be asking you all of the questions and if you can answer them, that would be awesome, but if not, that's fine too. (laughing) - Okay, let's see if we can make you a Loki expert. - Oh, an expert? - In just one hour. In just one hour. - Yeah, sure. I can totally do that. (laughs) How about we start from the beginning? What is a log and how is it even different from metrics? 'Cause I think most people when they think observability, the first kind of pillar that they think of is metrics. So how is a log any different? - Well, actually, I think I disagree that actually people start typically with logs. - Logs. - Because if you're a developer, what's the one thing you do to actually troubleshoot your program, you say print line or print, and then you get a log and you get an output. - Do you do the like console log and then here and then here now, and then you get a bunch of different logs that just say, that make no sense to you, to anyone but you? - Of course not, I never do this. I always... (Nicole and Paul laugh) I'm a professional developer. I strictly go into the debugger and never use print line for my troubleshooting. Exactly. - Oh, yeah, me either. - It is something that I definitely do and I still do. And that makes a lot of sense, right? Because it is very easy. Of course there are more advanced ways to troubleshoot, but people do that. And if you are developing a program, you sometimes not even know what kind of metrics you actually want to monitor, right? So you typically see when you are developing that people are starting, okay, let's just start logging. And then later on, they get a better understanding of the program, better understanding of what are the KPIs, they want to actually get metrics out of. And then you see people typically moving to metrics. So I would say that people typically start with logs as the first pillar. And of course, right, that depends a bit on whether you are monitoring your own application or are you monitoring something that comes outta the box, like Kubernetes, because Kubernetes of course, already has that Prometheus monitoring built in. So it has Prometheus exporters, all the metrics that you need, but still you also want to get logs from it. So, and how are logs kind of different from metrics as you were asking, is that logs are more like events. So it contains a lot of contextual information. So all the information about that specific record of event, and if you compare it to a metric, it's just one data point. So typically when you are like troubleshooting a system or there's an incident in a system, you see on those metrics, you can see, well, my error rate goes up. That doesn't look good. But you can't know from just the metrics alone what actually goes wrong in that application. For that you actually need to go to the logs because logs contain all the detail. And for metrics, it becomes really hard to troubleshoot a program, at least for the unknown unknowns. Because there might be some cases where you're just like, okay, I see like a metric going up, I just need to scale up. And that doesn't require any logs. But for a lot of troubleshooting, you actually want to go to the logs. - I think that's a good point. I think I'm like thinking like a tester, not as a developer first, because with testers, we're very focused on like thresholds and how do you say if something has passed or failed? And usually we've already done the work to set up things that we know we're going to need. Like if we know we're gonna need CPU utilization, then we're going to have a metric that tracks specifically that. But you were so right, when you talk about logs capturing the unknown unknowns, because if you knew that there was something that you needed to be watching out for, then you should probably, you might already have a metric for that if it isn't already set up for you. But logs are kind of like a catchall for the things that you don't even think that you're going to need and they kind of tell more of a story, right? It's not just a number. It's like what happened in what order. - Especially when it's a stack trace. Those are the fun logs. - Oh! (all laughing) - I went there, I went there. - Yeah, everybody wants to have like a thousand blank stack trace. Java, I'm looking at you. It's like. - Yeah, JavaDucky. Paul's handle on a lot of social media, social networks is JavaDucky, just so you know. - Oh. - I was in Java space for quite a few years. Just a couple. - Yeah and going back to that point, right, metrics versus logs, what is I think interesting is that you can actually use your logs based on that, to actually create metrics. So one of the things that I typically recommend people to do when they use something like Loki is that it is very easy to actually build metrics on top of your logs. And then maybe as a next step. you could actually make them like Prometheus metrics. So first you're gonna use your logs like the data source and then later on, you might want to make them a little bit more official and create Prometheus exported metrics. So that could be a way to start with logs, end up with metrics. - Hmm, so we've already touched on a few difficulties regarding logs. The first is that you need to have some way of making sense of it 'cause logs can be verbose, like stack traces, and it's one thing to have the logs, but it's another to have your your memory fill up because of all of these unusable stack traces or whatever. So it's also about making sure that you get only what is necessary and really nothing else if possible. And the other thing is how to make metrics out of that. So can you give us a bit of a, like an example or something of what would be a log that you can make a metric out of? - Hmm hmm. So typically people divide like the formats of logs. So I'm assuming like logs are text-based. So that is starting from there. You typically have like three types of logs so it's completely unstructured. So, the question is if a stack trace is completely unstructured, I don't know, but it doesn't have a lot of structure. Of course, it is a stack trace, but it's not like JSON, which is a complete structured way, right? Because it contains a very rigid format where you have very specific key value pair mapping and that in a kind of a hierarchy. So that is like structured and there's some formats that are like comma separated of whitespace, separated files, like what you typically get from Apache, Nginx, that kind of stuff. It's a little bit in between. So we call that typically like semi-structured logs. And yeah, some of those logs are a little bit more difficult to parse so you need a lot of flexibility in your system, in your log aggregation system to be able to handle that and actually to extract metrics out of that and to get value out of that. And it's actually some of the things as we're talking about Loki today, it's actually some of the things that we try to do with Loki very, very well. So that it's actually easy to work with all kinds of formats and still get value out of it and still are able to build up those metrics out of logs quite easily. And doing that because maybe people are like familiar with large ETL processes where there's like, I dunno, they use Spark, they use all kinds of analytical tooling to transform the data into their very specific schema. And actually what we wanted to do is make it easy for people to do this stuff without a lot of pre-processing needed. It's just like, okay, you're gonna decide, you're gonna first ingest everything into Loki and then you're gonna later decide what you actually want to do with it. Because again, if you do that upfront processing, that might be, eh, you might end up with a lot of very specific, very useful logs. But at the same time, you're also optimizing for the problems that you know. And going back a bit to the point that you made, Nicole, it's like sometimes it's kind of a security, oh, how do you say it? Insurance policy, to actually have just all the logs because there might be some stuff that you have missed and you just want to go back and maybe it contains a lot of details about that specific incident that you of course didn't imagine that it would happen. - Right? - So, that's why we're trying to be very flexible with Loki and try to support all the formats. It just needs to be text-based. We're gonna ingest it. And later on, at query time, we are gonna help you getting value out of that as easy as possible. - Why is it called Loki? (all laughing) I don't know if you know. - I know. It's actually quite funny. We actually wanted to call it first Tempo. So for the folks that know a little bit more about the Grafana stack, Tempo- - Oh wow. - Is the name of our tracing solution. Yeah, exactly. And I think it was named by Tom Wilkie, who is the creator of Grafana Loki. He's our CTO. And I think we chose to call it Loki because first of all, it is a Nordic god, yeah. So we have like a, preference for Nordic gods. - Yeah. - Like Prometheus, right? It is also stealing fire from the Nordic gods or actually that was not Nordic gods. That was actually from the Greek gods. Mount Olympus, I'm not- Yeah, so I don't know why we switched to Loki then, but anyway it is a god. It is from the past. And it is starting with an L. So we have like an LGTM, which is an acronym that we use for the LGTM stack, logs, Grafana, traces, metrics, and yeah, it just sounds great. So that's why I think we called it Loki, but why we typically went with Loki, I think it was just one of the first gods that actually had a name that started with an L. That would be my suspicion. (Paul laughs) - So I have no idea at all why it was called that. But I did wonder after the fact, this is just coming from me, Loki's like the trickster god, right? He's also a shape-shifter in Norse mythology. And so I was wondering if like it was the shape shifting, like changing something, something that could be difficult to understand into something that's easier to understand. I don't know. That's my post. - Oh. - That's actually pretty good. (laughing) - Yeah, we might just want to run with that and just rewrite history a bit. - Sounds good. (Nicole laughs) - There was a lot more thought. There was deep thought into that. - Yeah, never let actual history get in the way for a good story, right? - Yeah. (laughing) That's totally why it's called Loki. - Yeah, exactly. - So what exactly is Loki? - Yeah, that's a great question. So Loki is a log aggregation tool. So for the folks that might know typical systems like Lucine, which you have Solr, you have Elasticsearch maybe Splunk, those are all log aggregators. And the goal of log aggregation is quite simple. You just pull all the logs from all kinds of systems that you want to monitor and you put it in one place and then you need to be able to query it and ask questions to that data to gain insights. And that is what Loki does. - Yeah, that was something that was mind blowing to me when I first got, well, it started with an S, rhymes with funk was my first exposure to an aggregator. But because yeah, back in the old days, 'cause I am kind of old, we used to have all these secure shell terminal windows up and you know, so if we had a distributed system, we're sitting there tailing logs from five different machines and you know, now with this aggregation, it's so nice that they're all getting forward to that system and then you can go to, you can have one browser window up and open and seeing all this data coming through. But yeah, we only had 17 inch monitors and we liked it. It was all we had. (laughing) - Yeah. Exactly. So that was, if you don't have a log aggregator, that is kind of how you do it, right? You just try to log into that server, you need to find, okay, where did that program store the log files? Is the log files still there? Because- - Right. - Disk space might be limited. The file might be rotated out. So it was really hard, right, to find and to be productive with that. And especially nowadays when you look at cloud native, Loki is really designed for it to be a cloud native. And one of the things is that a lot of that data is ephemeral, so it might already not be there anymore. Or if you look at serverless, there's no server to log into, so there's no server anymore. So you actually need to send the data somewhere. So you typically want to do that to a log aggregator. - Yeah, I'm glad you brought that up about the ephemeral 'cause yeah, with these pods, I mean they go away, then there went your logs too, unless you're writing to a persistent volume or something. But, yeah, no. - Yeah, and of course, the benefit is also because you have all your logs in one place, it is also easier, right, if there's an issue that actually involves multiple systems to basically, all the puzzle pieces, to make it come together. - Yeah. - In one UI or with one query. And because when you have a distributed system where you have one application maybe running on multiple nodes, and you might, when you round robin or low balance request, there might be one piece of the request of a customer, one to one node, the other one to the second node. Yeah, you still see only partial request if you would log into those individual nodes. So that is also one reason why you all want to all bring it together in one single place. - Yeah. - Even in performance testing, on the side of the load generation, it's really, it's quite common for you to have dozens or hundreds of load generators, each of which is also potentially spewing out logs. And sometimes when something goes wrong, like I need to be able to know, not just, I need to not just be able to find that part in the logs, but also identify like which load generator actually encountered that error and what else was it doing at the time? So there's also, there is aggregation, but it still needs to be identifiable, otherwise, it's no longer useful. If you have like a hundred load generators and you know one of them encountered an error, it's like, well, I don't know how to troubleshoot that unless I know which one it was and what it was doing. - Yep, yeah, you need to know where it came from. And that is also something that we ask when people are sending data to Loki, log lines to Loki is you're just gonna send the log lines, but we do ask you to put a little bit of metadata where it says where it came from. So maybe, of course the nodes, the server, the team, the data center, maybe the application, all that kind of information. That's really good to know. - Hmm hmm and grep for the win for sure. (all laughing) Hey, and I just gotta make an observation here. Is this a plant? Do you have a plant in here? The name sounds vaguely familiar. - I invited all my family. I invited all my family, yeah. - Wow. - Actually, I know Ruan Bekker. He's actually a very nice guy from South Africa. We're probably like somewhat related at some point, I don't know, but he is not directly related, but great to have you here, Ruan. - I was gonna say, wait, your family understands what you do? How did you manage that? (laughing) - Oh, no, no, no, no. Yeah, if I explain to my family indeed what I do, they get very sleepy and they're very quickly distracted. - Do they just say, "I have a problem with my printer, can you help me?" - Yes! Yes. - Yeah, well that's my function in my house. My daughter and my son, they like to play a lot of games online and I'm the resident IT expert, indeed. I help them. - Yeah. - With all game purchases and network connectivity issues. (Paul and Nicole laughing) - So why don't you tell us a few more things about Loki? Like what- - Yeah. - What makes it, how is it different from those other alternatives that you mentioned? - Yeah, that's great, right, because there are already a lot of those systems. And also when folks started out developing Loki, there were a bunch of ones, a few popular ones. So why did we build, as Grafana Labs, another Loki, another log aggregator? Well it turns out that a lot of those systems that are very popular, they don't really fit our way of working. And when I say, our way of working is that, first of all, at Grafana Labs, we typically develop systems that can help us in better operating our cloud products. We have Grafana Clouds and we are big users of Grafana and Prometheus. And we just found there was like a mismatch in the way that most of those logging systems worked in combination with Prometheus. What we actually wanted to do is build a kind of a log aggregator that would work great together with Prometheus so it would like be very similar, so people that come from Prometheus also could work with Loki directly and it also would need to be integrated with Grafana very well. So yeah, Grafana is kind of the first choice as UI. And there were also two other things that are really important for us. First of all, what we noticed at Cloud Native and microservices, they generate a huge amount of logs. And typically those existing systems are great, they have a lot of functionality, but they're not really equipped for handling that amount of logs. And even if they are, the cost, the TCO factor was really, so how do you say? Was the factor higher or factor low? I dunno, it was really expensive basically, to collect all those logs and also operating such a large system so when you talk about multiple terabytes a day, it was also very difficult. So for the folks that maybe are here today on the Office Hours know about like Solr or Elasticsearch, when you have a large cluster, and especially in the past, and of course, right, those companies, they're great. They make great products. They also innovate their products. But in the past it was like a lot of JVM garbage collection optimization and it was also really hard. I see like Paul, like yeah- - JVM, yep. - JVM, you gotta love them. And it was also very fragile. So that means that if there was something wrong and maybe somebody misconfigured writing and maybe was starting much more, writing much more than they typically do, it would maybe affect the query path and also the query path could affect the read path and that kind of stuff, or sorry, the write path. So there were actually all kinds of things that we thought, okay, well we might be able to do that better. And what we also wanted to do is make it so cost effective. So what we actually did is we wanted to build a system that was built for Kubernetes and would use OPX storage as a very cost effective, very durable way of storing a huge amount of logs. And we wanted to basically build a system that can handle petabytes and not like, oh wow, you have a few gigs. It's like, no, we really want to make sure that we are able to handle that huge volume of logs that modern systems generate. - And that's kinda like that insurance policy, right? You know, keep all that there because you might use it in the future or might need to lean on that data in the future. - Yeah, and the thing is, if you make it as cheap as possible, as cost effective as possible, cheapest may be the wrong word because if you store multiple petabytes per month, that is still not deep, but it is very- - Not free. - No, it's not free, but it is really cost effective and it's also very durable so it means that you really know that you're not gonna accidentally delete one of those critical pieces of logs. And of course, right, some logs are more valuable than others because some of them are very chatty but probably are never used in troubleshooting. But well, if it is relatively cheap to store, it might make sense to just get a little bit more than you would normally do when you actually need to actually look at, okay, I pay for a pretty high price for every gig so then you want to be, yeah, you might want to say to your developer, "Stop logging." (Paul laughing) And we see that. Yeah, you see that with a lot of companies where they actually say it's like, "Folks you cannot log. Don't do that." - Yeah. No, speaking from experience, I know, we were in that kind of a situation and you know, we had something with like a, oh, I don't know, a Kafka consumer or publisher was just spewing logs like crazy and one of our SREs would send out a message and say, "Stop it for the love of God! Stop it. You're blowing out our license," you know? - Yeah, and that can have a lot of like big effects. So not only kind of a monetary effect, but also like impact the rest of the core use cases that are running on those platforms. - Yeah. - So, yeah. So that is actually why we wanted to create kind of a better system. And yeah, I think we achieved that with Loki. And you also see that in the years that, I think, I don't remember exactly where, what dates we actually released Loki, but I think it's now like around like five years that Loki is up and running and you see really massive adoption. So it's really great to see that a lot of folks are using Loki and are looking at Loki as kind of the default log aggregation to use in combination with Grafana and Prometheus in cloud native situations. - Yeah, one of the big pluses too, isn't it, that it's like a kind of leveraging the, I mean it's LogQL and it's similar to like PromQL? Isn't that one of the things to try to keep them a little bit similar and kinda leverage that, like, I guess, you know, understanding? - Yeah, exactly. And so maybe it's good that I show a quick slide. - Sure. - On how we actually store the logs because that is a little bit about like the secret sauce from Loki. - Yeah. - Because when people, when we say Loki doesn't index the logs, because that is something that we do say, a lot of people get confused, especially when they come from like fully indexed solutions like Elasticsearch. - Yeah. - So what we do is we have a very similar structure to PromQL or to Prometheus. So that means also that LogQL can have a similar structure as PromQL. So what we do is we store the timestamp. We store the same type of label selector pairs as you would normally see in Prometheus, and we store the log line. The only thing is we only index this piece, but we don't index the log line. So we don't do anything except a little bit of compression. So, right, it's otherwise wasting bytes, you want to compress. It's typically like a factor of four compressible so you would definitely want to do that. And what we do is we store that on OPX storage so that people can, yeah, keep that there quite easily. And the cool thing is, because we don't really index the log line, this piece, this index is really very small. So if you compare it to a lot of those systems that do full text indexing, sometimes the indexed data or the index files are sometimes even bigger than the original log lines. So let's say you're gonna ingest like, a half of petabyte per day, that is also the index. So you actually need to store both the original logs typically and the index so that is a huge burden. And that of course, costs a lot of resources. So we want to keep the amount of resources small. So that's why we say, okay, just a little bit of indexing and then compress the content of log lines. Then the next step is that you're gonna query that. So first of all, that index, like I said, is really small. So for the folks that need a little bit of an idea is that's 10 terabytes of log data actually goes into like 20 megabytes of index. So that's really, really small. So basically that doesn't count even, right? So it doesn't release there. And the way, if you look at how we are gonna query that with LogQL, so LogQL is the Loki query language. What we do there is we store all those raw logs, so we don't, how do you say that? We don't index them, but what we do is we say, okay, just give us the logs from a certain application or use case or department. So this is called a label selector. And that's exactly the same, right? If you have PromQL, if you're familiar with that, you always need to specify a selector there. The only thing that we don't have is of course a metric name, because that is very Prometheus-like. And what we then do is we apply a timeframe. So same if you would fire off a PromQL query. Yeah, you also need to specify the duration for that. And then what we do is we fire off a needle in a haystack query. And this is one example where we just basically do a kind of a distributed grep. Yeah, so people were like joking about like grepping and going into terminal, going ssh to a box and grepping. We still do the same thing, only we do that in a centralized, aggregated way. And also (indistinct), what we do is we do a lot of parallelization. So that means that basically with those label filters that you see here with a timeframe, we filter down the logs that you actually want to grep through and then we grep but we do that in a paralyzed way. So that means that we're distributing it, that workloads across a lot of different nodes in our cluster, and they're all gonna take a piece of those logs, those eight terabytes, so maybe eight nodes. So every node will take one terabyte and they will search through it. And we actually are able to achieve speeds of one terabyte a second. So that means that if you want to search through eight terabytes of logs, it'll cost you around eight seconds of query time. And that is also different compared to a lot of those earlier systems is that sometimes we're a little bit slower or a little bit less performance. But the thing is, it is a trade off that we make. It's like, I'm okay waiting a few seconds if that means that my TCO is like an order of magnitudes less. It's like, it's much more cost effective. And that is basically the design decision that we took with Loki is to make sure that it fits there as reuse case because a few seconds doesn't matter. And a lot of those systems like Elasticsearch, they were actually created for kind of different use cases where you want to have subsecond response times where like every 10 millisecond counted. And in our case, that's not really needed for DevOps as a reuse cases. And for that, it allows us to be much more cost effective and do this type of stuff in brute force at query time. - Yeah, that's interesting. Yeah, it is that trade off thing 'cause it does make me think of like, I use some NoSQL databases in the past too, and they weren't really indexed, but you would have these materialized views and really, it was a complete copy of all the data. It was just formatted differently for a faster search. - Yep. - But you were paying by having much more storage needs so. - Yeah and, indeed. And so it's not only storage, right? So a lot of those systems, they really like to keep all the data in RAM because otherwise it becomes, again, it becomes slow to retrieve it from disk. And typically they also like to do like fast CPUs. So yeah, it's really easy to spend a lot of money and you get a fast cluster, don't get me wrong, right? For the right use case, that makes a lot of sense. But in our case where we just want to ingest a huge amount of data as cost effective as possible, and then maybe we're going to query only on a subset of data, depending on those unknown unknowns, this makes much more sense for us. - Yeah, it's interesting because you're talking about cost effectiveness, but to some degree, that is also performance-based. When you talk about cost, it's not just a monetary cost, right? It's also the cost in terms of resource utilization. And if there's a query that you need to run, that doesn't necessarily have to be done immediately, then you're going to need fewer resources to do that as well, right? - Yeah, and the nice thing about Loki is that because we have that microservices architecture, it's a quite modern architecture. So you can choose a little bit your own adventure. So if you want more performance, yeah, you can actually scale up the query path. And let's say if you double the query path, it can be very well that you actually have two times the query performance. And so you can basically choose how cost effective you want to be there. And that is also really great. And also it's not only around like physical resources like the hardware, but it's also around operational costs. There are a lot of folks that are needed to skill those very popular existing systems to very high skill, right? There is like books written that you can find on Amazon. It's like the dark magic arts to actually create such a cluster and run it at that certain skill because you need to have a lot of knowledge. And I'm not saying that Loki is like, there's still a lot of inherent complexity, but it is, again, order of magnitudes, I think more simpler to run at those skills than a lot of those older systems. But I might be biased. I might be biased of course, right? There's probably somebody that will say, "What now? You're just a Loki fanboy," which I am, which I am, I cannot deny. - I mean, we're on the Grafana Channel, like. (laughing) We're all biased here. (laughs) Everyone has a bias, just some of us are more honest about them than others. - Yeah. (laughing) No, but yeah. But to that point, it's like, I've used those systems. I'm still using a lot of those existing systems there. They have their usage. It's just, I always like to pick the right tool for the right job. And sometimes I also talk to prospects and customers and they're like, yeah, we want to use Loki. And they explain to me, what they want to do. And I'm like, "That doesn't sound like Loki, don't use Loki for that. That will be not cost effective for you and you will have a horrible experience. So please don't." - So you touched upon one of the things that makes Loki awesome, and that's that it is built for performance and also horizontally scalable. Now what exactly do you have to do to be able to scale up Loki? - That is a nice one, as well. And let me show you this. I hope this is viewable on the YouTube channel if people are able to zoom in enough. So I created this kind of diagram. It's also in the Loki documentation around the architecture of Loki. And I typically talk with folks when they ask like, how do I scale Loki? I talk them through this architecture slide. So what you see here on the left is what we call the simple scalable deployment modes. So that's the default mode of Loki. So you can also run Loki as a single binary, by the way, which is really easy. So if you just want to play around with it on your computer, it's actually quite powerful. But of course, right, it doesn't have like high availability and that kind of stuff. So for that you want to deploy it in that mode I was just talking about. And what we do is- - Would you be able to zoom a little bit on that? - I'll try. I'll try. Maybe... I don't think I can. - Are you on a Mac? - I'm on a Mac. You have Mac- - Maybe you can just pinch and Zoom. - I tried. Sorry, I cannot. - That's okay. - Make this bigger, so. But I'll just explain the boxes and I'll read up what it says. - Sure. - So hopefully that makes sense. (laughing) And you can always look at the Loki documentation, look at like Loki architecture, and you'll find this slide, as well. So what you see is that Loki runs on a Kubernetes cluster. It doesn't need to be, it can also run on like a Docker on your own server, but we typically recommend it to run it on a Kubernetes cluster. And it consists of three type of microservices. And that is the write path, the read path and an administrative microservice. So the write path, as the name says, when everything that comes here goes through a load balancer, and then you have one or more of those write path microservices. So we typically would say you want at least three because we use three-way replication. So that means that if one of those writes goes down, you still have two successful rights instances that can still write data to OPX storage. And we need to have two because we want to make sure that we're actually writing consistent data to OPX store, and we're not accidentally like writing like partial data. So people want to trust what data they have written to Loki And then if that is a successful write, it'll say, okay, 200 successful, well actually it says 201, and then it sends that code back to Promtail and Promtail is the agent. So Grafana, Loki has Promtail as the default agent, but actually there's a bunch of agents that you can use with Grafana Loki, but Promtail is the one that comes outta the box and when it receives the 201, it'll send the next batch of log lines from your server. If for example, there's something wrong, then it will actually do an exponential back off. So that means that it retries, but it's not gonna like every minute hammer your server because you can imagine, right? If you have like thousands of Promtails and they're all hammering the same server at the same time, and let's say the server or the cluster is actually just coming up after there was kind of a big failure, that's of course not a nice, healthy way to actually return to service. So that's why we do exponential back off. And if all goes well, the write path, microservice actually writes the index (indistinct) chunk which is the log lines to OPX storage. And the read path very similar. It reads based on the queries that Grafana gives. So Grafana will query it, we will load balance the query, and then it will send to one of the read path microservices. And what's really cool is that we made it like this, that if you have multiple read paths, it actually will distribute the query over all the read path microservices that you have available. So when we say, hey, you can choose your own performance adventure, you can actually replicate the amount of read path microservices and that can increase the performance of all your queries. And of course, right, if you're like querying only a very small piece of data, right, it doesn't really make sense to actually split that load up. It makes only sense that you scale up when you have enough data that you're gonna query. So let's say you're gonna go through that eight terabytes, that might actually make sense to actually not have eight nodes of eight read paths, but you can actually do 16 read paths and that will actually significantly increase the performance of your query. And then we have this yellow box and that yellow box takes care of some administrative services, which is nowadays, for example, if you issue a delete, so the delete is actually not taking effect directly. It is actually something that we do in the backgrounds and we are actually also thinking of maybe introducing kind of a compactor so that we can make the data that's actually written to OPX storage, maybe to make that even more efficient for reading later on. So because now, yeah, go ahead. Sorry. - We have two questions about object storage actually while you're on the subject. Prashant Singh says, "Need detail about object storage. Any OSS object?" And also, "Can we use object storage with Unix host server?" - Ah, yeah, great questions. So yeah, depends on OPX storage. I think there is a mode in Loki that you can actually use file storage, but it's mostly more for development purposes. And indeed, thank you, my long lost relative from - (all laughing) He's actually giving that- - That's secretly you, isn't it? - Yeah! Just my alter ego, indeed. Yeah. (all laughing) - Also, uh- - Yeah? - I'm sorry, go on. - Yeah, no, what I want to say is that you can use minio definitely, and that definitely works. But the thing is that it's only as fast as your storage is actually capable of. And what you actually see in a distributed application is that storage is, it's very easy to make that a bottleneck. So a lot of like cloud providers, they optimize their architecture to have a compute that's separate from storage and they optimize the heck out of it. So that's why you have incredibly high throughput from your OPX storage to your compute. And in this case, these are the microservices. And so what you see with customers that are doing a lot of Loki, a lot of logs, a lot of querying, if the performance of that OPX storage and the throughput of that OPX object storage is not optimal, they will not have a optimal Loki experience. So it's really important to make sure that if you have OPX storage and performance is important, make sure that it's either on the public clouds like AWS GCP or that you benchmark it in a way that's actually in line or quite close to what public cloud providers can offer. - Okay and we have a question from mbaykara, who says, "Please reduce the number of Helm charts for Loki. Why too many Helm charts out there? It is confusing sometimes, especially for newcomers." - Yeah, I'm totally not gonna defend that. Yeah. (Paul laughing) Now so we've been through a few iterations with Helm charts. I think we now have like the one Helm chart that we say, okay, this is gonna be the Helm chart going forward, but there might be still some traces left and some documentations linked left to the older Helm charts. So apologies for the confusion? - Where is that? - Yeah, that's a good one. I need to Google that one, but let me do that. - My guess, that's just simply because of all the different deployment options. I mean, if you want multiple reader, writer and all that, or if you want just a kinda, I hate to say monolithic, but you know, the monolithic functionality where it's all just kinda one big thing. - Yep, yeah, exactly. And I'm not sure if I could join the chat, can I put it here on YouTube, the link? Oh no, I need to connect to YouTube for that. - Oh, we can move it over. It's okay. - If you can move it over. - Okay, I just did that. - Yeah, indeed, so Paul, you're correct. So we do allow you to run, for example, in microservice mode. So we still have multiple deployment options, but if you are asking me, okay, which one should I pick to get up and running with Loki, I would nowadays actually pick the scalable version, which is the one that I've linked there. And yeah, I totally agree. It's like a little bit confusing and we're actually trying with the entire Loki team to improve on that documentation. But yeah, try that one out and you should be able to be up and running. - I would say also the link that I just posted in chat is to the Grafana docs. I think that that's a good rule of thumb. Like we're not the only ones who can create Helm charts, you know, everything's open source. So I've seen a bunch of like third party ones and they change a bunch of different options depending on what you need. I think the Grafana docs is a good place to start. That's always going to hopefully be the most updated one that is good for most use cases. - Yep, yep. Yeah, exactly. And if people want, so shameless plug, I sometimes also make some videos around Grafana Loki. So there's actually a Grafana Screencast playlists and one of the videos is indeed a video about these simple scalable deployment modes. So this is a one year old video. So some things might have changed, but definitely look at it because a lot of the concepts still remain the same. And this might be a nice quick start for getting you up and running with the Helm chart. - Nice. - So maybe let's go back a little bit. How do you actually install Loki? We talked about Helm charts for Kubernetes and you also talked about there being a binary. - Yep, yep. Yeah, so if you just go to the Loki GitHub repo, there are releases and there is a release which you can download, which is just a single binary. And if you follow the instructions, just a single command away and you have an up and running Loki server, so that's great. And then you can start writing data to it. What I typically would recommend if people want to have it even easier is actually to go to Grafana Cloud. So for Grafana Cloud, we have a completely free tier and it actually comes with quite a generous amount of logs. I think we do like a hundred gigs, something like that, for a month. So that means that you can just send almost 100 gigs of logs per month to Grafana Clouds, to Grafana Cloud logs, which is powered by Loki. And then that's it, you're done. So that is an even easier way, but if you want to install it yourself, then definitely go with the binary. - That's the one that I took advantage of is the free tier. So if anybody wants to go to javaducky.com, 'cause I'm not testing the rate limits. (laughing) - I'm just gonna write a k6 test right now. (all laughing) - Exactly, exactly. - Another question about the integration between Loki and alerting, Grafana alerting. - Oh that's a nice one. Yeah, so we definitely wanted to have it integrate with alerting. And the cool thing is we actually based Loki alerting on the Prometheus alerting. Yeah, how do you say that? Strategy or solution. So Loki actually also works together with the Prometheus Alert Manager and actually configuring a Loki alert and Prometheus alerts exactly the same. The only difference is that the Loki query language is of course slightly different than PromQL. But what you do is you actually create a metric out of your logs using LogQL and then based on that, you actually create a nice alert out of that. So I can actually, if people are interested, I can actually just show you how that could look like. - Awesome. Yeah. - So you're actually doing that inside of Loki and not inside of Prometheus? So it's not one of those things where it's like you're creating the metrics and forwarding into Prometheus and then Prometheus then does the alerting? - So the Prometheus Alert Manager is actually the component that does the alerting, yeah. - Ah ha. - Loki will integrate, and so the alerting, so the ruler, which is the component that actually evaluates the Loki queries and then sees whether it needs to trigger an alert. That is like a Loki component, but it interacts with, in this case, the alert manager. So let me see if I have a nice example recording rule. Okay, yeah, I do have, so let me show you this one, but this is from my test environment and what you can see here is that you can actually create, and this is actually a recording rule which is similar to an alert rule from Loki. And actually I could create an alert rule out of that. So what you need to do there is you create a new alert rule. So Loki test rule. You're gonna select the Loki data source. So in this case this, it would be Loki Clouds. And then you write a query. And that is what I mean by writing a LogQL query. So this is one of those LogQL queries. And what we do here is we are looking at, and this I can make bigger, so that's may be more useful. Here, we're actually selecting log stream so I'm actually looking at maybe a certain agent that I'm interested in, and then I actually can find, okay, every log line then contains error. And if that increases, I want to actually be alerted on it. So what we're doing is we're counting for that last five minutes, we're counting how many log lines have an error in it, and then we can actually say, okay, if the threshold is above five, then we're gonna alert. So that is my alert condition. And then I can then, just the same as we do with the Prometheus alerting, we can add some additional metadata. So this is the metadata that gets sent with alerts. And that can be then used in your alerting and your on-call configuration. So where you can say, okay, this is the type of alert, this is why it triggers and maybe a runbook that you reference with, okay, this is how you typically solve this issue if it appears. So this is just a quick way on how to do Loki alerting. - Okay, another question here about the new Helm charts for Loki. So mbaykara says that charts "Come with a component called backend other than just the write and read components. So what does the backend do?" And Prashant says that they're also confused about that. - Yep, yep, so the backend takes care of all those administrative processes. So it also does the ruler. So the alerting is actually something that is actually running. It's not part of the write paths, not part of the read path, but it is a service that needs to be running, right? And that is in those backend processes. And also when people want to issue the leads, those leads are also done not synchronous. So these deletes are async and they work on the interval. And that is also taken care of by that backend service. And you'll actually see, like I mentioned before, is that there's gonna be more functionality added to that backend service, like maybe optimizing the way data is stored in OPX storage. Great questions by the way. - Yeah, I'd like to do a bit of a wrap up of everything that we discussed, but before we do that, do you have anything else that you want to show? - Oh, that's a good one. Yeah, one thing that I typically always like to show, and again, shameless plug, this is a dashboard that I created, but it's also very helpful dashboard. So if you go to play.Grafana.org, we have a lot of examples of dashboards and tooling that you can play around with from this hosted Grafana. So this is a public Grafana that people can use and they can play around with it. And this engine X dashboards is actually data or a dashboard that's created is based on data that's actually coming from one of my own web servers. So I have a pet project and I just send the engine X logs to, in this case, Grafana Cloud logs. And you're actually are just looking at my logs from my websites, so that's pretty cool. And everything you see here on this dashboard is actually created from Loki. So we're actually creating those metrics from those logs ad hoc. And the cool thing is that I actually have two versions. So there is actually a patent version that works with space delimited log lines. So you can actually see how we would handle a little bit like semi-structured logs. And there's the JSON version that I was showing, which is actually the one that shows how to work with structured logs. And there you can, hey, if you're interested and you're learning Loki, it's a great way to just click into those type of panels and see how that log line or how that query is being built up and what kind of functions are being called out to make sure that you get a very similar dashboard. And you can actually download it from the Grafana dashboard directory so it should be there available, as well. So definitely check that one out. It comes with a lot of great examples to kickstart your Loki journey. - Awesome, and actually next week, we are going to be talking about Grafana Play in general 'cause there's a lot of other examples other than just Loki stuff. And it's a really great way to just play around with Grafana without any commitment, without having to sign up for anything or pay for anything. - Yeah, that's great. And let me give you one secret. There is actually a game within play.Grafana.org. - Wait, I didn't know that. - Yeah, there is a game that you can play, a very famous game. - Oh, I know what it is. - Maybe the next host knows it. I don't know. - Oh! (Paul laughs) Oh, oh, I think I know now. - But let's keep that as a cliffhanger for our next episode. - Okay. Okay. - Let's hope that everybody will watch the next episode, as well. - So I didn't tell either of you this, Paul might already, you know, have guessed because I've done this before. I'd like to play a game. (laughing) I have this thing called ultra speaking where there is a bit of a podcast game. And what I'd like to do is spend the next few minutes summing up what we've discussed, just today on this Grafana Office Hours episode. So Ward, just so you know, we're going to be, our names are going to be showing up and we have to switch. The premise is we are doing a podcast and we just have to talk about Loki, which is what you already were doing. - Okay. Let's see. - It's also gonna be timed. (laughing) - Yeah. - Let's do it. - Not to put any pressure on this- - Oops. (Paul laughs) Let's not suggest a title. Okay. (Paul laughs) So... I'm gonna start. Okay. So we've talked today about what a log is and how it is different from metrics. A lot of times, things that we do when we're developing, we start with logs. We might just have like, we might just be debugging something in our terminal. - Or collecting logs from multiple applications all going to a single location. Come on, switch over 'cause I'm like lost here now. (laughing) - Yeah, because the alternative is that you actually need to log into all kinds of nodes and then manually grab the files and the files might not even be there. So it's really important that you actually aggregate them and make them available and so that also all your teammates can actually take advantage of that centralized log store. - There are a few issues regarding logs that Loki solves. The first is that logs are notoriously difficult to parse. Sometimes logs are unstructured. Sometimes they're semi-structured. - And also, yeah, or strictly structured with a JSON format where everything is basically labeled and that can provide easier indexing. Oh come on. It seems to go so long. (Ward laughing) - And the cool thing about Loki is that you actually don't need to worry too much about the indexing. We just need a little bit of metadata and just the rollout line. - And that is actually one of the features of Loki, that unlike other solutions, Loki indexes just the metadata and not the full text, which means that it has real performance benefits which leads into the cost effectiveness of the tool, as well. - Right and it's all about the object storage so that keeping those costs down, but it does imply that there's addressing total, oop. - Yeah, and the costs are, of course, the costs are not only about the amount of resource that you need to spend on the CPU and the RAM, but also a lot of like people that need to keep the servers up and running and of course need to tweak the performance all the time. - Those operational costs come into effect, as well. And also distributed computing in general means distributed logs so you really need some sort of log aggregation to put it all together. Now how do you actually install Loki? Well, there are a few ways. One is that you can use the binary that you can download from the repo. - Or you can use one of several Helm charts to actually install the microservices set up. But we're actually working on consolidating that to a more singular, better option of a Helm chart which is displayed on your screen. - Wonderful. And if you just want to kick the tires of Loki from kind of a usage perspective, I definitely recommend creating a free account on Grafana Clouds and start with a hundred gigs of logs per month included. - Yeah, I think also another way that you can just quickly have a look at Loki already up and running is by going to play.Grafana.com. You don't even need an account. You can just play around and even see how Ward's web server is doing. (laughing) - (laughs) And you can also really use that to learn about a lot of the different dashboard functionality that you can use with Loki. Yeah. - And today, we definitely touched a bit on Grafana Loki. If you want to learn more about Loki in a much more in-depth fashion, we actually provide also workshops. So go to Grafana.com and there's probably some button about workshops and we typically organize those workshops frequently. So definitely you sign up if you're more interested to know more about Loki. Sorry, Paul, I'm talking through your time. Go ahead. (Paul laughs) It's your turn. - It's all good. - Go for it. - Yeah. And you can use those to learn the LogQL syntax and just get real efficient. - That LogQL is based heavily on Prometheus because we think Prometheus is awesome and we often say that Loki is like Prometheus, but for logs. - (laughing) Yes, Prometheus. And then these are all based on what is it? Greek gods or no wait, Nordic gods is what we try to stick to for the product names now. - Yeah, and I still think that the name Loki was chosen because Loki's a trickster God that's a shape-shifter, just like you need to shape-shift your logs to make them useful. - Yeah, and that is now the final real history of the name of Loki. From now on, that's gonna be like our origin story. - From here on out. - Wonderful, from here on out. And I would like to really thank everybody for attending. It was great. Thanks so much for my entire family to also ask questions. (all laughing) And thanks, Nicole and Paul, for hosting. It was lovely. (Paul laughing) - Thank you, Ward, for coming to join us. And if you'd like to know more about Grafana Play, then we will be back next week at the same time with a different guest to talk all about, just the easiest way that you can play around Grafana without installing anything and without any data of your own. So check back for that. - And ask about the game. Ask about the game. - Yes, the game. - Yes, all right. Thank you, everybody, for watching and good luck finding that game. Have a good weekend. - Cheers, folks. Thanks much. - Bye, everyone. - Bye-Bye.
Info
Channel: Grafana
Views: 2,688
Rating: undefined out of 5
Keywords:
Id: OLebNPLIJMI
Channel Id: undefined
Length: 63min 10sec (3790 seconds)
Published: Sat Aug 26 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.