- Hello, everyone, and welcome back to another
Grafana Office Hours. As always, I'm Nicole van der Hoeven, and I am a developer
advocate at Grafana Labs. I'm with two of my colleagues today. - I'm Paul Balogh. I'm another developer
advocate at Grafana Labs. - Yeah and I'm Ward Bekker. I'm a solutions engineer at Grafana Labs. Thanks for hosting me today. - (chuckles) Awesome. We're glad to have you here, for sure. Tell us what you do. What is a solutions engineer? - That's a great question. What do we do? Well, we talk about a lot of things that we're not really expert about. (all laughing) - Wait, I'm just a developer advocate. - Hey, join the club.
- Yeah, exactly. I'm just joking a bit there. No, so it's actually a very fun job. So we are technical, so we typically have
a technical background and we are part of the pre-sales process. So that means that if a
potential customer comes to us, to Grafana Labs, they have all kinds of
questions to best use our tools, how to use our tools. Maybe they want to do a POV or a POC, and we just help them
and we just assist them. And like I said, we try to be experts and try to know everything
about everything. But especially nowadays with Grafana Labs, with the big platform, there's so much things that you don't know that we also are like the
gateway to engineering. So if somebody has a
question that I don't know, I can just reach out
to all the great people within Grafana Labs and
get them also on the call. - Nice. - And hopefully help the customer that they want to be our customer or a prospect, that
want to be our customer. That's the goal, that's the goal. And then I'm happy. - So then after they're already onboarded, is it you're hands off? You're like, "Leave me alone. Don't bother me. You already paid." (laughing) - Yeah, that would be easy. But also, no, that's not really, so we do have a lot of
folks that help customers when they are indeed, when there's a contract signed. We have like a customer success teams that is in regular
contact with the customer to make sure that their
goals are being achieved. We have support. So there's a bunch,
like busloads of people that are actually are helping
the customer be successful. But we do interact sometimes. For example, there's
always some new use cases that people want to adopt so
we're gonna tell about it. And we also have, of course,
quite a lot of context. So we build up a kind of a
relationship with the customer. So yeah, we know where
they're coming from. We know a little bit
what your use cases are. So we can sometimes help
with like solutioning, architecture, and that kind of stuff. If they want to, for
example, migrate new systems or yeah, adopt new technologies. - Yeah, that's awesome. - So today we're going to
be talking all about Loki. And I'm gonna be completely honest here. I'm a performance tester. I haven't used Loki. And I've kind of just
taken advantage of the fact that Loki has already been set up for me. So I'm on the side where
I'm watching everything going into the dashboard. So I'm gonna be asking
you all of the questions and if you can answer them,
that would be awesome, but if not, that's fine too. (laughing) - Okay, let's see if we
can make you a Loki expert. - Oh, an expert?
- In just one hour. In just one hour. - Yeah, sure. I can totally do that. (laughs) How about we start from the beginning? What is a log and how is it
even different from metrics? 'Cause I think most people
when they think observability, the first kind of pillar that
they think of is metrics. So how is a log any different? - Well, actually, I think I disagree that actually people
start typically with logs. - Logs. - Because if you're a developer, what's the one thing you do to actually troubleshoot your program, you say print line or print, and then you get a log
and you get an output. - Do you do the like console log and then here and then here now, and then you get a bunch of
different logs that just say, that make no sense to
you, to anyone but you? - Of course not, I never do this. I always...
(Nicole and Paul laugh) I'm a professional developer. I strictly go into the debugger and never use print line
for my troubleshooting. Exactly.
- Oh, yeah, me either. - It is something that I
definitely do and I still do. And that makes a lot of sense, right? Because it is very easy. Of course there are more
advanced ways to troubleshoot, but people do that. And if you are developing a program, you sometimes not even
know what kind of metrics you actually want to monitor, right? So you typically see
when you are developing that people are starting,
okay, let's just start logging. And then later on, they get a better
understanding of the program, better understanding of what are the KPIs, they want to actually get metrics out of. And then you see people
typically moving to metrics. So I would say that people
typically start with logs as the first pillar. And of course, right, that depends a bit on whether you are monitoring
your own application or are you monitoring something
that comes outta the box, like Kubernetes, because
Kubernetes of course, already has that Prometheus
monitoring built in. So it has Prometheus exporters, all the metrics that you need, but still you also want
to get logs from it. So, and how are logs kind
of different from metrics as you were asking, is that logs are more like events. So it contains a lot of
contextual information. So all the information about
that specific record of event, and if you compare it to a metric, it's just one data point. So typically when you are
like troubleshooting a system or there's an incident in a system, you see on those metrics, you can see, well, my error rate goes up. That doesn't look good. But you can't know from
just the metrics alone what actually goes wrong
in that application. For that you actually
need to go to the logs because logs contain all the detail. And for metrics, it becomes really hard to
troubleshoot a program, at least for the unknown unknowns. Because there might be some
cases where you're just like, okay, I see like a metric going up, I just need to scale up. And that doesn't require any logs. But for a lot of troubleshooting, you actually want to go to the logs. - I think that's a good point. I think I'm like thinking like a tester, not as a developer first, because with testers, we're very focused on like thresholds and how do you say if
something has passed or failed? And usually we've already done the work to set up things that we
know we're going to need. Like if we know we're
gonna need CPU utilization, then we're going to have a metric that tracks specifically that. But you were so right,
when you talk about logs capturing the unknown unknowns, because if you knew
that there was something that you needed to be watching out for, then you should probably, you might already have a metric for that if it isn't already set up for you. But logs are kind of like a catchall for the things that you don't even think that you're going to need and they kind of tell
more of a story, right? It's not just a number. It's like what happened in what order. - Especially when it's a stack trace. Those are the fun logs.
- Oh! (all laughing) - I went there, I went there. - Yeah, everybody wants to have like a thousand blank stack trace. Java, I'm looking at you. It's like. - Yeah, JavaDucky. Paul's handle on a lot of social media, social networks is
JavaDucky, just so you know. - Oh. - I was in Java space
for quite a few years. Just a couple. - Yeah and going back
to that point, right, metrics versus logs, what is I think interesting is that you can actually
use your logs based on that, to actually create metrics. So one of the things that I
typically recommend people to do when they use something like Loki is that it is very easy
to actually build metrics on top of your logs. And then maybe as a next step. you could actually make them
like Prometheus metrics. So first you're gonna use
your logs like the data source and then later on, you might want to make them
a little bit more official and create Prometheus exported metrics. So that could be a way to start with logs, end up with metrics. - Hmm, so we've already touched on a few difficulties regarding logs. The first is that you
need to have some way of making sense of it
'cause logs can be verbose, like stack traces, and it's one thing to have the logs, but it's another to have
your your memory fill up because of all of these unusable
stack traces or whatever. So it's also about making sure that you get only what is necessary and really nothing else if possible. And the other thing is how
to make metrics out of that. So can you give us a bit of a, like an example or something of what would be a log that
you can make a metric out of? - Hmm hmm. So typically people divide
like the formats of logs. So I'm assuming like logs are text-based. So that is starting from there. You typically have like
three types of logs so it's completely unstructured. So, the question is if a stack trace is completely unstructured, I don't know, but it doesn't have a lot of structure. Of course, it is a stack
trace, but it's not like JSON, which is a complete structured way, right? Because it contains a very rigid format where you have very specific
key value pair mapping and that in a kind of a hierarchy. So that is like structured
and there's some formats that are like comma
separated of whitespace, separated files, like what you typically
get from Apache, Nginx, that kind of stuff. It's a little bit in between. So we call that typically like semi-structured logs. And yeah, some of those logs are a little bit more difficult to parse so you need a lot of
flexibility in your system, in your log aggregation system to be able to handle that and actually to extract
metrics out of that and to get value out of that. And it's actually some of the things as we're talking about Loki today, it's actually some of the things that we try to do with
Loki very, very well. So that it's actually easy to
work with all kinds of formats and still get value out of it and still are able to
build up those metrics out of logs quite easily. And doing that because maybe
people are like familiar with large ETL processes
where there's like, I dunno, they use Spark, they use all kinds of analytical tooling to transform the data into
their very specific schema. And actually what we wanted to do is make it easy for people to do this stuff without a lot of pre-processing needed. It's just like, okay, you're gonna decide, you're gonna first ingest
everything into Loki and then you're gonna later decide what you actually want to do with it. Because again, if you do
that upfront processing, that might be, eh, you might end up with a lot of very
specific, very useful logs. But at the same time, you're also optimizing for
the problems that you know. And going back a bit to the
point that you made, Nicole, it's like sometimes
it's kind of a security, oh, how do you say it? Insurance policy, to actually
have just all the logs because there might be some
stuff that you have missed and you just want to go back and maybe it contains a lot of details about that specific incident that you of course didn't imagine that it would happen.
- Right? - So, that's why we're trying
to be very flexible with Loki and try to support all the formats. It just needs to be text-based. We're gonna ingest it. And later on, at query time, we are gonna help you
getting value out of that as easy as possible. - Why is it called Loki? (all laughing) I don't know if you know. - I know. It's actually quite funny. We actually wanted to call it first Tempo. So for the folks that
know a little bit more about the Grafana stack, Tempo-
- Oh wow. - Is the name of our tracing solution. Yeah, exactly. And I think it was named by Tom Wilkie, who is the creator of Grafana Loki. He's our CTO. And I think we chose to call it Loki because first of all, it
is a Nordic god, yeah. So we have like a, preference for Nordic gods.
- Yeah. - Like Prometheus, right? It is also stealing fire
from the Nordic gods or actually that was not Nordic gods. That was actually from the Greek gods. Mount Olympus, I'm not- Yeah, so I don't know why
we switched to Loki then, but anyway it is a god. It is from the past. And it is starting with an L. So we have like an LGTM, which is an acronym that
we use for the LGTM stack, logs, Grafana, traces, metrics, and yeah, it just sounds great. So that's why I think we called it Loki, but why we typically went with Loki, I think it was just one of the first gods that actually had a name
that started with an L. That would be my suspicion. (Paul laughs) - So I have no idea at all
why it was called that. But I did wonder after the fact, this is just coming from me, Loki's like the trickster god, right? He's also a shape-shifter
in Norse mythology. And so I was wondering if like
it was the shape shifting, like changing something, something that could be
difficult to understand into something that's
easier to understand. I don't know. That's my post.
- Oh. - That's actually pretty good. (laughing) - Yeah, we might just
want to run with that and just rewrite history a bit. - Sounds good. (Nicole laughs) - There was a lot more thought. There was deep thought into that. - Yeah, never let actual
history get in the way for a good story, right? - Yeah. (laughing) That's totally why it's called Loki. - Yeah, exactly. - So what exactly is Loki? - Yeah, that's a great question. So Loki is a log aggregation tool. So for the folks that might know typical systems like Lucine, which you have Solr, you have Elasticsearch maybe Splunk, those are all log aggregators. And the goal of log
aggregation is quite simple. You just pull all the logs
from all kinds of systems that you want to monitor and you put it in one place and then you need to be able to query it and ask questions to that
data to gain insights. And that is what Loki does. - Yeah, that was something
that was mind blowing to me when I first got, well, it started with
an S, rhymes with funk was my first exposure to an aggregator. But because yeah, back in the old days, 'cause I am kind of old, we used to have all these
secure shell terminal windows up and you know, so if we
had a distributed system, we're sitting there tailing logs from five different machines and you know, now with this aggregation, it's so nice that they're all
getting forward to that system and then you can go to, you can have one browser
window up and open and seeing all this data coming through. But yeah, we only had 17 inch
monitors and we liked it. It was all we had. (laughing) - Yeah. Exactly. So that was, if you don't
have a log aggregator, that is kind of how you do it, right? You just try to log into that server, you need to find, okay, where did that
program store the log files? Is the log files still there? Because-
- Right. - Disk space might be limited. The file might be rotated out. So it was really hard, right, to find and to be productive with that. And especially nowadays when
you look at cloud native, Loki is really designed for
it to be a cloud native. And one of the things is that a lot of that data is ephemeral, so it might already not be there anymore. Or if you look at serverless, there's no server to log into, so there's no server anymore. So you actually need to
send the data somewhere. So you typically want to do
that to a log aggregator. - Yeah, I'm glad you brought
that up about the ephemeral 'cause yeah, with these pods, I mean they go away, then there went your logs too, unless you're writing to a
persistent volume or something. But, yeah, no. - Yeah, and of course, the benefit is also because you have all
your logs in one place, it is also easier, right, if there's an issue that actually
involves multiple systems to basically, all the puzzle pieces, to make it come together.
- Yeah. - In one UI or with one query. And because when you
have a distributed system where you have one application maybe running on multiple nodes, and you might, when you round robin
or low balance request, there might be one piece of
the request of a customer, one to one node, the other
one to the second node. Yeah, you still see only partial request if you would log into
those individual nodes. So that is also one reason why you all want to all bring it
together in one single place. - Yeah. - Even in performance testing, on the side of the load generation, it's really, it's quite
common for you to have dozens or hundreds of load generators, each of which is also
potentially spewing out logs. And sometimes when something goes wrong, like I need to be able to know, not just, I need to not
just be able to find that part in the logs, but also identify like
which load generator actually encountered that error and what else was it doing at the time? So there's also, there is aggregation, but it still needs to be identifiable, otherwise, it's no longer useful. If you have like a hundred load generators and you know one of them
encountered an error, it's like, well, I don't
know how to troubleshoot that unless I know which one it
was and what it was doing. - Yep, yeah, you need to
know where it came from. And that is also something that we ask when people are sending data to Loki, log lines to Loki is you're
just gonna send the log lines, but we do ask you to put
a little bit of metadata where it says where it came from. So maybe, of course the nodes, the server, the team, the data center, maybe the application, all
that kind of information. That's really good to know. - Hmm hmm and grep for the win for sure. (all laughing) Hey, and I just gotta
make an observation here. Is this a plant? Do you have a plant in here? The name sounds vaguely familiar. - I invited all my family. I invited all my family, yeah. - Wow. - Actually, I know Ruan Bekker. He's actually a very nice
guy from South Africa. We're probably like somewhat
related at some point, I don't know, but he is
not directly related, but great to have you here, Ruan. - I was gonna say, wait, your family understands what you do? How did you manage that? (laughing) - Oh, no, no, no, no. Yeah, if I explain to my
family indeed what I do, they get very sleepy and they're very quickly distracted. - Do they just say, "I have
a problem with my printer, can you help me?"
- Yes! Yes. - Yeah, well that's my
function in my house. My daughter and my son, they like to play a lot of games online and I'm the resident IT expert, indeed. I help them.
- Yeah. - With all game purchases and
network connectivity issues. (Paul and Nicole laughing) - So why don't you tell us a
few more things about Loki? Like what-
- Yeah. - What makes it, how is it different from those other alternatives
that you mentioned? - Yeah, that's great, right, because there are already
a lot of those systems. And also when folks started
out developing Loki, there were a bunch of
ones, a few popular ones. So why did we build, as
Grafana Labs, another Loki, another log aggregator? Well it turns out that
a lot of those systems that are very popular, they don't really fit our way of working. And when I say, our
way of working is that, first of all, at Grafana Labs, we typically develop
systems that can help us in better operating our cloud products. We have Grafana Clouds and we are big users of
Grafana and Prometheus. And we just found there
was like a mismatch in the way that most of
those logging systems worked in combination with Prometheus. What we actually wanted to do is build a kind of a log aggregator that would work great
together with Prometheus so it would like be very similar, so people that come from Prometheus also could work with Loki directly and it also would need to be integrated with Grafana very well. So yeah, Grafana is kind
of the first choice as UI. And there were also two other things that are really important for us. First of all, what we noticed at Cloud
Native and microservices, they generate a huge amount of logs. And typically those
existing systems are great, they have a lot of functionality, but they're not really equipped for handling that amount of logs. And even if they are, the cost, the TCO factor was really, so how do you say? Was the factor higher or factor low? I dunno, it was really
expensive basically, to collect all those logs and also operating such a large system so when you talk about
multiple terabytes a day, it was also very difficult. So for the folks that maybe are here today on the Office Hours know about like Solr or Elasticsearch, when you have a large cluster, and especially in the
past, and of course, right, those companies, they're great. They make great products. They also innovate their products. But in the past it was like a lot of JVM garbage collection optimization and it was also really hard. I see like Paul, like yeah- - JVM, yep. - JVM, you gotta love them. And it was also very fragile. So that means that if
there was something wrong and maybe somebody misconfigured writing and maybe was starting much more, writing much more than they typically do, it would maybe affect the query path and also the query path
could affect the read path and that kind of stuff,
or sorry, the write path. So there were actually all
kinds of things that we thought, okay, well we might be
able to do that better. And what we also wanted to do
is make it so cost effective. So what we actually did is
we wanted to build a system that was built for Kubernetes and would use OPX storage as a very cost effective, very durable way of storing a huge amount of logs. And we wanted to basically build a system that can handle petabytes and not like, oh wow, you have a few gigs. It's like, no, we really want to make sure that we are able to handle
that huge volume of logs that modern systems generate. - And that's kinda like that
insurance policy, right? You know, keep all that there because you might use it in the future or might need to lean on
that data in the future. - Yeah, and the thing is, if you make it as cheap as possible, as cost effective as possible, cheapest may be the wrong word because if you store
multiple petabytes per month, that is still not deep, but it is very- - Not free. - No, it's not free, but
it is really cost effective and it's also very durable so it means that you really know that you're not gonna accidentally delete one of those critical pieces of logs. And of course, right, some logs are more valuable than others because some of them are very chatty but probably are never
used in troubleshooting. But well, if it is
relatively cheap to store, it might make sense to
just get a little bit more than you would normally do
when you actually need to actually look at, okay, I pay for a pretty high
price for every gig so then you want to be, yeah, you might want to
say to your developer, "Stop logging."
(Paul laughing) And we see that. Yeah, you see that with a lot of companies where they actually say it's like, "Folks you cannot log. Don't do that." - Yeah. No, speaking from experience, I know, we were in that
kind of a situation and you know, we had
something with like a, oh, I don't know, a Kafka consumer or publisher was just
spewing logs like crazy and one of our SREs would
send out a message and say, "Stop it for the love of God! Stop it. You're blowing out our license," you know? - Yeah, and that can have
a lot of like big effects. So not only kind of a monetary effect, but also like impact the
rest of the core use cases that are running on those platforms. - Yeah.
- So, yeah. So that is actually
why we wanted to create kind of a better system. And yeah, I think we
achieved that with Loki. And you also see that in
the years that, I think, I don't remember exactly where, what dates we actually released Loki, but I think it's now like
around like five years that Loki is up and running and you see really massive adoption. So it's really great to see that a lot of folks are using Loki and are looking at Loki as kind of the default log aggregation to use in combination with
Grafana and Prometheus in cloud native situations. - Yeah, one of the big pluses too, isn't it, that it's like
a kind of leveraging the, I mean it's LogQL and it's similar to like PromQL? Isn't that one of the things to try to keep them a little bit similar and kinda leverage that, like, I guess, you know, understanding? - Yeah, exactly. And so maybe it's good
that I show a quick slide. - Sure. - On how we actually store the logs because that is a little bit about like the secret sauce from Loki. - Yeah.
- Because when people, when we say Loki doesn't index the logs, because that is something that we do say, a lot of people get confused, especially when they come from like fully indexed solutions
like Elasticsearch. - Yeah. - So what we do is we have
a very similar structure to PromQL or to Prometheus. So that means also that LogQL can have a similar structure as PromQL. So what we do is we store the timestamp. We store the same type
of label selector pairs as you would normally see in Prometheus, and we store the log line. The only thing is we
only index this piece, but we don't index the log line. So we don't do anything except
a little bit of compression. So, right, it's otherwise wasting bytes, you want to compress. It's typically like a
factor of four compressible so you would definitely want to do that. And what we do is we
store that on OPX storage so that people can, yeah,
keep that there quite easily. And the cool thing is, because we don't really
index the log line, this piece, this index
is really very small. So if you compare it to
a lot of those systems that do full text indexing, sometimes the indexed data or the index files are
sometimes even bigger than the original log lines. So let's say you're gonna ingest like, a half of petabyte per day, that is also the index. So you actually need to store both the original logs
typically and the index so that is a huge burden. And that of course,
costs a lot of resources. So we want to keep the
amount of resources small. So that's why we say, okay, just a little bit of indexing and then compress the
content of log lines. Then the next step is that
you're gonna query that. So first of all, that index, like I said, is really small. So for the folks that need
a little bit of an idea is that's 10 terabytes of log data actually goes into like
20 megabytes of index. So that's really, really small. So basically that doesn't
count even, right? So it doesn't release there. And the way, if you look at how we are
gonna query that with LogQL, so LogQL is the Loki query language. What we do there is we store all those raw logs, so we don't, how do you say that? We don't index them, but
what we do is we say, okay, just give us the logs
from a certain application or use case or department. So this is called a label selector. And that's exactly the same, right? If you have PromQL, if
you're familiar with that, you always need to
specify a selector there. The only thing that we don't have is of course a metric name, because that is very Prometheus-like. And what we then do is
we apply a timeframe. So same if you would
fire off a PromQL query. Yeah, you also need to
specify the duration for that. And then what we do is we fire off a needle in a haystack query. And this is one example where we just basically do a
kind of a distributed grep. Yeah, so people were like
joking about like grepping and going into terminal, going ssh to a box and grepping. We still do the same thing, only we do that in a
centralized, aggregated way. And also (indistinct), what we do is we do a
lot of parallelization. So that means that basically
with those label filters that you see here with a timeframe, we filter down the logs that you actually want to grep through and then we grep but we do
that in a paralyzed way. So that means that we're distributing it, that workloads across a
lot of different nodes in our cluster, and they're all gonna take
a piece of those logs, those eight terabytes,
so maybe eight nodes. So every node will take one terabyte and they will search through it. And we actually are able to achieve speeds of one terabyte a second. So that means that if you want to search through eight terabytes of logs, it'll cost you around eight
seconds of query time. And that is also different compared to a lot of those earlier systems is that sometimes we're
a little bit slower or a little bit less performance. But the thing is, it is
a trade off that we make. It's like, I'm okay waiting a few seconds if that means that my TCO is like an order of magnitudes less. It's like, it's much more cost effective. And that is basically the design decision that we took with Loki is to make sure that it fits there as reuse case because a few seconds doesn't matter. And a lot of those systems
like Elasticsearch, they were actually created for
kind of different use cases where you want to have
subsecond response times where like every 10 millisecond counted. And in our case, that's not
really needed for DevOps as a reuse cases. And for that, it allows us to
be much more cost effective and do this type of stuff in
brute force at query time. - Yeah, that's interesting. Yeah, it is that trade off thing 'cause it does make me think of like, I use some NoSQL
databases in the past too, and they weren't really indexed, but you would have
these materialized views and really, it was a complete
copy of all the data. It was just formatted
differently for a faster search. - Yep. - But you were paying by having
much more storage needs so. - Yeah and, indeed. And so it's not only storage, right? So a lot of those systems, they really like to
keep all the data in RAM because otherwise it becomes, again, it becomes slow to retrieve it from disk. And typically they also
like to do like fast CPUs. So yeah, it's really easy
to spend a lot of money and you get a fast cluster,
don't get me wrong, right? For the right use case,
that makes a lot of sense. But in our case where we just want to
ingest a huge amount of data as cost effective as possible, and then maybe we're going to query only on a subset of data, depending on those unknown unknowns, this makes much more sense for us. - Yeah, it's interesting because you're talking
about cost effectiveness, but to some degree, that
is also performance-based. When you talk about cost, it's not just a monetary cost, right? It's also the cost in terms
of resource utilization. And if there's a query
that you need to run, that doesn't necessarily
have to be done immediately, then you're going to need fewer resources to do that as well, right? - Yeah, and the nice
thing about Loki is that because we have that
microservices architecture, it's a quite modern architecture. So you can choose a little
bit your own adventure. So if you want more performance, yeah, you can actually
scale up the query path. And let's say if you
double the query path, it can be very well that you actually have two
times the query performance. And so you can basically choose how cost effective you want to be there. And that is also really great. And also it's not only around
like physical resources like the hardware, but it's also around operational costs. There are a lot of folks that are needed to skill those very
popular existing systems to very high skill, right? There is like books written
that you can find on Amazon. It's like the dark magic arts to actually create such a cluster and run it at that certain skill because you need to
have a lot of knowledge. And I'm not saying that Loki is like, there's still a lot of
inherent complexity, but it is, again, order of magnitudes, I think more simpler
to run at those skills than a lot of those older systems. But I might be biased. I might be biased of course, right? There's probably somebody
that will say, "What now? You're just a Loki fanboy," which I am, which I am, I cannot deny. - I mean, we're on the Grafana
Channel, like. (laughing) We're all biased here. (laughs) Everyone has a bias, just some of us are more
honest about them than others. - Yeah. (laughing) No, but yeah. But to that point, it's like,
I've used those systems. I'm still using a lot of
those existing systems there. They have their usage. It's just, I always like
to pick the right tool for the right job. And sometimes I also talk
to prospects and customers and they're like, yeah,
we want to use Loki. And they explain to me,
what they want to do. And I'm like, "That
doesn't sound like Loki, don't use Loki for that. That will be not cost effective for you and you will have a horrible experience. So please don't." - So you touched upon one of the things that makes Loki awesome, and that's that it is
built for performance and also horizontally scalable. Now what exactly do you have to do to be able to scale up Loki? - That is a nice one, as well. And let me show you this. I hope this is viewable
on the YouTube channel if people are able to zoom in enough. So I created this kind of diagram. It's also in the Loki documentation around the architecture of Loki. And I typically talk with
folks when they ask like, how do I scale Loki? I talk them through
this architecture slide. So what you see here on the left is what we call the simple
scalable deployment modes. So that's the default mode of Loki. So you can also run Loki as
a single binary, by the way, which is really easy. So if you just want to play
around with it on your computer, it's actually quite powerful. But of course, right, it doesn't have like high availability and that kind of stuff. So for that you want to
deploy it in that mode I was just talking about. And what we do is- - Would you be able to
zoom a little bit on that? - I'll try. I'll try. Maybe... I don't think I can. - Are you on a Mac? - I'm on a Mac. You have Mac- - Maybe you can just pinch and Zoom. - I tried. Sorry, I cannot.
- That's okay. - Make this bigger, so. But I'll just explain the boxes and I'll read up what it says.
- Sure. - So hopefully that
makes sense. (laughing) And you can always look
at the Loki documentation, look at like Loki architecture, and you'll find this slide, as well. So what you see is that Loki
runs on a Kubernetes cluster. It doesn't need to be, it can also run on like a
Docker on your own server, but we typically recommend it to run it on a Kubernetes cluster. And it consists of three
type of microservices. And that is the write path, the read path and an
administrative microservice. So the write path, as the name says, when everything that comes here goes through a load balancer, and then you have one or more of those write path microservices. So we typically would say
you want at least three because we use three-way replication. So that means that if one
of those writes goes down, you still have two
successful rights instances that can still write data to OPX storage. And we need to have two
because we want to make sure that we're actually writing
consistent data to OPX store, and we're not accidentally
like writing like partial data. So people want to trust what
data they have written to Loki And then if that is a successful write, it'll say, okay, 200 successful, well actually it says 201, and then it sends that
code back to Promtail and Promtail is the agent. So Grafana, Loki has Promtail
as the default agent, but actually there's a bunch
of agents that you can use with Grafana Loki, but Promtail is the one
that comes outta the box and when it receives the 201, it'll send the next batch of
log lines from your server. If for example, there's something wrong, then it will actually do
an exponential back off. So that means that it retries, but it's not gonna like every
minute hammer your server because you can imagine, right? If you have like thousands of Promtails and they're all hammering the
same server at the same time, and let's say the server or the cluster is actually just coming up after there was kind of a big failure, that's of course not a nice, healthy way to actually return to service. So that's why we do exponential back off. And if all goes well, the write path, microservice actually writes the index (indistinct) chunk which is
the log lines to OPX storage. And the read path very similar. It reads based on the
queries that Grafana gives. So Grafana will query it, we will load balance the query, and then it will send to one
of the read path microservices. And what's really cool is
that we made it like this, that if you have multiple read paths, it actually will distribute the query over all the read path microservices that you have available. So when we say, hey, you can choose your
own performance adventure, you can actually replicate the amount of read path microservices and that can increase the
performance of all your queries. And of course, right, if you're like querying only a very small piece of data, right, it doesn't really make sense
to actually split that load up. It makes only sense that you scale up when you have enough data
that you're gonna query. So let's say you're gonna go
through that eight terabytes, that might actually make sense to actually not have eight nodes of eight read paths, but you can actually do 16 read paths and that will actually
significantly increase the performance of your query. And then we have this yellow box and that yellow box takes care of some administrative
services, which is nowadays, for example, if you issue a delete, so the delete is actually
not taking effect directly. It is actually something
that we do in the backgrounds and we are actually also thinking of maybe introducing kind of a compactor so that we can make the data that's actually written to OPX storage, maybe to make that even more
efficient for reading later on. So because now, yeah, go ahead. Sorry. - We have two questions
about object storage actually while you're on the subject. Prashant Singh says, "Need detail about object storage. Any OSS object?" And also, "Can we use object
storage with Unix host server?" - Ah, yeah, great questions. So yeah, depends on OPX storage. I think there is a mode in Loki that you can actually use file storage, but it's mostly more for
development purposes. And indeed, thank you, my long lost relative from - (all laughing) He's actually giving that- - That's secretly you, isn't it? - Yeah! Just my alter ego, indeed. Yeah. (all laughing) - Also, uh-
- Yeah? - I'm sorry, go on. - Yeah, no, what I want to say is that you can use minio definitely, and that definitely works. But the thing is that it's only as fast as your storage is actually capable of. And what you actually see
in a distributed application is that storage is, it's very easy to make that a bottleneck. So a lot of like cloud providers, they optimize their architecture to have a compute that's
separate from storage and they optimize the heck out of it. So that's why you have
incredibly high throughput from your OPX storage to your compute. And in this case, these
are the microservices. And so what you see with customers that are doing a lot of Loki, a lot of logs, a lot of querying, if the performance of that OPX storage and the throughput of that OPX object storage is not optimal, they will not have a
optimal Loki experience. So it's really important to make sure that if you have OPX storage and performance is important, make sure that it's either
on the public clouds like AWS GCP or that you benchmark it in a way that's actually in line or quite close to what public cloud providers can offer. - Okay and we have a
question from mbaykara, who says, "Please reduce the
number of Helm charts for Loki. Why too many Helm charts out there? It is confusing sometimes,
especially for newcomers." - Yeah, I'm totally not gonna defend that. Yeah.
(Paul laughing) Now so we've been through a few
iterations with Helm charts. I think we now have like the
one Helm chart that we say, okay, this is gonna be the
Helm chart going forward, but there might be still some traces left and some documentations linked left to the older Helm charts. So apologies for the confusion? - Where is that? - Yeah, that's a good one. I need to Google that
one, but let me do that. - My guess, that's just simply because of all the different
deployment options. I mean, if you want multiple
reader, writer and all that, or if you want just a kinda, I hate to say monolithic, but you know, the monolithic functionality where it's all just kinda one big thing. - Yep, yeah, exactly. And I'm not sure if I could join the chat, can I put it here on YouTube, the link? Oh no, I need to connect
to YouTube for that. - Oh, we can move it over. It's okay. - If you can move it over. - Okay, I just did that. - Yeah, indeed, so Paul, you're correct. So we do allow you to run, for example, in microservice mode. So we still have multiple
deployment options, but if you are asking me, okay, which one should I pick to
get up and running with Loki, I would nowadays actually
pick the scalable version, which is the one that I've linked there. And yeah, I totally agree. It's like a little bit confusing and we're actually trying
with the entire Loki team to improve on that documentation. But yeah, try that one out and you should be able
to be up and running. - I would say also the link
that I just posted in chat is to the Grafana docs. I think that that's a good rule of thumb. Like we're not the only ones
who can create Helm charts, you know, everything's open source. So I've seen a bunch of
like third party ones and they change a bunch
of different options depending on what you need. I think the Grafana docs
is a good place to start. That's always going to hopefully be the most updated one that
is good for most use cases. - Yep, yep. Yeah, exactly. And if people want, so shameless plug, I sometimes also make some
videos around Grafana Loki. So there's actually a
Grafana Screencast playlists and one of the videos is indeed a video about these simple
scalable deployment modes. So this is a one year old video. So some things might have changed, but definitely look at it because a lot of the concepts
still remain the same. And this might be a nice quick start for getting you up and
running with the Helm chart. - Nice. - So maybe let's go back a little bit. How do you actually install Loki? We talked about Helm charts for Kubernetes and you also talked about
there being a binary. - Yep, yep. Yeah, so if you just go
to the Loki GitHub repo, there are releases and there is a release which you can download, which
is just a single binary. And if you follow the instructions, just a single command away and you have an up and running
Loki server, so that's great. And then you can start writing data to it. What I typically would recommend if people want to have it even easier is actually to go to Grafana Cloud. So for Grafana Cloud, we have a completely free tier and it actually comes with
quite a generous amount of logs. I think we do like a hundred gigs, something like that, for a month. So that means that you can just send almost 100 gigs of logs per
month to Grafana Clouds, to Grafana Cloud logs,
which is powered by Loki. And then that's it, you're done. So that is an even easier way, but if you want to install it yourself, then definitely go with the binary. - That's the one that I took advantage of is the free tier. So if anybody wants to
go to javaducky.com, 'cause I'm not testing the
rate limits. (laughing) - I'm just gonna write
a k6 test right now. (all laughing) - Exactly, exactly. - Another question about
the integration between Loki and alerting, Grafana alerting. - Oh that's a nice one. Yeah, so we definitely wanted to have it integrate with alerting. And the cool thing is we
actually based Loki alerting on the Prometheus alerting. Yeah, how do you say that? Strategy or solution. So Loki actually also works together with the Prometheus Alert Manager and actually configuring a Loki alert and Prometheus alerts exactly the same. The only difference is that
the Loki query language is of course slightly
different than PromQL. But what you do is you
actually create a metric out of your logs using LogQL and then based on that, you actually create a nice alert out of that. So I can actually, if
people are interested, I can actually just show you
how that could look like. - Awesome. Yeah. - So you're actually
doing that inside of Loki and not inside of Prometheus? So it's not one of those things where it's like you're
creating the metrics and forwarding into Prometheus and then Prometheus
then does the alerting? - So the Prometheus Alert
Manager is actually the component that does the alerting, yeah.
- Ah ha. - Loki will integrate, and so the alerting, so the ruler, which is the component
that actually evaluates the Loki queries and then sees whether it
needs to trigger an alert. That is like a Loki component, but it interacts with, in this case, the alert manager. So let me see if I have a
nice example recording rule. Okay, yeah, I do have, so let me show you this one, but this is from my test environment and what you can see here is
that you can actually create, and this is actually a recording rule which is similar to an
alert rule from Loki. And actually I could create
an alert rule out of that. So what you need to do there is you create a new alert rule. So Loki test rule. You're gonna select the Loki data source. So in this case this,
it would be Loki Clouds. And then you write a query. And that is what I mean
by writing a LogQL query. So this is one of those LogQL queries. And what we do here is we are looking at, and this I can make bigger, so that's may be more useful. Here, we're actually selecting log stream so I'm actually looking
at maybe a certain agent that I'm interested in, and then I actually can find, okay, every log line then contains error. And if that increases, I want to actually be alerted on it. So what we're doing is we're counting for that last five minutes, we're counting how many log
lines have an error in it, and then we can actually say, okay, if the threshold is above five, then we're gonna alert. So that is my alert condition. And then I can then, just the same as we do with
the Prometheus alerting, we can add some additional metadata. So this is the metadata
that gets sent with alerts. And that can be then used in your alerting and your on-call configuration. So where you can say, okay,
this is the type of alert, this is why it triggers and maybe a runbook that you reference with, okay, this is how you
typically solve this issue if it appears. So this is just a quick way
on how to do Loki alerting. - Okay, another question here about the new Helm charts for Loki. So mbaykara says that charts "Come with a component called backend other than just the write
and read components. So what does the backend do?" And Prashant says that they're
also confused about that. - Yep, yep, so the backend takes care of all those administrative processes. So it also does the ruler. So the alerting is actually something that is actually running. It's not part of the write paths, not part of the read path, but it is a service that
needs to be running, right? And that is in those backend processes. And also when people
want to issue the leads, those leads are also done not synchronous. So these deletes are async and they work on the interval. And that is also taken care
of by that backend service. And you'll actually see,
like I mentioned before, is that there's gonna
be more functionality added to that backend service, like maybe optimizing the way
data is stored in OPX storage. Great questions by the way. - Yeah, I'd like to do a bit
of a wrap up of everything that we discussed, but before we do that, do you have anything else
that you want to show? - Oh, that's a good one. Yeah, one thing that I
typically always like to show, and again, shameless plug, this is a dashboard that I created, but it's also very helpful dashboard. So if you go to play.Grafana.org, we have a lot of examples
of dashboards and tooling that you can play around with from this hosted Grafana. So this is a public
Grafana that people can use and they can play around with it. And this engine X
dashboards is actually data or a dashboard that's
created is based on data that's actually coming from
one of my own web servers. So I have a pet project and I just send the engine X logs to, in this case, Grafana Cloud logs. And you're actually are
just looking at my logs from my websites, so that's pretty cool. And everything you see
here on this dashboard is actually created from Loki. So we're actually creating those metrics from those logs ad hoc. And the cool thing is that I
actually have two versions. So there is actually a patent version that works with space delimited log lines. So you can actually
see how we would handle a little bit like semi-structured logs. And there's the JSON
version that I was showing, which is actually the one that shows how to work
with structured logs. And there you can, hey,
if you're interested and you're learning Loki, it's a great way to just click
into those type of panels and see how that log line or how that query is being built up and what kind of functions
are being called out to make sure that you get
a very similar dashboard. And you can actually download it from the Grafana dashboard directory so it should be there available, as well. So definitely check that one out. It comes with a lot of great examples to kickstart your Loki journey. - Awesome, and actually next week, we are going to be talking
about Grafana Play in general 'cause there's a lot of other examples other than just Loki stuff. And it's a really great way to
just play around with Grafana without any commitment, without having to sign up for
anything or pay for anything. - Yeah, that's great. And let me give you one secret. There is actually a game
within play.Grafana.org. - Wait, I didn't know that. - Yeah, there is a game that you can play, a very famous game. - Oh, I know what it is. - Maybe the next host knows it. I don't know. - Oh!
(Paul laughs) Oh, oh, I think I know now. - But let's keep that as a
cliffhanger for our next episode. - Okay. Okay. - Let's hope that everybody will watch the next episode, as well. - So I didn't tell either of you this, Paul might already, you know, have guessed because I've done this before. I'd like to play a game. (laughing) I have this thing called ultra speaking where there is a bit of a podcast game. And what I'd like to do is
spend the next few minutes summing up what we've discussed, just today on this Grafana
Office Hours episode. So Ward, just so you
know, we're going to be, our names are going to be showing up and we have to switch. The premise is we are doing a podcast and we just have to talk about Loki, which is what you already were doing. - Okay. Let's see. - It's also gonna be timed. (laughing) - Yeah.
- Let's do it. - Not to put any pressure on this- - Oops.
(Paul laughs) Let's not suggest a title. Okay. (Paul laughs) So... I'm gonna start. Okay. So we've talked today about what a log is and how it is different from metrics. A lot of times, things that
we do when we're developing, we start with logs. We might just have like, we might just be debugging
something in our terminal. - Or collecting logs from
multiple applications all going to a single location. Come on, switch over 'cause I'm like lost here now. (laughing) - Yeah, because the alternative is that you actually need to
log into all kinds of nodes and then manually grab the files and the files might not even be there. So it's really important that
you actually aggregate them and make them available and so that also all your teammates can actually take advantage
of that centralized log store. - There are a few issues
regarding logs that Loki solves. The first is that logs are
notoriously difficult to parse. Sometimes logs are unstructured. Sometimes they're semi-structured. - And also, yeah, or strictly
structured with a JSON format where everything is basically labeled and that can provide easier indexing. Oh come on. It seems to go so long. (Ward laughing) - And the cool thing about Loki is that you actually don't
need to worry too much about the indexing. We just need a little bit of metadata and just the rollout line. - And that is actually one
of the features of Loki, that unlike other solutions, Loki indexes just the metadata
and not the full text, which means that it has
real performance benefits which leads into the cost effectiveness of the tool, as well. - Right and it's all
about the object storage so that keeping those costs down, but it does imply that
there's addressing total, oop. - Yeah, and the costs are, of course, the costs are not only about the amount of resource
that you need to spend on the CPU and the RAM, but also a lot of like people that need to keep the
servers up and running and of course need to tweak
the performance all the time. - Those operational costs
come into effect, as well. And also distributed computing in general means distributed logs so you really need some
sort of log aggregation to put it all together. Now how do you actually install Loki? Well, there are a few ways. One is that you can use the binary that you can download from the repo. - Or you can use one
of several Helm charts to actually install the
microservices set up. But we're actually working
on consolidating that to a more singular, better
option of a Helm chart which is displayed on your screen. - Wonderful. And if you just want to
kick the tires of Loki from kind of a usage perspective, I definitely recommend
creating a free account on Grafana Clouds and start with a hundred gigs
of logs per month included. - Yeah, I think also another way that you can just quickly
have a look at Loki already up and running is by going to play.Grafana.com. You don't even need an account. You can just play around and even see how Ward's web
server is doing. (laughing) - (laughs) And you can
also really use that to learn about a lot of the different dashboard functionality
that you can use with Loki. Yeah. - And today, we definitely
touched a bit on Grafana Loki. If you want to learn more about Loki in a much more in-depth fashion, we actually provide also workshops. So go to Grafana.com and there's probably some
button about workshops and we typically organize
those workshops frequently. So definitely you sign up if you're more interested
to know more about Loki. Sorry, Paul, I'm talking
through your time. Go ahead.
(Paul laughs) It's your turn.
- It's all good. - Go for it.
- Yeah. And you can use those to
learn the LogQL syntax and just get real efficient. - That LogQL is based
heavily on Prometheus because we think Prometheus is awesome and we often say that Loki is
like Prometheus, but for logs. - (laughing) Yes, Prometheus. And then these are all
based on what is it? Greek gods or no wait, Nordic gods is what we try to stick to for the product names now. - Yeah, and I still
think that the name Loki was chosen because Loki's a trickster God that's a shape-shifter, just like you need to
shape-shift your logs to make them useful. - Yeah, and that is now
the final real history of the name of Loki. From now on, that's gonna
be like our origin story. - From here on out.
- Wonderful, from here on out. And I would like to really
thank everybody for attending. It was great. Thanks so much for my entire
family to also ask questions. (all laughing) And thanks, Nicole and Paul, for hosting. It was lovely.
(Paul laughing) - Thank you, Ward, for coming to join us. And if you'd like to know
more about Grafana Play, then we will be back next
week at the same time with a different guest to talk all about, just the easiest way that
you can play around Grafana without installing anything and without any data of your own. So check back for that. - And ask about the game. Ask about the game.
- Yes, the game. - Yes, all right. Thank you, everybody, for watching and good luck finding that game. Have a good weekend.
- Cheers, folks. Thanks much. - Bye, everyone.
- Bye-Bye.