Welcome to this session of AKS Best Practices webinar. Today's subject is on monitoring
and logging, my name is Dennis Zielke I'm part of the cloud
native architecture team and to help you guide todays subject. Let's give a bit of an intro of what
we're going to talk about and actually want to introduce kind
of the problem that we're trying to solve of logging and
monitoring, I'm going to introduce the Azure native solution for
what is available to you in Azure. I'm going to demo this
hands-on with couple of topic use cases and give you resources following up and implementing this
in your own infrastructure. Yeah, what is the problem at
all? So think about how containerized applications
are a little bit different than traditional on-prem applications, if you're having written log files you
can typically not go through log files on a containerized
environment in your cluster. From the process problem up to the
technical ones you need a different way. And that sense,
typically ship the logs that are returned on standard out an
error to someplace else for later on indexing searching
and aggregation and maybe even persisting for longer times,
which is kind of the problem kind of what's happened in my application, I'll kind of figure out if
the user problem seems to have occurred go through log files,
figuring in narrowing down the time frame and understanding
a problem and then lead that to an option to fixing that. The other thing that you typically have is kind of that
you wanted to one monitor the application at runtime. They by observing the metrics
that your system is exhibiting storing by the underlying
infrastructure and the cluster so having a different nodes in
the cluster that have kind of metrics like usage memory of
disks networking that are being obviously also available on the underlying level which is for pods in different
namespaces for sets of pods that are surfacing certain metrics that you
want to have an observation down to underlying
base metrics on the networking and file level, across different
environments, thinking about this as a bigger problem that you want to solve how do I get logs efficiently
or do we have an observation metrics at runtime at any
point or different roles, environments, and applications. Entering the problem now, the
solution, Azure Monitor is kind of our first party solution that
is available in Azure. You can activate it very simply for
containerized environments specifically Kubernetes in this case and make use
of that. The typical problems that we're trying to
solve here is kind of starting with simple one "Is my cluster
healthy" coming from the infrastructure perspective, you
have a set of environments want to understand that everything
is good on a simpler view. The other part that you want to know
is have some sign of trouble shooting guide to, if a problem
seems to be occurring, how do you diagnose and drill down
efficiently on the problem? And give people that might not
be familiar with all the environments is typically run kind
of think about average developers in the way and how
they can get down to the problem without kind of directly
connecting them to the machines. We want to have this at a very
low maintenance overhead, so meaning running your own logging
and monitoring infrastructure can consume again additional
costs in terms of personal and infrastructure, want to keep that
very low, which is kind of the design point here, and you want to have the experience of bringing that
into running environments as small as possible. So how does it look like in
practice? Think about your average Azure environment, you
have set of clusters running and the set of subscriptions as you
can see here, I have 3 clusters in set of subscriptions 2 of
them are monitored by Azure Monitor which is sort of the entry view
onto all of my environments and there is additional environments
as you can see here, which is sort of being detected, but not
really monitored. I can't bring it up into the Azure Monitor as
well, and since in this case This is pure Kubernetes infrastructure based. I'm going to a show
how this can be done later, but for the
moment we want to start with the Azure Managed Kubernetes versions. I can see,
there are 2 clusters sort of running and one of them seems to
be having a problem as you can see cluster is from given infrastructure perspective running 4 nodes all of them
seem to be fine system pods are also running fine, the space Kubernetes is also alright but, in the user
space meaning my own application there seems to be
something wrong. As kind of only half of the pods are actually
healthy and running which might be something relevant to drill
down and look into. But the good news is, there seems to be another high
value business application that's running on a smaller cluster which doesn't have
any problems either in the cubes system pods or in the
running application points. Being alerted to this, you kind
of naturally start by going a little deeper into the cluster
that seems to be having problems. In this case, our Dev environment. The whole point here is that
this is as a managed Kubernetes cluster and it comes sort of out
of the box with the managed version of Azure Monitor
included. First of all we start again
with the high level view, what you see here is sort of the
cluster overview. You see, there is a set of nodes
running at this cluster in this case there seem to be 5
nodes, there seems to be some scale operations going on for some
reason as the cluster size has increased from 2 to 5 in the last hour, and change the time frame
level to get a sharper view of what's happening here. Assumption would probably be
that there is some sort of auto scaler running as the
number of pods has also increased the with a number of
pods increasing raising the number of nodes naturally follows that
it depends if you have in the scalar rule activated. But that alone doesn't seem to
be troublesome so you can see because the node, memory utilization seems you fairly ok and the CPU utilization seems
to be fine as well. In the pending space you can see
here now that the users or developers seem to seem yo have requested scheduling off number of pods
inside this cluster, which have been in pending state for
quite awhile. So the cluster size the left hand side has
been increased to accommodate that need and now we're kind of
this space happy because all the parts simply running there's
nothing pending and there is nothing unusual, so that being a good sign, let's go a little bit deeper into what is happening here. Um, the next phase is about kind of
having looked at the cluster now we want to take a look at the
nodes that are running this environment, which gives us
to the next tabular view, which is about kind of what is happening on these nodes with the scale operations being happened, you can see that some of these nodes have been introduced only
recently into this environment. You can also see that if you
go on these individual sets that we're comparing here, the columns is the actual memory consumption of the node versus
the actual available limit on the box that is here, so you
can see it is using 1.3 gigabytes available 6.8 sort of we're happier on this view here. What we
also see when we take a look at the nodes on the right
hand side is the Kubernetes versions that this
environment is running on the underlying operating system, Docker engine, can also take a look at the labels and sort
of the metadata that is involved in this node, which
allows us to keep networking and see what is actually happening here. We also have a tree view that
allows us to drill down a little deeper into what pods are being
scheduled on this particular node. If we want to have an
assumption about what is happening here can also switch
between different metrics here, see and switch between nodes but since there's nothing serious
happening here sort of can be happy in this space. We've come down to the issue of
actually analyzing the underlying applications set
here, then we're going to want to to switch back to the other
cluster because we have sort of an extra prepared application
that should help us here. See that the cluster or the
application has repaired itself because the scale
operations happened these are pods that are now being scheduled or scaled down cause the user load has
gone down and our cluster is now being served happily.
Let's move on to the other application and take a little
deeper look into how applications are performing. Assuming this case that we've
been getting reports of a production system that it hasn't
been performing that well, which is a bit surprising because
everything is green here and if we switch the view we get an
indication that might be something else happening here. What's also being done to Azure Monitor
is the observation or the monitoring of the application. Actually put a behavior versus
the allowed specification, so what we see here that there
seems to be buggy application or some crashing
application and what we can see here that the memory
working set of the application has in any case a limit of 265
megabytes. The application seems to be running fine at the beginning only
consuming 50 megabytes that seems to be increasing over time
until eventually that application has been killed by the orchestrator because it was simply
consuming too much, so the color indicating that this
is very likely the case. This case actually is the
application here did contain a memory leak and the interesting part is going over to the
troubleshooting guide what was happening here. So I want to get an insight into
what's happening in the application at runtime, I want
to drill down into the logs that have been exhibited by this pod. But I can see here is that
this was the specification, the controller is happening
here. If I look at the individual container instance
have also the option to take a look at the live logs that are
being streamed towards me. Seeing this year, my
application can surface a set of nodes so I can go in here
and take a look at what is happening in real time, this also maybe another application that serves better, again
here I can send a message to this application from the outside so I just triggered
another message. Let me change the text hi there, I guess seminar, I fire it up, you can see that the responds or the log output is being made available from this side
quite quickly. I can also have the option to search live into this so, if I
want to know for example of something, a webinar I can see a
highlighting all the messages that contain this which gives
me a good insight into the output message here. Which does help me if I want to
get as a developer inside into live logs that are now happening
on the production system. The other parts that might be
helpful to have insight into what the application here, in
this case is doing so in this case you can see that the
application is continuously consuming more memory. There
is the memory leak and see real time as the application
is kind of pulling put more and more memory
and eventually crashing due to the outage here. What else do I want to do, so this
was kind of the real time view now I want to go back in time
and see how often this has been happening. So I take my buggy
app and as you can see, there is next to the specification of the
application and the live log, also the option to drill into the logs of that particular application. When we do
this, I'm redirected to another dashboard, which is also already
creating a query specifically for this container. So it's setting
using the time frame I've given before, It has inserted the container
name at the right cluster. You can see you can have multiple
clusters and multiple environments therefore we need to
be kind of specific in this space. I can also
aggregate and create a set of rules that will be
outputted here and allow me to kind of page through the
days of logs that have been persisted by this application. I can also search for stuff if
I read into what is happening here and the
output, if that might be relevant in this case so this
gives me kind of view of how I can formulate queries per
system and drill down into all this happening here. If I create a smarter list I can also create queries that only surface
complete inputs. On this case I'm seeing that the schema for
example, gives me the option to ask all the container state failed
messages from this buggy app and I can count the number of container crashes in
the last 10 minutes. Which has an advantage because
maybe you're not through it all and you to be kind of sift through the
logs but I also want to get notified if something happens,
so think about the problem. If that application crashes. continuously again after I've implemented the fix, I want to
get a notification right away, or someone should be notified right away to
take another look at this because now I think that this application might be troublesome later on. I can
persist a query and also export the output, but was also
interesting here is that I can define alert rules, this space
go in here, say want to reuse this rule, I've already added a condition for this and want to use the
crash times as accounts. So let's say if the application crashes in 10 minutes more than 3 times based on the relation
period of 5 minutes evaluated every 10 minutes, I want to get
a notification, and this is kind of the next step that the Azure
Monitor backend also allows me to do and is to define to use these log
queries to define the existing action groups and see what the option is. Here I can send me a mail, I can trigger
some kind of function in logic app and that sends the process that
can start to create an interface with my system or my ITSM system. I can also simply trigger some kind of Webhook
and customizing right with custom code for all the things
that are happening here, or I can trigger Automation Runbook think about
this here, I'm triggering some kind of group that will expand the cluster or change configuration rules in
here depending on the problem that might be happening
here. We're not going to do this for now, but the
important part kind of being there is an option to kind of not only use lock message in retrospect,
but also use locks to trigger events and that might be more
relevant in the space and share this with additional teams. What might be also interesting
is to create some kind of chart in here so in this case, I want
to have an overview of how many times that application has been
crashing in this case, a little bit of a boring chart but you can
imagine that you can, based on the queries that are available
and the schema as you can see on the left hand side
you can create pretty interesting queries and also use these queries to
share them with the bigger team. Think about here, turning this
graphic into dashboard which can be shared with set of
teams think about this can be something like my... think about this is
my operational dashboard, which I will be using and serving to
my operations team, which can kind of configure and drill down not only the metrics in the locks and the graphs from my
Kubernetes infrastructure, but also maybe from my network storage from various or the
other parts of the application into a bigger holistic view of what is happening here. I can share this with everybody who
has an Azure ID account in my environment and also give
different permissions on this to make sure that this overview has been available by everybody. You might not always give everybody the same permission
to access the actual infrastructure, but I think,
it makes a lot of sense to give people the overview
onto the logs at least the dashboards that are servicing
these infrastructures. Coming back to the
environment, let's drill down back again to the
cluster. There's a couple of additional
points that might be interesting here as well other than looking
onto these environments. You also have the option to
surface here the metrics the logs as I showed you are
accessible in here. To give you kind of know review
of what is happening here the other interesting part is kind
of sort of schema so what kind of properties of each of these
messages available for filtering or creating queries, as you
can see, there is sort of a large set of stuff that is
available, there's text and there is numbers, there's date time
where all of these can be used in different kind of very
informative to narrow down your assumption on what is
happening in the environment validating this and also
having an overview directly on what is happening there. Another operational aspect
that might be interesting in this space is that you might not
only be interested in your own application, but also in the
back end. I guess in this sense is a managed environment
which means that you sort of do not have actually access to the to the master nodes that are running this, but you
still might be interested in kind of the scheduler logs or
or an audit log. It gives you an overview of what's happening
in the environment. This is sort of a little bit hidden and
you have to know that if you click on the resource group
that is in this case containing your radius cluster, you can click on diagnostic
settings, in the data diagnostic settings you now have the
option to turn on diagnostic logging and in this
diagnostic logging you have the option to also include the kube
API server, the controller manager, the auto scalar, scheduler and also the audit logs and forward all these logs into the
same log analytics space, meaning the same space where you
can query monitor and alert dashboards and alerts rules from
in one space. Something else that might be
interesting is kind of how you configure all of this into an
existing environment. I guess this is kind of in the user
interface very simple view and how you can spin up an environment
ask a couple questions and the monitoring may simply leave, switch yes. You can service multiple radius
clusters into the same login analytics workspace, which makes
a lot of sense of these applications are tied together
logically and that you can compare the logs across these
different clusters or you can spin up virtual workspace for one cluster
and separate them. There's also the option to
create these things declaratively so in this case
kind of see from the naming convention. I've created all
these infrastructure artifacts by using terraform and this kind
of next to a native Azure resource manager templates also have the option to kind of make sure that when you
are spinning up an infrastructure as kind of
networking, the Log Analytics Space the Kubernetes cluster
storage accounts and all the other resource they used by
application are defined in one template that you can roll out
consistently across different files to make sure that they
all are configured in the same way and not kind of unicorn
way different. That was about the overview
kind of so being the point being that we want to make it very
easy for you to spin up various cluster attach a solution for
monitoring workloads that doesn't cost you very
much because it's doesn't have any infrastructure cost
only the kind of the storage which are assisting the
logs and make that available as managed service that will
be upgraded on Microsoft in your cluster. Having hopefully addressed all
these points go a bit deeper into the bigger picture here. We
showed you that want to make it available as across tenant
solution so think about having an Azure tenant, that has many
subscriptions that has many subscriptions with different resources running different applications. There's a sort of a high stack of stuff that you want to
monitor and have no drill down view not only the container
price context, but also with the other resources that are being
used think about the database the storage accounts, the
network pieces that are being used to make this work. Health wise an integration point into the processes that
deploy application so this case by deferred, Azure Devops, which
is sort of our solution for bringing applications into your
environment and making sure that all these things are available.
Give me insights into having different visualization
forms as we showed you in the dashboards to have a drill to
see what is happening across all the different artifacts and
resources that you've provisioned giving you tools to
analyze what is happening here. Think about the metrics and the
log space that we've shown you also triggering automated
responding events on resources that have been used, think about
application stuff that we've shown you. If your application crashes
three times when it first of all notify somebody, it may even trigger an
automated response reacting on this and also integrating this
into other different line of process is that you
might already have in place. This case, something like an
active management system that you want to notify for each
error message so that someone can take a log take a look at
the later and understand what's happening there. Being this, the bigger picture
of how Azure Monitor fits first volunteer Azure environment, but
also into how you manage processes into inside your ??? Yeah, the onboarding
experience is actually quite fast, showing in the user
interface. There is also the option to persist and deploy this using Terraform if you want to do this. It's actually not that
complicated but also kind of the option being that you can
spin this up by using interactive
mode inside that cluster. The other
option is obviously kind of you can do this by using Terraform
and declaratively and configuring this in your own. Control Plane logs is something that we
showed you already so that is actually, I showed you how to
activate this. This is how you can formulate a query against
the diagnostic log that is containing the master logs,
which is particularly interesting if you think there's
something wrong that is not directly related to the application but rather to how the cluster
operates giving you the option to figure out if it's
a developer problem or an orchestrator operations
problem that you might need to solve by including somebody
else. To close up. I want to give you
a couple of follow resources. First of all kind of additional
documentation on how you bring and learn AKS in Azure. There are
couple of learning, seminars that give you more details on a
different insights and we obviously also want to give you
insights on how you can bring Azure Monitor into your raise environment
running in Azure. Thank you very much. Hope you
enjoy this and can implement what you've seen here today. I'm
looking forward to having you on a different seminar on a
different subject too.