How to monitor your Kubernetes clusters | Kubernetes Best Practices Series

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Welcome to this session of AKS Best Practices webinar. Today's subject is on monitoring and logging, my name is Dennis Zielke I'm part of the cloud native architecture team and to help you guide todays subject. Let's give a bit of an intro of what we're going to talk about and actually want to introduce kind of the problem that we're trying to solve of logging and monitoring, I'm going to introduce the Azure native solution for what is available to you in Azure. I'm going to demo this hands-on with couple of topic use cases and give you resources following up and implementing this in your own infrastructure. Yeah, what is the problem at all? So think about how containerized applications are a little bit different than traditional on-prem applications, if you're having written log files you can typically not go through log files on a containerized environment in your cluster. From the process problem up to the technical ones you need a different way. And that sense, typically ship the logs that are returned on standard out an error to someplace else for later on indexing searching and aggregation and maybe even persisting for longer times, which is kind of the problem kind of what's happened in my application, I'll kind of figure out if the user problem seems to have occurred go through log files, figuring in narrowing down the time frame and understanding a problem and then lead that to an option to fixing that. The other thing that you typically have is kind of that you wanted to one monitor the application at runtime. They by observing the metrics that your system is exhibiting storing by the underlying infrastructure and the cluster so having a different nodes in the cluster that have kind of metrics like usage memory of disks networking that are being obviously also available on the underlying level which is for pods in different namespaces for sets of pods that are surfacing certain metrics that you want to have an observation down to underlying base metrics on the networking and file level, across different environments, thinking about this as a bigger problem that you want to solve how do I get logs efficiently or do we have an observation metrics at runtime at any point or different roles, environments, and applications. Entering the problem now, the solution, Azure Monitor is kind of our first party solution that is available in Azure. You can activate it very simply for containerized environments specifically Kubernetes in this case and make use of that. The typical problems that we're trying to solve here is kind of starting with simple one "Is my cluster healthy" coming from the infrastructure perspective, you have a set of environments want to understand that everything is good on a simpler view. The other part that you want to know is have some sign of trouble shooting guide to, if a problem seems to be occurring, how do you diagnose and drill down efficiently on the problem? And give people that might not be familiar with all the environments is typically run kind of think about average developers in the way and how they can get down to the problem without kind of directly connecting them to the machines. We want to have this at a very low maintenance overhead, so meaning running your own logging and monitoring infrastructure can consume again additional costs in terms of personal and infrastructure, want to keep that very low, which is kind of the design point here, and you want to have the experience of bringing that into running environments as small as possible. So how does it look like in practice? Think about your average Azure environment, you have set of clusters running and the set of subscriptions as you can see here, I have 3 clusters in set of subscriptions 2 of them are monitored by Azure Monitor which is sort of the entry view onto all of my environments and there is additional environments as you can see here, which is sort of being detected, but not really monitored. I can't bring it up into the Azure Monitor as well, and since in this case This is pure Kubernetes infrastructure based. I'm going to a show how this can be done later, but for the moment we want to start with the Azure Managed Kubernetes versions. I can see, there are 2 clusters sort of running and one of them seems to be having a problem as you can see cluster is from given infrastructure perspective running 4 nodes all of them seem to be fine system pods are also running fine, the space Kubernetes is also alright but, in the user space meaning my own application there seems to be something wrong. As kind of only half of the pods are actually healthy and running which might be something relevant to drill down and look into. But the good news is, there seems to be another high value business application that's running on a smaller cluster which doesn't have any problems either in the cubes system pods or in the running application points. Being alerted to this, you kind of naturally start by going a little deeper into the cluster that seems to be having problems. In this case, our Dev environment. The whole point here is that this is as a managed Kubernetes cluster and it comes sort of out of the box with the managed version of Azure Monitor included. First of all we start again with the high level view, what you see here is sort of the cluster overview. You see, there is a set of nodes running at this cluster in this case there seem to be 5 nodes, there seems to be some scale operations going on for some reason as the cluster size has increased from 2 to 5 in the last hour, and change the time frame level to get a sharper view of what's happening here. Assumption would probably be that there is some sort of auto scaler running as the number of pods has also increased the with a number of pods increasing raising the number of nodes naturally follows that it depends if you have in the scalar rule activated. But that alone doesn't seem to be troublesome so you can see because the node, memory utilization seems you fairly ok and the CPU utilization seems to be fine as well. In the pending space you can see here now that the users or developers seem to seem yo have requested scheduling off number of pods inside this cluster, which have been in pending state for quite awhile. So the cluster size the left hand side has been increased to accommodate that need and now we're kind of this space happy because all the parts simply running there's nothing pending and there is nothing unusual, so that being a good sign, let's go a little bit deeper into what is happening here. Um, the next phase is about kind of having looked at the cluster now we want to take a look at the nodes that are running this environment, which gives us to the next tabular view, which is about kind of what is happening on these nodes with the scale operations being happened, you can see that some of these nodes have been introduced only recently into this environment. You can also see that if you go on these individual sets that we're comparing here, the columns is the actual memory consumption of the node versus the actual available limit on the box that is here, so you can see it is using 1.3 gigabytes available 6.8 sort of we're happier on this view here. What we also see when we take a look at the nodes on the right hand side is the Kubernetes versions that this environment is running on the underlying operating system, Docker engine, can also take a look at the labels and sort of the metadata that is involved in this node, which allows us to keep networking and see what is actually happening here. We also have a tree view that allows us to drill down a little deeper into what pods are being scheduled on this particular node. If we want to have an assumption about what is happening here can also switch between different metrics here, see and switch between nodes but since there's nothing serious happening here sort of can be happy in this space. We've come down to the issue of actually analyzing the underlying applications set here, then we're going to want to to switch back to the other cluster because we have sort of an extra prepared application that should help us here. See that the cluster or the application has repaired itself because the scale operations happened these are pods that are now being scheduled or scaled down cause the user load has gone down and our cluster is now being served happily. Let's move on to the other application and take a little deeper look into how applications are performing. Assuming this case that we've been getting reports of a production system that it hasn't been performing that well, which is a bit surprising because everything is green here and if we switch the view we get an indication that might be something else happening here. What's also being done to Azure Monitor is the observation or the monitoring of the application. Actually put a behavior versus the allowed specification, so what we see here that there seems to be buggy application or some crashing application and what we can see here that the memory working set of the application has in any case a limit of 265 megabytes. The application seems to be running fine at the beginning only consuming 50 megabytes that seems to be increasing over time until eventually that application has been killed by the orchestrator because it was simply consuming too much, so the color indicating that this is very likely the case. This case actually is the application here did contain a memory leak and the interesting part is going over to the troubleshooting guide what was happening here. So I want to get an insight into what's happening in the application at runtime, I want to drill down into the logs that have been exhibited by this pod. But I can see here is that this was the specification, the controller is happening here. If I look at the individual container instance have also the option to take a look at the live logs that are being streamed towards me. Seeing this year, my application can surface a set of nodes so I can go in here and take a look at what is happening in real time, this also maybe another application that serves better, again here I can send a message to this application from the outside so I just triggered another message. Let me change the text hi there, I guess seminar, I fire it up, you can see that the responds or the log output is being made available from this side quite quickly. I can also have the option to search live into this so, if I want to know for example of something, a webinar I can see a highlighting all the messages that contain this which gives me a good insight into the output message here. Which does help me if I want to get as a developer inside into live logs that are now happening on the production system. The other parts that might be helpful to have insight into what the application here, in this case is doing so in this case you can see that the application is continuously consuming more memory. There is the memory leak and see real time as the application is kind of pulling put more and more memory and eventually crashing due to the outage here. What else do I want to do, so this was kind of the real time view now I want to go back in time and see how often this has been happening. So I take my buggy app and as you can see, there is next to the specification of the application and the live log, also the option to drill into the logs of that particular application. When we do this, I'm redirected to another dashboard, which is also already creating a query specifically for this container. So it's setting using the time frame I've given before, It has inserted the container name at the right cluster. You can see you can have multiple clusters and multiple environments therefore we need to be kind of specific in this space. I can also aggregate and create a set of rules that will be outputted here and allow me to kind of page through the days of logs that have been persisted by this application. I can also search for stuff if I read into what is happening here and the output, if that might be relevant in this case so this gives me kind of view of how I can formulate queries per system and drill down into all this happening here. If I create a smarter list I can also create queries that only surface complete inputs. On this case I'm seeing that the schema for example, gives me the option to ask all the container state failed messages from this buggy app and I can count the number of container crashes in the last 10 minutes. Which has an advantage because maybe you're not through it all and you to be kind of sift through the logs but I also want to get notified if something happens, so think about the problem. If that application crashes. continuously again after I've implemented the fix, I want to get a notification right away, or someone should be notified right away to take another look at this because now I think that this application might be troublesome later on. I can persist a query and also export the output, but was also interesting here is that I can define alert rules, this space go in here, say want to reuse this rule, I've already added a condition for this and want to use the crash times as accounts. So let's say if the application crashes in 10 minutes more than 3 times based on the relation period of 5 minutes evaluated every 10 minutes, I want to get a notification, and this is kind of the next step that the Azure Monitor backend also allows me to do and is to define to use these log queries to define the existing action groups and see what the option is. Here I can send me a mail, I can trigger some kind of function in logic app and that sends the process that can start to create an interface with my system or my ITSM system. I can also simply trigger some kind of Webhook and customizing right with custom code for all the things that are happening here, or I can trigger Automation Runbook think about this here, I'm triggering some kind of group that will expand the cluster or change configuration rules in here depending on the problem that might be happening here. We're not going to do this for now, but the important part kind of being there is an option to kind of not only use lock message in retrospect, but also use locks to trigger events and that might be more relevant in the space and share this with additional teams. What might be also interesting is to create some kind of chart in here so in this case, I want to have an overview of how many times that application has been crashing in this case, a little bit of a boring chart but you can imagine that you can, based on the queries that are available and the schema as you can see on the left hand side you can create pretty interesting queries and also use these queries to share them with the bigger team. Think about here, turning this graphic into dashboard which can be shared with set of teams think about this can be something like my... think about this is my operational dashboard, which I will be using and serving to my operations team, which can kind of configure and drill down not only the metrics in the locks and the graphs from my Kubernetes infrastructure, but also maybe from my network storage from various or the other parts of the application into a bigger holistic view of what is happening here. I can share this with everybody who has an Azure ID account in my environment and also give different permissions on this to make sure that this overview has been available by everybody. You might not always give everybody the same permission to access the actual infrastructure, but I think, it makes a lot of sense to give people the overview onto the logs at least the dashboards that are servicing these infrastructures. Coming back to the environment, let's drill down back again to the cluster. There's a couple of additional points that might be interesting here as well other than looking onto these environments. You also have the option to surface here the metrics the logs as I showed you are accessible in here. To give you kind of know review of what is happening here the other interesting part is kind of sort of schema so what kind of properties of each of these messages available for filtering or creating queries, as you can see, there is sort of a large set of stuff that is available, there's text and there is numbers, there's date time where all of these can be used in different kind of very informative to narrow down your assumption on what is happening in the environment validating this and also having an overview directly on what is happening there. Another operational aspect that might be interesting in this space is that you might not only be interested in your own application, but also in the back end. I guess in this sense is a managed environment which means that you sort of do not have actually access to the to the master nodes that are running this, but you still might be interested in kind of the scheduler logs or or an audit log. It gives you an overview of what's happening in the environment. This is sort of a little bit hidden and you have to know that if you click on the resource group that is in this case containing your radius cluster, you can click on diagnostic settings, in the data diagnostic settings you now have the option to turn on diagnostic logging and in this diagnostic logging you have the option to also include the kube API server, the controller manager, the auto scalar, scheduler and also the audit logs and forward all these logs into the same log analytics space, meaning the same space where you can query monitor and alert dashboards and alerts rules from in one space. Something else that might be interesting is kind of how you configure all of this into an existing environment. I guess this is kind of in the user interface very simple view and how you can spin up an environment ask a couple questions and the monitoring may simply leave, switch yes. You can service multiple radius clusters into the same login analytics workspace, which makes a lot of sense of these applications are tied together logically and that you can compare the logs across these different clusters or you can spin up virtual workspace for one cluster and separate them. There's also the option to create these things declaratively so in this case kind of see from the naming convention. I've created all these infrastructure artifacts by using terraform and this kind of next to a native Azure resource manager templates also have the option to kind of make sure that when you are spinning up an infrastructure as kind of networking, the Log Analytics Space the Kubernetes cluster storage accounts and all the other resource they used by application are defined in one template that you can roll out consistently across different files to make sure that they all are configured in the same way and not kind of unicorn way different. That was about the overview kind of so being the point being that we want to make it very easy for you to spin up various cluster attach a solution for monitoring workloads that doesn't cost you very much because it's doesn't have any infrastructure cost only the kind of the storage which are assisting the logs and make that available as managed service that will be upgraded on Microsoft in your cluster. Having hopefully addressed all these points go a bit deeper into the bigger picture here. We showed you that want to make it available as across tenant solution so think about having an Azure tenant, that has many subscriptions that has many subscriptions with different resources running different applications. There's a sort of a high stack of stuff that you want to monitor and have no drill down view not only the container price context, but also with the other resources that are being used think about the database the storage accounts, the network pieces that are being used to make this work. Health wise an integration point into the processes that deploy application so this case by deferred, Azure Devops, which is sort of our solution for bringing applications into your environment and making sure that all these things are available. Give me insights into having different visualization forms as we showed you in the dashboards to have a drill to see what is happening across all the different artifacts and resources that you've provisioned giving you tools to analyze what is happening here. Think about the metrics and the log space that we've shown you also triggering automated responding events on resources that have been used, think about application stuff that we've shown you. If your application crashes three times when it first of all notify somebody, it may even trigger an automated response reacting on this and also integrating this into other different line of process is that you might already have in place. This case, something like an active management system that you want to notify for each error message so that someone can take a log take a look at the later and understand what's happening there. Being this, the bigger picture of how Azure Monitor fits first volunteer Azure environment, but also into how you manage processes into inside your ??? Yeah, the onboarding experience is actually quite fast, showing in the user interface. There is also the option to persist and deploy this using Terraform if you want to do this. It's actually not that complicated but also kind of the option being that you can spin this up by using interactive mode inside that cluster. The other option is obviously kind of you can do this by using Terraform and declaratively and configuring this in your own. Control Plane logs is something that we showed you already so that is actually, I showed you how to activate this. This is how you can formulate a query against the diagnostic log that is containing the master logs, which is particularly interesting if you think there's something wrong that is not directly related to the application but rather to how the cluster operates giving you the option to figure out if it's a developer problem or an orchestrator operations problem that you might need to solve by including somebody else. To close up. I want to give you a couple of follow resources. First of all kind of additional documentation on how you bring and learn AKS in Azure. There are couple of learning, seminars that give you more details on a different insights and we obviously also want to give you insights on how you can bring Azure Monitor into your raise environment running in Azure. Thank you very much. Hope you enjoy this and can implement what you've seen here today. I'm looking forward to having you on a different seminar on a different subject too.

Info

Channel: Microsoft Azure

Views: 13,698

Rating: undefined out of 5

Keywords: Microsoft, Azure, Kubernetes, Azure Kubernetes Services, Azure Monitor for Containers, kubernetes, monitoring, prometheus, grafana, logging, key metrics, demo, monitor your Kubernetes clusters, Kubernetes clusters, Dennis Zielke

Id: RjsNmapggPU

Channel Id: undefined

Length: 29min 13sec (1753 seconds)

Published: Thu Feb 14 2019