Azure Kubernetes Service (AKS) Tutorial: (Terraform - Nginx Ingress & TLS - OIDC Workload Identity)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

In this video, I’ll show you how to set up a production-ready Azure Kubernetes Service using Terraform and run multiple tests to verify its functionality. First of all, we’ll create an Azure virtual network (VNet) using Terraform. If you have never used Terraform before, I’ll guide you step by step. Then, we’ll create a couple of subnets, which, in my opinion, simplified version of AWS, and we will discuss why during the tutorial. Then, we’ll create a managed identity for the AKS cluster and use it to create Kubernetes itself. We’ll bind it with an Azure role to be able to create public and private load balancers in Kubernetes. Additionally, besides the default node group, we’ll create another one using spot instances and add taints to that node group. We will also enable an OpenID Connect provider and workload identity. For the first example, we'll create a simple Kubernetes deployment and expose it using a Kubernetes service of type LoadBalancer. We'll configure it to create a private or internal load balancer within the VNet we previously set up. Additionally, we'll expose the same application using the public or external load balancer, to allow anyone on the internet to access it. For the second example, we'll create another deployment to test the auto-scaling feature built into the AKS cluster. In the 3rd example, we’ll deploy an NGINX ingress controller using Helm and Terraform, and expose our application to the internet using Ingress. In the 4th example, we’ll additionally deploy cert-manager to automatically obtain and renew certificates from Let's Encrypt and secure our Ingress with TLS. I’ll also show you step by step how to debug cert-manager if you fail to obtain a certificate. And finally, in the 5th example, we'll test workload identity. We’ll create a managed identity in the Azure cloud and map it to the Kubernetes pod running inside the cluster. To test, we’ll create a storage account and a container, and use a Kubernetes deployment based on the azure-cli to authenticate and list objects in the container. You can find the source code and all the commands that I run in my public GitHub repository. This time, I decided to do this tutorial a little bit differently. Instead of creating all those Terraform files and examples, I’ll put myself in the viewer's shoes and assume that I just cloned this repo. I’ll still go over the code and explain what all parts are doing, and we’ll make changes when it’s necessary; for example, for workload identity, you would need to replace the client id. Let’s go ahead and start. For all my tutorials, I’ll try to use only a few variables, which makes the code more readable and easier to follow. For production use, you would of course parameterize as much as possible and perhaps convert this code to a Terraform module. Now, here are the most common variables that you may want to change. Typically, we provision more than one environment in a single Azure account, so it’s very helpful to have an environment prefix for each component, such as VNet, Kubernetes, and identities. Then we have a region, which I think is the cheapest one to test. Since we need to reference the resource group in multiple places, I decided to factor this out as a local variable as well. Regarding the AKS, I just copied the same variables from the AWS EKS tutorial; that’s why I have 'eks'—it should be 'aks', but it's not a big deal. And finally, the version: 1.27 is the latest Kubernetes version, which is generally available in Azure. Next, we need the Azure provider and must set up a few version constraints for each provider we’ll use. Later in the tutorial, we’ll use Helm to deploy NGINX ingress and Cert-Manager. By the way, I use numbers as file prefixes just to sort them in Visual Studio Code for the tutorial; you don’t need to do it. Terraform, in general, does not care about file naming conventions. Then, we need to create resource group. A resource group is a container that holds related resources for Azure, and you should place resources, such as VNet and others, in the same group when they have the same lifecycle. For example, AKS will create its own resource group for the Kubernetes node pools. For the resource group, we just need to specify the name and the location. Next, we will create a virtual network from scratch. I call it VPC, but in Azure terminology, it’s a VNet. Which is almost the same thing, but in my opinion, it’s a simplified version of the AWS VPC because you don’t need to set up different subnets, route tables, and other networking components such as internet gateway and nat gateway. Let’s call it 'main', but I would also suggest using the environment as a prefix and calling it 'dev-main'. Next, let’s define the CIDR /16, which will give us around 65,000 IP addresses. Then, specify the same region and the resource group that we just defined in the previous file. If you’re not familiar with Terraform, you can use the cross-reference feature and use the name output variable of the resource group. Finally, let’s add a tag to indicate that this VNet is created for the dev environment. Next, we need to create a couple of subnets. In AWS, each subnet maps to a single availability zone; on the other hand, subnets in Azure are created in the region where you create a VNet. Also, you don’t need to create an internet gateway and NAT gateways. All subnets in Azure have internet access by default; you just need to have a public IP address. Let’s create two subnets: Subnet1 and Subnet2. Now, if you already have a VNet and subnets and want to create AKS in the existing virtual network, you can use a data resource instead. However, I would strongly recommend that if you created them manually, import them into Terraform first and use a standard resource reference instead of the data resource. Next, similar to AWS, we have an option to create a managed identity for our cluster. In AWS, it is mandatory, but in Azure, it’s optional. However, in our case, if we want to create private load balancers in our subnets, we need to create it first. Then, we need to bind that managed identity with at least the Network Contributor role, and the scope must be the resource group that we created for the virtual network. We need that because the default AKS identity would have permissions only in the resource group created by the AKS itself. Also, you may notice that sometimes I use 'this,' which is a common pattern while creating a Terraform module when you have a single instance of that resource. Also, I use 'base' for that managed identity because, later down the road, we’ll create another managed identity to test workload identity. Now we can create the Kubernetes cluster itself. AKS itself represents the control plane of the Kubernetes; worker nodes will still be managed by us using virtual machines. For the name, let’s use the environment as a prefix. Then, we need to specify the same region as well as the resource group name. So, the AKS control plane will be provisioned in our initial resource group, and it will create another one to manage node pools. By the way, I use more or less the same configuration parameters in other tutorials when creating AWS EKS and Google GKE clusters. Next is a DNS prefix. It is used to generate a unique Fully Qualified Domain Name for your cluster when it is created. Next, is the Kubernetes version. The latest supported version in GA is 1.27, but most likely, by the time you watch this video, 1.28 will have been released. You can pick the latest one or simply upgrade, which is as easy as changing the version and running 'terraform apply.' Next, we can set the channel to automatically upgrade the cluster. If you set it to something other than 'none,' Azure will auto-upgrade your cluster. If you run stateful or some custom operator-managed applications, you may want to manually upgrade the cluster and carefully monitor the health of your apps. If you run a bunch of stateless applications, you'll probably be fine with auto-upgrade. Next, I'll create a public cluster since I don’t have a VPN or a bastion host, but four options are available. One option is to create a private cluster with a public endpoint, but it’s in review phase. Next, you can override the name for the resource group. If you omit this, Azure will generate one for you anyway. I prefer to have the same naming conventions everywhere. Then, for the tier, to test it out, use the free version which, at this time, allows you to create up to a 10-node cluster. For production, of course, use standard. Next is the OpenID Connect provider. In AWS, you need to manually create it using the certificate thumbprint and the issuer URL. In the case of Azure and Google, it’s basically a checkbox. Let’s enable it since it is required for the workload identity. The next parameter is for the identety itself let’s enable it as well. Now, regarding the networking profile settings: The networking plugin that allows AKS to use a native network is Azure, similar to AWS and even GKE. If you want Cilium or other networking and service mesh plugins, you can modify it. But it’s a good starting point. Then, you can specify service_cidr to allocate IPs for the Kubernetes services, and also you can use the pod_cidr option to allocate IP CIDR to assign to pods. Next, the default node pool for Kubernetes. Pretty much the standard settings; you choose the VM type, version, and most notably, you can enable autoscaling as easily as setting autoscaling to true, similar to Google Cloud. In AWS, you define the autoscaling config and then need to additionally deploy a cluster autoscaler as well as set up the OpenID Connect provider. Optionally, we can set custom labels, such as role equal to general, which helps to perform node group migrations if you need to. Next, we need to provide the managed identity that we created earlier, set the tag to the dev environment, and optionally ignore node count since we enabled autoscaling. Let’s also explicitly depend on the role assignment. That’s pretty much it for setting up managed Kubernetes in Azure Cloud. Now, most Kubernetes clusters have multiple node groups that are tailored for specific workloads. They can be compute or memory-optimized. They can be GPU-based nodes to run machine learning pipelines, or they can be spot nodes, which are much cheaper but can be terminated by Azure at any time. The config is very similar, except we add node taints that would prevent any pods from being scheduled on them unless they explicitly tolerate those taints. And also a few additional node labels that can be used in node selectors or affinity. If you have never used Terraform with Azure before, or CLI, you need to authenticate with Azure first. Run az login and log in through the browser. Next, you need to find the subscription ID; you can do it from the console or by running az account list. The last step is to set that subscription ID as your default. That’s all; now we can initialize Terraform. It will initialize the local state and download all providers mentioned in your Terraform code. And finally, run terraform apply and confirm that you want to create all those components in the cloud. In a few minutes, your VNet and the Kubernetes cluster should be ready. To authenticate with the cluster, you need to run the get-credentials subcommand. It will update your local Kubernetes config and set AKS as the active cluster. If you used a different name for the resource group or a cluster name, you need to change it here. Then, a quick test to verify if we can access the cluster: If the command get nodes returns your worker nodes, it means everything is set up correctly, and we can continue. In the Azure console, you can also find resources such as a resource group. We have two: one for networking components and AKS, and another for Kubernetes node pools. Additionally, we have a Kubernetes cluster with the name 'dev-demo'. Now, let’s run a first example. The goal is to ensure that we can create public and private load balancers and expose our application to the internet. Here is a basic nginx deployment object. Then we have a standard service of type load balancer. Every managed Kubernetes cluster, such as EKS, AKS, and GKE, comes with a cloud manager controller that is responsible for provisioning cloud-native components, such as load balancers. In this case, AKS will create a public load balancer in your account and set up routing to your application. 'Public', or sometimes referred to as 'external', means that the load balancer will have a public IP address, reachable from the internet. Next, let’s expose the same application but only within our virtual network, in case other applications deployed outside of Kubernetes need to reach your application. This type of load balancer is usually called private or internal. To set it to private, we need to add a special annotation to our service object. Now, let’s apply. That’s why we had to create a dedicated managed identity and bind it with the Network Contributor role. If you were to use a default one created by AKS and try to create a private load balancer, it would be stuck in a pending state due to the lack of permissions for AKS to use our subnets to create a load balancer. However, a public load balancer would still be created. Let's wait a few more minutes, and instead of pending, you should see the IP addresses. For the public load balancer, we obtained a public IP, but for the private one, we got an IP from the VNet range we defined. I'll only test the external load balancer since I don't have a bastion or a VPN. It’s very easy to test: just use curl and hit that public IP. Alright, we received a response from the nginx running in the Kubernetes cluster. When you are done with the test, I would suggest immediately terminating all pods and load balancers. In the next example, we’ll test auto-scaling. In Azure and Google Cloud, it's very easy to set up by simply defining the scaling block on the instance group. In AWS, you need to create an OpenID Connect provider, set up permissions, and additionally deploy a cluster auto-scaler. Nonetheless, let’s still test the auto-scaling. Here we have six pods, and I've also defined the resource block. Initially, not all of them should fit, which should trigger the auto-scaler to expand the cluster. Right now, I have two nodes, and the second one should have taints, so it will be excluded by the Kubernetes scheduler. Let's go ahead and apply. If you check the pods, you’ll notice that some are in a pending state. Let's go ahead and describe one of them. Now, you can see a message says triggered a scale-up from 1 node to 2. Let’s wait until all the pods are running and the new node has joined the cluster. Maybe in about 1 minute, all the pods should be running. The cluster auto-scaler can also remove nodes to reduce the cluster cost; you just need to terminate the pods. That’s all for the second example. In the 3rd example, we’ll use an ingress. So first, we need to deploy one of the ingress controllers in the cluster. For this video, I picked the most common NGINX ingress controller, which we’ll deploy with Terraform and Helm. First, similar to the Azure Terraform provider, we need to authenticate Helm with our Kubernetes cluster. You can do it by obtaining a certificate from AKS. Define the data resource to pull AKS data first. We also need to explicitly depend on the AKS cluster. Then, we use those values in the provider and finally define the Helm release resource. An important part here is that we want to override a few variables. One way to do it is by creating a values.yaml file and listing all variables there that you want to override. In earlier versions of Kubernetes, we used annotations to specify the ingress class. Nowadays, we use a new stable v1 API for ingress which has an 'ingress class name' parameter. Similar to load balancers, you can create public and private ingresses. For this tutorial, I’ll use the Public one and call it external-nginx. If you want a private one, just add the same annotation that we used on the load balancer. Additionally, Azure requires a health probe; if you leave this out, your ingress won’t work. We also have a section for TLS that I’ll cover later. Now, since we already applied the Terraform at the beginning, the ingress controller is already deployed in our cluster. Let’s go over the example. It's a simple deployment based on the echo server. Then, we have a service to create an endpoint and route traffic to the pods. And we have the ingress resource. Here, we specify the ingress class name, and we can route traffic to our echo server based on the path or a domain. You can use any domain here; I’ll show how to simulate a host header later. I would even suggest keeping this host in place and not changing it. Before we apply, let me show you that we have the NGINX ingress controller installed in the cluster. If you check services in the ingress namespace, you’ll find a public load balancer IP address. For all ingresses that you use this NGINX controller, you should create an A record pointing to this IP address. Alright, let’s apply. First of all, make sure that the pod is running. Next, check the ingresses. You can find the same IP address under the Address column. Keep in mind that it may take some time for the NGINX controller to update this value, so it’s not immediately present here. Again, for the ingress, you just create an A record and point this host to this IP address. For this test, I don’t want to create any DNS records. An easy test that you can perform involves overriding the host header. With curl, for example, you can use the resolve flag and map the host to the IP locally. Alright, it works! We received a response from the echo server running in Kubernetes. This concludes the 3rd example. Don’t forget to clean up. Now, the next example is a bit more complicated. We’ll use the same ingress but secure our endpoint with a TLS certificate issued from Let's Encrypt. To automatically obtain and renew the certificate, we need to deploy an additional component called cert-manager. We’ll use Helm to install it as well, and the configuration is minimal—just enable CRDs. If you have one or two variables that you want to override, you can use set blocks. There are two main ways you can prove that you own a domain to get a public certificate. One way is the HTTP-01 challenge, which would require our ingress controller to dynamically create an endpoint with provided secret from the issuer, in this case, Let's Encrypt. Another way is to use the DNS-01 challenge, which requires us to create TXT records with secret provided by the issuer. In production, I tend to use the DNS-01 challenge more often because you can obtain the certificate beforehand and test. But the DNS challenge would require setting up workload identity. Now, cert-manager has traditionally refused to use the new ingressClass field on the ingress resource to resolve the HTTP challenge. They explain this by saying they want to be generic and allow other ingress controllers that don’t use the ingress class field to still use cert-manager. Alright, to fix that, we need to configure our NGINX controller to also watch for the ingress class annotation. The first step to secure the ingress is to create issuers. First, let's create a staging issuer. Let's Encrypt has a limit on how many certificates you can issue in a week, and if you start testing with a production endpoint, you can quickly reach the limit. So, I always suggest starting to test with a staging environment. Here, you also need to replace the email. In case your cert-manager fails to renew the certificate, you’ll get a notification from Let's Encrypt. Then, specify the secret name where the certificate and a private key will be stored. And on line 18, we have the ingress class that will be used to resolve the HTTP-01 challenge. This value will be used in the ingress annotation and not in the ingressClassName field. That’s why we need extra settings on the ingress controller. Second is the production issuer, which you should only start using after you validate the staging certificate. The same deployment object as in the previous example. Service object. And for the ingress itself, there are a few important parts. First is the annotation with the type of the issuer. Then, we have the extra TLS section. These names will appear on the certificate, and you can have multiple hosts. However, in this case, the host field must be real, and we would need to create an A record for the HTTP-01 challenge. Let’s apply. Make sure that the pod is running. Here, you can notice two pods; the first one is from cert-manager. It received secret from Let's Encrypt and is now used to expose it using temporary ingress to prove that you own the domain. We also have two ingresses, and again, the first one is temporarily created by the cert-manager. You can notice here that it does not use the ingressClassName field—you can describe it to find out that it uses an annotation to select the ingress class. Also, cert-manager created a custom resource called Certificate, which is responsible for obtaining the certificate. I’ll show you how to debug it now. So, let’s describe the certificate. You can find that the certificate created another custom resource, which is called CertificateRequest. Let’s describe that as well. Now, the CertificateRequest created an Order. Let’s describe the Order. The Order created a Challenge custom resource. Describe it. Finally, we can see the error message explaining why we’re not getting the certificate. Obviously, we need to create a DNS record. Let’s retrieve the ingress again to find the IP addresses. I host it in Google Domains, but it doesn't matter—you just need to create an A record pointing to the shared NGINX ingress controller load balancer IP. Let’s watch the certificate. DNS can take up to 24 hours to update but usually takes a few minutes. So, in my case, it took maybe 5 minutes. If, in 10 minutes, the certificate is still marked as false, try to describe it again and find an error message. Alright, so now we have the certificate, but keep in mind that it’s not a real one. It works, but we get a warning that the connection is not private, which is expected. Let’s check the certificate and ensure that it was issued from the Let's Encrypt staging environment. After that, we can use the production issuer. You just need to update 'staging' to 'production' and reapply the ingress. Cert-manager will create a new challenge and obtain a real, new certificate. Wait until it’s true again and then go to the web browser to test. Alright, it works, and if you inspect the certificate, you’ll find that it’s a real one and you have a lock indication that the connection is now secure. That’s all for this example; let’s clean up. In the last, 5th example, I want to show you how to set up workload identity. It simply means that you can map Azure roles to individual Kubernetes pods. As you remember, we enabled the OpenID Connect provider and workload identity on the AKS cluster resource. Here, we’ll create a new managed identity and map it with the Kubernetes service account. Let’s call it dev-test. Then, be careful here; we need to create an identity_credential resource. The name can be anything, but let’s keep dev-test. Then the resource group name; we’ll use the one we created for the virtual network. The audience stays the same; you don’t need to change it. The issuer is the OpenID Connect provider issuer URL created by the AKS. We can use the Terraform resource cross-reference to obtain that value. The Parent ID is the managed identity we just defined. Then, the subject is the service account, which will be located in the dev Kubernetes namespace, and the name of the Kubernetes service account is my-account. This is needed to establish trust between Azure and the Kubernetes RBAC system. To test this workload integration, let’s create a storage account and a container, which is similar to an S3 bucket in AWS. Each storage account must have unique names, so let’s use a random integer to generate one. Then, we’ll create a container in that storage account. And finally, we’ll grant the dev-test managed identity that we just created the Contributor role. It will allow it to get, list, create, and delete objects in that container. Let’s switch to the Kubernetes side. First, we need to create a dev namespace. Then, the service account. To bind this service account with the managed identity in Azure, we use a special annotation with client-id. Of course, you can automate this with Terraform, but to learn, it’s way better this way. Now we need to replace the client-id. You can use a Terraform output variable or get it from the Azure console. Go to managed identities. Select dev-test. And copy and replace the client id. Let’s also create a deployment object based on the azure-cli image to test workload identity access. Let’s apply it. Wait until the pod is running and then SSH to that pod. Again, you'll find all the commands in the README file; this one is used to authenticate with Azure. If you get something other than this, you made a mistake somewhere, and you need to go back. So, in the Azure console, you can find the storage account that was created by Terraform. And also, we have a test container. Now, from the pod, let’s run the blob list command to get all the objects in that container. Well, we got an empty array since we don’t have anything there yet. If you upload a file in that container and run this command again, you'll get metadata about that object. Alright, that concludes this tutorial. If you want to learn more about AWS EKS and Google GKE, I have the tutorials and the Terraform code to create and test those clusters on my channel. Thank you for watching, and I’ll see you in the next video.

Info

Channel: Anton Putra

Views: 4,702

Rating: undefined out of 5

Keywords: aks, terraform, azure, Azure Kubernetes Service, k8s, kubernetes, kubernetes tutorial, k8s tutorial, aks tutorial, create aks using terraform, devops, anton putra, azure tutorial, azure tutorial for beginners, azure aks tutorial, azure aks cluster setup, azure aks networking, azure aks interview questions, azure aks terraform, aks tutorial azure, azure kubernetes, opentofu, nginx ingress, cert-manager, tls, aks workload identity

Id: 8HmReos6dlY

Channel Id: undefined

Length: 30min 46sec (1846 seconds)

Published: Tue Oct 10 2023