Azure Kubernetes Service (AKS) Tutorial: (Terraform - Nginx Ingress & TLS - OIDC Workload Identity)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
In this video, I’ll show you how  to set up a production-ready Azure   Kubernetes Service using Terraform and run  multiple tests to verify its functionality. First of all, we’ll create an Azure virtual  network (VNet) using Terraform. If you have   never used Terraform before, I’ll guide you step  by step. Then, we’ll create a couple of subnets,   which, in my opinion, simplified version of AWS,  and we will discuss why during the tutorial.  Then, we’ll create a managed identity  for the AKS cluster and use it to create   Kubernetes itself. We’ll bind it with an  Azure role to be able to create public   and private load balancers in Kubernetes. Additionally, besides the default node group,   we’ll create another one using spot  instances and add taints to that node   group. We will also enable an OpenID  Connect provider and workload identity. For the first example, we'll create a simple  Kubernetes deployment and expose it using a   Kubernetes service of type LoadBalancer. We'll  configure it to create a private or internal load   balancer within the VNet we previously set up.  Additionally, we'll expose the same application   using the public or external load balancer,  to allow anyone on the internet to access it. For the second example, we'll create another   deployment to test the auto-scaling  feature built into the AKS cluster. In the 3rd example, we’ll deploy an NGINX  ingress controller using Helm and Terraform,   and expose our application to  the internet using Ingress. In the 4th example, we’ll additionally deploy  cert-manager to automatically obtain and renew   certificates from Let's Encrypt and secure  our Ingress with TLS. I’ll also show you   step by step how to debug cert-manager  if you fail to obtain a certificate. And finally, in the 5th example, we'll  test workload identity. We’ll create a   managed identity in the Azure cloud and  map it to the Kubernetes pod running   inside the cluster. To test, we’ll create a  storage account and a container, and use a   Kubernetes deployment based on the azure-cli to  authenticate and list objects in the container. You can find the source code and all the  commands that I run in my public GitHub   repository. This time, I decided to do this  tutorial a little bit differently. Instead of   creating all those Terraform files and examples,  I’ll put myself in the viewer's shoes and assume   that I just cloned this repo. I’ll still go over  the code and explain what all parts are doing,   and we’ll make changes when it’s necessary;  for example, for workload identity, you would   need to replace the client id. Let’s go ahead and start. For all my tutorials, I’ll try  to use only a few variables,   which makes the code more readable and easier  to follow. For production use, you would of   course parameterize as much as possible and  perhaps convert this code to a Terraform module. Now, here are the most common variables  that you may want to change. Typically,   we provision more than one environment  in a single Azure account, so it’s very   helpful to have an environment prefix for  each component, such as VNet, Kubernetes,   and identities. Then we have a region, which  I think is the cheapest one to test. Since we   need to reference the resource group in multiple  places, I decided to factor this out as a local   variable as well. Regarding the AKS, I just copied  the same variables from the AWS EKS tutorial;   that’s why I have 'eks'—it should be 'aks', but  it's not a big deal. And finally, the version:   1.27 is the latest Kubernetes version,  which is generally available in Azure. Next, we need the Azure provider and must set up  a few version constraints for each provider we’ll   use. Later in the tutorial, we’ll use Helm to  deploy NGINX ingress and Cert-Manager. By the way,   I use numbers as file prefixes just to sort  them in Visual Studio Code for the tutorial;   you don’t need to do it. Terraform, in general,  does not care about file naming conventions. Then, we need to create resource group. A  resource group is a container that holds   related resources for Azure, and you should  place resources, such as VNet and others,   in the same group when they have  the same lifecycle. For example,   AKS will create its own resource  group for the Kubernetes node pools. For the resource group, we just need  to specify the name and the location.  Next, we will create a virtual  network from scratch. I call it VPC,   but in Azure terminology, it’s a VNet. Which  is almost the same thing, but in my opinion,   it’s a simplified version of the AWS VPC because  you don’t need to set up different subnets,   route tables, and other networking components  such as internet gateway and nat gateway.  Let’s call it 'main', but I would also suggest  using the environment as a prefix and calling   it 'dev-main'. Next, let’s define the CIDR /16,  which will give us around 65,000 IP addresses. Then, specify the same region and the  resource group that we just defined in   the previous file. If you’re not  familiar with Terraform, you can   use the cross-reference feature and use the name  output variable of the resource group. Finally,   let’s add a tag to indicate that this  VNet is created for the dev environment. Next, we need to create a couple of subnets. In  AWS, each subnet maps to a single availability   zone; on the other hand, subnets in Azure  are created in the region where you create   a VNet. Also, you don’t need to create an  internet gateway and NAT gateways. All subnets   in Azure have internet access by default;  you just need to have a public IP address. Let’s create two subnets: Subnet1 and Subnet2.  Now, if you already have a VNet and subnets   and want to create AKS in the existing virtual  network, you can use a data resource instead.   However, I would strongly recommend that if  you created them manually, import them into   Terraform first and use a standard resource  reference instead of the data resource. Next, similar to AWS, we have an option to create  a managed identity for our cluster. In AWS,   it is mandatory, but in Azure, it’s optional.  However, in our case, if we want to create private   load balancers in our subnets, we need to create  it first. Then, we need to bind that managed   identity with at least the Network Contributor  role, and the scope must be the resource group   that we created for the virtual network. We need  that because the default AKS identity would have   permissions only in the resource group created  by the AKS itself. Also, you may notice that   sometimes I use 'this,' which is a common pattern  while creating a Terraform module when you have   a single instance of that resource. Also, I  use 'base' for that managed identity because,   later down the road, we’ll create another  managed identity to test workload identity. Now we can create the Kubernetes cluster  itself. AKS itself represents the control   plane of the Kubernetes; worker nodes will  still be managed by us using virtual machines. For the name, let’s use the environment as  a prefix. Then, we need to specify the same   region as well as the resource group name. So,  the AKS control plane will be provisioned in   our initial resource group, and it will  create another one to manage node pools. By the way, I use more or less the same  configuration parameters in other tutorials when   creating AWS EKS and Google GKE clusters. Next is a DNS prefix. It is used to   generate a unique Fully Qualified Domain  Name for your cluster when it is created. Next, is the Kubernetes version. The latest  supported version in GA is 1.27, but most likely,   by the time you watch this video, 1.28 will have  been released. You can pick the latest one or   simply upgrade, which is as easy as changing  the version and running 'terraform apply.' Next, we can set the channel  to automatically upgrade the   cluster. If you set it to something other  than 'none,' Azure will auto-upgrade your   cluster. If you run stateful or some  custom operator-managed applications,   you may want to manually upgrade the cluster  and carefully monitor the health of your apps.   If you run a bunch of stateless applications,  you'll probably be fine with auto-upgrade. Next, I'll create a public cluster since  I don’t have a VPN or a bastion host,   but four options are available. One  option is to create a private cluster   with a public endpoint, but it’s in review phase. Next, you can override the name for  the resource group. If you omit this,   Azure will generate one for you anyway. I prefer  to have the same naming conventions everywhere.  Then, for the tier, to test it out, use  the free version which, at this time,   allows you to create up to a 10-node cluster.  For production, of course, use standard. Next is the OpenID Connect provider.  In AWS, you need to manually create   it using the certificate thumbprint and the  issuer URL. In the case of Azure and Google,   it’s basically a checkbox. Let’s enable  it since it is required for the workload   identity. The next parameter is for the  identety itself let’s enable it as well. Now, regarding the networking profile settings:  The networking plugin that allows AKS to use   a native network is Azure, similar to AWS  and even GKE. If you want Cilium or other   networking and service mesh plugins, you  can modify it. But it’s a good starting   point. Then, you can specify service_cidr  to allocate IPs for the Kubernetes services,   and also you can use the pod_cidr option  to allocate IP CIDR to assign to pods.  Next, the default node pool for Kubernetes.  Pretty much the standard settings; you choose   the VM type, version, and most notably, you  can enable autoscaling as easily as setting   autoscaling to true, similar to Google Cloud. In  AWS, you define the autoscaling config and then   need to additionally deploy a cluster autoscaler  as well as set up the OpenID Connect provider. Optionally, we can set custom labels, such as  role equal to general, which helps to perform   node group migrations if you need to. Next, we need to provide the managed   identity that we created earlier,  set the tag to the dev environment,   and optionally ignore node count since we enabled  autoscaling. Let’s also explicitly depend on the   role assignment. That’s pretty much it for  setting up managed Kubernetes in Azure Cloud. Now, most Kubernetes clusters have multiple node  groups that are tailored for specific workloads.   They can be compute or memory-optimized. They  can be GPU-based nodes to run machine learning   pipelines, or they can be spot nodes, which are  much cheaper but can be terminated by Azure at   any time. The config is very similar,  except we add node taints that would   prevent any pods from being scheduled on them  unless they explicitly tolerate those taints.   And also a few additional node labels that  can be used in node selectors or affinity.  If you have never used Terraform  with Azure before, or CLI,   you need to authenticate with Azure first.  Run az login and log in through the browser.  Next, you need to find the subscription  ID; you can do it from the console or   by running az account list. The last step is to set that   subscription ID as your default. That’s  all; now we can initialize Terraform.   It will initialize the local state and download  all providers mentioned in your Terraform code.  And finally, run terraform  apply and confirm that you want   to create all those components in the cloud. In a few minutes, your VNet and the Kubernetes   cluster should be ready. To authenticate with  the cluster, you need to run the get-credentials   subcommand. It will update your local Kubernetes  config and set AKS as the active cluster. If you   used a different name for the resource group  or a cluster name, you need to change it here.  Then, a quick test to verify if we can access  the cluster: If the command get nodes returns   your worker nodes, it means everything  is set up correctly, and we can continue. In the Azure console, you can also find  resources such as a resource group. We have two:   one for networking components and AKS, and another  for Kubernetes node pools. Additionally, we have a   Kubernetes cluster with the name 'dev-demo'. Now, let’s run a first example. The goal is   to ensure that we can create public and private  load balancers and expose our application to the   internet. Here is a basic nginx deployment object.  Then we have a standard service of type load   balancer. Every managed Kubernetes cluster, such  as EKS, AKS, and GKE, comes with a cloud manager   controller that is responsible for provisioning  cloud-native components, such as load balancers.  In this case, AKS will create a public  load balancer in your account and set   up routing to your application. 'Public',  or sometimes referred to as 'external',   means that the load balancer will have a  public IP address, reachable from the internet.  Next, let’s expose the same application  but only within our virtual network,   in case other applications deployed outside of  Kubernetes need to reach your application. This   type of load balancer is usually called private or  internal. To set it to private, we need to add a   special annotation to our service object. Now, let’s apply. That’s why we had to create a dedicated managed  identity and bind it with the Network Contributor   role. If you were to use a default one created  by AKS and try to create a private load balancer,   it would be stuck in a pending state due to  the lack of permissions for AKS to use our   subnets to create a load balancer. However, a  public load balancer would still be created.  Let's wait a few more minutes, and instead of  pending, you should see the IP addresses. For   the public load balancer, we obtained  a public IP, but for the private one,   we got an IP from the VNet range we defined. I'll only test the external load balancer since   I don't have a bastion or a VPN. It’s very easy  to test: just use curl and hit that public IP.   Alright, we received a response from the  nginx running in the Kubernetes cluster.  When you are done with the test,  I would suggest immediately   terminating all pods and load balancers. In the next example, we’ll test auto-scaling.   In Azure and Google Cloud, it's very easy to  set up by simply defining the scaling block on   the instance group. In AWS, you need to create  an OpenID Connect provider, set up permissions,   and additionally deploy a cluster auto-scaler. Nonetheless, let’s still test the auto-scaling.   Here we have six pods, and I've also  defined the resource block. Initially,   not all of them should fit, which should  trigger the auto-scaler to expand the cluster.  Right now, I have two nodes, and  the second one should have taints,   so it will be excluded by the Kubernetes  scheduler. Let's go ahead and apply. If you check the pods, you’ll notice that  some are in a pending state. Let's go   ahead and describe one of them. Now, you can see a message says   triggered a scale-up from 1 node to 2. Let’s wait until all the pods are running   and the new node has joined the cluster.  Maybe in about 1 minute, all the pods   should be running. The cluster auto-scaler can  also remove nodes to reduce the cluster cost;   you just need to terminate the pods. That’s all for the second example.  In the 3rd example, we’ll use an ingress. So  first, we need to deploy one of the ingress   controllers in the cluster. For this video, I  picked the most common NGINX ingress controller,   which we’ll deploy with Terraform and Helm. First, similar to the Azure Terraform provider,   we need to authenticate Helm with our Kubernetes  cluster. You can do it by obtaining a certificate   from AKS. Define the data resource to pull AKS  data first. We also need to explicitly depend   on the AKS cluster. Then, we use those values in  the provider and finally define the Helm release   resource. An important part here is that we want  to override a few variables. One way to do it   is by creating a values.yaml file and listing  all variables there that you want to override.  In earlier versions of Kubernetes, we used  annotations to specify the ingress class.   Nowadays, we use a new stable v1 API for  ingress which has an 'ingress class name'   parameter. Similar to load balancers, you can  create public and private ingresses. For this   tutorial, I’ll use the Public one and call  it external-nginx. If you want a private one,   just add the same annotation that we  used on the load balancer. Additionally,   Azure requires a health probe; if you  leave this out, your ingress won’t work.  We also have a section for TLS that I’ll  cover later. Now, since we already applied   the Terraform at the beginning, the ingress  controller is already deployed in our cluster.  Let’s go over the example. It's a simple  deployment based on the echo server. Then,   we have a service to create an endpoint and  route traffic to the pods. And we have the   ingress resource. Here, we specify the ingress  class name, and we can route traffic to our echo   server based on the path or a domain. You  can use any domain here; I’ll show how to   simulate a host header later. I would even suggest  keeping this host in place and not changing it.  Before we apply, let me show you that we have  the NGINX ingress controller installed in the   cluster. If you check services in the ingress  namespace, you’ll find a public load balancer   IP address. For all ingresses that you use this  NGINX controller, you should create an A record   pointing to this IP address. Alright, let’s apply.  First of all, make sure that the pod is running.  Next, check the ingresses. You can find the same   IP address under the Address column. Keep  in mind that it may take some time for the   NGINX controller to update this value, so  it’s not immediately present here. Again,   for the ingress, you just create an A record  and point this host to this IP address. For this test, I don’t want to create any DNS  records. An easy test that you can perform   involves overriding the host header. With  curl, for example, you can use the resolve   flag and map the host to the IP locally. Alright, it works! We received a response from the   echo server running in Kubernetes. This concludes  the 3rd example. Don’t forget to clean up.  Now, the next example is a bit more complicated.  We’ll use the same ingress but secure our endpoint   with a TLS certificate issued from Let's Encrypt.  To automatically obtain and renew the certificate,   we need to deploy an additional component called  cert-manager. We’ll use Helm to install it as   well, and the configuration is minimal—just enable  CRDs. If you have one or two variables that you   want to override, you can use set blocks. There are two main ways you can prove that   you own a domain to get a public certificate.  One way is the HTTP-01 challenge, which would   require our ingress controller to dynamically  create an endpoint with provided secret from   the issuer, in this case, Let's Encrypt. Another way is to use the DNS-01 challenge,   which requires us to create TXT records with  secret provided by the issuer. In production,   I tend to use the DNS-01 challenge more  often because you can obtain the certificate   beforehand and test. But the DNS challenge  would require setting up workload identity.  Now, cert-manager has traditionally refused to  use the new ingressClass field on the ingress   resource to resolve the HTTP challenge. They  explain this by saying they want to be generic   and allow other ingress controllers that don’t use  the ingress class field to still use cert-manager.  Alright, to fix that, we need to  configure our NGINX controller to   also watch for the ingress class annotation. The first step to secure the ingress is to   create issuers. First, let's create a staging  issuer. Let's Encrypt has a limit on how many   certificates you can issue in a week, and if you  start testing with a production endpoint, you   can quickly reach the limit. So, I always suggest  starting to test with a staging environment. Here,   you also need to replace the email. In case your  cert-manager fails to renew the certificate,   you’ll get a notification from Let's Encrypt.  Then, specify the secret name where the   certificate and a private key will be stored.  And on line 18, we have the ingress class that   will be used to resolve the HTTP-01 challenge.  This value will be used in the ingress annotation   and not in the ingressClassName field. That’s why  we need extra settings on the ingress controller.  Second is the production issuer, which  you should only start using after you   validate the staging certificate. The same deployment object as in the   previous example. Service object. And for the ingress itself, there are a few  important parts. First is the annotation   with the type of the issuer. Then, we  have the extra TLS section. These names   will appear on the certificate, and you can  have multiple hosts. However, in this case,   the host field must be real, and we would need  to create an A record for the HTTP-01 challenge.  Let’s apply.  Make sure that the pod is running.  Here, you can notice two pods;   the first one is from cert-manager. It  received secret from Let's Encrypt and   is now used to expose it using temporary  ingress to prove that you own the domain.  We also have two ingresses, and again,  the first one is temporarily created by   the cert-manager. You can notice here that it  does not use the ingressClassName field—you   can describe it to find out that it uses  an annotation to select the ingress class.  Also, cert-manager created a custom resource  called Certificate, which is responsible for   obtaining the certificate. I’ll  show you how to debug it now.  So, let’s describe the certificate. You can find that the certificate   created another custom resource,  which is called CertificateRequest.  Let’s describe that as well. Now, the CertificateRequest created an Order.  Let’s describe the Order. The Order created a   Challenge custom resource. Describe it. Finally, we can see the error message   explaining why we’re not getting the certificate.  Obviously, we need to create a DNS record.  Let’s retrieve the ingress  again to find the IP addresses.  I host it in Google Domains, but  it doesn't matter—you just need to   create an A record pointing to the shared  NGINX ingress controller load balancer IP.  Let’s watch the certificate. DNS can take  up to 24 hours to update but usually takes   a few minutes. So, in my case, it took  maybe 5 minutes. If, in 10 minutes,   the certificate is still marked as false, try  to describe it again and find an error message.  Alright, so now we have the certificate,  but keep in mind that it’s not a real one.  It works, but we get a warning that the connection  is not private, which is expected. Let’s check   the certificate and ensure that it was issued  from the Let's Encrypt staging environment. After that, we can use the production  issuer. You just need to update 'staging'   to 'production' and reapply the ingress. Cert-manager will create a new challenge and   obtain a real, new certificate. Wait until it’s  true again and then go to the web browser to test.  Alright, it works, and if you inspect the  certificate, you’ll find that it’s a real   one and you have a lock indication  that the connection is now secure.   That’s all for this example; let’s clean up. In the last, 5th example, I want to show you how   to set up workload identity. It simply means that  you can map Azure roles to individual Kubernetes   pods. As you remember, we enabled the OpenID  Connect provider and workload identity on the   AKS cluster resource. Here, we’ll create a new  managed identity and map it with the Kubernetes   service account. Let’s call it dev-test. Then, be careful here; we need to   create an identity_credential resource. The name  can be anything, but let’s keep dev-test. Then   the resource group name; we’ll use the one we  created for the virtual network. The audience   stays the same; you don’t need to change it. The  issuer is the OpenID Connect provider issuer URL   created by the AKS. We can use the Terraform  resource cross-reference to obtain that value.  The Parent ID is the managed identity we just  defined. Then, the subject is the service account,   which will be located in the dev Kubernetes  namespace, and the name of the Kubernetes   service account is my-account. This  is needed to establish trust between   Azure and the Kubernetes RBAC system. To test this workload integration,   let’s create a storage account and a container,  which is similar to an S3 bucket in AWS. Each   storage account must have unique names, so  let’s use a random integer to generate one.  Then, we’ll create a container  in that storage account.  And finally, we’ll grant the dev-test  managed identity that we just created the   Contributor role. It will allow it to get, list,  create, and delete objects in that container.  Let’s switch to the Kubernetes side. First,  we need to create a dev namespace. Then,   the service account. To bind this service  account with the managed identity in Azure,   we use a special annotation with client-id. Of  course, you can automate this with Terraform,   but to learn, it’s way better this way. Now we need to replace the client-id. You   can use a Terraform output variable  or get it from the Azure console.  Go to managed identities. Select dev-test.  And copy and replace the client id. Let’s also create a deployment object based on the   azure-cli image to test workload identity access. Let’s apply it.  Wait until the pod is running and then SSH to that  pod. Again, you'll find all the commands in the   README file; this one is used to authenticate with  Azure. If you get something other than this, you   made a mistake somewhere, and you need to go back. So, in the Azure console, you can find the storage   account that was created by Terraform. And also, we have a test container.  Now, from the pod, let’s run the blob list  command to get all the objects in that container.  Well, we got an empty array since we don’t  have anything there yet. If you upload a   file in that container and run this command  again, you'll get metadata about that object.  Alright, that concludes this tutorial. If you  want to learn more about AWS EKS and Google GKE,   I have the tutorials and the Terraform  code to create and test those clusters   on my channel. Thank you for watching,  and I’ll see you in the next video.
Info
Channel: Anton Putra
Views: 4,702
Rating: undefined out of 5
Keywords: aks, terraform, azure, Azure Kubernetes Service, k8s, kubernetes, kubernetes tutorial, k8s tutorial, aks tutorial, create aks using terraform, devops, anton putra, azure tutorial, azure tutorial for beginners, azure aks tutorial, azure aks cluster setup, azure aks networking, azure aks interview questions, azure aks terraform, aks tutorial azure, azure kubernetes, opentofu, nginx ingress, cert-manager, tls, aks workload identity
Id: 8HmReos6dlY
Channel Id: undefined
Length: 30min 46sec (1846 seconds)
Published: Tue Oct 10 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.