In this video, I’ll show you how
to set up a production-ready Azure Kubernetes Service using Terraform and run
multiple tests to verify its functionality. First of all, we’ll create an Azure virtual
network (VNet) using Terraform. If you have never used Terraform before, I’ll guide you step
by step. Then, we’ll create a couple of subnets, which, in my opinion, simplified version of AWS,
and we will discuss why during the tutorial. Then, we’ll create a managed identity
for the AKS cluster and use it to create Kubernetes itself. We’ll bind it with an
Azure role to be able to create public and private load balancers in Kubernetes.
Additionally, besides the default node group, we’ll create another one using spot
instances and add taints to that node group. We will also enable an OpenID
Connect provider and workload identity. For the first example, we'll create a simple
Kubernetes deployment and expose it using a Kubernetes service of type LoadBalancer. We'll
configure it to create a private or internal load balancer within the VNet we previously set up.
Additionally, we'll expose the same application using the public or external load balancer,
to allow anyone on the internet to access it. For the second example, we'll create another deployment to test the auto-scaling
feature built into the AKS cluster. In the 3rd example, we’ll deploy an NGINX
ingress controller using Helm and Terraform, and expose our application to
the internet using Ingress. In the 4th example, we’ll additionally deploy
cert-manager to automatically obtain and renew certificates from Let's Encrypt and secure
our Ingress with TLS. I’ll also show you step by step how to debug cert-manager
if you fail to obtain a certificate. And finally, in the 5th example, we'll
test workload identity. We’ll create a managed identity in the Azure cloud and
map it to the Kubernetes pod running inside the cluster. To test, we’ll create a
storage account and a container, and use a Kubernetes deployment based on the azure-cli to
authenticate and list objects in the container. You can find the source code and all the
commands that I run in my public GitHub repository. This time, I decided to do this
tutorial a little bit differently. Instead of creating all those Terraform files and examples,
I’ll put myself in the viewer's shoes and assume that I just cloned this repo. I’ll still go over
the code and explain what all parts are doing, and we’ll make changes when it’s necessary;
for example, for workload identity, you would need to replace the client id.
Let’s go ahead and start. For all my tutorials, I’ll try
to use only a few variables, which makes the code more readable and easier
to follow. For production use, you would of course parameterize as much as possible and
perhaps convert this code to a Terraform module. Now, here are the most common variables
that you may want to change. Typically, we provision more than one environment
in a single Azure account, so it’s very helpful to have an environment prefix for
each component, such as VNet, Kubernetes, and identities. Then we have a region, which
I think is the cheapest one to test. Since we need to reference the resource group in multiple
places, I decided to factor this out as a local variable as well. Regarding the AKS, I just copied
the same variables from the AWS EKS tutorial; that’s why I have 'eks'—it should be 'aks', but
it's not a big deal. And finally, the version: 1.27 is the latest Kubernetes version,
which is generally available in Azure. Next, we need the Azure provider and must set up
a few version constraints for each provider we’ll use. Later in the tutorial, we’ll use Helm to
deploy NGINX ingress and Cert-Manager. By the way, I use numbers as file prefixes just to sort
them in Visual Studio Code for the tutorial; you don’t need to do it. Terraform, in general,
does not care about file naming conventions. Then, we need to create resource group. A
resource group is a container that holds related resources for Azure, and you should
place resources, such as VNet and others, in the same group when they have
the same lifecycle. For example, AKS will create its own resource
group for the Kubernetes node pools. For the resource group, we just need
to specify the name and the location. Next, we will create a virtual
network from scratch. I call it VPC, but in Azure terminology, it’s a VNet. Which
is almost the same thing, but in my opinion, it’s a simplified version of the AWS VPC because
you don’t need to set up different subnets, route tables, and other networking components
such as internet gateway and nat gateway. Let’s call it 'main', but I would also suggest
using the environment as a prefix and calling it 'dev-main'. Next, let’s define the CIDR /16,
which will give us around 65,000 IP addresses. Then, specify the same region and the
resource group that we just defined in the previous file. If you’re not
familiar with Terraform, you can use the cross-reference feature and use the name
output variable of the resource group. Finally, let’s add a tag to indicate that this
VNet is created for the dev environment. Next, we need to create a couple of subnets. In
AWS, each subnet maps to a single availability zone; on the other hand, subnets in Azure
are created in the region where you create a VNet. Also, you don’t need to create an
internet gateway and NAT gateways. All subnets in Azure have internet access by default;
you just need to have a public IP address. Let’s create two subnets: Subnet1 and Subnet2.
Now, if you already have a VNet and subnets and want to create AKS in the existing virtual
network, you can use a data resource instead. However, I would strongly recommend that if
you created them manually, import them into Terraform first and use a standard resource
reference instead of the data resource. Next, similar to AWS, we have an option to create
a managed identity for our cluster. In AWS, it is mandatory, but in Azure, it’s optional.
However, in our case, if we want to create private load balancers in our subnets, we need to create
it first. Then, we need to bind that managed identity with at least the Network Contributor
role, and the scope must be the resource group that we created for the virtual network. We need
that because the default AKS identity would have permissions only in the resource group created
by the AKS itself. Also, you may notice that sometimes I use 'this,' which is a common pattern
while creating a Terraform module when you have a single instance of that resource. Also, I
use 'base' for that managed identity because, later down the road, we’ll create another
managed identity to test workload identity. Now we can create the Kubernetes cluster
itself. AKS itself represents the control plane of the Kubernetes; worker nodes will
still be managed by us using virtual machines. For the name, let’s use the environment as
a prefix. Then, we need to specify the same region as well as the resource group name. So,
the AKS control plane will be provisioned in our initial resource group, and it will
create another one to manage node pools. By the way, I use more or less the same
configuration parameters in other tutorials when creating AWS EKS and Google GKE clusters.
Next is a DNS prefix. It is used to generate a unique Fully Qualified Domain
Name for your cluster when it is created. Next, is the Kubernetes version. The latest
supported version in GA is 1.27, but most likely, by the time you watch this video, 1.28 will have
been released. You can pick the latest one or simply upgrade, which is as easy as changing
the version and running 'terraform apply.' Next, we can set the channel
to automatically upgrade the cluster. If you set it to something other
than 'none,' Azure will auto-upgrade your cluster. If you run stateful or some
custom operator-managed applications, you may want to manually upgrade the cluster
and carefully monitor the health of your apps. If you run a bunch of stateless applications,
you'll probably be fine with auto-upgrade. Next, I'll create a public cluster since
I don’t have a VPN or a bastion host, but four options are available. One
option is to create a private cluster with a public endpoint, but it’s in review phase. Next, you can override the name for
the resource group. If you omit this, Azure will generate one for you anyway. I prefer
to have the same naming conventions everywhere. Then, for the tier, to test it out, use
the free version which, at this time, allows you to create up to a 10-node cluster.
For production, of course, use standard. Next is the OpenID Connect provider.
In AWS, you need to manually create it using the certificate thumbprint and the
issuer URL. In the case of Azure and Google, it’s basically a checkbox. Let’s enable
it since it is required for the workload identity. The next parameter is for the
identety itself let’s enable it as well. Now, regarding the networking profile settings:
The networking plugin that allows AKS to use a native network is Azure, similar to AWS
and even GKE. If you want Cilium or other networking and service mesh plugins, you
can modify it. But it’s a good starting point. Then, you can specify service_cidr
to allocate IPs for the Kubernetes services, and also you can use the pod_cidr option
to allocate IP CIDR to assign to pods. Next, the default node pool for Kubernetes.
Pretty much the standard settings; you choose the VM type, version, and most notably, you
can enable autoscaling as easily as setting autoscaling to true, similar to Google Cloud. In
AWS, you define the autoscaling config and then need to additionally deploy a cluster autoscaler
as well as set up the OpenID Connect provider. Optionally, we can set custom labels, such as
role equal to general, which helps to perform node group migrations if you need to.
Next, we need to provide the managed identity that we created earlier,
set the tag to the dev environment, and optionally ignore node count since we enabled
autoscaling. Let’s also explicitly depend on the role assignment. That’s pretty much it for
setting up managed Kubernetes in Azure Cloud. Now, most Kubernetes clusters have multiple node
groups that are tailored for specific workloads. They can be compute or memory-optimized. They
can be GPU-based nodes to run machine learning pipelines, or they can be spot nodes, which are
much cheaper but can be terminated by Azure at any time. The config is very similar,
except we add node taints that would prevent any pods from being scheduled on them
unless they explicitly tolerate those taints. And also a few additional node labels that
can be used in node selectors or affinity. If you have never used Terraform
with Azure before, or CLI, you need to authenticate with Azure first.
Run az login and log in through the browser. Next, you need to find the subscription
ID; you can do it from the console or by running az account list.
The last step is to set that subscription ID as your default. That’s
all; now we can initialize Terraform. It will initialize the local state and download
all providers mentioned in your Terraform code. And finally, run terraform
apply and confirm that you want to create all those components in the cloud.
In a few minutes, your VNet and the Kubernetes cluster should be ready. To authenticate with
the cluster, you need to run the get-credentials subcommand. It will update your local Kubernetes
config and set AKS as the active cluster. If you used a different name for the resource group
or a cluster name, you need to change it here. Then, a quick test to verify if we can access
the cluster: If the command get nodes returns your worker nodes, it means everything
is set up correctly, and we can continue. In the Azure console, you can also find
resources such as a resource group. We have two: one for networking components and AKS, and another
for Kubernetes node pools. Additionally, we have a Kubernetes cluster with the name 'dev-demo'.
Now, let’s run a first example. The goal is to ensure that we can create public and private
load balancers and expose our application to the internet. Here is a basic nginx deployment object.
Then we have a standard service of type load balancer. Every managed Kubernetes cluster, such
as EKS, AKS, and GKE, comes with a cloud manager controller that is responsible for provisioning
cloud-native components, such as load balancers. In this case, AKS will create a public
load balancer in your account and set up routing to your application. 'Public',
or sometimes referred to as 'external', means that the load balancer will have a
public IP address, reachable from the internet. Next, let’s expose the same application
but only within our virtual network, in case other applications deployed outside of
Kubernetes need to reach your application. This type of load balancer is usually called private or
internal. To set it to private, we need to add a special annotation to our service object.
Now, let’s apply. That’s why we had to create a dedicated managed
identity and bind it with the Network Contributor role. If you were to use a default one created
by AKS and try to create a private load balancer, it would be stuck in a pending state due to
the lack of permissions for AKS to use our subnets to create a load balancer. However, a
public load balancer would still be created. Let's wait a few more minutes, and instead of
pending, you should see the IP addresses. For the public load balancer, we obtained
a public IP, but for the private one, we got an IP from the VNet range we defined.
I'll only test the external load balancer since I don't have a bastion or a VPN. It’s very easy
to test: just use curl and hit that public IP. Alright, we received a response from the
nginx running in the Kubernetes cluster. When you are done with the test,
I would suggest immediately terminating all pods and load balancers.
In the next example, we’ll test auto-scaling. In Azure and Google Cloud, it's very easy to
set up by simply defining the scaling block on the instance group. In AWS, you need to create
an OpenID Connect provider, set up permissions, and additionally deploy a cluster auto-scaler.
Nonetheless, let’s still test the auto-scaling. Here we have six pods, and I've also
defined the resource block. Initially, not all of them should fit, which should
trigger the auto-scaler to expand the cluster. Right now, I have two nodes, and
the second one should have taints, so it will be excluded by the Kubernetes
scheduler. Let's go ahead and apply. If you check the pods, you’ll notice that
some are in a pending state. Let's go ahead and describe one of them.
Now, you can see a message says triggered a scale-up from 1 node to 2.
Let’s wait until all the pods are running and the new node has joined the cluster.
Maybe in about 1 minute, all the pods should be running. The cluster auto-scaler can
also remove nodes to reduce the cluster cost; you just need to terminate the pods.
That’s all for the second example. In the 3rd example, we’ll use an ingress. So
first, we need to deploy one of the ingress controllers in the cluster. For this video, I
picked the most common NGINX ingress controller, which we’ll deploy with Terraform and Helm.
First, similar to the Azure Terraform provider, we need to authenticate Helm with our Kubernetes
cluster. You can do it by obtaining a certificate from AKS. Define the data resource to pull AKS
data first. We also need to explicitly depend on the AKS cluster. Then, we use those values in
the provider and finally define the Helm release resource. An important part here is that we want
to override a few variables. One way to do it is by creating a values.yaml file and listing
all variables there that you want to override. In earlier versions of Kubernetes, we used
annotations to specify the ingress class. Nowadays, we use a new stable v1 API for
ingress which has an 'ingress class name' parameter. Similar to load balancers, you can
create public and private ingresses. For this tutorial, I’ll use the Public one and call
it external-nginx. If you want a private one, just add the same annotation that we
used on the load balancer. Additionally, Azure requires a health probe; if you
leave this out, your ingress won’t work. We also have a section for TLS that I’ll
cover later. Now, since we already applied the Terraform at the beginning, the ingress
controller is already deployed in our cluster. Let’s go over the example. It's a simple
deployment based on the echo server. Then, we have a service to create an endpoint and
route traffic to the pods. And we have the ingress resource. Here, we specify the ingress
class name, and we can route traffic to our echo server based on the path or a domain. You
can use any domain here; I’ll show how to simulate a host header later. I would even suggest
keeping this host in place and not changing it. Before we apply, let me show you that we have
the NGINX ingress controller installed in the cluster. If you check services in the ingress
namespace, you’ll find a public load balancer IP address. For all ingresses that you use this
NGINX controller, you should create an A record pointing to this IP address.
Alright, let’s apply. First of all, make sure that the pod is running.
Next, check the ingresses. You can find the same IP address under the Address column. Keep
in mind that it may take some time for the NGINX controller to update this value, so
it’s not immediately present here. Again, for the ingress, you just create an A record
and point this host to this IP address. For this test, I don’t want to create any DNS
records. An easy test that you can perform involves overriding the host header. With
curl, for example, you can use the resolve flag and map the host to the IP locally.
Alright, it works! We received a response from the echo server running in Kubernetes. This concludes
the 3rd example. Don’t forget to clean up. Now, the next example is a bit more complicated.
We’ll use the same ingress but secure our endpoint with a TLS certificate issued from Let's Encrypt.
To automatically obtain and renew the certificate, we need to deploy an additional component called
cert-manager. We’ll use Helm to install it as well, and the configuration is minimal—just enable
CRDs. If you have one or two variables that you want to override, you can use set blocks.
There are two main ways you can prove that you own a domain to get a public certificate.
One way is the HTTP-01 challenge, which would require our ingress controller to dynamically
create an endpoint with provided secret from the issuer, in this case, Let's Encrypt.
Another way is to use the DNS-01 challenge, which requires us to create TXT records with
secret provided by the issuer. In production, I tend to use the DNS-01 challenge more
often because you can obtain the certificate beforehand and test. But the DNS challenge
would require setting up workload identity. Now, cert-manager has traditionally refused to
use the new ingressClass field on the ingress resource to resolve the HTTP challenge. They
explain this by saying they want to be generic and allow other ingress controllers that don’t use
the ingress class field to still use cert-manager. Alright, to fix that, we need to
configure our NGINX controller to also watch for the ingress class annotation.
The first step to secure the ingress is to create issuers. First, let's create a staging
issuer. Let's Encrypt has a limit on how many certificates you can issue in a week, and if you
start testing with a production endpoint, you can quickly reach the limit. So, I always suggest
starting to test with a staging environment. Here, you also need to replace the email. In case your
cert-manager fails to renew the certificate, you’ll get a notification from Let's Encrypt.
Then, specify the secret name where the certificate and a private key will be stored.
And on line 18, we have the ingress class that will be used to resolve the HTTP-01 challenge.
This value will be used in the ingress annotation and not in the ingressClassName field. That’s why
we need extra settings on the ingress controller. Second is the production issuer, which
you should only start using after you validate the staging certificate.
The same deployment object as in the previous example.
Service object. And for the ingress itself, there are a few
important parts. First is the annotation with the type of the issuer. Then, we
have the extra TLS section. These names will appear on the certificate, and you can
have multiple hosts. However, in this case, the host field must be real, and we would need
to create an A record for the HTTP-01 challenge. Let’s apply. Make sure that the pod is running.
Here, you can notice two pods; the first one is from cert-manager. It
received secret from Let's Encrypt and is now used to expose it using temporary
ingress to prove that you own the domain. We also have two ingresses, and again,
the first one is temporarily created by the cert-manager. You can notice here that it
does not use the ingressClassName field—you can describe it to find out that it uses
an annotation to select the ingress class. Also, cert-manager created a custom resource
called Certificate, which is responsible for obtaining the certificate. I’ll
show you how to debug it now. So, let’s describe the certificate.
You can find that the certificate created another custom resource,
which is called CertificateRequest. Let’s describe that as well.
Now, the CertificateRequest created an Order. Let’s describe the Order.
The Order created a Challenge custom resource. Describe it.
Finally, we can see the error message explaining why we’re not getting the certificate.
Obviously, we need to create a DNS record. Let’s retrieve the ingress
again to find the IP addresses. I host it in Google Domains, but
it doesn't matter—you just need to create an A record pointing to the shared
NGINX ingress controller load balancer IP. Let’s watch the certificate. DNS can take
up to 24 hours to update but usually takes a few minutes. So, in my case, it took
maybe 5 minutes. If, in 10 minutes, the certificate is still marked as false, try
to describe it again and find an error message. Alright, so now we have the certificate,
but keep in mind that it’s not a real one. It works, but we get a warning that the connection
is not private, which is expected. Let’s check the certificate and ensure that it was issued
from the Let's Encrypt staging environment. After that, we can use the production
issuer. You just need to update 'staging' to 'production' and reapply the ingress.
Cert-manager will create a new challenge and obtain a real, new certificate. Wait until it’s
true again and then go to the web browser to test. Alright, it works, and if you inspect the
certificate, you’ll find that it’s a real one and you have a lock indication
that the connection is now secure. That’s all for this example; let’s clean up.
In the last, 5th example, I want to show you how to set up workload identity. It simply means that
you can map Azure roles to individual Kubernetes pods. As you remember, we enabled the OpenID
Connect provider and workload identity on the AKS cluster resource. Here, we’ll create a new
managed identity and map it with the Kubernetes service account. Let’s call it dev-test.
Then, be careful here; we need to create an identity_credential resource. The name
can be anything, but let’s keep dev-test. Then the resource group name; we’ll use the one we
created for the virtual network. The audience stays the same; you don’t need to change it. The
issuer is the OpenID Connect provider issuer URL created by the AKS. We can use the Terraform
resource cross-reference to obtain that value. The Parent ID is the managed identity we just
defined. Then, the subject is the service account, which will be located in the dev Kubernetes
namespace, and the name of the Kubernetes service account is my-account. This
is needed to establish trust between Azure and the Kubernetes RBAC system.
To test this workload integration, let’s create a storage account and a container,
which is similar to an S3 bucket in AWS. Each storage account must have unique names, so
let’s use a random integer to generate one. Then, we’ll create a container
in that storage account. And finally, we’ll grant the dev-test
managed identity that we just created the Contributor role. It will allow it to get, list,
create, and delete objects in that container. Let’s switch to the Kubernetes side. First,
we need to create a dev namespace. Then, the service account. To bind this service
account with the managed identity in Azure, we use a special annotation with client-id. Of
course, you can automate this with Terraform, but to learn, it’s way better this way.
Now we need to replace the client-id. You can use a Terraform output variable
or get it from the Azure console. Go to managed identities.
Select dev-test. And copy and replace the client id.
Let’s also create a deployment object based on the azure-cli image to test workload identity access.
Let’s apply it. Wait until the pod is running and then SSH to that
pod. Again, you'll find all the commands in the README file; this one is used to authenticate with
Azure. If you get something other than this, you made a mistake somewhere, and you need to go back.
So, in the Azure console, you can find the storage account that was created by Terraform.
And also, we have a test container. Now, from the pod, let’s run the blob list
command to get all the objects in that container. Well, we got an empty array since we don’t
have anything there yet. If you upload a file in that container and run this command
again, you'll get metadata about that object. Alright, that concludes this tutorial. If you
want to learn more about AWS EKS and Google GKE, I have the tutorials and the Terraform
code to create and test those clusters on my channel. Thank you for watching,
and I’ll see you in the next video.