How to Create GKE Cluster Using TERRAFORM? (Google Kubernetes Engine & Workload Identity)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

In this video, I'll show you how to create a GKE cluster using Terraform. You may follow along and create a VPC from scratch with Terraform, or you can plug in values from existing network and subnets. We will create two instance groups, one for general services and one instance group that will use spot instances and have proper tains and labels. In the first demo, I'll show you how to configure autoscaling for the cluster. In the second one, we will use workload identity and grant access to the pod to list GS buckets in our Google project. For the final example, I will deploy the nginx ingress controller with a public load balancer to expose services to the internet. Let's create terraform folder where we're going to place all the terraform-related files. First of all, we need to declare a terraform provider. You can think of it as a library with methods to create and manage infrastructure in a specific environment. In this case, it is a Google Cloud Platform. When it comes to the file names, try to give self-explanatory names. I include a link to the official doc for most of the new resources I will use in the code. You can find an example of how to use it, all possible input parameters, and output variables exported when the resource is created. You need to include a project id and a region for the google provider. When you create resources in GCP such as VPC, Terraform needs a way to keep track of them. If you simply apply terraform right now, it will keep all the state locally on your computer. It's very hard to collaborate with other team members and easy to accidentally destroy all your infrastructure. You can declare Terraform backend to use remote storage instead. Since we're creating infrastructure in GCP, the logical approach would be to use Google Storage Bucket to store Terraform state. You need to provide a bucket name and a prefix. We will create them in a minute. Also, you have an option to set constraints on the version of the provider that you want to use. Now, let's create a bucket before running Terraform. Go to Google Cloud console and select Cloud Storage. You can create a GS bucket using console, gcloud cli, but you can't use Terraform since it needs a bucket beforehand. Click on Create Bucket. Then pick a name for the bucket; it must be globally unique. It's common to separate terraform workspaces by the environment to reduce risks. This bucket will be used to manage infrastructure in the staging environment. For the location, I always use multi-region. Terraform will only keep the state in the form of json files. It will not require a lot of storage and should be pretty cheap. You can also select different classes, but Standard is perfect for the Terraform. The only parameter that you need to update is version control. It will help you to recover the state in case of an incident. That's all, now we can create a bucket. Nothing stops you from using the existing VPC to create a Kubernetes cluster, but I will create all the infrastructure using Terraform for this lesson. For example, instead of resource, you can use data keyword to import it to Terraform. Before creating VPC in a new GCP project, you need to enable compute API. To create a GKE cluster, you also need to enable container google API. Now let's create VPC itself. Give it a name, for example, main. Then select the routing mode. You have two options here: Regional and Global. If set to REGIONAL, this network's cloud routers will only advertise routes with subnets of this network in the same region as the router. If set to GLOBAL, this network's cloud routers will advertise routes with all subnets of this network across regions. Then set auto_create_subnetworks to false. When set to false, the network is created in "custom subnet mode," and let us define our own subnets. Maximum Transmission Unit in bytes. The minimum value for this field is 1460. If you set this value to true, it will delete the default route to the internet. We need to explicitly specify resources that need to be created before creating VPC. We need compute and optionally can specify container api. The next step is to create a private subnet to place Kubernetes nodes. When you use the GKE cluster, the Kubernetes control plane is managed by Google, and you only need to worry about the placement of Kubernetes workers. Give it a name private. If you have more than one private subnet, it's better to be more specific. Then the CIDR range of the subnet. 10.0.0.0/18 will give you 16 thousand ip addresses to play with. VPC in the Google Cloud is a global concept. You can create subnets in different regions; on the other hand, in AWS, VPC belongs to a specific region. You need to provide a reference to the network that we created earlier. Enable private IP google access. VMs in this subnetwork without external IP addresses can access Google APIs and services, for example, Managed Redis or Postgres. Then you need to provide secondary ip ranges. Kubernetes nodes will use IPs from the main CIDR range, but the Kubernetes pods will use IPs from the secondary ranges. In case you need to open a firewall to access other VMs in your VPC from Kubernetes, you would need to use this secondary ip range as a source and optionally service account of the Kubernetes nodes. Each secondary IP range has a name associated with it, which we will use in the GKE configuration. The second secondary range will be used to assign IP addresses for ClusterIPs in Kubernetes. When you create a regular service in Kubernetes, an IP address will be taken from that range. Next, we need to create Cloud Router to advertise routes. It will be used with the NAT gateway to allow VMs without public IP addresses to access the internet. For example, Kubernetes nodes will be able to pull docker images from the docker hub. Give it a name router. Then the region, us-central1, is the same region where we created the subnet. Then the reference to the VPC, where you want to place this router. Now, let's create Cloud NAT. Give it a name and a reference to the Cloud Router. Then the region us-central1. You can decide to advertise this Cloud NAT to all subnets in that VPC, or you can select specific ones. In this example, I will choose the private subnet only. The next option is very important, especially if you have external clients. You can let Google to allocate and assign an IP address for your NAT, or you can choose to manage yourself. In case you have a webhook and a client that need to whitelist your public IP address (allow your IP address to access their network by opening up a firewall), that's the only way to go. Then the list of subnetworks to advertise the NAT. The first one is for the private subnet. You can also choose to advertise to only the main CIDR range or both, including secondary IP ranges. Since we will allocate External IP addresses ourselves, we need to provide them in the nat_ips field. You can allocate more than one IP address for NAT. The following resource is to allocate IP. Give it a name and a type External. Also, you need to select the network_tier. It can be premium or a standard. Since we create VPC from scratch, we need to make sure that compute API is enabled before allocating IP.The next resource is a firewall. We don't need to create any firewalls manually for GKE; it's just to give you an example. This firewall will allow sshing to the compute instances within VPC. The name is allow-ssh. Reference to the main network. Ports and protocols to allow. For ssh, we need TCP protocol and a standard port 22. For the source, we can restrict to certain service accounts network tags, or we can use CIDR. 0.0.0.0/0 will allow any IP to access port 22 on our VMs. Finally, we got to Kubernetes resource. First, we need to configure the control plane of the cluster itself. Primary will be a cluster name. Now for location, you can either select a region or an availability zone. If you choose a region, GKE will create a highly available cluster for you in multiple availability zones of that region. No doubt that it is a preferred setup, but it will cost you more money. If you are budget sensitive, you may want to select a zonal cluster and deploy your Kubernetes nodes in different availability zones. If you go with that approach, as we will in this video, if something happens with your control plane, all your applications will continue running without interruptions. You only won't be able to access the master itself, for example, deploy a new application or a service. But I would highly recommend choosing at least two availability zones for Kubernetes nodes. In my experience, availability zones go down often. This cluster will have a single NOT highly available control plane in the us-central1-a zone. Then choose to destroy the default node pool since we will create additional instance groups for the Kubernetes cluster. Initial node cont does not matter since it will be destroyed anyway. Provide a link to the main VPC and a subnet. In this case, it's a private subnet. Now be very careful with services that you enable for Kubernetes. Obviously, you want logging for your applications. This option will deploy a fluent bit agent on each node and scrape all the logs that your application sends to the console. But it will add cost to your infrastructure. At some point, for the short period of time in one of my environments, the cost of logging exceeded the cost of infrastructure. Because the developer enabled debug logs. Be very careful and constantly monitor the cost. Next is monitoring; the same thing here it's not free. If you plan to deploy Prometheus, you may want to disable it. All cloud providers will try to sell you as many managed services as possible, which are very easily scalable and convenient. But it may lead to a huge bill at the end of the month. The networking mode is VPC_NATIVE. Available options are VPC_NATIVE or ROUTES. VPC-native clusters have several benefits; you can read about them here. As I mentioned before, if we create a zonal cluster, we want to add at least one availability zone. We already have us-central1-a zone; let's add b zone. There are many different addons you can enable and disable. For example, you can deploy istio service mesh or disable http_load_balancing if you're planning to use nginx ingress or plain load balancers to expose your services from Kubernetes. Later I will deploy the nginx ingress controller anyway, so let's disable this addon. The second is horizontal pod autoscaling; I want to keep this addon enabled. The release channel will manage your Kubernetes cluster upgrades. Keep in mind that you never be able to completely disable upgrades for the Kubernetes control plane. However, you can disable it for nodes. Then I want to enable workload identity. You can substitute this with variables and data objects. You need to replace devops-v4 with your project ID. Under the ip allocation policy, you need to provide the names of the secondary ranges. First for the pods and then for the cluster IPs. To make this cluster private, we need to enable private nodes. This will only use private IP addresses from our private subnet for the Kubernetes nodes. Next is a private endpoint. If you have a VPN setup or you use bastion host to connect to the Kubernetes cluster, set this option to true, otherwise keep it false to be able to access GKE from your computer. You would also need to provide a CIDR range for the control plane. Since it's managed by Google, they will create a control plane in their network and establish a peering connection to your VPC. Optionally you can specify the CIDR ranges which can access the Kubernetes cluster. The typical use case is to enable Jenkins to access your GKE. If you skip this, anyone can access your control plane endpoint. Before we can create node groups for Kubernetes, if we want to follow best practices, we need to create a dedicated service account. In this tutorial, we will create two node groups. The first one is general without tains to be able to run cluster components such as DNS. Provide a cluster id. This node group will not have autoscaling enabled; we need to specify how many nodes we want. For the management, allow auto_repair and auto_upgrade. Under node config, we can specify that this node group is not preemptible. Choose a machine type, for example, e2-small. I prefer to have large instances and a small number of nodes since there are a lot of system components that need to be deployed on each node, such as fluent bit, nodes exporter, and many others. If you have smaller instances, those components will eat a lot of your resources. You can give this node group a label. Provide a service account and oauth_scope cloud-platform. Google recommends custom service accounts that have cloud-platform scope and permissions granted via IAM Roles. Later we will grant the IAM role to the service account to access GS buckets in our project. Now the second instance group. It will have a few different parameters. Give it a name spot. Then the same cluster-id. Management config will stay the same. But now we have autoscaling. You can define the minimum number of nodes and a maximum number of nodes. Under node config, let's set preemptible equal to true. This will use much cheaper VMs for the Kubernetes nodes, but they can be taken away by google at any time, and they last up to 24 hours. They are perfect for some batch jobs and some data pipelines. They can be used with regular applications, but they have to be able to tolerate if nodes will go down. Give it a label team equal to devops. And most importantly, such nodes must have taints to avoid accidental scheduling. In this case, your deployment or pod object must tolerate those taints. Same service account and scope for this node group. To run Terraform locally on your computer, you need to configure default application credentials. Run gcloud auth application-default login command. It will open the default browser, where you would need to complete authorization. When it's done, make sure that the google project id matches the one you used in Terraform. Let's change the directory where we have Terraform files. The first command that you need to run is terraform init. It will download google provider and initialize the Terraform backend to use GS bucket. To actually create all those resources that we defined in Terraform, we need to run terraform apply. Terraform wants to create 12 resources and destroy 0. Looks like that is what we want; let's agree and type yes. It may take up to 20 minutes to create all those components, be patient. It's completed; let's go to the Google console to look on VPC and other resources. When you enable the compute API, Google will generate a default network for you. You can disable the creation of default networks by creating an organization policy with the compute.skipDefaultNetworkCreation constraint. Projects that inherit this policy won't have a default network. Let's delete this VPC manually. Click DELETE VPC NETWORK and confirm that you want to delete it. This is the main network that we created with Terraform. We have a single private subnet in the us-central1 region. You can see the main IP address range and secondary ranges for Kubernetes. You will always have much more pods than services in Kubernetes; that's why we have different IP ranges. One for /14 and /20. We also have a few firewalls created by GKE and our allow-ssh firewall as well. Now let's go to the Kubernetes engine. You may see some warnings, but I found more often than not, they are false alarms. We have two availability zones for Kubernetes nodes, regular channel and public endpoint. Under node pools, you will find two instance groups. General with 2 nodes, one per zone and spot with autoscaling. To connect to the cluster, you need to click connect and copy the command. Then just paste it to the terminal and execute. To check the connection, just run kubectl get svc. It should return the Kubernetes service from the default namespace. We can also run kubectl get nodes to list all the nodes in the cluster. We have two from the general node pool. Now let's deploy a few examples to the Kubernetes. The first one is the deployment object to demonstrate cluster autoscaling. Let's use the nginx image and set two replicas. We want to deploy it to the spot instance group that does not have any nodes right now. First, we need to tolerate those taints set on the nodes. Then we want to restrict deployment to only nodes with label team equal to devops. PodAntiAffinity will force Kubernetes to spread pods between different nodes. We can use the kubectl apply command and provide a path to the folder or file, in this case, example one. Let's use the watch command to repeatedly run kubectl get pods. For now, pods are in a pending state since they only can be scheduled on the spot instance group. Let's split the screen and also run the kubectl get nodes command. We can describe one of the pods to check the status of autoscaling. You can find the message from cluster-autoscaler that the pod triggered scale-up. It's a good sign we just need to wait a few minutes. Two additional nodes joined the cluster from the spot group. When they become ready, two pods will be able to schedule. Now we have four nodes in total, two for general and two for the spot. In the following example, I'll show you how to use workload identity and grant access for the pod to list GS buckets. First of all, we need to create a service account in the Google Cloud Platform. Let's give it an account id service-a. Then we need to grant access to that service account to list buckets. Specify the project. Then the role, for example, Storage Admin. And a member, which is a service account. I suggest using google_project_iam_member. It's non-authoritative; other members for the role for the project are preserved. Finally, we need to allow the Kubernetes service account to impersonate this GCP service account. To establish a link between the Kubernetes RBAC system and the GCP IAM system. It's always the same role workload identity user, but you need to update the member. First is a project ID for your GKE cluster. Then the Kubneretens namespace where you are planning to deploy your application. In this case, it's a staging namespace. And a name for the Kubernetes service account, which is service-a. We will create a namespace and a service account in a minute. Now we need to reapply the terraform to create a GCP service account and bindings. Let's go back to the Google console. And choose Service Accounts. Here is a service-a service account created by Terraform. Under the IAM tab, we can inspect the roles assigned to that service account. Here is an account and a Storage Admin role. To check who can use this account, you can go to the permission tab. Under principals, you will find Kubernetes workload identity associated with Kubernetes service account. Time to create the second example. The first will be a staging namespace. Then the deployment. Give it a name gcloud and specify the same staging namespace. We will use the gcloud-sdk image and define the command and arguments to prevent exiting right after the pod is started. This will give us time to ssh to the pod and run gcloud commands inside of it. Let's apply and test if we can list GS buckets. We have a staging namespace created a few seconds ago. Since we created deployment in staging namespace, don't forget to provide it with get pods. To execute commands inside that pod, use kubectl exec and provide the id of the pod. You can specify the command to execute after the pod. Now run gcloud storage ls to get all the buckets in the project. We got an error; the caller does not have storage.bukets.list access. That's because when we omit the service account in the Deployment object, it will use the default service account in that namespace. Let's fix it. First, create a service account with the service-a name, then you have to bind this service account with the GCP service account using the annotation. Under the spec of the deployment, you need to override the service account to use the one we just created. Optionally, you can use affinity to place this pod to the node with workload identity enabled. We have enabled workload identity on each instance group already. We updated the Deployment object with a service account; when we reapply it, it will force Kubernetes to redeploy the pod. We have a new service-a service account. Now we need to ssh to the new pod and run list buckets again. If you run the gcloud storage ls command again, it should impersonate the GCP service account and get access to buckets. Alright, we have a single GS bucket that we created for Terraform backend. For the last example, let me deploy the nginx ingress controller using the helm. Add the ingress-nginx repository. Update the helm index. And search for the nginx ingress. We will use the 4.0.17 Helm Chart version. To override some default variables, create the nginx-values.yaml file. This is just an example that you can provide some variables such as compute-full-forwarded-for, use-forwarded-header, proxy-body-size, and many others. Those variables override global default nginx settings. You can find all possible options on the nginx website. You can also override the same settings on the ingress level instead. New ingress uses a new CustomResouce called ingressClass instead of old annotation to specify the ingress to use for the service. Enable it, and optionally, you can mark it as a default ingress class. I highly suggest using podAntiAffinity with all kinds of ingresses. This will make it highly available and spread the pods between different nodes. In case one node fails or be simply upgraded, you always have another pod to handle the requests. For this example, I use a single replica but for production, always use multiple instances. We will disable the admission webhook for this example since in GCP with private GKE clusters, it will require opening an additional port; it's out of the scope of this video. You have an option to configure the load balancer for the ingress. By default, it will create a public load balancer. But in case you want to have a private ingress, you can set load balancer annotation to Internal. Metrics are out of scope for this video as well, but I have another tutorial that explains how to use ingress with Prometheus. It's time to deploy ingress. Provide the helm release, namespace, version, and values to override. In a few minutes, you will get a fully functional ingress controller. Let's check if the controller is running. Also, you should check the service; if it will be stuck in a pending state, you need to describe this service and find the error message. Sometimes it's because you exceeded the limit of the number of load balancers in the project. When you get the IP address, you can continue with ingress. Create the third example. We will reuse the deployment object created earlier for the autoscaling demo. This service will select those pods by using the app: nginx label. Then the ingress itself. Default namespace. Then specify the ingress class name to be external-nginx. The first example will be applied to all possible domain names. Specify the path and a backend pointing to the service created before. Now let's apply this ingress. You can also verify if you set up ingress correctly by getting the ingress class. You should see the external-nginx class name. When you get the ingress, you should see the host. In our case, the star will represent all possible domain names, and you should see the address. It may take a few seconds or minutes, but you should see this address is equal to the nginx ingress load balancer. The final step to make this ingress work is to create DNS A record in your DNS provider. In my case, it's hosted in Google Domains. Give it a name. Then change it to A record. For the IP, use the one from the nginx ingress load balancer. You can test it in the browser. Paste your domain; it should return the default nginx page NOT from ingress but from the Deployment. Looks like our ingress works as expected. To use this ingress with only a specific domain, update the host property. If you get the ingress right now, you should see the HOST field filled with your domain. If you refresh your browser, nothing should change. If you want to learn more about ingress, I have a dedicated video with a bunch of examples of how to use nginx ingress, including TLS, cert-manager, and TCP proxies. Thank you for watching, and I'll see you in the next one.

Info

Channel: Anton Putra

Views: 30,725

Rating: undefined out of 5

Keywords: google cloud, kubernetes engine, google kubernetes engine, Create GKE Cluster Using TERRAFORM, google cloud platform tutorial, google cloud platform, cloud computing, how to use gke, gke cluster, google cloud tutorial, gke tutorial, terraform, terraform google cloud tutorial, google cloud tutorial for beginners, gke, gke terraform, gke terraform tutorial, gcp terraform tutorial, gcp terraform, gcp terraform setup, terraform backend gcs, devops, sre, anton putra, aws, gcp, cloud

Id: X_IK0GBbBTw

Channel Id: undefined

Length: 23min 21sec (1401 seconds)

Published: Wed Mar 02 2022