In this tutorial, I'll guide you step by step on
how to create a VPC using plain Terraform code. Next, we'll refactor it into a well-structured
Terraform module by following best practices. Finally, we'll make use of Terragrunt's
top features to set up your infrastructure. We'll not only build a VPC module, but also
create EKS and Kubernetes add-ons modules from scratch. With the Kubernetes add-ons
module, you can easily enable managed add-ons like the CSI storage driver, or self-managed
add-ons such as the cluster autoscaler, load balancer controller, ArgoCD,
and others deployed as Helm charts. In this tutorial, you'll learn many valuable
techniques and best practices for creating Terraform modules, even if you don't end up using
Terragrunt in your projects. Towards the end, I'll demonstrate a production-ready setup that
includes using an S3 bucket and DynamoDB table to lock the state, as well as creating an IAM role
that can be assumed by users or automation tools like Jenkins. Additionally, we'll set up separate
Git repositories for the live infrastructure and Terraform modules, to help maintain a
well-organized and efficient workflow. Terragrunt offers many useful features
that can enhance your workflows. One such feature is the ability to execute and apply
Terraform on multiple modules simultaneously, while also sharing output variables from
one module as input to others. For instance, we'll create a VPC module and use the
private subnet IDs as input for the EKS modules. Terragrunt can also generate
backend configurations and provider configs, saving you from copy-pasting and keeping
your code DRY. Many people, including myself, find Terragrunt extremely helpful and use
it in production environments all the time. Let’s get started. In this tutorial, we'll talk
about various ways to organize your Terraform code. One method is to make an "environments"
folder containing separate folders for each environment, like "dev" for development. This
environment is for testing new infrastructure and app features, and it's okay if things break.
Next, you might have a "staging" environment where you test your apps before going live. Sometimes
called "pre-prod," this more stable environment should resemble production but might be smaller
in terms of resources and cost. And so forth. If we want to use Terragrunt recommended
approach, we'll refer to these environments as "live environments" or "infrastructure live."
This naming sets this folder or Git repository apart from other Terraform code and modules. The
"live" label means you can treat this repo as a reliable source, and whatever is declared
in those folders should be up and running. For your reference, I'll label this
first approach as "v1" when we set up our infrastructure using the basic Terraform code. Alright, let's begin by creating a VPC under the "dev" folder. We'll use this VPC
later to deploy an EKS cluster. First, we need to declare the provider. There
are several ways to authenticate with AWS. For this initial example, I'll stick to the default
method, which uses either environment variables or your default AWS profile. Later, we'll
assume roles, and I'll explain the difference it makes when creating an EKS cluster.
Next, let's set some Terraform constraints. You should use version 1 or higher.
When Terraform provision infrastructure, it needs a way to track what was created.
This is called the Terraform state, and we use backend configuration to decide
where to store it. For testing purposes, we'll use the local backend, which is the default
option, but we can adjust a few parameters. Later, we'll switch to using an S3 bucket and a DynamoDB
table to lock our state. This approach should be used most of the time, especially when working
in a team, instead of relying on local state. Here, we can set a path for storing the state
file. Often, this path will be in an object store like an S3 or Google Cloud Storage
(GS) bucket. Additionally, we can set version constraints for the AWS provider itself.
That's all for the provider configuration. Next, let's create a VPC. First, we need
to assign it a CIDR block. At this point, you need to make an important decision.
If you plan to peer multiple VPCs in the future, you should come up with
unique CIDR ranges beforehand. Many third-party add-ons for
Kubernetes, like the EFS storage driver, require DNS support. It's a good
idea to enable it from the start, as it can save you a lot of time
when troubleshooting issues later on. Lastly, let's add the Name tag. You'll
notice that we use environment prefixes, which is a common practice even if you have
separate accounts for different environments. Next, we'll create an internet gateway
to provide internet access for public subnets. We need to attach it
to the VPC and add a Name tag. Now, let's create subnets. We'll make
two public and two private subnets. The CIDR block for each subnet should be
a subset of the VPC's CIDR block. Then, choose the availability zone. For EKS, you
need at least two different availability zones. Subnet tags are crucial, especially
for EKS. First, add a Name tag. Then, include a tag indicating that EKS can
use the subnet to create private load balancers. Add another tag to
associate this subnet with EKS, with a value of either "owned" or "shared." You
can create an EKS cluster without these tags, but some components might not work as expected.
For example, the Cluster Autoscaler or Karpenter will use these tags to auto-discover subnets for
creating additional Kubernetes workers. Next, create another private subnet in a different
availability zone. Remember, the cluster tag must match your EKS cluster name. For the dev
environment, we'll create the "dev-demo" cluster. Now, let's create the public subnets. You'll
want to enable assigning public IP addresses when virtual machines launch. Additionally, tag
these subnets to allow EKS to create public load balancers. Public load balancers get public IP
addresses and are used to expose your service to the internet. For example, an Nginx ingress
controller can create a public load balancer. In contrast, private load balancers only get
private IP addresses, allowing you to expose your service within your VPC only.
Finally, create the last public subnet in a different availability zone. Next, we need to create a NAT gateway
to provide internet access to private subnets. I recommend manually allocating
a static public IP address, as you might need to whitelist it with your clients
in the future. It's better to allocate multiple public IPs in case you need to perform
blue-green deployments. For the NAT gateway, we must explicitly depend on the
internet gateway. Additionally, the NAT gateway must be placed in one of
the public subnets with an internet gateway. Finally, we need to create
routing tables. For now, both public and private subnets have the same
default route, which is limited to your VPC only. First, we'll create a private routing table
and use the NAT gateway as the default route. A route that uses all IP addresses is called the
default route. Then, the second public routing table will have the default route set to the
internet gateway. Next, we need to associate all four subnets (two private and two public) with the
corresponding routing tables. In the next example, we'll dynamically generate subnets and associate
them with routes. That's pretty much it. Optionally, you can expose the VPC
ID. This can be useful if you use it as input for another Terraform code or module.
That's all for the VPC setup. The current folder structure consists of a "live" folder, followed
by "environments" and component-specific Terraform code, such as the VPC. Now, let's switch to the
VPC folder and initialize Terraform. This will download all required Terraform providers
and initialize the Terraform state. Then, run "apply" and enter "yes" to create your VPC.
It may take about 2 or 3 minutes to complete. Once Terraform finishes,
it should return the VPC ID. Now, you can check the AWS console to confirm
that we have a newly created VPC with the proper Name tag and all four subnets with the "dev"
prefix. This is a typical example of how to use Terraform to create an AWS VPC. It's
very straightforward: define in the code what you want to create and apply Terraform. The
challenge is reproducing this at scale and keeping the Terraform code DRY (Don't Repeat Yourself).
Additionally, based on your backend configuration, Terraform will create a state
file under the "dev/vpc" folder. If we want to reproduce the same setup
in another environment, such as staging, using just plain Terraform code, we'll need to
copy the VPC folder to the new environment and replace all the environment-specific references.
For some, we can use variables, but for others, such as the backend configuration, we cannot.
First, let's remove the state folders and file. In the backend block, variables are not supported,
and you'll have to manually replace the path for each new environment. For example, replace
"dev" with "staging." If you use a separate bucket for each environment, you'll need
to replace the bucket name instead. Then, we need to find all the references to the specific
environment and replace them with "staging." You can definitely use variables here. Under
subnets, we have a lot of references to "dev" that we need to replace. I'll use a Visual Studio
shortcut to replace all occurrences of "dev." Next, update the NAT gateway and routes. It's possible that we may miss something, so
let's search for any remaining "dev" references. Also, don't forget to update the internet gateway. So far, our directory structure looks like this: That's it! Let's go ahead and initialize
Terraform in the staging VPC folder. You may encounter an error since we
copied the Terraform folder. To fix it, just run "reconfigure," but be very careful and
first ensure that you're in the right place. Then, apply the Terraform. Alright! Now, we have one VPC for the
dev environment and another for staging. The same applies to the subnets; we
have four for dev and another four for staging. Some companies may choose to
dedicate a separate AWS account for the production environment with very limited access.
Additionally, we now have another Terraform state file for the staging environment under the
"staging" folder. In the following example, we'll improve it. But before we proceed, make
sure to destroy the staging and dev VPCs. Once completed, you should not have any VPCs in
that account besides the default VPC. By the way, you should not use the default VPC in any
situation. It's only there for demonstration purposes, so feel free to delete it.
Sometimes you may get an error that the default VPC does not exist, but it's a good
sign that you forgot to update something. In the next part of this tutorial, we will
transform our terraform code into a module. This will greatly reduce the amount of
code we need to duplicate when working with different environments. By doing this,
we are taking the first step towards making our code more efficient and less repetitive. At
the moment, we will keep everything in the same repository and create a folder specifically
for infrastructure modules. In the future, we will explore how to structure terraform
modules across different git repositories. Now, let's copy the entire VPC folder
and place it under the modules folder. We can remove the dev folder, which
contains the Terraform state and lock files. Next, we'll begin refactoring. We don't need to
declare the provider within the module. Instead, we can just set version constraints for Terraform
and the provider. Let's remove the provider and backend configurations. The provider and backend
will be managed in the infrastructure-live folder. Next, create a variables file. We'll modify our
code by moving some parts into variables. One commonly used variable is the environment
variable. This helps differentiate between different environments and is often used as a
prefix for your infrastructure components. It's a good idea to add a description and specify
a type, like "string," for this variable. The next variable is a CIDR block. In this case, we can set a default value for the variable
and only override it when necessary. Now, let's begin converting this code into a
module, following best practices. If there's no more descriptive and general name
available, or if the resource module creates only one resource of this type,
the resource name should be called "this." Next, replace the hardcoded CIDR block for the
VPC with a variable. Whenever you create a module, replace all possible configurations with variables
and provide default values if you don't want to change them. This approach will be helpful
in the future if you receive new requirements and need to update a parameter. It makes your
modules flexible and future-proof. For instance, when working with DNS, you should also create
variables and set the defaults to true. Now, replace the "dev" prefix with
the environment variable. You can also replace the entire name with
a variable, not just the prefix. Next, apply the same process to the
internet gateway. Replace the resource name with the "this" keyword. Update the
vpc_id to reference "this" as well. Also, update the name tag following the same
approach used for the VPC resource. For subnets, let's remove the
existing code and replace it with logic that dynamically creates subnets
on demand. You can find similar logic in the official AWS VPC module. But
first, let's add a few new variables. The first variable is for availability zones.
We'll pass a list containing the zones we want to use. Next, create a variable for CIDR
ranges for the private subnets. Similarly, create another variable for CIDR
ranges for the public subnets. Next, create two more variables to pass
additional tags for private and public subnets. This is particularly useful when using
a VPC for EKS clusters, as previously mentioned. Now that we have variables for subnets, we
can create Terraform code for them. Instead of using a "for each" loop, we'll use the
"count" variable and create as many private subnets as needed, based on the input provided
to the module. We'll then use the VPC reference. For the CIDR block, we'll use the count index.
We'll apply the same logic for the availability zone. For the tags, we'll use the built-in
merge function to combine the provided tags with a name tag. Essentially, we'll use the same
logic to create public subnets for the module. Now, let's update the NAT Terraform code
to use the "this" keyword and modify the Name tags to incorporate the environment variable. For the subnet, let's use the
first generated public subnet. The same applies to the routes. We need to
refactor the code and use the "this" keyword. We should also update the logic
to associate these routing tables with the generated subnets.
Instead of hardcoding each route, let's use a count variable and the index of
the subnets. First, let's associate a private routing table with all private subnets,
and then do the same for public subnets. Now, let's add a few output variables, such
as lists of private and public subnets. These output variables can be used later when
passing information to the EKS module. We can use the star (*) to return all created subnets.
These shortcuts are simple but very powerful. We have now completed the module. Next,
let's create another live environment where we can call these Terraform
modules. We'll use the same structure: a 'dev' folder for the development environment
and another folder for the staging environment. Now, create another VPC folder
to invoke the VPC module. One crucial lesson learned from writing hundreds
of thousands of lines of infrastructure code is that large modules should be considered harmful.
In other words, it's not a good idea to define all your environments (dev, stage, prod, etc.) or even
a significant amount of infrastructure (servers, databases, load balancers, DNS, etc.) in a
single Terraform module. Large modules are slow, insecure, hard to update, challenging
to code review, and difficult to test. Alright, create a main.tf file
to call the VPC module. First, as with plain Terraform code, we
need to declare the AWS provider and set up the backend. In this example,
we'll continue to use the local state. Next, declare the VPC module. In this
case, 'VPC' is an arbitrary variable. Then, specify the source. A Terraform module is simply
a folder containing a bunch of Terraform code. You can reference it using a relative path
or, later on, use a dedicated git repository. Declare the environment. Since
it's under the 'dev' folder, it should be the development environment. Later, I'll show you how to dynamically obtain
this information from the folder structure. Next, provide the availability zones, private
subnet CIDR ranges, and public ranges. Finally, let's pass the same subnet tags for
both private and public subnets. As you can see, the code is now much more concise.
We've reduced our Terraform code to a single file. Optionally, we can use output variables if we
want to display them in the console; otherwise, you can still use them, but they won't be printed
to the console. To reference an output variable, first use the 'module' keyword, followed by
the module name and the module output variable. So far, we have the VPC Terraform module and
a main.tf file under the 'dev' environment to invoke it. Now, let's switch to the
environment and initialize Terraform. After that, run 'terraform apply'
to create the VPC using this module. In the terminal, you can see all output
variables, such as subnets and the VPC ID. Now, instead of copying the entire
Terraform folder with all its files, we'll simply create another main.tf file under
the staging environment. Let me copy the content of the main file from the 'dev' environment and
replace all references with the 'staging' keyword. Also, let's copy the output file. Switch to the 'staging' VPC folder, initialize
Terraform, and then apply the changes. Now, we have the same setup
as in the first example: a 'dev' and 'staging' VPC, along with 8 subnets. Before moving on to the next example, let's
destroy both the 'dev' and 'staging' VPCs. Alright, we have successfully deleted both VPCs.
In this section, we'll improve our current setup by using Terragrunt. Terragrunt is a simple tool
that offers additional features for making your configurations more efficient, working with many
Terraform modules, and handling remote state. Let's make another live environment
to use Terragrunt. To set it up, we need to create a terragrunt.hcl file.
If you're using the same S3 bucket and configuring a different path to store your state, you can place this file above your environment
folders. You’ll see more examples later. In this tutorial, we'll cover many features
of Terragrunt. While most of them are simple shortcuts, when used together, they
can greatly enhance your workflow. First, let's reorganize the Terraform backend
configuration. Usually, you'll use remote state, but for this initial example, we'll stick with
local state. Keep in mind that the backend configuration doesn't support variables or
expressions, so you'll need to copy and paste it, updating the parameters as needed. For
instance, even when using local state, you must update the path to the state
file: for the development environment, it's "dev/vpc/state," while for staging, it's
"staging/vpc/state." If you use different buckets, you'll also need to update the
bucket name for each environment. Terragrunt helps you maintain efficient
backend configurations by letting you define them just once in a root
location and then inheriting that configuration in all child modules.
The "path_relative_to_include" will be translated to "dev" for the development
environment and "staging" for the staging environment. This way, you won't have
to repeat yourself in the configuration. With Terragrunt, you can now create your
backend configuration just once in the root terragrunt.hcl file, and it will
be used across all environments and modules. This simplifies your
setup and reduces repetition. Managing provider configurations across
all your modules can be challenging, particularly when customizing authentication
credentials. If you need to update your provider, you must do so in each environment separately,
which can be time-consuming and repetitive. Let's say you want Terraform to assume
an IAM role before connecting to AWS; you need to add a provider block with the
"assume_role" configuration. You would then copy and paste this configuration into
every one of your Terraform modules. While it's not a significant amount of code, it
can be difficult to maintain. For instance, if you need to modify the configuration to
expose another parameter (e.g., "session_name"), you would have to go through each of your modules
and make the change, which can be cumbersome. In addition, what if you wanted to
directly deploy a general purpose module, such as that from the Terraform module
registry? These modules typically do not expose provider configurations as
it is tedious to expose every single provider configuration parameter imaginable
through the module interface. Terragrunt allows you to refactor common Terraform
code to keep your Terraform modules DRY. I’ll show you more examples
later. For this basic example, we'll use the AWS provider with the default
authentication method and the "us-east-1" region. That's all the setup required to begin using
Terragrunt. With this basic configuration, you can start taking advantage of its
features to improve your Terraform workflows. Now, let's create a standard folder structure for
our environments, including "dev" and "staging" folders as usual. We'll also create a "vpc" folder
to call the VPC module. To use Terragrunt, we need to declare a single file. Start by defining the
source for the module, which can be a local path, a Git repository, or the Terraform registry,
just like a regular module source attribute. Here's where things differ: we'll include the
root terragrunt file that we defined earlier. This will generate backend configuration and set up the
provider for us. Then, under the "inputs" section, you'll provide the same Terraform variables
that we used in the previous example, such as environment, availability zones, etc.
This part is identical to a regular module, except that you need to use the "input"
block to supply these variables. To start using Terragrunt, you first need to
install it. You can download it from the source, but a preferred method is to use a
package manager. For instance, on a Mac, you would use Homebrew to install Terragrunt. Next, navigate to the "vpc" folder within the
development environment. Instead of running "terraform init," you just need to
run "terragrunt init" (you may want to create an alias for this command). Then,
execute "terragrunt apply." Once Terragrunt completes the deployment, you'll have
the same VPC and subnets as before. Now, let's examine the backend configuration.
Terragrunt generates backend and profile configurations in its own working directory, which
you can find under the "terragrunt-cache" folder. You'll notice that we still have the "dev" key
for the state. It's important not to use local state with Terragrunt; later, we'll convert it
to S3. With the current setup, it's challenging to share your state with other team members
since the entire folder is ignored by Git. Now, let's create a similar VPC in
the staging environment. Copy the Terragrunt file from the "dev" environment
and replace all references with "staging." That's pretty much it. Navigate to the
"vpc" folder under the staging environment, initialize Terragrunt, and then apply the changes. Because we used the same backend configuration, Terragrunt created a different path based on
the location, which starts with the "staging" key. This approach will be very helpful when
working with remote backend configurations. As a result, we have the same VPC for both the
development and staging environments, along with their respective subnets. While it may not seem
like a significant change, it can save you a lot of time and help maintain DRY configurations
when you have large number of environments. Another useful Terragrunt feature is the
ability to run commands in multiple folders simultaneously. This will be incredibly
valuable later when we define dependencies between modules. For now, instead of changing
directories and running "destroy" to clean up, we can simply run "terragrunt run-all
destroy" from the root folder. It will show you where it's going to execute those
commands and ask for your confirmation. By running just one command, we've successfully destroyed both VPCs in
the staging and development environments. In the following part of this tutorial,
we will make an EKS Terraform module and add some extra features like cluster
autoscaling. Let's move forward by making a new folder called 'eks' inside
the 'infrastructure modules' folder. We'll start by using the same version
constraints for the AWS Terraform provider. Before we can set up the EKS control
plane, we need to create an IAM role with EKS principal. After that, we
must attach the AmazonEKSClusterPolicy, which allows EKS to create EC2
instances and load balancers. For the cluster, we will use an environment
variable and pass the EKS name variable. Like in the previous module, it's best practice
to parameterize all possible options and set defaults, rather than hardcoding them in the
module or relying on provider defaults. Then, attach the IAM role to the cluster. We need
to parameterize these values too. For now, I'll turn off the private endpoint since I don't
have a VPN in this cluster, and enable the public endpoint. This way, I can access the EKS
from my laptop and deploy applications. Up next, we have to supply subnets for
EKS, which should be located in at least two different availability zones.
Amazon EKS sets up cross-account elastic network interfaces in these subnets to
enable communication between your worker nodes and the Kubernetes control plane. We'll pass
this variable dynamically from the VPC module using the Terragrunt dependency feature. That's
about everything we need for the control plane. Now, let's create an IAM policy and IAM role
for the Kubernetes nodes. We'll use a similar prefix for the environment and cluster name. If
you want to set up multiple environments in the same account, you'll need to do the same,
or else you'll face a conflict when trying to create another environment. Next, we have to
attach multiple IAM policies to this role. We'll use a 'for each' loop to iterate over all provided
policies and attach them to the nodes' IAM role. The last policy is optional; it allows you to
use the session manager to SSH into the node. In the next file, let's create EKS-managed
instance groups. As in the previous example, we want to iterate over all node groups provided as
a map variable. All node groups must be connected to the EKS cluster we created earlier. We'll
use a key of the object for the node group, for instance, general. We'll also share the same IAM
role among all node groups. If you need to grant additional access for applications running in EKS
to the AWS API, you would use an OpenID Connect provider instead. You'll see an example of this
later on when we deploy the cluster autoscaler. Next, we'll set the capacity type of the node,
which can be either on-demand or spot type. Then, we'll specify the list of instance types
associated with the EKS Node Group, such as 't3a.xlarge'. After that, we'll configure
the scaling settings. Remember that these settings only set up the initial autoscaling
group parameters, like minimum, maximum, and desired size. To enable autoscaling, you must
deploy the cluster autoscaler or use Karpenter. Next, we'll set the desired maximum
number of unavailable worker nodes during a node group update. I'll keep
the default of 1 node at a time. It's also helpful to assign labels to
the Kubernetes workers. Later, you can use nodeSelector or affinity to bind
pods with nodes. Additionally, we'll use a similar 'depend_on' statement to ensure the IAM
role is ready before creating instance groups. The next step involves setting
up the OpenID Connect provider, which is used to grant access to the AWS API. Most
of the time, you'll want this in your cluster, but sometimes it's not necessary. Let's create
a boolean flag called 'enable_irsa' that we can use to create this provider on demand. Then,
you'll need to point it to the EKS control plane. Once you've retrieved the EKS TLS certificate,
you can proceed to create the OpenID Connect provider. Since we'll be creating additional
Kubernetes add-on modules, we want to expose some variables that can be passed to another
module. For example, the full EKS name that includes the environment prefix. To deploy the
cluster autoscaler, we'll need to use this OpenID provider ARN to establish trust between
AWS IAM and the Kubernetes service account. Finally, let's declare the variables that
we want to provide for this module. First, we'll use the same environment variable as a
prefix. Then, we'll specify the desired EKS version and the name of the EKS cluster. Next,
we'll provide the list of subnet IDs that we need to pass to EKS, followed by the default IAM
policies that we have to attach to the EKS nodes. Lastly, we'll define the node groups, specifying
all the parameters of the desired Kubernetes node group. And finally, we'll add the 'enable_irsa'
flag to create the OpenID Connect provider. Now, let's create another live
environment—our fourth one. First, let's create a Terragrunt file and define shared
objects between environments and modules. We'll continue using the local state for now, but
this will be the last time, I promise. Also, we'll set up the AWS Terraform
provider with the 'us-east-1' region. Next, create a 'dev' folder for the development
environment. Inside it, let's create another Terragrunt file, but in this case, define common
variables only for this development environment. Later, when we need to create another environment,
that's when you'll need to update most of them. For example, I want to share the 'dev' environment
prefix across all modules in this environment. Before we can create the EKS cluster, we need
to provision a VPC in your AWS account. Let's copy the VPC module from the previous example
and paste it under the development folder. Make sure to delete the lock and state files
from the previous example, as we don't need them. First, let's refactor the environment
variable. Like in the root, we can include that environment variable
in the code and use it wherever we need it. The 'expose' attribute determines if the
included config should be parsed and made available as a variable. This enables other
parts of the configuration to access and use it. Instead of hardcoding the environment variable, we can dynamically pass it from the parent
folder. This is another step to make your Terraform code DRY (Don't Repeat Yourself). That's
all for the VPC; we'll leave the other variables as they are. If you decide to use a different
EKS name, you must update the tags accordingly. Now, create another folder for the EKS module and
make a new Terragrunt file. For the time being, we'll keep using the relative path to specify
the module's source. Next, include the root to generate backend configuration and
the AWS provider. Then, add a similar environment variable. By the way, you can also
expose some variables in the root Terragrunt file if you want to share them between different
environments, such as the AWS account or region. Next, we need to provide input variables
for the module. First, we want to use the most recent EKS control plane version available
at the moment, which is currently 1.26. Then, we'll use the same environment variable as in
the VPC module and set the EKS cluster name. This is where Terragrunt really shines—it allows
you to define dependencies between modules. For example, the EKS module depends on the VPC module
and needs subnet IDs. With plain Terraform, you would have to use the Terraform remote state
and execute those modules sequentially. Next, we need to create a 'node_groups' variable with
the desired settings for the group. Finally, define the dependency on
the VPC module. To do that, you simply need to point to the VPC
folder where you invoke that module. It's also important to provide some mock outputs.
This is helpful when you want to run 'terraform plan' on both modules simultaneously.
If you omit this mock output variable, the plan will exit with an error stating
that the EKS module needs subnet IDs. You'll see this in action soon. That's
all for the VPC and EKS module. Now, let's go ahead and initialize Terraform.
Terragrunt offers another useful feature that allows you to run the same command in
multiple folders and, most importantly, respect dependencies. Let's switch
to the development environment. From here, we can run 'init' for both
VPC and EKS modules. By the way, it's optional—Terraform will automatically initialize
it when you run 'plan' or 'apply' anyway. You can see that Terragrunt will run 'init' in
the VPC first and then in the EKS module since we've defined the dependency. Now, let's run
'plan'. It will execute in the same order—VPC and then EKS. If you leave the mock variables
out, the plan will exit with an error on EKS. Alright, we can run 'apply' now.
Terragrunt will show you the order again and ask you to confirm the action.
When you say yes, it will create the VPC first and use subnet IDs output variables
as inputs in the EKS module. Typically, we have many different modules in our
environment, and it becomes extremely helpful to share output variables and run 'apply'
on the entire environment simultaneously. Alright. It will take maybe 10 minutes
to create the VPC and EKS cluster. To access the EKS cluster, we need to update
our local Kubernetes config using the 'aws eks' command. Let's run 'kubectl get nodes'
to verify that we can connect to the cluster. At this point, we have the VPC and EKS cluster
set up. Most of the time, you would want to deploy additional components, such as cluster autoscaler,
CSI storage drivers, load balancer controllers, etc. For that, let's create a separate
Terraform module called 'kubernetes-addons'. We'll combine managed and self-managed
addons that are deployed as Helm charts. As always, let's begin by creating version
constraints. In the case of this module, we want to use the Helm provider to deploy self-managed
Kubernetes addons as Helm charts. Next, let's create an additional Terraform file for each
addon. The first one is the cluster autoscaler. The cluster autoscaler needs access to the AWS
API to discover autoscaling groups and adjust desired size setting on them. For that, we need to
use IAM for service accounts. We'll deploy it in the 'kube-system' namespace, and the Kubernetes
service account name is 'cluster-autoscaler'. We need to set it in the Helm chart later. All
Kubernetes addons will have a flag to enable them, such as 'enable_cluster_autoscaler'. If
it's true, the count is 1, which means we will create this type of resource. Then, use
the EKS name as a prefix for the IAM role. Next, let's create the IAM policy that allows
the cluster autoscaler to work properly, as I described earlier. Attach this policy to the
trusted IAM role. And let's create a Helm release. It will also use a boolean flag to enable it, the
name of the Helm release, the remote repository to use, and the chart name. Specify the namespace,
which must match the namespace on the IAM role. Then, provide the chart version. Now it's
important to match the service account name with the IAM role. To establish trust, we must set the
Kubernetes service account annotation with the IAM role ARN. Finally, provide the EKS name so that
the cluster autoscaler can auto-discover subnets and autoscaling groups. This functionality is
based on the subnet tags that we provided earlier. Now, I intentionally limited the number of
parameters that we can customize so that everyone can follow the same process and limit the drift
between environments. To deploy the load balancer controller, just create another Terraform
file and follow the same logic. The same goes for Karpenter, ArgoCD, and other components.
First, as always, the environment variable. Then, the EKS cluster name that we'll get from
the EKS output variable, a flag to deploy the cluster autoscaler, the Helm chart version of
the cluster autoscaler, and finally, we need to pass the OpenID Connect provider ARN from the
EKS module. That's all for the addons module. Next, create an addons folder under the live
development environment and create a Terragrunt file. As always, we need to define the source of
the module, include the backend and AWS provider config, and then the same environment variable.
Now, we need to pass the EKS name as a dependency from the EKS module and the OpenID Connect
provider ARN, enable the cluster autoscaler, specify the version of the chart that we
want to install, and define dependencies on the EKS module. Specifically, we need the EKS
cluster name and the OpenID Connect provider ARN. The tricky part is to authenticate the Helm
provider, which we will generate using Terragrunt as well. Keep in mind that you cannot pass
variables from the EKS module here. This provider will be generated and can only use variables that
are provided to the module itself. Essentially, you can generate anything you want. To initialize
the Helm provider, we need to get a temporary token. You can pretty much initialize any
provider that needs to authenticate with EKS using the same principle, such as the
Kubernetes and kubectl Terraform providers. That's all. Let's go back to the
terminal and run terraform apply on the whole development environment.
Since we defined dependencies, you can see the order in which Terraform will
apply the infrastructure. Let's confirm it. After deploying the infrastructure with
Terraform, it's important to verify that the cluster autoscaler has been installed
correctly. To do this, you can check the installed helm charts and confirm that
the autoscaler is installed. Additionally, you should make sure that the autoscaler pod
is up and running. It's also recommended to check the logs of the autoscaler for any
errors. If there are any misconfigurations, you may see an error in the logs indicating
that the autoscaler is unable to access the AWS API due to permission denied. If the
autoscaler is unable to access the AWS API, it won't be able to adjust the desired
size property for autoscaling. Therefore, it's important to confirm that the
autoscaler has been deployed successfully. To test if the autoscaler is working correctly,
we can create a simple deployment based on Nginx with 4 replicas. After creating the deployment,
we can watch the pods in the default namespace and see that one pod is in a pending state.
This is because there are not enough nodes to schedule the pod. We can describe the pending
pod and confirm that the autoscaler is triggered to add more nodes. We should see a message
indicating that the pod triggered a scale-up from 1 to 2 instances. After a few seconds or
minutes, a new node should join the cluster, and the pending pod should be scheduled. This test
confirms that the autoscaler is working correctly. To replicate the setup in the staging environment, you can simply create a copy of the
dev folder and rename it to staging. The only required change would be to update the
environment variable from "dev" to "staging". While you can also update other local parameters
such as EKS node instance types, since autoscaling is set up, this same setup can be used across
different environments including production. Switch to the staging environment, and initialize. When we run the plan, we may receive
an error message, but we can run the command again. However, Helm provider
requires an existing EKS cluster, so we cannot fake it. If we still want
to run the plan command on some folders and exclude addons, we can use a
specific command. Alternatively, we can run the apply command, which will work
because the addons module is invoked only after the EKS cluster is provisioned. Overall,
we have three groups in this execution plan. Great! We have successfully set up the VPC, EKS cluster, and autoscaler in the staging
environment. To connect to the staging eks, just update the name of the cluster
in your Kubernetes config file. You can see that the cluster autoscaler is
running in the kube-system namespace. Now, let's check our clusters on the AWS console.
Since we used environment prefixes, we were able to create two independent environments in a
single AWS account. We now have two clusters, each with its own environment prefix. To destroy
both environments, we just need to run "destroy" on the top level. Terragrunt will destroy both the
development and staging environments at the same time. It will do so in reverse order, deleting
the helm charts, EKS clusters, and lastly, VPCs. If you check the AWS console, you will see
that all the clusters and VPCs have been removed. In this part of the tutorial, I'll teach
you how to use Terragrunt for real-world projects. We'll separate our code using Git
modules, save our remote state in an S3 bucket, and secure it with DynamoDB locking. Plus, we'll
use an IAM role to set up our infrastructure. First, we need to create an S3 bucket to save
the Terraform state. You can name it anything, but remember it must be globally unique. It's also
a good idea to turn on bucket versioning, so if something happens to your state, you can always
go back to an earlier version and recover it. Now, let's create a DynamoDB table
to lock the Terraform state. This helps avoid conflicts when several
team members try to run Terraform simultaneously. We'll name the table
"terraform-lock-table" and create a "LockID" partition key. That's all we need
to manage the remote state in our S3 bucket. Instead of giving users direct access,
we can create a dedicated IAM role for applying all infrastructure changes.
This role can be assumed by users or, even better, by automation tools like Jenkins. For now, let's give the role admin access, as
the specific permissions needed will depend on your infrastructure plans. We won't cover
IAM permissions in this tutorial. Choose "AdministratorAccess" and add it to the role,
then name it "terraform." By default, any user in the account can potentially assume this role.
We can limit this on the principal side if needed, but it's not required, as we'll need to explicitly
grant users permission to assume the role. In the next step, we'll create an IAM policy that
allows users to use the "terraform" role. First, copy the ARN of the role. Then, create
a new policy allowing users to use the "terraform" IAM role and name it "AllowTerraform." Following best practices, let's avoid attaching
the policy directly to users. Instead, create an IAM group called "devops" and add
the "AllowTerraform" policy to this group. For the demo, create a new IAM user
and add them to the "devops" group. This will allow the user to assume the
"terraform" role. Keep in mind that any user with Admin access in the account will
also be able to assume the "terraform" role. Next, generate security credentials for the user. We'll use these credentials to
create an AWS local profile. Download the credentials. Now you can use the "aws configure"
command to add a new profile for the user. Next, create a separate Git repository to store
your Terraform modules. You have two options: store all modules in the same Git repo or
create a separate Git repository for each Terraform module. With the last option,
you might end up with many repositories, which can be difficult to maintain.
If you're just starting out or have a relatively small DevOps team, begin with a single
repository. Name it "infrastructure-modules." Make the repository private, add a README file,
and include a .gitignore file for Terraform. Now, clone the "infrastructure-modules" Git
repository and open it with a text editor. First, let's copy the VPC Terraform
module we created earlier. Add this module to the Git repo and commit the
changes. When using a single Git repository for multiple modules, create a Git tag specific
to this module so we can reference it later. If you need to make changes to the VPC module,
commit again and create a new VPC tag. Finally, push the Git tags to GitHub
or another remote Git server. So far, we have a single tag
in the GitHub repository. Next, let's copy the EKS module. Follow
the same workflow: add it to the repo, commit the changes, and create a new
Git tag specific to this EKS module. Finally, let's move the "kubernetes-addons"
module to the "infrastructure-modules" Git repo. Add it to the repo and create a new tag.
This is the last module we're going to add. Now we have three separate
Git tags related to specific modules. That's how you manage multiple
Terraform modules in a single repo. Next, create a new Git repository to store
the live state of our infrastructure. Make the repository private and select
"Terraform" for the .gitignore file. Clone this new repository as well and open it in
Visual Studio Code or your preferred text editor. First, create a root Terragrunt
file. In this case, we'll use a remote S3 bucket to store our state. Since
we're using an IAM role to apply changes, we need to provide the IAM role ARN and specify
the profile that can assume that role. Then, provide the S3 bucket and, finally,
the DynamoDB table to lock the state. Next, let's generate the AWS Terraform
provider. We'll also use the same role and AWS profile to run Terraform. Optionally,
you can give the provider a session name. Now, let's copy the development
environment from the previous example and make a few adjustments,
as it won't work out-of-the-box. First, let's clean up by deleting previous
local state and lock files from all modules. Next, we need to update the source of the modules
to point to the remote Git repository and use tags to pin each module to a specific version. Let's
update the "kubernetes-addons" source first, and then the VPC module. This example is
taken from the Terragrunt Quick Start. Additionally, we must update the Helm provider
to use a token; I'll explain why later. We also need to ignore the Terragrunt cache. Now, switch to the development
environment and run "terragrunt apply." It will show the order in which the modules
will be applied and ask for confirmation. It seems that Terraform was updated and
no longer supports "github.com" as is. We need to update this in the code. Let's go
through all three modules and add the "git" schema before each module: the Kubernetes
addons module, and finally the VPC module. Run the command again, confirm that you want
to update the state, and in a few minutes, Terraform should create the VPC, EKS
cluster, and deploy the auto-scaler. Now, you can check the S3 bucket
and find that there's a "dev" key and separate paths for each module:
VPC, EKS, and Kubernetes addons. Let's try connecting to the cluster
as we did in the previous example. We now encounter an error; even when using
the default AWS profile with admin access, we don't have permissions to access
Kubernetes. The problem is that only the IAM user or role used to create the
cluster has access to it. I have a separate video on how to add additional users and
IAM roles while following best practices. For now, let's create a new AWS profile for
the "terraform" IAM role. This indicates that the "anton" user will be used
to assume this "terraform" role. Next, update the Kubernetes context once again,
but this time use the "terraform" profile. Now, we can access the cluster. Let's
also check if the auto-scaler is running. To destroy all the infrastructure, just
run "destroy" in the "dev" environment. The next step is to add additional users and
learn how to use ArgoCD to deploy applications to Kubernetes. Thank you for watching,
and I'll see you in the next video.