Infrastructure as Code: Terraform & CloudFormation
Why declarative IaC exists, how Terraform's plan/apply/state workflow operates, what the state file is and why losing it is catastrophic, Terraform vs CloudFormation vs Pulumi trade-offs, and the module and remote state patterns used in production.
Infrastructure as Code: Terraform & CloudFormation
Why declarative IaC exists, how Terraform's plan/apply/state workflow operates, what the state file is and why losing it is catastrophic, Terraform vs CloudFormation vs Pulumi trade-offs, and the module and remote state patterns used in production.
What you'll learn
- IaC enables PR review for infrastructure changes, reproducible environments, and drift detection — console clicks have none of these
- The Terraform state file maps HCL resources to actual cloud resources; losing it does not destroy infrastructure but makes it unmanageable
- Always use remote state (S3 + DynamoDB) with versioning and locking for any team-based Terraform usage
- Use terraform plan -out=tfplan to save the plan, then terraform apply tfplan to prevent plan drift between CI stages
- Split large state files by domain (networking, compute, database) to limit blast radius and serialisation of IaC operations
Lesson outline
Why IaC Exists and What It Actually Solves
Infrastructure as Code is the practice of defining and managing cloud resources through version-controlled configuration files rather than manual console clicks or one-off CLI commands. The core problem it solves is not laziness — it is drift, repeatability, and review.
Problems IaC solves that manual operations cannot
- Drift: production diverges from documentation — A developer adds an S3 bucket in the console during an incident and forgets to document it. Three months later, a security audit finds an unmanaged bucket with sensitive data. IaC prevents this: if it is not in code, it is either flagged by drift detection or does not exist.
- Repeatability: identical environments are impossible manually — Creating a dev environment that perfectly mirrors production requires either perfect documentation (never exists) or a script (which is IaC without the state management). Terraform apply on the same configuration creates identical environments every time.
- Review: infrastructure changes go through pull requests — A Terraform PR shows exactly what will be created, modified, or destroyed. Security engineers, architects, and teammates review the change before it touches production. Console clicks have no review trail.
- Disaster recovery: recreate everything from code — If your production account is compromised or accidentally deleted, IaC lets you rebuild the entire environment in hours. Without it, you are looking at days of console work from memory and screenshots.
Declarative vs imperative IaC
Terraform, CloudFormation, and Pulumi with a declarative style describe what you want (desired state). The tool computes the diff against current state and executes only the necessary changes. Ansible and shell scripts describe how to get there (imperative). Declarative IaC is idempotent — running it twice does nothing on the second run. Imperative scripts are not: running "aws ec2 create-security-group" twice creates two groups. Declarative is strongly preferred for infrastructure definitions.
Terraform: Plan, Apply, State
Terraform's workflow has three phases: write configuration (HCL), run terraform plan to see what will change, run terraform apply to make it so. The state file is what enables the plan to know the difference between "this resource needs to be created" and "this resource already exists and needs to be updated."
The Terraform workflow in a team environment
01
Write HCL configuration defining resources. Commit to a feature branch.
02
Open a pull request. CI runs terraform fmt (formatting) and terraform validate (syntax check).
03
CI runs terraform plan against a remote state backend. The plan output is posted as a PR comment showing exact changes.
04
Team reviews the plan. Security team checks for overly permissive IAM policies or public S3 buckets.
05
After approval, CI runs terraform apply using the reviewed plan. State file is updated automatically.
06
Monitor the apply output and confirm resources are created correctly. Tag the release.
Write HCL configuration defining resources. Commit to a feature branch.
Open a pull request. CI runs terraform fmt (formatting) and terraform validate (syntax check).
CI runs terraform plan against a remote state backend. The plan output is posted as a PR comment showing exact changes.
Team reviews the plan. Security team checks for overly permissive IAM policies or public S3 buckets.
After approval, CI runs terraform apply using the reviewed plan. State file is updated automatically.
Monitor the apply output and confirm resources are created correctly. Tag the release.
The state file is the source of truth — treat it like a production database
Terraform state (terraform.tfstate) records the mapping between your HCL resource definitions and the actual cloud resources they manage. If the state file is deleted, Terraform no longer knows which resources it manages. Running terraform apply after a state deletion can create duplicate resources, or worse — Terraform may try to create resources that already exist and fail midway, leaving the environment in an inconsistent state. The state file can also contain plaintext secrets (database passwords, API keys) as values of sensitive resource attributes. Never store state in a local file in a team environment. Never commit state to git.
Remote state backend requirements for teams
- Centralised storage: S3, Terraform Cloud, or GCS — State must be accessible to all CI pipelines and team members. S3 with versioning enabled is the standard choice on AWS. Versioning lets you recover from a corrupted state by rolling back to a previous version.
- State locking: DynamoDB (for S3 backend) or native locking (Terraform Cloud) — Prevents two concurrent applies from simultaneously modifying state and corrupting it. Without locking, two CI pipelines running apply at the same time on the same workspace will corrupt the state file.
- Encryption: S3 SSE or Terraform Cloud encryption — State files can contain plaintext resource outputs including database passwords, private keys, and connection strings. The S3 bucket must have server-side encryption and strict IAM access policies.
- Separate state per environment — Never share state between dev, staging, and production. A plan/apply mistake in dev that corrupts state should not affect production. Use separate S3 keys (dev/terraform.tfstate, prod/terraform.tfstate) or separate Terraform Cloud workspaces.
1# terraform/main.tf — example remote state backend configuration2terraform {3required_version = ">= 1.5.0"4required_providers {5aws = {6source = "hashicorp/aws"7version = "~> 5.0"8}9}1011backend "s3" {Always use remote backend in team environments — local state causes corruption and secrets leakage12bucket = "my-company-terraform-state"13key = "production/vpc/terraform.tfstate"14region = "us-east-1"DynamoDB table prevents concurrent applies from corrupting state15encrypt = true # SSE-S3 encryption16dynamodb_table = "terraform-state-locks" # DynamoDB for state locking17# Never hardcode credentials here — use IAM role or env vars18}19}2021# Example VPC resource using a module22module "vpc" {23source = "terraform-aws-modules/vpc/aws"24version = "5.1.0"2526name = "production-vpc"27cidr = "10.0.0.0/16"28azs = ["us-east-1a", "us-east-1b", "us-east-1c"]2930private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]31public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]3233enable_nat_gateway = true34single_nat_gateway = false # One NAT per AZ for resilience35enable_dns_hostnames = true36single_nat_gateway = false is more expensive but required for AZ-resilient private subnet egress37tags = {38Environment = "production"39ManagedBy = "terraform"40}41}4243# View current state without applying44# terraform state list45# terraform state show module.vpc.aws_vpc.this[0]4647# Import a resource created outside Terraform (emergency console fix)48# terraform import aws_s3_bucket.my_bucket my-existing-bucket-name4950# Detect drift between state and actual infrastructure51# terraform plan -refresh-only
Terraform vs CloudFormation vs Pulumi
The IaC ecosystem has three primary tools, each with different design philosophies, ecosystem maturity, and operational trade-offs. Knowing when to use each — and what pain points each brings — is essential knowledge for cloud engineers.
| Dimension | Terraform (HCL) | CloudFormation (YAML/JSON) | Pulumi (Python/TypeScript/Go) |
|---|---|---|---|
| Language | HCL (domain-specific) | YAML / JSON | General-purpose (Python, TS, Go, C#) |
| Multi-cloud | Yes (900+ providers) | AWS only | Yes (major providers) |
| State management | External (S3 + DynamoDB or TF Cloud) | AWS-managed (S3 internally) | Pulumi Cloud or self-managed |
| Drift detection | terraform plan -refresh-only | CloudFormation Drift Detection | pulumi refresh |
| Destroy protection | prevent_destroy lifecycle rule | DeletionPolicy: Retain | protect option on resource |
| Nested / modular | Modules (local and registry) | Nested stacks | Stacks + component resources |
| Testing | Terratest, terraform test (1.6+) | cfn-lint, taskcat | Pulumi testing SDK |
| Secrets handling | Sensitive attribute, not encrypted in state | SSM/Secrets Manager references | Encrypted secrets in state |
| Learning curve | Low (HCL is simple) | Medium (YAML is verbose) | Low for existing programmers, high for ops |
Choose Terraform for multi-cloud or when the team knows cloud infrastructure; CloudFormation when you need native AWS integrations
Terraform wins on multi-cloud and community module ecosystem. CloudFormation wins when you need native AWS integrations (StackSets for multi-account, Service Catalog, CloudFormation Hooks for policy enforcement) and AWS-managed state. Pulumi wins when your team is composed of developers who want to write real code with conditionals, loops, and unit tests rather than declarative configuration. For a team new to IaC, Terraform is the safest default due to the richest ecosystem and community.
Always use terraform plan -out to prevent plan drift between CI and apply
Running terraform plan and then terraform apply in separate CI steps can cause the apply to execute a different plan if infrastructure changed between the two steps. Use terraform plan -out=tfplan to save the exact plan, then terraform apply tfplan to execute exactly that plan. This is the correct production workflow.
Module Patterns and Organisation at Scale
As an IaC codebase grows, unstructured HCL becomes just as hard to maintain as unstructured code. Module patterns, workspace strategies, and the monorepo vs repo-per-service question significantly affect how productive your team is with Terraform over time.
IaC organisation patterns in production
- Root module structure: separate by environment and service — terraform/environments/prod/vpc/, terraform/environments/prod/ecs/, terraform/environments/staging/vpc/. Each directory is an independent Terraform root module with its own state. Prevents a single large state file from becoming a blast-radius problem.
- Shared modules: reuse with versioning via registry or git tags — module { source = "git::https://github.com/company/terraform-modules.git//vpc?ref=v2.1.0" }. Pin to a specific version tag. Updating a shared module should trigger a plan review in all consuming root modules before merging.
- data sources for cross-state references (not outputs shared directly) — The networking module outputs its VPC ID. The application module uses a terraform_remote_state data source to read it: data.terraform_remote_state.vpc.outputs.vpc_id. This creates a dependency between state files without coupling their apply lifecycle.
Large state files are a reliability risk — split them early
A single Terraform state file managing 500 resources has two problems: (1) every plan operation must lock the entire state, serialising all IaC operations across the team; (2) a corrupted state file means 500 resources are all unmanaged simultaneously. The industry rule of thumb is no more than 100–150 resources per state file. Split by logical domain: networking state, ECS cluster state, database state, IAM state.
How this might come up in interviews
Cloud engineer, DevOps engineer, and platform engineer interviews universally ask about IaC. Both conceptual questions ("what is state?") and operational scenarios ("apply failed halfway — now what?") are common. Senior roles expect module design and team workflow patterns.
Common questions:
- What is the Terraform state file and what happens if you lose it?
- Explain the difference between terraform plan and terraform apply. Why does the -out flag matter?
- How would you manage Terraform state in a team of 20 engineers working on the same infrastructure?
- What is the difference between Terraform and CloudFormation? When would you choose one over the other?
- A terraform apply failed halfway through. How do you determine the state of the infrastructure and recover?
Try this question: Is there an existing IaC codebase or is this greenfield? How many engineers will be applying Terraform? Is the workload multi-cloud or AWS-only? Are there compliance requirements for audit logging of infrastructure changes?
Strong answer: Mentions S3 versioning and DynamoDB locking unprompted when describing remote state. Explains the -out flag for plan/apply separation. Talks about splitting state files by domain for blast radius reduction. Mentions terraform state list and terraform import for state recovery scenarios.
Red flags: Does not know what the state file is. Suggests storing state in git. Cannot explain what happens when apply fails partway through. Treats Terraform and Ansible as equivalent tools (they solve different problems).
Key takeaways
- IaC enables PR review for infrastructure changes, reproducible environments, and drift detection — console clicks have none of these
- The Terraform state file maps HCL resources to actual cloud resources; losing it does not destroy infrastructure but makes it unmanageable
- Always use remote state (S3 + DynamoDB) with versioning and locking for any team-based Terraform usage
- Use terraform plan -out=tfplan to save the plan, then terraform apply tfplan to prevent plan drift between CI stages
- Split large state files by domain (networking, compute, database) to limit blast radius and serialisation of IaC operations
💡 Analogy
Infrastructure as Code is the architectural blueprint for your building. Terraform is the construction crew that reads the blueprint and builds what it specifies. The state file is the "as-built drawing" — a record of what was actually constructed, including which bolts were used, where every pipe runs, and what deviates from the original plan. Without the as-built drawing, even the original architect cannot safely renovate the building, because they do not know if what was built matches what was designed. Lose the as-built drawing and the crew treats the building as unmeasured — they cannot safely add or remove anything without risking structural damage.
⚡ Core Idea
Terraform computes the difference between desired state (HCL) and current state (state file + cloud API), then executes only the changes needed to reconcile them. The state file is the bridge between code and reality. Without it, Terraform cannot know which resources it manages.
🎯 Why It Matters
IaC is non-negotiable at any serious scale: it enables PR review for infrastructure, reproducible environments, disaster recovery, and drift detection. But the state file creates a new category of operational risk — it is a security-sensitive, business-critical file that must be stored securely, backed up, locked against concurrent access, and never deleted. Understanding the plan/apply/state workflow and its failure modes separates engineers who use Terraform as a productivity tool from those who create new categories of outage with it.
Related concepts
Explore topics that connect to this one.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Discussion
Questions? Discuss in the community or start a thread below.
Join DiscordIn-app Q&A
Sign in to start or join a thread.