Multi-Cloud Infrastructure as Code with Terraform: Lessons Learned
Best practices for managing infrastructure across AWS, GCP, and Azure using Terraform, including state management, modules, and CI/CD integration.
Introduction: Why Multi-Cloud with Terraform?
Managing infrastructure across AWS, Azure, and GCP manually is a recipe for disaster. Terraform enables consistent, repeatable infrastructure deployment across all clouds.
- •Single tool, multiple clouds
- •Infrastructure as Code (version controlled, reviewable)
- •State management and drift detection
- •Modular and reusable components
- •Plan before apply (no surprises)
- 1.Disaster Recovery: Primary in AWS, failover in Azure
- 2.Vendor Diversification: Avoid single-vendor lock-in
- 3.Cost Optimization: Use cheapest region/service for each workload
- 4.Regulatory Compliance: Data residency requirements
- 5.Best-of-Breed: Use best service from each cloud
- •Level 1: Manual state, no modules (1-2 engineers)
- •Level 2: Remote state, basic modules (5-10 engineers)
- •Level 3: Workspaces, CI/CD, governance (10-50 engineers)
- •Level 4: Platform team, custom modules, policy as code (50+ engineers)
Production infrastructure managed: 2,000+ resources across 3 clouds, 15+ regions.
State Management
Terraform state is critical-it tracks your infrastructure and enables collaboration.
Remote State Best Practices:
- •S3 for state storage (versioned, encrypted)
- •DynamoDB for state locking (prevents concurrent modifications)
- •Enable server-side encryption (SSE-S3 or KMS)
- •Encrypt sensitive values (passwords, keys) with SOPS
- •Separate state files per environment (dev/staging/prod)
- •Use workspaces or separate backends
- •Never share state across unrelated infrastructure
- •Enable S3 versioning (rollback capability)
- •Periodic state backups to separate location
- •Test state recovery process
- •Restrict state access (IAM policies)
- •Separate read/write permissions
- •Audit state access (CloudTrail)
Common State Issues:
Problem: State drift (Terraform state != actual infrastructure)
Solution: Run terraform refresh
or terraform plan
regularly
Problem: State corruption
Solution: Use state locking, enable versioning, keep backups
Problem: Secrets in state
Solution: Use AWS Secrets Manager/Vault, not Terraform variables
# Backend configuration for multi-cloud state management
# backend.tf
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "production/multi-cloud/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
# Enable versioning for rollback
versioning = true
# Server-side encryption with KMS
kms_key_id = "arn:aws:kms:us-east-1:123456789:key/..."
}
}
# State locking with DynamoDB
resource "aws_dynamodb_table" "terraform_lock" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
tags = {
Name = "Terraform State Lock"
Environment = "production"
}
}
# State bucket with versioning and encryption
resource "aws_s3_bucket" "terraform_state" {
bucket = "company-terraform-state"
lifecycle {
prevent_destroy = true # Protect state bucket
}
tags = {
Name = "Terraform State"
Environment = "production"
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.terraform_state.arn
}
}
}
# Block public access
resource "aws_s3_bucket_public_access_block" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
# IAM policy for state access
resource "aws_iam_policy" "terraform_state_access" {
name = "TerraformStateAccess"
description = "Policy for Terraform state operations"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:ListBucket",
"s3:GetObject",
"s3:PutObject"
]
Resource = [
aws_s3_bucket.terraform_state.arn,
"${aws_s3_bucket.terraform_state.arn}/*"
]
},
{
Effect = "Allow"
Action = [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:DeleteItem"
]
Resource = aws_dynamodb_table.terraform_lock.arn
}
]
})
}
Reusable Modules for Multi-Cloud
Modules are the key to maintainable multi-cloud infrastructure. Build once, deploy everywhere.
Module Structure Best Practices:
- •Define common interface across clouds
- •Hide cloud-specific details
- •Use consistent naming conventions
- •Validate inputs with
variable validation
blocks - •Provide sensible defaults
- •Document all variables
- •Return consistent outputs (IDs, endpoints, etc.)
- •Include all necessary information for dependent modules
- •Use descriptive output names
- •Pin module versions in production
- •Use semantic versioning
- •Test upgrades in lower environments
Module Organization:
modules/
├── compute/
│ ├── aws/ (AWS EC2-specific implementation)
│ ├── azure/ (Azure VM-specific implementation)
│ ├── gcp/ (GCP Compute-specific implementation)
│ └── interface.tf (Common interface)
├── database/
│ ├── aws/ (RDS)
│ ├── azure/ (Azure SQL)
│ └── gcp/ (Cloud SQL)
└── networking/
├── aws/ (VPC)
├── azure/ (VNet)
└── gcp/ (VPC)
Example: Multi-Cloud Compute Module
# modules/compute/interface.tf
# Common interface for compute resources across clouds
variable "cloud_provider" {
description = "Cloud provider (aws, azure, gcp)"
type = string
validation {
condition = contains(["aws", "azure", "gcp"], var.cloud_provider)
error_message = "Provider must be aws, azure, or gcp"
}
}
variable "instance_size" {
description = "Instance size (small, medium, large)"
type = string
default = "medium"
validation {
condition = contains(["small", "medium", "large"], var.instance_size)
error_message = "Size must be small, medium, or large"
}
}
variable "environment" {
description = "Environment (dev, staging, prod)"
type = string
}
# modules/compute/aws/main.tf
# AWS-specific implementation
locals {
instance_types = {
small = "t3.medium"
medium = "t3.large"
large = "t3.xlarge"
}
}
resource "aws_instance" "app" {
ami = data.aws_ami.ubuntu.id
instance_type = local.instance_types[var.instance_size]
tags = {
Name = "${var.environment}-app-server"
Environment = var.environment
ManagedBy = "terraform"
}
root_block_device {
volume_type = "gp3"
volume_size = 50
encrypted = true
}
metadata_options {
http_tokens = "required" # IMDSv2
}
}
output "instance_id" {
value = aws_instance.app.id
}
output "public_ip" {
value = aws_instance.app.public_ip
}
# modules/compute/azure/main.tf
# Azure-specific implementation
locals {
vm_sizes = {
small = "Standard_B2s"
medium = "Standard_D2s_v3"
large = "Standard_D4s_v3"
}
}
resource "azurerm_linux_virtual_machine" "app" {
name = "${var.environment}-app-vm"
resource_group_name = var.resource_group_name
location = var.location
size = local.vm_sizes[var.instance_size]
admin_username = "adminuser"
admin_ssh_key {
username = "adminuser"
public_key = var.ssh_public_key
}
os_disk {
caching = "ReadWrite"
storage_account_type = "Premium_LRS"
disk_size_gb = 50
}
source_image_reference {
publisher = "Canonical"
offer = "UbuntuServer"
sku = "20.04-LTS"
version = "latest"
}
tags = {
Environment = var.environment
ManagedBy = "terraform"
}
}
output "instance_id" {
value = azurerm_linux_virtual_machine.app.id
}
output "public_ip" {
value = azurerm_linux_virtual_machine.app.public_ip_address
}
# Root module usage - environment/prod/main.tf
module "compute_aws" {
source = "../../modules/compute/aws"
cloud_provider = "aws"
instance_size = "large"
environment = "production"
}
module "compute_azure" {
source = "../../modules/compute/azure"
cloud_provider = "azure"
instance_size = "large"
environment = "production"
resource_group_name = azurerm_resource_group.prod.name
location = "eastus"
}
CI/CD Pipeline for Terraform
Automated Terraform workflows with GitHub Actions ensure consistent, safe deployments.
CI/CD Best Practices:
- •
terraform fmt
check (code formatting) - •
terraform validate
(syntax validation) - •Security scan (Checkov, tfsec)
- •Cost estimation (Infracost)
- •
terraform plan
(preview changes) - •Comment plan output on PR
- •Require PR approval (2+ reviewers)
- •Auto-run plan again
- •Manual approval for apply
- •
terraform apply
on approval - •Notify on Slack/Teams
- •Checkov: Policy-as-code validation
- •tfsec: Security best practices
- •Prevent merges on critical findings
- •Infracost: Estimate costs before apply
- •Alert on >20% cost increase
- •Require approval for >$1K/month changes
Production Pipeline Example:
# .github/workflows/terraform.yml
name: 'Terraform CI/CD'
on:
pull_request:
paths:
- 'terraform/**'
- '.github/workflows/terraform.yml'
push:
branches:
- main
paths:
- 'terraform/**'
env:
TF_VERSION: '1.6.0'
AWS_REGION: 'us-east-1'
jobs:
terraform-validate:
name: 'Validate and Plan'
runs-on: ubuntu-latest
defaults:
run:
working-directory: terraform/production
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v2
with:
role-to-assume: ${{ secrets.AWS_TERRAFORM_ROLE_ARN }}
aws-region: ${{ env.AWS_REGION }}
- name: Terraform Format Check
id: fmt
run: terraform fmt -check -recursive
continue-on-error: true
- name: Terraform Init
id: init
run: terraform init
- name: Terraform Validate
id: validate
run: terraform validate -no-color
- name: Run Checkov Security Scan
id: checkov
uses: bridgecrewio/checkov-action@master
with:
directory: terraform/production
framework: terraform
output_format: cli
soft_fail: false # Fail on security issues
- name: Run tfsec
uses: aquasecurity/tfsec-action@v1.0.0
with:
working_directory: terraform/production
- name: Terraform Plan
id: plan
run: terraform plan -no-color -out=tfplan
continue-on-error: true
- name: Setup Infracost
uses: infracost/actions/setup@v2
with:
api-key: ${{ secrets.INFRACOST_API_KEY }}
- name: Generate Cost Estimate
id: cost
run: |
infracost breakdown --path tfplan --format json --out-file /tmp/cost.json
infracost output --path /tmp/cost.json --format github-comment --out-file /tmp/cost_comment.md
- name: Comment PR with Plan
uses: actions/github-script@v6
if: github.event_name == 'pull_request'
with:
script: |
const fs = require('fs');
const plan = fs.readFileSync('terraform/production/tfplan.txt', 'utf8');
const cost = fs.readFileSync('/tmp/cost_comment.md', 'utf8');
const output = `#### Terraform Format and Style 🖌\`${{ steps.fmt.outcome }}\`
#### Terraform Initialization ⚙️\`${{ steps.init.outcome }}\`
#### Terraform Validation 🤖\`${{ steps.validate.outcome }}\`
#### Terraform Plan 📖\`${{ steps.plan.outcome }}\`
<details><summary>Show Plan</summary>
\`\`\`terraform
${plan}
\`\`\`
</details>
${cost}
*Pusher: @${{ github.actor }}, Action: \`${{ github.event_name }}\`, Workflow: \`${{ github.workflow }}\`*`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
});
terraform-apply:
name: 'Apply Changes'
runs-on: ubuntu-latest
needs: terraform-validate
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
environment: production # Requires manual approval in GitHub
defaults:
run:
working-directory: terraform/production
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v2
with:
role-to-assume: ${{ secrets.AWS_TERRAFORM_ROLE_ARN }}
aws-region: ${{ env.AWS_REGION }}
- name: Terraform Init
run: terraform init
- name: Terraform Apply
id: apply
run: |
terraform apply -auto-approve -no-color | tee apply.log
echo "APPLY_OUTPUT<<EOF" >> $GITHUB_ENV
cat apply.log >> $GITHUB_ENV
echo "EOF" >> $GITHUB_ENV
- name: Notify Slack on Success
if: success()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "- Terraform apply succeeded in production",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Terraform Apply Successful*
Environment: Production
Commit: ${{ github.sha }}
Actor: @${{ github.actor }}"
}
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
- name: Notify Slack on Failure
if: failure()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "❌ Terraform apply failed in production",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Terraform Apply Failed*
Environment: Production
Commit: ${{ github.sha }}
Actor: @${{ github.actor }}
Check workflow: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
}
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
Multi-Cloud Networking and Security
Consistent networking and security policies across AWS, Azure, and GCP.
Network Architecture Patterns:
- •Central hub VPC/VNet for shared services
- •Spoke VPCs/VNets for applications
- •Transit Gateway (AWS) / Virtual WAN (Azure) / Network Connectivity Center (GCP)
- •Public zone: Internet-facing resources
- •Private zone: Application tier
- •Data zone: Databases, sensitive data
- •Management zone: Admin access, monitoring
- •VPN tunnels for site-to-site
- •Direct Connect / ExpressRoute / Interconnect for dedicated connectivity
- •Cloud Router for BGP peering
Security Best Practices:
- •Separate subnets per tier (web, app, data)
- •Security groups / NSGs / Firewall rules
- •Zero-trust architecture
- •TLS for data in transit
- •KMS / Key Vault / Cloud KMS for data at rest
- •Rotate keys automatically
- •IAM roles (least privilege)
- •Service accounts (no long-lived credentials)
- •MFA enforcement for human access
- •Flow logs for all networks
- •CloudTrail / Activity Log / Cloud Audit Logs
- •SIEM integration (Splunk, Datadog, ELK)
- •15 regions across AWS, Azure, GCP
- •50+ VPCs/VNets globally
- •99.99% uptime SLA
- •Sub-50ms latency between clouds