Are you an LLM? Read llms.txt for a summary of the docs, or llms-full.txt for the full context.
Skip to content

Production Checklist

A step-by-step checklist for hardening your evm-cloud deployment before running in production.


Remote State Backend

Terraform state contains all variable values, including secrets, in plaintext. Never use local state for production.

S3 + KMS + DynamoDB

# backend.tf (add to your example directory)
terraform {
  backend "s3" {
    bucket         = "myorg-terraform-state"
    key            = "evm-cloud/production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    kms_key_id     = "alias/terraform-state"
    dynamodb_table = "terraform-locks"
  }
}

Create the prerequisite resources:

# S3 bucket with versioning
aws s3api create-bucket --bucket myorg-terraform-state --region us-east-1
aws s3api put-bucket-versioning --bucket myorg-terraform-state \
  --versioning-configuration Status=Enabled
 
# KMS key for encryption
aws kms create-alias --alias-name alias/terraform-state \
  --target-key-id $(aws kms create-key --query KeyMetadata.KeyId --output text)
 
# DynamoDB table for state locking
aws dynamodb create-table \
  --table-name terraform-locks \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

After adding the backend configuration, migrate existing state:

terraform init -migrate-state

Environment Isolation

Use separate state files (or separate directories) for each environment. Never share Terraform state across dev/staging/production.

Option A: Separate directories

environments/
  dev/
    main.tf          # source = "../../"
    dev.tfvars
    secrets.auto.tfvars
    backend.tf       # key = "evm-cloud/dev/terraform.tfstate"
  staging/
    main.tf
    staging.tfvars
    secrets.auto.tfvars
    backend.tf       # key = "evm-cloud/staging/terraform.tfstate"
  production/
    main.tf
    production.tfvars
    secrets.auto.tfvars
    backend.tf       # key = "evm-cloud/production/terraform.tfstate"

Option B: Terraform workspaces

terraform workspace new production
terraform workspace select production
terraform apply -var-file=production.tfvars

Option A is recommended -- it provides stronger isolation and makes it impossible to accidentally apply a dev config to production.


CI/CD Gates

On every pull request

Run make qa to catch formatting issues, validation errors, linting violations, and security misconfigurations:

make qa
# Runs: fmt-check + validate + lint (tflint) + security (checkov)

Before merge

Run make verify to validate all examples plan successfully:

make verify
# Runs: qa + plan for all examples against LocalStack

Plan-then-apply workflow

Never run terraform apply directly in CI. Generate a plan artifact, review it, then apply the specific plan file:

# CI: generate plan
terraform plan -var-file=production.tfvars -out=tfplan
# Upload tfplan as a build artifact
 
# After human review: apply the exact plan
terraform apply tfplan

This ensures the applied changes match exactly what was reviewed. No surprises from concurrent state changes.


Instance Sizing

Use t3.medium or larger for production. The default t3.medium (4 GB RAM) fits eRPC (1 GB) + rindexer (2 GB) + OS (1 GB) with no headroom.

WorkloadRecommended InstanceMemory Config
Single chain, few contractst3.medium (4 GB)rindexer 2g, eRPC 1g
Single chain, many contractst3.large (8 GB)rindexer 4g, eRPC 2g
Multi-chain or backfillt3.xlarge (16 GB)rindexer 8g, eRPC 2g

Monitor resource usage to right-size:

# EC2: check Docker container memory
ssh ubuntu@<ip> "sudo docker stats --no-stream"
 
# K8s: check pod resource usage
kubectl top pods

See Variable Reference -- Instance Sizing for the full sizing table.


NAT Gateway

Enable the NAT gateway for production if your workloads run in private subnets and need outbound internet access (for RPC calls, ClickHouse Cloud connections, etc.):

network_enable_nat_gateway = true

Warning: NAT gateways add approximately $35/month plus data transfer charges. Skip for dev environments where instances can use public subnets.


Database Backup

Managed PostgreSQL (RDS)

Set the backup retention period:

# production.tfvars
postgres_backup_retention = 30  # days (default is 7)

RDS handles automated daily backups and point-in-time recovery. Verify backups are running:

aws rds describe-db-instances \
  --query "DBInstances[?DBInstanceIdentifier=='evm-cloud-prod'].{Retention:BackupRetentionPeriod,LatestRestore:LatestRestorableTime}"

ClickHouse (BYODB)

If using ClickHouse Cloud, backups are managed by the service. For self-hosted ClickHouse, configure backups on the ClickHouse side -- evm-cloud does not manage external database backups.


Secrets Rotation

To rotate database credentials or other secrets:

  1. Update the password in secrets.auto.tfvars (or your TF_VAR_* environment variable)
  2. Run terraform apply
  3. Verify the new credentials are propagated:
EnginePost-rotation Step
EC2SSH in, run pull-secrets.sh, then docker compose restart
EKS (terraform)Terraform updates the K8s Secret and triggers rollout automatically
EKS (external)Re-run deployers/eks/deploy.sh
k3sRe-run deployers/k3s/deploy.sh
Bare metalTerraform re-provisions .env via SSH automatically

Set the Secrets Manager recovery window appropriately:

# production.tfvars
ec2_secret_recovery_window_in_days = 30  # default is 7, use 0 only for dev

See Secrets Management for the full secrets lifecycle.


Destroy Safety

Kubernetes deployments (k3s, EKS external)

Always remove workloads before destroying infrastructure:

# Step 1: Remove workloads
./deployers/k3s/teardown.sh handoff.json
# or: helm uninstall rindexer erpc
 
# Step 2: Destroy infrastructure
terraform destroy -var-file=production.tfvars

Skipping step 1 will not leak AWS resources (the EC2 instance is terminated along with everything on it), but Helm release state and any persistent volumes will be lost without clean shutdown.

All deployments

Before running terraform destroy in production:

  1. Generate a destroy plan and review it:
terraform plan -destroy -var-file=production.tfvars -out=destroy-plan
# Review the plan carefully
terraform apply destroy-plan
  1. Verify no unexpected resources are being destroyed (especially RDS instances or EBS volumes)
  2. Confirm database backups are current before destroying any database resources

Warning: Never run terraform destroy in production without reviewing the destroy plan first. RDS instances have deletion protection enabled by default, but other resources do not.


Monitoring

EC2

Terraform creates a CloudWatch log group (available in the handoff as artifacts.cloudwatch_log_group). View logs:

# From the instance
ssh ubuntu@<ip> "sudo docker compose -f /opt/evm-cloud/docker-compose.yml logs rindexer --tail 100"
ssh ubuntu@<ip> "sudo docker compose -f /opt/evm-cloud/docker-compose.yml logs erpc --tail 100"
 
# Via CloudWatch (if configured)
aws logs tail /evm-cloud/my-indexer --follow

Kubernetes (k3s, EKS)

kubectl logs -l app=rindexer --tail=100
kubectl logs -l app=erpc --tail=100
 
# Watch for restarts
kubectl get pods -w

Health checks

Verify the indexer is making progress:

# rindexer health endpoint
curl http://<ip>:3001/health
 
# eRPC health endpoint
curl http://<ip>:4000/health

Note: A future release will include a Prometheus + Grafana monitoring addon for dashboards and alerting. For now, use CloudWatch (EC2) or standard Kubernetes monitoring tooling.


Summary Checklist

Use this as a pre-launch checklist:

  • Remote state backend configured (S3 + KMS + DynamoDB)
  • Environment isolation in place (separate directories or workspaces)
  • CI runs make qa on every PR
  • Plan-then-apply workflow enforced (no direct terraform apply in CI)
  • Instance sized for workload (t3.medium minimum, monitor and adjust)
  • NAT gateway enabled if private subnets need outbound access
  • Database backup retention set (postgres_backup_retention = 30)
  • Secrets Manager recovery window set to 30 days
  • secrets.auto.tfvars and handoff.json are in .gitignore
  • Destroy procedure documented (workload teardown before terraform destroy)
  • Monitoring and log access verified
  • Health check endpoints reachable

Related Pages