From ClickOps to Code: Infrastructure as Code Transformation

Kevin Brown
Kevin Brown on
Manual AWS console management transforming into organized infrastructure as code with Git workflows
Client
E-commerce Platform
Industry
Retail / E-commerce
Project Type
Infrastructure as Code
Duration
4 months

Overview

An e-commerce company with eight years of manual AWS changes needed to bring their infrastructure under version control. We imported their entire production environment into Terraform, established PR-based change workflows, and reduced new environment provisioning from two weeks to two hours—all without disrupting their live platform.

The Challenge

Eight years of “just click it in the console” had created a mess nobody fully understood.

The client was a mid-sized e-commerce platform with 15 engineers and an AWS environment that had grown organically since 2016. Every time someone needed infrastructure, they logged into the AWS console and created it. Security groups, EC2 instances, RDS databases, S3 buckets, IAM roles — all of it provisioned by hand, configured by hand, and documented nowhere.

Nobody knew what existed. The AWS bill showed 400+ EC2 instances, but the team could only account for about 250. There were S3 buckets with names like “test-backup-2019” and “johns-temp-bucket” that nobody wanted to delete because nobody knew if they mattered. Security groups had rules added over the years with no comments explaining why.

Configuration drift was constant. Production and staging were supposed to be identical, but they’d diverged so much that bugs would appear in production that couldn’t be reproduced in staging. Deployments were terrifying because nobody was sure the environments actually matched.

Provisioning new environments took two weeks. When the team needed a new staging environment for a large feature, an engineer would spend days clicking through the console, trying to replicate production from memory and tribal knowledge. The result was never quite right.

They’d tried Terraform before. Two years earlier, a previous engineer had started a Terraform project. It ended badly — a state file corruption wiped out the mapping between Terraform and real resources, and the team spent a week recovering. The project was abandoned, and the word “Terraform” became taboo.

The breaking point came when a production change went wrong. Someone modified a security group in the console, didn’t realize it was shared across services, and took down the payment system for 45 minutes. The incident report recommended infrastructure as code. This time, it had executive sponsorship.

The Approach

The first two weeks were archaeology. I needed to understand what actually existed before I could codify it.

I ran AWS Config queries to inventory every resource across all regions. I used tools like aws-nuke (an open-source tool for cleaning up AWS accounts) in dry-run mode to see what would be deleted if we started fresh—we weren’t going to, but it was illuminating for understanding the full scope. I interviewed every engineer about what they’d created and why.

The inventory revealed 847 distinct resources that mattered—everything from VPCs and subnets to Lambda functions and CloudWatch alarms. About 200 more were clearly abandoned but too risky to delete without more investigation.

The key decisions shaped the rest of the project:

  • Terraform over Pulumi
    The team had some familiarity with HCL from the failed previous attempt. Terraform's ecosystem was also more mature for our use case, with better tooling for importing existing resources.
  • Remote state from day one
    The previous failure was partly due to local state files. We set up S3 backend with DynamoDB locking before importing a single resource. State file problems would not sink this project.
  • Import first, refactor later
    We'd import resources into Terraform exactly as they existed, even if the configuration was messy. Cleaning up the HCL could happen after we had everything under control.
  • Atlantis for workflow
    Every infrastructure change would go through a pull request, with Atlantis running terraform plan automatically and requiring approval before terraform apply. This gave us audit trails and prevented the "quick console fix" that had caused the incident.

The implementation happened in phases:

Table: Implementation phase timeline

We saved IAM for late in the project because it’s the most likely to break things if imported incorrectly. The final phase proved we could spin up a complete staging environment from code.

The obstacles were what you’d expect. Resources created by departed employees with no documentation. Security groups with rules nobody remembered adding. IAM policies attached to roles that no longer made sense. Each one required detective work to understand before we could safely import it.

The Solution

The final architecture centered on Terraform modules organized by service tier, with Atlantis1 enforcing PR-based workflows for all changes.

Module Structure

We organized Terraform code into layers that reflected how the team thought about their infrastructure:

Each environment composed modules with environment-specific variables. Production and staging used the same module versions with different inputs for instance sizes, replica counts, and domain names.

Import Process

Terraform’s native import command works, but it’s tedious for hundreds of resources. We built a Python script that queried AWS APIs and generated both the terraform import commands and initial HCL scaffolding. Running the imports still required manual work—verifying that plans showed no changes after import—but the automation cut the time significantly.

For particularly complex resources like IAM policies, we used terraformer (Google’s tool for generating Terraform configurations from existing infrastructure) to generate initial HCL, then cleaned up the output. The generated code was ugly but accurate, which was exactly what we needed for the import phase.

Atlantis Workflow

Atlantis ran in ECS, connected to GitHub via webhooks. The workflow:

  • Engineer opens PR with Terraform changes
  • Atlantis automatically runs terraform plan and comments the output on the PR
  • Reviewer checks the plan, approves if correct
  • Engineer comments atlantis apply to execute
  • Atlantis applies changes and comments the result

This replaced the old process of “ask in Slack, then click in the console.” Every change had a PR, a plan, an approval, and an audit trail.

Security and Cost Guardrails

We integrated tfsec2 into the Atlantis workflow. PRs that introduced security issues (public S3 buckets, overly permissive security groups) failed CI and couldn’t be merged without explicit override.

Infracost estimated the monthly cost impact of changes. This didn’t block PRs, but seeing “+$450 / month” in the PR comment made engineers think twice about oversized instances.

The Results

Four months after starting, the transformation was complete:

Transformation outcomes
  1. 1
    100% of production infrastructure in version control
    Every VPC, instance, database, and security group existed in Terraform. The AWS console became read-only for the team.
  2. 2
    New environment provisioning dropped from 2 weeks to 2 hours
    The team spun up a complete staging environment by running terraform apply with different variables. What used to take a senior engineer days of clicking now took anyone on the team a couple of hours, mostly waiting for RDS to provision.
  3. 3
    Configuration drift incidents went from weekly to zero
    With all changes going through PRs, environments stayed in sync. The "works in staging, fails in production" problem disappeared.
  4. 4
    All infrastructure changes require PR review
    No more surprise security group modifications. Every change was visible, reviewed, and reversible.
  5. 5
    Mean time to recover from infrastructure issues dropped by 70%
    When something went wrong, engineers could read the Terraform code to understand the current state, check Git history to see what changed, and roll back with confidence.

The abandoned resources got cleaned up too. We identified 180 resources that were genuinely unused and deleted them over the following month, saving $12,000/month in wasted spend.

Key Takeaways

  • Import before you refactor The instinct is to design a beautiful module structure and rebuild everything cleanly. Resist it. Import what exists first, prove it works, then refactor incrementally. You cannot improve what you do not control.
  • Remote state and locking are non-negotiable The previous failed attempt taught this lesson painfully. State file corruption or conflicts will happen if you do not prevent them. S3 + DynamoDB locking is table stakes.
  • PR-based workflows change behavior Once all changes required a PR, the "quick fix in the console" pattern stopped. Engineers adapted because the new process was actually easier. Running atlantis apply is less work than clicking through the console. Rollback is trivial.

Footnotes

  1. Atlantis is an open-source tool that automates Terraform workflows through pull requests by connecting a version control system such as GitHub, GitLab, or Bitbucket to Terraform plan and apply execution.

  2. tfsec was an open-source static analysis tool for scanning Terraform code for misconfigurations and security risks before deployment, including issues such as unencrypted storage and unintended public access across AWS, Azure, and GCP. The project has since been incorporated into Trivy.