Self-Service Infrastructure for Product Teams

Kevin Brown on Jan 1, 2026

Terraform module being assembled by multiple teams with automated security shields and compliance checks

Client: Digital Media Company
Industry: Media & Entertainment
Project Type: Self-Service Infrastructure
Duration: 5 months

Overview

A digital media company’s two-person platform team had become the bottleneck for 120 engineers across 8 product teams. We built a self-service infrastructure system that let teams provision their own AWS resources safely, reducing request lead time from two weeks to under an hour while maintaining SOC2 compliance with zero policy violations.

The Challenge

The platform team was drowning in tickets.

The client was a digital media company with 120 engineers spread across 8 product teams. Two platform engineers supported all of them. Every database, every S3 bucket, every Lambda function — if it touched AWS, it went through the platform team’s ticket queue.

The backlog was brutal. Average wait time for infrastructure requests was two weeks. Product teams would submit a ticket for a new RDS instance, then wait. And wait. The platform engineers weren’t slow—they were just outnumbered 60 to 1.

Product deadlines were slipping. Teams planned features, started building, then discovered they needed infrastructure that wouldn’t be ready for weeks. Some teams learned to front-load infrastructure requests at the start of every sprint, even if they weren’t sure exactly what they’d need. This created waste and still didn’t solve the timing problem.

Shadow IT had emerged. Frustrated engineers started creating resources directly in AWS using personal IAM credentials that had been provisioned for debugging. The security team discovered S3 buckets and Lambda functions that weren’t in the official inventory. Nobody knew who owned them or what they were for. Some had public access enabled.

The platform team was miserable too. They’d joined to build platforms, not to process tickets. Most of their day was spent on repetitive provisioning tasks that required their expertise to do safely but not their expertise to design. They wanted to work on interesting problems, but the ticket queue never let them.

The constraints made this tricky. The company had SOC2 compliance requirements, which meant security guardrails weren’t optional. We couldn’t just give product teams AWS console access and hope for the best. Every resource needed encryption, proper network isolation, and audit logging. The solution had to be both self-service and secure.

The Approach

I started by understanding what teams actually needed. I cataloged every infrastructure ticket from the past six months—487 requests total. The distribution was revealing:

Infrastructure request distribution by resource type
Resource Type	% of Requests
S3 buckets	28%
RDS databases	24%
Lambda functions	19%
SQS queues	12%
ElastiCache	8%
Other	9%

Infrastructure request distribution by resource type

Five resource types accounted for 91% of all requests. And within each type, the configurations were remarkably similar. Most RDS requests were for PostgreSQL with the same encryption and backup settings. Most S3 buckets needed the same access policies. Teams weren’t asking for exotic infrastructure—they were asking for the same things over and over.

This was the insight that shaped the solution: we didn’t need to give teams full AWS access. We needed to give them access to approved patterns that handled the security requirements automatically.

The key decisions:

Terraform modules over custom portalWe considered building a web interface, but Terraform modules offered more flexibility and lower maintenance. Engineers already understood infrastructure as code. We'd meet them where they were.
Policy as code with OPASecurity requirements would be codified as Open Policy Agent policies. Every Terraform plan would be checked against these policies before apply. Compliance became automated, not manual.
CI/CD-driven provisioningTeams would provision infrastructure by opening PRs, not by clicking in a portal. This gave us audit trails, review workflows, and integration with existing development processes.

The implementation happened in five phases:

1
Month 1: Core modules
We built Terraform modules for the top five resource types with security best practices baked in. Each module exposed only the configuration options teams actually needed, you could not accidentally create a public S3 bucket because the module did not have a parameter for that.
2
Month 2: Policy layer
We wrote OPA policies encoding SOC2 requirements: encryption at rest, encryption in transit, no public access, proper tagging. We integrated Conftest to run these policies as part of the CI pipeline.
3
Month 3: CI/CD wrapper
We set up Atlantis to run Terraform plans and applies on PR events. Teams got plan output as PR comments. Applies required approval from at least one platform team member initially.
4
Month 4: Self-service rollout
We onboarded teams one by one. Each team got a repository with module references they could customize. We trained them on the workflow and supported their first few requests closely.
5
Month 5: Refinement and autonomy
As teams proved they could use the system safely, we removed the approval requirement for pre-approved patterns. Platform team review became optional for standard resources.

The security team was initially skeptical. “You want to let developers provision their own databases?” We addressed this by showing them the policy layer in action. We demonstrated that developers literally couldn’t create non-compliant resources — the CI pipeline would reject them. The policies enforced what reviews had previously caught manually.

The Solution

Module Architecture

The Terraform modules lived in a central repository owned by the platform team. Each module encapsulated security best practices and exposed a simplified interface:

# Example: What a team's Terraform looks like
module "user_uploads" {
  source = "git::https://github.com/company/tf-modules//s3-bucket?ref=v2.1.0"

  name        = "user-uploads"
  team        = "content-platform"
  environment = "production"

  versioning_enabled = true
  lifecycle_rules    = "standard-90-day"
}

Self-service S3 bucket provisioning using approved module

The module handled everything else: encryption with KMS, blocking public access, access logging, proper IAM policies, and required tags. Teams specified what they needed; the module ensured it was built correctly.

We versioned modules using Git tags. Teams pinned to specific versions, and the platform team could release new versions with improvements without breaking existing infrastructure. Major version bumps required team action to upgrade.

Policy Enforcement

OPA policies ran on every Terraform plan. A sample policy for S3:

# Deny S3 buckets without encryption
deny[msg] {
  resource := input.planned_values.root_module.resources[_]
  resource.type == "aws_s3_bucket"
  not resource.values.server_side_encryption_configuration
  msg := sprintf("S3 bucket '%s' must have encryption enabled", [resource.name])
}

OPA policy enforcing S3 encryption requirement

The policies covered encryption, network isolation, tagging requirements, and naming conventions. Conftest (a CLI tool for running OPA policies against structured data like Terraform plans) ran these policies as a GitHub Actions step before Atlantis¹ could apply anything. Failed policies blocked the PR with clear error messages explaining what needed to change.

Cost Controls

Infracost estimated the monthly cost of each infrastructure change and posted it as a PR comment. Teams saw immediately if their new RDS instance would cost $500/month or $5,000/month.

We also implemented per-team budgets. Each team had a monthly infrastructure allocation. The system tracked provisioned resources against these budgets and warned when teams approached their limits. This prevented surprise bills and encouraged teams to think about cost as a design constraint.

Workflow

The day-to-day workflow for a product team:

Engineer creates a branch in their team's infrastructure repo
Adds or modifies Terraform configuration using approved modules
Opens a PR
GitHub Actions runs terraform plan, OPA policy checks, and Infracost
PR comment shows the plan, any policy violations, and cost estimate
For standard patterns, engineer merges after self-review
Atlantis applies the changes automatically
Resources are provisioned, tagged, and compliant

For non-standard requests—anything outside the pre-approved modules—the PR required platform team review. This kept the escape hatch available while ensuring experts reviewed novel infrastructure.

The Results

Five months after starting, the transformation was complete:

Self-service infrastructure outcomes
Metric	Before	After
Infrastructure request lead time	2 weeks	< 1 hour
Platform team ticket volume	Baseline	-75%
Security policy violations	Unknown (shadow IT)	Zero
Self-sufficient product teams	0 of 8	6 of 8
Shadow IT resources	Discovered during audit	Eliminated
Infrastructure spend	Baseline	-12%

Self-service infrastructure outcomes

For standard resources, most requests completed in 15-30 minutes. The remaining 25% of tickets were non-standard requests that genuinely required platform expertise—interesting problems, not repetitive provisioning. With ticket volume down 75%, the platform engineers finally had time for the work they wanted to do: improving observability, building deployment tooling, and designing the next generation of infrastructure patterns.

The cost visibility produced an unexpected but significant benefit. When teams could see infrastructure costs in their PR comments, behavior changed. One team realized their staging environment was costing as much as production and voluntarily downsized. Another team chose smaller RDS instance types after seeing the monthly estimates. Total infrastructure spend dropped 12% despite increased provisioning volume—the opposite of what typically happens when you make provisioning easier. Visibility drove accountability.

Key Takeaways

Self-service does not mean uncontrolled—The key insight was that teams did not need full AWS access, they needed access to approved patterns. Constrained self-service is often more valuable than unconstrained freedom because it removes the cognitive load of security and compliance decisions.
Codified policies beat manual reviews—When security requirements are code, they are consistent and fast. Manual reviews are slow, inconsistent, and create bottlenecks. Automate the rules, review the exceptions.
Meet teams where they are—We used Terraform and Git because that is what engineers already knew. A custom portal might have looked nicer but would have added friction and required training. Familiar tools with new capabilities beat new tools with new capabilities.

Footnotes

Atlantis is an open-source tool that automates Terraform workflows through pull requests by connecting a version control system such as GitHub, GitLab, or Bitbucket to Terraform plan and apply execution. ↩

Table of Contents

Footnotes

Building an Internal Developer Platform with Backstage

HIPAA Compliance Automation for a Healthtech Startup

Event-Driven Architecture Migration for Real-Time Analytics

Full-Stack Observability for a Microservices Architecture