Kubernetes Adoption for a High-Volume Payment Processor

Kevin Brown
Kevin Brown on
Kubernetes helm steering through financial data streams with security shields and compliance checkmarks
Client
Payment Processing Fintech
Industry
Financial Services
Project Type
Kubernetes Adoption
Duration
9 months

Overview

A payment processing fintech needed to modernize their deployment infrastructure without compromising on PCI compliance or their sub-100ms latency SLAs. Over nine months, we migrated 40+ microservices from EC2 instances to Amazon EKS, reducing deployment cycles from two weeks to multiple times daily while passing their PCI-DSS1 audit with the new architecture.

The Challenge

The client processed over 10 million payment transactions daily through a microservices architecture running on EC2 instances. Each service had its own deployment pipeline, its own scaling rules, and its own operational quirks. What had started as a clean architecture had become a sprawling estate of 40+ services that nobody fully understood.

Deployments were painful. A two-week release cycle was the norm — not because they wanted to ship slowly, but because coordinating deployments across services, managing configuration drift, and handling rollbacks manually consumed enormous amounts of engineering time. Teams were afraid to deploy because rolling back meant SSH-ing into instances and manually reverting changes.

The on-call burden was crushing. Engineers were getting paged for issues that container orchestration handles automatically: instance failures, scaling events, health check flapping. Senior engineers were burning out. Two had left in the past six months, citing on-call fatigue.

The constraints made this harder than a typical Kubernetes adoption. PCI-DSS compliance meant we needed network segmentation, encryption in transit, comprehensive audit logging, and strict access controls. The payment processing SLA required sub-100ms P99 latency—we couldn’t introduce overhead that would breach that. And the business couldn’t tolerate payment failures during migration. Every transaction mattered.

The Approach

We started with a month of assessment work. I mapped service dependencies, analyzed traffic patterns, and worked with their compliance team to understand exactly what PCI-DSS required in the new architecture. We also evaluated team readiness—Kubernetes has a learning curve, and I needed to know where the gaps were.

The key architectural decisions came out of that assessment:

The migration happened in four phases over nine months:

Migration phase timeline

Phase 1 built the platform with no production traffic. Phase 2 validated the platform and built team familiarity. Phase 3 migrated services with parallel running and gradual traffic shifting—payment-critical services went last. Phase 4 worked with the QSA to certify the new architecture.

The biggest obstacle was service discovery. Many services had hard-coded IP addresses or relied on EC2 instance metadata. We had to refactor these to use Kubernetes DNS and service abstractions before they could migrate.

The Solution

The final architecture reflected the tradeoffs between operational simplicity, compliance requirements, and performance:

Cluster Architecture

We deployed a multi-AZ EKS cluster in us-east-1 with node groups spread across three availability zones. The cluster ran Kubernetes 1.28 with managed add-ons for CoreDNS, kube-proxy, and the VPC CNI2.

Karpenter3 handled node provisioning with provisioners tuned for different workload types. Payment processing services ran on dedicated node pools with instance types selected for consistent performance (no burstable instances). Batch processing and internal tools shared a separate pool that could use spot instances.

Service Mesh and Security

Istio provided the service mesh layer. Every service-to-service call used mTLS with certificates rotated automatically via Istio’s certificate management. This satisfied the PCI requirement for encryption in transit without requiring application changes.

Network policies enforced segmentation. The cardholder data environment (CDE) services could only communicate with explicitly allowed services. Default-deny policies meant new services were isolated until explicitly permitted.

We chose Istio over Linkerd despite Linkerd’s simpler operational model. The deciding factors were Istio’s more mature authorization policies and its integration with external authorization systems — we needed to enforce fine-grained access controls that Linkerd couldn’t support at the time.

GitOps and Deployment

ArgoCD managed all deployments from a central Git repository. Each service had its own directory with Kubernetes manifests and Kustomize overlays for environment-specific configuration. Deployments happened by merging to the main branch — ArgoCD detected changes and reconciled the cluster state.

We configured ArgoCD with automated sync for staging and manual sync for production. Production deployments required explicit approval in the ArgoCD UI, creating a clear audit trail of who deployed what and when.

The choice of ArgoCD over Flux came down to the UI. The team needed visibility into deployment status without kubectl access. ArgoCD’s dashboard let product managers and compliance officers see deployment history without needing cluster credentials.

Secrets and Configuration

HashiCorp Vault stored all secrets with the Vault Secrets Operator (HashiCorp’s official Kubernetes operator) injecting them into pods. This replaced a fragile system of EC2 parameter store lookups and hardcoded credentials in deployment scripts. Vault’s audit logging met PCI requirements for tracking access to sensitive configuration.

Observability

Datadog provided unified metrics, logs, and traces. We instrumented all services with the Datadog agent and configured distributed tracing to follow requests across service boundaries. Custom dashboards showed payment processing latency, error rates, and transaction volumes. Alerts tied into PagerDuty with severity-based routing.

Tradeoffs We Made

Istio's overhead
Istio adds roughly 2-3ms of latency per hop. For payment processing with a 100ms SLA, this mattered. We accepted it because the security and traffic management benefits outweighed the cost, and the services still met SLA comfortably.
Managed node groups vs. Karpenter
We started with managed node groups for simplicity, then migrated to Karpenter in month 4 when we understood our workload patterns better. Karpenter's flexibility was worth the additional operational complexity.
Single cluster vs. multi-cluster
We considered separate clusters for CDE and non-CDE workloads. The compliance team ultimately accepted namespace-level isolation with network policies, which reduced operational overhead significantly.

The Results

Nine months after starting, the migration was complete and certified:

Migration outcomes summary
  1. 1
    Deployment frequency increased from bi-weekly to 15+ times daily
    Teams now deployed independently without coordinating release windows. Most deployments were fully automated with manual approval only for production.
  2. 2
    Mean time to recovery dropped from 45 minutes to 8 minutes
    Kubernetes self-healing, combined with proper health checks and Istio's traffic management, meant most issues resolved automatically. When human intervention was needed, rollback was a single click in ArgoCD.
  3. 3
    Infrastructure costs reduced by 30%
    Better bin-packing from Karpenter, automatic scaling that actually worked, and elimination of over-provisioned EC2 instances drove the savings.
  4. 4
    PCI-DSS audit passed with zero findings
    The QSA specifically called out the mTLS implementation and audit logging as exemplary.
  5. 5
    On-call pages reduced by 60%
    The self-healing capabilities and improved observability meant engineers were not woken up for issues the platform could handle itself.
  6. 6
    P99 latency improved from 85ms to 72ms
    Despite Istio's overhead, the consistent performance from dedicated node pools and better resource allocation actually improved latency.

Key Takeaways

  • Compliance constraints can drive better architecture The PCI requirements forced us to implement mTLS, network segmentation, and comprehensive audit logging. These practices made the system more secure and more observable than it would have been otherwise.
  • Migrate incrementally with traffic shifting Running old and new systems in parallel with gradual traffic migration eliminated big-bang risk. When we found issues, we shifted traffic back. No customer ever noticed a migration-related problem.
  • Invest in the platform before migrating workloads The three months we spent building the platform foundation, observability, CI/CD, security controls, before migrating any production traffic paid off. Teams trusted the platform because it was proven before they depended on it.

Footnotes

  1. PCI-DSS is the Payment Card Industry Data Security Standard, a set of security requirements for systems that store, process, or transmit cardholder data.

  2. The Amazon VPC CNI is EKS‘s default networking plugin. It assigns VPC IP addresses directly to pods so they can talk to other AWS network resources without an overlay network, which improves performance and makes native VPC controls like flow logs, routing, and security policies easier to use.

  3. Karpenter is an open-source Kubernetes autoscaler that launches and removes right-sized nodes based on what pending workloads actually need. It is faster and more flexible than older autoscaling approaches, but it adds some operational complexity and depends heavily on cloud-provider integration.