Why Your HPA Scales Too Late (And the Tuning That Fixes It)

Kevin Brown on Nov 3, 2024

6 min read

Control room with HPA dashboards showing traffic patterns and replica counts, operators tuning stabilization and scaling parameters to minimize capacity gaps

An e-commerce team configures HPA with a 50% CPU target. During a flash sale, traffic spikes 10x in 30 seconds. HPA takes 15 seconds to detect the load, another 15 seconds for its evaluation cycle, then the stabilization window kicks in. Meanwhile, new pods need scheduling, image pulls, and readiness probes. By the time capacity catches up — 3+ minutes later — frustrated users have already left.

I’ve watched this play out repeatedly. The Horizontal Pod Autoscaler looks deceptively simple: set a target CPU percentage, and Kubernetes scales your pods automatically. In practice, teams discover that HPA reacts too slowly to traffic spikes, oscillates between scaling up and down, or scales on entirely the wrong signals.

The problem is that HPA is reactive, not predictive. By the time it decides to scale, your workload is already under stress. Tuning HPA means understanding where delays come from, choosing metrics that respond quickly, and configuring behavior that matches your traffic patterns.

The Delay Problem

The HPA controller runs every 15 seconds by default, querying the metrics server and calculating how many replicas are needed. That sounds fast, but the total time from “traffic spike begins” to “new capacity receives traffic” is much longer. Every step in the pipeline adds latency.

HPA delay components.
Delay Component	Typical Duration	Notes
Metrics scrape	15-60s	Metrics Server polls kubelets at this interval
HPA sync	15s	Controller evaluation cycle
Stabilization	0-300s	Prevents thrashing; configurable
Pod scheduling	1-10s	Depends on cluster capacity
Pod startup	5-60s+	Application initialization time
Readiness probe	5-30s	Before pod receives traffic

HPA delay components.

These delays stack. The metrics scrape and HPA sync cycles run independently — they’re not synchronized. In the worst case, a CPU spike happens right after a scrape, waits nearly a full interval to be captured, then waits again for HPA to evaluate. Add kubelet collection overhead (~10 seconds), and you’re looking at 85+ seconds before HPA even decides to scale — before stabilization, scheduling, or startup.

With aggressive tuning and cached images, the timeline improves but still isn’t instant. Here’s a typical tuned scenario:

T+0s Traffic spike begins
T+15s Metrics server captures the load
T+30s HPA evaluates and decides to scale
T+45s Stabilization check passes
T+50s New pods scheduled
T+80s Pods pass readiness probe, receive traffic

That’s 80+ seconds of degraded performance under favorable conditions. With default settings — including a 300-second stabilization window in older Kubernetes versions — it’s significantly worse.

Even with aggressive tuning, expect 30-45 seconds minimum from spike to new capacity. If your traffic can increase faster than that, HPA alone won’t save you. You’ll need pre-scaling for known events, higher baseline capacity, or request queuing to absorb the delay.

Choosing the Right Metric

Most teams start with CPU utilization because it’s built-in. But CPU is a lagging indicator: by the time CPU spikes, requests are already queuing and users are already waiting. For web services, the metric you choose determines whether HPA responds to incoming load or to damage already done.

Leading indicators like requests per second and queue depth increase the moment traffic arrives — before your system shows stress. Lagging indicators like CPU utilization and response latency rise only after work is already backing up. When possible, scale on what’s coming, not what’s already hurting.

Scaling metrics by response type.
Metric	Type	Best For
Requests per second	Leading	Web APIs, high-traffic services
Queue depth	Leading	Async workers, background job processors
CPU utilization	Lagging	Compute-bound workloads, synchronous processing
Response latency	Lagging	SLO-driven scaling, user-facing services

Scaling metrics by response type.

CPU works well for compute-bound workloads where processing time scales linearly with CPU. But for I/O-bound services — anything waiting on databases, external APIs, or message queues — CPU stays low while requests pile up. By the time CPU rises, you’re already degraded.

newsletter.subscribe

Requests per second scales based on incoming traffic regardless of how much work each request requires. It’s a direct measure of load, and it increases immediately when traffic arrives. The tradeoff is that it requires custom metrics setup through the Prometheus Adapter¹ or KEDA², but the improved responsiveness is worth the effort for most web services.

Queue depth is ideal for async workers. If your service pulls from a message queue, scale on how many messages are waiting. A growing backlog means you need more consumers — don’t wait for CPU to tell you what the queue already knows.

When possible, scale on leading indicators (RPS, queue depth) rather than lagging ones (CPU, latency). By the time CPU spikes, requests are already queuing.

Configuring Asymmetric Scaling

The key insight for HPA behavior configuration is that scale up and scale down should be asymmetric. When traffic spikes, you want capacity now — waiting costs customer experience. When traffic drops, you want to wait — scaling down too fast means you’ll scale right back up if traffic returns, wasting the pods you just terminated.

HPA v2 provides the behavior field for fine-grained control. Two mechanisms work together: stabilization windows prevent thrashing by requiring metrics to stay above or below thresholds for a duration before acting, and scaling policies limit how many replicas can change per time period.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
      selectPolicy: Min

Asymmetric HPA behavior — aggressive scale up, conservative scale down.

The selectPolicy field determines how HPA chooses between multiple policies. Max uses the policy that results in the largest change — for scale up, this adds the most pods. Min uses the smallest change — for scale down, this removes the fewest pods. Disabled prevents scaling in that direction entirely.

With zero stabilization on scale up, HPA acts on the first evaluation showing high load. A 100% policy means you can double capacity every 15 seconds — going from 5 to 10 to 20 to 40 pods in under a minute if needed.

For scale down, 300 seconds of stabilization means HPA waits five minutes of consistently low metrics before removing pods. A 10% policy limits removal to one-tenth of current capacity per minute, preventing rapid scale down that you’d immediately regret.

Over-provisioning costs money. Under-provisioning loses customers. The asymmetric pattern — aggressive scale up, conservative scale down — fits most workloads because the cost of having extra capacity is usually lower than the cost of not having enough.

Putting It Together

HPA tuning comes down to three things: understanding delays so you know what’s realistic, choosing metrics that respond to incoming load rather than existing damage, and configuring asymmetric behavior that prioritizes availability over cost optimization.

Free PDF Guide

Download the HPA Tuning Guide

Get the complete autoscaling playbook for latency-aware metrics, behavior tuning, and resilient traffic-response strategies.

What you'll get:

HPA delay budget worksheet
Metric selection decision matrix
Scale behavior tuning templates
Traffic profile response playbook

Free resource

Instant access

Download Now

Learn More

No credit card required.

If your traffic is predictable — business hours, scheduled events, marketing campaigns — don’t rely on HPA to catch up. Pre-scale ahead of time using KEDA cron triggers or simple CronJobs. HPA’s job is handling unexpected variance around your baseline, not scrambling to meet traffic you knew was coming.

Start with defaults, observe behavior under real load, and tune based on what you see. If you’re scaling too slowly, reduce stabilization windows and increase policy percentages. If you’re oscillating, increase stabilization and lower your target utilization. There’s no universal “best” configuration — only the configuration that matches your traffic pattern.

Footnotes

The Prometheus Adapter is a Kubernetes component that queries Prometheus for metrics and exposes them through the Kubernetes custom metrics API, allowing HPA to scale on any metric Prometheus collects. ↩
KEDA (Kubernetes Event-driven Autoscaling) extends HPA with scalers for dozens of event sources — message queues, databases, HTTP traffic, cron schedules — and enables scale-to-zero for event-driven workloads. ↩

Enjoyed the read? Share it with your network.

Table of Contents

Download the HPA Tuning Guide

Footnotes

Your Rate Limiter Is Your Biggest Outage Risk

Why Your Traces Are Unreadable: Span Design

Terraform Module Defaults That Won't Break Your Consumers

Why Your E2E Tests Are Flaky (And How to Fix Them)

How We Cut Preview Environment Costs by 60 Percent

Table of Contents

The Delay Problem

Choosing the Right Metric

Configuring Asymmetric Scaling

Putting It Together

Download the HPA Tuning Guide

Footnotes

Share this article

Your Rate Limiter Is Your Biggest Outage Risk

Why Your Traces Are Unreadable: Span Design

Terraform Module Defaults That Won't Break Your Consumers

Why Your E2E Tests Are Flaky (And How to Fix Them)

How We Cut Preview Environment Costs by 60 Percent