HPA Autoscaling: Signals, Delays, and Traps

Control room with HPA dashboards showing traffic patterns and replica counts, operators tuning stabilization and scaling parameters to minimize capacity gaps

An e-commerce team configures HPA with a 50% CPU target. During a flash sale, traffic spikes 10x in 30 seconds. HPA takes 15 seconds to detect the load, another 15 seconds for its evaluation cycle, then the stabilization window kicks in. Meanwhile, new pods need scheduling, image pulls, and readiness probes. By the time capacity catches up—3+ minutes later—frustrated users have already left.

I’ve watched this play out repeatedly. The Horizontal Pod Autoscaler looks deceptively simple: set a target CPU percentage, and Kubernetes scales your pods automatically. In practice, teams discover that HPA reacts too slowly to traffic spikes, oscillates between scaling up and down, or scales on entirely the wrong signals.

The problem is that HPA is reactive, not predictive. By the time it decides to scale, your workload is already under stress. Tuning HPA means understanding where delays come from, choosing metrics that respond quickly, and configuring behavior that matches your traffic patterns.

The Delay Problem

The HPA controller runs every 15 seconds by default, querying the metrics server and calculating how many replicas are needed. That sounds fast, but the total time from “traffic spike begins” to “new capacity receives traffic” is much longer. Every step in the pipeline adds latency.

Delay ComponentTypical DurationNotes
Metrics scrape15-60sMetrics Server polls kubelets at this interval
HPA sync15sController evaluation cycle
Stabilization0-300sPrevents thrashing; configurable
Pod scheduling1-10sDepends on cluster capacity
Pod startup5-60s+Application initialization time
Readiness probe5-30sBefore pod receives traffic
HPA delay components.

These delays stack. The metrics scrape and HPA sync cycles run independently—they’re not synchronized. In the worst case, a CPU spike happens right after a scrape, waits nearly a full interval to be captured, then waits again for HPA to evaluate. Add kubelet collection overhead (~10 seconds), and you’re looking at 85+ seconds before HPA even decides to scale—before stabilization, scheduling, or startup.

With aggressive tuning and cached images, the timeline improves but still isn’t instant. Here’s a typical tuned scenario:

  • T+0s: Traffic spike begins
  • T+15s: Metrics server captures the load
  • T+30s: HPA evaluates and decides to scale
  • T+45s: Stabilization check passes
  • T+50s: New pods scheduled
  • T+80s: Pods pass readiness probe, receive traffic

That’s 80+ seconds of degraded performance under favorable conditions. With default settings—including a 300-second stabilization window in older Kubernetes versions—it’s significantly worse.

Info callout:

Even with aggressive tuning, expect 30-45 seconds minimum from spike to new capacity. If your traffic can increase faster than that, HPA alone won’t save you. You’ll need pre-scaling for known events, higher baseline capacity, or request queuing to absorb the delay.

Choosing the Right Metric

Most teams start with CPU utilization because it’s built-in. But CPU is a lagging indicator: by the time CPU spikes, requests are already queuing and users are already waiting. For web services, the metric you choose determines whether HPA responds to incoming load or to damage already done.

Leading indicators like requests per second and queue depth increase the moment traffic arrives—before your system shows stress. Lagging indicators like CPU utilization and response latency rise only after work is already backing up. When possible, scale on what’s coming, not what’s already hurting.

MetricTypeBest For
Requests per secondLeadingWeb APIs, high-traffic services
Queue depthLeadingAsync workers, background job processors
CPU utilizationLaggingCompute-bound workloads, synchronous processing
Response latencyLaggingSLO-driven scaling, user-facing services
Scaling metrics by response type.

CPU works well for compute-bound workloads where processing time scales linearly with CPU. But for I/O-bound services—anything waiting on databases, external APIs, or message queues—CPU stays low while requests pile up. By the time CPU rises, you’re already degraded.

newsletter.subscribe

$ Stay Updated

> One deep dive per month on infrastructure topics, plus quick wins you can ship the same day.

$

You'll receive a confirmation email. Click the link to complete your subscription.

Requests per second scales based on incoming traffic regardless of how much work each request requires. It’s a direct measure of load, and it increases immediately when traffic arrives. The tradeoff is that it requires custom metrics setup through the Prometheus Adapter1 or KEDA2, but the improved responsiveness is worth the effort for most web services.

Queue depth is ideal for async workers. If your service pulls from a message queue, scale on how many messages are waiting. A growing backlog means you need more consumers—don’t wait for CPU to tell you what the queue already knows.

Success callout:

When possible, scale on leading indicators (RPS, queue depth) rather than lagging ones (CPU, latency). By the time CPU spikes, requests are already queuing.

Configuring Asymmetric Scaling

The key insight for HPA behavior configuration is that scale up and scale down should be asymmetric. When traffic spikes, you want capacity now—waiting costs customer experience. When traffic drops, you want to wait—scaling down too fast means you’ll scale right back up if traffic returns, wasting the pods you just terminated.

HPA v2 provides the behavior field for fine-grained control. Two mechanisms work together: stabilization windows prevent thrashing by requiring metrics to stay above or below thresholds for a duration before acting, and scaling policies limit how many replicas can change per time period.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
      selectPolicy: Min
Asymmetric HPA behavior—aggressive scale up, conservative scale down.

The selectPolicy field determines how HPA chooses between multiple policies. Max uses the policy that results in the largest change—for scale up, this adds the most pods. Min uses the smallest change—for scale down, this removes the fewest pods. Disabled prevents scaling in that direction entirely.

With zero stabilization on scale up, HPA acts on the first evaluation showing high load. A 100% policy means you can double capacity every 15 seconds—going from 5 to 10 to 20 to 40 pods in under a minute if needed.

For scale down, 300 seconds of stabilization means HPA waits five minutes of consistently low metrics before removing pods. A 10% policy limits removal to one-tenth of current capacity per minute, preventing rapid scale down that you’d immediately regret.

Info callout:

Over-provisioning costs money. Under-provisioning loses customers. The asymmetric pattern—aggressive scale up, conservative scale down—fits most workloads because the cost of having extra capacity is usually lower than the cost of not having enough.

Putting It Together

HPA tuning comes down to three things: understanding delays so you know what’s realistic, choosing metrics that respond to incoming load rather than existing damage, and configuring asymmetric behavior that prioritizes availability over cost optimization.

Free PDF Guide

Download the HPA Tuning Guide

Get the complete autoscaling playbook for latency-aware metrics, behavior tuning, and resilient traffic-response strategies.

What you'll get:

  • HPA delay budget worksheet
  • Metric selection decision matrix
  • Scale behavior tuning templates
  • Traffic profile response playbook
PDF download

Free resource

Instant access

No credit card required.

If your traffic is predictable—business hours, scheduled events, marketing campaigns—don’t rely on HPA to catch up. Pre-scale ahead of time using KEDA cron triggers or simple CronJobs. HPA’s job is handling unexpected variance around your baseline, not scrambling to meet traffic you knew was coming.

Start with defaults, observe behavior under real load, and tune based on what you see. If you’re scaling too slowly, reduce stabilization windows and increase policy percentages. If you’re oscillating, increase stabilization and lower your target utilization. There’s no universal “best” configuration—only the configuration that matches your traffic pattern.

Footnotes

  1. The Prometheus Adapter is a Kubernetes component that queries Prometheus for metrics and exposes them through the Kubernetes custom metrics API, allowing HPA to scale on any metric Prometheus collects. ↩

  2. KEDA (Kubernetes Event-driven Autoscaling) extends HPA with scalers for dozens of event sources—message queues, databases, HTTP traffic, cron schedules—and enables scale-to-zero for event-driven workloads. ↩

Share this article

Found this helpful? Share it with others who might benefit.

Share this article

Enjoyed the read? Share it with your network.

Other things I've written