HPA Autoscaling: Signals, Delays, and Traps
An e-commerce team configures HPA with a 50% CPU target. During a flash sale, traffic spikes 10x in 30 seconds. HPA takes 15 seconds to detect the load, another 15 seconds for its evaluation cycle, then the stabilization window kicks in. Meanwhile, new pods need scheduling, image pulls, and readiness probes. By the time capacity catches up—3+ minutes later—frustrated users have already left.
I’ve watched this play out repeatedly. The Horizontal Pod Autoscaler looks deceptively simple: set a target CPU percentage, and Kubernetes scales your pods automatically. In practice, teams discover that HPA reacts too slowly to traffic spikes, oscillates between scaling up and down, or scales on entirely the wrong signals.
The problem is that HPA is reactive, not predictive. By the time it decides to scale, your workload is already under stress. Tuning HPA means understanding where delays come from, choosing metrics that respond quickly, and configuring behavior that matches your traffic patterns.
The Delay Problem
The HPA controller runs every 15 seconds by default, querying the metrics server and calculating how many replicas are needed. That sounds fast, but the total time from “traffic spike begins” to “new capacity receives traffic” is much longer. Every step in the pipeline adds latency.
| Delay Component | Typical Duration | Notes |
|---|---|---|
| Metrics scrape | 15-60s | Metrics Server polls kubelets at this interval |
| HPA sync | 15s | Controller evaluation cycle |
| Stabilization | 0-300s | Prevents thrashing; configurable |
| Pod scheduling | 1-10s | Depends on cluster capacity |
| Pod startup | 5-60s+ | Application initialization time |
| Readiness probe | 5-30s | Before pod receives traffic |
These delays stack. The metrics scrape and HPA sync cycles run independently—they’re not synchronized. In the worst case, a CPU spike happens right after a scrape, waits nearly a full interval to be captured, then waits again for HPA to evaluate. Add kubelet collection overhead (~10 seconds), and you’re looking at 85+ seconds before HPA even decides to scale—before stabilization, scheduling, or startup.
With aggressive tuning and cached images, the timeline improves but still isn’t instant. Here’s a typical tuned scenario:
- T+0s: Traffic spike begins
- T+15s: Metrics server captures the load
- T+30s: HPA evaluates and decides to scale
- T+45s: Stabilization check passes
- T+50s: New pods scheduled
- T+80s: Pods pass readiness probe, receive traffic
That’s 80+ seconds of degraded performance under favorable conditions. With default settings—including a 300-second stabilization window in older Kubernetes versions—it’s significantly worse.
Even with aggressive tuning, expect 30-45 seconds minimum from spike to new capacity. If your traffic can increase faster than that, HPA alone won’t save you. You’ll need pre-scaling for known events, higher baseline capacity, or request queuing to absorb the delay.
Choosing the Right Metric
Most teams start with CPU utilization because it’s built-in. But CPU is a lagging indicator: by the time CPU spikes, requests are already queuing and users are already waiting. For web services, the metric you choose determines whether HPA responds to incoming load or to damage already done.
Leading indicators like requests per second and queue depth increase the moment traffic arrives—before your system shows stress. Lagging indicators like CPU utilization and response latency rise only after work is already backing up. When possible, scale on what’s coming, not what’s already hurting.
| Metric | Type | Best For |
|---|---|---|
| Requests per second | Leading | Web APIs, high-traffic services |
| Queue depth | Leading | Async workers, background job processors |
| CPU utilization | Lagging | Compute-bound workloads, synchronous processing |
| Response latency | Lagging | SLO-driven scaling, user-facing services |
CPU works well for compute-bound workloads where processing time scales linearly with CPU. But for I/O-bound services—anything waiting on databases, external APIs, or message queues—CPU stays low while requests pile up. By the time CPU rises, you’re already degraded.
$ Stay Updated
> One deep dive per month on infrastructure topics, plus quick wins you can ship the same day.
Requests per second scales based on incoming traffic regardless of how much work each request requires. It’s a direct measure of load, and it increases immediately when traffic arrives. The tradeoff is that it requires custom metrics setup through the Prometheus Adapter1 or KEDA2, but the improved responsiveness is worth the effort for most web services.
Queue depth is ideal for async workers. If your service pulls from a message queue, scale on how many messages are waiting. A growing backlog means you need more consumers—don’t wait for CPU to tell you what the queue already knows.
When possible, scale on leading indicators (RPS, queue depth) rather than lagging ones (CPU, latency). By the time CPU spikes, requests are already queuing.
Configuring Asymmetric Scaling
The key insight for HPA behavior configuration is that scale up and scale down should be asymmetric. When traffic spikes, you want capacity now—waiting costs customer experience. When traffic drops, you want to wait—scaling down too fast means you’ll scale right back up if traffic returns, wasting the pods you just terminated.
HPA v2 provides the behavior field for fine-grained control. Two mechanisms work together: stabilization windows prevent thrashing by requiring metrics to stay above or below thresholds for a duration before acting, and scaling policies limit how many replicas can change per time period.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
selectPolicy: MinThe selectPolicy field determines how HPA chooses between multiple policies. Max uses the policy that results in the largest change—for scale up, this adds the most pods. Min uses the smallest change—for scale down, this removes the fewest pods. Disabled prevents scaling in that direction entirely.
With zero stabilization on scale up, HPA acts on the first evaluation showing high load. A 100% policy means you can double capacity every 15 seconds—going from 5 to 10 to 20 to 40 pods in under a minute if needed.
For scale down, 300 seconds of stabilization means HPA waits five minutes of consistently low metrics before removing pods. A 10% policy limits removal to one-tenth of current capacity per minute, preventing rapid scale down that you’d immediately regret.
Over-provisioning costs money. Under-provisioning loses customers. The asymmetric pattern—aggressive scale up, conservative scale down—fits most workloads because the cost of having extra capacity is usually lower than the cost of not having enough.
Putting It Together
HPA tuning comes down to three things: understanding delays so you know what’s realistic, choosing metrics that respond to incoming load rather than existing damage, and configuring asymmetric behavior that prioritizes availability over cost optimization.
Download the HPA Tuning Guide
Get the complete autoscaling playbook for latency-aware metrics, behavior tuning, and resilient traffic-response strategies.
What you'll get:
- HPA delay budget worksheet
- Metric selection decision matrix
- Scale behavior tuning templates
- Traffic profile response playbook
If your traffic is predictable—business hours, scheduled events, marketing campaigns—don’t rely on HPA to catch up. Pre-scale ahead of time using KEDA cron triggers or simple CronJobs. HPA’s job is handling unexpected variance around your baseline, not scrambling to meet traffic you knew was coming.
Start with defaults, observe behavior under real load, and tune based on what you see. If you’re scaling too slowly, reduce stabilization windows and increase policy percentages. If you’re oscillating, increase stabilization and lower your target utilization. There’s no universal “best” configuration—only the configuration that matches your traffic pattern.
Footnotes
-
The Prometheus Adapter is a Kubernetes component that queries Prometheus for metrics and exposes them through the Kubernetes custom metrics API, allowing HPA to scale on any metric Prometheus collects. ↩
-
KEDA (Kubernetes Event-driven Autoscaling) extends HPA with scalers for dozens of event sources—message queues, databases, HTTP traffic, cron schedules—and enables scale-to-zero for event-driven workloads. ↩
Table of Contents
Share this article
Found this helpful? Share it with others who might benefit.
Share this article
Enjoyed the read? Share it with your network.