Your First Chaos Experiment: Start Today with kubectl
Introduction
I once watched a team spend four months evaluating enterprise chaos platforms. They built elaborate ROI presentations, negotiated enterprise licenses, and planned a sophisticated experiment program. Two weeks before their first scheduled experiment, a routine power test at the data center knocked out a PDU (power distribution unit)—and revealed that three services had hard dependencies on a fourth service that failed to restart automatically. The outage lasted six hours.
A fifteen-minute experiment killing a single pod would have found that bug.
Chaos engineering doesn’t require expensive platforms or dedicated teams. You can run meaningful experiments today with nothing more than kubectl delete pod and a hypothesis about what should happen when you press enter.
This article covers what chaos engineering actually is (hint: it’s not random destruction), how to run your first experiment this week, and the three mistakes that turn experiments into outages.
What Chaos Engineering Actually Is
The term “chaos engineering” sounds destructive, which leads to a common misconception: that it’s about randomly breaking things to see what happens. That’s not chaos engineering—that’s just chaos.
Real chaos engineering follows the scientific method. You form a hypothesis about how your system should behave under specific failure conditions, you design a controlled experiment to test that hypothesis, and you observe whether reality matches your expectations. The goal isn’t to cause outages; it’s to build confidence that your system handles failures gracefully—or to learn exactly how it doesn’t.
| Chaos Engineering IS | Chaos Engineering IS NOT |
|---|---|
| Hypothesis-driven experimentation | Random destruction |
| Controlled failure injection | Breaking things for fun |
| Learning about system behavior | Proving you can cause outages |
| Building confidence | Testing in production blindly |
The key insight: without a hypothesis, you’re just breaking things. If you kill a pod and the service degrades, what did you learn? Was that expected? Was it acceptable? Without a hypothesis—“the service should recover within 30 seconds with no user-visible errors”—you can’t answer those questions. You can’t distinguish expected behavior from bugs, and you won’t know if your system got better or worse.
Your First Experiment: Pod Termination
The simplest and most valuable first experiment is killing a pod. If you’ve never run a chaos experiment, start here—you can do it this week with tools you already have.
Hypothesis: When a single pod is terminated, Kubernetes will automatically restart it and the service will continue handling requests with no more than 2 minutes of degraded availability.
Prerequisites:
- Your deployment has replicas > 1 (so traffic continues on surviving pods during recovery)
- Health checks (liveness and readiness probes) are configured
- You have monitoring in place to observe the experiment
Steps:
- Record baseline metrics for 5 minutes (error rate, latency, throughput)
- Delete one pod:
kubectl delete pod <pod-name> - Observe recovery for 5 minutes, watching for new pod scheduling, health check passing, and traffic resuming
- Compare post-experiment metrics to baseline
Success criteria:
- New pod running within 60 seconds
- No 5xx errors during recovery
- Latency returned to baseline within 2 minutes
Here’s the kubectl command to kill one random pod from a deployment and watch the recovery:
#!/bin/bash
# Kill one pod and watch Kubernetes recover
# Replace 'myservice' with your app label
# Get the name of one running pod
POD=$(kubectl get pods -l app=myservice -o jsonpath='{.items[0].metadata.name}')
# Delete it and immediately watch the recovery
kubectl delete pod $POD && kubectl get pods -l app=myservice -wWhat teams commonly discover from this experiment:
- Slow container startup delays recovery. If your container takes 45 seconds to start, that’s 45 seconds of degraded capacity. Optimize image size, add startup probes, or adjust replica counts.
- Missing readiness probes cause traffic to unhealthy pods. Without a readiness probe, Kubernetes sends traffic to pods that aren’t ready to handle it, causing errors during startup.
- No alerts fired despite degradation. This is the “aha” moment—your monitoring gap was invisible until you tested it.
These findings alone justify the experiment. And you can run it in staging today, with no budget approval, no platform purchase, and no dedicated chaos team.
$ Stay Updated
> One deep dive per month on infrastructure topics, plus quick wins you can ship the same day.
Three Mistakes That Turn Experiments Into Incidents
Now that you know how to run an experiment, here’s how to avoid turning it into an incident. These three mistakes are the difference between controlled learning and self-inflicted outages.
Mistake 1: No Hypothesis
I covered this earlier, but it bears repeating: “let’s see what happens” isn’t an experiment. Write down what you expect before you start, or you won’t know if what you observed was a bug or expected behavior.
Mistake 2: No Abort Conditions
Running an experiment with the plan “watch and manually stop if things look bad” is a recipe for turning experiments into incidents. Humans are slow to react, especially when they’re not sure if what they’re seeing is expected behavior or a problem.
Define abort conditions before you start: error rate > 5%, p99 latency > 5 seconds, any customer complaint. Ideally these trigger automatic rollback. At minimum, write them down so you know when to pull the plug.
Mistake 3: No Follow-Through
The worst pattern: find bugs, document them, move on. Six months later, someone finds the document and asks why none of the bugs were fixed.
Chaos without fixes is expensive documentation. The workflow must be: find bugs → document → create tickets → fix → re-run experiment to verify the fix. If you’re finding bugs but not fixing them, you’re wasting effort and building false confidence.
The value of chaos engineering is not in finding problems—it’s in fixing them. An experiment without follow-through is worse than no experiment because it creates false confidence.
Conclusion
You can start chaos engineering today. Pick a service. Form a hypothesis about what happens when you kill one pod. Open your monitoring dashboard. Run kubectl delete pod. Watch what happens.
Download the Chaos Engineering Guide
Get the complete starter playbook for low-cost failure experiments, blast-radius controls, and repeatable resilience learning cycles.
What you'll get:
- Starter experiment runbook templates
- Failure injection tooling matrix
- Blast radius control checklist
- Quarterly adoption roadmap guide
That’s a real chaos experiment—no enterprise platform required.
Once you’re comfortable with pod termination, the natural next steps are latency injection (what happens when your database responds slowly?) and network partitions (what happens when your cache is unreachable?). These three failure modes cover the vast majority of real-world incidents.
The barrier to starting isn’t tooling or budget. It’s deciding to run that first experiment.
Table of Contents
Share this article
Found this helpful? Share it with others who might benefit.
Share this article
Enjoyed the read? Share it with your network.