Your First Chaos Experiment: Start Today with kubectl

Kevin Brown on Mar 19, 2022

5 min read

Laboratory setting with engineers conducting controlled experiments on miniature server infrastructure in place of traditional lab equipment

Introduction

I once watched a team spend four months evaluating enterprise chaos platforms. They built elaborate ROI presentations, negotiated enterprise licenses, and planned a sophisticated experiment program. Two weeks before their first scheduled experiment, a routine power test at the data center knocked out a PDU (power distribution unit)—and revealed that three services had hard dependencies on a fourth service that failed to restart automatically. The outage lasted six hours.

A fifteen-minute experiment killing a single pod would have found that bug.

Chaos engineering doesn’t require expensive platforms or dedicated teams. You can run meaningful experiments today with nothing more than kubectl delete pod and a hypothesis about what should happen when you press enter.

This article covers what chaos engineering actually is (hint: it’s not random destruction), how to run your first experiment this week, and the three mistakes that turn experiments into outages.

What Chaos Engineering Actually Is

The term “chaos engineering” sounds destructive, which leads to a common misconception: that it’s about randomly breaking things to see what happens. That’s not chaos engineering — that’s just chaos.

Real chaos engineering follows the scientific method. You form a hypothesis about how your system should behave under specific failure conditions, you design a controlled experiment to test that hypothesis, and you observe whether reality matches your expectations. The goal isn’t to cause outages; it’s to build confidence that your system handles failures gracefully — or to learn exactly how it doesn’t.

The distinction that separates chaos engineering from breaking things.
Chaos Engineering IS	Chaos Engineering IS NOT
Hypothesis-driven experimentation	Random destruction
Controlled failure injection	Breaking things for fun
Learning about system behavior	Proving you can cause outages
Building confidence	Testing in production blindly

The distinction that separates chaos engineering from breaking things.

The key insight: without a hypothesis, you’re just breaking things. If you kill a pod and the service degrades, what did you learn? Was that expected? Was it acceptable? Without a hypothesis — “the service should recover within 30 seconds with no user-visible errors” — you can’t answer those questions. You can’t distinguish expected behavior from bugs, and you won’t know if your system got better or worse.

Your First Experiment: Pod Termination

The simplest and most valuable first experiment is killing a pod. If you’ve never run a chaos experiment, start here — you can do it this week with tools you already have.

Hypothesis: When a single pod is terminated, Kubernetes will automatically restart it and the service will continue handling requests with no more than 2 minutes of degraded availability.

Prerequisites:

Your deployment has replicas > 1 (so traffic continues on surviving pods during recovery)
Health checks (liveness and readiness probes) are configured
You have monitoring in place to observe the experiment

Steps:

1

Record baseline metrics for 5 minutes (error rate, latency, throughput)
2

Delete one pod: `kubectl delete pod <pod-name>`
3

Observe recovery for 5 minutes, watching for new pod scheduling, health check passing, and traffic resuming
4

Compare post-experiment metrics to baseline

Success criteria:

New pod running within 60 seconds
No 5xx errors during recovery
Latency returned to baseline within 2 minutes

Here’s the kubectl command to kill one random pod from a deployment and watch the recovery:

#!/bin/bash
# Kill one pod and watch Kubernetes recover
# Replace 'myservice' with your app label

# Get the name of one running pod
POD=$(kubectl get pods -l app=myservice -o jsonpath='{.items[0].metadata.name}')

# Delete it and immediately watch the recovery
kubectl delete pod $POD && kubectl get pods -l app=myservice -w

Simple pod termination experiment with recovery observation.

What teams commonly discover from this experiment:

Slow container startup delays recovery If your container takes 45 seconds to start, that's 45 seconds of degraded capacity. Optimize image size, add startup probes, or adjust replica counts.
Missing readiness probes cause traffic to unhealthy pods Without a readiness probe, Kubernetes sends traffic to pods that aren't ready to handle it, causing errors during startup.
No alerts fired despite degradation This is the "aha" moment — your monitoring gap was invisible until you tested it.

These findings alone justify the experiment. And you can run it in staging today, with no budget approval, no platform purchase, and no dedicated chaos team.

newsletter.subscribe

Three Mistakes That Turn Experiments Into Incidents

Now that you know how to run an experiment, here’s how to avoid turning it into an incident. These three mistakes are the difference between controlled learning and self-inflicted outages.

Mistake 1: No Hypothesis

I covered this earlier, but it bears repeating: “let’s see what happens” isn’t an experiment. Write down what you expect before you start, or you won’t know if what you observed was a bug or expected behavior.

Mistake 2: No Abort Conditions

Running an experiment with the plan “watch and manually stop if things look bad” is a recipe for turning experiments into incidents. Humans are slow to react, especially when they’re not sure if what they’re seeing is expected behavior or a problem.

Define abort conditions before you start: error rate > 5%, p99 latency > 5 seconds, any customer complaint. Ideally these trigger automatic rollback. At minimum, write them down so you know when to pull the plug.

Mistake 3: No Follow-Through

The worst pattern: find bugs, document them, move on. Six months later, someone finds the document and asks why none of the bugs were fixed.

Chaos without fixes is expensive documentation. The workflow must be: find bugs → document → create tickets → fix → re-run experiment to verify the fix. If you’re finding bugs but not fixing them, you’re wasting effort and building false confidence.

The value of chaos engineering is not in finding problems — it’s in fixing them. An experiment without follow-through is worse than no experiment because it creates false confidence.

Conclusion

You can start chaos engineering today. Pick a service. Form a hypothesis about what happens when you kill one pod. Open your monitoring dashboard. Run kubectl delete pod. Watch what happens.

Free PDF Guide

Download the Chaos Engineering Guide

Get the complete starter playbook for low-cost failure experiments, blast-radius controls, and repeatable resilience learning cycles.

What you'll get:

Starter experiment runbook templates
Failure injection tooling matrix
Blast radius control checklist
Quarterly adoption roadmap guide

Free resource

Instant access

Download Now

Learn More

No credit card required.

That’s a real chaos experiment — no enterprise platform required.

Once you’re comfortable with pod termination, the natural next steps are latency injection (what happens when your database responds slowly?) and network partitions (what happens when your cache is unreachable?). These three failure modes cover the vast majority of real-world incidents.

The barrier to starting isn’t tooling or budget. It’s deciding to run that first experiment.

Enjoyed the read? Share it with your network.

Table of Contents

Download the Chaos Engineering Guide

Your Rate Limiter Is Your Biggest Outage Risk

Why Your Traces Are Unreadable: Span Design

Terraform Module Defaults That Won't Break Your Consumers

Why Your E2E Tests Are Flaky (And How to Fix Them)

How We Cut Preview Environment Costs by 60 Percent

Table of Contents

Introduction

What Chaos Engineering Actually Is

Your First Experiment: Pod Termination

Three Mistakes That Turn Experiments Into Incidents

Mistake 1: No Hypothesis

Mistake 2: No Abort Conditions

Mistake 3: No Follow-Through

Conclusion

Download the Chaos Engineering Guide

Share this article

Your Rate Limiter Is Your Biggest Outage Risk

Why Your Traces Are Unreadable: Span Design

Terraform Module Defaults That Won't Break Your Consumers

Why Your E2E Tests Are Flaky (And How to Fix Them)

How We Cut Preview Environment Costs by 60 Percent