Kubernetes StatefulSets: Managing Stateful Applications
My first attempt to run PostgreSQL on Kubernetes used a Deployment. Three replicas, shared PVC, fingers crossed. It worked great until the first pod restart — at which point three Postgres instances all thought they were the primary, our data directory looked like modern art, and I gained a healthy respect for workloads that have opinions about identity.
Stateful applications remember things. They care about hostname. They expect storage to follow them around. They want to start up in a predictable order. Deployments treat every pod like an interchangeable intern. StatefulSets treat each pod like a named employee with a desk and a filing cabinet.
If you’re running databases, message brokers, or anything that writes to disk and gossip with peers, you want StatefulSets. Here’s what we figured out deploying them in production on Kubernetes 1.11+.
Deployments vs StatefulSets: Know Your Pod’s Personality
| Deployment | StatefulSet | |
|---|---|---|
| Pod names | Random hash suffix | Stable ordinal: web-0, web-1 |
| Storage | Shared or none | Dedicated PVC per pod |
| Startup order | All at once | Sequential (by default) |
| Network identity | Ephemeral | Stable DNS per pod |
| Best for | Web APIs, workers | Databases, queues, clusters |
The mental model: Deployments scale out stateless work. StatefulSets scale out stateful work. Mixing them up is how you get data corruption and 3 AM pages.
Your First StatefulSet (With Storage That Sticks)
This nginx example is deliberately simple — the interesting part is volumeClaimTemplates, which gives each pod its own PersistentVolumeClaim:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
serviceName: "web"
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: nginx
image: nginx:1.15
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "fast-ssd"
resources:
requests:
storage: 10Gi
When this creates three pods, Kubernetes also creates three PVCs: www-web-0, www-web-1, www-web-2. Kill web-1, and when it comes back, it reattaches to the same volume. Your data doesn’t wander off.
The Headless Service: DNS Names That Actually Mean Something
StatefulSets require a headless Service (clusterIP: None). Without it, pods get network identities but no way for peers to find each other. With it, each pod is reachable at a predictable DNS name:
apiVersion: v1
kind: Service
metadata:
name: web
spec:
clusterIP: None # Headless service
selector:
app: web
ports:
- port: 80
name: web
Each pod gets:
- Stable hostname:
web-0,web-1,web-2 - Stable DNS:
web-0.web.default.svc.cluster.local
# Pods are accessible by name
ping web-0.web
ping web-1.web
ping web-2.web
This is non-negotiable for clustered databases. “Connect to whoever is available” doesn’t work when you’re setting up replication and one node needs to know it’s postgres-0 and the other is postgres-1.
Ordered Deployment: Patience, Kubernetes Style
By default, StatefulSets use podManagementPolicy: OrderedReady:
spec:
podManagementPolicy: OrderedReady # Default
# Pods created sequentially: 0, then 1, then 2
Pod 1 won’t start until pod 0 is Running and Ready. Annoying when you’re impatient; essential when pod 1 needs to join a cluster that pod 0 initialized.
For workloads that don’t care about order (some cache layers, stateless-ish workers with sticky storage), you can go parallel:
spec:
podManagementPolicy: Parallel
# All pods created simultaneously
We use OrderedReady for anything involving leader election or primary/replica setup. Parallel for Redis cluster nodes once the cluster bootstrap script handles concurrent joins.
Persistent Storage: One Disk Per Pod, No Sharing
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "ssd"
resources:
requests:
storage: 100Gi
Each pod gets its own PVC:
data-web-0data-web-1data-web-2
ReadWriteOnce means one node mounts it at a time — which is what you want for a database. Don’t try to share one PVC across pods unless you enjoy filesystem corruption.
Pick your storageClassName carefully. Databases on slow network-attached storage feel fine in dev and miserable under load.
PostgreSQL on Kubernetes: The Example We Wish We’d Had Sooner
This isn’t “Postgres in prod on K8s is easy” — it’s “if you’re going to do it, do it with eyes open.” ConfigMap for tuning, Secret for credentials, probes that actually check readiness:
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-config
data:
postgresql.conf: |
max_connections = 200
shared_buffers = 256MB
effective_cache_size = 1GB
maintenance_work_mem = 64MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 4MB
min_wal_size = 1GB
max_wal_size = 4GB
---
apiVersion: v1
kind: Secret
metadata:
name: postgres-secret
type: Opaque
data:
postgres-password: cG9zdGdyZXM= # base64 encoded
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: "postgres"
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:13
ports:
- containerPort: 5432
name: postgres
env:
- name: POSTGRES_DB
value: mydb
- name: POSTGRES_USER
value: postgres
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: postgres-password
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql/data
- name: config
mountPath: /etc/postgresql/postgresql.conf
subPath: postgresql.conf
livenessProbe:
exec:
command:
- /bin/sh
- -c
- pg_isready -U postgres
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
exec:
command:
- /bin/sh
- -c
- pg_isready -U postgres
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: config
configMap:
name: postgres-config
volumeClaimTemplates:
- metadata:
name: postgres-data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "fast-ssd"
resources:
requests:
storage: 100Gi
Honest caveat from 2018: many teams run Postgres outside Kubernetes and keep K8s for stateless apps. That’s often the right call. But when you need it in-cluster, StatefulSets are the tool — not Deployments with wishful thinking.
Redis Cluster: StatefulSets for Distributed Cache
Redis Cluster wants stable identities and persistent node configuration:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis
spec:
serviceName: "redis"
replicas: 6
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:6-alpine
ports:
- containerPort: 6379
name: redis
command:
- redis-server
- /etc/redis/redis.conf
- --cluster-enabled
- "yes"
- --cluster-config-file
- /data/nodes.conf
- --cluster-node-timeout
- "5000"
- --appendonly
- "yes"
volumeMounts:
- name: redis-data
mountPath: /data
- name: config
mountPath: /etc/redis
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
volumes:
- name: config
configMap:
name: redis-config
volumeClaimTemplates:
- metadata:
name: redis-data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "fast-ssd"
resources:
requests:
storage: 20Gi
Six replicas because Redis Cluster wants three masters and three replicas. Resource limits prevent one noisy neighbor from evicting the whole node.
Scaling: Up Is Easy, Down Requires Thought
# Scale up
kubectl scale statefulset web --replicas=5
# Scale down (pods removed in reverse order)
kubectl scale statefulset web --replicas=2
Scale-down removes highest-index pods first (web-4, then web-3). That PVC sticks around by default — which is usually what you want (data survives), but means scaling down doesn’t free storage costs until you manually delete orphaned PVCs.
For rolling updates, you can partition the rollout to update replicas incrementally:
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 2 # Update pods 2, 3, 4... but not 0, 1
Set partition: 0 to roll all pods; set it to replicas to pause updates entirely. Handy for canary-testing a new image on the highest-index pod first.
Protecting Availability During Chaos
Kubernetes will happily evict your database pods during node maintenance. A PodDisruptionBudget says “not all at once, please”:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: web
For a 3-replica StatefulSet, minAvailable: 2 means at most one pod can be disrupted at a time. Without this, a cluster upgrade can take down your entire database tier simultaneously.
Init Containers: First-Day Setup Without Custom Images
Need to initialize data before the main container starts? Init containers run to completion before the app container launches:
spec:
template:
spec:
initContainers:
- name: init-db
image: busybox
command:
- sh
- -c
- |
if [ ! -f /data/initialized ]; then
echo "Initializing database..."
touch /data/initialized
fi
volumeMounts:
- name: data
mountPath: /data
containers:
- name: app
# ...
We use init containers for schema migrations, config templating, and “wait until the leader pod is ready” logic.
Update Strategies: Rolling vs OnDelete
RollingUpdate (default) — Kubernetes updates pods in reverse ordinal order, one at a time.
OnDelete — Pods update only when you manually delete them. Useful when you want full control over maintenance windows:
spec:
updateStrategy:
type: OnDelete
# Pods updated only when manually deleted
We default to RollingUpdate with a partition for staged rollouts. OnDelete for production databases where “automatic” and “database” shouldn’t share a sentence without careful review.
Patterns That Show Up in Real Clusters
Master-Slave by Pod Index
Pod mysql-0 becomes master; everyone else replicates from it. The hostname is stable, so the startup script can branch:
# Master pod (index 0)
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mysql
spec:
replicas: 3
template:
spec:
containers:
- name: mysql
command:
- /bin/sh
- -c
- |
if [ "$(hostname)" = "mysql-0" ]; then
# Master configuration
mysqld --server-id=1 --log-bin
else
# Slave configuration
mysqld --server-id=$HOSTNAME_NUM --replicate-from=mysql-0
fi
It’s crude, but explicit — and sometimes crude beats a complex operator you don’t fully understand yet.
Leader Election
For workloads that need a single active leader:
# Use init container for leader election
initContainers:
- name: elect-leader
image: leader-elector:latest
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
What We Do Differently Now
Headless services are mandatory — we don’t deploy StatefulSets without them. Resource requests and limits go on every container; stateful workloads without limits are future eviction victims. Liveness and readiness probes must test actual readiness (pg_isready, not “port is open”). PDBs protect every production StatefulSet. Fast storage classes for databases, with disk usage alerts because PVCs don’t shrink themselves. And backups happen outside the StatefulSet lifecycle — Kubernetes keeps pods alive; it doesn’t replace your backup strategy.
The Bottom Line
StatefulSets exist because some software has memory — literal and figurative. They give you stable names, sticky storage, ordered startup, and DNS that peers can rely on. Deployments are brilliant for stateless apps. For databases, caches, and clustered stateful services, StatefulSets are the difference between “it runs” and “it survives Tuesday.”
Just please don’t run three Postgres primaries on a shared volume. Some lessons you only need to learn once.
Written September 2018, covering Kubernetes 1.11+ StatefulSet features. Kubernetes storage, operators, and managed database offerings have matured significantly since — evaluate whether in-cluster state is still the right tradeoff for your team.