Skip to content

Control-Plane Watch-Cache Freeze (Zero Pod Creations / Stalled Reflectors)

Service: microk8s control plane (pvek8s) First documented: 2026-06-12 Incident: PGM-241 — KCM dead 16h with zero pod creations cluster-wide; no alert fired Linear: PGM-241, PGM-242 Nagios: microk8s-newest-pod-age (warn ≥15m, crit ≥30m without any new pod)


Failure Mode

A node's apiserver watch cache freezes because the apiserver's watch on kine (snap.microk8s.daemon-k8s-dqlite) breaks and never recovers. Every client whose reflector lists at resourceVersion=0 — kubelet, scheduler, KCM, ARC controller, any controller-runtime operator — is served the frozen cache and never sees new objects, while:

  • the process stays alive and systemd reports active
  • leases keep renewing (the lease goroutine doesn't depend on the cache)
  • logs keep flowing (probe noise, etc.)
  • kubectl get looks completely normal (quorum reads bypass the cache)

Symptoms depend on which component is homed on the frozen node:

Component on frozen node Symptom
KCM (leader) Zero pod creations cluster-wide; CronJobs stop; Deployments don't reconcile; deleted pods not replaced
Scheduler (leader) Pods pile up Pending with no FailedScheduling events
kubelet Pods assigned to the node sit Pending forever; Scheduled is the last event
ARC controller Runner pods not created; stale Running EphemeralRunner CRs; GH Actions jobs queue for hours

Diagnosis — the RV=0 test

Create any object, then compare a watch-cache read against a quorum read on the suspect node's local apiserver:

kubectl --context pvek8s run cache-canary -n default --image=busybox:1.36 \
  --restart=Never --command -- true

ssh <node> "sudo /snap/bin/microk8s kubectl get --raw \
  '/api/v1/namespaces/default/pods?resourceVersion=0' | grep -c cache-canary"   # cache read
ssh <node> "sudo /snap/bin/microk8s kubectl get --raw \
  '/api/v1/namespaces/default/pods' | grep -c cache-canary"                     # quorum read

0 from the cache read and 1 from the quorum read = frozen watch cache. Test every node — in PGM-241 two of three nodes were frozen and the healthy one (k8s02) was silently doing all the work.

Supporting signals:

# kine watch stream churn on the node (high during/after the break)
sudo journalctl -u snap.microk8s.daemon-kubelite --since "-5 minutes" --no-pager \
  | grep -cE "kine.sock.*closed|unexpected EOF"

# the moment a controller's informers died (e.g. ARC)
kubectl logs -n arc-systems <gharc-pod> | grep "Unexpected EOF during watch stream"

What does NOT work

  • Restarting kubelite alone. The rebuilt watch cache freezes again immediately because kine's feed is still broken — PGM-241 saw two consecutive kubelite restarts on k8s01 produce kubelets stalled from birth.
  • Deleting the leader lease. The stale instance's lease goroutine is alive and re-wins the election within seconds.
  • Waiting. The cache does not self-heal; KCM was dead 16h.

Recovery

Per affected node, non-dqlite-leader nodes first, leader last (find the leader with .leader via the dqlite client, or /var/snap/microk8s/current/var/kubernetes/backend/info.yaml + leader query):

NODE=<node>

# 1. Cordon (mandatory — kubelet watch-race on restart, see kubelet-silent-stall.md)
kubectl --context pvek8s cordon "$NODE"

# 2. Restart kine/dqlite FIRST — this is the broken layer
ssh "$NODE" "sudo systemctl restart snap.microk8s.daemon-k8s-dqlite"
ssh "$NODE" "systemctl is-active snap.microk8s.daemon-k8s-dqlite"
# wait ~30s; confirm no 'database is locked' errors:
ssh "$NODE" "sudo journalctl -u snap.microk8s.daemon-k8s-dqlite --since '-1 minute' --no-pager | grep -c 'database is locked'"

# 3. Then kubelite
ssh "$NODE" "sudo systemctl restart snap.microk8s.daemon-kubelite"

# 4. Verify BEFORE uncordon: canary must run AND appear in the cache read
kubectl --context pvek8s run "${NODE}-canary" -n default --image=busybox:1.36 \
  --restart=Never --overrides="{\"spec\":{\"nodeName\":\"${NODE}\"}}" --command -- true
kubectl --context pvek8s wait -n default "pod/${NODE}-canary" \
  --for=jsonpath='{.status.phase}'=Succeeded --timeout=120s
ssh "$NODE" "sudo /snap/bin/microk8s kubectl get --raw \
  '/api/v1/namespaces/default/pods?resourceVersion=0' | grep -c ${NODE}-canary"   # must be 1
kubectl --context pvek8s delete pod -n default "${NODE}-canary"

# 5. Uncordon
kubectl --context pvek8s uncordon "$NODE"

After recovery, also check for collateral stale controllers that broke when the apiserver bounced: restart any controller-runtime operator pods (e.g. gharc-controller in arc-systems) whose logs end in watch EOF errors, and clean up stale CRs (EphemeralRunners claiming Running against Completed/missing pods).

Post-Recovery Verification

kubectl --context pvek8s get pods -A --field-selector status.phase=Pending --no-headers | wc -l   # → 0 (after backlog drains)
kubectl --context pvek8s get pods -A --sort-by=.metadata.creationTimestamp --no-headers | tail -1  # newest pod < 2m old (per-minute cronjobs)

Expect a large backlog flood (Jobs for every missed CronJob window) — let it drain; jiva replica pods may briefly go Pending while rescheduling.

References