Control-Plane Watch-Cache Freeze (Zero Pod Creations / Stalled Reflectors)¶

Service: microk8s control plane (pvek8s) First documented: 2026-06-12 Incident: PGM-241 — KCM dead 16h with zero pod creations cluster-wide; no alert fired Linear: PGM-241, PGM-242 Nagios: microk8s-newest-pod-age (warn ≥15m, crit ≥30m without any new pod)

Failure Mode¶

A node's apiserver watch cache freezes because the apiserver's watch on kine (snap.microk8s.daemon-k8s-dqlite) breaks and never recovers. Every client whose reflector lists at resourceVersion=0 — kubelet, scheduler, KCM, ARC controller, any controller-runtime operator — is served the frozen cache and never sees new objects, while:

the process stays alive and systemd reports active
leases keep renewing (the lease goroutine doesn't depend on the cache)
logs keep flowing (probe noise, etc.)
kubectl get looks completely normal (quorum reads bypass the cache)

Symptoms depend on which component is homed on the frozen node:

Component on frozen node	Symptom
KCM (leader)	Zero pod creations cluster-wide; CronJobs stop; Deployments don't reconcile; deleted pods not replaced
Scheduler (leader)	Pods pile up Pending with no FailedScheduling events
kubelet	Pods assigned to the node sit Pending forever; `Scheduled` is the last event
ARC controller	Runner pods not created; stale `Running` EphemeralRunner CRs; GH Actions jobs queue for hours

Diagnosis — the RV=0 test¶

Create any object, then compare a watch-cache read against a quorum read on the suspect node's local apiserver:

kubectl --context pvek8s run cache-canary -n default --image=busybox:1.36 \
  --restart=Never --command -- true

ssh <node> "sudo /snap/bin/microk8s kubectl get --raw \
  '/api/v1/namespaces/default/pods?resourceVersion=0' | grep -c cache-canary"   # cache read
ssh <node> "sudo /snap/bin/microk8s kubectl get --raw \
  '/api/v1/namespaces/default/pods' | grep -c cache-canary"                     # quorum read

0 from the cache read and 1 from the quorum read = frozen watch cache. Test every node — in PGM-241 two of three nodes were frozen and the healthy one (k8s02) was silently doing all the work.

Supporting signals:

# kine watch stream churn on the node (high during/after the break)
sudo journalctl -u snap.microk8s.daemon-kubelite --since "-5 minutes" --no-pager \
  | grep -cE "kine.sock.*closed|unexpected EOF"

# the moment a controller's informers died (e.g. ARC)
kubectl logs -n arc-systems <gharc-pod> | grep "Unexpected EOF during watch stream"

What does NOT work¶

Restarting kubelite alone. The rebuilt watch cache freezes again immediately because kine's feed is still broken — PGM-241 saw two consecutive kubelite restarts on k8s01 produce kubelets stalled from birth.
Deleting the leader lease. The stale instance's lease goroutine is alive and re-wins the election within seconds.
Waiting. The cache does not self-heal; KCM was dead 16h.

Recovery¶

Per affected node, non-dqlite-leader nodes first, leader last (find the leader with .leader via the dqlite client, or /var/snap/microk8s/current/var/kubernetes/backend/info.yaml + leader query):

NODE=<node>

# 1. Cordon (mandatory — kubelet watch-race on restart, see kubelet-silent-stall.md)
kubectl --context pvek8s cordon "$NODE"

# 2. Restart kine/dqlite FIRST — this is the broken layer
ssh "$NODE" "sudo systemctl restart snap.microk8s.daemon-k8s-dqlite"
ssh "$NODE" "systemctl is-active snap.microk8s.daemon-k8s-dqlite"
# wait ~30s; confirm no 'database is locked' errors:
ssh "$NODE" "sudo journalctl -u snap.microk8s.daemon-k8s-dqlite --since '-1 minute' --no-pager | grep -c 'database is locked'"

# 3. Then kubelite
ssh "$NODE" "sudo systemctl restart snap.microk8s.daemon-kubelite"

# 4. Verify BEFORE uncordon: canary must run AND appear in the cache read
kubectl --context pvek8s run "${NODE}-canary" -n default --image=busybox:1.36 \
  --restart=Never --overrides="{\"spec\":{\"nodeName\":\"${NODE}\"}}" --command -- true
kubectl --context pvek8s wait -n default "pod/${NODE}-canary" \
  --for=jsonpath='{.status.phase}'=Succeeded --timeout=120s
ssh "$NODE" "sudo /snap/bin/microk8s kubectl get --raw \
  '/api/v1/namespaces/default/pods?resourceVersion=0' | grep -c ${NODE}-canary"   # must be 1
kubectl --context pvek8s delete pod -n default "${NODE}-canary"

# 5. Uncordon
kubectl --context pvek8s uncordon "$NODE"

After recovery, also check for collateral stale controllers that broke when the apiserver bounced: restart any controller-runtime operator pods (e.g. gharc-controller in arc-systems) whose logs end in watch EOF errors, and clean up stale CRs (EphemeralRunners claiming Running against Completed/missing pods).

Post-Recovery Verification¶

kubectl --context pvek8s get pods -A --field-selector status.phase=Pending --no-headers | wc -l   # → 0 (after backlog drains)
kubectl --context pvek8s get pods -A --sort-by=.metadata.creationTimestamp --no-headers | tail -1  # newest pod < 2m old (per-minute cronjobs)

Expect a large backlog flood (Jobs for every missed CronJob window) — let it drain; jiva replica pods may briefly go Pending while rescheduling.

References¶

Incident: PGM-241 (2026-06-10/11) — 16h KCM stall, then scheduler, then kubelets on k8s01+k8s03
Related: kubelet-silent-stall.md — kubelet-only variant and why cordon-before-restart is mandatory
Related: kcm-stale-terminating-replicas.md — earlier, narrower KCM informer staleness
Related: kubelet-volume-manager-stall.md — processorListener variant; dqlite restart safety checks
Related: dqlite-write-contention.md — the write-storm conditions (PGM-237) that break kine watch streams in the first place