Post Incident Review: k8s02 Watch-Cache Freeze — Ansible Parallel dqlite Restart Stalls Pod Creation¶
Date: 2026-06-24 Duration: ~4h active (Nagios alert ~20:20 UTC → recovery ~00:20 UTC 2026-06-25); underlying dqlite disruption at 16:16 UTC Severity: High (KCM stalled cluster-wide; zero pod creations for ~4h; running workloads unaffected) Status: Resolved
Executive Summary¶
At 16:02 UTC on 2026-06-24, update/home.yml ran against the k8s group without serial: 1. The playbook processed k8s01 and k8s03 in parallel. After upgrading packages and modifying /var/snap/microk8s/current/args/k8s-dqlite, the Ansible handler fired systemctl restart snap.microk8s.daemon-k8s-dqlite on both nodes at the same second (16:16:06 UTC). Ansible also restarted kubelite on both nodes at 16:16:10 UTC.
With k8s01 and k8s03 dqlite simultaneously offline, k8s02 (the only remaining raft member) could not achieve quorum. All kine transactions on k8s02 failed with context canceled. k8s01 and k8s03 dqlite restarted within ~2 seconds (16:16:08), raft re-elected a leader, and those nodes recovered normally. k8s02's kubelite was never restarted — its existing HTTP/2 watch stream connection stayed alive via TCP keepalives but stopped delivering events after the raft disruption.
From 16:26 UTC (after raft recovery), k8s02's watch cache appeared to work — the watch stream was alive and delivering events via the old HTTP/2 connection. At ~20:04 UTC, the long-lived HTTP/2 watch stream connection finally closed (after ~3.5h of degraded operation and sustained write contention). The apiserver's watch cache froze. Because k8s02 held the KCM leader lease, all pod creation, CronJob scheduling, and Deployment reconciliation halted cluster-wide.
The microk8s-watch-cache Nagios check detected the freeze at ~20:20 UTC. Recovery followed the control-plane-watch-cache-freeze runbook: cordon k8s02, restart snap.microk8s.daemon-k8s-dqlite, restart snap.microk8s.daemon-kubelite, verify, uncordon. Total recovery time ~10 minutes.
Fix applied (PGM-281): Added serial: 1 to the k8s play in update/home.yml.
Timeline (UTC)¶
| Time | Event |
|---|---|
| 16:02 UTC | Ansible update/home.yml begins on k8s01 AND k8s03 simultaneously (no serial: 1) — AnsiballZ_setup.py runs on both nodes at same second |
| 16:03-16:04 UTC | APT update + full upgrade runs on k8s01 and k8s03 simultaneously |
| 16:04-16:10 UTC | dqlite database is locked (try: 500) on k8s01 and k8s03 — write contention from lease updates; Ansible's APT activity adds write pressure |
| 16:14:22 UTC | k8s02 kubelite logs first kine.sock connection failure (operation was canceled) — pre-existing write contention propagates to k8s02 kine |
| 16:15:51-16:16:02 UTC | Ansible modifies /var/snap/microk8s/current/args/k8s-dqlite on k8s01 and k8s03: adds --metrics and --metrics-listen=127.0.0.1:9042 |
| 16:16:03-16:16:05 UTC | Ansible queries journald for "database is locked" and "use of closed network connection" errors — detects contention but does not abort |
| 16:16:06 UTC | Root cause: ansible.builtin.systemd state=restarted fires for snap.microk8s.daemon-k8s-dqlite on k8s01 AND k8s03 at the same second |
| 16:16:07 UTC | k8s02 dqlite sees connection refused to k8s01 (172.22.22.6) and k8s03 (172.22.22.9) — both peers offline simultaneously |
| 16:16:07-16:16:31 UTC | k8s02 dqlite: no known leader (18+ attempts) → all kine transactions fail: context canceled, context deadline exceeded |
| 16:16:08 UTC | k8s01 and k8s03 dqlite restart with new PIDs — raft begins leader election |
| 16:16:10 UTC | Ansible restarts kubelite on k8s01 and k8s03 — new kubelite PIDs establish fresh connections to newly-restarted dqlite |
| 16:16:24-16:16:30 UTC | k8s01 and k8s03 new dqlite elect leader; raft cluster reforms; k8s02 reconnects as raft follower |
| ~16:26 UTC | k8s02 dqlite reconnects to new raft leader; kine resumes event delivery on existing HTTP/2 watch stream; last dqlite log entry (recovery complete) |
| 16:24-20:04 UTC | k8s02 kubelite logs kine.sock: use of closed network connection every ~4-5 min — gRPC connection pool channels failing to reconnect; watch stream on existing HTTP/2 connection still alive |
| ~16:27 UTC | dqlite database is locked resumes on k8s01 and k8s03 — write contention continues after restart (hostpath-provisioner, masterleases TTL) |
| ~20:04 UTC | Long-lived HTTP/2 watch stream connection on k8s02 finally closes (sustained write contention + ~3.5h TCP lifetime) |
| ~20:20 UTC | microk8s-watch-cache CRITICAL fires on k8s02 (NRPE polling interval); microk8s-newest-pod-age CRITICAL fires on all nodes (no pod created cluster-wide for >30m) |
| ~00:10 UTC (+1d) | Investigation begins |
| ~00:12 UTC | RV=0 canary test confirms k8s02 cache frozen (stale annotations ~4h old); k8s01 and k8s03 caches healthy |
| ~00:13 UTC | KCM leader lease confirmed as k8s02_7f7db8d0-b54f-432a-a75b-915ea452be27 |
| ~00:15 UTC | kubectl cordon k8s02 |
| ~00:15 UTC | systemctl restart snap.microk8s.daemon-k8s-dqlite on k8s02 |
| ~00:16 UTC | Wait 30s; zero "database is locked" errors confirmed |
| ~00:16 UTC | systemctl restart snap.microk8s.daemon-kubelite on k8s02 |
| ~00:18 UTC | RV=0 test on k8s02 shows fresh canary stamp — watch cache live |
| ~00:19 UTC | kubectl uncordon k8s02 |
| ~00:19 UTC | KCM lease moves to k8s01_944b8f8c-9407-4692-a5a0-9c33bd109f67 |
| ~00:19 UTC | Backlog drains: CronJobs, ARC runners, hostpath-provisioner jobs all resume; newest pod < 2m |
Root Causes¶
Primary cause (confirmed): update/home.yml ran k8s nodes without serial: 1¶
The k8s play in update/home.yml had no serial directive, causing Ansible to process k8s01, k8s02, and k8s03 in parallel. When the ansible-role-microk8s role modified /var/snap/microk8s/current/args/k8s-dqlite on k8s01 and k8s03, the handler fired systemctl restart snap.microk8s.daemon-k8s-dqlite on both nodes at the same second (16:16:06 UTC). Kubelite was also restarted on both nodes at 16:16:10 UTC.
With k8s01 and k8s03 dqlite simultaneously offline, k8s02 (the only remaining raft member) lost quorum for ~2 seconds. All kine transactions on k8s02 failed with context canceled, disrupting the connection pool between k8s02's apiserver and its kine socket.
Fix: Added serial: 1 to the k8s play in ansible/update/home.yml (PGM-281).
Compounding cause (confirmed): dqlite restart during active write contention¶
At the time of the restart (16:16:06 UTC), k8s01 and k8s03 dqlite were experiencing database is locked (try: 500) — severe write contention from lease updates. Ansible detected this contention via journald queries at 16:16:03-16:16:05 but had no abort condition — it restarted dqlite regardless.
The restart during active contention likely made the raft re-election noisier and caused kine's connection pool on k8s02 to enter a prolonged broken state (ongoing use of closed network connection errors for ~3.5h).
Compounding cause (confirmed): k8s02 kubelite not restarted — long-lived HTTP/2 watch stream survived ~3.5h¶
Ansible's k8s update did NOT restart kubelite on k8s02 (k8s02 was not in the concurrent Ansible run — it was processed separately, or skipped). k8s02's kubelite (PID 2547215) kept running continuously. Its existing HTTP/2 watch stream to kine.sock (established well before the disruption) stayed alive via TCP keepalives.
After raft recovery at ~16:26, k8s02's kine resumed event delivery on the old HTTP/2 connection. New connection pool channel requests continued to fail (use of closed network connection every ~4-5 min), but the watch stream was alive. The watch cache functioned normally from ~16:26 to ~20:04.
At ~20:04, the long-lived HTTP/2 connection finally closed — likely from sustained write contention pushing kine to terminate stale streams, or an HTTP/2 server-side connection lifetime limit. The watch cache froze.
Compounding cause (confirmed): KCM leader was on k8s02¶
The KCM leader lease was held by k8s02. With k8s02's cache frozen, the KCM could not see any new object changes. Pod creation, CronJob scheduling, and Deployment reconciliation halted cluster-wide. The KCM lease was not revoked — k8s02's lease-renewal goroutine operates independently of the watch cache.
What Went Well¶
microk8s-watch-cacheandmicrok8s-newest-pod-ageNagios checks correctly detected the freeze- The RV=0 canary test immediately pinpointed k8s02 as the frozen node (k8s01 and k8s03 tested healthy)
- The recovery runbook was complete and accurate; no surprises during execution
- Total recovery time from diagnosis to fully restored was ~10 minutes
- No running workload was interrupted — pods already scheduled continued running throughout
- Journald persistence on all nodes meant 7h+ of logs were available for post-incident archaeology
What Could Improve¶
serial: 1not enforced on k8s plays: Theupdate/home.ymlk8s play lackedserial: 1. Themicrok8s-monthly-maintenance.ymlplaybook hasserial: 1but the main update playbook did not. K8s operations must always be serial to prevent simultaneous dqlite restarts.- No abort on lock contention: Ansible detected
database is lockederrors at 16:16:03-16:16:05 immediately before triggering the restart, but restarted anyway. The role should abort (or warn and skip) if recent lock contention is detected. - 3.5h silent degradation: The watch cache was eventually working again from 16:26 to 20:04, but kine's connection pool was in a broken state throughout. A Nagios check for persistent kine.sock reconnection failures (the
use of closed network connectionpattern repeating every 5 min) would have surfaced this much earlier. - Detection delay on watch stream health: The
microk8s-watch-cachecheck fires when the cache is already frozen. An earlier check (detecting repeated kine.sock errors or high dqlite raft churn) could fire before the stream breaks entirely. - No automatic KCM leader failover: When k8s02's cache froze, k8s02 kept renewing the KCM lease normally. Other nodes could not preempt it.
Action Items¶
| # | Action | Ticket | Status |
|---|---|---|---|
| 1 | Add serial: 1 to k8s play in update/home.yml |
PGM-281 | Done — committed 2026-06-25 |
| 2 | Add guard in ansible-role-microk8s to abort dqlite restart if recent lock contention detected |
PGM-281 subtask | Open |
| 3 | Add Nagios check for persistent kine.sock reconnection failures as early-warning signal | PGM-281 subtask | Open |
References¶
- Runbook: control-plane-watch-cache-freeze.md
- Linear: PGM-281 — root cause investigation and fix
- Related: PGM-241 — original watch-cache freeze incident (2026-06-10)
- Related PIR: 2026-04-02 dqlite snapshot crash-loop
- Related PIR: 2026-05-15 AWX pod pending — dqlite write storm