Incidents¶ Post-incident reviews documenting what went wrong, why, and how we fixed it. Date Title Severity Duration 2026-05-23 k8s01 Calico CNI Unauthorized — Stale Pod-Bound Token After Calico Upgrade Medium ~1h 2026-05-18 k8s03 Extended Recovery — kine Watch Corruption, VXLAN Route Corruption, and Kubelet Watch Stream Stall High ~2h10m 2026-05-17 k8s03 PLEG Deadlock — Stale Calico IPAM Blocks + Generic PLEG Serial-Poll Vulnerability High ~9h 2026-05-16 microk8s 1.34 → 1.35 Rolling Upgrade — cgroup v2, containerd Shim, Disk Pressure, and Kubelet Stall High ~8.75h 2026-05-15 AWX Automation Pod Stuck Pending — Calico RBAC Gap + dqlite Write Storm Medium ~13 min silent + ~8 min to fix 2026-04-12 pvek8s Complete Cluster Outage — dqlite Quorum Loss and Ansible-Injected Invalid Flags Critical 7d degraded + ~1h 12m full outage 2026-04-02 dqlite Snapshot Bloat → kube-apiserver Instability → Controller Crash-Loop Cascade and Watch Stream Failure High ~7h 2026-03-30 Sonarr Outage Due to iSCSI Hairpin NAT Failure on k8s03 High ~45m 2026-03-28 Radarr Outage — OpenEBS Jiva Replica Divergence (Second Occurrence) High ~30h 2026-03-28 ARC GitHub Actions Runner Pods Stuck Pending — Kubelet Sync Loop Stall and Multi-Node Degradation High ~7h40m 2026-02-22 Radarr Outage Due to OpenEBS Jiva Replica Divergence High ~17h 2026-01-06 Cascading Kubernetes Cluster Failures Critical ~3 days