Skip to content

Incidents

Post-incident reviews documenting what went wrong, why, and how we fixed it.

Date Title Severity Duration
2026-05-15 AWX Pod Stuck Pending — Calico RBAC Gap + dqlite Write Storm Medium ~13 min silent + ~8 min to fix
2026-04-12 pvek8s Complete Cluster Outage — dqlite Quorum Loss and Ansible-Injected Invalid Flags Critical 7d degraded + ~1h 12m full outage
2026-03-30 Sonarr Outage — iSCSI Hairpin NAT Failure (ContainerCreating) High ~45m
2026-03-28 Radarr Outage — OpenEBS Jiva Replica Divergence (Second Occurrence) High ~30h
2026-03-28 ARC GitHub Actions Runner Pods Stuck Pending — Kubelet Sync Loop Stall High ~7h40m
2026-02-22 Radarr Outage Due to OpenEBS Jiva Replica Divergence High ~17h
2026-01-06 Cascading Kubernetes Cluster Failures Critical ~3 days