Incidents¶
Post-incident reviews documenting what went wrong, why, and how we fixed it.
| Date | Title | Severity | Duration |
|---|---|---|---|
| 2026-05-15 | AWX Pod Stuck Pending — Calico RBAC Gap + dqlite Write Storm | Medium | ~13 min silent + ~8 min to fix |
| 2026-04-12 | pvek8s Complete Cluster Outage — dqlite Quorum Loss and Ansible-Injected Invalid Flags | Critical | 7d degraded + ~1h 12m full outage |
| 2026-03-30 | Sonarr Outage — iSCSI Hairpin NAT Failure (ContainerCreating) | High | ~45m |
| 2026-03-28 | Radarr Outage — OpenEBS Jiva Replica Divergence (Second Occurrence) | High | ~30h |
| 2026-03-28 | ARC GitHub Actions Runner Pods Stuck Pending — Kubelet Sync Loop Stall | High | ~7h40m |
| 2026-02-22 | Radarr Outage Due to OpenEBS Jiva Replica Divergence | High | ~17h |
| 2026-01-06 | Cascading Kubernetes Cluster Failures | Critical | ~3 days |
- Post Incident Review: AWX Automation Pod Stuck Pending — Calico RBAC Gap + dqlite Write Storm — 2026-05-15
- Post Incident Review: pvek8s Complete Cluster Outage — dqlite Quorum Loss and Ansible-Injected Invalid Flags — 2026-04-04 (degraded) → 2026-04-12
- Post Incident Review: dqlite Snapshot Bloat → kube-apiserver Instability → Controller Crash-Loop Cascade and Watch Stream Failure — 2026-04-01 to 2026-04-02
- Post Incident Review: Sonarr Outage Due to iSCSI Hairpin NAT Failure on k8s03 — 2026-03-30
- Post Incident Review: Radarr Outage — OpenEBS Jiva Replica Divergence (Second Occurrence) — 2026-03-28
- Post Incident Review: ARC GitHub Actions Runner Pods Stuck Pending — Kubelet Sync Loop Stall and Multi-Node Degradation — 2026-03-28
- Post Incident Review: Radarr Outage Due to OpenEBS Jiva Replica Divergence — 2026-02-22
- Post Incident Review: Cascading Kubernetes Cluster Failures — 2026-01-06