Post-Incident Reviews¶
Welcome to the internal Post-Incident Review (PIR) documentation site. This site contains detailed analyses of system incidents, root cause investigations, and lessons learned.
Purpose¶
Post-incident reviews are critical for:
- Learning from failures - Understanding what went wrong and why
- Preventing recurrence - Implementing safeguards and preventive measures
- Improving systems - Identifying architectural and operational improvements
- Knowledge sharing - Building team expertise and institutional memory
PGMac . Net Service Status¶
These documents are an artefact to give clarity and detail on incidents discovered and communicated through my Nagios Status Page
Recent Incidents¶
2026¶
- 2026-03-30 - Sonarr Outage Due to iSCSI Hairpin NAT Failure on k8s03
- Severity: P2
- Duration: Unknown silent failure + ~45m active investigation and recovery
-
Summary: Sonarr became stuck in ContainerCreating on k8s03 because microk8s Calico does not support hairpin NAT for host-namespace iSCSI clients — the kubelet's iSCSI connection to the Jiva controller ClusterIP was looped back to the same node and dropped at the PDU receive stage. Resolution required cordoning k8s03 and force-deleting the pod so it rescheduled to a different node.
-
2026-03-28 - Radarr Outage — OpenEBS Jiva Replica Divergence (Second Occurrence)
- Severity: High
- Duration: ~30h silent failure + ~50m active recovery
-
Summary: Second occurrence of all three OpenEBS Jiva replicas entering CrashLoopBackOff with diverged snapshot chains, rendering the radarr-config PVC unmountable. Unlike the February incident, no single authoritative replica existed, requiring all data directories to be wiped (total data loss). Recovery was further complicated by a ghost RW replica entry in the controller API and a stale iSCSI session on k8s03.
-
2026-03-28 - ARC GitHub Actions Runner Pods Stuck Pending — Kubelet Sync Loop Stall
- Severity: High
- Duration: ~7h40m
-
Summary: Five ARC runner pods remained Pending for over 7 hours due to a layered failure: disk exhaustion on k8s02 triggered image pull failures, recovery attempts caused Calico disruption and PLEG desync, and a ghost containerd record on k8s01 (orphaned from a force-deleted pod) stalled the kubelet sync loop every 60 seconds. Resolution required clearing the ghost container, restarting kubelite, pinning the ARC controller to k8s01, and disabling a Wazuh webhook returning 500 errors on every API server event.
-
2026-02-22 - Radarr Outage Due to OpenEBS Jiva Replica Divergence
- Severity: High
- Duration: ~16h30m silent failure + ~47m active recovery (~20h47m total outage)
-
Summary: Radarr became completely unavailable when all three OpenEBS Jiva storage replicas simultaneously entered CrashLoopBackOff following an ungraceful shutdown during an active rebuild, leaving the iSCSI-backed PVC unmountable due to ext4 journal corruption
- Severity: Critical
- Duration: ~8 hours (Phase 1-2) + 16.5 hours (Phase 3) + 12+ hours (Phase 5)
- Summary: Multi-phase cascading failure across microk8s cluster spanning 4 days, involving node reboots, kubelet failures, disk exhaustion, storage issues, job controller corruption, and container runtime corruption
PIR Structure¶
Each post-incident review follows a standard structure:
- Executive Summary - High-level overview of the incident
- Timeline - Detailed chronological sequence of events
- Root Causes - Analysis of underlying issues
- Impact - Affected services, duration, and scope
- Resolution Steps - Actions taken to resolve the incident
- Verification - Confirmation of service restoration
- Preventive Measures - Immediate and long-term improvements
- Lessons Learned - Key takeaways and insights
- Action Items - Specific follow-up tasks with owners
Contributing¶
When creating a new PIR document:
- Use the naming convention:
YYYY-MM-DD-brief-description.md - Place documents in the
docs/incidents/directory - Update the
mkdocs.ymlnavigation section - Follow the standard PIR structure template
- Include relevant technical details, commands, and verification steps
Navigation¶
Use the navigation menu to browse incidents by date or search for specific topics using the search functionality.