Post-Incident Reviews
Welcome to the internal Post-Incident Review (PIR) documentation site. This site contains detailed analyses of system incidents, root cause investigations, and lessons learned.
Purpose
Post-incident reviews are critical for:
- Learning from failures - Understanding what went wrong and why
- Preventing recurrence - Implementing safeguards and preventive measures
- Improving systems - Identifying architectural and operational improvements
- Knowledge sharing - Building team expertise and institutional memory
Recent Incidents
2026
- 2026-01-06 - Cascading Kubernetes Cluster Failures
- Severity: Critical
- Duration: ~8 hours (Phase 1-2) + 16.5 hours (Phase 3) + 12+ hours (Phase 5)
- Summary: Multi-phase cascading failure across microk8s cluster spanning 4 days, involving node reboots, kubelet failures, disk exhaustion, storage issues, job controller corruption, and container runtime corruption
PIR Structure
Each post-incident review follows a standard structure:
- Executive Summary - High-level overview of the incident
- Timeline - Detailed chronological sequence of events
- Root Causes - Analysis of underlying issues
- Impact - Affected services, duration, and scope
- Resolution Steps - Actions taken to resolve the incident
- Verification - Confirmation of service restoration
- Preventive Measures - Immediate and long-term improvements
- Lessons Learned - Key takeaways and insights
- Action Items - Specific follow-up tasks with owners
Contributing
When creating a new PIR document:
- Use the naming convention:
YYYY-MM-DD-brief-description.md
- Place documents in the
docs/incidents/ directory
- Update the
mkdocs.yml navigation section
- Follow the standard PIR structure template
- Include relevant technical details, commands, and verification steps
Navigation
Use the navigation menu to browse incidents by date or search for specific topics using the search functionality.