incidents

Post-Incident Reviews

Welcome to the internal Post-Incident Review (PIR) documentation site. This site contains detailed analyses of system incidents, root cause investigations, and lessons learned.

Purpose

Post-incident reviews are critical for:

Learning from failures - Understanding what went wrong and why
Preventing recurrence - Implementing safeguards and preventive measures
Improving systems - Identifying architectural and operational improvements
Knowledge sharing - Building team expertise and institutional memory

Recent Incidents

2026

2026-01-06 - Cascading Kubernetes Cluster Failures
- Severity: Critical
- Duration: ~8 hours (Phase 1-2) + 16.5 hours (Phase 3) + 12+ hours (Phase 5)
- Summary: Multi-phase cascading failure across microk8s cluster spanning 4 days, involving node reboots, kubelet failures, disk exhaustion, storage issues, job controller corruption, and container runtime corruption

PIR Structure

Each post-incident review follows a standard structure:

Executive Summary - High-level overview of the incident
Timeline - Detailed chronological sequence of events
Root Causes - Analysis of underlying issues
Impact - Affected services, duration, and scope
Resolution Steps - Actions taken to resolve the incident
Verification - Confirmation of service restoration
Preventive Measures - Immediate and long-term improvements
Lessons Learned - Key takeaways and insights
Action Items - Specific follow-up tasks with owners

Contributing

When creating a new PIR document:

Use the naming convention: YYYY-MM-DD-brief-description.md
Place documents in the docs/incidents/ directory
Update the mkdocs.yml navigation section
Follow the standard PIR structure template
Include relevant technical details, commands, and verification steps

Use the navigation menu to browse incidents by date or search for specific topics using the search functionality.