Post-Incident Reviews¶
Welcome to the internal Post-Incident Review (PIR) documentation site. This site contains detailed analyses of system incidents, root cause investigations, and lessons learned.
Purpose¶
Post-incident reviews are critical for:
- Learning from failures - Understanding what went wrong and why
- Preventing recurrence - Implementing safeguards and preventive measures
- Improving systems - Identifying architectural and operational improvements
- Knowledge sharing - Building team expertise and institutional memory
PGMac . Net Service Status¶
These documents are an artefact to give clarity and detail on incidents discovered and communicated through my Nagios Status Page
PIR Structure¶
Each post-incident review follows a standard structure:
- Executive Summary - High-level overview of the incident
- Timeline - Detailed chronological sequence of events
- Root Causes - Analysis of underlying issues
- Impact - Affected services, duration, and scope
- Resolution Steps - Actions taken to resolve the incident
- Verification - Confirmation of service restoration
- Preventive Measures - Immediate and long-term improvements
- Lessons Learned - Key takeaways and insights
- Action Items - Specific follow-up tasks with owners
Contributing¶
Creating a PIR¶
- Use the naming convention:
YYYY-MM-DD-brief-description.md - Place documents in the
src/incidents/directory — auto-nav picks them up automatically, nomkdocs.ymlchanges needed - Add a row to the top of
src/incidents/index.md(newest-first) - Follow the PIR structure template — each section is explained with guidance on what to write and why
Creating a Runbook¶
Write a runbook when an incident has a repeatable failure mode with a concrete, step-by-step recovery procedure that an on-call could follow cold.
- Use a descriptive name:
<service>-<failure-description>.md(e.g.,calico-cni-unauthorized.md) - Place documents in the
src/runbooks/directory — auto-nav picks them up automatically - Follow the runbook template — it covers both the simple pattern (one failure mode) and the multi-mode pattern (same symptom, multiple root causes)
- Consider extending an existing runbook with a new failure mode section instead of creating a new file if the observable symptom is the same
Navigation¶
Use the navigation menu to browse incidents by date or search for specific topics using the search functionality.