Post-Incident Reviews¶

Welcome to the internal Post-Incident Review (PIR) documentation site. This site contains detailed analyses of system incidents, root cause investigations, and lessons learned.

Purpose¶

Post-incident reviews are critical for:

Learning from failures - Understanding what went wrong and why
Preventing recurrence - Implementing safeguards and preventive measures
Improving systems - Identifying architectural and operational improvements
Knowledge sharing - Building team expertise and institutional memory

PGMac . Net Service Status¶

These documents are an artefact to give clarity and detail on incidents discovered and communicated through my Nagios Status Page

PIR Structure¶

Each post-incident review follows a standard structure:

Executive Summary - High-level overview of the incident
Timeline - Detailed chronological sequence of events
Root Causes - Analysis of underlying issues
Impact - Affected services, duration, and scope
Resolution Steps - Actions taken to resolve the incident
Verification - Confirmation of service restoration
Preventive Measures - Immediate and long-term improvements
Lessons Learned - Key takeaways and insights
Action Items - Specific follow-up tasks with owners

Contributing¶

Creating a PIR¶

Use the naming convention: YYYY-MM-DD-brief-description.md
Place documents in the src/incidents/ directory — auto-nav picks them up automatically, no mkdocs.yml changes needed
Add a row to the top of src/incidents/index.md (newest-first)
Follow the PIR structure template — each section is explained with guidance on what to write and why

Creating a Runbook¶

Write a runbook when an incident has a repeatable failure mode with a concrete, step-by-step recovery procedure that an on-call could follow cold.

Use a descriptive name: <service>-<failure-description>.md (e.g., calico-cni-unauthorized.md)
Place documents in the src/runbooks/ directory — auto-nav picks them up automatically
Follow the runbook template — it covers both the simple pattern (one failure mode) and the multi-mode pattern (same symptom, multiple root causes)
Consider extending an existing runbook with a new failure mode section instead of creating a new file if the observable symptom is the same

Use the navigation menu to browse incidents by date or search for specific topics using the search functionality.