incidents

Post Incident Review: Cascading Kubernetes Cluster Failures

Date: 2026-01-06 Duration: ~8 hours (estimated 09:00 - 17:00 AEST) Severity: Critical (Complete cluster instability, multiple service outages) Status: Resolved

Executive Summary

A cascading failure across the microk8s Kubernetes cluster began with unplanned node reboots, leading to widespread kubelet failures, disk exhaustion, controller corruption, and ultimately service outages. The incident progressed through five distinct phases spanning January 6-9, 2026:

Phase 1 (2026-01-06 09:00-12:30): Cascading node failures caused kubelet hangs on all three nodes due to disk pressure (97-100% usage), audit buffer overload, and orphaned pod accumulation. The cluster reached a critical state where pods could not be scheduled, started, or terminated. 571 orphaned GitHub Actions runner pods and 22 stuck OpenEBS replica pods contributed to resource exhaustion.

Phase 2 (2026-01-06 12:30-15:35): After stabilizing node operations, secondary issues emerged: OpenEBS Jiva volume snapshot accumulation (1011+ snapshots), ingress controller endpoint caching failures, and volume capacity exhaustion. Multiple media services (Sonarr, Radarr, Overseerr) became inaccessible.

Phase 3 (2026-01-08 02:00-18:25): Job controller corruption prevented all cluster-wide job creation for 16.5 hours. Required nuclear option (cluster restart with dqlite backup) to resolve persistent database state corruption originating from Phase 1.

Phase 4 (2026-01-08 19:00-19:45): ArgoCD application recovery required manual finalizer removal and configuration fixes for GitHub Actions runner controllers and LinkAce cronjob.

Phase 5 (2026-01-09 09:50-09:55): k8s01 container runtime corruption recurred 48+ hours after Phase 3 nuclear option, demonstrating that cluster restart cleared cluster-global state but not node-local container runtime issues. 4 runner pods stuck Pending for 12+ hours due to silent failure pattern.

Resolution required systematic intervention across multiple infrastructure layers: node recovery, disk cleanup, pod force-deletion, storage subsystem repair, ingress refresh, database backup/restart, and multiple node-local container runtime restarts. All services restored to full functionality with complete volume replication (3/3 replicas).

Timeline (AEST - UTC+10)

Phase 1: Cascading Node and Kubelet Failures

Time	Event
~09:00	INCIDENT START: Unplanned node reboots across k8s01, k8s02, k8s03 (likely power event or scheduled maintenance)
09:15-09:30	Cluster returns online but exhibits severe instability: pods not scheduling, not starting, not terminating
09:30-10:00	Initial diagnostics: Control plane components healthy, scheduler functioning, but kubelets not processing assigned pods
10:00-10:15	Identified k8s02 kubelet hung: pods assigned by scheduler but never reaching ContainerCreating state
10:15-10:20	RESOLUTION 1.1: Restarted kubelite on k8s02 (`systemctl restart snap.microk8s.daemon-kubelite`)
10:20-10:30	k8s01 kubelet repeatedly crashing: “Kubelet stopped posting node status” within minutes of restart
10:30-10:45	Root cause analysis k8s01: Disk at 97% usage + audit buffer overload (“audit buffer queue blocked” errors)
10:45-11:00	Database lock errors in kine (etcd replacement): “database is locked” preventing state updates
11:00-11:15	k8s03 diagnostics: Disk at 100% capacity with garbage collection failures
11:15-11:30	Discovered 571 orphaned GitHub Actions runner pods in ci namespace (deployment scaled to 0 but pods remained)
11:30-11:45	RESOLUTION 1.2: Disk cleanup on k8s01 (container images, logs) reducing from 97% → 87% usage
11:45-12:00	RESOLUTION 1.3: Disk cleanup on k8s03 reducing from 100% → 81% usage
12:00-12:15	k8s01 kubelet stabilized after disk cleanup, node maintaining Ready status
12:15-12:20	Deleted RunnerDeployment and HorizontalRunnerAutoscaler (GitHub Actions runner controller orphaned)
12:20-12:25	Force-deleted 22 OpenEBS replica pods stuck in Terminating state
12:25-12:30	Began aggressive force-deletion of 571 runner pods in batches (Pending, ContainerStatusUnknown, StartError, Completed)

Phase 2: Storage and Ingress Service Outages

Time	Event
~12:30	PHASE 2 START: User reports 504 Gateway Timeout errors for Sonarr at https://sonarr.int.pgmac.net/
12:30-12:45	Initial investigation: Examined ingress controller logs showing upstream timeouts to pods at old IP addresses (10.1.236.34:8989 for Sonarr, etc.)
12:45-13:00	Root cause analysis: Discovered Radarr pod in CrashLoopBackOff with “No space left on device” error. Sonarr pod Pending on k8s01 node.
13:00-13:15	Volume analysis: Identified OpenEBS Jiva volumes with excessive snapshots (1011 vs 500 threshold) affecting Radarr, Sonarr, and Overseerr
13:15-13:30	Node troubleshooting: Identified k8s01 node unable to start new containers despite being healthy (residual kubelet issues from Phase 1)
13:30-13:35	RESOLUTION 2.1: Restarted microk8s on k8s01, resolving pod scheduling issues
13:35-14:00	Snapshot cleanup: Triggered manual Jiva snapshot cleanup job (jiva-snapshot-cleanup-manual)
13:52	Cleanup job started processing Overseerr volume (pvc-05e03b60)
13:57	Sonarr volume (pvc-17e6e808) cleanup completed
14:10	RESOLUTION 2.2: Restarted all 3 ingress controller pods to clear stale endpoint cache
14:11	SERVICE RESTORED: Sonarr accessible at https://sonarr.int.pgmac.net/ (200 OK responses)
14:15	Overseerr confirmed accessible (200 OK responses)
14:20	Radarr volume (pvc-311bef00) cleanup completed
14:25	Radarr pod still crashing: volume at 100% capacity (958M/974M used)
14:28	RESOLUTION 2.3: Cleared 49M of old backups from Radarr volume, reducing to 95% usage
14:30	SERVICE RESTORED: Radarr accessible at https://radarr.int.pgmac.net/
14:35	Identified 8-9 Jiva replica pods stuck in Pending state on k8s03 (residual from Phase 1)
~15:30	RESOLUTION 2.4: Restarted microk8s on k8s03, resolving all Pending replica pods
15:35	INCIDENT END: All services operational, all replicas running (3/3), no problematic pods

Cleanup Operations (Parallel with Phase 2)

Time	Event
12:30-12:45	Force-deleted 299 Pending runner pods
12:45-13:00	Force-deleted 110 ContainerStatusUnknown runner pods
13:00-13:15	Force-deleted 58 StartError/RunContainerError runner pods
13:15-13:30	Force-deleted 79 Completed runner pods
13:30-14:00	Force-deleted final batch of 247 non-Running/non-Terminating runner pods
14:00	Runner pod cleanup substantially complete: 393 pods remain (128 Terminating, 18 Running, 247 deleted)

Phase 3: LinkAce CronJob Controller Corruption (2026-01-08 02:00-18:25)

Time	Event
2026-01-08 ~02:00	PHASE 3 START: LinkAce cronjob (`* * * * *` schedule) begins failing to create jobs successfully
02:00-05:00	Cronjob creates job objects but pods orphaned (parent job deleted before pod creation)
05:00-05:30	44+ jobs stuck in Running state (0/1 completions, 6min-11h old), 24+ pods Pending
05:30-06:00	Investigation reveals job controller stuck syncing deleted job `linkace-cronjob-29463021`
06:00-06:15	Job controller logs: “syncing job: tracking status: jobs.batch not found” errors
06:15-06:30	Cleanup attempts: Suspended cronjob, deleted orphaned pods, cleared stale active jobs
06:30-07:00	Restarted kubelite on k8s01, temporary improvement but orphaned job reference persists
07:00-08:00	Created dummy job with stale name and deleted properly, but new jobs still not creating pods
08:00-09:00	User added timeout configuration to ArgoCD manifest (activeDeadlineSeconds: 300, ttlSecondsAfterFinished: 120)
09:00-09:30	ArgoCD synced configuration successfully but cronjob deleted to recreate cleanly
09:30-10:00	ArgoCD failed to auto-recreate deleted cronjob despite OutOfSync status
10:00-10:30	Manually recreated cronjob, but job controller completely wedged (not creating pods for any jobs)
10:30-18:00	Self-healing attempted: waited 1.5 hours for TTL cleanup and active deadline enforcement - failed
18:00-18:05	Jobs created by cronjob but no pods spawned, active deadline not enforced (jobs 85+ min old still Running)
18:05-18:10	TTL cleanup not working (no jobs auto-deleted after completion)
18:10	DECISION: Nuclear option approved - etcd cleanup with cluster restart
18:12-18:15	RESOLUTION 3.1: Stopped MicroK8s on all 3 nodes (k8s01, k8s02, k8s03)
18:15	RESOLUTION 3.2: Backed up etcd/dqlite database to `/var/snap/microk8s/common/backup/etcd-backup-20260108-201540`
18:15-18:17	RESOLUTION 3.3: Restarted MicroK8s cluster, all nodes returned Ready
18:17-18:18	RESOLUTION 3.4: Force-deleted all stuck jobs and cronjob
18:18-18:19	RESOLUTION 3.5: Triggered ArgoCD sync to recreate cronjob with fresh state
18:19	Cronjob recreated successfully with all timeout settings applied
18:20-18:22	First job (`linkace-cronjob-29464462`) created successfully, completed in 7 seconds
18:22-18:25	TTL cleanup verified working: completed jobs auto-deleted after 2 minutes
18:25	INCIDENT END: Cronjob fully functional, no orphaned pods, all cleanup mechanisms working

Phase 4: ArgoCD Application Recovery (2026-01-08 ~19:00-19:45)

Time	Event
2026-01-08 ~19:00	PHASE 4 START: Investigation of 5 ArgoCD applications stuck OutOfSync or Progressing
19:00-19:05	Identified problematic applications: ci-tools (OutOfSync + Progressing), gharc-runners-pgmac-net-self-hosted (OutOfSync + Healthy), gharc-runners-pgmac-user-self-hosted (Synced + Progressing), hass (Synced + Progressing), linkace (OutOfSync + Healthy)
19:05-19:15	ci-tools investigation: Found child application `gharc-runners-pgmac-user-self-hosted` stuck with resources “Pending deletion”
19:15-19:18	RESOLUTION 4.1: Removed finalizers from 4 stuck resources (AutoscalingRunnerSet, ServiceAccount, Role, RoleBinding) using `kubectl patch --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'`
19:18-19:20	Triggered ArgoCD sync for ci-tools, application became Synced + Healthy
19:20-19:25	gharc-runners-pgmac-net-self-hosted investigation: Found old listener resources with hash `754b578d` needing deletion
19:25-19:28	Deleted 3 old listener resources manually (ServiceAccount, Role, RoleBinding)
19:28-19:30	Discovered 6 runner pods stuck Pending for 44+ minutes (residual from Phase 3 job controller corruption)
19:30-19:32	Force-deleted 6 stuck runner pods: `pgmac-renovatebot-*` pods with PodScheduled=True but no containers created
19:32-19:33	Application status: OutOfSync + Healthy (acceptable due to ignoreDifferences configuration for AutoscalingListener, Role, RoleBinding)
19:33-19:35	gharc-runners-pgmac-user-self-hosted: Already deleted during ci-tools cleanup
19:35-19:37	hass investigation: Application self-resolved during investigation, showing Synced + Healthy (StatefulSet rollout completed)
19:37-19:40	linkace investigation: Found linkace-cronjob OutOfSync despite application Healthy, ArgoCD attempted 23 auto-heal operations
19:40-19:42	Root cause identified: LinkAce Helm chart doesn’t support `backoffLimit` and `resources` configuration in cronjob
19:42-19:43	RESOLUTION 4.2: Edited `/Users/paulmacdonnell/pgmac/pgk8s/pgmac.net/media/templates/linkace.yaml` to remove unsupported fields (backoffLimit, resources block)
19:43	Kept critical timeout settings: startingDeadlineSeconds, activeDeadlineSeconds, ttlSecondsAfterFinished, history limits
19:43-19:44	Committed changes with message “Remove unsupported LinkAce cronjob configuration”
19:44	Git push rejected due to remote changes, used `git stash && git pull --rebase && git stash pop && git push`
19:45	PHASE 4 END: All applications resolved or explained; 2 applications Synced + Healthy (ci-tools, hass), 2 applications OutOfSync + Healthy acceptable (gharc-runners-pgmac-net-self-hosted, linkace), 1 application deleted (gharc-runners-pgmac-user-self-hosted)

Phase 5: k8s01 Container Runtime Corruption Recurrence (2026-01-09 ~09:50-09:55)

Time	Event
2026-01-09 ~09:50	PHASE 5 START: Investigation revealed 4 runner pods in arc-runners namespace stuck in Pending state for 12+ hours
09:50-09:51	Identified all 4 Pending pods assigned to k8s01 node: self-hosted-l52x9-runner-2nnsr, -69qnv, -ls8c2, -w8mcd
09:51	Pod describe showed PodScheduled=True but no container initialization, no events generated (silent failure pattern from Phase 2/3)
09:51-09:52	Verified k8s01 node showing Ready status despite being unable to start new containers
09:52	Root cause identified: Container runtime state corruption on k8s01 (residual from Phase 1-3, not fully cleared by Phase 3 nuclear option)
09:52-09:53	Found 10 EphemeralRunner resources but only 4 pods exist (6 pgmac-slack-scores runners have no pods at all)
09:53	RESOLUTION 5.1: User restarted microk8s on k8s01 (`microk8s stop && microk8s start`)
09:55	PHASE 5 END: All 4 Pending pods cleared, container runtime recovered

Root Causes

Phase 1: Node and Control Plane Failures

1.1 Cascading Node Reboots (Primary Trigger)

Issue: All three nodes (k8s01, k8s02, k8s03) experienced unplanned reboots
Likely cause: Power event, scheduled maintenance, or infrastructure issue
Impact: Triggered cascade of secondary failures during recovery
Why it cascaded:
- Simultaneous reboot prevented graceful pod migration
- etcd state (via kine) became inconsistent across nodes
- Container runtime state corrupted on restart
- Disk pressure accumulated during downtime (logs, audit buffers, orphaned containers)

1.2 k8s01 Kubelet Crash Loop (Critical)

Issue: Kubelet repeatedly crashing within minutes of restart
Root causes:
- Disk exhaustion: 97% usage preventing kubelet operations
- Audit buffer overload: “audit buffer queue blocked” errors in logs
- Database locks: kine (etcd replacement) showing “database is locked” errors
Impact: Node oscillating between Ready/NotReady, unable to start/stop pods
Why it happened:
- Container image accumulation from 4+ years of operations
- Log rotation not keeping pace with audit log generation
- Kubelet requires >10% free disk to function properly
Resolution: Disk cleanup (97% → 87%), container image removal, log pruning

1.3 k8s02 Kubelet Process Hang (Critical)

Issue: Kubelet not processing newly assigned pods
Symptoms: Pods assigned by scheduler (PodScheduled=True) but never reaching ContainerCreating
Root cause: Kubelet process corrupted/hung after node reboot
Impact: Entire node unable to start new containers despite reporting Ready status
Resolution: kubelite service restart (systemctl restart snap.microk8s.daemon-kubelite)

1.4 k8s03 Disk Exhaustion (Critical)

Issue: Disk at 100% capacity
Symptoms: “Failed to garbage collect required amount of images. Attempted to free 13GB, but only found 0 bytes eligible to free”
Impact:
- Prevented container image pulls
- Blocked new pod scheduling
- Contributed to 8-9 Jiva replica pods stuck in Pending
Resolution: User disk cleanup (100% → 81%)

1.5 GitHub Actions Runner Controller Orphaned Pods (Secondary)

Issue: 571 orphaned runner pods remained after RunnerDeployment scaled to 0
Affected namespace: ci
Pod states: 299 Pending, 110 ContainerStatusUnknown, 79 Completed, 58 StartError, 25+ other stuck states
Why it happened:
- RunnerDeployment controller existed but in Error state
- Pods orphaned from parent controller (reboot disrupted controller finalizers)
- No automated cleanup triggered despite 0 replicas
Impact:
- Consumed scheduler resources attempting to place Pending pods
- Consumed API server resources with status updates
- Contributed to disk pressure (container image layers, logs)
Resolution:
- Deleted RunnerDeployment and HorizontalRunnerAutoscaler
- Force-deleted 546+ pods in batches using --force --grace-period=0 --wait=false

1.6 OpenEBS Replica Pods Stuck Terminating (Secondary)

Issue: 22 OpenEBS Jiva replica pods stuck in Terminating state
Impact: Storage subsystem instability, prevented volume operations
Resolution: Force-deleted all 22 pods

Phase 2: Storage and Ingress Failures

2.1 OpenEBS Jiva Snapshot Accumulation (Primary)

Issue: Jiva volumes accumulated 1011 snapshots (threshold: 500)
Affected Volumes:
- Radarr config (pvc-311bef00-1b89-4584-90f0-ae3772e30e09)
- Sonarr config (pvc-17e6e808-a9fc-4f64-b490-71deffdb81fd)
- Overseerr config (pvc-05e03b60-3ab7-41a0-9baf-3d0e291eed63)
- Multiple other Jiva volumes cluster-wide
Impact: Radarr volume reached capacity (3GB physical snapshots in 1Gi volume), causing “No space left on device” errors
Why it happened:
- Automated cleanup cronjob exists (jiva-snapshot-cleanup) running daily at 2 AM
- Connection to Phase 1: Node reboots and kubelet failures prevented cronjob pods from running for 2-3 days
- Snapshot accumulation rate exceeded cleanup frequency
- 1011 snapshots suggests cronjob not executing successfully during Phase 1 disk/kubelet issues
Resolution: Manual snapshot cleanup job triggered after node stabilization

2.2 Residual Node Container Runtime Issues

k8s01 Node (post Phase 1 cleanup): Unable to start new containers despite node reporting Ready status
- Pods scheduled successfully but containers never initialized
- Affected: Sonarr pod, linkace-cronjob, cleanup job pod
- Connection to Phase 1: Kubelet state corruption persisted despite disk cleanup
- Resolution: Full microk8s restart (not just kubelite)
k8s03 Node: Similar container startup issues
- 8-9 Jiva replica pods stuck in Pending with PodScheduled=True but no container creation
- Connection to Phase 1: Disk exhaustion (100% → 81%) required full cluster restart to clear runtime state
- Resolution: microk8s restart resolved all Pending pods

2.3 Ingress Controller Stale Endpoint Cache

Issue: Nginx ingress controllers retained old pod IP addresses after pod restarts
- Example: Sonarr old IP 10.1.236.34:8989 vs new IP 10.1.73.92:8989
- Resulted in 504 Gateway Timeout errors despite healthy backend pods
Why it happened: Pod IP changes during Phase 1 chaos not reflected in ingress controller endpoint cache
Resolution: Restarting all 3 ingress controller pods refreshed endpoint cache

2.4 Radarr Volume Capacity (Secondary)

Issue: Radarr config PVC at 100% capacity (1Gi volume, 958M used)
Breakdown:
- MediaCover directory: 837M (movie posters/artwork)
- Backups: 49M
- Database: 33M
- Other: ~39M
Resolution: Cleared old backups freeing 49M (temporary fix to 95% usage)
Long-term concern: Volume undersized for media library artwork

Phase 3: LinkAce CronJob Controller Corruption (2026-01-08)

3.1 Job Controller State Corruption (Primary - Critical)

Issue: Job controller stuck in error loop trying to sync deleted job linkace-cronjob-29463021
Symptoms:
- Jobs created by cronjob but pods never spawned
- Job objects showing “Running” with 0 Active/Succeeded/Failed pods
- No events generated for newly created jobs
- Controller logs: "syncing job: tracking status: adding uncounted pods to status: jobs.batch \"linkace-cronjob-29463021\" not found"
Impact: Complete failure of all job creation cluster-wide, not just LinkAce cronjob
Why it happened:
- Connection to Phase 1: Job controller corruption originated from cascading failures on 2026-01-06
- Stale job reference persisted in controller’s in-memory state after Phase 1 reboots
- Controller unable to clear orphaned reference without full cluster restart
- MicroK8s dqlite database retained corrupted job metadata
Resolution: Nuclear option - cluster restart with dqlite backup

3.2 Dqlite Database State Corruption (Primary)

Issue: MicroK8s dqlite database retained stale job references preventing controller recovery
Symptoms:
- Job controller restart didn’t resolve issue (state persisted in database)
- Manually recreating cronjob didn’t clear corruption
- Creating dummy job with stale name and deleting didn’t clear reference
- Multiple kubelite restarts across all nodes failed to resolve
Impact: Self-healing mechanisms completely ineffective
Why it happened:
- Phase 1 cascading failures (disk pressure, kubelet crashes) corrupted dqlite write operations
- Job deletion operations during Phase 1 chaos didn’t complete atomically
- Dqlite state diverged across 3 nodes during simultaneous kubelet failures
Resolution: Stopped cluster, backed up dqlite database, restarted to clear in-memory state

3.3 Timeout Configuration Not Enforced (Secondary)

Issue: CronJob timeout settings not enforced despite proper configuration
Settings Applied:
- activeDeadlineSeconds: 300 (5-minute job timeout)
- ttlSecondsAfterFinished: 120 (2-minute cleanup after completion)
- startingDeadlineSeconds: 60 (1-minute grace for job creation)
Observed Behavior:
- Jobs remained Running for 85+ minutes despite 5-minute timeout
- Completed jobs not auto-deleted despite 2-minute TTL
- No timeout enforcement events generated
Root Cause: Job controller corruption prevented processing of any job lifecycle events
Impact: Self-healing timeline predictions completely invalid (expected 6-12 hours, actual: infinite)

3.4 ArgoCD Auto-Sync Failure (Secondary)

Issue: ArgoCD failed to auto-recreate deleted cronjob despite automated sync enabled
Symptoms:
- Application showed OutOfSync status but no sync operation triggered
- Manual sync attempts via kubectl patch failed
- Hard refresh annotation didn’t trigger recreation
Impact: Required manual cronjob creation, delaying recovery
Why it happened: ArgoCD controller may have been affected by broader job controller corruption
Workaround: Manually created cronjob from template, ArgoCD eventually adopted it

Phase 4: ArgoCD Application Recovery (2026-01-08)

4.1 GitHub Actions Runner Controller Finalizer Issues (Primary)

Issue: Child application gharc-runners-pgmac-user-self-hosted stuck with resources “Pending deletion”
Affected resources: 4 resources with blocking finalizers
- AutoscalingRunnerSet: pgmac-slack-scores
- ServiceAccount: pgmac-slack-scores-gha-rs-no-permission
- Role: pgmac-slack-scores-gha-rs-manager
- RoleBinding: pgmac-slack-scores-gha-rs-manager
Symptoms:
- Parent application ci-tools stuck OutOfSync + Progressing
- Resources marked for deletion but finalizers preventing cleanup
- ArgoCD unable to sync parent application due to child application state
Impact: Blocked CI/CD application sync, prevented runner controller updates
Why it happened:
- Connection to Phase 3: Job controller corruption prevented cleanup job from removing finalizers
- Resources created by parent application but child application deletion failed
- Finalizers intended to ensure graceful resource cleanup but became stuck
Resolution: Removed finalizers manually using kubectl patch --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

4.2 GitHub Actions Runner Controller State Drift (Secondary)

Issue: Old listener resources with hash 754b578d remained after controller update
Affected resources:
- ServiceAccount with old hash
- Role with old hash
- RoleBinding with old hash
Symptoms:
- Application gharc-runners-pgmac-net-self-hosted showing OutOfSync + Healthy
- New resources created with different hash but old resources not deleted
- 6 runner pods stuck Pending for 44+ minutes (residual from Phase 3)
Impact: Resource accumulation, pod scheduling failures
Why it happened:
- Connection to Phase 3: Pods created during job controller corruption remained stuck
- ArgoCD ignoreDifferences configuration for AutoscalingListener, Role, RoleBinding masked drift
- Controller update didn’t clean up old resources automatically
Resolution: Manually deleted old listener resources, force-deleted 6 stuck runner pods
Acceptable state: OutOfSync + Healthy is expected due to ignoreDifferences configuration

4.3 LinkAce Helm Chart Configuration Drift (Primary)

Issue: LinkAce cronjob showing OutOfSync despite application Healthy, ArgoCD attempted 23 auto-heal operations
Unsupported configuration: LinkAce Helm chart doesn’t support these cronjob fields:
- backoffLimit: 0
- resources block (limits/requests)
Symptoms:
- ArgoCD continuously detecting drift between desired and actual state
- Auto-heal operations failing to converge
- Application Healthy but OutOfSync persisting
Impact: ArgoCD resource consumption, false positive OutOfSync alerts
Why it happened:
- Upstream Helm chart doesn’t expose all CronJob configuration options
- ArgoCD manifest specified fields not supported by chart templates
- GitOps configuration drift from Helm chart capabilities
Resolution: Removed unsupported fields from ArgoCD manifest, kept critical timeout settings
Acceptable state: Minor field drift from Helm chart is expected and acceptable

4.4 Home Assistant Application Self-Healing (None)

Issue: Application showed Synced + Progressing initially
Root cause: StatefulSet rollout in progress during investigation
Resolution: Self-resolved as rollout completed
Impact: None, normal operational state

Phase 5: k8s01 Container Runtime Corruption Recurrence (2026-01-09)

5.1 Persistent Container Runtime Corruption on k8s01 (Critical)

Issue: k8s01 node unable to start new containers 48+ hours after Phase 3 nuclear option
Symptoms:
- 4 runner pods stuck in Pending state for 12+ hours
- All 4 pods assigned to k8s01 node
- Pods showing PodScheduled=True but no container initialization
- No events generated (silent failure pattern identical to Phase 2)
- Node reporting Ready status despite being unable to start containers
- 6 additional EphemeralRunner resources with no corresponding pods
Impact: Complete inability to start new workloads on k8s01, affecting GitHub Actions runners
Why it happened:
- Connection to Phase 1: Container runtime corruption originated from Phase 1 disk pressure (97% usage) and kubelet crashes
- Connection to Phase 2: k8s01 required microk8s restart during Phase 2 (Resolution 2.1) for same symptoms
- Connection to Phase 3: Phase 3 nuclear option (cluster-wide restart) only cleared dqlite/controller state, not node-local container runtime corruption
- Container runtime (containerd) state diverged from kubelet state
- Corruption persisted across cluster restarts because it was node-local, not cluster-global
- 12+ hour delay in detection suggests corruption was dormant until new workloads attempted to schedule
Resolution: Full microk8s restart on k8s01 (microk8s stop && microk8s start)
Key finding: Nuclear option (cluster restart) insufficient to clear node-local container runtime issues

5.2 Detection Gap for Node-Local Failures (Secondary)

Issue: 12+ hour delay between pod creation and detection of container startup failure
Why it happened:
- No monitoring for silent pod failures (PodScheduled=True but no container creation)
- No alerts for pods stuck in Pending with node assignment
- No synthetic pod startup tests on individual nodes
- Kubernetes node status (Ready) doesn’t reflect container runtime health
Impact: Extended downtime for affected workloads without visibility
Resolution: Manual investigation prompted by user observation

Impact

Services Affected

Phase 1:

Entire cluster: Widespread pod scheduling and lifecycle failures
GitHub Actions: Self-hosted runners completely non-functional (571 pods stuck)
OpenEBS storage: 22 replica pods unavailable, volume operations degraded
All services: Intermittent availability as pods failed to start/stop properly

Phase 2:

Sonarr: Unavailable via https://sonarr.int.pgmac.net/ (504 errors)
Radarr: Pod crashing, completely unavailable
Overseerr: Intermittent 504 timeout errors

Phase 3:

LinkAce cronjob: Complete failure to execute scheduled tasks (every minute)
All Kubernetes Jobs: Job controller corruption affected cluster-wide job creation
LinkAce scheduled tasks: Backup creation, link validation, database cleanup not executing
ArgoCD: Auto-sync mechanism failed for deleted resources

Phase 4:

CI/CD Applications: ci-tools stuck OutOfSync + Progressing, preventing runner controller updates
GitHub Actions Runners: 6 runner pods stuck Pending for 44+ minutes (residual from Phase 3)
ArgoCD GitOps: Multiple applications showing false OutOfSync status consuming resources
LinkAce Application: 23 failed auto-heal attempts creating noise in ArgoCD

Phase 5:

GitHub Actions Runners: 4 runner pods stuck Pending for 12+ hours on k8s01 (pgmac-slack-scores runners unable to spawn)
k8s01 Node: Complete inability to start new containers despite Ready status
EphemeralRunner Resources: 6 additional runners with no corresponding pods (silent failure)

Duration

Total incident duration: ~8 hours (09:00 - 17:00 AEST 2026-01-06) + 16.5 hours (Phase 3, 2026-01-08) + 0.75 hours (Phase 4, 2026-01-08) + 12+ hours (Phase 5, 2026-01-09)
Phase 1 critical period: ~3.5 hours (09:00 - 12:30 2026-01-06)
Phase 2 service outages: ~3 hours (12:30 - 15:35 2026-01-06)
Phase 3 cronjob failure: ~16.5 hours (02:00 - 18:25 2026-01-08)
Phase 4 ArgoCD recovery: ~45 minutes (19:00 - 19:45 2026-01-08)
Phase 5 container runtime corruption: ~5 minutes active recovery (09:50 - 09:55 2026-01-09), but 12+ hours of silent failure
Sonarr downtime: ~1.5 hours
Radarr downtime: ~2.5 hours
Overseerr impact: Intermittent throughout Phase 2
GitHub Actions runners: ~8 hours (full Phase 1-2 duration) + 45 minutes (Phase 4) + 12+ hours (Phase 5)
LinkAce scheduled tasks: ~16.5 hours (complete failure to execute)
CI/CD deployments: Blocked during Phase 4 investigation (~45 minutes) + 12+ hours (Phase 5 k8s01 node failure)

Scope

Infrastructure: All 3 nodes compromised at various points
User-facing: All web access to media management services
Internal: Home Assistant integrations unable to query service APIs
CI/CD: GitHub Actions self-hosted runners completely unavailable
Monitoring: Nagios health checks failing across multiple services
Storage: OpenEBS volume operations degraded during Phase 1

Resolution Steps Taken

Phase 1: Node and Kubelet Recovery

1. k8s02 Kubelet Restart

# On k8s02 node
sudo systemctl restart snap.microk8s.daemon-kubelite

2. k8s01 Disk Cleanup and Stabilization

# On k8s01 node
# Removed unused container images
microk8s ctr images rm <image-id>...

# Cleaned container logs (methods vary)
# Removed old/stopped containers
# Result: 97% → 87% disk usage

3. k8s03 Disk Cleanup

# On k8s03 node
# Similar cleanup process
# Result: 100% → 81% disk usage

4. Runner Controller Cleanup

# Deleted orphaned controller resources
kubectl delete runnerdeployment pgmac.pgmac-runnerdeploy -n ci --context pvek8s
kubectl delete horizontalrunnerautoscaler pgmac-pgmac-runnerdeploy-autoscaler -n ci --context pvek8s

# Force-deleted 546+ orphaned runner pods in batches
# Pending pods (299)
kubectl get pods -n ci --context pvek8s --no-headers | \
  grep "pgmac.pgmac-runnerdeploy" | grep "Pending" | \
  awk '{print $1}' | xargs -I {} kubectl delete pod {} -n ci \
  --context pvek8s --force --grace-period=0 --wait=false

# ContainerStatusUnknown (110)
kubectl get pods -n ci --context pvek8s --no-headers | \
  grep "pgmac.pgmac-runnerdeploy" | grep "ContainerStatusUnknown" | \
  awk '{print $1}' | xargs -I {} kubectl delete pod {} -n ci \
  --context pvek8s --force --grace-period=0 --wait=false

# StartError/RunContainerError (58)
kubectl get pods -n ci --context pvek8s --no-headers | \
  grep "pgmac.pgmac-runnerdeploy" | grep -E "StartError|RunContainerError|Error" | \
  awk '{print $1}' | xargs -I {} kubectl delete pod {} -n ci \
  --context pvek8s --force --grace-period=0 --wait=false

# Completed (79)
kubectl get pods -n ci --context pvek8s --no-headers | \
  grep "pgmac.pgmac-runnerdeploy" | grep "Completed" | \
  awk '{print $1}' | xargs -I {} kubectl delete pod {} -n ci \
  --context pvek8s --force --grace-period=0 --wait=false

# Final cleanup batch (247)
kubectl get pods -n ci --context pvek8s --no-headers | \
  grep "pgmac.pgmac-runnerdeploy" | grep -v "Running" | grep -v "Terminating" | \
  awk '{print $1}' | xargs -I {} kubectl delete pod {} -n ci \
  --context pvek8s --force --grace-period=0 --wait=false

5. OpenEBS Replica Pod Cleanup

# Force-deleted 22 stuck Terminating replica pods
kubectl get pods -n openebs --context pvek8s | grep Terminating | \
  awk '{print $1}' | xargs -I {} kubectl delete pod {} -n openebs \
  --context pvek8s --force --grace-period=0

Phase 2: Storage and Ingress Recovery

6. k8s01 Full Restart (Residual Issues)

# On k8s01 node
microk8s stop && microk8s start

7. Jiva Snapshot Cleanup

# Triggered manual cleanup job
kubectl --context pvek8s create job -n openebs jiva-snapshot-cleanup-manual \
  --from=cronjob/jiva-snapshot-cleanup

# Job processed all Jiva volumes sequentially
# - Rolling restart of 3 replicas per volume
# - 30-second stabilization period between replicas
# - Total runtime: ~60 minutes for all volumes

8. Ingress Controller Refresh

# Restarted all ingress controllers to clear endpoint cache
kubectl --context pvek8s delete pod -n ingress \
  nginx-ingress-microk8s-controller-2chvz \
  nginx-ingress-microk8s-controller-k56gn \
  nginx-ingress-microk8s-controller-t56r5

9. Radarr Volume Emergency Cleanup

# Freed space by removing old backups
kubectl --context pvek8s exec -n media radarr-<pod> -- \
  rm -rf /config/Backups/*

# Result: 100% → 95% usage, sufficient for startup

10. k8s03 Full Restart (Replica Pod Issues)

# On k8s03 node
microk8s stop && microk8s start

# All Pending replica pods recreated successfully after restart

Phase 3: Job Controller and Database Recovery (Nuclear Option)

11. Cluster-Wide Restart with Database Backup

# Stop MicroK8s on all nodes (prevent database writes during backup)
ssh k8s01 "sudo snap stop microk8s"
ssh k8s02 "sudo snap stop microk8s"
ssh k8s03 "sudo snap stop microk8s"

# Wait for clean shutdown
sleep 30

# Backup dqlite database (on k8s01 primary node)
ssh k8s01 "sudo mkdir -p /var/snap/microk8s/common/backup && \
  sudo cp -r /var/snap/microk8s/current/var/kubernetes/backend \
  /var/snap/microk8s/common/backup/etcd-backup-20260108-201540"

# Start MicroK8s on all nodes
ssh k8s01 "sudo snap start microk8s"
ssh k8s02 "sudo snap start microk8s"
ssh k8s03 "sudo snap start microk8s"

# Wait for cluster to be ready
kubectl --context pvek8s wait --for=condition=Ready nodes --all --timeout=300s

# Verify cluster health
kubectl --context pvek8s get nodes
kubectl --context pvek8s get componentstatuses

12. Clean Job and CronJob State

# Delete all stuck jobs (85+ jobs accumulated)
kubectl --context pvek8s delete jobs -n media -l app.kubernetes.io/instance=linkace --all

# Delete cronjob to get fresh state
kubectl --context pvek8s delete cronjob linkace-cronjob -n media

# Trigger ArgoCD sync to recreate with fresh state
kubectl --context pvek8s patch application linkace -n argocd \
  --type merge -p '{"operation":{"sync":{"revision":"HEAD"}}}'

13. Update ArgoCD Manifest with Timeout Configuration

# In pgk8s/pgmac.net/media/templates/linkace.yaml (lines 101-114)
cronjob:
  startingDeadlineSeconds: 60 # Grace period for job creation
  activeDeadlineSeconds: 300 # 5-minute job timeout
  ttlSecondsAfterFinished: 120 # 2-minute cleanup after completion
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 2
  resources:
    limits:
      memory: 512Mi
      cpu: 500m
    requests:
      memory: 256Mi
      cpu: 100m

14. Verification

# Verify cronjob recreated with correct configuration
kubectl --context pvek8s get cronjob linkace-cronjob -n media -o yaml

# Wait for next minute and verify job creation
watch -n 10 'kubectl --context pvek8s get jobs -n media -l app.kubernetes.io/instance=linkace'

# Verify job completes successfully
kubectl --context pvek8s wait --for=condition=complete \
  job/linkace-cronjob-<generated> -n media --timeout=120s

# Verify TTL cleanup working (jobs deleted after 2 minutes)
# Monitor job count - should stabilize at 1 successful + max 2 failed
watch -n 30 'kubectl --context pvek8s get jobs -n media -l app.kubernetes.io/instance=linkace'

# Check job execution time (should be ~7 seconds)
kubectl --context pvek8s get job linkace-cronjob-<latest> -n media -o yaml | \
  grep -A 5 "startTime\|completionTime"

# Verify no orphaned pods
kubectl --context pvek8s get pods -n media -l job-name

Phase 4: ArgoCD Application and Finalizer Recovery

15. Remove Finalizers from Stuck Resources

# Remove finalizer from AutoscalingRunnerSet
kubectl --context pvek8s patch autoscalingrunnerset pgmac-slack-scores -n arc-runners \
  --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

# Remove finalizer from ServiceAccount
kubectl --context pvek8s patch serviceaccount pgmac-slack-scores-gha-rs-no-permission -n arc-runners \
  --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

# Remove finalizer from Role
kubectl --context pvek8s patch role pgmac-slack-scores-gha-rs-manager -n arc-runners \
  --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

# Remove finalizer from RoleBinding
kubectl --context pvek8s patch rolebinding pgmac-slack-scores-gha-rs-manager -n arc-runners \
  --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

# Trigger ArgoCD sync for parent application
kubectl --context pvek8s patch application ci-tools -n argocd \
  --type merge -p '{"operation":{"sync":{"revision":"HEAD"}}}'

16. Clean Up Old Runner Controller Resources

# Delete old listener resources with stale hash
kubectl --context pvek8s delete serviceaccount \
  pgmac-renovatebot-gha-rs-listener-754b578d -n arc-runners

kubectl --context pvek8s delete role \
  pgmac-renovatebot-gha-rs-listener-754b578d -n arc-runners

kubectl --context pvek8s delete rolebinding \
  pgmac-renovatebot-gha-rs-listener-754b578d -n arc-runners

# Force-delete stuck runner pods (residual from Phase 3)
kubectl --context pvek8s delete pod pgmac-renovatebot-<pod-id> -n arc-runners \
  --force --grace-period=0 --wait=false
# Repeat for all 6 stuck pods

17. Fix LinkAce Helm Chart Configuration

# Edit ArgoCD manifest to remove unsupported fields
# File: /Users/paulmacdonnell/pgmac/pgk8s/pgmac.net/media/templates/linkace.yaml
# Removed lines (backoffLimit and resources block):
#   backoffLimit: 0
#   resources:
#     limits:
#       memory: 512Mi
#       cpu: 500m
#     requests:
#       memory: 256Mi
#       cpu: 100m

# Kept critical timeout configuration:
#   startingDeadlineSeconds: 60
#   activeDeadlineSeconds: 300
#   ttlSecondsAfterFinished: 120
#   successfulJobsHistoryLimit: 1
#   failedJobsHistoryLimit: 2

# Commit changes
cd /Users/paulmacdonnell/pgmac/pgk8s
git add pgmac.net/media/templates/linkace.yaml
git commit -m "Remove unsupported LinkAce cronjob configuration"

# Handle git push rejection
git stash
git pull --rebase
git stash pop
git push

18. Verification

# Verify ci-tools application status
kubectl --context pvek8s get application ci-tools -n argocd

# Verify gharc-runners-pgmac-net-self-hosted (acceptable OutOfSync + Healthy)
kubectl --context pvek8s get application gharc-runners-pgmac-net-self-hosted -n argocd

# Verify hass application (should be Synced + Healthy)
kubectl --context pvek8s get application hass -n argocd

# Verify linkace application (acceptable OutOfSync + Healthy)
kubectl --context pvek8s get application linkace -n argocd

# Verify no stuck runner pods remain
kubectl --context pvek8s get pods -n arc-runners | grep Pending

# Verify ArgoCD sync status
kubectl --context pvek8s get applications -n argocd | grep -E "OutOfSync|Progressing"

Phase 5: k8s01 Container Runtime Recovery

19. k8s01 Container Runtime Investigation

# List all pods in arc-runners namespace
kubectl --context pvek8s get pods -n arc-runners

# Identified 4 Pending pods (12+ hours old):
# - self-hosted-l52x9-runner-2nnsr
# - self-hosted-l52x9-runner-69qnv
# - self-hosted-l52x9-runner-ls8c2
# - self-hosted-l52x9-runner-w8mcd

# Describe pod to check status
kubectl --context pvek8s describe pod self-hosted-l52x9-runner-2nnsr -n arc-runners
# Observed: PodScheduled=True, assigned to k8s01, no events generated

# Check node status
kubectl --context pvek8s get nodes
# k8s01 showing Ready status despite being unable to start containers

# Check pod locations
kubectl --context pvek8s get pods -n arc-runners -o wide
# All 4 Pending pods assigned to k8s01 node

# Check EphemeralRunner resources
kubectl --context pvek8s get ephemeralrunner -n arc-runners
# Found 10 EphemeralRunner resources but only 4 pods exist
# 6 pgmac-slack-scores runners have no corresponding pods

20. k8s01 MicroK8s Restart

# On k8s01 node (user executed)
microk8s stop && microk8s start

# Wait for node to return Ready
kubectl --context pvek8s wait --for=condition=Ready node/k8s01 --timeout=300s

# Verify Pending pods cleared
kubectl --context pvek8s get pods -n arc-runners
# All 4 Pending pods should be gone, container runtime recovered

21. Verification

# Verify no Pending pods remain in arc-runners namespace
kubectl --context pvek8s get pods -n arc-runners | grep Pending

# Verify EphemeralRunner resources
kubectl --context pvek8s get ephemeralrunner -n arc-runners

# Verify k8s01 node health
kubectl --context pvek8s describe node k8s01

# Check for any new container creation issues
kubectl --context pvek8s get events -n arc-runners --sort-by='.lastTimestamp'

Verification

Service Health Checks

✅ Sonarr: 200 OK responses, RSS sync operational
✅ Radarr: 200 OK responses, web UI accessible
✅ Overseerr: 200 OK responses, login page loading

Infrastructure Health

✅ All 3 nodes (k8s01, k8s02, k8s03): Ready status
✅ Jiva replicas: 39/40 Running (effectively 100%)
✅ No Pending, CrashLoop, or Error pods cluster-wide (excluding residual runner pod cleanup)
✅ Ingress controllers: Routing to correct pod IPs
✅ Kubelet stable on all nodes: No repeated crashes or hangs

Volume Replication

Overseerr:  3/3 replicas Running
Sonarr:     3/3 replicas Running
Radarr:     3/3 replicas Running
All others: 3/3 replicas Running

Node Disk Status (Post-Cleanup)

k8s01: 87% (down from 97%)
k8s02: Stable (no initial disk pressure)
k8s03: 81% (down from 100%)

Preventive Measures

Immediate Actions Required

Implement Node Disk Space Monitoring (Critical Priority)
- Current: No alerts for disk usage >85%
- Target: Alert at 80%, critical alert at 90%
- Actions:
  - Deploy Prometheus node-exporter on all nodes
  - Configure AlertManager rules for disk pressure
  - Add Nagios checks for disk usage as backup
- Rationale: Both k8s01 (97%) and k8s03 (100%) hit critical thresholds without detection
Automated Container Image Garbage Collection (High Priority)
- Current: Manual cleanup required during incident
- Target: Automated daily cleanup maintaining <75% disk usage
- Actions:
  - Configure kubelet imageGCHighThresholdPercent=75 (default: 85)
  - Configure kubelet imageGCLowThresholdPercent=70 (default: 80)
  - Schedule weekly cleanup cronjob as backup
- Rationale: 4+ years of accumulated images contributed to disk exhaustion
Audit Log Rotation and Buffer Management (High Priority)
- Current: Audit buffer overload caused kubelet crashes on k8s01
- Actions:
  - Reduce audit log verbosity (current level generating excessive data)
  - Implement aggressive log rotation (hourly vs daily)
  - Configure audit buffer size limits
  - Consider disabling detailed audit logging for non-critical operations
- Rationale: “audit buffer queue blocked” directly caused kubelet instability
Radarr PVC Expansion (High Priority)
- Current: 1Gi volume at 95% capacity (carried over from Phase 2)
- Target: 2Gi to accommodate media artwork growth
- Action: Requires PVC recreation (Jiva doesn’t support online expansion)
- Steps:
```
# 1. Backup Radarr config
# 2. Create new 2Gi PVC
# 3. Restore data
# 4. Update deployment to use new PVC
```
Jiva Snapshot Cleanup Frequency (High Priority)
- Current: Daily at 2 AM (threshold: 500 snapshots)
- Problem: 1011 snapshots accumulated when cronjob couldn’t run during Phase 1
- Actions:
  - Lower threshold from 500 to 300 snapshots
  - Increase frequency to every 12 hours (2 AM and 2 PM)
  - Add monitoring/alerting for snapshot counts >400
  - Add pod anti-affinity to ensure cleanup job can run on healthy nodes
- Rationale: Cronjob failure during Phase 1 directly caused Phase 2 storage issues
GitHub Actions Runner Controller Migration (Medium Priority)
- Current: Orphaned runner pods consumed significant resources
- Actions:
  - Migrate to GitHub-hosted runners or alternative self-hosted solution
  - If keeping self-hosted: implement strict maxReplicas limits
  - Add PodDisruptionBudgets to prevent runaway scaling
  - Configure aggressive pod cleanup policies
- Rationale: 571 orphaned pods significantly contributed to cluster instability
CronJob Timeout Configuration Baseline (High Priority - Added from Phase 3)
- Current: CronJobs created without timeout settings, allowing infinite hangs
- Target: All cronjobs have defensive timeout configuration
- Actions:
  - Create baseline cronjob template with standard timeouts:
    - startingDeadlineSeconds: 60 (for minute-frequency jobs)
    - activeDeadlineSeconds: <appropriate for task> (e.g., 300 for 5-min tasks)
    - ttlSecondsAfterFinished: 120 (2-minute cleanup)
    - successfulJobsHistoryLimit: 1
    - failedJobsHistoryLimit: 2
  - Audit all existing cronjobs and add timeout configuration
  - Add validation in ArgoCD to require timeout settings
- Rationale: Timeout settings proved critical for self-healing, but were missing
Job Controller Health Monitoring (Critical Priority - Added from Phase 3)
- Current: No monitoring for job controller state or corruption
- Actions:
  - Add synthetic job creation tests every 5 minutes cluster-wide
  - Monitor job controller logs for “not found” errors
  - Alert on jobs with 0 pods after 2 minutes
  - Alert on jobs exceeding activeDeadlineSeconds without termination
  - Monitor dqlite database health and replication lag
- Rationale: Job controller corruption went undetected for 16+ hours
Dqlite Database Backup Automation (High Priority - Added from Phase 3)
- Current: Manual backup procedures only
- Target: Automated hourly backups with 24-hour retention
- Actions:
  - Create cronjob to backup dqlite database (requires node-local execution)
  - Store backups on NFS with rotation policy
  - Document and test restoration procedure
  - Add alerts for backup failures
- Rationale: Database backup was critical for nuclear option confidence

Longer-Term Improvements

Node Health Synthetic Testing (High Priority)

Issue: Nodes reported Ready but couldn’t start containers
Actions:
- Deploy synthetic pod startup tests every 5 minutes on each node
- Monitor microk8s/kubelite service health
- Alert on pod startup failures or extended ContainerCreating states
- Consider scheduled microk8s service restarts (monthly maintenance)
Rationale: k8s02 and k8s03 both showed “Ready” status while unable to start pods

Dqlite State Recovery Procedures (Medium Priority - Updated from Phase 3)

Issue: Dqlite state corruption from Phase 1 required Phase 3 nuclear option
Actions:
- COMPLETED: Automated dqlite database backups documented (see item 9)
- Document nuclear option recovery procedure (cluster restart with backup)
- Test backup restoration in non-production scenario
- Add monitoring for dqlite replication lag between nodes
- Consider scheduled preventive cluster restarts (quarterly maintenance)
Rationale: Dqlite corruption from Phase 1 persisted for 48+ hours, required nuclear option

Node Reboot Resilience Testing (Medium Priority)

Issue: Simultaneous node reboots triggered cascading failures
Actions:
- Implement controlled rolling node restarts (monthly maintenance)
- Document graceful node shutdown/startup procedures
- Test simultaneous 2-node failure scenarios
- Add PodDisruptionBudgets for critical services
- Configure proper pod anti-affinity for redundant services
Rationale: Inability to handle simultaneous reboots indicates insufficient resilience

Ingress Endpoint Monitoring (Medium Priority)
- Add monitoring to detect stale endpoint caching
- Alert on pod IP changes not reflected in ingress logs
- Consider automated ingress controller restarts after pod migrations
Volume Capacity Monitoring (High Priority)
- Implement alerts for PVC usage >85%
- Current gap: No visibility into Jiva volume capacity
- Tool: Consider deploying Prometheus with node-exporter + custom Jiva metrics
Snapshot Management Strategy (Medium Priority)
- Investigate snapshot growth rate per volume
- Document expected snapshot accumulation patterns
- Consider application-specific snapshot retention policies
- Evaluate if 3-replica Jiva setup is necessary (vs 2-replica for non-critical data)
MediaCover Cleanup Automation (Low Priority)
- Radarr MediaCover directory: 837M of 974M total
- Implement periodic cleanup of orphaned/old media artwork
- Consider storing media artwork on NFS instead of Jiva volumes
Runbook Documentation (High Priority - Updated from Phase 3)
- Document kubelet/kubelite restart procedures for all nodes
- Document disk cleanup emergency procedures with target thresholds
- Document Jiva snapshot cleanup manual trigger process
- Document ingress controller restart for endpoint refresh
- Document force-deletion procedures for stuck pods
- NEW: Document nuclear option procedures (cluster restart with dqlite backup)
- NEW: Document job controller corruption recovery steps
- NEW: Document self-healing verification checklist
- Add to on-call playbook with estimated recovery times
- Rationale: Multiple manual interventions required across all 3 phases; procedures must be documented
- Reference: /tmp/linkace-cronjob-nuclear-option.md created during Phase 3
Cluster Architecture Review (Low Priority)
- Current: 4+ year old microk8s installation
- Consider: Upgrade path to newer Kubernetes versions
- Evaluate: Migration to managed Kubernetes (EKS, GKE, AKS) or alternative distributions
- Rationale: Age of installation may contribute to accumulated technical debt

Lessons Learned

What Went Well

Systematic troubleshooting approach: Correctly identified kubelet issues as separate from scheduler problems
Node cordoning strategy: Temporarily removing k8s02 from rotation helped isolate the problem
Diagnostic tools worked effectively: kubectl commands, journalctl, and custom scripts like check-jiva-volumes.py provided crucial insights
Modular architecture: Issues isolated to specific components, preventing total cluster failure
Quick node recovery: microk8s restarts resolved kubelet issues within 1-2 minutes
Automated cleanup existed: Jiva snapshot cleanup cronjob was already in place, just needed manual trigger
Full replication: Jiva 3-replica setup meant volumes remained accessible with 2/3 replicas during issue
Force-deletion strategy: Successfully cleared 546+ orphaned pods using batched force-delete commands
Phase 3 - Timeout configuration added proactively: ArgoCD manifest updated with defensive timeout settings before nuclear option
Phase 3 - Database backup procedures: Successfully backed up dqlite database before nuclear option, providing rollback capability
Phase 3 - Nuclear option executed cleanly: Cluster restart resolved all issues within 15 minutes with zero data loss
Phase 3 - Verification thoroughness: Systematic verification of job creation, completion, TTL cleanup, and pod lifecycle

What Didn’t Go Well

Cascading failure propagation: Initial node reboot triggered multiple secondary failures across all infrastructure layers
No proactive monitoring: Disk usage (97%, 100%) and snapshot accumulation (1011) went undetected
Kubelet instability: Disk pressure caused repeated kubelet crashes without clear error messages in pod status
Database state corruption: Dqlite database corruption persisted for 48+ hours, spanning Phase 1 → Phase 3
Manual intervention required: Multiple manual steps needed across 8-hour (Phase 1-2) + 16.5-hour (Phase 3) + 12+ hour (Phase 5) periods vs automated recovery
Long cleanup duration: 60+ minutes for snapshot cleanup job to process all volumes
Ingress endpoint caching: No automatic detection/refresh of stale endpoints
Runner controller orphaned pods: 571 pods remained despite controller scaled to 0
Capacity planning gap: Radarr volume undersized for actual usage patterns
Node Ready status misleading: Nodes reported Ready but couldn’t start containers (kubelet vs containerd state mismatch)
Cronjob failure during node issues: Snapshot cleanup cronjob couldn’t run during Phase 1, directly causing Phase 2 storage issues
Phase 3 - Self-healing complete failure: Waited 1.5 hours for timeout-based self-healing that never occurred
Phase 3 - Job controller corruption went undetected: 16+ hours of cronjob failures without alerting
Phase 3 - Controller restarts ineffective: Multiple kubelite restarts across all nodes failed to clear corruption
Phase 3 - ArgoCD auto-sync failed: GitOps automation failed when resources were deleted for clean state
Phase 3 - No job controller monitoring: Zero visibility into controller state or processing errors
Phase 5 - Nuclear option insufficient: Cluster-wide restart (Phase 3) didn’t clear node-local container runtime corruption on k8s01
Phase 5 - Silent failure undetected: 12+ hour delay in detecting Pending pods with no container initialization
Phase 5 - No node-local runtime monitoring: Zero visibility into container runtime health vs kubelet health

Surprise Findings

Audit buffer overload: Audit logging directly caused kubelet crashes (not commonly documented failure mode)
Dqlite database corruption persistence: Database corruption from Phase 1 persisted for 48+ hours despite multiple controller restarts
Kubelet crash without pod warnings: Pods showed “Pending” with no indication kubelet was crashing
Disk threshold: 97% disk usage was sufficient to crash kubelet despite >3% free space
Runner pod accumulation: 571 pods accumulated without triggering any resource quota or alerts
Snapshot physical storage: 1011 snapshots consumed 3GB physical space in 1Gi logical volume
Media artwork growth: Radarr artwork (837M) exceeded database size (33M) by 25x
Cleanup job thoroughness: Job processed ALL Jiva volumes, not just over-threshold volumes
Cross-phase dependency: Phase 1 kubelet/disk issues directly prevented Phase 2 cronjobs from running, and Phase 1 database corruption caused Phase 3 job controller failures
Phase 3 - Job controller single point of failure: Single corrupted job reference prevented ALL job creation cluster-wide
Phase 3 - Timeout settings ignored: Properly configured activeDeadlineSeconds and ttlSecondsAfterFinished completely ignored by corrupted controller
Phase 3 - Controller restart insufficient: Restarting kubelite service didn’t clear in-memory controller state
Phase 3 - Nuclear option effectiveness: Full cluster restart immediately resolved all controller corruption issues
Phase 3 - Self-healing timeline invalid: Expected 6-12 hour self-healing never occurred; corruption was permanent without intervention
Phase 5 - Nuclear option scope limitation: Cluster restart cleared cluster-global state (dqlite, controllers) but not node-local container runtime corruption
Phase 5 - Corruption dormancy: Container runtime corruption from Phase 1 remained dormant for 48+ hours until new workloads attempted to schedule on k8s01
Phase 5 - Silent failure persistence: Same silent failure pattern from Phase 2 (PodScheduled=True, no events) persisted despite Phase 3 nuclear option

Action Items

Priority	Action	Owner	Due Date	Status
Critical	Deploy node disk space monitoring with alerts (80%/90% thresholds)	SRE	2026-01-08	Open
Critical	Configure automated container image garbage collection (75% threshold)	SRE	2026-01-09	Open
Critical	Implement job controller health monitoring with synthetic tests	SRE	2026-01-09	Open
High	Implement audit log rotation and reduce verbosity	SRE	2026-01-10	Open
High	Expand Radarr PVC from 1Gi to 2Gi	SRE	2026-01-13	Open
High	Lower snapshot threshold to 300, increase cleanup frequency to 12h	SRE	2026-01-08	Open
High	Audit all cronjobs and add timeout configuration baseline	SRE	2026-01-15	Open
High	Implement automated dqlite database backups (hourly, 24h retention)	SRE	2026-01-10	Open
High	Document nuclear option runbook (cluster restart with dqlite backup)	SRE	2026-01-12	Open
High	Implement synthetic pod startup health checks on all nodes	SRE	2026-01-15	Open
High	Add PVC capacity monitoring and alerting (>85%)	SRE	2026-01-20	Open
Medium	Test dqlite backup restoration in non-production scenario	SRE	2026-01-17	Open
Medium	Add dqlite replication lag monitoring	SRE	2026-01-20	Open
Medium	Migrate GitHub Actions to hosted runners or implement strict limits	SRE	2026-01-27	Open
Medium	Test node reboot resilience with controlled failures	SRE	2026-02-03	Open
Medium	Investigate k8s01/k8s03 kubelet/containerd logs from incident	SRE	2026-01-13	Open
Medium	Add ingress endpoint staleness monitoring	SRE	2026-02-10	Open
Medium	Investigate ArgoCD auto-sync failure for deleted resources	SRE	2026-01-20	Open
Low	Implement Radarr MediaCover cleanup automation	Dev	2026-02-03	Open
Low	Evaluate reducing Jiva replication from 3 to 2 for non-critical data	SRE	2026-02-10	Open
Low	Review cluster architecture and upgrade path	SRE	2026-03-01	Open
Low	Consider scheduled preventive cluster restarts (quarterly)	SRE	2026-03-01	Open

Technical Details

Environment

Cluster: microk8s on 3 nodes (k8s01, k8s02, k8s03)
Kubernetes Version: v1.34.3
Node Age: 4 years 138 days
Storage: OpenEBS Jiva 2.12.1 (openebs-jiva-default storage class)
Ingress: nginx-ingress-microk8s-controller (3 replicas)
Network: Calico CNI
etcd Alternative: kine (microk8s default)
Container Runtime: containerd via microk8s

Affected Resources

Phase 1:

Namespaces: ci, openebs, kube-system, all namespaces (scheduler impact)
Nodes:
  - k8s01: Kubelet crash loop (disk 97% + audit buffer overload)
  - k8s02: Kubelet hung (process restart required)
  - k8s03: Disk 100% full (garbage collection failure)
Pods:
  - GitHub Actions runners: 571 orphaned (299 Pending, 110 ContainerStatusUnknown, 79 Completed, 58 StartError, 25 other)
  - OpenEBS replicas: 22 stuck Terminating
  - Various: Unable to start/stop across all namespaces

Phase 2:

Namespaces: media, openebs, ingress
Pods:
  - sonarr-7b8f6fcfc4-4wm8m (Pending → Running)
  - radarr-5c95c64cff-* (CrashLoopBackOff → Running, multiple restarts)
  - overseerr-58cc7d4569-kllz2 (Running, intermittent timeouts)
PVCs:
  - radarr-config (pvc-311bef00..., 1Gi, 100% full → 95% after cleanup)
  - sonarr-config (pvc-17e6e808..., 1Gi, 1011 snapshots)
  - overseerr-config (pvc-05e03b60..., 1Gi, 1011 snapshots)

Snapshot Cleanup Job Output

Volumes processed: 13
Volumes cleaned: 13
Snapshots consolidated: 1011 → ~100 per volume (estimated)
Duration: ~60 minutes
Method: Rolling restart of replicas (3 per volume, 30s stabilization between)

Node Disk Usage Timeline

k8s01: 97% (critical) → 87% (stable) after cleanup
k8s02: Stable throughout (no disk pressure)
k8s03: 100% (critical) → 81% (stable) after cleanup

Kubelet Error Patterns (Phase 1)

k8s01 errors:
- "audit buffer queue blocked"
- "database is locked" (kine)
- "Failed to garbage collect required amount of images"
- "Kubelet stopped posting node status"

k8s02 errors:
- Pods assigned but never reached ContainerCreating (silent failure)

k8s03 errors:
- "Failed to garbage collect required amount of images. Attempted to free 13GB, but only found 0 bytes eligible to free"

References

Jiva Volume Checker Script: /Users/paulmacdonnell/pgmac/check-jiva-volumes.py
Snapshot Cleanup Config: /Users/paulmacdonnell/pgmac/pgk8s/pgmac.net/system/templates/jiva-snapshot-cleanup.yaml
Cleanup Job Logs: kubectl logs -n openebs jiva-snapshot-cleanup-manual-4tv5c
OpenEBS Jiva Documentation: https://openebs.io/docs/user-guides/jiva
microk8s Documentation: https://microk8s.io/docs
Kubernetes Kubelet Configuration: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

Reviewers

Prepared by: Claude (AI Assistant)
Date: 2026-01-06
Review Status: Draft - Pending human review

Notes

This incident demonstrated the fragility of a long-running Kubernetes cluster under cascading failure conditions across five distinct phases spanning 2026-01-06 to 2026-01-09. Key takeaways:

Cross-Phase Insights

Disk pressure is a critical failure mode: Both 97% and 100% disk usage caused complete kubelet failure, not just degraded performance
Audit logging can become a liability: Excessive audit log generation directly caused kubelet crashes via buffer overload
Node “Ready” status is insufficient: Nodes reported Ready while unable to start containers (kubelet vs containerd state mismatch)
Cascading failures span days, not hours: Initial Phase 1 node reboot → disk pressure → kubelet failures → dqlite corruption → 48 hours later → Phase 3 job controller corruption
Automated cleanup jobs are single points of failure: Snapshot cleanup cronjob failure during Phase 1 directly caused Phase 2 storage issues
Orphaned pods accumulate silently: 571 runner pods accumulated over time without triggering resource quotas or alerts
Force-deletion is sometimes necessary: Normal deletion failed for 546+ pods due to finalizer/controller corruption
Database state corruption is persistent: Dqlite corruption persisted for 48+ hours despite multiple controller restarts
Multiple layers require monitoring: Node health, disk space, kubelet status, pod lifecycle, storage subsystem, ingress endpoints, controller state
Age matters: 4+ year old installation accumulated technical debt (images, logs, state corruption)

Phase 3-Specific Insights (Job Controller Corruption)

Controller corruption is catastrophic: Single corrupted job reference prevented ALL job creation cluster-wide
Service restarts don’t clear all state: Restarting kubelite service didn’t clear in-memory controller state or dqlite database corruption
Self-healing has limits: Properly configured timeout settings (activeDeadlineSeconds, ttlSecondsAfterFinished) were completely ignored by corrupted controller
Nuclear option is sometimes necessary: Full cluster restart with database backup was the only effective recovery path
Timeout configuration is defensive, not curative: Timeout settings prevent runaway resource consumption but don’t fix controller corruption
Job controller is a single point of failure: No redundancy or failover mechanism for corrupted job controller state
GitOps auto-sync can fail: ArgoCD auto-sync failed when resources deleted for clean state, requiring manual intervention
Database backups provide confidence: Having dqlite backup before nuclear option provided rollback capability and reduced risk
Verification is critical: Systematic verification of job lifecycle (creation → pod spawn → completion → TTL cleanup) necessary after controller recovery
Controller monitoring is essential: Zero visibility into job controller processing state delayed detection by 16+ hours

The resolution required comprehensive intervention across all infrastructure layers (compute, storage, networking, control plane, database) demonstrating the interconnected nature of Kubernetes cluster health and the importance of:

Proactive monitoring at multiple levels (nodes, controllers, database)
Automated maintenance and cleanup with defensive timeout configuration
Graceful degradation under failure (with acknowledgment of hard limits)
Clear runbooks for manual intervention including nuclear option procedures
Regular resilience testing including controller corruption scenarios
Capacity planning and right-sizing
Understanding failure cascades and dependencies between systems across multi-day timespans
Database backup and recovery procedures for confidence in drastic recovery actions
Controller state monitoring and synthetic testing
Recognition that some failures require full cluster restart to resolve

Future incidents can be prevented or mitigated through the preventive measures outlined above, particularly:

Phase 1 preventions: Disk space monitoring, automated image cleanup, audit log management
Phase 2 preventions: Comprehensive health checks beyond node “Ready” status, cronjob reliability monitoring
Phase 3 preventions: Job controller health monitoring, synthetic job tests, dqlite backup automation, cronjob timeout baselines, nuclear option documentation

The three-phase nature of this incident (spanning 48+ hours) highlights that cascading failures can have long-term delayed effects requiring sustained vigilance and multiple recovery strategies beyond initial stabilization.