incidents

Post Incident Review: Cascading Kubernetes Cluster Failures

Date: 2026-01-06 Duration: ~8 hours (estimated 09:00 - 17:00 AEST) Severity: Critical (Complete cluster instability, multiple service outages) Status: Resolved


Executive Summary

A cascading failure across the microk8s Kubernetes cluster began with unplanned node reboots, leading to widespread kubelet failures, disk exhaustion, controller corruption, and ultimately service outages. The incident progressed through five distinct phases spanning January 6-9, 2026:

Phase 1 (2026-01-06 09:00-12:30): Cascading node failures caused kubelet hangs on all three nodes due to disk pressure (97-100% usage), audit buffer overload, and orphaned pod accumulation. The cluster reached a critical state where pods could not be scheduled, started, or terminated. 571 orphaned GitHub Actions runner pods and 22 stuck OpenEBS replica pods contributed to resource exhaustion.

Phase 2 (2026-01-06 12:30-15:35): After stabilizing node operations, secondary issues emerged: OpenEBS Jiva volume snapshot accumulation (1011+ snapshots), ingress controller endpoint caching failures, and volume capacity exhaustion. Multiple media services (Sonarr, Radarr, Overseerr) became inaccessible.

Phase 3 (2026-01-08 02:00-18:25): Job controller corruption prevented all cluster-wide job creation for 16.5 hours. Required nuclear option (cluster restart with dqlite backup) to resolve persistent database state corruption originating from Phase 1.

Phase 4 (2026-01-08 19:00-19:45): ArgoCD application recovery required manual finalizer removal and configuration fixes for GitHub Actions runner controllers and LinkAce cronjob.

Phase 5 (2026-01-09 09:50-09:55): k8s01 container runtime corruption recurred 48+ hours after Phase 3 nuclear option, demonstrating that cluster restart cleared cluster-global state but not node-local container runtime issues. 4 runner pods stuck Pending for 12+ hours due to silent failure pattern.

Resolution required systematic intervention across multiple infrastructure layers: node recovery, disk cleanup, pod force-deletion, storage subsystem repair, ingress refresh, database backup/restart, and multiple node-local container runtime restarts. All services restored to full functionality with complete volume replication (3/3 replicas).


Timeline (AEST - UTC+10)

Phase 1: Cascading Node and Kubelet Failures

Time Event
~09:00 INCIDENT START: Unplanned node reboots across k8s01, k8s02, k8s03 (likely power event or scheduled maintenance)
09:15-09:30 Cluster returns online but exhibits severe instability: pods not scheduling, not starting, not terminating
09:30-10:00 Initial diagnostics: Control plane components healthy, scheduler functioning, but kubelets not processing assigned pods
10:00-10:15 Identified k8s02 kubelet hung: pods assigned by scheduler but never reaching ContainerCreating state
10:15-10:20 RESOLUTION 1.1: Restarted kubelite on k8s02 (systemctl restart snap.microk8s.daemon-kubelite)
10:20-10:30 k8s01 kubelet repeatedly crashing: “Kubelet stopped posting node status” within minutes of restart
10:30-10:45 Root cause analysis k8s01: Disk at 97% usage + audit buffer overload (“audit buffer queue blocked” errors)
10:45-11:00 Database lock errors in kine (etcd replacement): “database is locked” preventing state updates
11:00-11:15 k8s03 diagnostics: Disk at 100% capacity with garbage collection failures
11:15-11:30 Discovered 571 orphaned GitHub Actions runner pods in ci namespace (deployment scaled to 0 but pods remained)
11:30-11:45 RESOLUTION 1.2: Disk cleanup on k8s01 (container images, logs) reducing from 97% → 87% usage
11:45-12:00 RESOLUTION 1.3: Disk cleanup on k8s03 reducing from 100% → 81% usage
12:00-12:15 k8s01 kubelet stabilized after disk cleanup, node maintaining Ready status
12:15-12:20 Deleted RunnerDeployment and HorizontalRunnerAutoscaler (GitHub Actions runner controller orphaned)
12:20-12:25 Force-deleted 22 OpenEBS replica pods stuck in Terminating state
12:25-12:30 Began aggressive force-deletion of 571 runner pods in batches (Pending, ContainerStatusUnknown, StartError, Completed)

Phase 2: Storage and Ingress Service Outages

Time Event
~12:30 PHASE 2 START: User reports 504 Gateway Timeout errors for Sonarr at https://sonarr.int.pgmac.net/
12:30-12:45 Initial investigation: Examined ingress controller logs showing upstream timeouts to pods at old IP addresses (10.1.236.34:8989 for Sonarr, etc.)
12:45-13:00 Root cause analysis: Discovered Radarr pod in CrashLoopBackOff with “No space left on device” error. Sonarr pod Pending on k8s01 node.
13:00-13:15 Volume analysis: Identified OpenEBS Jiva volumes with excessive snapshots (1011 vs 500 threshold) affecting Radarr, Sonarr, and Overseerr
13:15-13:30 Node troubleshooting: Identified k8s01 node unable to start new containers despite being healthy (residual kubelet issues from Phase 1)
13:30-13:35 RESOLUTION 2.1: Restarted microk8s on k8s01, resolving pod scheduling issues
13:35-14:00 Snapshot cleanup: Triggered manual Jiva snapshot cleanup job (jiva-snapshot-cleanup-manual)
13:52 Cleanup job started processing Overseerr volume (pvc-05e03b60)
13:57 Sonarr volume (pvc-17e6e808) cleanup completed
14:10 RESOLUTION 2.2: Restarted all 3 ingress controller pods to clear stale endpoint cache
14:11 SERVICE RESTORED: Sonarr accessible at https://sonarr.int.pgmac.net/ (200 OK responses)
14:15 Overseerr confirmed accessible (200 OK responses)
14:20 Radarr volume (pvc-311bef00) cleanup completed
14:25 Radarr pod still crashing: volume at 100% capacity (958M/974M used)
14:28 RESOLUTION 2.3: Cleared 49M of old backups from Radarr volume, reducing to 95% usage
14:30 SERVICE RESTORED: Radarr accessible at https://radarr.int.pgmac.net/
14:35 Identified 8-9 Jiva replica pods stuck in Pending state on k8s03 (residual from Phase 1)
~15:30 RESOLUTION 2.4: Restarted microk8s on k8s03, resolving all Pending replica pods
15:35 INCIDENT END: All services operational, all replicas running (3/3), no problematic pods

Cleanup Operations (Parallel with Phase 2)

Time Event
12:30-12:45 Force-deleted 299 Pending runner pods
12:45-13:00 Force-deleted 110 ContainerStatusUnknown runner pods
13:00-13:15 Force-deleted 58 StartError/RunContainerError runner pods
13:15-13:30 Force-deleted 79 Completed runner pods
13:30-14:00 Force-deleted final batch of 247 non-Running/non-Terminating runner pods
14:00 Runner pod cleanup substantially complete: 393 pods remain (128 Terminating, 18 Running, 247 deleted)

Phase 3: LinkAce CronJob Controller Corruption (2026-01-08 02:00-18:25)

Time Event
2026-01-08 ~02:00 PHASE 3 START: LinkAce cronjob (* * * * * schedule) begins failing to create jobs successfully
02:00-05:00 Cronjob creates job objects but pods orphaned (parent job deleted before pod creation)
05:00-05:30 44+ jobs stuck in Running state (0/1 completions, 6min-11h old), 24+ pods Pending
05:30-06:00 Investigation reveals job controller stuck syncing deleted job linkace-cronjob-29463021
06:00-06:15 Job controller logs: “syncing job: tracking status: jobs.batch not found” errors
06:15-06:30 Cleanup attempts: Suspended cronjob, deleted orphaned pods, cleared stale active jobs
06:30-07:00 Restarted kubelite on k8s01, temporary improvement but orphaned job reference persists
07:00-08:00 Created dummy job with stale name and deleted properly, but new jobs still not creating pods
08:00-09:00 User added timeout configuration to ArgoCD manifest (activeDeadlineSeconds: 300, ttlSecondsAfterFinished: 120)
09:00-09:30 ArgoCD synced configuration successfully but cronjob deleted to recreate cleanly
09:30-10:00 ArgoCD failed to auto-recreate deleted cronjob despite OutOfSync status
10:00-10:30 Manually recreated cronjob, but job controller completely wedged (not creating pods for any jobs)
10:30-18:00 Self-healing attempted: waited 1.5 hours for TTL cleanup and active deadline enforcement - failed
18:00-18:05 Jobs created by cronjob but no pods spawned, active deadline not enforced (jobs 85+ min old still Running)
18:05-18:10 TTL cleanup not working (no jobs auto-deleted after completion)
18:10 DECISION: Nuclear option approved - etcd cleanup with cluster restart
18:12-18:15 RESOLUTION 3.1: Stopped MicroK8s on all 3 nodes (k8s01, k8s02, k8s03)
18:15 RESOLUTION 3.2: Backed up etcd/dqlite database to /var/snap/microk8s/common/backup/etcd-backup-20260108-201540
18:15-18:17 RESOLUTION 3.3: Restarted MicroK8s cluster, all nodes returned Ready
18:17-18:18 RESOLUTION 3.4: Force-deleted all stuck jobs and cronjob
18:18-18:19 RESOLUTION 3.5: Triggered ArgoCD sync to recreate cronjob with fresh state
18:19 Cronjob recreated successfully with all timeout settings applied
18:20-18:22 First job (linkace-cronjob-29464462) created successfully, completed in 7 seconds
18:22-18:25 TTL cleanup verified working: completed jobs auto-deleted after 2 minutes
18:25 INCIDENT END: Cronjob fully functional, no orphaned pods, all cleanup mechanisms working

Phase 4: ArgoCD Application Recovery (2026-01-08 ~19:00-19:45)

Time Event
2026-01-08 ~19:00 PHASE 4 START: Investigation of 5 ArgoCD applications stuck OutOfSync or Progressing
19:00-19:05 Identified problematic applications: ci-tools (OutOfSync + Progressing), gharc-runners-pgmac-net-self-hosted (OutOfSync + Healthy), gharc-runners-pgmac-user-self-hosted (Synced + Progressing), hass (Synced + Progressing), linkace (OutOfSync + Healthy)
19:05-19:15 ci-tools investigation: Found child application gharc-runners-pgmac-user-self-hosted stuck with resources “Pending deletion”
19:15-19:18 RESOLUTION 4.1: Removed finalizers from 4 stuck resources (AutoscalingRunnerSet, ServiceAccount, Role, RoleBinding) using kubectl patch --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
19:18-19:20 Triggered ArgoCD sync for ci-tools, application became Synced + Healthy
19:20-19:25 gharc-runners-pgmac-net-self-hosted investigation: Found old listener resources with hash 754b578d needing deletion
19:25-19:28 Deleted 3 old listener resources manually (ServiceAccount, Role, RoleBinding)
19:28-19:30 Discovered 6 runner pods stuck Pending for 44+ minutes (residual from Phase 3 job controller corruption)
19:30-19:32 Force-deleted 6 stuck runner pods: pgmac-renovatebot-* pods with PodScheduled=True but no containers created
19:32-19:33 Application status: OutOfSync + Healthy (acceptable due to ignoreDifferences configuration for AutoscalingListener, Role, RoleBinding)
19:33-19:35 gharc-runners-pgmac-user-self-hosted: Already deleted during ci-tools cleanup
19:35-19:37 hass investigation: Application self-resolved during investigation, showing Synced + Healthy (StatefulSet rollout completed)
19:37-19:40 linkace investigation: Found linkace-cronjob OutOfSync despite application Healthy, ArgoCD attempted 23 auto-heal operations
19:40-19:42 Root cause identified: LinkAce Helm chart doesn’t support backoffLimit and resources configuration in cronjob
19:42-19:43 RESOLUTION 4.2: Edited /Users/paulmacdonnell/pgmac/pgk8s/pgmac.net/media/templates/linkace.yaml to remove unsupported fields (backoffLimit, resources block)
19:43 Kept critical timeout settings: startingDeadlineSeconds, activeDeadlineSeconds, ttlSecondsAfterFinished, history limits
19:43-19:44 Committed changes with message “Remove unsupported LinkAce cronjob configuration”
19:44 Git push rejected due to remote changes, used git stash && git pull --rebase && git stash pop && git push
19:45 PHASE 4 END: All applications resolved or explained; 2 applications Synced + Healthy (ci-tools, hass), 2 applications OutOfSync + Healthy acceptable (gharc-runners-pgmac-net-self-hosted, linkace), 1 application deleted (gharc-runners-pgmac-user-self-hosted)

Phase 5: k8s01 Container Runtime Corruption Recurrence (2026-01-09 ~09:50-09:55)

Time Event
2026-01-09 ~09:50 PHASE 5 START: Investigation revealed 4 runner pods in arc-runners namespace stuck in Pending state for 12+ hours
09:50-09:51 Identified all 4 Pending pods assigned to k8s01 node: self-hosted-l52x9-runner-2nnsr, -69qnv, -ls8c2, -w8mcd
09:51 Pod describe showed PodScheduled=True but no container initialization, no events generated (silent failure pattern from Phase 2/3)
09:51-09:52 Verified k8s01 node showing Ready status despite being unable to start new containers
09:52 Root cause identified: Container runtime state corruption on k8s01 (residual from Phase 1-3, not fully cleared by Phase 3 nuclear option)
09:52-09:53 Found 10 EphemeralRunner resources but only 4 pods exist (6 pgmac-slack-scores runners have no pods at all)
09:53 RESOLUTION 5.1: User restarted microk8s on k8s01 (microk8s stop && microk8s start)
09:55 PHASE 5 END: All 4 Pending pods cleared, container runtime recovered

Root Causes

Phase 1: Node and Control Plane Failures

1.1 Cascading Node Reboots (Primary Trigger)

1.2 k8s01 Kubelet Crash Loop (Critical)

1.3 k8s02 Kubelet Process Hang (Critical)

1.4 k8s03 Disk Exhaustion (Critical)

1.5 GitHub Actions Runner Controller Orphaned Pods (Secondary)

1.6 OpenEBS Replica Pods Stuck Terminating (Secondary)

Phase 2: Storage and Ingress Failures

2.1 OpenEBS Jiva Snapshot Accumulation (Primary)

2.2 Residual Node Container Runtime Issues

2.3 Ingress Controller Stale Endpoint Cache

2.4 Radarr Volume Capacity (Secondary)

Phase 3: LinkAce CronJob Controller Corruption (2026-01-08)

3.1 Job Controller State Corruption (Primary - Critical)

3.2 Dqlite Database State Corruption (Primary)

3.3 Timeout Configuration Not Enforced (Secondary)

3.4 ArgoCD Auto-Sync Failure (Secondary)

Phase 4: ArgoCD Application Recovery (2026-01-08)

4.1 GitHub Actions Runner Controller Finalizer Issues (Primary)

4.2 GitHub Actions Runner Controller State Drift (Secondary)

4.3 LinkAce Helm Chart Configuration Drift (Primary)

4.4 Home Assistant Application Self-Healing (None)

Phase 5: k8s01 Container Runtime Corruption Recurrence (2026-01-09)

5.1 Persistent Container Runtime Corruption on k8s01 (Critical)

5.2 Detection Gap for Node-Local Failures (Secondary)


Impact

Services Affected

Phase 1:

Phase 2:

Phase 3:

Phase 4:

Phase 5:

Duration

Scope


Resolution Steps Taken

Phase 1: Node and Kubelet Recovery

1. k8s02 Kubelet Restart

# On k8s02 node
sudo systemctl restart snap.microk8s.daemon-kubelite

2. k8s01 Disk Cleanup and Stabilization

# On k8s01 node
# Removed unused container images
microk8s ctr images rm <image-id>...

# Cleaned container logs (methods vary)
# Removed old/stopped containers
# Result: 97% → 87% disk usage

3. k8s03 Disk Cleanup

# On k8s03 node
# Similar cleanup process
# Result: 100% → 81% disk usage

4. Runner Controller Cleanup

# Deleted orphaned controller resources
kubectl delete runnerdeployment pgmac.pgmac-runnerdeploy -n ci --context pvek8s
kubectl delete horizontalrunnerautoscaler pgmac-pgmac-runnerdeploy-autoscaler -n ci --context pvek8s

# Force-deleted 546+ orphaned runner pods in batches
# Pending pods (299)
kubectl get pods -n ci --context pvek8s --no-headers | \
  grep "pgmac.pgmac-runnerdeploy" | grep "Pending" | \
  awk '{print $1}' | xargs -I {} kubectl delete pod {} -n ci \
  --context pvek8s --force --grace-period=0 --wait=false

# ContainerStatusUnknown (110)
kubectl get pods -n ci --context pvek8s --no-headers | \
  grep "pgmac.pgmac-runnerdeploy" | grep "ContainerStatusUnknown" | \
  awk '{print $1}' | xargs -I {} kubectl delete pod {} -n ci \
  --context pvek8s --force --grace-period=0 --wait=false

# StartError/RunContainerError (58)
kubectl get pods -n ci --context pvek8s --no-headers | \
  grep "pgmac.pgmac-runnerdeploy" | grep -E "StartError|RunContainerError|Error" | \
  awk '{print $1}' | xargs -I {} kubectl delete pod {} -n ci \
  --context pvek8s --force --grace-period=0 --wait=false

# Completed (79)
kubectl get pods -n ci --context pvek8s --no-headers | \
  grep "pgmac.pgmac-runnerdeploy" | grep "Completed" | \
  awk '{print $1}' | xargs -I {} kubectl delete pod {} -n ci \
  --context pvek8s --force --grace-period=0 --wait=false

# Final cleanup batch (247)
kubectl get pods -n ci --context pvek8s --no-headers | \
  grep "pgmac.pgmac-runnerdeploy" | grep -v "Running" | grep -v "Terminating" | \
  awk '{print $1}' | xargs -I {} kubectl delete pod {} -n ci \
  --context pvek8s --force --grace-period=0 --wait=false

5. OpenEBS Replica Pod Cleanup

# Force-deleted 22 stuck Terminating replica pods
kubectl get pods -n openebs --context pvek8s | grep Terminating | \
  awk '{print $1}' | xargs -I {} kubectl delete pod {} -n openebs \
  --context pvek8s --force --grace-period=0

Phase 2: Storage and Ingress Recovery

6. k8s01 Full Restart (Residual Issues)

# On k8s01 node
microk8s stop && microk8s start

7. Jiva Snapshot Cleanup

# Triggered manual cleanup job
kubectl --context pvek8s create job -n openebs jiva-snapshot-cleanup-manual \
  --from=cronjob/jiva-snapshot-cleanup

# Job processed all Jiva volumes sequentially
# - Rolling restart of 3 replicas per volume
# - 30-second stabilization period between replicas
# - Total runtime: ~60 minutes for all volumes

8. Ingress Controller Refresh

# Restarted all ingress controllers to clear endpoint cache
kubectl --context pvek8s delete pod -n ingress \
  nginx-ingress-microk8s-controller-2chvz \
  nginx-ingress-microk8s-controller-k56gn \
  nginx-ingress-microk8s-controller-t56r5

9. Radarr Volume Emergency Cleanup

# Freed space by removing old backups
kubectl --context pvek8s exec -n media radarr-<pod> -- \
  rm -rf /config/Backups/*

# Result: 100% → 95% usage, sufficient for startup

10. k8s03 Full Restart (Replica Pod Issues)

# On k8s03 node
microk8s stop && microk8s start

# All Pending replica pods recreated successfully after restart

Phase 3: Job Controller and Database Recovery (Nuclear Option)

11. Cluster-Wide Restart with Database Backup

# Stop MicroK8s on all nodes (prevent database writes during backup)
ssh k8s01 "sudo snap stop microk8s"
ssh k8s02 "sudo snap stop microk8s"
ssh k8s03 "sudo snap stop microk8s"

# Wait for clean shutdown
sleep 30

# Backup dqlite database (on k8s01 primary node)
ssh k8s01 "sudo mkdir -p /var/snap/microk8s/common/backup && \
  sudo cp -r /var/snap/microk8s/current/var/kubernetes/backend \
  /var/snap/microk8s/common/backup/etcd-backup-20260108-201540"

# Start MicroK8s on all nodes
ssh k8s01 "sudo snap start microk8s"
ssh k8s02 "sudo snap start microk8s"
ssh k8s03 "sudo snap start microk8s"

# Wait for cluster to be ready
kubectl --context pvek8s wait --for=condition=Ready nodes --all --timeout=300s

# Verify cluster health
kubectl --context pvek8s get nodes
kubectl --context pvek8s get componentstatuses

12. Clean Job and CronJob State

# Delete all stuck jobs (85+ jobs accumulated)
kubectl --context pvek8s delete jobs -n media -l app.kubernetes.io/instance=linkace --all

# Delete cronjob to get fresh state
kubectl --context pvek8s delete cronjob linkace-cronjob -n media

# Trigger ArgoCD sync to recreate with fresh state
kubectl --context pvek8s patch application linkace -n argocd \
  --type merge -p '{"operation":{"sync":{"revision":"HEAD"}}}'

13. Update ArgoCD Manifest with Timeout Configuration

# In pgk8s/pgmac.net/media/templates/linkace.yaml (lines 101-114)
cronjob:
  startingDeadlineSeconds: 60 # Grace period for job creation
  activeDeadlineSeconds: 300 # 5-minute job timeout
  ttlSecondsAfterFinished: 120 # 2-minute cleanup after completion
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 2
  resources:
    limits:
      memory: 512Mi
      cpu: 500m
    requests:
      memory: 256Mi
      cpu: 100m

14. Verification

# Verify cronjob recreated with correct configuration
kubectl --context pvek8s get cronjob linkace-cronjob -n media -o yaml

# Wait for next minute and verify job creation
watch -n 10 'kubectl --context pvek8s get jobs -n media -l app.kubernetes.io/instance=linkace'

# Verify job completes successfully
kubectl --context pvek8s wait --for=condition=complete \
  job/linkace-cronjob-<generated> -n media --timeout=120s

# Verify TTL cleanup working (jobs deleted after 2 minutes)
# Monitor job count - should stabilize at 1 successful + max 2 failed
watch -n 30 'kubectl --context pvek8s get jobs -n media -l app.kubernetes.io/instance=linkace'

# Check job execution time (should be ~7 seconds)
kubectl --context pvek8s get job linkace-cronjob-<latest> -n media -o yaml | \
  grep -A 5 "startTime\|completionTime"

# Verify no orphaned pods
kubectl --context pvek8s get pods -n media -l job-name

Phase 4: ArgoCD Application and Finalizer Recovery

15. Remove Finalizers from Stuck Resources

# Remove finalizer from AutoscalingRunnerSet
kubectl --context pvek8s patch autoscalingrunnerset pgmac-slack-scores -n arc-runners \
  --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

# Remove finalizer from ServiceAccount
kubectl --context pvek8s patch serviceaccount pgmac-slack-scores-gha-rs-no-permission -n arc-runners \
  --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

# Remove finalizer from Role
kubectl --context pvek8s patch role pgmac-slack-scores-gha-rs-manager -n arc-runners \
  --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

# Remove finalizer from RoleBinding
kubectl --context pvek8s patch rolebinding pgmac-slack-scores-gha-rs-manager -n arc-runners \
  --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

# Trigger ArgoCD sync for parent application
kubectl --context pvek8s patch application ci-tools -n argocd \
  --type merge -p '{"operation":{"sync":{"revision":"HEAD"}}}'

16. Clean Up Old Runner Controller Resources

# Delete old listener resources with stale hash
kubectl --context pvek8s delete serviceaccount \
  pgmac-renovatebot-gha-rs-listener-754b578d -n arc-runners

kubectl --context pvek8s delete role \
  pgmac-renovatebot-gha-rs-listener-754b578d -n arc-runners

kubectl --context pvek8s delete rolebinding \
  pgmac-renovatebot-gha-rs-listener-754b578d -n arc-runners

# Force-delete stuck runner pods (residual from Phase 3)
kubectl --context pvek8s delete pod pgmac-renovatebot-<pod-id> -n arc-runners \
  --force --grace-period=0 --wait=false
# Repeat for all 6 stuck pods

17. Fix LinkAce Helm Chart Configuration

# Edit ArgoCD manifest to remove unsupported fields
# File: /Users/paulmacdonnell/pgmac/pgk8s/pgmac.net/media/templates/linkace.yaml
# Removed lines (backoffLimit and resources block):
#   backoffLimit: 0
#   resources:
#     limits:
#       memory: 512Mi
#       cpu: 500m
#     requests:
#       memory: 256Mi
#       cpu: 100m

# Kept critical timeout configuration:
#   startingDeadlineSeconds: 60
#   activeDeadlineSeconds: 300
#   ttlSecondsAfterFinished: 120
#   successfulJobsHistoryLimit: 1
#   failedJobsHistoryLimit: 2

# Commit changes
cd /Users/paulmacdonnell/pgmac/pgk8s
git add pgmac.net/media/templates/linkace.yaml
git commit -m "Remove unsupported LinkAce cronjob configuration"

# Handle git push rejection
git stash
git pull --rebase
git stash pop
git push

18. Verification

# Verify ci-tools application status
kubectl --context pvek8s get application ci-tools -n argocd

# Verify gharc-runners-pgmac-net-self-hosted (acceptable OutOfSync + Healthy)
kubectl --context pvek8s get application gharc-runners-pgmac-net-self-hosted -n argocd

# Verify hass application (should be Synced + Healthy)
kubectl --context pvek8s get application hass -n argocd

# Verify linkace application (acceptable OutOfSync + Healthy)
kubectl --context pvek8s get application linkace -n argocd

# Verify no stuck runner pods remain
kubectl --context pvek8s get pods -n arc-runners | grep Pending

# Verify ArgoCD sync status
kubectl --context pvek8s get applications -n argocd | grep -E "OutOfSync|Progressing"

Phase 5: k8s01 Container Runtime Recovery

19. k8s01 Container Runtime Investigation

# List all pods in arc-runners namespace
kubectl --context pvek8s get pods -n arc-runners

# Identified 4 Pending pods (12+ hours old):
# - self-hosted-l52x9-runner-2nnsr
# - self-hosted-l52x9-runner-69qnv
# - self-hosted-l52x9-runner-ls8c2
# - self-hosted-l52x9-runner-w8mcd

# Describe pod to check status
kubectl --context pvek8s describe pod self-hosted-l52x9-runner-2nnsr -n arc-runners
# Observed: PodScheduled=True, assigned to k8s01, no events generated

# Check node status
kubectl --context pvek8s get nodes
# k8s01 showing Ready status despite being unable to start containers

# Check pod locations
kubectl --context pvek8s get pods -n arc-runners -o wide
# All 4 Pending pods assigned to k8s01 node

# Check EphemeralRunner resources
kubectl --context pvek8s get ephemeralrunner -n arc-runners
# Found 10 EphemeralRunner resources but only 4 pods exist
# 6 pgmac-slack-scores runners have no corresponding pods

20. k8s01 MicroK8s Restart

# On k8s01 node (user executed)
microk8s stop && microk8s start

# Wait for node to return Ready
kubectl --context pvek8s wait --for=condition=Ready node/k8s01 --timeout=300s

# Verify Pending pods cleared
kubectl --context pvek8s get pods -n arc-runners
# All 4 Pending pods should be gone, container runtime recovered

21. Verification

# Verify no Pending pods remain in arc-runners namespace
kubectl --context pvek8s get pods -n arc-runners | grep Pending

# Verify EphemeralRunner resources
kubectl --context pvek8s get ephemeralrunner -n arc-runners

# Verify k8s01 node health
kubectl --context pvek8s describe node k8s01

# Check for any new container creation issues
kubectl --context pvek8s get events -n arc-runners --sort-by='.lastTimestamp'

Verification

Service Health Checks

Infrastructure Health

Volume Replication

Overseerr:  3/3 replicas Running
Sonarr:     3/3 replicas Running
Radarr:     3/3 replicas Running
All others: 3/3 replicas Running

Node Disk Status (Post-Cleanup)

k8s01: 87% (down from 97%)
k8s02: Stable (no initial disk pressure)
k8s03: 81% (down from 100%)

Preventive Measures

Immediate Actions Required

  1. Implement Node Disk Space Monitoring (Critical Priority)
    • Current: No alerts for disk usage >85%
    • Target: Alert at 80%, critical alert at 90%
    • Actions:
      • Deploy Prometheus node-exporter on all nodes
      • Configure AlertManager rules for disk pressure
      • Add Nagios checks for disk usage as backup
    • Rationale: Both k8s01 (97%) and k8s03 (100%) hit critical thresholds without detection
  2. Automated Container Image Garbage Collection (High Priority)
    • Current: Manual cleanup required during incident
    • Target: Automated daily cleanup maintaining <75% disk usage
    • Actions:
      • Configure kubelet imageGCHighThresholdPercent=75 (default: 85)
      • Configure kubelet imageGCLowThresholdPercent=70 (default: 80)
      • Schedule weekly cleanup cronjob as backup
    • Rationale: 4+ years of accumulated images contributed to disk exhaustion
  3. Audit Log Rotation and Buffer Management (High Priority)
    • Current: Audit buffer overload caused kubelet crashes on k8s01
    • Actions:
      • Reduce audit log verbosity (current level generating excessive data)
      • Implement aggressive log rotation (hourly vs daily)
      • Configure audit buffer size limits
      • Consider disabling detailed audit logging for non-critical operations
    • Rationale: “audit buffer queue blocked” directly caused kubelet instability
  4. Radarr PVC Expansion (High Priority)
    • Current: 1Gi volume at 95% capacity (carried over from Phase 2)
    • Target: 2Gi to accommodate media artwork growth
    • Action: Requires PVC recreation (Jiva doesn’t support online expansion)
    • Steps:
      # 1. Backup Radarr config
      # 2. Create new 2Gi PVC
      # 3. Restore data
      # 4. Update deployment to use new PVC
      
  5. Jiva Snapshot Cleanup Frequency (High Priority)
    • Current: Daily at 2 AM (threshold: 500 snapshots)
    • Problem: 1011 snapshots accumulated when cronjob couldn’t run during Phase 1
    • Actions:
      • Lower threshold from 500 to 300 snapshots
      • Increase frequency to every 12 hours (2 AM and 2 PM)
      • Add monitoring/alerting for snapshot counts >400
      • Add pod anti-affinity to ensure cleanup job can run on healthy nodes
    • Rationale: Cronjob failure during Phase 1 directly caused Phase 2 storage issues
  6. GitHub Actions Runner Controller Migration (Medium Priority)
    • Current: Orphaned runner pods consumed significant resources
    • Actions:
      • Migrate to GitHub-hosted runners or alternative self-hosted solution
      • If keeping self-hosted: implement strict maxReplicas limits
      • Add PodDisruptionBudgets to prevent runaway scaling
      • Configure aggressive pod cleanup policies
    • Rationale: 571 orphaned pods significantly contributed to cluster instability
  7. CronJob Timeout Configuration Baseline (High Priority - Added from Phase 3)
    • Current: CronJobs created without timeout settings, allowing infinite hangs
    • Target: All cronjobs have defensive timeout configuration
    • Actions:
      • Create baseline cronjob template with standard timeouts:
        • startingDeadlineSeconds: 60 (for minute-frequency jobs)
        • activeDeadlineSeconds: <appropriate for task> (e.g., 300 for 5-min tasks)
        • ttlSecondsAfterFinished: 120 (2-minute cleanup)
        • successfulJobsHistoryLimit: 1
        • failedJobsHistoryLimit: 2
      • Audit all existing cronjobs and add timeout configuration
      • Add validation in ArgoCD to require timeout settings
    • Rationale: Timeout settings proved critical for self-healing, but were missing
  8. Job Controller Health Monitoring (Critical Priority - Added from Phase 3)
    • Current: No monitoring for job controller state or corruption
    • Actions:
      • Add synthetic job creation tests every 5 minutes cluster-wide
      • Monitor job controller logs for “not found” errors
      • Alert on jobs with 0 pods after 2 minutes
      • Alert on jobs exceeding activeDeadlineSeconds without termination
      • Monitor dqlite database health and replication lag
    • Rationale: Job controller corruption went undetected for 16+ hours
  9. Dqlite Database Backup Automation (High Priority - Added from Phase 3)
    • Current: Manual backup procedures only
    • Target: Automated hourly backups with 24-hour retention
    • Actions:
      • Create cronjob to backup dqlite database (requires node-local execution)
      • Store backups on NFS with rotation policy
      • Document and test restoration procedure
      • Add alerts for backup failures
    • Rationale: Database backup was critical for nuclear option confidence

Longer-Term Improvements

  1. Node Health Synthetic Testing (High Priority)
  1. Dqlite State Recovery Procedures (Medium Priority - Updated from Phase 3)
  1. Node Reboot Resilience Testing (Medium Priority)
  1. Ingress Endpoint Monitoring (Medium Priority)
    • Add monitoring to detect stale endpoint caching
    • Alert on pod IP changes not reflected in ingress logs
    • Consider automated ingress controller restarts after pod migrations
  2. Volume Capacity Monitoring (High Priority)
    • Implement alerts for PVC usage >85%
    • Current gap: No visibility into Jiva volume capacity
    • Tool: Consider deploying Prometheus with node-exporter + custom Jiva metrics
  3. Snapshot Management Strategy (Medium Priority)
    • Investigate snapshot growth rate per volume
    • Document expected snapshot accumulation patterns
    • Consider application-specific snapshot retention policies
    • Evaluate if 3-replica Jiva setup is necessary (vs 2-replica for non-critical data)
  4. MediaCover Cleanup Automation (Low Priority)
    • Radarr MediaCover directory: 837M of 974M total
    • Implement periodic cleanup of orphaned/old media artwork
    • Consider storing media artwork on NFS instead of Jiva volumes
  5. Runbook Documentation (High Priority - Updated from Phase 3)
    • Document kubelet/kubelite restart procedures for all nodes
    • Document disk cleanup emergency procedures with target thresholds
    • Document Jiva snapshot cleanup manual trigger process
    • Document ingress controller restart for endpoint refresh
    • Document force-deletion procedures for stuck pods
    • NEW: Document nuclear option procedures (cluster restart with dqlite backup)
    • NEW: Document job controller corruption recovery steps
    • NEW: Document self-healing verification checklist
    • Add to on-call playbook with estimated recovery times
    • Rationale: Multiple manual interventions required across all 3 phases; procedures must be documented
    • Reference: /tmp/linkace-cronjob-nuclear-option.md created during Phase 3
  6. Cluster Architecture Review (Low Priority)
    • Current: 4+ year old microk8s installation
    • Consider: Upgrade path to newer Kubernetes versions
    • Evaluate: Migration to managed Kubernetes (EKS, GKE, AKS) or alternative distributions
    • Rationale: Age of installation may contribute to accumulated technical debt

Lessons Learned

What Went Well

  1. Systematic troubleshooting approach: Correctly identified kubelet issues as separate from scheduler problems
  2. Node cordoning strategy: Temporarily removing k8s02 from rotation helped isolate the problem
  3. Diagnostic tools worked effectively: kubectl commands, journalctl, and custom scripts like check-jiva-volumes.py provided crucial insights
  4. Modular architecture: Issues isolated to specific components, preventing total cluster failure
  5. Quick node recovery: microk8s restarts resolved kubelet issues within 1-2 minutes
  6. Automated cleanup existed: Jiva snapshot cleanup cronjob was already in place, just needed manual trigger
  7. Full replication: Jiva 3-replica setup meant volumes remained accessible with 2/3 replicas during issue
  8. Force-deletion strategy: Successfully cleared 546+ orphaned pods using batched force-delete commands
  9. Phase 3 - Timeout configuration added proactively: ArgoCD manifest updated with defensive timeout settings before nuclear option
  10. Phase 3 - Database backup procedures: Successfully backed up dqlite database before nuclear option, providing rollback capability
  11. Phase 3 - Nuclear option executed cleanly: Cluster restart resolved all issues within 15 minutes with zero data loss
  12. Phase 3 - Verification thoroughness: Systematic verification of job creation, completion, TTL cleanup, and pod lifecycle

What Didn’t Go Well

  1. Cascading failure propagation: Initial node reboot triggered multiple secondary failures across all infrastructure layers
  2. No proactive monitoring: Disk usage (97%, 100%) and snapshot accumulation (1011) went undetected
  3. Kubelet instability: Disk pressure caused repeated kubelet crashes without clear error messages in pod status
  4. Database state corruption: Dqlite database corruption persisted for 48+ hours, spanning Phase 1 → Phase 3
  5. Manual intervention required: Multiple manual steps needed across 8-hour (Phase 1-2) + 16.5-hour (Phase 3) + 12+ hour (Phase 5) periods vs automated recovery
  6. Long cleanup duration: 60+ minutes for snapshot cleanup job to process all volumes
  7. Ingress endpoint caching: No automatic detection/refresh of stale endpoints
  8. Runner controller orphaned pods: 571 pods remained despite controller scaled to 0
  9. Capacity planning gap: Radarr volume undersized for actual usage patterns
  10. Node Ready status misleading: Nodes reported Ready but couldn’t start containers (kubelet vs containerd state mismatch)
  11. Cronjob failure during node issues: Snapshot cleanup cronjob couldn’t run during Phase 1, directly causing Phase 2 storage issues
  12. Phase 3 - Self-healing complete failure: Waited 1.5 hours for timeout-based self-healing that never occurred
  13. Phase 3 - Job controller corruption went undetected: 16+ hours of cronjob failures without alerting
  14. Phase 3 - Controller restarts ineffective: Multiple kubelite restarts across all nodes failed to clear corruption
  15. Phase 3 - ArgoCD auto-sync failed: GitOps automation failed when resources were deleted for clean state
  16. Phase 3 - No job controller monitoring: Zero visibility into controller state or processing errors
  17. Phase 5 - Nuclear option insufficient: Cluster-wide restart (Phase 3) didn’t clear node-local container runtime corruption on k8s01
  18. Phase 5 - Silent failure undetected: 12+ hour delay in detecting Pending pods with no container initialization
  19. Phase 5 - No node-local runtime monitoring: Zero visibility into container runtime health vs kubelet health

Surprise Findings

  1. Audit buffer overload: Audit logging directly caused kubelet crashes (not commonly documented failure mode)
  2. Dqlite database corruption persistence: Database corruption from Phase 1 persisted for 48+ hours despite multiple controller restarts
  3. Kubelet crash without pod warnings: Pods showed “Pending” with no indication kubelet was crashing
  4. Disk threshold: 97% disk usage was sufficient to crash kubelet despite >3% free space
  5. Runner pod accumulation: 571 pods accumulated without triggering any resource quota or alerts
  6. Snapshot physical storage: 1011 snapshots consumed 3GB physical space in 1Gi logical volume
  7. Media artwork growth: Radarr artwork (837M) exceeded database size (33M) by 25x
  8. Cleanup job thoroughness: Job processed ALL Jiva volumes, not just over-threshold volumes
  9. Cross-phase dependency: Phase 1 kubelet/disk issues directly prevented Phase 2 cronjobs from running, and Phase 1 database corruption caused Phase 3 job controller failures
  10. Phase 3 - Job controller single point of failure: Single corrupted job reference prevented ALL job creation cluster-wide
  11. Phase 3 - Timeout settings ignored: Properly configured activeDeadlineSeconds and ttlSecondsAfterFinished completely ignored by corrupted controller
  12. Phase 3 - Controller restart insufficient: Restarting kubelite service didn’t clear in-memory controller state
  13. Phase 3 - Nuclear option effectiveness: Full cluster restart immediately resolved all controller corruption issues
  14. Phase 3 - Self-healing timeline invalid: Expected 6-12 hour self-healing never occurred; corruption was permanent without intervention
  15. Phase 5 - Nuclear option scope limitation: Cluster restart cleared cluster-global state (dqlite, controllers) but not node-local container runtime corruption
  16. Phase 5 - Corruption dormancy: Container runtime corruption from Phase 1 remained dormant for 48+ hours until new workloads attempted to schedule on k8s01
  17. Phase 5 - Silent failure persistence: Same silent failure pattern from Phase 2 (PodScheduled=True, no events) persisted despite Phase 3 nuclear option

Action Items

Priority Action Owner Due Date Status
Critical Deploy node disk space monitoring with alerts (80%/90% thresholds) SRE 2026-01-08 Open
Critical Configure automated container image garbage collection (75% threshold) SRE 2026-01-09 Open
Critical Implement job controller health monitoring with synthetic tests SRE 2026-01-09 Open
High Implement audit log rotation and reduce verbosity SRE 2026-01-10 Open
High Expand Radarr PVC from 1Gi to 2Gi SRE 2026-01-13 Open
High Lower snapshot threshold to 300, increase cleanup frequency to 12h SRE 2026-01-08 Open
High Audit all cronjobs and add timeout configuration baseline SRE 2026-01-15 Open
High Implement automated dqlite database backups (hourly, 24h retention) SRE 2026-01-10 Open
High Document nuclear option runbook (cluster restart with dqlite backup) SRE 2026-01-12 Open
High Implement synthetic pod startup health checks on all nodes SRE 2026-01-15 Open
High Add PVC capacity monitoring and alerting (>85%) SRE 2026-01-20 Open
Medium Test dqlite backup restoration in non-production scenario SRE 2026-01-17 Open
Medium Add dqlite replication lag monitoring SRE 2026-01-20 Open
Medium Migrate GitHub Actions to hosted runners or implement strict limits SRE 2026-01-27 Open
Medium Test node reboot resilience with controlled failures SRE 2026-02-03 Open
Medium Investigate k8s01/k8s03 kubelet/containerd logs from incident SRE 2026-01-13 Open
Medium Add ingress endpoint staleness monitoring SRE 2026-02-10 Open
Medium Investigate ArgoCD auto-sync failure for deleted resources SRE 2026-01-20 Open
Low Implement Radarr MediaCover cleanup automation Dev 2026-02-03 Open
Low Evaluate reducing Jiva replication from 3 to 2 for non-critical data SRE 2026-02-10 Open
Low Review cluster architecture and upgrade path SRE 2026-03-01 Open
Low Consider scheduled preventive cluster restarts (quarterly) SRE 2026-03-01 Open

Technical Details

Environment

Affected Resources

Phase 1:

Namespaces: ci, openebs, kube-system, all namespaces (scheduler impact)
Nodes:
  - k8s01: Kubelet crash loop (disk 97% + audit buffer overload)
  - k8s02: Kubelet hung (process restart required)
  - k8s03: Disk 100% full (garbage collection failure)
Pods:
  - GitHub Actions runners: 571 orphaned (299 Pending, 110 ContainerStatusUnknown, 79 Completed, 58 StartError, 25 other)
  - OpenEBS replicas: 22 stuck Terminating
  - Various: Unable to start/stop across all namespaces

Phase 2:

Namespaces: media, openebs, ingress
Pods:
  - sonarr-7b8f6fcfc4-4wm8m (Pending → Running)
  - radarr-5c95c64cff-* (CrashLoopBackOff → Running, multiple restarts)
  - overseerr-58cc7d4569-kllz2 (Running, intermittent timeouts)
PVCs:
  - radarr-config (pvc-311bef00..., 1Gi, 100% full → 95% after cleanup)
  - sonarr-config (pvc-17e6e808..., 1Gi, 1011 snapshots)
  - overseerr-config (pvc-05e03b60..., 1Gi, 1011 snapshots)

Snapshot Cleanup Job Output

Volumes processed: 13
Volumes cleaned: 13
Snapshots consolidated: 1011 → ~100 per volume (estimated)
Duration: ~60 minutes
Method: Rolling restart of replicas (3 per volume, 30s stabilization between)

Node Disk Usage Timeline

k8s01: 97% (critical) → 87% (stable) after cleanup
k8s02: Stable throughout (no disk pressure)
k8s03: 100% (critical) → 81% (stable) after cleanup

Kubelet Error Patterns (Phase 1)

k8s01 errors:
- "audit buffer queue blocked"
- "database is locked" (kine)
- "Failed to garbage collect required amount of images"
- "Kubelet stopped posting node status"

k8s02 errors:
- Pods assigned but never reached ContainerCreating (silent failure)

k8s03 errors:
- "Failed to garbage collect required amount of images. Attempted to free 13GB, but only found 0 bytes eligible to free"

References


Reviewers


Notes

This incident demonstrated the fragility of a long-running Kubernetes cluster under cascading failure conditions across five distinct phases spanning 2026-01-06 to 2026-01-09. Key takeaways:

Cross-Phase Insights

  1. Disk pressure is a critical failure mode: Both 97% and 100% disk usage caused complete kubelet failure, not just degraded performance
  2. Audit logging can become a liability: Excessive audit log generation directly caused kubelet crashes via buffer overload
  3. Node “Ready” status is insufficient: Nodes reported Ready while unable to start containers (kubelet vs containerd state mismatch)
  4. Cascading failures span days, not hours: Initial Phase 1 node reboot → disk pressure → kubelet failures → dqlite corruption → 48 hours later → Phase 3 job controller corruption
  5. Automated cleanup jobs are single points of failure: Snapshot cleanup cronjob failure during Phase 1 directly caused Phase 2 storage issues
  6. Orphaned pods accumulate silently: 571 runner pods accumulated over time without triggering resource quotas or alerts
  7. Force-deletion is sometimes necessary: Normal deletion failed for 546+ pods due to finalizer/controller corruption
  8. Database state corruption is persistent: Dqlite corruption persisted for 48+ hours despite multiple controller restarts
  9. Multiple layers require monitoring: Node health, disk space, kubelet status, pod lifecycle, storage subsystem, ingress endpoints, controller state
  10. Age matters: 4+ year old installation accumulated technical debt (images, logs, state corruption)

Phase 3-Specific Insights (Job Controller Corruption)

  1. Controller corruption is catastrophic: Single corrupted job reference prevented ALL job creation cluster-wide
  2. Service restarts don’t clear all state: Restarting kubelite service didn’t clear in-memory controller state or dqlite database corruption
  3. Self-healing has limits: Properly configured timeout settings (activeDeadlineSeconds, ttlSecondsAfterFinished) were completely ignored by corrupted controller
  4. Nuclear option is sometimes necessary: Full cluster restart with database backup was the only effective recovery path
  5. Timeout configuration is defensive, not curative: Timeout settings prevent runaway resource consumption but don’t fix controller corruption
  6. Job controller is a single point of failure: No redundancy or failover mechanism for corrupted job controller state
  7. GitOps auto-sync can fail: ArgoCD auto-sync failed when resources deleted for clean state, requiring manual intervention
  8. Database backups provide confidence: Having dqlite backup before nuclear option provided rollback capability and reduced risk
  9. Verification is critical: Systematic verification of job lifecycle (creation → pod spawn → completion → TTL cleanup) necessary after controller recovery
  10. Controller monitoring is essential: Zero visibility into job controller processing state delayed detection by 16+ hours

The resolution required comprehensive intervention across all infrastructure layers (compute, storage, networking, control plane, database) demonstrating the interconnected nature of Kubernetes cluster health and the importance of:

Future incidents can be prevented or mitigated through the preventive measures outlined above, particularly:

The three-phase nature of this incident (spanning 48+ hours) highlights that cascading failures can have long-term delayed effects requiring sustained vigilance and multiple recovery strategies beyond initial stabilization.