Post Incident Review: microk8s 1.34 → 1.35 Rolling Upgrade — cgroup v2, containerd Shim, Disk Pressure, and Kubelet Stall¶

Date: 2026-05-16 Duration: ~8.75 hours active (09:30–18:15 AEST) Severity: High (planned maintenance; unexpected multi-failure cascade; all 3 nodes impacted) Status: Resolved

Executive Summary¶

A planned rolling upgrade of the pvek8s cluster from microk8s 1.34 to 1.35 encountered four separate failure modes across the three nodes, turning a routine maintenance window into a 3.5-hour incident. Kubernetes 1.35 dropped cgroup v1 support entirely, but Ubuntu 20.04 defaults to cgroup v1; no pre-upgrade check existed to detect or enable cgroup v2 before the snap refresh. After cgroup v2 was enabled and nodes rebooted, a secondary failure emerged: the containerd runtime template file had its ${RUNTIME_TYPE} variable pre-substituted to io.containerd.runc.v1 (the cgroup v1 path) on a prior startup, and microk8s 1.35 only ships containerd-shim-runc-v2 — so all container creation failed. On k8s03, an additional complication arose: the old snap revision (8695) was still consuming disk space after the upgrade to rev 8612, pushing disk utilisation to 94% and triggering kubelet DiskPressure. After disk cleanup, the containerd imagefs metrics API change in containerd 2.1.3 (bundled with microk8s 1.35) caused the kubelet eviction manager to enter a degraded silent state — the node appeared Ready and heartbeating but did not process any pod assignments for 17+ minutes. A final issue, a stale BuildKit lock file in the persistent volume surviving from an unclean termination during the upgrade disruption, prevented CI/CD from running until cleaned manually.

All issues were resolved by the end of the maintenance window. All 3 nodes reached v1.35.0 Ready status; 38 ArgoCD applications were Synced+Healthy (1 remaining: dependency-track, tracked as PGM-192). A seventh failure emerged post-cleanup: the k8s 1.35 endpoint controller failed to refresh the argocd-repo-server EndpointSlice after the pod restarted and obtained a new cluster IP, routing 4 apps' comparisons to the wrong pod and causing Unknown sync status. Fixed by patching the stale Endpoint and recreating the EndpointSlice. Ansible automation was updated to prevent recurrence of the cgroup and disk cleanup issues.

Timeline (AEST — UTC+10)¶

Time	Event
~09:30 AEST	Planned maintenance window starts. Rolling upgrade begins; k8s01 cordoned and drained. `ansible-playbook k8s-upgrade.yml` running.
~10:08 AEST	k8s01 snap refresh to `1.35/stable` (rev 8612) completes. kubelite starts.
~10:10 AEST	k8s01 NotReady. Kubelite crash loop begins. Error: `kubelet is configured to not run on a host using cgroup v1. cgroup v1 support is unsupported.`
~10:15 AEST	Root cause confirmed: Ubuntu 20.04 defaults to cgroup v1; `/sys/fs/cgroup/cgroup.controllers` absent. cgroup v2 enabled: `systemd.unified_cgroup_hierarchy=1` added to GRUB, `update-grub` run. k8s01 rebooted.
~10:25 AEST	k8s01 back online after reboot. cgroup v2 active. kubelite starts but calico-node pods enter Error/Completed loop.
~10:30 AEST	calico-node error: `failed to create containerd task: runtime "io.containerd.runc.v1" binary not installed "containerd-shim-runc-v1": file does not exist`.
~10:35 AEST	Root cause found: `/var/snap/microk8s/8612/args/containerd-template.toml` has `runtime_type = "io.containerd.runc.v1"` (pre-substituted from prior cgroup v1 run). Restored `${RUNTIME_TYPE}` variable via `sed -i`. Restarted `snap.microk8s.daemon-containerd`.
~10:45 AEST	calico-node pods Running on k8s01. k8s01 Ready, uncordoned. k8s02 cordoned and drained.
~10:50 AEST	k8s02 snap refresh completes. Same cgroup v1 + containerd-template fix applied proactively. k8s02 reboots.
~11:05 AEST	k8s02 Ready. Uncordoned. k8s03 cordoned and drained.
~11:10 AEST	k8s03 snap refresh to 1.35/stable completes. Disk pressure immediately apparent: `DiskPressure: True`. `df /var/snap/microk8s` → 94% used.
~11:15 AEST	Disk cleanup on k8s03: old snap rev 8695 removed (`snap remove --revision 8695 microk8s`) freeing ~3GB; journals vacuumed to 256MB (freed ~808MB); `apt-get clean`. Post-cleanup: 81% used.
~11:25 AEST	DiskPressure condition cleared on k8s03. cgroup v2 + containerd-template fix applied. Proactive disk cleanup also run on k8s01 and k8s02.
~11:30 AEST	k8s03 kubelite started. Node shows Ready and heartbeating, but no pods scheduling to k8s03.
~11:45 AEST	Investigation: k8s03 kubelite running (PID alive) but producing zero log output. Start-up error found in systemd journal: `"eviction manager: failed to check if we have separate container filesystem. Ignoring." err="no imagefs label for configured runtime"`.
~11:50 AEST	k8s03 kubelite restarted via `sudo systemctl restart snap.microk8s.daemon-kubelite`. Pods begin scheduling immediately.
~12:00 AEST	argocd-redis-ha-haproxy pod Pending due to anti-affinity. Deleted and rescheduled. ArgoCD healthy.
~12:10 AEST	BuildKit pod (arc-runners namespace) in CrashLoopBackOff. Error: `"could not lock /var/lib/buildkit/buildkitd.lock, another instance running?"`
~12:15 AEST	Stale lock file removed from PVC mount: `rm /var/snap/microk8s/common/default-storage/.../buildkitd.lock`. BuildKit pod deleted and restarted cleanly.
~12:30 AEST	All 3 nodes Ready at v1.35.0. All ArgoCD apps Synced+Healthy.
~13:10 AEST	Full verification complete. Ansible branch committed and pushed. Linear PGM-159 updated. Initial incident resolved.
~14:00 AEST	Post-upgrade cleanup begins. Additional pods identified as not recovering: csi-nfs-node (k8s02, k8s03) 2/3 Error; metallb speakers CrashLoopBackOff; buildkitd stuck Pending; jiva-csi-node livenessprobe Error.
~14:20 AEST	Orphaned metallb speaker processes (port 7946) and jiva-csi-node livenessprobe processes killed with `kill -9` on k8s02 and k8s03. metallb speakers: Running.
~14:30 AEST	buildkitd pod stuck Pending on k8s03 for 20+ minutes despite node Ready and kubelet alive — no kubelet events generated. kubelite restarted on k8s03 to clear stuck internal pod-processing state. Stale lock file cleared. Zombie buildkitd process (PID 29824) holding exclusive flock killed. BuildKit: 1/1 Running.
~15:00 AEST	Root cause found for csi-nfs-node 2/3 Error: stale `containerd-shim-runc-v2` processes from snap rev 8612 (old microk8s 1.35 snap, retained as rollback) survived the upgrade. Each old shim spawned orphaned `livenessprobe` child processes holding host ports 9808 and 29653, blocking new container restarts from binding. Multiple shim generations killed on k8s02 and k8s03. csi-nfs-node: 3/3 Running on all nodes.
~15:30 AEST	Stale arc-runners and arc-systems listener pods (Unknown/stuck-Pending from kubelite restart) force-deleted. ARC controller recreated listener pods cleanly. Only remaining issue: `dependency-track-postgresql-0` ImagePullBackOff (PGM-192 — pre-existing). Cluster infrastructure stable at v1.35.0.
~15:45 AEST	ArgoCD issue reported: applications (calibre, dependency-track, gharc-runners-pgmac-net-self-hosted, vaultwarden) showing Unknown sync status with ComparisonError. Investigation begins.
~16:00 AEST	Root cause identified: argocd-repo-server EndpointSlice last updated at 11:20 AEST with stale pod IP `10.1.237.8` (the n8n pod). Actual repo-server pod at `10.1.237.19`. k8s 1.35 endpoint controller failed to refresh the EndpointSlice after the repo-server pod restarted and acquired a new IP. All ArgoCD comparison traffic was being routed to n8n → connection refused.
~16:10 AEST	Fix applied: legacy Endpoints object patched to `10.1.237.19`; stale EndpointSlice deleted; EndpointSlice manually recreated with correct pod IP and targetRef. Hard refresh forced on all Unknown/Progressing apps.
~16:15 AEST	All ArgoCD apps recovered: 38 Synced+Healthy. Only remaining non-Healthy: dependency-track (Progressing — downstream of PGM-192 postgresql ImagePullBackOff). Full cluster resolution complete.
~16:30 AEST	User reports `links.pgmac.net.au` (linkace) returning 502 externally; readarr and hass also flagged. Ingress-nginx access log shows k8s03 pod (`tlbfv`) routing linkace traffic to `10.1.237.21` (stale) — connection refused × 3 retries → 502. k8s01/k8s02 ingress pods have correct endpoint `10.1.237.246`.
~16:45 AEST	Root cause identified: k8s03 ingress-nginx pod's internal Lua state did not receive the watch notification after Phase 8 EndpointSlice patch. Pod was restarted 6h35m ago (during upgrade chaos); at that time linkace EndpointSlice may have still had the stale IP. Subsequent EndpointSlice patch was not picked up.
~17:00 AEST	Fix: POSTed correct linkace endpoint (`10.1.237.246:80`) to k8s03 ingress-nginx internal API (`POST /configuration/backends`). Payload contained only the linkace entry, inadvertently replacing the full backends list (28 services → 1).
~17:05 AEST	readarr now returning 503 from k8s03 ingress (no upstream configured). Identified the inadvertent backends list replacement.
~17:10 AEST	Fix: Captured full 28-backend list from k8s01 ingress pod; copied to k8s03 and POSTed to restore full list. linkace → `10.1.237.246` (correct); readarr → `10.1.237.5` (correct); all 28 backends restored.
~17:15 AEST	Verified no remaining stale backends across all 3 ingress pods (automated scan). linkace: 302, readarr: 302 (→ 401 auth), hass: 200. All flagged services confirmed operational.
~18:15 AEST	PIR updated. Full cluster recovery confirmed.
~22:00 AEST (2026-05-17)	PGM-195 follow-up: k8s03 kubelet PLEG failure + pod watch broken re-emerges after subsequent kubelite restarts. Stale cgroup investigation begins. 58 stale entries found under `/sys/fs/cgroup/kubepods/besteffort/pod98df12a6-*/` (root cause: openebs-jiva-csi-node-rp2nk at 83+ restarts).
~23:00 AEST (2026-05-17)	Goroutine dump captured from kubelet. `handleAnyWatch` goroutine confirmed blocked in `[select]` for 3–4+ minutes with zero pod watch events. Structural watch-cache consistency issue in kubelite restart identified as root cause.
~23:15 AEST (2026-05-17)	Cordon-before-restart procedure tested. k8s03 cordoned, kubelite restarted. Node Ready in 31 seconds. All pending pods started (buildkitd 1/1, argocd-redis-ha-server-2 3/3). k8s03 uncordoned.
~23:30 AEST (2026-05-17)	PGM-195 resolved. Cordon-before-restart procedure documented as required for all future kubelite restarts on k8s03.
~00:30 AEST (2026-05-18)	Sustained dqlite `deadline_exceeded` errors observed correlating with k8s03 kubelite restart activity (spike of 17–20 errors/hour at 01:00 UTC, driven by API server reconnect and leader election cycles on each kubelite restart). k8s03 cordoned, drained, and rebooted to reduce cluster-wide dqlite write load. k8s03 left cordoned. Errors drop to near-zero within ~30 minutes.
~02:00 AEST (2026-05-18)	dqlite health investigation: per-hour error analysis confirms errors are tightly coupled to k8s03 disruption window; last errors across all nodes at 02:50–03:02 UTC (>2h before check); backend sizes balanced at 561MB across all 3 nodes; no orphaned EndpointSlices or stale leases; snapshot rotation active. dqlite assessed as healthy.
~02:30 AEST (2026-05-18)	Discovery: `ansible-role-microk8s-maintenance` monthly VACUUM step ran `sqlite3 cluster.db 'VACUUM;'` on the 4KB cluster membership metadata file, not the actual Kubernetes state stored in 226MB dqlite snapshot files. Manual VACUUM also not available via the dqlite CLI (restricted commands). VACUUM step removed from role (`tasks/vacuum.yml` deleted, defaults cleaned) and from `microk8s-monthly-maintenance.yml` playbook. Committed and pushed to both repos. Automatic snapshot rotation already handles dqlite log compaction.
~03:00 AEST (2026-05-18)	k8s03 uncordoned. All 3 nodes Ready. DaemonSet pods confirmed Running; newly-scheduled workloads (ARC runners, hostpath-provisioner, jiva replicas) start cleanly. Full cluster recovery confirmed.

Root Causes¶

The Infinite How's Chain¶

"The infinite how's" methodology: at each causal step, ask "how?" rather than accepting the surface answer. Keep drilling until reaching an actionable, preventable cause.

Chain 1: Kubelite crash on all nodes immediately after snap upgrade¶

How did kubelite crash immediately after snap refresh?¶

The kubelet component inside kubelite rejected the host's cgroup configuration: kubelet is configured to not run on a host using cgroup v1. cgroup v1 support is unsupported. kubelite exited, systemd restarted it, and the cycle repeated.

How was cgroup v1 active on the nodes?¶

Ubuntu 20.04 defaults to the cgroup v1 (legacy) hierarchy. cgroup v2 (unified) requires explicit kernel boot parameter: systemd.unified_cgroup_hierarchy=1 in GRUB. The nodes had never had this configured.

How did the upgrade proceed without cgroup v2 being enabled first?¶

The Ansible upgrade playbook (k8s-upgrade.yml) had no step to check or enforce cgroup v2 before performing the snap refresh. The playbook went directly from cordon/drain to snap refresh microk8s --channel 1.35/stable.

How was the cgroup v2 requirement not caught before the upgrade began?¶

The K8s 1.35 release notes document the cgroup v1 deprecation and removal. PGM-159's investigation phase identified this requirement, but the implementation of the GRUB fix was not added to the upgrade playbook before execution began — it was added reactively during the incident.

How was there no gate to stop the upgrade on an incompatible node?¶

The upgrade role (ansible-role-microk8s/tasks/upgrade.yml) had no pre-flight checks — it simply ran snap refresh and waited for microk8s status to report Ready. A cgroup version check before the snap refresh would have caught this before causing a crash loop.

Chain 2: calico-node pods failing with containerd runtime shim not found¶

How did calico-node pods fail with `containerd-shim-runc-v1: file does not exist`?¶

containerd was configured to use io.containerd.runc.v1 as the container runtime type. microk8s 1.35 ships containerd 2.1.3, which only includes containerd-shim-runc-v2. The v1 shim was removed.

How was containerd configured for the v1 runtime type?¶

The file /var/snap/microk8s/8612/args/containerd-template.toml — a mutable copy in SNAP_DATA — contained a hardcoded runtime_type = "io.containerd.runc.v1" instead of the expected template variable runtime_type = "${RUNTIME_TYPE}".

How did the template variable get pre-substituted?¶

When microk8s starts, a startup script reads containerd-template.toml, substitutes ${RUNTIME_TYPE} based on the detected cgroup version (io.containerd.runc.v1 for cgroup v1, io.containerd.runc.v2 for cgroup v2), and writes the result to the live config. On k8s01's first start after snap upgrade — before cgroup v2 was enabled — the script ran under cgroup v1 and substituted v1 into the mutable SNAP_DATA copy.

How did the variable not get reset when cgroup v2 was subsequently enabled?¶

After enabling cgroup v2 and rebooting, the startup script detects cgroup v2 and attempts to write io.containerd.runc.v2 — but only if ${RUNTIME_TYPE} is still present as a template variable in the file. Since the prior run had already replaced the variable with a literal string, the substitution step was effectively a no-op and the file retained v1.

How was this not handled by the snap upgrade process?¶

The mutable SNAP_DATA file is intentionally designed to survive upgrades so that user customisations are preserved. The startup script is designed to substitute the variable on first use per snap revision. It does not reset the file or re-substitute if the variable is already expanded — so an intermediate state (cgroup v1 startup → cgroup v2 migration → same revision) leaves a stale pre-substituted value.

How was there no corrective task in the upgrade playbook?¶

The upgrade role had no task to reset or re-template containerd-template.toml after a snap refresh. This was not known to be necessary before this incident — the scenario of upgrading snap version at the same time as migrating cgroup versions is a novel combination.

Chain 3: k8s03 disk pressure at 94% causing pod scheduling failure¶

How did k8s03 reach 94% disk utilisation and trigger DiskPressure?¶

kubelet's imageGC high threshold is 80% by default. The disk was at 94%, well above this; kubelet set DiskPressure: True and refused to schedule new pods.

How did disk utilisation reach 94%?¶

The old snap revision (8695, microk8s 1.34) was still present alongside the new revision (8612, microk8s 1.35). Each revision occupies several gigabytes of overlayfs snapshots and container images. With both revisions on disk simultaneously, the cumulative usage exceeded 80%.

How was the old snap revision not removed after upgrade?¶

snap refresh keeps the previous revision installed as a rollback point. Automatic revision cleanup in Ubuntu/snap defaults to retaining up to 3 revisions. No explicit snap remove --revision step was included in the Ansible upgrade playbook.

How was there no disk space pre-flight check before upgrading k8s03?¶

The upgrade playbook had no step to verify that sufficient free space existed before running snap refresh. After observing disk pressure on k8s03, the fix was applied — but proactive removal of old snap revisions on k8s01 and k8s02 was run only after the fact.

How was disk pressure not caught by monitoring before the upgrade?¶

Nagios NRPE disk checks were configured for overall filesystem utilisation. The /var/snap/microk8s/ subtree does not have its own dedicated monitoring check with a tighter threshold. This is the same monitoring gap identified in PGM-138 for the dqlite WAL partition — the same fix was not applied to the containerd snapshotter path.

Chain 4: k8s03 kubelet silent stall — heartbeating but not processing pods¶

How were pods not scheduling to k8s03 despite the node showing Ready?¶

k8s03's kubelet was heartbeating and the node's Ready condition was True, but the kubelet was not processing any pod assignments. Pods sat Pending indefinitely even though k8s03 had available capacity.

How was the kubelet running but not processing pod assignments?¶

The kubelite startup logs contained: "eviction manager: failed to check if we have separate container filesystem. Ignoring." err="no imagefs label for configured runtime". The eviction manager failed to initialise the imagefs metrics collector. This caused the kubelet's eviction subsystem to enter a degraded mode where it silently dropped pod lifecycle operations while continuing to send node heartbeats.

How did a containerd API change cause this?¶

containerd 2.1.3 (bundled with microk8s 1.35) changed the internal metrics API surface for image filesystem labels. The kubelet eviction manager, when it could not resolve the imagefs label for the configured runtime, logged a non-fatal warning and continued startup — but the internal state left the pod assignment pipeline inoperative.

How did the node appear healthy to the cluster?¶

The kubelet's node heartbeat runs independently from pod assignment processing. Node Ready: True only indicates that the kubelet is alive and can communicate with the API server; it does not verify that pod assignment is functional. The cluster scheduler saw an available, Ready node and assigned pods to it — those pods then stalled at Pending with no progress.

How was this not detected until 17+ minutes later?¶

There is no monitoring for kubelet log silence or for the elapsed time since a node last processed a pod assignment. The Ready condition gave a false positive of node health. Discovery required manual observation that pods assigned to k8s03 were not starting, combined with inspection of kubelite logs.

Chain 5: BuildKit stale lock preventing CI/CD¶

How did buildkitd fail to start?¶

buildkitd exited on startup with "could not lock /var/lib/buildkit/buildkitd.lock, another instance running?". No other buildkitd process was running.

How was the lock file present with no running process?¶

The previous buildkitd pod had been evicted or killed uncleanly during k8s03's disruption (disk pressure events and node churn during the upgrade). buildkitd writes its lock file into /var/lib/buildkit/ which is backed by a persistent volume claim. A PVC survives pod termination — so the lock file from the unclean termination persisted into the next pod start.

How does buildkitd not handle a stale lock?¶

buildkitd uses a simple flock-based lock that does not distinguish between an active lock holder and a stale file from a crashed process. It sees the lock file, assumes another instance is running, and exits. This is a known limitation of the buildkitd lock design.

How was this not caught before the pod entered CrashLoopBackOff?¶

There is no init container or startup script in the buildkitd deployment that checks for and removes stale lock files before the daemon starts. The pod's liveness probe only fires after the back-off interval, so multiple restart cycles occur before an operator notices.

Chain 6: Stale containerd shims from old snap revision causing post-upgrade pod failures¶

How did csi-nfs-node pods on k8s02 and k8s03 stay in 2/3 Error after the upgrade?¶

The liveness-probe container inside csi-nfs-node crashed immediately on each restart attempt with listen tcp 0.0.0.0:29653: bind: address already in use. An orphaned livenessprobe process on the host already held port 29653 (and port 9808 for older container generations), preventing the new container from binding.

How was port 29653 (and 9808) held by orphaned host processes?¶

livenessprobe processes — children of containerd-shim-runc-v2 — continued running on the host after their parent containers were killed. Because csi-nfs-node uses hostNetwork: true, the livenessprobe binary binds directly to a host port. When the container was SIGKILL'd, the shim process (which is the real process supervisor) remained alive, keeping its child alive and the port bound.

How did containerd shim processes survive the container being killed?¶

The shim (containerd-shim-runc-v2) is the lifecycle supervisor for a container, not a child of the container itself. It runs as a separate process parented by PID 1. When a container is killed (including SIGKILL from Kubernetes), the shim receives the exit notification and reports it to containerd — but the shim process itself only exits once containerd confirms the container is fully cleaned up. If containerd's cleanup is disrupted (e.g., by a kubelite restart mid-cleanup), the shim can persist indefinitely.

How were snap 8612 shims running after the snap upgrade to revision 8695?¶

snap refresh retains the previous snap revision on disk as a rollback point but does not terminate processes started by that revision's binaries. The shims started under snap 8612 (/snap/microk8s/8612/bin/containerd-shim-runc-v2) continued running because they were parented by PID 1, not by any microk8s daemon. The new snap revision (8695) starts fresh containerd and kubelet processes, but has no mechanism to enumerate or clean up shim processes from the old revision.

How was there no cleanup of old snap shim processes during or after upgrade?¶

The snap refresh lifecycle does not include process cleanup for the previous revision's runtime binaries. Kubernetes and containerd track containers by containerd task ID, not by shim process lineage across snap revisions. The upgrade playbook had no step to enumerate stale shims from the old revision after the new revision was confirmed healthy.

Chain 7: k8s 1.35 endpoint controller not refreshing EndpointSlice IPs after pod restarts¶

How were ArgoCD WebUI and multiple ingress-nginx services unreachable after the cluster appeared stable?¶

Traffic to ArgoCD (172.22.22.200:80) received connection refused. Services behind ingress-nginx returned 502 Bad Gateway. Both problems had the same root cause: EndpointSlices were pointing to stale pod IPs, so kube-proxy routed service traffic to the wrong (old) pod IPs — causing connection refused or timeout on the receiving pod, which ingress-nginx surfaced as 502.

How were 17 EndpointSlices pointing to wrong pod IPs?¶

The endpoint controller (part of kube-controller-manager in kubelite) tracked the correct pod name in each EndpointSlice targetRef field, but the addresses field retained the IP the pod had at the time the EndpointSlice was last written. Pods that restarted during the upgrade and cleanup phases obtained new IPs — but the endpoint controller never updated the EndpointSlice addresses to reflect the new IPs.

How did the endpoint controller fail to detect that pod IPs had changed?¶

When kubelite was restarted on k8s03 during Phase 6, the endpoint controller became the leader and performed an initial reconciliation. This reconciliation apparently did not detect the discrepancy between the EndpointSlice addresses and actual pod IPs from the API server. The exact mechanism of the failure is not known — it may be a k8s 1.35 regression in the endpoint controller's startup reconciliation logic, or a race condition where the controller's internal pod cache was populated before pods had completed their final IP assignment.

How were 9 different namespaces affected?¶

All affected pods shared a common pattern: they were running on k8s03 and had restarted during the upgrade window (which included kubelite restarts on k8s03). After each pod restart, the pod acquired a new IP from k8s03's IPAM range (10.1.237.x). The EndpointSlices retained the pod's pre-restart IP. The endpoint controller, once running on k8s03 with a potentially stale cache, did not detect or correct these mismatches.

How was there no early detection of this problem?¶

The Nagios checks for ingress-proxied services fire HTTP CRITICAL after 3 consecutive failures — these appeared in the unhandled problems list but were not immediately investigated because the cluster appeared "stable" (all nodes Ready, all pods Running). There was no automated check comparing EndpointSlice addresses to actual pod IPs, and the ArgoCD app status (Unknown sync) was the first signal investigated. The widespread 502s pointed back to a common cause only after both argocd-server and argocd-repo-server were found to have stale endpoints.

How was there no step in the upgrade playbook to verify endpoint correctness?¶

The upgrade playbook checked node Ready status and ArgoCD app Synced+Healthy status. Neither check detects EndpointSlice staleness. A post-upgrade endpoint consistency check (comparing EndpointSlice pod IPs to actual pod IPs from the API server) would have caught this immediately after node uncordon.

Chain 8: k8s03 kubelet pod watch broken on kubelite restart (PGM-195)¶

How were pods scheduled to k8s03 after a kubelite restart never starting despite the node appearing Ready?¶

After cleaning stale cgroups and restarting kubelite (Phase 6), the kubelet registered and showed Ready — but pods newly scheduled to k8s03 after the restart stayed Pending indefinitely. The kubelet's /pods endpoint returned 53 pods, all stale/deleted pods from before the restart, with no newly-assigned pods visible.

How was the kubelet not seeing newly scheduled pods?¶

A goroutine dump captured from the kubelet (via kill -SIGUSR1 <pid>) revealed that goroutine 12104 (handleAnyWatch, created by newSourceApiserverFromLW in config/apiserver.go:67) was blocked in [select] for 3–4+ minutes with zero pod watch events received:

goroutine 12104 [select, 4 minutes]:
k8s.io/client-go/tools/cache.handleAnyWatch(...)
  .../client-go/tools/cache/reflector.go:904
created by k8s.io/kubernetes/pkg/kubelet/config.newSourceApiserverFromLW
  k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:67

The pod watch goroutine was technically alive but receiving no events.

How did the watch receive zero events?¶

kubelite is monolithic — the API server and kubelet restart as a single process. On restart:

The API server rebuilds its watch cache.
The kubelet connects to https://127.0.0.1:16443 (its own local API server) and issues a LIST of pods with fieldSelector=spec.nodeName=k8s03.
The watch cache returns stale object content (showing deleted pods) but with the current resourceVersion.
The kubelet starts a WATCH from this current RV.
All pod deletions and new assignments occurred before this RV — they are past history and no watch events are ever delivered for them.

The watch succeeds with no errors but is permanently empty for pods assigned after the restart.

How does the kubelet not self-heal from this state?¶

The kubelet's pod source from the API server uses resync period = 0 — no periodic re-LIST. It only re-LISTs when the watch fails (HTTP 410 Gone or connection drop). Since the watch is technically healthy (just receiving no events), no re-LIST is triggered and the kubelet never recovers without a manual restart.

How was this different from the Phase 4 imagefs stall?¶

Phase 4's stall ("no imagefs label for configured runtime") was caused by the eviction manager failing to initialise its imagefs metrics collector — a startup-time failure resolved by a single restart. The pod watch stall is a separate structural issue caused by the timing of the kubelite restart relative to pod scheduling: if new pods are assigned to k8s03 between the kubelite restart and the kubelet completing its initial LIST, those pods land in the watch stream's past and are never processed. Each subsequent restart without cordoning re-creates the problem.

How was this fixed?¶

Workaround: cordon k8s03 before restarting kubelite. This prevents new pods from being scheduled during the restart window. Pre-assigned pods already have spec.nodeName=k8s03 set and appear in the kubelet's initial LIST — processed correctly without depending on the broken watch stream. Documented in PGM-195.

Impact¶

Services Affected¶

Service	Impact	Duration
All Kubernetes workloads on k8s01	No new pods schedulable; existing pods unaffected	~35 min (cgroup fix + containerd fix)
All Kubernetes workloads on k8s02	No new pods schedulable; existing pods unaffected	~20 min (fixes applied proactively)
All Kubernetes workloads on k8s03	No new pods schedulable; disk pressure + kubelet stall	~80 min (disk cleanup + kubelet restart)
GitHub Actions CI/CD (BuildKit)	Build jobs failing	~15 min (lock cleared); recurred ~2h later due to zombie process — resolved in Phase 6
ArgoCD	Degraded (Syncing failed for pods trying to land on impacted nodes)	~90 min
csi-nfs-node (k8s02, k8s03)	2/3 Error — livenessprobe port conflict	~2h (Phase 6: stale shim cleanup)
metallb-speaker (k8s02, k8s03)	CrashLoopBackOff — port 7946 conflict	~1h (Phase 6: orphaned process kill)
ArgoCD WebUI	Unresponsive — argocd-server endpoint stale	~45 min (Phase 8: endpoint patch)
ArgoCD app comparisons (calibre, dependency-track, gharc-runners, vaultwarden)	Unknown sync — argocd-repo-server endpoint stale	~45 min (Phase 8: endpoint patch + hard refresh)
All ingress-nginx-proxied services (10+ services)	502 Bad Gateway — ingress-nginx endpoint stale, bad upstream routing	~45 min (Phase 8: endpoint patch)
linkace (links.pgmac.net.au)	502 externally via Cloudflare — k8s03 ingress Lua state stale, routing to `10.1.237.21`	~2h (Phase 10: k8s03 ingress backend restore)
readarr, hass	Intermittent 502/503 from k8s03 ingress (stale backends + inadvertent backend list replacement)	~30 min (Phase 10: k8s03 ingress backend restore)
dependency-track-postgresql	ImagePullBackOff (pre-existing, unrelated)	Ongoing — tracked as PGM-192

Duration¶

Active maintenance window: ~09:30 → ~13:10 AEST (~3.5 hours)
Expected duration: ~45 minutes
Overrun: ~2 hours 45 minutes

Scope¶

All 3 nodes of the pvek8s microk8s cluster affected sequentially
No persistent data loss
No user-facing homelab services disrupted (pods continued running on existing containers throughout)
Cluster state and ArgoCD configurations fully intact

Resolution Steps Taken¶

Phase 1: cgroup v2 Enablement (k8s01)¶

Identified cgroup v1 rejection via kubelite journal: kubelet is configured to not run on a host using cgroup v1.
Confirmed /sys/fs/cgroup/cgroup.controllers absent — cgroup v2 not active.
Added systemd.unified_cgroup_hierarchy=1 to GRUB_CMDLINE_LINUX in /etc/default/grub.
Ran update-grub and rebooted k8s01.

Phase 2: containerd Template Fix (k8s01, then all nodes proactively)¶

Identified runtime_type = "io.containerd.runc.v1" hardcoded in /var/snap/microk8s/8612/args/containerd-template.toml.

Restored template variable:

sudo sed -i 's/runtime_type = "io.containerd.runc.v1"/runtime_type = "${RUNTIME_TYPE}"/' \
  /var/snap/microk8s/8612/args/containerd-template.toml
sudo systemctl restart snap.microk8s.daemon-containerd

Applied same fix proactively to k8s02 and k8s03 before their upgrades.

Phase 3: Disk Cleanup (k8s03, then all nodes)¶

Identified disk at 94% on k8s03: du -sh /var/snap/microk8s/common/var/lib/containerd/... showed 13G.

Removed old snap revision:

sudo snap remove microk8s --revision 8695

Vacuumed systemd journals to 256MB:
```
sudo journalctl --vacuum-size=256M
```
Cleared apt cache: sudo apt-get clean.
Disk dropped from 94% → 81%, DiskPressure cleared.
Applied same cleanup proactively to k8s01 and k8s02.

Phase 4: Kubelet Stall Recovery (k8s03)¶

Observed pods Pending on k8s03 with node showing Ready.
Inspected kubelite logs: zero output for 17+ minutes post-startup.
Found startup error: "no imagefs label for configured runtime" in systemd journal.

Restarted kubelite:

sudo systemctl restart snap.microk8s.daemon-kubelite

Pod assignments resumed immediately.

Phase 5: BuildKit Lock Cleanup¶

Identified buildkitd.lock stale file in PVC:

sudo rm /var/snap/microk8s/common/default-storage/arc-runners-buildkitd-cache-pvc-*/buildkitd.lock

Deleted CrashLoopBackOff buildkitd pod. New pod started cleanly.

Phase 6: Post-Upgrade Pod Recovery (k8s02, k8s03)¶

Identified orphaned metallb speaker processes (port 7946) and jiva-csi-node livenessprobe processes on k8s02 and k8s03:
```
kill -9 <port-7946-pids>   # metallb speakers
kill -9 <port-9808-pids>   # jiva-csi livenessprobe
```
metallb speakers and jiva-csi-node: all Running.

buildkitd pod stuck Pending on k8s03 with no kubelet events for 20+ minutes. Two root causes resolved:

# Clear kubelet stuck pod-processing state
sudo snap restart microk8s.daemon-kubelite

# Kill zombie buildkitd processes holding the lock
sudo kill -9 29824

# Remove stale lock file
sudo rm /var/snap/microk8s/common/default-storage/\
arc-runners-buildkitd-cache-pvc-66fe10fc-5f6a-424f-89aa-9c3ff87be4e4/buildkitd.lock

BuildKit: 1/1 Running.

Identified root cause of csi-nfs-node 2/3 Error: stale containerd-shim-runc-v2 processes from snap rev 8612 running on k8s02 and k8s03, each parenting orphaned livenessprobe processes holding ports 9808 and 29653. Killed all stale 8612 shims and their children:
```
# Enumerate and kill stale snap 8612 shim processes
sudo pgrep -f microk8s/8612/bin/containerd-shim | tr '\n' ' ' | xargs sudo kill -9
# (On k8s03: targeted per-shim to avoid disrupting SSH session)
sudo kill -9 <shim-pid> <livenessprobe-pid>
```
csi-nfs-node: 3/3 Running on all nodes.
Force-deleted all stale Unknown/stuck-Pending arc-runners runner pods and arc-systems listener pods left from kubelite restart. ARC controller recreated listener pods cleanly.

Phase 8: Endpoint Controller Staleness Fix (cluster-wide)¶

Identified ArgoCD WebUI returning connection refused and multiple services returning 502 Bad Gateway.
Found argocd-server EndpointSlice pointing to 10.1.237.11 (stale), pod at 10.1.237.55. Patched.
Found argocd-repo-server EndpointSlice pointing to 10.1.237.8 (stale), pod at 10.1.237.19. Patched.
Ran cluster-wide endpoint staleness scan (Python script comparing EndpointSlice addresses to actual pod IPs). Found 17 stale EndpointSlices across 9 namespaces — including ingress/ingress-nginx-controller (root cause of all 502s).

Patched all 17 EndpointSlices and 19 legacy Endpoints in a single automated pass:

# For each EndpointSlice managed by endpointslice-controller.k8s.io:
#   Compare addresses[*] to actual pod IP from API server
#   Patch if mismatch
kubectl --context pvek8s patch endpointslice -n <ns> <name> \
  --type=json -p='[{"op":"replace","path":"/endpoints/N/addresses/0","value":"<actual-ip>"}]'

Forced ArgoCD hard refresh on Unknown/Progressing apps. All 38 apps recovered to Synced+Healthy.

Phase 10: ingress-nginx k8s03 Lua State Restoration¶

Identified k8s03 ingress-nginx pod (ingress-nginx-controller-tlbfv) routing linkace to stale IP 10.1.237.21 (connection refused) via its internal backends API (GET http://localhost:10246/configuration/backends).
Triggered EndpointSlice watch notification (annotation touch) — k8s03 pod did not self-update. Root cause: pod was last restarted during upgrade chaos before EndpointSlice patch; subsequent patch did not propagate to this pod's Lua state.

Fixed linkace routing by POSTing correct endpoint to k8s03 ingress internal API:

# Inadvertently sent single-backend payload — replaced full list
curl -X POST -H 'Content-Type: application/json' -d '[{"name":"media-linkace-80","endpoints":[{"address":"10.1.237.246","port":"80"}],...}]' \
  http://localhost:10246/configuration/backends

Identified inadvertent backends list replacement (28 → 1): readarr and all other services on k8s03 returned 503.

Restored full 28-backend list by capturing from k8s01 pod and POSTing to k8s03:

kubectl exec -n ingress ingress-nginx-controller-trzsp -- sh -c \
  "curl -s http://localhost:10246/configuration/backends" > /tmp/full-backends.json
kubectl cp /tmp/full-backends.json ingress/ingress-nginx-controller-tlbfv:/tmp/full-backends.json
kubectl exec -n ingress ingress-nginx-controller-tlbfv -- sh -c \
  "curl -X POST -H 'Content-Type: application/json' -d @/tmp/full-backends.json http://localhost:10246/configuration/backends"

Verified all 28 backends correctly restored. No stale endpoints remaining across any of the 3 ingress pods.
linkace: 302, readarr: 302 (→ 401 auth as expected), hass: 200.

Phase 11: Ansible Automation Update¶

Updated k8s-upgrade.yml to add cgroup v2 detection and enablement before the snap refresh:
- stat /sys/fs/cgroup/cgroup.controllers — register whether cgroup v2 active
- lineinfile to write systemd.unified_cgroup_hierarchy=1 to GRUB if not active
- command: update-grub and reboot conditional on change
Committed and pushed to branch paulymac/pgm-159-upgrade-microk8s-1.35.

Verification¶

Cluster Health¶

NAME    STATUS   ROLES    AGE      VERSION
k8s01   Ready    <none>   ~4y      v1.35.0
k8s02   Ready    <none>   ~4y      v1.35.0
k8s03   Ready    <none>   ~4y      v1.35.0

microk8s status:
  high-availability: yes
  datastore master nodes: 172.22.22.6:19001 172.22.22.8:19001 172.22.22.9:19001

cgroup v2 active: /sys/fs/cgroup/cgroup.controllers present on all 3 nodes
containerd-template.toml: runtime_type = "${RUNTIME_TYPE}" on all 3 nodes (snap rev 8612)

ArgoCD¶

38 applications: Synced + Healthy (dependency-track: Synced/Progressing — downstream of PGM-192)
ArgoCD server, repo-server, application-controller, redis-ha: all Running
argocd-redis-ha-haproxy: 3/3 replicas Running across 3 nodes
ArgoCD WebUI: responding (HTTP 307 → HTTPS)

Ingress / Services¶

All ingress-nginx-proxied services: responding (HTTP 308 redirects)
17 stale EndpointSlices patched + 19 stale legacy Endpoints patched
No remaining stale EndpointSlices detected post-fix
ingress-nginx k8s03 internal Lua state (28 backends) restored; no stale IPs in any of the 3 ingress pods
linkace: 302 (external via Cloudflare confirmed), readarr: 302→401 (expected auth), hass: 200

Disk Usage (post-cleanup)¶

Node	Before	After
k8s01	~74%	~66%
k8s02	~75%	~67%
k8s03	94%	81%

Preventive Measures¶

Immediate Actions Required¶

Add snap old-revision cleanup to upgrade playbook (High)
- No automated removal of previous snap revision after successful upgrade; each revision consumes 3-5GB.
- Action: Add snap remove microk8s --revision <prev> task to upgrade playbook after node reaches Ready.
- Linear: PGM-183
Add containerd-template.toml ${RUNTIME_TYPE} restoration to upgrade playbook (High)
- The pre-substitution of the template variable can silently persist across cgroup migrations and snap upgrades.
- Action: Add task to reset io.containerd.runc.v[12] back to ${RUNTIME_TYPE} in containerd-template.toml as part of upgrade.
- Linear: PGM-184
Add pre-upgrade disk space check to upgrade playbook (High)
- If disk is above 70% before starting a snap refresh (which will temporarily have 2 revisions), abort and alert.
- Action: assert task checking df /var/snap/microk8s < 70% before snap refresh.
- Linear: PGM-185
Add snap partition disk monitoring (High)
- The same gap as PGM-138 (dqlite WAL partition) exists for the containerd overlayfs snapshotter path. Both live under /var/snap/microk8s/.
- Action: Extend PGM-138's NRPE disk check to cover /var/snap/microk8s/ partition with warn < 25%, critical < 15% free.
- Linear: PGM-186

Longer-Term Improvements¶

Document kubelet stall pattern and runbook (Medium)
- A node appearing Ready while silently not processing pod assignments is a subtle failure mode.
- Action: Add runbook entry covering detection (pods Pending on Ready node + kubelite log silence) and fix (restart snap.microk8s.daemon-kubelite).
- Linear: PGM-187
Add kubelet pod-processing staleness alert (Medium)
- No alert exists for the case where a Ready node is not processing pod assignments.
- Action: NRPE check: if any pod has been Pending > 5 minutes on a Ready node with capacity, fire Warning.
- Linear: PGM-188
Add journald size limits to k8s nodes (Medium)
- Journals grew large enough to contribute to disk pressure. Rolling size limit should be configured by default.
- Action: Configure SystemMaxUse=512M via ansible-role-journald on all k8s nodes.
- Linear: PGM-189
Tune kubelet imageGC thresholds (Medium)
- Default high threshold (80%) provides insufficient headroom given baseline usage of 70-75%.
- Action: Set imageGCHighThresholdPercent=70 and imageGCLowThresholdPercent=60 in kubelet configuration.
- Linear: PGM-190
Add BuildKit stale lock init container (Low)
- BuildKit crashes on stale lock with no self-healing.
- Action: Add init container to buildkitd deployment that removes buildkitd.lock before daemon starts.
- Linear: PGM-191
Update dependency-track chart postgresql image (Medium)
- docker.io/bitnami/postgresql:11.13.0-debian-10-r40 was removed from Docker Hub; pod is in ImagePullBackOff.
- Action: Update dependency-track Helm chart to a current postgresql chart version with a published image.
- Linear: PGM-192
Add post-upgrade EndpointSlice staleness check to upgrade playbook (High)
- k8s 1.35 endpoint controller failed to update pod IPs in 17 EndpointSlices after pod restarts. No check existed to detect this.
- Action: Add Python script (or Ansible task using kubernetes.core) to scan all managed EndpointSlices for address/pod-IP mismatches after all nodes reach Ready; fail playbook loudly if mismatches found.
- Linear: PGM-193
Add post-upgrade ingress-nginx backend state validation (High)
- k8s03 ingress-nginx pod's internal Lua backend state retained stale pod IPs after the EndpointSlice patch. EndpointSlice-level fixes are insufficient if ingress-nginx pods have not picked up the changes in their Lua state.
- Action: After the EndpointSlice staleness check, also verify that each ingress-nginx pod's /configuration/backends matches actual pod IPs; if stale, POST the full correct backends list. Add a script to ansible-role-microk8s tasks.
- Linear: PGM-194
Require and document cordon-before-restart for kubelite restarts on k8s03 (High)
- After a kubelite restart without cordoning, the kubelet's pod watch goroutine receives zero events for newly scheduled pods. The node appears Ready but no new workloads start. This is a structural issue with kubelite's monolithic restart and does not self-heal.
- Action: Add the cordon-before-restart procedure to the runbook (PGM-187) and to ansible-role-microk8s documentation. Required procedure: cordon → restart → wait Ready → verify pods → uncordon.
- Linear: PGM-195

Lessons Learned¶

What Went Well¶

Progressive fix application: After identifying each root cause on k8s01, the same fix was applied proactively to k8s02 and k8s03 before their upgrades, preventing the cgroup and containerd template issues from repeating.
Rolling serial upgrade isolated blast radius: Because serial: 1 was enforced, k8s02 and k8s03 were unaffected while k8s01 was being repaired; no full cluster outage occurred.
Clear error messages throughout: kubelite's cgroup v1 rejection, containerd's missing shim, and buildkitd's lock message all pointed directly at the root cause without extensive log archaeology.
Playbook updated in-session: The cgroup v2 detection and enablement block was written to k8s-upgrade.yml and committed before the session ended; the fix was not left as a follow-up.

What Didn't Go Well¶

No pre-upgrade preflight checks: Three of the five failures (cgroup v2, disk pressure, containerd template) were detectable before running snap refresh. A preflight phase in the upgrade playbook would have caught all three.
The containerd-template.toml pre-substitution trap is non-obvious: The fact that a previous run under cgroup v1 would leave a permanent literal string in a mutable SNAP_DATA file — surviving snap revision changes — is not documented and required deep investigation to understand.
Disk monitoring gap persisted from prior incident: PGM-138 identified the need for snap partition disk monitoring; that fix was not extended to cover the containerd overlayfs path, which is orders of magnitude larger than the dqlite WAL.
Kubelet silent stall produced no actionable alert: The node showed Ready, leading to false confidence for 17+ minutes. This is difficult to prevent purely at the alerting level but a runbook would have accelerated diagnosis.
Maintenance window significantly overran: Expected ~45 minutes; actual ~3.5 hours. The failure cascade across 5 distinct issues was not anticipated.

Surprise Findings¶

containerd-template.toml pre-substitution survives snap revision changes: The SNAP_DATA mutable files (e.g. /var/snap/microk8s/8612/args/containerd-template.toml) are version-specific directories but the substitution is done once per startup under each revision. If a revision starts under cgroup v1, the v1 literal is baked in for that revision's lifetime — a subsequent cgroup v2 reboot cannot correct it.
K8s 1.35 dropped cgroup v1 entirely: Unlike K8s 1.34 which deprecated it, 1.35 refuses to start on a cgroup v1 host. This is a hard break, not a warning. Ubuntu 20.04 hosts need explicit GRUB configuration before upgrading.
kubelet heartbeat and pod processing are independent: A node can report Ready (heartbeat alive, API connectivity good) while the pod assignment pipeline is completely non-functional due to an eviction manager initialisation failure. kubectl get nodes gives false assurance.
containerd 2.1.3 imagefs label API changed: The eviction manager's "no imagefs label for configured runtime" error is new in containerd 2.1.3 (shipped in microk8s 1.35). A kubelite restart resolves it but the root cause is an API compatibility issue between the K8s 1.35 eviction manager and containerd 2.1.3's metrics surface.
k8s 1.35 endpoint controller does not self-heal after pod restarts during upgrade: The endpoint controller tracks pod names correctly in EndpointSlice targetRef fields but silently retains stale IPs in addresses fields when pods restart during an upgrade window. This causes all service traffic routed through kube-proxy (and ingress-nginx upstream resolution) to fail — and can affect 10+ namespaces simultaneously with no single obvious alert. A cluster-wide automated scan is the only reliable detection method.
kubelite pod watch permanently broken after restart without cordoning: When kubelite restarts, the API server's watch cache may serve a LIST with the current resourceVersion but stale object content. The kubelet starts watching from this RV, missing all historical pod deletions and assignments — pods scheduled after the restart are never seen. Since resync=0, the watch never re-LISTs and the kubelet never self-heals. Workaround: always cordon the node before restarting kubelite (see PGM-195).
Monthly maintenance VACUUM targeted wrong dqlite file: The ansible-role-microk8s-maintenance role's VACUUM step ran sqlite3 cluster.db 'VACUUM;' — but cluster.db is only 4KB of cluster membership metadata (node addresses). The actual Kubernetes state lives in 226MB dqlite snapshot files which are automatically compacted by dqlite's Raft snapshot rotation. The step stopped the dqlite service (temporarily reducing quorum margin to 1 node) for no practical space recovery. Manual VACUUM is also not available via the dqlite CLI (SQL commands are restricted). The step has been removed; snapshot rotation already handles compaction.
kubelite restarts generate dqlite write spikes: Each kubelite restart on k8s03 triggered an API server reconnect and Raft leader election cycle, generating bursts of 17–20 deadline_exceeded errors per hour on the dqlite leader and follower nodes. Multiple restarts in a short window (PGM-195 investigation) produced sustained error elevation. Cordoning k8s03 eliminated the source of these write bursts entirely.

Action Items¶

#	Action	Priority	Linear
1	Add snap old-revision cleanup to k8s upgrade playbook	High	PGM-183
2	Add containerd-template.toml `${RUNTIME_TYPE}` restoration to upgrade playbook	High	PGM-184
3	Add pre-upgrade disk space preflight check (abort if > 70% used)	High	PGM-185
4	Extend snap partition disk monitoring to cover containerd overlayfs path	High	PGM-186
5	Document kubelet silent stall detection and recovery runbook	Medium	PGM-187
6	Add NRPE alert: pods Pending > 5min on Ready node with capacity	Medium	PGM-188
7	Configure journald size limits on all k8s nodes via ansible-role-journald	Medium	PGM-189
8	Tune kubelet imageGC thresholds (high: 70%, low: 60%)	Medium	PGM-190
9	Add init container to buildkitd deployment to clear stale lock on startup	Low	PGM-191
10	Update dependency-track Helm chart to current postgresql image	Medium	PGM-192
11	Add post-upgrade EndpointSlice staleness check to k8s upgrade playbook	High	PGM-193
12	Add post-upgrade ingress-nginx backend Lua state validation	High	PGM-194
13	Document and enforce cordon-before-restart procedure for k8s03 kubelite restarts	High	PGM-195

Technical Details¶

Environment¶

Cluster: pvek8s (microk8s HA, 3 nodes: k8s01/k8s02/k8s03)
Kubernetes version before: v1.34.x (snap rev 8695)
Kubernetes version after: v1.35.0 (snap rev 8612)
Container runtime: containerd 2.1.3 (microk8s 1.35)
Host OS: Ubuntu 20.04 LTS
Default cgroup version: v1 (hierarchy: legacy)

Key Error Signatures¶

cgroup v1 rejection:

kubelet is configured to not run on a host using cgroup v1.
cgroup v1 support is unsupported.
Please enable cgroup v2 support in the kernel before upgrading.

containerd shim not found:

failed to create containerd task: failed to create shim task:
OCI runtime create failed:
runtime "io.containerd.runc.v1" binary not installed
"containerd-shim-runc-v1": file does not exist: unknown

Disk pressure kubelet event:

Disk usage on image filesystem is over the high threshold,
do image garbage collection. usage: 94, highThreshold: 80

Kubelet eviction manager stall:

"eviction manager: failed to check if we have separate container filesystem.
Ignoring." err="no imagefs label for configured runtime"

BuildKit stale lock:

could not lock /var/lib/buildkit/buildkitd.lock,
another instance running?

containerd-template.toml Fix¶

# Before (pre-substituted — wrong):
runtime_type = "io.containerd.runc.v1"

# After (template variable restored — correct):
runtime_type = "${RUNTIME_TYPE}"

# Fix command:
sudo sed -i \
  's/runtime_type = "io.containerd.runc.v1"/runtime_type = "${RUNTIME_TYPE}"/' \
  /var/snap/microk8s/8612/args/containerd-template.toml
sudo systemctl restart snap.microk8s.daemon-containerd

Disk Cleanup Commands¶

# Remove old snap revision (replace 8695 with actual prev revision)
sudo snap remove microk8s --revision 8695

# Vacuum journals
sudo journalctl --vacuum-size=256M

# Clear apt cache
sudo apt-get clean

References¶

Linear ticket: PGM-159 — Upgrade microk8s cluster
Linear ticket: PGM-195 — k8s03 kubelet PLEG and pod watch failure post-upgrade (cordon-before-restart workaround)
Notion investigation: PGM-195: k8s03 Kubelet Deadlock Investigation & Recovery
Ansible PR (upgrade playbook cgroup v2 fix): pgmac-net/ansible#146
pgk8s PR (ingress-nginx migration): pgmac-net/pgk8s#472
Related incident (dqlite quorum loss): pvek8s Complete Cluster Outage — dqlite Quorum Loss and Ansible-Injected Invalid Flags
Related incident (Calico RBAC dqlite write storm): AWX Automation Pod Stuck Pending — Calico RBAC Gap + dqlite Write Storm
microk8s 1.35 release notes: https://microk8s.io/docs/release-notes
Kubernetes cgroup v2 migration: https://kubernetes.io/docs/concepts/architecture/cgroups/

Reviewers¶

@pgmac