Post Incident Review: pvek8s Complete Cluster Outage — dqlite Quorum Loss and Ansible-Injected Invalid Flags¶
Date: 2026-04-04 (degraded) → 2026-04-12 Duration: 7 days degraded (2/3 nodes) + ~1h 12m complete outage (2026-04-12 09:27–10:39 AEST) Severity: Critical (complete cluster outage; all NRPE checks timing out; API server unreachable) Status: Resolved
Executive Summary¶
The pvek8s microk8s cluster suffered a complete 3-node outage when dqlite's Raft consensus layer lost quorum. The proximate trigger was disk pressure on k8s02 (detected 2026-04-04) causing dqlite WAL write failures, which put k8s02 into a crash-restart loop. Over the following 7 days the cluster continued operating with 2/3 voters, but accumulated un-ACKed WAL entries; on 2026-04-11 a snapshot compaction deadlock propagated to k8s01 and k8s03, taking all 3 nodes down.
Recovery was blocked for ~1.5 hours by a second, unrelated issue: an Ansible dqlite tuning role had injected --snapshot-threshold=512 and --snapshot-trailing-logs=256 into the k8s-dqlite args file overnight. These flags do not exist in k8s-dqlite snap rev 8695. The binary exited with Error: unknown flag: --snapshot-threshold on every start attempt, preventing dqlite — and therefore the API server — from coming up.
A further complication: the standard single-node recovery procedure (edit cluster.yaml to one node, start that node) failed silently because dqlite Raft snapshots embed the cluster membership at snapshot time. A single node cannot reach quorum (2/3) against its own snapshot's 3-node membership list. Recovery required starting all 3 nodes simultaneously.
Total active recovery time from incident detection to resolution: ~1h 12m. No persistent data loss occurred.
Timeline (AEST — UTC+10)¶
| Time | Event |
|---|---|
| 2026-04-04 (est.) | k8s02 disk pressure causes dqlite WAL write failure. k8s02 dqlite enters crash-restart loop. Cluster continues on 2/3 voters (k8s01 + k8s03). No alert fires. |
| 2026-04-11 ~16:07 | dqlite snapshot compaction ACK deadlock on all 3 nodes. k8s01 and k8s03 dqlite enter crash loops. All 3 nodes now failing. ~3,794 restarts on k8s01. |
| 2026-04-12 09:27 AEST | Incident opened (PGM-137). Nagios Nagios shows microk8s-dqlite-scheduler CRITICAL on k8s01, k8s02, k8s03 with 15s socket timeouts. kubectl --context pvek8s get nodes fails. |
| ~09:35 | Nagios MCP confirms all 3 nodes CRITICAL. NRPE execution time 15.03s — confirms check is hanging, not just reporting failure. |
| ~09:40 | journalctl shows [go-dqlite] attempt N: server 172.22.22.X:19001: connection refused / no known leader on all nodes. Port 19001 not bound. NRestarts=3794 on k8s01. |
| ~09:45 | Disk confirmed adequate (k8s01: 74%, k8s02: ~75%, k8s03: 71%). Certificates valid until 2027. Root cause: Raft quorum loss confirmed. |
| ~09:52 | All 3 nodes stopped: sudo microk8s stop (parallel). Stop sequence: k8s02, k8s01, k8s03. |
| ~09:55 | NRPE check fix: --request-timeout=10s added to both kubectl calls in check_microk8s_dqlite.sh. Deployed to all 3 nodes. Ansible source files updated. |
| ~10:00 | microk8s start on k8s01 (single-node recovery attempt). NRestarts=39 immediately — binary still failing. |
| ~10:05 | Invalid flags discovered: manual invocation of /snap/microk8s/8695/bin/k8s-dqlite --snapshot-threshold=512 → Error: unknown flag: --snapshot-threshold. Args file contains Ansible-injected flags not supported by this binary version. |
| ~10:08 | Invalid flags removed from args files on all 3 nodes: sed -i '/snapshot-threshold\|snapshot-trailing-logs\|^#/d'. Ansible role dqlite.yml updated to remove the task entirely. |
| ~10:12 | k8s01 restarted. Binary now starts but stuck at [go-dqlite] attempt N: no known leader indefinitely (250+ attempts). |
| ~10:20 | Snapshot membership issue discovered: raft snapshot (216MB, term 2774) embeds 3-node cluster config. Single-node cluster.yaml has no effect. Raft segments deleted to force fresh bootstrap — still stuck at quorum 1/3. |
| ~10:30 | 3-node cluster.yaml restored from backup on k8s01. All 3 nodes started simultaneously. |
| ~10:39 AEST | Raft election succeeds in ~60s. All 3 nodes: microk8s is running. High availability: yes. Datastore master nodes: 172.22.22.6 172.22.22.8 172.22.22.9. |
| ~10:42 | kubectl get nodes: k8s01/k8s02/k8s03 all Ready. ArgoCD server running. Workloads resuming. PGM-137 closed. |
Root Causes¶
The Infinite How's Chain¶
"The infinite how's" methodology: at each causal step, ask "how?" rather than accepting the surface answer. Keep drilling until reaching an actionable, preventable cause.
How were NRPE checks timing out with no useful error?¶
check_microk8s_dqlite.sh called microk8s kubectl get componentstatuses and microk8s kubectl get pods -A with no --request-timeout flag. When the API server was unavailable, both calls blocked indefinitely — exceeding the NRPE socket timeout of 15 seconds. Nagios received CHECK_NRPE STATE CRITICAL: Socket timeout after 15 seconds rather than any diagnostic information about the actual cluster state.
How was the API server unavailable?¶
The kube-apiserver (inside kubelite) depends on kine.sock — a UNIX socket created by k8s-dqlite that proxies etcd API calls into dqlite. k8s-dqlite was not running on any node, so kine.sock was never created, and kubelite failed to start.
How was k8s-dqlite not running after microk8s start?¶
The k8s-dqlite binary exited with Error: unknown flag: --snapshot-threshold on every invocation. The systemd service restarted it continuously (NRestarts=39 within minutes), but each attempt immediately failed. Since set -e is active in the startup script and exec replaces the process, the exit code propagated directly to systemd.
How did an unknown flag end up in the args file?¶
The Ansible dqlite tuning role (roles-dev/ansible-role-microk8s/tasks/dqlite.yml) used ansible.builtin.blockinfile to inject:
These flags were added based on the k8s-dqlite API as it existed when the role was written (after the Jan 2026 dqlite lock contention incident). The flags were removed or renamed in a subsequent microk8s snap release. Snap rev 8695 (the installed version) does not include them.
How was the role allowed to inject unsupported flags without detection?¶
The Ansible role had no mechanism to validate that a flag exists in the target binary before writing it. The run script (run-k8s-dqlite-with-args) passes all lines from the args file directly to the binary — including comment lines injected by blockinfile — without any pre-validation. After the playbook ran overnight and the restart was completed, there was no health check task to verify the service remained running.
Additionally, blockinfile injects # BEGIN/END ANSIBLE MANAGED BLOCK comment lines. While bash evaluates these as comments in an eval-style array expansion, their presence alongside invalid flags made diagnosis harder — it was initially unclear whether the comments or the flags were causing the failure.
How did the cluster reach complete quorum loss (the underlying outage)?¶
k8s02 ran low on disk space (estimated 2026-04-04). dqlite's write-ahead log (WAL) requires disk writes to commit entries. When dqlite tried to write a WAL entry on k8s02 and the filesystem was full, the write failed. dqlite exited, and systemd restarted it. On restart, the same WAL entries needed to be applied — and dqlite exited again. k8s02 entered a crash-restart loop with thousands of restarts.
With k8s02's dqlite out of the election pool, the cluster continued on 2 voters (k8s01 + k8s03), which still constitutes quorum in a 3-node cluster. The cluster remained nominally operational for ~7 days.
How did the 2-node operation cascade to all 3 nodes?¶
During 7 days of degraded operation, the 2 active nodes continued taking dqlite snapshots and accumulating WAL entries. At some point (2026-04-11 ~16:07 AEST), a snapshot compaction cycle attempted to ACK completion across all 3 Raft voters. Because k8s02 was not responding to Raft RPC calls, the ACK could not complete. The snapshot compaction deadlock caused the dqlite process on k8s01 and k8s03 to stall, eventually crash. With all 3 nodes now unable to run dqlite, the Raft quorum was completely lost.
How did disk pressure on k8s02 go undetected for 7 days?¶
There was no disk space monitoring or alerting on k8s nodes targeting the /var/snap/microk8s/ mount point. The existing Nagios NRPE configuration checked overall system disk usage but did not specifically alert on the snap data partition where dqlite stores its WAL, snapshots, and database files.
How did the dqlite crash-loop go undetected for 7 days?¶
There was no NRPE check or Nagios service monitoring the systemd NRestarts property for snap.microk8s.daemon-k8s-dqlite. The existing microk8s-dqlite-scheduler check relied on kubectl API calls — which continued to succeed while 2/3 voters were running — and so returned OK throughout the degraded period.
Secondary Finding: Snapshot-Embedded Raft Membership¶
The standard dqlite recovery guide suggests editing cluster.yaml to reduce membership to a single node and starting that node. This approach does not work when an existing raft snapshot is present.
dqlite Raft snapshots embed the full cluster configuration (membership) at the time the snapshot was taken. When the single node loads the snapshot, it reads the embedded 3-node membership and requires 2/3 votes to elect a leader. Since the node is alone, it gets 1/3 votes and is stuck at no known leader indefinitely.
This is counter-intuitive because cluster.yaml is clearly documented as controlling cluster membership. The documentation does not make clear that for an existing cluster with an existing snapshot, the snapshot's embedded membership takes precedence over cluster.yaml for Raft election purposes.
Impact¶
Services Affected¶
| Service | Impact | Duration |
|---|---|---|
| All Kubernetes workloads | API server unreachable; no new pod scheduling possible | ~7 days degraded + ~1h 12m full outage |
| Nagios NRPE monitoring | All 3 k8s nodes reporting CRITICAL socket timeout (misleading, not diagnostic) | ~1h 12m visible outage |
| ArgoCD | Unavailable (depends on API server) | ~1h 12m |
| All homelab applications | Running on existing pods; unaffected (no new scheduling required) | 0 |
Duration¶
- Degraded operation (2/3 voters, no alerting): ~2026-04-04 → 2026-04-11 (~7 days)
- Complete outage (0/3 voters): ~2026-04-11 16:07 AEST → 2026-04-12 10:39 AEST (~18h 32m)
- Active incident window: 2026-04-12 09:27 AEST → 10:39 AEST (~1h 12m)
Scope¶
- 3-node microk8s HA cluster
pvek8s - No persistent storage data loss
- No user-facing services disrupted (all applications continued running on existing pods)
- Cluster state recovered from snapshot taken at 2026-04-11 16:17 AEST
Resolution Steps Taken¶
Phase 1: Diagnosis¶
-
Queried Nagios MCP for unhandled problems — confirmed all 3 nodes CRITICAL with 15s socket timeout.
-
Confirmed API server down:
-
Identified crash-restart loop on all nodes:
-
Confirmed data intact: snapshot
snapshot-2774-2115903254-99297685(216MB, Apr 11 16:17) present on k8s01.
Phase 2: NRPE Fix¶
-
Added
--request-timeout=10sto both kubectl calls incheck_microk8s_dqlite.sh:# Before: COMPSTAT=$(/snap/bin/microk8s kubectl get componentstatuses -o json 2>/dev/null) PENDING_PODS=$(/snap/bin/microk8s kubectl get pods -A 2>/dev/null | grep -c "Pending") # After: COMPSTAT=$(/snap/bin/microk8s kubectl get componentstatuses --request-timeout=10s -o json 2>/dev/null) PENDING_PODS=$(/snap/bin/microk8s kubectl get pods -A --request-timeout=10s 2>/dev/null | grep -c "Pending") -
Deployed to all 3 nodes via
scp+ssh. Updated Ansible source files.
Phase 3: Invalid Flags Discovery and Fix¶
-
Stopped all nodes:
for node in k8s01 k8s02 k8s03; do ssh $node "sudo microk8s stop" & done -
Attempted
microk8s starton k8s01 — NRestarts=39 immediately, service still failing. -
Ran binary manually to diagnose:
-
Removed invalid flags and comment lines from args files on all 3 nodes:
-
Removed the task from Ansible role
roles-dev/ansible-role-microk8s/tasks/dqlite.ymlentirely.
Phase 4: Single-Node Recovery Attempt (Failed)¶
-
Edited
cluster.yamlon k8s01 to single-node configuration. Started k8s01 dqlite. Observed[go-dqlite] attempt N: no known leaderfor 250+ attempts (~4 minutes). -
Removed raft segments to force fresh bootstrap from snapshot:
Still stuck atno known leader— snapshot-embedded 3-node membership requires quorum.
Phase 5: Correct Recovery — Simultaneous 3-Node Start¶
-
Restored 3-node
cluster.yamlfrom backup on k8s01. -
Started all 3 nodes simultaneously:
-
Raft election completed in ~60 seconds. All 3 nodes joined as datastore masters.
Verification¶
Cluster Health¶
NAME STATUS ROLES AGE VERSION
k8s01 Ready <none> 4y234d v1.34.5
k8s02 Ready <none> 4y234d v1.34.5
k8s03 Ready <none> 4y234d v1.34.5
microk8s status:
high-availability: yes
datastore master nodes: 172.22.22.6:19001 172.22.22.8:19001 172.22.22.9:19001
NRPE Fix Verification¶
grep 'request-timeout' /etc/nagios/nrpe.d/check_microk8s_dqlite.cfg
# → --request-timeout=10s present in both kubectl calls on all 3 nodes
Preventive Measures¶
Immediate Actions Required¶
-
Add disk space monitoring for k8s nodes (Urgent)
- No alerting existed for disk pressure on
/var/snap/microk8s/. k8s02 disk pressure was the cascade trigger. - Action: Add NRPE check for
/var/snap/microk8s/utilisation: warn < 20% free, critical < 10% free. - Linear: PGM-138
- No alerting existed for disk pressure on
-
Add dqlite service crash-loop alert (Urgent)
snap.microk8s.daemon-k8s-dqliteaccumulated 3,794 restarts over 7+ days with no alerting.- Action: Add NRPE check on
systemctl show snap.microk8s.daemon-k8s-dqlite --property=NRestarts: warn > 10, critical > 50. - Linear: PGM-139
-
Add flag validation and health check to Ansible dqlite role (High)
- The role wrote flags unsupported by the installed binary with no pre-check and no post-restart health verification.
- Action: Add
--help | grep <flag>validation before writing flags; add service health check after any restart triggered by Ansible. - Linear: PGM-140
-
Document dqlite quorum recovery procedure (High)
- The snapshot-embedded membership constraint is not documented. The single-node recovery approach fails silently and cost ~1.5h.
- Action: Add runbook: start all 3 nodes simultaneously; do NOT attempt single-node recovery with existing snapshots.
- Linear: PGM-141
Longer-Term Improvements¶
-
Fix Ansible blockinfile comment pollution in k8s-dqlite args file (Medium)
blockinfileinjects# BEGIN/END ANSIBLE MANAGED BLOCKcomment lines. Replace withlineinfile.- Linear: PGM-142
-
Add NRPE check test suite to prevent kubectl timeout omissions (Medium)
- No automated check prevents
kubectlcalls without--request-timeoutfrom being added to NRPE scripts. - Action: Add pre-commit lint check: any
kubectlcall inansible/files/nagios/*.shmust include--request-timeout. - Linear: PGM-143
- No automated check prevents
Lessons Learned¶
What Went Well¶
- Data intact throughout: The 216MB dqlite snapshot from 2026-04-11 16:17 was preserved on all nodes. No Kubernetes state was lost.
- Invalid flag diagnosis was fast: Running the binary manually with explicit args immediately revealed
Error: unknown flag: --snapshot-threshold. ~3 minutes from suspicion to confirmed root cause. - Simultaneous 3-node restart was decisive: Once the correct recovery approach was identified, the cluster was up in ~60 seconds with no manual raft manipulation required.
- NRPE fix deployed immediately: The
--request-timeout=10sfix was deployed to all nodes and Ansible source files in parallel with the recovery work, before the cluster came back up.
What Didn't Go Well¶
- 7 days of silent degraded operation: k8s02's dqlite crash-loop went completely undetected. Two alerting gaps — no disk monitoring, no NRestarts monitoring — together allowed a single-node failure to accumulate for long enough to take the entire cluster down.
- Ansible role silently deployed breaking change: The overnight playbook run added unsupported flags to all 3 nodes with no validation and no notification. This turned a repairable quorum loss into a complete binary-level failure that blocked recovery.
- Single-node recovery approach wasted ~1.5 hours: The standard recovery guide's suggestion to reduce
cluster.yamlto one node is correct for bootstrap scenarios but fails for existing snapshots. This was not documented and cost significant investigation time. - Comment lines in args file complicated diagnosis: The
# BEGIN/END ANSIBLE MANAGED BLOCKcomment lines were initially suspected as the cause of the binary failure, requiring additional time to confirm that it was actually the invalid flags.
Surprise Findings¶
- dqlite Raft snapshots embed cluster membership: Editing
cluster.yamldoes not change the cluster configuration used for Raft elections when an existing snapshot is present. The snapshot's membership overridescluster.yaml. This is a critical undocumented constraint for recovery scenarios. blockinfilecomment lines are passed to the binary: The k8s-dqlite run script usesdeclare -a args="($(cat $SNAP_DATA/args/$app))"to build argument arrays. In this evaluation context,#lines behave as comments — but only in some bash versions and evaluation modes. This is an unintended interaction between Ansible's file management approach and the microk8s startup mechanism.--snapshot-thresholdand--snapshot-trailing-logsremoved in snap rev 8695: The k8s-dqlite binary no longer supports these flags. Any Ansible role, runbook, or documentation referencing these flags must be updated before applying to a cluster running rev 8695+.
Action Items¶
| # | Action | Priority | Linear |
|---|---|---|---|
| 1 | Add disk space monitoring for k8s nodes (/var/snap/microk8s/ < 20% free) |
Urgent | PGM-138 |
| 2 | Add dqlite crash-loop alert (NRestarts > 10 warn, > 50 critical) | Urgent | PGM-139 |
| 3 | Add flag validation + health check to Ansible dqlite args role | High | PGM-140 |
| 4 | Document dqlite quorum recovery (all-nodes-simultaneous, snapshot membership) | High | PGM-141 |
| 5 | Fix Ansible blockinfile comment pollution in k8s-dqlite args file |
Medium | PGM-142 |
| 6 | Add NRPE check lint: all kubectl calls must include --request-timeout |
Medium | PGM-143 |
Technical Details¶
Environment¶
- Cluster:
pvek8s(microk8s HA, 3 nodes) - Kubernetes version: v1.34.5
- microk8s snap revision: 8695
- Container runtime: containerd 1.7.28
- dqlite snapshot at recovery:
snapshot-2774-2115903254-99297685(216MB, 2026-04-11 16:17 AEST) - k8s01 dqlite NRestarts at incident start: 3,794
k8s-dqlite Args File Before Fix¶
--storage-dir=${SNAP_DATA}/var/kubernetes/backend/
--listen=unix://${SNAP_DATA}/var/kubernetes/backend/kine.sock:12379
# BEGIN ANSIBLE MANAGED BLOCK - dqlite tuning
--snapshot-threshold=512
--snapshot-trailing-logs=256
# END ANSIBLE MANAGED BLOCK - dqlite tuning
k8s-dqlite Args File After Fix¶
--storage-dir=${SNAP_DATA}/var/kubernetes/backend/
--listen=unix://${SNAP_DATA}/var/kubernetes/backend/kine.sock:12379
Key Error Signatures¶
Invalid flag (primary recovery blocker):
Raft quorum loss (underlying outage):
time="..." level=warning msg="[go-dqlite] attempt N: server 172.22.22.X:19001: connection refused"
time="..." level=warning msg="[go-dqlite] attempt N: server 172.22.22.6:19001: no known leader"
Single-node recovery failure (snapshot membership):
# 250+ identical lines — node never becomes leader
time="..." level=warning msg="[go-dqlite] attempt 251: server 172.22.22.6:19001: no known leader"
References¶
- Related incident (dqlite snapshot bloat + WAL deadlock, April 2026): dqlite Snapshot Bloat → kube-apiserver Instability → Controller Crash-Loop Cascade and Watch Stream Failure
- Linear incident ticket: PGM-137
- Memory record on dqlite recovery:
memory/project_dqlite_recovery.md - microk8s dqlite documentation: https://microk8s.io/docs/dqlite
- go-dqlite project: https://github.com/canonical/go-dqlite
Reviewers¶
- @pgmac