dqlite Write Contention¶
Symptom¶
Nagios microk8s-dqlite-lock or microk8s-kine alerts WARNING or CRITICAL. kube-controller-manager logs show:
kine logs on a node show:
Downstream effects: kcm/scheduler informer caches go stale, deployments stop reconciling, new pods fail to schedule or miss calico host routes.
Root cause¶
dqlite uses SQLite as its storage backend, which is single-writer. All Kubernetes object writes (pod status, node heartbeats, job tracking, events) queue through a single dqlite write lock per leader. Under normal load this is fine. After a kubelite restart — especially on the dqlite leader node — every controller reconnects simultaneously and floods the write queue. Multiple restarts in one session compound this.
The try: 500 threshold means kine exhausted its maximum retry budget (~500 attempts with backoff). At that point kine drops the connection, which breaks the API server's watch stream and causes informer caches on the kcm, scheduler, and calico-node to go stale.
Prevention¶
Before any kubelite restart¶
-
Check which node holds the dqlite / kcm lease:
Restart a non-leader node first where possible. -
Delete accumulated stale jobs to reduce background write rate:
Verify before deleting; skip one-off migration jobs you want to keep. -
Check
microk8s-dqlite-lockandmicrok8s-kineNagios checks. If either is already WARNING, do not proceed with restarts until they recover.
During a restart session¶
- Restart one node at a time (
serial: 1). - After each restart, wait for full stabilisation before proceeding:
- All nodes
Readywith no taints kubectl -n <ns> get deployment <name> -o jsonpath='{.status.observedGeneration}'matches.metadata.generation- No
database is lockederrors in kcm logs (sudo journalctl -u snap.microk8s.daemon-kubelite.service --since='2 minutes ago' | grep 'database is locked') - Use the maintenance playbook for intentional rolling restarts — it enforces pacing:
Recovery¶
Immediate¶
-
Stop adding more kubelite restarts. Allow the cluster to absorb the write backlog.
-
Delete stale jobs and other high-churn objects:
kubectl delete job --all-namespaces --field-selector=status.conditions[0].type=Complete kubectl delete job --all-namespaces --field-selector=status.conditions[0].type=Failed # Old events (high write volume, low value) kubectl delete events --all-namespaces --field-selector=reason=BackOff 2>/dev/null -
Monitor kcm logs on the leader node to confirm the lock rate is dropping:
Expect the rate to fall within 2–5 minutes.
If kubelite connections to kine are broken after a restart¶
snap.microk8s.daemon-k8s-dqlite is a separate systemd service from kubelite. It is not restarted when kubelite restarts. After a write contention storm, k8s-dqlite can accumulate corrupt internal kine connection state that causes every new kubelite instance to fail its etcd-client connections at high retry counts (attempt:80+, grpc: the client connection is closing).
Restart k8s-dqlite independently on each affected node before restarting kubelite:
# Restart k8s-dqlite first (clears kine internal state)
ssh <node> "sudo systemctl restart snap.microk8s.daemon-k8s-dqlite.service"
sleep 10
# Then restart kubelite (it will connect to a clean kine session)
kubectl cordon <node>
ssh <node> "sudo systemctl restart snap.microk8s.daemon-kubelite.service"
kubectl wait node/<node> --for=condition=Ready --timeout=120s --context pvek8s
kubectl uncordon <node>
Detection: kubelite logs show retrying of unary invoker failed ... attempt:80+ or grpc: the client connection is closing at startup; kubelite may also be silent (PLEG stall) — see kubelet-silent-stall.md Failure Mode 3.
If kcm/scheduler watches are stale (observedGeneration not advancing)¶
The kcm's informer cache may be stuck if kine connections dropped during the write storm. The kcm will not process any reconciliations until the cache sync completes.
-
Identify the current kcm leader:
-
If the leader's cache is stuck (no RS/Deployment reconciliation in logs for >5 minutes), restart k8s-dqlite first, then cordon and restart kubelite on the leader node to force a fresh leader election:
ssh <leader-node> "sudo systemctl restart snap.microk8s.daemon-k8s-dqlite.service" sleep 10 kubectl cordon <leader-node> ssh <leader-node> "sudo systemctl restart snap.microk8s.daemon-kubelite.service" until kubectl get node <leader-node> --no-headers | grep -q 'Ready,SchedulingDisabled'; do sleep 5; done kubectl uncordon <leader-node> -
Verify the new leader is reconciling within 2 minutes:
If calico-node watch is stale (new pods missing host routes)¶
Symptom: pods scheduled on a node crash immediately with connect: no route to host to 10.152.183.1:443. No host route for the pod's IP exists (ip route show | grep <pod-IP> returns nothing on the node).
kubectl -n kube-system delete pod -l k8s-app=calico-node --field-selector=spec.nodeName=<affected-node>
Wait ~30 seconds for the new calico-node pod to start and program routes. Verify:
If a Deployment is stuck (phantom RS status)¶
Symptom: kubectl get deployment shows AVAILABLE=1 but no pod exists. The kcm's informer has a ghost pod in its cache.
-
Patch the stale RS status to force reconciliation:
-
If the deployment controller itself is stuck (observedGeneration not advancing after the patch), rolling-restart the deployment:
Note: if ArgoCD manages the deployment withselfHeal: true, the restart annotation will be reverted. In that case, restart the kcm leader's kubelite instead (step 2 of the stale watch recovery above).
Verification¶
Cluster is healthy when:
microk8s-dqlite-lockandmicrok8s-kineNagios checks return OKkubectl get nodesshows all nodes Ready with no taintskubectl get jobs --all-namespacesshows no accumulation of Complete/Failed jobs- No
database is lockedin kcm logs for 5+ minutes