k8s Upgrade Post-Upgrade Validation¶
Service: microk8s (pvek8s) First observed: 2026-05-16 PIR: microk8s 1.34 → 1.35 Upgrade Linear: PGM-193
Purpose¶
After any microk8s rolling upgrade, two controller components can silently retain stale state that causes service disruptions:
- Endpoint controller — may hold pre-upgrade pod IPs in EndpointSlices for pods that restarted and got new IPs during the upgrade window
- Ingress-nginx Lua backend cache — may route requests to pod IPs that no longer exist
Both are now checked automatically by k8s-upgrade.yml. This runbook documents the manual procedure for when the automated checks flag issues or need to be run outside the playbook.
Automated checks in k8s-upgrade.yml¶
The upgrade playbook runs two post-upgrade validation plays:
| Play tag | Script | What it checks | Failure action |
|---|---|---|---|
endpoint-validate |
files/endpoints/check_endpoints.py |
EndpointSlice IPs vs actual pod IPs | Fails loudly — manual recovery required |
ingress-validate |
files/ingress/check_backends.py |
ingress-nginx Lua backend IPs vs Running pods | Auto-restarts stale ingress-nginx pods |
Run only the validation plays (no upgrade) with:
Failure Mode 1 — Stale EndpointSlice IPs¶
When it occurs¶
After a rolling upgrade where kubelite restarts cause kubelets to evict and reschedule pods. The endpoint controller (part of kube-controller-manager) uses an informer with a watch bookmark. Pods that changed IP during the restart window — before the controller reestablished its watch — fall outside the controller's resync scope. The EndpointSlice retains the old IP; the new pod IP is never recorded.
This is not a fixed k8s regression. It's an inherent property of the informer model under upgrade chaos. The detection check exists to catch it before services are affected.
Detection¶
# Run the check script directly
cd /home/paul/pgmac/ansible
python3 files/endpoints/check_endpoints.py | python3 -m json.tool
Healthy output:
Stale output:
{
"ok": false,
"stale_count": 2,
"details": [
{
"namespace": "argocd",
"endpointslice": "argocd-redis-ha-abc123",
"service": "argocd-redis-ha",
"pod": "argocd-redis-ha-pkm99",
"stale_ip": "10.1.237.21",
"actual_ip": "10.1.237.25"
}
]
}
Cross-check a single EndpointSlice manually:
kubectl --context pvek8s get endpointslice -n <namespace> <endpointslice> -o yaml
# Look at .endpoints[].addresses[0] — should match the pod's actual IP
kubectl --context pvek8s get pod -n <namespace> <pod> -o jsonpath='{.status.podIP}'
The Nagios NRPE check runs continuously from all nodes:
Recovery — Option A: delete and recreate the EndpointSlice¶
The endpoint controller recreates the EndpointSlice within ~5 seconds with correct IPs. This is the fast, low-risk fix.
# 1. Verify the stale entry
kubectl --context pvek8s get endpointslice -n <namespace> <endpointslice> -o yaml
# 2. Delete — controller recreates immediately
kubectl --context pvek8s delete endpointslice -n <namespace> <endpointslice>
# 3. Wait for recreation and verify correct IP
kubectl --context pvek8s get endpointslice -n <namespace> -w
# Should appear within 5s with the correct pod IP
# 4. Confirm the service is routing correctly
kubectl --context pvek8s get endpoints -n <namespace> <service>
Script for bulk recovery (all stale slices from the check output):
python3 files/endpoints/check_endpoints.py | \
python3 -c "
import json, sys, subprocess
data = json.load(sys.stdin)
for d in data['details']:
cmd = ['kubectl', '--context', 'pvek8s', 'delete', 'endpointslice',
'-n', d['namespace'], d['endpointslice']]
print('Deleting:', ' '.join(cmd))
subprocess.run(cmd, check=True)
print('Done — controller will recreate within 5s')
"
Recovery — Option B: restart the kcm leader to force full resync¶
Use when multiple namespaces are affected or when Option A doesn't clear all mismatches (e.g., controller has a corrupted in-memory view).
# 1. Identify the kcm leader node
LEADER=$(kubectl --context pvek8s -n kube-system get lease kube-controller-manager \
-o jsonpath='{.spec.holderIdentity}' | cut -d_ -f1)
echo "kcm leader: $LEADER"
# 2. Cordon the leader node to prevent scheduling disruption
kubectl --context pvek8s cordon $LEADER
# 3. Restart k8s-dqlite first (prevents kine connection errors)
ssh $LEADER "sudo systemctl restart snap.microk8s.daemon-k8s-dqlite.service"
sleep 10
# 4. Restart kubelite on the leader (forces new leader election, fresh informer sync)
ssh $LEADER "sudo systemctl restart snap.microk8s.daemon-kubelite.service"
kubectl --context pvek8s wait node/$LEADER --for=condition=Ready --timeout=120s
# 5. Uncordon
kubectl --context pvek8s uncordon $LEADER
# 6. Wait ~30s for informer resync, then re-run the check
sleep 30
python3 files/endpoints/check_endpoints.py | python3 -m json.tool
Verification¶
# Re-run the endpoint check — should show ok: true
python3 files/endpoints/check_endpoints.py | python3 -m json.tool
# Re-run the full upgrade validation plays
ansible-playbook -i inventory/hosts.ini k8s-upgrade.yml \
--tags endpoint-validate,ingress-validate
Failure Mode 2 — Ingress-nginx Stale Lua Backend Cache¶
When it occurs¶
ingress-nginx caches pod IP → backend mappings in its Lua state. When pods restart with new IPs, the Lua cache isn't updated until ingress-nginx is restarted. Affects all services behind ingress-nginx (502 Bad Gateway).
This is handled automatically by the ingress-validate play in k8s-upgrade.yml — it restarts stale ingress-nginx pods automatically.
Detection¶
Recovery¶
The ingress-validate play handles this automatically. Manual procedure:
# Delete stale ingress-nginx pods (they restart with fresh Lua state)
kubectl --context pvek8s delete pod -n ingress -l app.kubernetes.io/name=ingress-nginx
# Wait for Ready
kubectl --context pvek8s wait pod -n ingress -l app.kubernetes.io/name=ingress-nginx \
--for=condition=Ready --timeout=120s
# Verify
python3 files/ingress/check_backends.py | python3 -m json.tool
Full Post-Upgrade Checklist¶
Run these after every microk8s rolling upgrade:
cd /home/paul/pgmac/ansible
# 1. All nodes Ready
kubectl --context pvek8s get nodes
# 2. EndpointSlice staleness check (automated in k8s-upgrade.yml)
python3 files/endpoints/check_endpoints.py | python3 -m json.tool
# 3. Ingress-nginx backend check (automated in k8s-upgrade.yml)
python3 files/ingress/check_backends.py | python3 -m json.tool
# 4. ArgoCD sync health
kubectl --context pvek8s -n argocd get applications
# 5. Cluster component health
kubectl --context pvek8s get componentstatuses 2>/dev/null || \
kubectl --context pvek8s get pods -n kube-system
# 6. Check for pods in non-Running/non-Completed state
kubectl --context pvek8s get pods -A \
--field-selector='status.phase!=Running,status.phase!=Succeeded' \
| grep -v Completed
References¶
- Linear: PGM-193 — endpoint staleness discovery and root cause
- PIR: microk8s 1.34 → 1.35 Upgrade
- Scripts:
ansible/files/endpoints/check_endpoints.py,ansible/files/ingress/check_backends.py - NRPE:
ansible/files/nagios/check_k8s_endpoints.sh - Related: dqlite-write-contention runbook — kcm leader restart context