Post Incident Review: Radarr Outage Due to OpenEBS Jiva Replica Divergence¶
Date: 2026-02-22 Duration: ~16h30m silent failure + ~47m active recovery (22:25 AEST 2026-02-21 → 18:48 AEST 2026-02-22) Severity: High (single service outage — Radarr completely unavailable) Status: Resolved
Executive Summary¶
Radarr became unavailable when its pod failed to start, remaining stuck in ContainerCreating for over 4.5 hours before investigation began. The pod could not start because its persistent volume (radarr-config) could not be mounted. The mount failure was caused by a corrupted ext4 filesystem on the iSCSI block device (/dev/sdi), which itself was caused by all three OpenEBS Jiva storage replicas simultaneously entering a CrashLoopBackOff state with diverged snapshot chains.
The Jiva replicas failed because all three had been left in a Rebuilding: true state following an ungraceful shutdown at approximately 22:25 AEST on 2026-02-21 — roughly 16 hours before the pod failure was detected. Without a healthy replica to serve as a rebuild source, the Jiva controller could not serve a consistent iSCSI target. This left the filesystem journal dirty, which caused fsck -a (run automatically by kubelet before each mount attempt) to fail repeatedly.
Two additional PVCs — overseerr-config and scotchcraft-minecraft-datadir — were found to have suffered the same underlying Jiva failure but self-recovered because at least one of their replicas remained in a healthy state. They had restart counts of 10-11 and 52-53 respectively indicating significant instability around the same event.
Resolution required: scaling down all Jiva deployments, patching volume.meta on one replica to clear the Rebuilding flag, clearing the image data on the other two replicas so they rebuilt from the good source, scaling back up in sequence, and allowing the ext4 journal recovery to complete during the next successful mount.
Timeline (AEST — UTC+10)¶
| Time | Event |
|---|---|
| 2026-02-21 ~22:25 | ROOT EVENT: Ungraceful shutdown interrupts an in-progress Jiva rebuild across all 3 replicas. All nodes' revision.counter files share this timestamp. |
| ~22:25 onwards | All 3 Jiva replicas for radarr-config enter CrashLoopBackOff. Jiva controller loses all healthy backends. iSCSI LUN (/dev/sdi) becomes unserviceable. |
| ~22:25 onwards | Kubelet begins retrying PVC mount for radarr pod (not yet scheduled). Each retry runs fsck -a /dev/sdi, which fails with "can't read superblock". |
| 2026-02-22 14:01 | Radarr pod radarr-cd6596b59-lbc2v scheduled, enters ContainerCreating. PVC mount failing silently — pod status gives no indication of storage failure. |
| 14:01 → 18:05 | Pod remains in ContainerCreating for 4h4m with no alerting. FailedMount events accumulate in pod describe but are not visible without active investigation. |
| ~18:05 | INCIDENT DETECTED: Manual investigation triggered. kubectl describe pod reveals repeated FailedMount events citing can't read superblock on /dev/sdi and fsck found errors but could not correct them. |
| ~18:08 | All 3 Jiva replica pods identified in CrashLoopBackOff: rep-1 (k8s03), rep-2 (k8s02), rep-3 (k8s01). Controller running 2/2 but with zero healthy backends. |
| ~18:10 | Replica logs reveal fatal error: "Current replica's checkpoint not present in rwReplica chain, Shutting down..." |
| ~18:12 | All 3 nodes' volume.meta inspected — all show "Rebuilding":true with identical RevisionCounter: 2538385 but diverged Parent snapshot chains. Root cause confirmed. |
| ~18:15 | RESOLUTION START: All 4 Jiva deployments (controller + 3 replicas) scaled to 0. |
| ~18:17 | volume.meta on k8s01 (rep-3) patched: "Rebuilding": false. This replica designated as the authoritative source. |
| ~18:19 | All .img and .img.meta files moved to .bak directories on k8s02 and k8s03. volume.meta files also moved so those replicas start completely fresh. |
| ~18:21 | Jiva controller scaled back to 1. Becomes ready within 80 seconds. |
| ~18:22 | rep-3 (k8s01, the fixed replica) scaled to 1. Joins controller as RW replica. |
| ~18:24 | rep-1 and rep-2 scaled to 1. Begin rebuilding from rep-3. Jiva correctly serialises — only one WO rebuild at a time. |
| ~18:29 | All 4 Jiva pods Running. Snapshot sync active on rep-2, rep-1 queued. |
| ~18:30 | Old D-state fsck.ext4 process (from prior kubelet retry) clears. /dev/sdi becomes free. |
| ~18:32 | Manual fsck -y /dev/sdi attempt fails — kubelet has already spawned a new fsck -a process, racing for the device. |
| ~18:37 | Radarr scaled to 0 to stop kubelet from competing for /dev/sdi. |
| ~18:40 | D-state fsck -a process clears. dmesg shows EXT4-fs (sdi): recovery complete — the kernel's ext4 journal recovery succeeded during a mount attempt after Jiva became healthy. |
| ~18:47 | Radarr scaled back to 1. |
| 18:48:28 | INCIDENT RESOLVED: Radarr pod radarr-cd6596b59-mlbs6 reaches 1/1 Running. |
Root Causes¶
The Infinite How's Chain¶
"The infinite how's" methodology: at each causal step, ask "how?" rather than accepting the surface answer. Keep drilling until reaching an actionable, preventable cause.
How did radarr become unavailable?¶
The radarr pod entered ContainerCreating and never progressed. The startup probe (TCP socket on port 7878) could not succeed because the container itself never launched.
How did the container fail to launch?¶
Kubelet was unable to mount the radarr-config PVC. Volume mounting is a prerequisite for container creation; without it, the pod is stuck in ContainerCreating indefinitely.
How did the PVC mount fail?¶
Kubelet automatically runs fsck before mounting a block device-backed volume. The fsck reported:
and later:
The auto-repair flag (-a) is insufficient for this class of journal inconsistency. Kubelet has no mechanism to escalate beyond fsck -a; it simply retries and logs FailedMount.
How did the ext4 filesystem on /dev/sdi become inconsistent?¶
/dev/sdi is the iSCSI block device provided by the OpenEBS Jiva controller for the radarr-config PVC. When all Jiva replica pods simultaneously entered CrashLoopBackOff, the controller had no healthy backends to service I/O. The iSCSI target remained presented to the host but writes timed out or returned errors. The last radarr write session left the ext4 journal in a dirty/uncommitted state — which fsck -a cannot repair because the journal's needs_recovery flag was inconsistent with the presence of journal data.
How did all three Jiva replica pods enter CrashLoopBackOff?¶
Each replica's log contained:
level=fatal msg="Failed to add replica to controller, err: Current replica's
checkpoint not present in rwReplica chain, Shutting down..."
Jiva's safety mechanism: when a replica restarts, it contacts the controller and verifies that its latest snapshot checkpoint exists in the controller's canonical chain. If not, the replica refuses to join (to prevent serving stale or diverged data) and exits. With all three replicas failing this check, the controller has zero healthy backends.
How did all three replicas end up with checkpoints that didn't match the controller's chain?¶
Inspection of each replica's volume.meta showed:
All three replicas had identical RevisionCounter values (2538385) and identical Checkpoint UUIDs — but each had a different Parent snapshot for its head image:
| Node | Head Parent |
|---|---|
| k8s01 | volume-snap-3d6f0344... |
| k8s02 | volume-snap-b5c23a63... |
| k8s03 | volume-snap-af55ce5c... |
All three were in Rebuilding: true. They had been simultaneously mid-rebuild when something caused a cluster-wide disruption. Each replica had snapshotted at the moment of trying to join (standard Jiva behaviour during rebuild) and those new snapshots were not present in any sibling's chain — causing the circular checkpoint mismatch.
How did all three replicas end up rebuilding at the same time?¶
The revision.counter file on all three nodes bore the same timestamp: 2026-02-21 22:25 AEST. Jiva writes the revision counter file atomically on clean shutdown. The identical timestamp is strong evidence that all three nodes experienced a simultaneous ungraceful shutdown at that moment — a power event, network partition, or host-level failure causing all three nodes to lose connectivity or restart at the same instant.
OpenEBS Jiva only rebuilds one WO (write-only) replica at a time under normal operation. For all three to be in a rebuilding state simultaneously, the disruption must have occurred while a multi-replica rebuild was already in progress — meaning the system had already been in a partially degraded state before the 22:25 event.
How did a prior degraded state go undetected?¶
There is no alerting on:
- Jiva replica
CrashLoopBackOffor elevated restart counts - Jiva replica
Rebuilding: trueflag persisting beyond a threshold - PVC
FailedMountevents accumulating on pods - Pods remaining in
ContainerCreatingbeyond a time threshold
The two other affected PVCs (overseerr-config, scotchcraft-minecraft-datadir) had accumulated 10-53 restarts respectively before self-recovering — also without triggering any alert.
How did the radarr pod sit in ContainerCreating for over 4 hours without detection?¶
The pod status ContainerCreating is a normal transient state during startup. Kubernetes does not surface FailedMount events prominently in kubectl get pods output — they are only visible via kubectl describe pod. Without a dashboard widget or alert rule explicitly targeting pods stuck in ContainerCreating beyond a threshold (e.g., 5 minutes), the failure was invisible.
Secondary Findings¶
pvc-05e03b60 (overseerr-config) and pvc-f1888541 (scotchcraft-minecraft)¶
Both PVCs were hit by the same underlying Feb 21 22:25 disruption. Both showed the same "checkpoint not present in rwReplica chain" fatal error in replica logs. Unlike radarr, at least one replica for each volume had remained in a healthy (non-Rebuilding) state before the disruption, allowing them to self-recover by electing one replica as RW and rebuilding the others from it. Recovery took 50-90 minutes and produced 10-53 container restarts per replica pod — indicating significant thrashing before convergence.
Impact¶
Services Affected¶
- Radarr (
https://radarr.int.pgmac.net): Completely unavailable. Pod stuck inContainerCreating, no web UI, no API, no media management functionality. - Overseerr (
https://overseerr.int.pgmac.net): Elevated Jiva replica instability but service remained available throughout. - Scotchcraft Minecraft: Elevated Jiva replica instability but service remained available throughout.
Duration¶
- Radarr total outage: ~20h47m (from 22:01 AEST 2026-02-21 to 18:48 AEST 2026-02-22)
- Silent failure period (undetected): ~16h04m (22:25 → ~14:01 — pod was not scheduled)
- Pod stuck in ContainerCreating (undetected): ~4h04m (14:01 → ~18:05)
- Active recovery: ~43m (~18:05 → 18:48)
- Overseerr instability: ~6-8h duration, self-resolved, no user-visible outage confirmed
- Minecraft instability: ~6-8h duration, self-resolved, no user-visible outage confirmed
Scope¶
- Storage: OpenEBS Jiva storage subsystem for 3 PVCs across 3 namespaces
- User-facing: Media management (no new media could be tracked or imported via Radarr)
- Monitoring: No detection for ~16h of silent failure
Resolution Steps Taken¶
1. Create ArgoCD SyncWindow¶
Create a dney SyncWindow in ArgoCD on all applications to ensure ArgoCD does NOT attempt to auto-sync any changes during the restoration
2. Scale Down All Jiva Deployments¶
kubectl scale deployment -n openebs \
pvc-a634b9a3-fdaa-4b45-9dc3-2486e716d755-ctrl \
pvc-a634b9a3-fdaa-4b45-9dc3-2486e716d755-rep-1 \
pvc-a634b9a3-fdaa-4b45-9dc3-2486e716d755-rep-2 \
pvc-a634b9a3-fdaa-4b45-9dc3-2486e716d755-rep-3 \
--replicas=0
3. Patch volume.meta on k8s01 (rep-3) — the Authoritative Source¶
# Backup first
sudo cp volume.meta volume.meta.bak
# Patch Rebuilding flag to false
sudo python3 -c "
import json
path = '/var/snap/microk8s/common/var/openebs/pvc-a634b9a3-.../volume.meta'
with open(path) as f:
data = json.load(f)
data['Rebuilding'] = False
with open(path, 'w') as f:
json.dump(data, f, separators=(',', ':'))
"
4. Clear Image Data on k8s02 and k8s03¶
# On k8s02 and k8s03 — move (not delete) all img files and volume.meta to backup
sudo mkdir -p /var/snap/microk8s/common/var/openebs/pvc-a634b9a3-....bak
sudo mv /var/snap/microk8s/common/var/openebs/pvc-a634b9a3-.../*.img \
/var/snap/microk8s/common/var/openebs/pvc-a634b9a3-....bak/
sudo mv /var/snap/microk8s/common/var/openebs/pvc-a634b9a3-.../*.img.meta \
/var/snap/microk8s/common/var/openebs/pvc-a634b9a3-....bak/
sudo mv /var/snap/microk8s/common/var/openebs/pvc-a634b9a3-.../volume.meta \
/var/snap/microk8s/common/var/openebs/pvc-a634b9a3-....bak/
5. Scale Up in Sequence¶
# Controller first
kubectl scale deployment -n openebs pvc-a634b9a3-...-ctrl --replicas=1
# Wait for controller ready
kubectl wait --for=condition=ready pod -n openebs \
-l openebs.io/persistent-volume=pvc-a634b9a3-...,openebs.io/controller=jiva-controller \
--timeout=60s
# Good replica (k8s01, rep-3) next
kubectl scale deployment -n openebs pvc-a634b9a3-...-rep-3 --replicas=1
# Allow rep-3 to establish as RW, then bring up the others
kubectl scale deployment -n openebs \
pvc-a634b9a3-...-rep-1 \
pvc-a634b9a3-...-rep-2 \
--replicas=1
6. Stop Radarr to Clear the Mount Race¶
# Radarr was generating competing fsck -a processes preventing manual fsck
kubectl scale deployment -n media radarr --replicas=0
At this point the kernel's ext4 journal recovery completed automatically during a mount attempt (dmesg showed EXT4-fs (sdi): recovery complete and mounted filesystem with ordered data mode), eliminating the need for a manual fsck -y.
7. Restore Radarr¶
8. Cleanup¶
# Remove backup directories from all nodes
ssh k8s01 "sudo rm -f .../volume.meta.bak"
ssh k8s02 "sudo rm -rf ...pvc-a634b9a3-....bak"
ssh k8s03 "sudo rm -rf ...pvc-a634b9a3-....bak"
9. Remove SyncWindow¶
Remove the deny SyncWindow in ArgoCD to ensure normal/expected auto-sync operation continues
Verification¶
Service Health¶
- ✅ Radarr:
1/1 Running, stable for 8+ check intervals post-recovery - ✅ Overseerr:
1/1 Running, no further replica restarts - ✅ Minecraft:
1/1 Running, no further replica restarts
Storage Health¶
pvc-a634b9a3 (radarr-config):
ctrl: 2/2 Running
rep-1: 1/1 Running (rebuilt from rep-3)
rep-2: 1/1 Running (rebuilt from rep-3)
rep-3: 1/1 Running (authoritative source)
pvc-05e03b60 (overseerr-config):
All replicas: 1/1 Running, Rebuilding=false, shared Checkpoint ✅
pvc-f1888541 (minecraft-datadir):
All replicas: 1/1 Running, Rebuilding=false, shared Checkpoint ✅
Volume Metadata (Post-Recovery)¶
All radarr-config replicas confirmed with:
"Rebuilding": false- Shared
CheckpointUUID across all 3 nodes - Shared
Parentsnapshot reference - Active sync converging
RevisionCountervalues
Preventive Measures¶
Immediate Actions Required¶
- Alert on pods stuck in ContainerCreating > 5 minutes (Critical Priority)
- Current: No alerting; a pod can sit stuck indefinitely without detection
- Target: PagerDuty/Slack alert when any pod remains in
ContainerCreatingbeyond 5 minutes - Implementation: Prometheus
kube_pod_status_phase+ duration alert rule -
Rationale: 4+ hours elapsed before manual detection. This single alert would have reduced radarr's outage from hours to minutes.
-
Alert on Jiva replica CrashLoopBackOff (Critical Priority)
- Current: No alerting on OpenEBS replica pod failures
- Target: Immediate alert when any Jiva replica pod enters
CrashLoopBackOfforError - Implementation: Prometheus
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}filtered toopenebsnamespace -
Rationale: All 3 replicas were in CrashLoopBackOff for ~16 hours before detection
-
Alert on Jiva replica restart count threshold (High Priority)
- Current: No alerting; overseerr and minecraft accumulated 10-53 restarts silently
- Target: Alert when any Jiva replica pod exceeds 5 restarts within 30 minutes
- Implementation:
rate(kube_pod_container_status_restarts_total[30m]) > 0.1filtered to openebs namespace -
Rationale: The self-recovered PVCs showed the same failure pattern but slightly less severe — an early restart alert would flag the pattern before it becomes critical
-
Alert on FailedMount events (High Priority)
- Current:
FailedMountevents are only visible viakubectl describe; no alerting - Target: Alert when a pod generates more than 3
FailedMountevents - Implementation: Prometheus
kube_event_count{reason="FailedMount"}alert rule -
Rationale: The mount failure was generating repeated events for hours with no visibility
-
Document OpenEBS Jiva replica divergence recovery runbook (High Priority)
- Current: No documented procedure; recovery required real-time diagnosis
- Target: Step-by-step runbook covering: identify diverged replicas → patch volume.meta → clear image data on non-source replicas → scale up in sequence
- Location:
incidents/docs/runbooks/openebs-jiva-replica-recovery.md -
Rationale: Recovery took ~43 minutes of active work; a runbook would reduce this significantly and remove the knowledge dependency
-
Investigate and document the Feb 21 22:25 root event (High Priority)
- Current: The simultaneous all-node disruption at 22:25 AEST is unexplained
- Target: Identify whether this was a power event, network partition, kernel bug, or other cause
- Actions:
- Review UPS/PDU logs for that timeframe
- Review node-level system logs (
/var/log/syslog) from all 3 nodes around 22:25 - Check Proxmox/hypervisor logs if nodes are VMs
- Rationale: The same unknown event also degraded overseerr and minecraft. If it recurs, all Jiva volumes are at risk of the same failure mode
Longer-Term Improvements¶
- Jiva Rebuilding flag monitoring (Medium Priority)
- Add a periodic check (every 5 minutes) that inspects
volume.metaon all Jiva replica nodes and alerts ifRebuilding: truepersists beyond 30 minutes - A replica stuck in Rebuilding for >30 minutes indicates a stalled or failed rebuild that requires intervention
-
Implementation: CronJob running a script against node hostPaths, or custom Prometheus exporter
-
Jiva rebuild serialisation guard (Medium Priority)
- When a cluster-wide disruption leaves all replicas in Rebuilding state simultaneously, Jiva has no self-healing path because no replica can establish as RW
- Investigate whether OpenEBS Jiva has a recovery mode or operator-level intervention hook that can be automated
-
Consider upgrading OpenEBS if newer versions have improved recovery handling for this scenario
-
Structured review of all Jiva volumes' health state (Medium Priority)
- Run a periodic job that checks
volume.metaon all Jiva replica hostPaths across all nodes - Report: revision counter skew between replicas, Rebuilding flag, Dirty flag, snapshot chain depth
-
This would surface partial degradation (e.g., one of three replicas in an unhealthy state) before it becomes a full outage
-
Snapshot chain depth monitoring (Medium Priority)
- Referenced from the 2026-01-06 PIR: excessive snapshot accumulation caused Phase 2 storage issues in that incident
- The radarr-config PVC had 280+ snapshot files on k8s01 at recovery time, indicating the jiva-snapshot-cleanup cronjob may not be running effectively
- Verify snapshot cleanup cronjob is healthy and its threshold/frequency is appropriate (see 2026-01-06 action items)
Lessons Learned¶
What Went Well¶
- Thorough diagnostic approach: The full causal chain from "pod not starting" to "all replicas Rebuilding simultaneously" was traced in approximately 10 minutes using
kubectl describe, pod logs, node SSH access, andvolume.metainspection - Careful recovery sequencing: Scaling down before making filesystem changes, choosing the source replica deliberately, moving (not deleting) backup data before confirming recovery — all prevented data loss
- Self-healing worked for two of three affected PVCs: The overseerr and minecraft volumes recovered without intervention, demonstrating that Jiva's rebuild mechanism works correctly when at least one healthy replica survives
- Backup before touching metadata:
volume.meta.bakwas created before patching, and image files were moved rather than deleted, preserving rollback options throughout - The kernel handled ext4 recovery: Once the Jiva backend was healthy, the kernel's built-in ext4 journal recovery resolved the filesystem corruption without requiring a separate manual
fsck— simplifying the recovery
What Didn't Go Well¶
- 16+ hours of silent failure: The root event occurred at 22:25 AEST; the incident was not detected until ~18:05 the next day — a detection gap of over 16 hours
- ContainerCreating is invisible as an error state: The pod appeared "normal" to casual inspection; only
kubectl describerevealed the FailedMount events - No Jiva health alerting whatsoever: Three separate PVCs experienced Jiva replica failures affecting 3 different services, all without any alert being generated
- fsck -a race condition: Kubelet continuously spawning new
fsck -aprocesses prevented manualfsck -yfrom acquiring the device, requiring the workaround of scaling radarr to 0 - The underlying 22:25 disruption remains unexplained: The simultaneous all-replica crash is the true root cause, and without knowing what caused it, the risk of recurrence cannot be assessed or mitigated
- Jiva has no self-healing path when all replicas diverge: The system had no ability to recover without manual metadata surgery — this is a fundamental architectural limitation
Surprise Findings¶
- All 3 replicas can simultaneously enter an unrecoverable state: The assumption that 3-replica Jiva provides resilience only holds if the disruption affects fewer than a quorum. A simultaneous all-node disruption during an active rebuild defeats this assumption entirely.
- The Rebuilding flag persists across pod restarts:
volume.metais on the node's hostPath, not in the pod. Each time a replica pod restarted, it readRebuilding: trueand immediately failed. The CrashLoopBackOff was not a transient issue — it would never self-resolve. - The revision.counter timestamp as a forensic tool: The identical
Feb 21 22:25mtime on all three nodes'revision.counterfiles provided precise timing of the root event without any application-level logging. - ext4 journal recovery as a "free" fix: The kernel handled the filesystem repair during the successful mount after Jiva was fixed, avoiding the need for manual
fsck -y. Thecan't read superblockerror from the earlier kubelet attempts was due to the total absence of Jiva backends, not permanent disk corruption. - Two other services were also impacted but self-recovered: Without checking all Jiva pod restart counts, the full blast radius of the Feb 21 event would have been unknown. The same root event caused instability across minecraft and overseerr.
Action Items¶
| Priority | Action | Owner | Due Date | Status |
|---|---|---|---|---|
| Critical | Alert: pod stuck in ContainerCreating > 5 minutes | SRE | 2026-03-01 | Open |
| Critical | Alert: Jiva replica pod CrashLoopBackOff in openebs namespace | SRE | 2026-03-01 | Open |
| High | Alert: Jiva replica restart rate > 5 in 30 minutes | SRE | 2026-03-08 | Open |
| High | Alert: FailedMount events > 3 on any pod | SRE | 2026-03-08 | Open |
| High | Write OpenEBS Jiva replica recovery runbook | SRE | 2026-03-08 | Open |
| High | Investigate Feb 21 22:25 root event (UPS, PDU, hypervisor logs) | SRE | 2026-03-01 | Open |
| Medium | Implement Jiva Rebuilding flag monitor (cronjob/exporter) | SRE | 2026-03-15 | Open |
| Medium | Investigate Jiva upgrade path or automated rebuild recovery | SRE | 2026-03-22 | Open |
| Medium | Periodic Jiva volume health report (revision skew, chain depth) | SRE | 2026-03-22 | Open |
| Medium | Verify jiva-snapshot-cleanup cronjob health and thresholds | SRE | 2026-03-01 | Open |
| Low | Investigate ci namespace high-restart pods (dependency-track: 176 restarts) | SRE | 2026-03-15 | Open |
| Low | Investigate media namespace chronic restarters (metasearch: 34, linkace: 11) | SRE | 2026-03-22 | Open |
Technical Details¶
Environment¶
- Cluster: pvek8s (microk8s on 3 nodes: k8s01/172.22.22.6, k8s02, k8s03/172.22.22.9)
- Storage: OpenEBS Jiva 2.12.1 (
openebs-jiva-defaultstorage class) - Affected PVC:
radarr-config(pvc-a634b9a3-fdaa-4b45-9dc3-2486e716d755), 5Gi RWO - iSCSI target:
iqn.2016-09.com.openebs.jiva:pvc-a634b9a3-fdaa-4b45-9dc3-2486e716d755at10.152.183.80:3260 - Block device on k8s01:
/dev/sdi(ext4 filesystem, 2G) - Replica hostPath:
/var/snap/microk8s/common/var/openebs/pvc-a634b9a3-fdaa-4b45-9dc3-2486e716d755/
Replica State at Discovery¶
| Node | Replica | Revision Counter | Head Image | Rebuilding | Status |
|---|---|---|---|---|---|
| k8s01 | rep-3 | 2538385 | volume-head-280.img | true | CrashLoopBackOff |
| k8s02 | rep-2 | 2538385 | volume-head-376.img | true | CrashLoopBackOff |
| k8s03 | rep-1 | 2538385 | volume-head-378.img | true | CrashLoopBackOff |
All three had identical Checkpoint: "volume-snap-0fd00bc8-aaa8-40d1-90c3-1971d4837540.img" but different Parent snapshot references, confirming divergence during an interrupted multi-replica rebuild.
Key Log Entries¶
Jiva replica fatal error (all 3 replicas):
level=fatal msg="Failed to add replica to controller, err: Current replica's
checkpoint not present in rwReplica chain, Shutting down..."
Kubelet mount failure (pod events):
Warning FailedMount kubelet MountVolume.MountDevice failed for volume
"pvc-a634b9a3-..." : 'fsck' found errors on device /dev/disk/by-path/...
but could not correct them:
/dev/sdi: recovering journal
/dev/sdi: Superblock needs_recovery flag is clear, but journal has data.
/dev/sdi: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
ext4 recovery success (dmesg on k8s01):
[...] EXT4-fs (sdi): recovery complete
[...] EXT4-fs (sdi): mounted filesystem with ordered data mode. Opts: (null)
[...] sd 10:0:0:0: [sdi] Synchronizing SCSI cache
Other Affected PVCs (Same Root Event)¶
| PVC | App | Max Restarts | Self-Recovered | Outage |
|---|---|---|---|---|
| pvc-05e03b60 (overseerr-config) | Overseerr | 11 | Yes | None confirmed |
| pvc-f1888541 (minecraft-datadir) | Scotchcraft Minecraft | 53 | Yes | None confirmed |
References¶
- Previous incident covering Jiva snapshot accumulation:
incidents/docs/incidents/2026-01-06-cluster-cascade-failure.md - OpenEBS Jiva documentation: https://openebs.io/docs/user-guides/jiva
- OpenEBS Jiva volume.meta schema: internal replica metadata, not publicly documented
Reviewers¶
- Prepared by: Claude (AI Assistant)
- Date: 2026-02-22
- Review Status: Draft — Pending human review