Cluster janitor scan
Cluster janitor scan
Scan time: 2026-06-27T12:45:04.220843+00:00
π‘ DRY-RUN MODE β log commands you WOULD run but DO NOT execute them. Append to wiki page raw/cluster-janitor-dryrun.md with the kubectl command, rationale, and the specific resource identified for action.
Image pull failures
(none)
CrashLoopBackOff
- autopilot/autopilot-api-7cd6df6b84-2bzlb container=api restartCount=290 [AUTO-CANDIDATE]
- autopilot/autopilot-api-9cd49c749-gdmv8 container=api restartCount=32 [AUTO-CANDIDATE]
- hermes/hermes-agent-bffc948db-vdmrs container=dashboard restartCount=61 [AUTO-CANDIDATE]
- hermes/hermes-chat-shim-6664868ff-42swr container=shim restartCount=95 [AUTO-CANDIDATE]
- hermes/qdrant-549cd9b884-bf9c4 container=qdrant restartCount=195 [AUTO-CANDIDATE]
Pods pending >10 min
(none)
Jobs running >2h
(none)
Stale completed Jobs (>7d)
(none)
Your job
-
For each issue, identify likely root cause from the names alone.
-
Decide severity: BLOCKING / DEGRADED / COSMETIC.
-
Produce a concise summary; the queue-api will auto-post to Slack.
-
After reporting, AUTO-EXECUTE remediation for these safe categories:
- CrashLoopBackOff with
restartCount >= 30AND pod age > 1h (marked [AUTO-CANDIDATE]) βkubectl -n <ns> delete pod <name>. Rationale: 30+ restarts means the issue isnβt transient; deleting at least re-rolls against emptyDir-state corruption. - Jobs in active state for > 24h with NO progress (marked [AUTO-CANDIDATE]) β
kubectl -n <ns> delete job <name>. Rationale: stuck-sidecar / never-terminating- companion is a known pattern; the source CronJob will re-create on schedule. - Pods Pending > 30 min where Events show vanished resources (marked [AUTO-CANDIDATE]) β describe + delete.
- Stale completed Jobs > 7d (marked [AUTO-CANDIDATE]) β
kubectl -n <ns> delete job <name>.
ESCALATE (do NOT auto-execute) for:
- Manifest changes (sidecar removal, image updates, env edits, pod affinity tweaks, resource limit changes).
- Anything in
kube-system,cert-manager,metallb-system,gpu-operator, or any namespace withpvs-protected: truelabel (these are NOT marked [AUTO-CANDIDATE]). - StatefulSet pods (check ownerReferences β never delete SS pods automatically).
- Anything where the recommended action requires a deploy.sh/Makefile/CI pipeline run.
- ImagePullBackOff / ErrImagePull (usually needs human investigation of image tag or registry credentials).
For every auto-executed action, append a Note to this taskβs body with the kubectl command run + return code + post-action
kubectl get pod/joboutput. For escalations, file a per-issue task inwiki/queue/queued/namedcluster-fix-<short-hash>.mdwith the diagnosis and recommended action. - CrashLoopBackOff with
DECOMPOSE REQUIRED β read this before doing anything else
This task has 5 deliverable sections (67 body lines) and is too large for a single autopilot tick. Your entire job this tick is to break it into child tasks. Do not write code, run tests, or commit anything.
How to decompose
For each ### section in this task that represents a distinct unit of work,
create ONE child task file at:
/opt/data/wiki/queue/queued/<slug>-2026-06-27.md
Use this template:
---
title: <one-line description of the deliverable>
id: <slug-derived-from-title-YYYY-MM-DD>
assignee: hermes
priority: high
parent: 0ce43825
created: <ISO timestamp>
---
# <deliverable title>
<Copy the relevant ### section body here β one deliverable only.>
## Done criteria
<One binary check: what does passing look like? e.g. "GET /api/x returns 200", "screenshot shows Y">
After filing all children
Append a note to this task (via wiki_task_update or direct file append):
decomposed: [child-slug-1, child-slug-2, ...]
Then call:
wiki_task_update({"id": "0ce43825", "status": "done",
"note": "decomposed: [child-slug-1, child-slug-2, ...]"})
The children will be picked up automatically in subsequent ticks.
Exit after the decomposed note. Do not start any child task this tick.
Note (2026-06-27T12:45:21+00:00) [autopilot tick start β picked by queue controller]
Tick with consecutive_stuck=0. Backoff bucket: none.
Note (2026-06-27T13:25:03+00:00) [autopilot tick start β picked by queue controller]
Tick with consecutive_stuck=1. Backoff bucket: 30min.
Note (2026-06-27T15:05:03+00:00) [autopilot tick start β picked by queue controller]
Tick with consecutive_stuck=2. Backoff bucket: 90min.
Note (2026-06-27T16:15:00+00:00) [ralph decompose complete]
Decomposed into two child tasks targeting the CrashLoopBackOff issues per section.
decomposed: [hermes-crashloop-fix-2026-06-27, autopilot-crashloop-fix-2026-06-27]
Note (2026-06-27T19:15:21+00:00) [autopilot tick start β picked by queue controller]
Tick with consecutive_stuck=3. Backoff bucket: 240min.
Note (2026-06-28T19:25:03+00:00) [autopilot tick start β picked by queue controller]
Tick with consecutive_stuck=4. Backoff bucket: 1440min.