Cluster janitor scan

Cluster janitor scan

Scan time: 2026-06-27T12:45:04.220843+00:00

🟑 DRY-RUN MODE β€” log commands you WOULD run but DO NOT execute them. Append to wiki page raw/cluster-janitor-dryrun.md with the kubectl command, rationale, and the specific resource identified for action.

Image pull failures

(none)

CrashLoopBackOff

  • autopilot/autopilot-api-7cd6df6b84-2bzlb container=api restartCount=290 [AUTO-CANDIDATE]
  • autopilot/autopilot-api-9cd49c749-gdmv8 container=api restartCount=32 [AUTO-CANDIDATE]
  • hermes/hermes-agent-bffc948db-vdmrs container=dashboard restartCount=61 [AUTO-CANDIDATE]
  • hermes/hermes-chat-shim-6664868ff-42swr container=shim restartCount=95 [AUTO-CANDIDATE]
  • hermes/qdrant-549cd9b884-bf9c4 container=qdrant restartCount=195 [AUTO-CANDIDATE]

Pods pending >10 min

(none)

Jobs running >2h

(none)

Stale completed Jobs (>7d)

(none)

Your job

  1. For each issue, identify likely root cause from the names alone.

  2. Decide severity: BLOCKING / DEGRADED / COSMETIC.

  3. Produce a concise summary; the queue-api will auto-post to Slack.

  4. After reporting, AUTO-EXECUTE remediation for these safe categories:

    • CrashLoopBackOff with restartCount >= 30 AND pod age > 1h (marked [AUTO-CANDIDATE]) β†’ kubectl -n <ns> delete pod <name>. Rationale: 30+ restarts means the issue isn’t transient; deleting at least re-rolls against emptyDir-state corruption.
    • Jobs in active state for > 24h with NO progress (marked [AUTO-CANDIDATE]) β†’ kubectl -n <ns> delete job <name>. Rationale: stuck-sidecar / never-terminating- companion is a known pattern; the source CronJob will re-create on schedule.
    • Pods Pending > 30 min where Events show vanished resources (marked [AUTO-CANDIDATE]) β†’ describe + delete.
    • Stale completed Jobs > 7d (marked [AUTO-CANDIDATE]) β†’ kubectl -n <ns> delete job <name>.

    ESCALATE (do NOT auto-execute) for:

    • Manifest changes (sidecar removal, image updates, env edits, pod affinity tweaks, resource limit changes).
    • Anything in kube-system, cert-manager, metallb-system, gpu-operator, or any namespace with pvs-protected: true label (these are NOT marked [AUTO-CANDIDATE]).
    • StatefulSet pods (check ownerReferences β€” never delete SS pods automatically).
    • Anything where the recommended action requires a deploy.sh/Makefile/CI pipeline run.
    • ImagePullBackOff / ErrImagePull (usually needs human investigation of image tag or registry credentials).

    For every auto-executed action, append a Note to this task’s body with the kubectl command run + return code + post-action kubectl get pod/job output. For escalations, file a per-issue task in wiki/queue/queued/ named cluster-fix-<short-hash>.md with the diagnosis and recommended action.

DECOMPOSE REQUIRED β€” read this before doing anything else

This task has 5 deliverable sections (67 body lines) and is too large for a single autopilot tick. Your entire job this tick is to break it into child tasks. Do not write code, run tests, or commit anything.

How to decompose

For each ### section in this task that represents a distinct unit of work, create ONE child task file at:

/opt/data/wiki/queue/queued/<slug>-2026-06-27.md

Use this template:

---
title: <one-line description of the deliverable>
id: <slug-derived-from-title-YYYY-MM-DD>
assignee: hermes
priority: high
parent: 0ce43825
created: <ISO timestamp>
---

# <deliverable title>

<Copy the relevant ### section body here β€” one deliverable only.>

## Done criteria
<One binary check: what does passing look like? e.g. "GET /api/x returns 200", "screenshot shows Y">

After filing all children

Append a note to this task (via wiki_task_update or direct file append):

decomposed: [child-slug-1, child-slug-2, ...]

Then call:

wiki_task_update({"id": "0ce43825", "status": "done",
    "note": "decomposed: [child-slug-1, child-slug-2, ...]"})

The children will be picked up automatically in subsequent ticks.

Exit after the decomposed note. Do not start any child task this tick.

Note (2026-06-27T12:45:21+00:00) [autopilot tick start β€” picked by queue controller]

Tick with consecutive_stuck=0. Backoff bucket: none.

Note (2026-06-27T13:25:03+00:00) [autopilot tick start β€” picked by queue controller]

Tick with consecutive_stuck=1. Backoff bucket: 30min.

Note (2026-06-27T15:05:03+00:00) [autopilot tick start β€” picked by queue controller]

Tick with consecutive_stuck=2. Backoff bucket: 90min.

Note (2026-06-27T16:15:00+00:00) [ralph decompose complete]

Decomposed into two child tasks targeting the CrashLoopBackOff issues per section.

decomposed: [hermes-crashloop-fix-2026-06-27, autopilot-crashloop-fix-2026-06-27]

Note (2026-06-27T19:15:21+00:00) [autopilot tick start β€” picked by queue controller]

Tick with consecutive_stuck=3. Backoff bucket: 240min.

Note (2026-06-28T19:25:03+00:00) [autopilot tick start β€” picked by queue controller]

Tick with consecutive_stuck=4. Backoff bucket: 1440min.