Ops Janitor — Review

What it is

Kubernetes cluster janitorial automation: stale Job/CRD cleanup, task-queue hygiene, prune-of-done/cancelled accumulations.
Migrated scope from hermes-infra project.

Current state

t-001 (stale-job sweep): DONE — scanned 76 jobs across namespaces (52 Complete, 19 Failed, ~10 Running). Identified stale failed jobs but did not apply any cleanup yet.
t-002 (prune done/ tasks): TODO — not started.
t-003 (seed index.md): DONE.
todos.json out of sync with index.md: todos still show status: todo and owner: hermes; t-001 should be marked done in JSON too.

Gaps / Risks

Stale jobs identified but not cleaned up — cluster cruising with 52 Complete + 19 Failed Jobs consuming etcd space.
No recurring cadence defined (one-off sweep, no CronJob/scheduler).
todos.json drifts from actual progress — unreliable as source of truth.
All tasks owned by “hermes” (now obsolete); need owner split across apollo/mercury/metis per policy.
No dry-run or approval gate documented before deletion.

Recommended approach

Fix data integrity first — sync todos.json to reflect actual done state, correct owners.
Execute remaining sweep — apply cleanup with dry-run output reviewed by pvs.
Automate — turn stale-job sweep into a scheduled CronJob with safe thresholds (age > N days, status filter).

Phased plan

Phase	Goal	Owner
1	Sync todos.json + index.md, correct owners	apollo
2	Run actual cleanup on stale failed jobs (dry-run → review → apply)	apollo
3	Implement done/ task pruner (30-day TTL)	apollo
4	Convert sweep to CronJob with alerting	mercury
5	Audit full scope — add CRD cleanup, orphaned PVCs if needed	metis