Ops Janitor — Review

What it is

  • Kubernetes cluster janitorial automation: stale Job/CRD cleanup, task-queue hygiene, prune-of-done/cancelled accumulations.
  • Migrated scope from hermes-infra project.

Current state

  • t-001 (stale-job sweep): DONE — scanned 76 jobs across namespaces (52 Complete, 19 Failed, ~10 Running). Identified stale failed jobs but did not apply any cleanup yet.
  • t-002 (prune done/ tasks): TODO — not started.
  • t-003 (seed index.md): DONE.
  • todos.json out of sync with index.md: todos still show status: todo and owner: hermes; t-001 should be marked done in JSON too.

Gaps / Risks

  • Stale jobs identified but not cleaned up — cluster cruising with 52 Complete + 19 Failed Jobs consuming etcd space.
  • No recurring cadence defined (one-off sweep, no CronJob/scheduler).
  • todos.json drifts from actual progress — unreliable as source of truth.
  • All tasks owned by “hermes” (now obsolete); need owner split across apollo/mercury/metis per policy.
  • No dry-run or approval gate documented before deletion.
  1. Fix data integrity first — sync todos.json to reflect actual done state, correct owners.
  2. Execute remaining sweep — apply cleanup with dry-run output reviewed by pvs.
  3. Automate — turn stale-job sweep into a scheduled CronJob with safe thresholds (age > N days, status filter).

Phased plan

PhaseGoalOwner
1Sync todos.json + index.md, correct ownersapollo
2Run actual cleanup on stale failed jobs (dry-run → review → apply)apollo
3Implement done/ task pruner (30-day TTL)apollo
4Convert sweep to CronJob with alertingmercury
5Audit full scope — add CRD cleanup, orphaned PVCs if neededmetis