Cluster Health Monitoring

Periodic dry-run scans of the Kubernetes cluster to identify degraded or unhealthy pods. Uses cluster-janitor script that runs on a scheduled basis.

Scan Categories

CategoryThresholdSeverity
Image pull failuresAny pod in ImagePullBackOff/ErrImagePullDEGRADED
CrashLoopBackOff podsRestart count ≥ 30, age > 1hCRITICAL (if recurring)
Pods pending >10 minAny pod stuck in PendingDEGRADED
Jobs running >2hAny long-running job exceeding thresholdINFO
Stale completed JobsCompleted jobs older than 7 daysINFO (cleanup)

Recent Scan History

DateFindingsSeverity
2026-06-21autopilot-api (213 restarts), qdrant (46 restarts)DEGRADED
2026-06-26Added hermes-agent, hermes-chat-shim to CrashLoop listCRITICAL
2026-06-27T03:30+10:00Summary of 4 pods with crashloopsCRITICAL
2026-06-27T19:35+10:00Crashloop returned; autopilot-planner error pods (7)CRITICAL

Action Items from Latest Scan (2026-06-27)

CrashLoopBackOff Pods

  • autopilot-api in hermes — 49 restarts, 73m old
  • hermes-agent-bffc948db-vdmrs in hermes — dashboard container, 18 restarts
  • hermes-chat-shim-6664868ff-42swr in hermes — shim container, 50 restarts

Error Pods (autopilot-planner job)

7 pods failed: autopilot-planner-29709120-*. All completed with error.

Pending Pods

  • archive-inactive-sessions-29709000-gczrf in hermes — stuck for 9m

Dry-Run Commands (awaiting sign-off)

# Restart crashloop pods
kubectl -n hermes delete pod autopilot-api
kubectl -n hermes delete pod hermes-agent-bffc948db-vdmrs
kubectl -n hermes delete pod hermes-chat-shim-6664868ff-42swr
 
# Clean error pods (autopilot-planner job)
kubectl -n autopilot delete pod autopilot-planner-29709120-2tv6r autopilot-planner-29709120-5sfp9 autopilot-planner-29709120-ngp4c autopilot-planner-29709120-t647v autopilot-planner-29709120-vdbqw autopilot-planner-29709120-wpc59 autopilot-planner-29709120-xfl6p

Status: Dry-run only. No live changes executed. Requires pvs sign-off for remediation.

See Also

  • hermes-k8s-deployment — deployment topology and resource layout
  • 2026-06-27 — latest queue baseline (tick health)
  • scripts/cluster-janitor — the janitor script source