Cluster Health Monitoring
Periodic dry-run scans of the Kubernetes cluster to identify degraded or unhealthy pods. Uses cluster-janitor script that runs on a scheduled basis.
Scan Categories
| Category | Threshold | Severity |
|---|---|---|
| Image pull failures | Any pod in ImagePullBackOff/ErrImagePull | DEGRADED |
| CrashLoopBackOff pods | Restart count ≥ 30, age > 1h | CRITICAL (if recurring) |
| Pods pending >10 min | Any pod stuck in Pending | DEGRADED |
| Jobs running >2h | Any long-running job exceeding threshold | INFO |
| Stale completed Jobs | Completed jobs older than 7 days | INFO (cleanup) |
Recent Scan History
| Date | Findings | Severity |
|---|---|---|
| 2026-06-21 | autopilot-api (213 restarts), qdrant (46 restarts) | DEGRADED |
| 2026-06-26 | Added hermes-agent, hermes-chat-shim to CrashLoop list | CRITICAL |
| 2026-06-27T03:30+10:00 | Summary of 4 pods with crashloops | CRITICAL |
| 2026-06-27T19:35+10:00 | Crashloop returned; autopilot-planner error pods (7) | CRITICAL |
Action Items from Latest Scan (2026-06-27)
CrashLoopBackOff Pods
autopilot-apiinhermes— 49 restarts, 73m oldhermes-agent-bffc948db-vdmrsinhermes— dashboard container, 18 restartshermes-chat-shim-6664868ff-42swrinhermes— shim container, 50 restarts
Error Pods (autopilot-planner job)
7 pods failed: autopilot-planner-29709120-*. All completed with error.
Pending Pods
archive-inactive-sessions-29709000-gczrfinhermes— stuck for 9m
Dry-Run Commands (awaiting sign-off)
# Restart crashloop pods
kubectl -n hermes delete pod autopilot-api
kubectl -n hermes delete pod hermes-agent-bffc948db-vdmrs
kubectl -n hermes delete pod hermes-chat-shim-6664868ff-42swr
# Clean error pods (autopilot-planner job)
kubectl -n autopilot delete pod autopilot-planner-29709120-2tv6r autopilot-planner-29709120-5sfp9 autopilot-planner-29709120-ngp4c autopilot-planner-29709120-t647v autopilot-planner-29709120-vdbqw autopilot-planner-29709120-wpc59 autopilot-planner-29709120-xfl6pStatus: Dry-run only. No live changes executed. Requires pvs sign-off for remediation.
See Also
- hermes-k8s-deployment — deployment topology and resource layout
- 2026-06-27 — latest queue baseline (tick health)
scripts/cluster-janitor— the janitor script source