Cluster Health Monitoring

Periodic dry-run scans of the Kubernetes cluster to identify degraded or unhealthy pods. Uses cluster-janitor script that runs on a scheduled basis.

Scan Categories

Category	Threshold	Severity
Image pull failures	Any pod in ImagePullBackOff/ErrImagePull	DEGRADED
CrashLoopBackOff pods	Restart count ≥ 30, age > 1h	CRITICAL (if recurring)
Pods pending >10 min	Any pod stuck in Pending	DEGRADED
Jobs running >2h	Any long-running job exceeding threshold	INFO
Stale completed Jobs	Completed jobs older than 7 days	INFO (cleanup)

Recent Scan History

Date	Findings	Severity
2026-06-21	autopilot-api (213 restarts), qdrant (46 restarts)	DEGRADED
2026-06-26	Added hermes-agent, hermes-chat-shim to CrashLoop list	CRITICAL
2026-06-27T03:30+10:00	Summary of 4 pods with crashloops	CRITICAL
2026-06-27T19:35+10:00	Crashloop returned; autopilot-planner error pods (7)	CRITICAL

Action Items from Latest Scan (2026-06-27)

CrashLoopBackOff Pods

autopilot-api in hermes — 49 restarts, 73m old
hermes-agent-bffc948db-vdmrs in hermes — dashboard container, 18 restarts
hermes-chat-shim-6664868ff-42swr in hermes — shim container, 50 restarts

Error Pods (autopilot-planner job)

7 pods failed: autopilot-planner-29709120-*. All completed with error.

Pending Pods

archive-inactive-sessions-29709000-gczrf in hermes — stuck for 9m

Dry-Run Commands (awaiting sign-off)

# Restart crashloop pods
kubectl -n hermes delete pod autopilot-api
kubectl -n hermes delete pod hermes-agent-bffc948db-vdmrs
kubectl -n hermes delete pod hermes-chat-shim-6664868ff-42swr
 
# Clean error pods (autopilot-planner job)
kubectl -n autopilot delete pod autopilot-planner-29709120-2tv6r autopilot-planner-29709120-5sfp9 autopilot-planner-29709120-ngp4c autopilot-planner-29709120-t647v autopilot-planner-29709120-vdbqw autopilot-planner-29709120-wpc59 autopilot-planner-29709120-xfl6p

Status: Dry-run only. No live changes executed. Requires pvs sign-off for remediation.

Quartz 4

Explorer

Cluster Health Monitoring

Cluster Health Monitoring

Scan Categories

Recent Scan History

Action Items from Latest Scan (2026-06-27)

CrashLoopBackOff Pods

Error Pods (autopilot-planner job)

Pending Pods

Dry-Run Commands (awaiting sign-off)

See Also

Graph View

Table of Contents