Failure & Recovery
Problem Diagnosis & Escalation — What broke and what decision do you need from me?
Build Specification: Failure & Recovery Dashboard
Specification Source: Hilbert Factory Sections 4 (Chief Engineer), 5 (Packet-not-ready), 9 (Stop Conditions), 17 (Chief Engineer Ops) + Dashboard Spec View 4
Panel 4.1 — Active Failures
Chart Type: Data Table with severity highlighting
Data Source: GET /api/orchestrator/queue?status=failed,repairing,escalated
Refresh Rate: FREQUENT (every 30 seconds)
Display: ESCALATED packets at top with red highlight — these need human action. Each: packet_id, failure type, failure step, retry count, time since failure, current routing.
Panel 4.2 — Escalation Queue (Human Action Required)
Chart Type: Action Card list
Data Source: GET /api/escalations/pending
Refresh Rate: FREQUENT (every 30 seconds)
Display: Each escalation: escalation_id, packet_id, category, time waiting, Chief Engineer’s recommended actions. Action buttons per escalation: “Approve Repair”, “Modify Architecture”, “Override and Resume”, “Suspend Build”.
Interaction: Action buttons trigger POST /api/escalations/{id}/resolve with the chosen action
Panel 4.3 — Chief Engineer Activity
Chart Type: Data Table + Gauge
Data Source: GET /api/chief-engineer/activity — returns all interventions
Refresh Rate: PERIODIC (every 5 minutes)
Display: Resolution rate gauge (% resolved without human escalation). Table: diagnosis_id, packet_id, classification, root cause, confidence score, outcome (REPAIR/ESCALATE), duration.
Panel 4.4 — Failure Pattern Analysis
Chart Type: Pie Chart + Bar Chart + Line Chart (Recharts)
Data Source: GET /api/failures/patterns — returns aggregated failure data
Refresh Rate: SESSION
Display: Most common failure types (pie), failure rate by phase (bar), failure rate trend over 30 days (line). Systemic issue highlight: if same root cause appears 3+ times, show red badge “Systemic Issue — architecture review recommended”.