hero_osis crash-loops under hero_proc — health-check probes a socket the binary never creates #178

Closed
opened 2026-04-30 06:27:37 +00:00 by sameh-farouk · 1 comment
Member

Symptom

  • service_osis start reports success.
  • proc service status hero_osis shows state: running, restarts: N>0, growing.
  • pgrep -au $USER -af hero_osis shows only hero_osis_ui, never hero_osis (the backend).
  • Browser hits 404 on /hero_osis_base/rpc (and every other domain) with Socket 'rpc.sock' not found for 'hero_osis_base'.
  • WASM Failed to fetch contexts after 5 retries.
  • Logs show repeated received SIGTERM / starting — socket: ... cycles.

Cause

The unified hero_osis backend binds per-domain sockets:

~/hero/var/sockets/hero_osis_base/rpc.sock
~/hero/var/sockets/hero_osis_business/rpc.sock
~/hero/var/sockets/hero_osis_calendar/rpc.sock
... 16 total

It never creates a singular hero_osis/rpc.sock. But service_osis.nu's action spec references that non-existent path:

kill_other: { socket: [$"($sock_base)/hero_osis/rpc.sock"] }
health_checks: [{ openrpc_socket: $"($sock_base)/hero_osis/rpc.sock", ... }]

With svc_server_health_policy (start_period_ms: 3000, timeout_ms: 5000, retries: 3), hero_proc:

  1. Waits 3s grace.
  2. Probes the missing path → fails.
  3. Retries 3× over ~15s → all fail.
  4. SIGTERMs the (perfectly healthy) backend, respawns, repeats.

Confirmation

Running ~/hero/bin/hero_osis standalone (without hero_proc) stays up cleanly and binds all 16 per-domain sockets immediately. restarts: 0. So the binary is fine; the action spec is the bug.

Fix

Swap both fields to hero_osis_base/rpc.sock. Root domain, registered first by the unified server, registration is atomic across all 16 domains, so a healthy base socket is a sufficient liveness signal. hero_osis_ui's path (hero_osis/ui.sock) is correct because hero_osis_ui actually creates that singular socket — leave it untouched.

PR: #177

## Symptom - `service_osis start` reports success. - `proc service status hero_osis` shows `state: running, restarts: N>0`, growing. - `pgrep -au $USER -af hero_osis` shows only `hero_osis_ui`, never `hero_osis` (the backend). - Browser hits 404 on `/hero_osis_base/rpc` (and every other domain) with `Socket 'rpc.sock' not found for 'hero_osis_base'`. - WASM `Failed to fetch contexts after 5 retries`. - Logs show repeated `received SIGTERM` / `starting — socket: ...` cycles. ## Cause The unified `hero_osis` backend binds **per-domain** sockets: ``` ~/hero/var/sockets/hero_osis_base/rpc.sock ~/hero/var/sockets/hero_osis_business/rpc.sock ~/hero/var/sockets/hero_osis_calendar/rpc.sock ... 16 total ``` It **never** creates a singular `hero_osis/rpc.sock`. But `service_osis.nu`'s action spec references that non-existent path: ```nu kill_other: { socket: [$"($sock_base)/hero_osis/rpc.sock"] } health_checks: [{ openrpc_socket: $"($sock_base)/hero_osis/rpc.sock", ... }] ``` With `svc_server_health_policy` (`start_period_ms: 3000, timeout_ms: 5000, retries: 3`), hero_proc: 1. Waits 3s grace. 2. Probes the missing path → fails. 3. Retries 3× over ~15s → all fail. 4. SIGTERMs the (perfectly healthy) backend, respawns, repeats. ## Confirmation Running `~/hero/bin/hero_osis` standalone (without hero_proc) stays up cleanly and binds all 16 per-domain sockets immediately. `restarts: 0`. So the binary is fine; the action spec is the bug. ## Fix Swap both fields to `hero_osis_base/rpc.sock`. Root domain, registered first by the unified server, registration is atomic across all 16 domains, so a healthy `base` socket is a sufficient liveness signal. `hero_osis_ui`'s path (`hero_osis/ui.sock`) is correct because `hero_osis_ui` actually creates that singular socket — leave it untouched. PR: #177
Author
Member

fixed

fixed
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_skills#178
No description provided.