hero_inspector crash-loops in Docker container (hero_proc conflict) #100

Closed
opened 2026-03-27 12:35:57 +00:00 by mik-tf · 1 comment
Owner

Problem

hero_inspector_server crash-loops in the Docker container (~0.5s restart interval). When run manually it works fine — discovers 42 services, binds socket, serves requests.

The crash-loop causes 4 smoke test failures:

  • hero_inspector_server /rpc (0 methods — service not running)
  • MCP gateway (hero_redis_server) — depends on inspector
  • hero_inspector_ui status-dot missing
  • hero_inspector_ui connection-status.js not found

Root cause

hero_proc starts hero_inspector_server, but the process may exit or get killed before hero_proc detects it as healthy. hero_proc then restarts it, but the previous socket file may still exist, causing a bind conflict. The rapid restart loop (~0.5s) suggests hero_proc is retrying without enough delay.

When run manually (docker exec herolocal /root/hero/bin/hero_inspector_server), it works perfectly — discovers all 42 services, binds socket, serves RPC.

Evidence

Docker logs show repeated restarts:

job process started job_id=18 pid=1916 action=user.hero_inspector_server.run
job process started job_id=18 pid=2005 action=user.hero_inspector_server.run
...(every 0.5s)

Likely fix

  • Check hero_proc restart delay configuration for inspector
  • Ensure stale socket cleanup before rebind
  • May need --start flag support in hero_inspector_server for proper self-registration

Signed-off-by: mik-tf

## Problem hero_inspector_server crash-loops in the Docker container (~0.5s restart interval). When run manually it works fine — discovers 42 services, binds socket, serves requests. The crash-loop causes 4 smoke test failures: - hero_inspector_server /rpc (0 methods — service not running) - MCP gateway (hero_redis_server) — depends on inspector - hero_inspector_ui status-dot missing - hero_inspector_ui connection-status.js not found ## Root cause hero_proc starts hero_inspector_server, but the process may exit or get killed before hero_proc detects it as healthy. hero_proc then restarts it, but the previous socket file may still exist, causing a bind conflict. The rapid restart loop (~0.5s) suggests hero_proc is retrying without enough delay. When run manually (`docker exec herolocal /root/hero/bin/hero_inspector_server`), it works perfectly — discovers all 42 services, binds socket, serves RPC. ## Evidence Docker logs show repeated restarts: ``` job process started job_id=18 pid=1916 action=user.hero_inspector_server.run job process started job_id=18 pid=2005 action=user.hero_inspector_server.run ...(every 0.5s) ``` ## Likely fix - Check hero_proc restart delay configuration for inspector - Ensure stale socket cleanup before rebind - May need `--start` flag support in hero_inspector_server for proper self-registration Signed-off-by: mik-tf
Author
Owner

Fixed

Root cause: the service TOML had exec = "hero_inspector_server serve" but the binary's clap CLI parser has no serve subcommand. Clap errored and exited, triggering hero_proc crash-loop retry.

Fix: removed serve argument from both server and UI exec lines in hero_services/services/hero_inspector.toml.

Result: 122/124 smoke tests pass, 0 failures (was 4 failures).

Signed-off-by: mik-tf

## Fixed Root cause: the service TOML had `exec = "hero_inspector_server serve"` but the binary's clap CLI parser has no `serve` subcommand. Clap errored and exited, triggering hero_proc crash-loop retry. Fix: removed `serve` argument from both server and UI exec lines in `hero_services/services/hero_inspector.toml`. Result: 122/124 smoke tests pass, 0 failures (was 4 failures). Signed-off-by: mik-tf
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#100
No description provided.