hero_inspector crash-loops in Docker container (hero_proc conflict) #100

New issue

Closed

opened 2026-03-27 12:35:57 +00:00 by mik-tf · 1 comment

mik-tf commented

2026-03-27 12:35:57 +00:00

Owner

Problem

hero_inspector_server crash-loops in the Docker container (~0.5s restart interval). When run manually it works fine — discovers 42 services, binds socket, serves requests.

The crash-loop causes 4 smoke test failures:

hero_inspector_server /rpc (0 methods — service not running)
MCP gateway (hero_redis_server) — depends on inspector
hero_inspector_ui status-dot missing
hero_inspector_ui connection-status.js not found

Root cause

hero_proc starts hero_inspector_server, but the process may exit or get killed before hero_proc detects it as healthy. hero_proc then restarts it, but the previous socket file may still exist, causing a bind conflict. The rapid restart loop (~0.5s) suggests hero_proc is retrying without enough delay.

When run manually (docker exec herolocal /root/hero/bin/hero_inspector_server), it works perfectly — discovers all 42 services, binds socket, serves RPC.

Evidence

Docker logs show repeated restarts:

job process started job_id=18 pid=1916 action=user.hero_inspector_server.run
job process started job_id=18 pid=2005 action=user.hero_inspector_server.run
...(every 0.5s)

Likely fix

Check hero_proc restart delay configuration for inspector
Ensure stale socket cleanup before rebind
May need --start flag support in hero_inspector_server for proper self-registration

Signed-off-by: mik-tf

## Problem hero_inspector_server crash-loops in the Docker container (~0.5s restart interval). When run manually it works fine — discovers 42 services, binds socket, serves requests. The crash-loop causes 4 smoke test failures: - hero_inspector_server /rpc (0 methods — service not running) - MCP gateway (hero_redis_server) — depends on inspector - hero_inspector_ui status-dot missing - hero_inspector_ui connection-status.js not found ## Root cause hero_proc starts hero_inspector_server, but the process may exit or get killed before hero_proc detects it as healthy. hero_proc then restarts it, but the previous socket file may still exist, causing a bind conflict. The rapid restart loop (~0.5s) suggests hero_proc is retrying without enough delay. When run manually (`docker exec herolocal /root/hero/bin/hero_inspector_server`), it works perfectly — discovers all 42 services, binds socket, serves RPC. ## Evidence Docker logs show repeated restarts: ``` job process started job_id=18 pid=1916 action=user.hero_inspector_server.run job process started job_id=18 pid=2005 action=user.hero_inspector_server.run ...(every 0.5s) ``` ## Likely fix - Check hero_proc restart delay configuration for inspector - Ensure stale socket cleanup before rebind - May need `--start` flag support in hero_inspector_server for proper self-registration Signed-off-by: mik-tf

mik-tf referenced this issue from a commit

2026-03-27 20:54:24 +00:00

fix: remove 'serve' arg from hero_inspector TOML — fixes crash-loop (#100)

mik-tf commented

2026-03-27 20:55:12 +00:00

Author

Owner

Fixed

Root cause: the service TOML had exec = "hero_inspector_server serve" but the binary's clap CLI parser has no serve subcommand. Clap errored and exited, triggering hero_proc crash-loop retry.

Fix: removed serve argument from both server and UI exec lines in hero_services/services/hero_inspector.toml.

Result: 122/124 smoke tests pass, 0 failures (was 4 failures).

Signed-off-by: mik-tf

## Fixed Root cause: the service TOML had `exec = "hero_inspector_server serve"` but the binary's clap CLI parser has no `serve` subcommand. Clap errored and exited, triggering hero_proc crash-loop retry. Fix: removed `serve` argument from both server and UI exec lines in `hero_services/services/hero_inspector.toml`. Result: 122/124 smoke tests pass, 0 failures (was 4 failures). Signed-off-by: mik-tf

mik-tf closed this issue

2026-03-27 20:55:13 +00:00

mik-tf referenced this issue

2026-03-28 03:36:42 +00:00

Compute island: connect to TFGrid node #33

mik-tf referenced this issue

2026-03-28 05:07:00 +00:00

Hero OS — Master Roadmap #38