lab service destructively deletes hero_proc rpc.sock on false-negative liveness probe #255

Open
opened 2026-05-15 19:43:11 +00:00 by mik-tf · 1 comment
Owner

Observed (session 95 / hero_code sweep, lab build #50469)

While bootstrapping the hero_proc → hero_db → hero_code dep chain via lab service, lab decided the running hero_proc daemon was "not running" (probe failed for an unrelated reason — screen was not installed yet, separate issue), then proceeded to:

pkill: pattern that searches for process name longer than 15 characters will result in zero matches
Try `pkill -f' option to match against the complete command line.
  removing leftover socket: /home/pctwo/hero/var/sockets/hero_proc/rpc.sock
  socket removed ok
  launching: screen -dmS hero_proc_server /home/pctwo/hero/bin/hero_proc_server
  fail … failed to run screen — is it installed?: No such file or directory (os error 2)

The "removing leftover socket" step deleted the live rpc.sock BEFORE confirming a replacement could be launched. With screen not installed, the launch failed but the socket was already gone — so existing hero_proc clients (e.g. ~/hero/bin/hero_proc service list) started failing with Connection error: No such file or directory.

lab service resetall cleanly recovered (wipes hero_proc DB + sockets + restarts), but the immediate post-cleanup state was broken until that recovery.

Also: pkill matches process names truncated to 15 chars (kernel limit). The literal pattern hero_proc_server is 16 chars and never matches by name — silent zero-kill. pkill -f hero_proc_server matches the command line; that's the right flag.

Suggested fixes

  1. Probe the socket via connect() (not just file existence) before deciding "not running". If a process is accept()ing on it, the daemon is alive.
  2. Even after deciding "not running", verify the socket file is not owned by a live process before unlinking — e.g. lsof / fuser check, or connect()+timeout. Don't delete if owned.
  3. Use pkill -f (or just pkill --full) for >15-char process names.

Why this matters for the sweep

Every repo's smoke gate via lab service … --start will go through this dep-bootstrap path. A flaky liveness probe on a developer's machine becomes destructive instead of merely failing to start.

Refs: lhumina_code/hero_proc#102 (sweep tracker), lhumina_code/hero_code#15 (where this surfaced).

## Observed (session 95 / hero_code sweep, lab build #50469) While bootstrapping the hero_proc → hero_db → hero_code dep chain via `lab service`, lab decided the running hero_proc daemon was "not running" (probe failed for an unrelated reason — `screen` was not installed yet, separate issue), then proceeded to: ``` pkill: pattern that searches for process name longer than 15 characters will result in zero matches Try `pkill -f' option to match against the complete command line. removing leftover socket: /home/pctwo/hero/var/sockets/hero_proc/rpc.sock socket removed ok launching: screen -dmS hero_proc_server /home/pctwo/hero/bin/hero_proc_server fail … failed to run screen — is it installed?: No such file or directory (os error 2) ``` The "removing leftover socket" step deleted the live `rpc.sock` BEFORE confirming a replacement could be launched. With screen not installed, the launch failed but the socket was already gone — so existing hero_proc clients (e.g. `~/hero/bin/hero_proc service list`) started failing with `Connection error: No such file or directory`. `lab service resetall` cleanly recovered (wipes hero_proc DB + sockets + restarts), but the immediate post-cleanup state was broken until that recovery. Also: `pkill` matches process names truncated to 15 chars (kernel limit). The literal pattern `hero_proc_server` is 16 chars and never matches by name — silent zero-kill. `pkill -f hero_proc_server` matches the command line; that's the right flag. ## Suggested fixes 1. Probe the socket via `connect()` (not just file existence) before deciding "not running". If a process is `accept()`ing on it, the daemon is alive. 2. Even after deciding "not running", verify the socket file is not owned by a live process before unlinking — e.g. `lsof` / `fuser` check, or `connect()`+timeout. Don't delete if owned. 3. Use `pkill -f` (or just `pkill --full`) for >15-char process names. ## Why this matters for the sweep Every repo's smoke gate via `lab service … --start` will go through this dep-bootstrap path. A flaky liveness probe on a developer's machine becomes destructive instead of merely failing to start. Refs: https://forge.ourworld.tf/lhumina_code/hero_proc/issues/102 (sweep tracker), https://forge.ourworld.tf/lhumina_code/hero_code/pulls/15 (where this surfaced).
Author
Owner

Fix in PR #257 — awaiting squash-merge gate.

Verified under lab build #54729 during testing on the hero_proc#102 sweep:

  • lab service hero_code --start now starts both server and admin via a single invocation (8/8 smoke checks).
  • start_hero_proc no longer destroys a live hero_proc socket on false-negative liveness probe.
  • Missing screen fails fast with a clear pointer to lab install base BEFORE any state cleanup.
Fix in [PR #257](https://forge.ourworld.tf/lhumina_code/hero_skills/pulls/257) — awaiting squash-merge gate. Verified under lab build #54729 during testing on the hero_proc#102 sweep: - `lab service hero_code --start` now starts both server and admin via a single invocation (8/8 smoke checks). - `start_hero_proc` no longer destroys a live hero_proc socket on false-negative liveness probe. - Missing `screen` fails fast with a clear pointer to `lab install base` BEFORE any state cleanup.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_skills#255
No description provided.