[hero_codescalers] Fix duplicate instance appearing in hero_router #18

Closed
opened 2026-04-29 07:10:13 +00:00 by mahmoud · 1 comment
Owner

Problem

A second unexpected hero_codescalers instance is appearing in hero_router
dashboard. This is caused by a registration issue.

Steps to fix

  • Identify why a second instance is being registered
  • Ensure only one instance registers per service name
  • Verify multi-instance naming convention is followed:
    instance N uses socket dir hero_codescalers_serverN/

Acceptance Criteria

  • Only the expected instance(s) appear in hero_router
  • No ghost/duplicate entries after restart
## Problem A second unexpected hero_codescalers instance is appearing in hero_router dashboard. This is caused by a registration issue. ## Steps to fix - Identify why a second instance is being registered - Ensure only one instance registers per service name - Verify multi-instance naming convention is followed: instance N uses socket dir `hero_codescalers_serverN/` ## Acceptance Criteria - [ ] Only the expected instance(s) appear in hero_router - [ ] No ghost/duplicate entries after restart
Author
Owner

Implementation Spec for Issue #18

Objective

Eliminate the duplicate/ghost hero_codescalers entry in the hero_router dashboard by detecting and pruning stale legacy socket directories (hero_codescalers_server[/_N]) left over from before PR #21, and by hardening --start so a fresh start cannot leave such directories behind in the future.

Root Cause

(A) Stale-on-disk legacy directory is the dominant root cause. Ranked analysis:

  1. (A) Confirmed — stale legacy directory. Before commit 7edb4e3 (PR #16, "fix(cli): align CLI socket dir default with the server's"), the CLI's sock_dir_name(0) returned "hero_codescalers_server" and exported it via HERO_CODESCALERS_SOCK_NAME to the spawned server/UI children, causing them to bind their UDS at $HERO_SOCKET_DIR/hero_codescalers_server/{rpc,ui}.sock. After the fix, the canonical directory is hero_codescalers/ (see crates/hero_codescalers/src/main.rs:42-51). On any host that previously ran the old binary and was upgraded after PR #16/#21, $HERO_SOCKET_DIR/hero_codescalers_server/ still exists with the old socket files. hero_router scans every subdirectory of $HERO_SOCKET_DIR (per hero_sockets skill section 4 and hero_router skill probing strategy), so it discovers BOTH hero_codescalers/ (live) and hero_codescalers_server/ (ghost). The ghost shows up as Inactive (no listener) but still appears in the dashboard as a second entry. For non-zero instances the legacy names were hero_codescalers_server1, hero_codescalers_server2, etc. (no underscore between server and the number), so any of those may also be lingering.

  2. (B) Not confirmed. self_start calls hp.restart_service(&svc_name, ...) (crates/hero_codescalers/src/main.rs:406). The hero_proc service is keyed by svc_name, so calling --start repeatedly is idempotent — it replaces the existing service definition and does not register a second copy.

  3. (C) Not active. sock_dir_name(instance) (lines 42-51) yields a unique directory per instance (hero_codescalers, hero_codescalers_1, hero_codescalers_2, …). Server/UI actions for the same instance share that one directory but bind different filenames (rpc.sock, ui.sock), which is by design.

  4. (D) Cosmetic only. Both crates/hero_codescalers_server/heroservice.json and crates/hero_codescalers_ui/heroservice.json are static, embedded at compile time, and always declare "name": "hero_codescalers" — even for instance 1, instance 2, etc. hero_router keys services by their socket directory name (URL routing prefix /<service>/...), so two running instances are distinguished correctly. The static name field is only informational. Out of scope for this fix unless the dashboard renders the manifest name as the primary label.

Requirements

  • Detect any sibling socket directories under $HERO_SOCKET_DIR whose names match the legacy pattern (hero_codescalers_server, hero_codescalers_server<N>, hero_codescalers_server_<N>).
  • For each match, log a clear, actionable eprintln! warning during --start showing the offending path(s).
  • Be conservative about destructive cleanup. Default to warn-only (do not delete user data automatically). Optionally allow opt-in pruning via a flag (e.g. --prune-legacy-sockets) so an operator can confirm the action.
  • The check must run in --start only — never at runtime in the server or UI binary, and never in any subcommand path.
  • The check must be a no-op when no legacy directory exists.
  • Preserve idempotency of --start (already in place via restart_service).

Files to Modify/Create

  • crates/hero_codescalers/src/main.rs — add a legacy-directory scan/warn (and optional prune) helper, invoke it from self_start, add a CLI flag to opt into pruning.
  • README.md — add a brief "Migrating from pre-PR#16 layout" subsection under "Unix socket directories" pointing operators at the warning and the optional prune flag.
  • (No changes to server, UI, or SDK crates. Static heroservice.json files do not need to change for this fix.)

Implementation Plan

Step 1: Add a legacy-directory detector helper

Files: crates/hero_codescalers/src/main.rs
Changes:

  • Add a private function (near the other naming helpers around lines 42-98), e.g. legacy_sock_dirs(base: &str) -> Vec<PathBuf>, that:
    • Reads the entries of base = socket_base().
    • Returns every directory whose name starts with the literal "hero_codescalers_server" (covers hero_codescalers_server, hero_codescalers_server1, hero_codescalers_server_1, etc.). Be careful to use starts_with("hero_codescalers_server") AND require the next char (if any) to be either end-of-string, an underscore, or an ASCII digit, so we do not accidentally match a hypothetical future hero_codescalers_server_admin etc.
    • Excludes any current canonical directory (hero_codescalers, hero_codescalers_<N>). Since the legacy names all start with hero_codescalers_server, no canonical name will match — but document this invariant in a comment.
    • Returns absolute paths.
  • Unit-test the predicate logic (pure name-classifier function on &str) under #[cfg(test)] mod tests near the existing parse_duration_ms tests.
    Dependencies: none.

Step 2: Wire the detector into self_start (warn by default)

Files: crates/hero_codescalers/src/main.rs
Changes:

  • In self_start (line 402), before calling hp.restart_service(...):
    • Call the new helper.
    • For each match, emit a single multi-line eprintln!("warning: ...") block: name the legacy directory, explain that it is from a pre-PR#16 layout, suggest running hero_codescalers --start --prune-legacy-sockets (or rm -rf <path>) to remove it, and warn that it will appear as a ghost service in hero_router until removed.
    • Do not delete anything in the default code path. This is the conservative default per the brief.
      Dependencies: Step 1.

Step 3: Add an opt-in --prune-legacy-sockets flag

Files: crates/hero_codescalers/src/main.rs
Changes:

  • Add a #[arg(long)] prune_legacy_sockets: bool field on Cli (around lines 119-147) — global so it can pair with --start. Document it in the long help.
  • In self_start (signature change to accept the bool, plumb through from main), after the warn step, if prune_legacy_sockets == true:
    • For each detected legacy directory, attempt std::fs::remove_dir_all(path). On success, log eprintln!("removed legacy socket dir: {path}"). On failure, log a warning but continue.
  • Keep the warning-only path as default behavior.
    Dependencies: Steps 1, 2.

Step 4: Document the migration in README

Files: README.md
Changes:

  • Under the existing "Unix socket directories" section (around line 54), append a short "Migrating from pre-PR#16 layout" subsection explaining:
    • Pre-PR#16 builds used hero_codescalers_server[N]/ as the socket dir.
    • After upgrading, the legacy directory is left behind and shows up as a ghost in the hero_router dashboard.
    • The fix: re-run hero_codescalers --start (it will warn) and optionally pass --prune-legacy-sockets to clean up.
      Dependencies: Step 3 (so the flag name in the doc is accurate).

Step 5: Add a regression test

Files: crates/hero_codescalers/src/main.rs (test module)
Changes:

  • Unit-test the pure name classifier added in Step 1: assert that hero_codescalers_server, hero_codescalers_server1, hero_codescalers_server_2 are flagged as legacy; that hero_codescalers, hero_codescalers_1, hero_codescalers_99, and unrelated names like hero_router, hero_codescalers_audit are NOT flagged.
  • Optional: a tempfile-based integration test that creates a fake $HERO_SOCKET_DIR with a mix of canonical and legacy directories, runs the helper, and asserts the returned list. Skip if the crate does not already pull tempfile.
    Dependencies: Step 1.

Acceptance Criteria

  • Only the expected hero_codescalers[/_N] entry/entries appear in hero_router after running hero_codescalers --start --prune-legacy-sockets once.
  • No ghost/duplicate entries after restart on a host that previously ran a pre-PR#16 build.
  • Running hero_codescalers --start on a clean host (no legacy dir) is a no-op for the new code path — no spurious warnings.
  • Running hero_codescalers --start (without the flag) on a host with a legacy dir prints a clear, actionable warning naming the offending path(s) and exits successfully.
  • cargo test -p hero_codescalers passes (legacy-name classifier unit tests).
  • No breaking change to existing --start/--stop semantics or the restart_service idempotency.

Notes

  • This builds on top of branch development_fix_codescalers_service_name (PR #21, unmerged). Either rebase this work on top of #21 once merged, or accept that the two PRs touch overlapping regions of crates/hero_codescalers/src/main.rs and may conflict. Both fixes target the same root cause from different angles: PR #21 catches a stale HERO_CODESCALERS_SOCK_NAME env var; this fix catches a stale on-disk directory.
  • Manifest-name aliasing (cause D) is not addressed here. If a future user reports that the hero_router dashboard renders both instances under the same display label, we may need to make heroservice.json dynamic (write it from the server at startup using args.sockname instead of a static include_str!). That is a separate, larger change and is out of scope for issue #18.
  • The hero_proc action names (hero_codescalers_server[_N]) are unchanged and are not the cause of any duplication: they live inside hero_proc, not on the socket filesystem, and the hero_router dashboard does not key off them.
  • Conservative default (warn-only) is deliberate: remove_dir_all on $HERO_SOCKET_DIR/hero_codescalers_server would silently delete any orphaned socket files even if the operator is running an old version of the binary in another window. The opt-in flag puts the operator in control.
## Implementation Spec for Issue #18 ### Objective Eliminate the duplicate/ghost `hero_codescalers` entry in the hero_router dashboard by detecting and pruning stale legacy socket directories (`hero_codescalers_server[/_N]`) left over from before PR #21, and by hardening `--start` so a fresh start cannot leave such directories behind in the future. ### Root Cause **(A) Stale-on-disk legacy directory** is the dominant root cause. Ranked analysis: 1. **(A) Confirmed — stale legacy directory.** Before commit `7edb4e3` (PR #16, "fix(cli): align CLI socket dir default with the server's"), the CLI's `sock_dir_name(0)` returned `"hero_codescalers_server"` and exported it via `HERO_CODESCALERS_SOCK_NAME` to the spawned server/UI children, causing them to bind their UDS at `$HERO_SOCKET_DIR/hero_codescalers_server/{rpc,ui}.sock`. After the fix, the canonical directory is `hero_codescalers/` (see `crates/hero_codescalers/src/main.rs:42-51`). On any host that previously ran the old binary and was upgraded after PR #16/#21, `$HERO_SOCKET_DIR/hero_codescalers_server/` still exists with the old socket files. `hero_router` scans every subdirectory of `$HERO_SOCKET_DIR` (per `hero_sockets` skill section 4 and `hero_router` skill probing strategy), so it discovers BOTH `hero_codescalers/` (live) and `hero_codescalers_server/` (ghost). The ghost shows up as Inactive (no listener) but still appears in the dashboard as a second entry. For non-zero instances the legacy names were `hero_codescalers_server1`, `hero_codescalers_server2`, etc. (no underscore between `server` and the number), so any of those may also be lingering. 2. **(B) Not confirmed.** `self_start` calls `hp.restart_service(&svc_name, ...)` (`crates/hero_codescalers/src/main.rs:406`). The hero_proc service is keyed by `svc_name`, so calling `--start` repeatedly is idempotent — it replaces the existing service definition and does not register a second copy. 3. **(C) Not active.** `sock_dir_name(instance)` (lines 42-51) yields a unique directory per instance (`hero_codescalers`, `hero_codescalers_1`, `hero_codescalers_2`, …). Server/UI actions for the same instance share that one directory but bind different filenames (`rpc.sock`, `ui.sock`), which is by design. 4. **(D) Cosmetic only.** Both `crates/hero_codescalers_server/heroservice.json` and `crates/hero_codescalers_ui/heroservice.json` are static, embedded at compile time, and always declare `"name": "hero_codescalers"` — even for instance 1, instance 2, etc. `hero_router` keys services by their socket directory name (URL routing prefix `/<service>/...`), so two running instances are distinguished correctly. The static `name` field is only informational. Out of scope for this fix unless the dashboard renders the manifest name as the primary label. ### Requirements - Detect any sibling socket directories under `$HERO_SOCKET_DIR` whose names match the legacy pattern (`hero_codescalers_server`, `hero_codescalers_server<N>`, `hero_codescalers_server_<N>`). - For each match, log a clear, actionable `eprintln!` warning during `--start` showing the offending path(s). - Be conservative about destructive cleanup. Default to **warn-only** (do not delete user data automatically). Optionally allow opt-in pruning via a flag (e.g. `--prune-legacy-sockets`) so an operator can confirm the action. - The check must run in `--start` only — never at runtime in the server or UI binary, and never in any subcommand path. - The check must be a no-op when no legacy directory exists. - Preserve idempotency of `--start` (already in place via `restart_service`). ### Files to Modify/Create - `crates/hero_codescalers/src/main.rs` — add a legacy-directory scan/warn (and optional prune) helper, invoke it from `self_start`, add a CLI flag to opt into pruning. - `README.md` — add a brief "Migrating from pre-PR#16 layout" subsection under "Unix socket directories" pointing operators at the warning and the optional prune flag. - (No changes to server, UI, or SDK crates. Static `heroservice.json` files do not need to change for this fix.) ### Implementation Plan #### Step 1: Add a legacy-directory detector helper Files: `crates/hero_codescalers/src/main.rs` Changes: - Add a private function (near the other naming helpers around lines 42-98), e.g. `legacy_sock_dirs(base: &str) -> Vec<PathBuf>`, that: - Reads the entries of `base = socket_base()`. - Returns every directory whose name starts with the literal `"hero_codescalers_server"` (covers `hero_codescalers_server`, `hero_codescalers_server1`, `hero_codescalers_server_1`, etc.). Be careful to use `starts_with("hero_codescalers_server")` AND require the next char (if any) to be either end-of-string, an underscore, or an ASCII digit, so we do not accidentally match a hypothetical future `hero_codescalers_server_admin` etc. - Excludes any current canonical directory (`hero_codescalers`, `hero_codescalers_<N>`). Since the legacy names all start with `hero_codescalers_server`, no canonical name will match — but document this invariant in a comment. - Returns absolute paths. - Unit-test the predicate logic (pure name-classifier function on `&str`) under `#[cfg(test)] mod tests` near the existing `parse_duration_ms` tests. Dependencies: none. #### Step 2: Wire the detector into `self_start` (warn by default) Files: `crates/hero_codescalers/src/main.rs` Changes: - In `self_start` (line 402), before calling `hp.restart_service(...)`: - Call the new helper. - For each match, emit a single multi-line `eprintln!("warning: ...")` block: name the legacy directory, explain that it is from a pre-PR#16 layout, suggest running `hero_codescalers --start --prune-legacy-sockets` (or `rm -rf <path>`) to remove it, and warn that it will appear as a ghost service in `hero_router` until removed. - Do **not** delete anything in the default code path. This is the conservative default per the brief. Dependencies: Step 1. #### Step 3: Add an opt-in `--prune-legacy-sockets` flag Files: `crates/hero_codescalers/src/main.rs` Changes: - Add a `#[arg(long)] prune_legacy_sockets: bool` field on `Cli` (around lines 119-147) — global so it can pair with `--start`. Document it in the long help. - In `self_start` (signature change to accept the bool, plumb through from `main`), after the warn step, if `prune_legacy_sockets == true`: - For each detected legacy directory, attempt `std::fs::remove_dir_all(path)`. On success, log `eprintln!("removed legacy socket dir: {path}")`. On failure, log a warning but continue. - Keep the warning-only path as default behavior. Dependencies: Steps 1, 2. #### Step 4: Document the migration in README Files: `README.md` Changes: - Under the existing "Unix socket directories" section (around line 54), append a short "Migrating from pre-PR#16 layout" subsection explaining: - Pre-PR#16 builds used `hero_codescalers_server[N]/` as the socket dir. - After upgrading, the legacy directory is left behind and shows up as a ghost in the hero_router dashboard. - The fix: re-run `hero_codescalers --start` (it will warn) and optionally pass `--prune-legacy-sockets` to clean up. Dependencies: Step 3 (so the flag name in the doc is accurate). #### Step 5: Add a regression test Files: `crates/hero_codescalers/src/main.rs` (test module) Changes: - Unit-test the pure name classifier added in Step 1: assert that `hero_codescalers_server`, `hero_codescalers_server1`, `hero_codescalers_server_2` are flagged as legacy; that `hero_codescalers`, `hero_codescalers_1`, `hero_codescalers_99`, and unrelated names like `hero_router`, `hero_codescalers_audit` are NOT flagged. - Optional: a `tempfile`-based integration test that creates a fake `$HERO_SOCKET_DIR` with a mix of canonical and legacy directories, runs the helper, and asserts the returned list. Skip if the crate does not already pull `tempfile`. Dependencies: Step 1. ### Acceptance Criteria - [ ] Only the expected `hero_codescalers[/_N]` entry/entries appear in `hero_router` after running `hero_codescalers --start --prune-legacy-sockets` once. - [ ] No ghost/duplicate entries after restart on a host that previously ran a pre-PR#16 build. - [ ] Running `hero_codescalers --start` on a clean host (no legacy dir) is a no-op for the new code path — no spurious warnings. - [ ] Running `hero_codescalers --start` (without the flag) on a host with a legacy dir prints a clear, actionable warning naming the offending path(s) and exits successfully. - [ ] `cargo test -p hero_codescalers` passes (legacy-name classifier unit tests). - [ ] No breaking change to existing `--start`/`--stop` semantics or the `restart_service` idempotency. ### Notes - This builds on top of branch `development_fix_codescalers_service_name` (PR #21, unmerged). Either rebase this work on top of #21 once merged, or accept that the two PRs touch overlapping regions of `crates/hero_codescalers/src/main.rs` and may conflict. Both fixes target the same root cause from different angles: PR #21 catches a stale `HERO_CODESCALERS_SOCK_NAME` env var; this fix catches a stale on-disk directory. - Manifest-name aliasing (cause D) is not addressed here. If a future user reports that the hero_router dashboard renders both instances under the same display label, we may need to make `heroservice.json` dynamic (write it from the server at startup using `args.sockname` instead of a static `include_str!`). That is a separate, larger change and is out of scope for issue #18. - The `hero_proc` action names (`hero_codescalers_server[_N]`) are unchanged and are not the cause of any duplication: they live inside hero_proc, not on the socket filesystem, and the hero_router dashboard does not key off them. - Conservative default (warn-only) is deliberate: `remove_dir_all` on `$HERO_SOCKET_DIR/hero_codescalers_server` would silently delete any orphaned socket files even if the operator is running an old version of the binary in another window. The opt-in flag puts the operator in control.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_codescalers#18
No description provided.