service manager in router #90

Open
opened 2026-05-07 15:46:09 +00:00 by despiegk · 8 comments
Owner

Story: Replace Nu/Bash/Make Service Scripts with Hero Service Manager

Purpose

We need to replace the current collection of Nu shell scripts, Bash scripts, and Makefiles with one Rust-based service management server, following the Hero architecture.

The goal is to make service lifecycle management reproducible, typed, inspectable, and centrally controlled.


Goal

Build a Hero Service Manager that can:

  • build services
  • compile binaries
  • install services
  • configure services
  • start/stop/restart services
  • check service health
  • manage logs
  • manage sockets
  • manage runtime state
  • replace existing Nu scripts, Bash scripts, and Makefiles

Functional Requirements

  • The system must provide one central Rust-based server for service management.
  • The server must expose a clear API for build, install, start, stop, restart, status, and health operations.
  • Each service must have a declarative service definition.
  • Service definitions must describe source location, build command, binary name, install target, runtime command, environment variables, sockets, logs, and dependencies.
  • The system must support compiling Rust services.
  • The system must support installing compiled binaries into the correct Hero runtime directories.
  • The system must support starting services through the Hero process manager model.
  • The system must support stopping and restarting services safely.
  • The system must expose service status and health information.
  • The system must support dependency ordering between services.
  • The system must support idempotent operations, meaning running the same install/build/start action multiple times should be safe.
  • The system must keep logs for every operation.
  • The system must provide a CLI client for developers and automation.
  • The system must provide an API usable by Hero Admin UI.
  • The system must eventually replace all Nu shell scripts.
  • The system must eventually remove the need for Makefiles and Bash scripts.

Service Definition Requirements

Each service definition should include:

  • service name
  • service type
  • repository path
  • build profile
  • binary output path
  • install path
  • runtime user
  • environment variables
  • socket configuration
  • log path
  • health check endpoint
  • dependencies
  • restart policy
  • version or checksum information

do this as code (rust as part of the _server)


Core Operations

  • service.build
  • service.install
  • service.start
  • service.stop
  • service.restart
  • service.status
  • service.health
  • service.logs
  • service.list
  • service.upgrade
  • service.verify

Deliverables

  • Rust-based Hero Service Manager server.
  • API for Hero Admin UI integration.
  • Migration plan from Nu scripts to service definitions.
  • First batch of migrated services.
  • Removal plan for Makefiles and Bash scripts.
  • Logging and status dashboard support.
  • Documentation for developers.
## Story: Replace Nu/Bash/Make Service Scripts with Hero Service Manager ### Purpose We need to replace the current collection of Nu shell scripts, Bash scripts, and Makefiles with one Rust-based service management server, following the Hero architecture. The goal is to make service lifecycle management reproducible, typed, inspectable, and centrally controlled. --- ## Goal Build a Hero Service Manager that can: * build services * compile binaries * install services * configure services * start/stop/restart services * check service health * manage logs * manage sockets * manage runtime state * replace existing Nu scripts, Bash scripts, and Makefiles --- ## Functional Requirements * [ ] The system must provide one central Rust-based server for service management. * [ ] The server must expose a clear API for build, install, start, stop, restart, status, and health operations. * [ ] Each service must have a declarative service definition. * [ ] Service definitions must describe source location, build command, binary name, install target, runtime command, environment variables, sockets, logs, and dependencies. * [ ] The system must support compiling Rust services. * [ ] The system must support installing compiled binaries into the correct Hero runtime directories. * [ ] The system must support starting services through the Hero process manager model. * [ ] The system must support stopping and restarting services safely. * [ ] The system must expose service status and health information. * [ ] The system must support dependency ordering between services. * [ ] The system must support idempotent operations, meaning running the same install/build/start action multiple times should be safe. * [ ] The system must keep logs for every operation. * [ ] The system must provide a CLI client for developers and automation. * [ ] The system must provide an API usable by Hero Admin UI. * [ ] The system must eventually replace all Nu shell scripts. * [ ] The system must eventually remove the need for Makefiles and Bash scripts. --- ## Service Definition Requirements Each service definition should include: * [ ] service name * [ ] service type * [ ] repository path * [ ] build profile * [ ] binary output path * [ ] install path * [ ] runtime user * [ ] environment variables * [ ] socket configuration * [ ] log path * [ ] health check endpoint * [ ] dependencies * [ ] restart policy * [ ] version or checksum information do this as code (rust as part of the _server) --- ## Core Operations * [ ] `service.build` * [ ] `service.install` * [ ] `service.start` * [ ] `service.stop` * [ ] `service.restart` * [ ] `service.status` * [ ] `service.health` * [ ] `service.logs` * [ ] `service.list` * [ ] `service.upgrade` * [ ] `service.verify` --- ## Deliverables * [ ] Rust-based Hero Service Manager server. * [ ] API for Hero Admin UI integration. * [ ] Migration plan from Nu scripts to service definitions. * [ ] First batch of migrated services. * [ ] Removal plan for Makefiles and Bash scripts. * [ ] Logging and status dashboard support. * [ ] Documentation for developers.
Owner
  • Do as another socket endpoint (openrpc), can be hosted by same router binary
  • separate domain for AI to work with
- Do as another socket endpoint (openrpc), can be hosted by same router binary - separate domain for AI to work with -
Owner

Implementation plan — Hero Service Manager inside hero_router

Aligning before coding. PR will land on development_mik_service_manager off development. Squash-merge only with explicit go-ahead.

Architecture

New third socket under $HERO_SOCKET_DIR/hero_router/:

hero_router/
├── rpc.sock        ← existing: router.* JSON-RPC
├── ui.sock         ← existing: admin dashboard
└── service.sock    ← NEW: service.* JSON-RPC (own OpenRPC domain)

service.sock serves POST /rpc, GET /openrpc.json (separate spec), GET /health, GET /.well-known/heroservice.json (distinct service_id so the scanner indexes it as its own service).

No new crate — single binary stays single. New module tree:

crates/hero_router/src/service_manager/
├── mod.rs           ← public re-exports + ServiceManager state
├── definition.rs    ← ServiceDefinition + typed extension points
├── registry.rs      ← static Vec<ServiceDefinition>
├── error.rs         ← ServiceError enum
├── build.rs         ← cargo-build + forge-release-download dispatcher
├── install.rs       ← binary placement, ELF verify, freshness check
├── lifecycle.rs     ← start/stop/restart wrapping hero_proc_sdk
├── health.rs        ← HTTP health probe (reuses crate::probe::fetch_openrpc)
├── inspect.rs       ← composite "inspect" + "troubleshoot" responses
├── ops_log.rs       ← in-memory ring buffer of recent operations
├── rpc.rs           ← service.* JSON-RPC dispatcher
└── services/
    ├── mod.rs
    ├── hero_db.rs
    └── hero_books.rs

ServiceDefinition — pure data with typed extension points

Custom per-service behavior is encoded as enum-typed extension points, not trait impls — so the agent can serialize and reason about every service uniformly.

pub struct ServiceDefinition {
    pub name: &'static str,
    pub forge_loc: &'static str,
    pub description: &'static str,
    pub binaries: &'static [&'static str],
    pub actions: &'static [ActionDef],
    pub bind: BindStrategy,
    pub extras: &'static [Extra],
    pub install: InstallPolicy,
    pub depends_on: &'static [&'static str],
    pub health: HealthSpec,
    pub timing: TimingPolicy,
}

pub enum BindStrategy { Loopback, Mycelium { fallback_loopback_port: u16 }, ExplicitAddr(&'static str) }
pub enum Extra { OnnxRuntime { version: &'static str, ci_target: &'static str } /* future */ }
pub enum InstallPolicy { SourceOnly, DownloadOnly { asset_suffix: &'static str }, Either { asset_suffix: &'static str } }
pub enum ArgSource { Static(&'static str), Resolved(Resolver) }
pub enum EnvSource { Static(&'static str), FromHeroProcSecret(&'static str), Resolved(Resolver) }
pub enum Resolver { PortFromEnv(&'static str, u16 /* default */), MyceliumAddr /* … */ }

Extra::OnnxRuntime and BindStrategy::Mycelium compile in but are stubbed for now — exercised in follow-up PRs (hero_voice / hero_router itself).

RPC surface (service.sock, namespace service.*)

Method Params Result
service.list [{name, status, version_installed, healthy}]
service.inspect {name} full ServiceDefinition + runtime status + last 5 ops
service.status {name} {state, uptime_ms, restart_count} (delegates to hp.service_status)
service.health {name} HTTP probe of declared endpoint
service.build {name, mode, version?} {installed_paths, duration_ms}
service.install {name, mode, version?, reset?} same as build
service.start {name, reset?, version?} {started_at}
service.stop {name} {stopped_at}
service.restart {name} {restarted_at}
service.delete {name, purge_binaries?} unregisters + optionally removes binaries
service.upgrade {name, version?} install → restart
service.verify {name} binary present + ELF valid + fresh-vs-registration
service.logs {name, lines?} recent lines via hp.logs_tail
service.troubleshoot {name} composite: status + why-blocked + failed jobs + last 30 log lines

OpenRPC spec hand-written in crates/hero_router/static/service_manager.openrpc.json — same approach as the existing static/openrpc.json.

Reuse from existing crates

  • hero_proc_sdk::HeroProcFactorystart_service / stop_service / restart_service + service_status / service_list / logs_tail / job_list
  • hero_proc_sdk::ServiceBuilder / ActionBuilder — convert ServiceDefinitionServiceBuildResult
  • hero_proc_sdk::socket::{socket_base_dir, service_socket_dir}
  • crate::probe::fetch_openrpc for service.health
  • crate::log_bridge for tracing → herolib_core file logger
  • Mirror server/rpc.rs::dispatch shape (envelope helpers ~30 lines duplicated; not abstracting yet)

Source vs download (both first-class)

InstallPolicy::Either { asset_suffix: "linux-amd64-musl" } is the default.

  • Source: cargo build --release under $CARGO_TARGET_DIR/hero_service_manager/<name>/src/, copy named binaries into ~/hero/bin/, last-200-lines-on-error.
  • Download: mirrors svc_install_download (lib.nu:585) — resolve tag (/api/v1/repos/<forge_loc>/releases/latest), fetch each <bin>-<asset_suffix>, ELF-verify (\x7fELF magic), chmod +x, touch mtime fix (lib.nu:619), uses FORGEJO_TOKEN if present.

Freshness check (svc_verify_binaries_fresh from lib.nu:784) → install::verify_fresh().

CLI subcommand

hero_router service <op> [name] [flags] — same code path as the agent (connects to local service.sock over UDS).

hero_router service list
hero_router service inspect hero_books
hero_router service start hero_books --version v0.4.2
hero_router service troubleshoot hero_books
hero_router service install hero_books --mode download

First-PR scope: 2 services

hero_db and hero_books — both server+ui pairs, no mycelium quirks, no ONNX. Exercises every code path without dragging in extras yet to design.

Out of scope this PR: hero_router (mycelium auto-detect), hero_voice / hero_embedder / hero_editor (ONNX overlay), hero_proc (circular), and the other ~17. Existing nu modules keep working in parallel — this PR adds capability, doesn't remove anything.

Tests

  • Unit: definition.rs (to_proc_service_build_result), build.rs (tag-resolution URL, ELF magic, asset-name composition).
  • Integration: tests/service_manager_e2e.rs — tempdir socket + stub hero_proc, drives service.list/start/status/stop against a fake echo service.
  • Manual smoke: cargo run --bin hero_router -- service list prints the two registered services after starting hero_proc locally.

Verification

  1. Workspace gate: cargo fmt --check && cargo clippy --workspace --all-targets -- -D warnings && cargo build --workspace --release.
  2. Smoke: hero_proc + hero_router → curl --unix-socket .../service.sock http://x/openrpc.json; hero_router service list.
  3. Install: service install hero_db --mode download --version latest.
  4. Lifecycle: startstatus runninghealth okstop.
  5. Troubleshoot: misconfigure → composite output.
  6. Existing surface (rpc.sock, ui.sock) unchanged — covered by cargo test --workspace.
  7. No deploy to herodemo this PR — heroci validation in a follow-up session.

Files modified / created

Modified:

  • crates/hero_router/src/main.rs — bind third socket, service CLI subcommand
  • crates/hero_router/src/lib.rs — re-export service_manager
  • crates/hero_router/Cargo.tomlhero_proc_sdk dep already present; reuse hyperlocal + hyper for HTTPS download (no new dep)
  • crates/hero_router/CLAUDE.md — document third socket

Created (14 files):

  • service_manager/{mod,definition,registry,error,build,install,lifecycle,health,inspect,ops_log,rpc}.rs
  • service_manager/services/{mod,hero_db,hero_books}.rs
  • static/service_manager.openrpc.json
  • tests/service_manager_e2e.rs

Out-of-scope follow-ups (separate issues filed after merge)

  • Port remaining ~20 services (one issue per group: simple servers, mycelium-binding, ONNX, hero_proc itself).
  • Admin UI dashboard pane on ui.sock (/services page).
  • Removal plan for hero_skills/nutools/modules/services/*.nu once all ports complete.
  • Bash/Make removal (separate issues per script).

Starting implementation now on development_mik_service_manager.

## Implementation plan — Hero Service Manager inside hero_router Aligning before coding. PR will land on `development_mik_service_manager` off `development`. Squash-merge only with explicit go-ahead. ### Architecture **New third socket** under `$HERO_SOCKET_DIR/hero_router/`: ``` hero_router/ ├── rpc.sock ← existing: router.* JSON-RPC ├── ui.sock ← existing: admin dashboard └── service.sock ← NEW: service.* JSON-RPC (own OpenRPC domain) ``` `service.sock` serves `POST /rpc`, `GET /openrpc.json` (separate spec), `GET /health`, `GET /.well-known/heroservice.json` (distinct `service_id` so the scanner indexes it as its own service). **No new crate** — single binary stays single. New module tree: ``` crates/hero_router/src/service_manager/ ├── mod.rs ← public re-exports + ServiceManager state ├── definition.rs ← ServiceDefinition + typed extension points ├── registry.rs ← static Vec<ServiceDefinition> ├── error.rs ← ServiceError enum ├── build.rs ← cargo-build + forge-release-download dispatcher ├── install.rs ← binary placement, ELF verify, freshness check ├── lifecycle.rs ← start/stop/restart wrapping hero_proc_sdk ├── health.rs ← HTTP health probe (reuses crate::probe::fetch_openrpc) ├── inspect.rs ← composite "inspect" + "troubleshoot" responses ├── ops_log.rs ← in-memory ring buffer of recent operations ├── rpc.rs ← service.* JSON-RPC dispatcher └── services/ ├── mod.rs ├── hero_db.rs └── hero_books.rs ``` ### ServiceDefinition — pure data with typed extension points Custom per-service behavior is encoded as enum-typed extension points, not trait impls — so the agent can serialize and reason about every service uniformly. ```rust pub struct ServiceDefinition { pub name: &'static str, pub forge_loc: &'static str, pub description: &'static str, pub binaries: &'static [&'static str], pub actions: &'static [ActionDef], pub bind: BindStrategy, pub extras: &'static [Extra], pub install: InstallPolicy, pub depends_on: &'static [&'static str], pub health: HealthSpec, pub timing: TimingPolicy, } pub enum BindStrategy { Loopback, Mycelium { fallback_loopback_port: u16 }, ExplicitAddr(&'static str) } pub enum Extra { OnnxRuntime { version: &'static str, ci_target: &'static str } /* future */ } pub enum InstallPolicy { SourceOnly, DownloadOnly { asset_suffix: &'static str }, Either { asset_suffix: &'static str } } pub enum ArgSource { Static(&'static str), Resolved(Resolver) } pub enum EnvSource { Static(&'static str), FromHeroProcSecret(&'static str), Resolved(Resolver) } pub enum Resolver { PortFromEnv(&'static str, u16 /* default */), MyceliumAddr /* … */ } ``` `Extra::OnnxRuntime` and `BindStrategy::Mycelium` compile in but are stubbed for now — exercised in follow-up PRs (hero_voice / hero_router itself). ### RPC surface (`service.sock`, namespace `service.*`) | Method | Params | Result | |---|---|---| | `service.list` | — | `[{name, status, version_installed, healthy}]` | | `service.inspect` | `{name}` | full `ServiceDefinition` + runtime status + last 5 ops | | `service.status` | `{name}` | `{state, uptime_ms, restart_count}` (delegates to `hp.service_status`) | | `service.health` | `{name}` | HTTP probe of declared endpoint | | `service.build` | `{name, mode, version?}` | `{installed_paths, duration_ms}` | | `service.install` | `{name, mode, version?, reset?}` | same as build | | `service.start` | `{name, reset?, version?}` | `{started_at}` | | `service.stop` | `{name}` | `{stopped_at}` | | `service.restart` | `{name}` | `{restarted_at}` | | `service.delete` | `{name, purge_binaries?}` | unregisters + optionally removes binaries | | `service.upgrade` | `{name, version?}` | install → restart | | `service.verify` | `{name}` | binary present + ELF valid + fresh-vs-registration | | `service.logs` | `{name, lines?}` | recent lines via `hp.logs_tail` | | `service.troubleshoot` | `{name}` | composite: status + why-blocked + failed jobs + last 30 log lines | OpenRPC spec hand-written in `crates/hero_router/static/service_manager.openrpc.json` — same approach as the existing `static/openrpc.json`. ### Reuse from existing crates - `hero_proc_sdk::HeroProcFactory` — `start_service` / `stop_service` / `restart_service` + `service_status` / `service_list` / `logs_tail` / `job_list` - `hero_proc_sdk::ServiceBuilder` / `ActionBuilder` — convert `ServiceDefinition` → `ServiceBuildResult` - `hero_proc_sdk::socket::{socket_base_dir, service_socket_dir}` - `crate::probe::fetch_openrpc` for `service.health` - `crate::log_bridge` for tracing → herolib_core file logger - Mirror `server/rpc.rs::dispatch` shape (envelope helpers ~30 lines duplicated; not abstracting yet) ### Source vs download (both first-class) `InstallPolicy::Either { asset_suffix: "linux-amd64-musl" }` is the default. - **Source**: `cargo build --release` under `$CARGO_TARGET_DIR/hero_service_manager/<name>/src/`, copy named binaries into `~/hero/bin/`, last-200-lines-on-error. - **Download**: mirrors `svc_install_download` (lib.nu:585) — resolve tag (`/api/v1/repos/<forge_loc>/releases/latest`), fetch each `<bin>-<asset_suffix>`, ELF-verify (`\x7fELF` magic), `chmod +x`, `touch` mtime fix (lib.nu:619), uses `FORGEJO_TOKEN` if present. Freshness check (`svc_verify_binaries_fresh` from lib.nu:784) → `install::verify_fresh()`. ### CLI subcommand `hero_router service <op> [name] [flags]` — same code path as the agent (connects to local `service.sock` over UDS). ``` hero_router service list hero_router service inspect hero_books hero_router service start hero_books --version v0.4.2 hero_router service troubleshoot hero_books hero_router service install hero_books --mode download ``` ### First-PR scope: 2 services `hero_db` and `hero_books` — both server+ui pairs, no mycelium quirks, no ONNX. Exercises every code path without dragging in extras yet to design. Out of scope this PR: hero_router (mycelium auto-detect), hero_voice / hero_embedder / hero_editor (ONNX overlay), hero_proc (circular), and the other ~17. Existing nu modules keep working in parallel — this PR adds capability, doesn't remove anything. ### Tests - Unit: `definition.rs` (to_proc_service_build_result), `build.rs` (tag-resolution URL, ELF magic, asset-name composition). - Integration: `tests/service_manager_e2e.rs` — tempdir socket + stub hero_proc, drives `service.list/start/status/stop` against a fake echo service. - Manual smoke: `cargo run --bin hero_router -- service list` prints the two registered services after starting hero_proc locally. ### Verification 1. Workspace gate: `cargo fmt --check && cargo clippy --workspace --all-targets -- -D warnings && cargo build --workspace --release`. 2. Smoke: hero_proc + hero_router → `curl --unix-socket .../service.sock http://x/openrpc.json`; `hero_router service list`. 3. Install: `service install hero_db --mode download --version latest`. 4. Lifecycle: `start` → `status running` → `health ok` → `stop`. 5. Troubleshoot: misconfigure → composite output. 6. Existing surface (`rpc.sock`, `ui.sock`) unchanged — covered by `cargo test --workspace`. 7. **No deploy to herodemo this PR** — heroci validation in a follow-up session. ### Files modified / created Modified: - `crates/hero_router/src/main.rs` — bind third socket, `service` CLI subcommand - `crates/hero_router/src/lib.rs` — re-export `service_manager` - `crates/hero_router/Cargo.toml` — `hero_proc_sdk` dep already present; reuse `hyperlocal` + `hyper` for HTTPS download (no new dep) - `crates/hero_router/CLAUDE.md` — document third socket Created (14 files): - `service_manager/{mod,definition,registry,error,build,install,lifecycle,health,inspect,ops_log,rpc}.rs` - `service_manager/services/{mod,hero_db,hero_books}.rs` - `static/service_manager.openrpc.json` - `tests/service_manager_e2e.rs` ### Out-of-scope follow-ups (separate issues filed after merge) - Port remaining ~20 services (one issue per group: simple servers, mycelium-binding, ONNX, hero_proc itself). - Admin UI dashboard pane on `ui.sock` (`/services` page). - Removal plan for `hero_skills/nutools/modules/services/*.nu` once all ports complete. - Bash/Make removal (separate issues per script). --- Starting implementation now on `development_mik_service_manager`.
Owner

Status update — code-complete, operationally unverified

PR #91 lands the framework + all 33 service ports + documentation. Workspace gate green (fmt + clippy -D warnings + release build + 125 tests). However: zero of this has been run against a live hero_proc. Closing this META is gated on a smoke session that proves the manager actually drives services end-to-end.

This comment captures everything needed to finish the work.


What's in PR #91 (verified to compile, not to work)

Framework under crates/hero_router/src/service_manager/:

  • definition.rsServiceDefinition + typed extension points (BindStrategy, Extra, InstallPolicy, ArgSource, EnvSource, Resolver, HealthSpec, TimingPolicy)
  • registry.rs — compile-time registry over services::all()
  • build.rs — source (cargo build --release) + download (Forgejo Releases via curl) dispatcher
  • install.rs — atomic-rename binary placement, ELF magic verify, freshness check
  • lifecycle.rshero_proc_sdk wrappers (start/stop/restart/status/list)
  • health.rs — HTTP probe over UDS via hyperlocal
  • inspect.rsinspect (def + status + last 5 ops) and troubleshoot (+ log tail) composites
  • ops_log.rs — per-service ring buffer (32 entries cap)
  • rpc.rs — JSON-RPC 2.0 dispatcher + Axum router for service.sock
  • error.rsServiceError enum with stable JSON-RPC error codes (-32001..-32009)

Socket layout: third UDS service.sock alongside existing rpc.sock and ui.sock, separate OpenRPC domain (static/service_manager.openrpc.json), separate service_id so the router scanner indexes it as a distinct service.

RPC methods (14): service.list / inspect / status / health / build / install / start / stop / restart / delete / upgrade / verify / logs / troubleshoot + rpc.discover + rpc.health.

CLI: hero_router service <op> connects to service.sock over UDS — same dispatcher path agents use.

Service ports (33): every service_<name>.nu module under hero_skills/nutools/modules/services/ is represented as a Rust file under services/:

hero_agent, hero_aibroker, hero_biz, hero_books, hero_browser,
hero_claude, hero_code, code_indexer, hero_codescalers, hero_collab,
hero_compute, hero_db, hero_editor, hero_embedder, hero_foundry,
hero_indexer, hero_livekit, hero_logic, hero_mail, hero_matrixchat,
hero_office, hero_os, hero_osis, hero_planner, hero_proxy,
hero_router, hero_runner_rhai, hero_shrimp, hero_slides, hero_voice,
hero_wallet, hero_whiteboard, mycelium

Documented exclusions (in services/mod.rs):

  • hero_proc — circular: manager is a hero_proc client.
  • hero_onlyoffice — Docker container; engine doesn't issue docker run actions yet.
  • hero_do — installer-only nu module, no daemon.
  • service_core.nu — empty meta-module.

Documentation (#90 deliverables):

  • crates/hero_router/docs/service_manager/README.md — developer guide
  • crates/hero_router/docs/service_manager/migration.md — 4-phase nu → manager
  • crates/hero_router/docs/service_manager/removal.md — bottom-up deletion order

What is NOT verified (the important part)

The framework has never been run against a live hero_proc. None of these have happened:

  • hero_router service list against a running router — output unverified
  • hero_router service install hero_db --mode download --version latest — forge fetch + ELF verify + freshness check pipeline never exercised against real Forgejo URLs
  • hero_router service install hero_db --mode sourcecargo build --release shell-out never run against a real checkout
  • hero_router service start hero_db — hero_proc_sdk action-spec translation may be subtly wrong; would surface here
  • hero_router service health hero_db — UDS HTTP probe path untested against a real service
  • hero_router service troubleshoot hero_db — composite output never inspected on a real misconfigured service
  • hero_router service stop hero_db then service status hero_db showing exited
  • Heroci deploy and validation
  • herodemo deploy

Per-service translation accuracy is unproven. I read 33 nu modules and translated them into Rust by hand. Almost certainly some have bugs (wrong env key, missing arg, wrong socket subdir). The only way to find them is to run them.


Known structural gaps (data model declares, engine doesn't honor)

These are intentionally deferred — the data field is correct so the agent can reason about the intent, but the engine treats them as no-ops:

Gap Service(s) What's missing
LiveKit credential bootstrap hero_collab Extra::LiveKit variant + auto-bootstrap of hero_livekit
ONNX Runtime auto-install hero_voice, hero_embedder, hero_editor Extra::OnnxRuntime declared; install-side support TODO
Cascade mother (--split) hero_aibroker Second ServiceDefinition sharing binaries[]
Multi-instance suffix hero_codescalers, mycelium Action/socket templating per instance
Mycelium IPv6 auto-detect hero_router BindStrategy::Mycelium engine support via mycelium_sdk
Docker-container actions hero_onlyoffice New action interpreter beyond exec
FromHeroProcSecret resolver (forward-looking) Currently stubbed to empty; canonical pattern is daemon-side runtime resolution

How to actually finish this issue

Step 1 — Local smoke (≤30 min)

On the workstation, after source ~/hero/cfg/env/env.sh:

# Bring up the supervisor
service_proc start --reset

# Build + start the manager (hero_router with the new service.sock)
cd ~/Documents/temp/hero_work/lhumina_code/hero_router
cargo build --release -p hero_router
./target/release/hero_router start

# Confirm the third socket is bound
ls -la ~/hero/var/sockets/hero_router/
# expect: rpc.sock, ui.sock, service.sock

# Discover via OpenRPC
curl --unix-socket ~/hero/var/sockets/hero_router/service.sock \
     http://x/openrpc.json | jq '.info.title, (.methods | length)'
# expect: "Hero Service Manager", 16

# List the registry
./target/release/hero_router service list
# expect: 33 entries; installed=true for any binary already in ~/hero/bin

# Walk one service end-to-end (pick hero_db — has a published musl release)
./target/release/hero_router service install hero_db --mode download --version latest
ls -la ~/hero/bin/hero_db hero_db_server hero_db_ui
file ~/hero/bin/hero_db_server                              # expect: ELF 64-bit LSB pie executable

./target/release/hero_router service start hero_db
./target/release/hero_router service status hero_db         # expect state: running
./target/release/hero_router service health hero_db         # expect ok: true
./target/release/hero_router service stop hero_db
./target/release/hero_router service status hero_db         # expect state: exited

# Negative path: troubleshoot output on a service that isn't installed
./target/release/hero_router service troubleshoot hero_browser
# inspect the composite output for usefulness

If any of these fail, that's the bug list to fix before closing the issue.

Step 2 — Walk ≥3 services with different shapes

Pick services exercising different code paths:

  • hero_db (musl, RESP TCP port in kill_other) — covers download path + port reclaim
  • hero_books (gnu, env wiring for HERO_BOOKS_DATA + HERO_EMBEDDER_URL) — covers Resolver::HeroHomePath + Resolver::SocketPath
  • hero_browser (musl, plain server+ui) — sanity baseline

Each: install → start → health → stop. File a fix-up commit for any translation bugs found.

Step 3 — Heroci validation

Per feedback_no_direct_push_except_hero_demo.md, this is an L2 PR change and needs explicit go-ahead. Once given:

# On heroci VM
service_router stop
service_router install --update
service_router start
hero_router service list                    # expect 33 entries
hero_router service install hero_db --mode download --version latest
hero_router service start hero_db
hero_router service health hero_db

Step 4 — Catalog per-service translation bugs

Run hero_router service install <name> --mode download for each service that has a published release. Any failure is either:

  • A translation bug in services/<name>.rs (wrong action shape) — fix in PR.
  • A documented structural gap (LiveKit, ONNX, cascade, etc.) — leave as-is, ensure the doc comment in the file lists it as a Limitation:.

Track the verified-vs-broken matrix in a follow-up comment here.

Step 5 — Close this META

Only when:

  • Local smoke green for ≥3 services
  • Heroci smoke green for the same set
  • Translation-bug fixes committed
  • Either every remaining service either passes smoke or has a tracked follow-up issue for its structural gap

Follow-up issues to file before closing

These are explicit out-of-scope for this PR and should each become their own tracked issue:

  1. hero_router#9X — LiveKit auto-bootstrap (Extra::LiveKit)
  2. hero_router#9X — ONNX Runtime auto-install (Extra::OnnxRuntime engine support)
  3. hero_router#9XBindStrategy::Mycelium engine support
  4. hero_router#9X — Multi-instance suffix support (action/socket templating)
  5. hero_router#9X — Cascade --split mother/child as separate ServiceDefinition
  6. hero_router#9X — Docker-action interpreter for hero_onlyoffice
  7. hero_router#9X — Admin UI dashboard pane on ui.sock (driven by service.sock)
  8. hero_router#9Xservice.deploy composite (build + install + restart + verify)
  9. hero_router#9X — Per-service nu module deletions (one PR per service after verified parity)

Each follow-up references this META.


TL;DR

PR #91 is code-complete: framework, 33 ports, docs, gate green. Not operationally validated — zero live hero_proc runs. Steps above (1–5) are the path to closing this issue. Realistically: one focused 30-60 min smoke session covers steps 1–2; heroci validation is one more session.

## Status update — code-complete, **operationally unverified** PR [#91](https://forge.ourworld.tf/lhumina_code/hero_router/pulls/91) lands the framework + all 33 service ports + documentation. Workspace gate green (fmt + clippy `-D warnings` + release build + 125 tests). **However: zero of this has been run against a live hero_proc.** Closing this META is gated on a smoke session that proves the manager actually drives services end-to-end. This comment captures everything needed to finish the work. --- ### What's in PR #91 (verified to compile, not to work) **Framework** under `crates/hero_router/src/service_manager/`: - `definition.rs` — `ServiceDefinition` + typed extension points (`BindStrategy`, `Extra`, `InstallPolicy`, `ArgSource`, `EnvSource`, `Resolver`, `HealthSpec`, `TimingPolicy`) - `registry.rs` — compile-time registry over `services::all()` - `build.rs` — source (`cargo build --release`) + download (Forgejo Releases via `curl`) dispatcher - `install.rs` — atomic-rename binary placement, ELF magic verify, freshness check - `lifecycle.rs` — `hero_proc_sdk` wrappers (start/stop/restart/status/list) - `health.rs` — HTTP probe over UDS via `hyperlocal` - `inspect.rs` — `inspect` (def + status + last 5 ops) and `troubleshoot` (+ log tail) composites - `ops_log.rs` — per-service ring buffer (32 entries cap) - `rpc.rs` — JSON-RPC 2.0 dispatcher + Axum router for `service.sock` - `error.rs` — `ServiceError` enum with stable JSON-RPC error codes (-32001..-32009) **Socket layout**: third UDS `service.sock` alongside existing `rpc.sock` and `ui.sock`, separate OpenRPC domain (`static/service_manager.openrpc.json`), separate `service_id` so the router scanner indexes it as a distinct service. **RPC methods (14)**: `service.list / inspect / status / health / build / install / start / stop / restart / delete / upgrade / verify / logs / troubleshoot` + `rpc.discover` + `rpc.health`. **CLI**: `hero_router service <op>` connects to `service.sock` over UDS — same dispatcher path agents use. **Service ports (33)**: every `service_<name>.nu` module under `hero_skills/nutools/modules/services/` is represented as a Rust file under `services/`: ``` hero_agent, hero_aibroker, hero_biz, hero_books, hero_browser, hero_claude, hero_code, code_indexer, hero_codescalers, hero_collab, hero_compute, hero_db, hero_editor, hero_embedder, hero_foundry, hero_indexer, hero_livekit, hero_logic, hero_mail, hero_matrixchat, hero_office, hero_os, hero_osis, hero_planner, hero_proxy, hero_router, hero_runner_rhai, hero_shrimp, hero_slides, hero_voice, hero_wallet, hero_whiteboard, mycelium ``` **Documented exclusions** (in `services/mod.rs`): - `hero_proc` — circular: manager is a hero_proc client. - `hero_onlyoffice` — Docker container; engine doesn't issue `docker run` actions yet. - `hero_do` — installer-only nu module, no daemon. - `service_core.nu` — empty meta-module. **Documentation** (#90 deliverables): - `crates/hero_router/docs/service_manager/README.md` — developer guide - `crates/hero_router/docs/service_manager/migration.md` — 4-phase nu → manager - `crates/hero_router/docs/service_manager/removal.md` — bottom-up deletion order --- ### What is NOT verified (the important part) The framework has **never been run against a live hero_proc**. None of these have happened: - [ ] `hero_router service list` against a running router — output unverified - [ ] `hero_router service install hero_db --mode download --version latest` — forge fetch + ELF verify + freshness check pipeline never exercised against real Forgejo URLs - [ ] `hero_router service install hero_db --mode source` — `cargo build --release` shell-out never run against a real checkout - [ ] `hero_router service start hero_db` — hero_proc_sdk action-spec translation may be subtly wrong; would surface here - [ ] `hero_router service health hero_db` — UDS HTTP probe path untested against a real service - [ ] `hero_router service troubleshoot hero_db` — composite output never inspected on a real misconfigured service - [ ] `hero_router service stop hero_db` then `service status hero_db` showing `exited` - [ ] Heroci deploy and validation - [ ] herodemo deploy **Per-service translation accuracy is unproven.** I read 33 nu modules and translated them into Rust by hand. Almost certainly some have bugs (wrong env key, missing arg, wrong socket subdir). The only way to find them is to run them. --- ### Known structural gaps (data model declares, engine doesn't honor) These are intentionally deferred — the data field is correct so the agent can reason about the intent, but the engine treats them as no-ops: | Gap | Service(s) | What's missing | |---|---|---| | LiveKit credential bootstrap | `hero_collab` | `Extra::LiveKit` variant + auto-bootstrap of `hero_livekit` | | ONNX Runtime auto-install | `hero_voice`, `hero_embedder`, `hero_editor` | `Extra::OnnxRuntime` declared; install-side support TODO | | Cascade mother (`--split`) | `hero_aibroker` | Second `ServiceDefinition` sharing `binaries[]` | | Multi-instance suffix | `hero_codescalers`, `mycelium` | Action/socket templating per instance | | Mycelium IPv6 auto-detect | `hero_router` | `BindStrategy::Mycelium` engine support via `mycelium_sdk` | | Docker-container actions | `hero_onlyoffice` | New action interpreter beyond `exec` | | `FromHeroProcSecret` resolver | (forward-looking) | Currently stubbed to empty; canonical pattern is daemon-side runtime resolution | --- ### How to actually finish this issue #### Step 1 — Local smoke (≤30 min) On the workstation, after `source ~/hero/cfg/env/env.sh`: ```bash # Bring up the supervisor service_proc start --reset # Build + start the manager (hero_router with the new service.sock) cd ~/Documents/temp/hero_work/lhumina_code/hero_router cargo build --release -p hero_router ./target/release/hero_router start # Confirm the third socket is bound ls -la ~/hero/var/sockets/hero_router/ # expect: rpc.sock, ui.sock, service.sock # Discover via OpenRPC curl --unix-socket ~/hero/var/sockets/hero_router/service.sock \ http://x/openrpc.json | jq '.info.title, (.methods | length)' # expect: "Hero Service Manager", 16 # List the registry ./target/release/hero_router service list # expect: 33 entries; installed=true for any binary already in ~/hero/bin # Walk one service end-to-end (pick hero_db — has a published musl release) ./target/release/hero_router service install hero_db --mode download --version latest ls -la ~/hero/bin/hero_db hero_db_server hero_db_ui file ~/hero/bin/hero_db_server # expect: ELF 64-bit LSB pie executable ./target/release/hero_router service start hero_db ./target/release/hero_router service status hero_db # expect state: running ./target/release/hero_router service health hero_db # expect ok: true ./target/release/hero_router service stop hero_db ./target/release/hero_router service status hero_db # expect state: exited # Negative path: troubleshoot output on a service that isn't installed ./target/release/hero_router service troubleshoot hero_browser # inspect the composite output for usefulness ``` If any of these fail, that's the bug list to fix before closing the issue. #### Step 2 — Walk ≥3 services with different shapes Pick services exercising different code paths: - **`hero_db`** (musl, RESP TCP port in `kill_other`) — covers download path + port reclaim - **`hero_books`** (gnu, env wiring for `HERO_BOOKS_DATA` + `HERO_EMBEDDER_URL`) — covers `Resolver::HeroHomePath` + `Resolver::SocketPath` - **`hero_browser`** (musl, plain server+ui) — sanity baseline Each: install → start → health → stop. File a fix-up commit for any translation bugs found. #### Step 3 — Heroci validation Per [`feedback_no_direct_push_except_hero_demo.md`](https://forge.ourworld.tf/lhumina_code/home/issues), this is an L2 PR change and needs explicit go-ahead. Once given: ```bash # On heroci VM service_router stop service_router install --update service_router start hero_router service list # expect 33 entries hero_router service install hero_db --mode download --version latest hero_router service start hero_db hero_router service health hero_db ``` #### Step 4 — Catalog per-service translation bugs Run `hero_router service install <name> --mode download` for each service that has a published release. Any failure is either: - A translation bug in `services/<name>.rs` (wrong action shape) — fix in PR. - A documented structural gap (LiveKit, ONNX, cascade, etc.) — leave as-is, ensure the doc comment in the file lists it as a `Limitation:`. Track the verified-vs-broken matrix in a follow-up comment here. #### Step 5 — Close this META Only when: - [ ] Local smoke green for ≥3 services - [ ] Heroci smoke green for the same set - [ ] Translation-bug fixes committed - [ ] Either every remaining service either passes smoke or has a tracked follow-up issue for its structural gap --- ### Follow-up issues to file before closing These are explicit out-of-scope for this PR and should each become their own tracked issue: 1. `hero_router#9X` — LiveKit auto-bootstrap (`Extra::LiveKit`) 2. `hero_router#9X` — ONNX Runtime auto-install (`Extra::OnnxRuntime` engine support) 3. `hero_router#9X` — `BindStrategy::Mycelium` engine support 4. `hero_router#9X` — Multi-instance suffix support (action/socket templating) 5. `hero_router#9X` — Cascade `--split` mother/child as separate ServiceDefinition 6. `hero_router#9X` — Docker-action interpreter for `hero_onlyoffice` 7. `hero_router#9X` — Admin UI dashboard pane on `ui.sock` (driven by `service.sock`) 8. `hero_router#9X` — `service.deploy` composite (build + install + restart + verify) 9. `hero_router#9X` — Per-service nu module deletions (one PR per service after verified parity) Each follow-up references this META. --- ### TL;DR PR [#91](https://forge.ourworld.tf/lhumina_code/hero_router/pulls/91) is **code-complete**: framework, 33 ports, docs, gate green. **Not operationally validated** — zero live hero_proc runs. Steps above (1–5) are the path to closing this issue. Realistically: one focused 30-60 min smoke session covers steps 1–2; heroci validation is one more session.
Owner

Direction change — switching from data-schema to code-per-service

Per Kristof's feedback (paraphrased):

not sure this is the right way — this is a model on top of hero_proc — too complicated — I would just get AI to write code to use the hero_proc SDK + primitives in hero_lib/os, just like we do with nushell — metadata means it needs to be always in 100% same way, makes for less flexibility

He's right. PR #91 built a ServiceDefinition schema + interpreter on top of hero_proc_sdk — adding a meta-layer where one wasn't needed. The 6 "documented gaps" (LiveKit auto-bootstrap, ONNX install, cascade --split, multi-instance, mycelium auto-detect, Docker actions) are all cases where a service didn't fit the schema, exactly the rigidity he called out.

New direction

Each service is Rust code (a small module with install / start / stop / health functions) that calls hero_proc_sdk::ServiceBuilder + ActionBuilder directly, plus shared helpers ported from the existing nu lib.nu. No schema, no interpreter. Same model as the existing nu modules, just in Rust.

What survives from #91

Roughly 60-65% of the PR is reusable:

  • Third UDS service.sock + Axum router + JSON-RPC dispatcher (the call surface is correct; only the handler bodies change)
  • service.* method namespace + OpenRPC spec + CLI subcommand (hero_router service <op>)
  • ops_log ring buffer
  • install.rs / build.rs / health.rs helpers — these are exactly the "primitives" Kristof referenced; refactored into service_manager::lib
  • All three docs (README.md, migration.md, removal.md) — mostly valid; updated to describe the trait-not-schema model

What gets thrown out

  • definition.rsServiceDefinition struct + 7 typed-extension-point enums + the interpreter to_proc_service_build_result() method (~400 lines)
  • The struct-literal version of each per-service file (33 files)
  • The generic lifecycle.rs interpreter

What gets rewritten

Each services/<name>.rs becomes a HeroService trait impl that calls hero_proc_sdk directly. Volume is similar (~50 lines per service); flexibility is much higher — every nu-module wrinkle (LiveKit, ONNX, cascade, multi-instance, mycelium) becomes "just write the code in this start() method", not "extend the schema".

pub struct HeroDb;

#[async_trait]
impl HeroService for HeroDb {
    fn name(&self) -> &'static str { "hero_db" }
    fn forge_loc(&self) -> &'static str { "lhumina_code/hero_db" }
    fn binaries(&self) -> &'static [&'static str] { &["hero_db", "hero_db_server", "hero_db_ui"] }

    async fn install(&self, ctx: &Ctx, opts: InstallOpts) -> Result<InstallOutcome> {
        match opts.mode {
            BuildMode::Download => lib::install_download(
                ctx, self.forge_loc(), self.binaries(),
                &opts.version, "x86_64-unknown-linux-musl",
            ).await,
            BuildMode::Source => lib::install_source(ctx, self.forge_loc(), self.binaries()).await,
        }
    }

    async fn start(&self, ctx: &Ctx) -> Result<DateTime<Utc>> {
        let bin = lib::hero_bin_dir();
        let server = ActionBuilder::new(
                "hero_db_server",
                bin.join("hero_db_server").to_string_lossy(),
            )
            .interpreter("exec")
            .is_process()
            .stop_signal("SIGTERM")
            .stop_timeout_ms(10_000)
            .env("RUST_LOG", "info")
            .build();
        // ... ui action, kill_other, health_checks ...
        let svc = ServiceBuilder::new(self.name())
            .description("Hero DB — encrypted Redis-backed store")
            .action(server)
            .action(ui)
            .build();
        ctx.hp().await?.start_service(self.name(), svc, 60).await?;
        Ok(Utc::now())
    }

    // stop / health have default impls that cover the common case;
    // services with quirks override them.
}

For a service with a quirk (LiveKit auto-bootstrap, ONNX install, cascade variant), the quirk lives inline in that service's install() or start() body — no schema extension needed.

Plan

PR #91 is being closed; v2 is in flight on development_mik_service_manager_v2 off development. Same scope (33 services + framework + docs) but the per-service files are Rust code, not data literals. New PR will follow.

Closing checklist remains the same as my previous comment — operational verification (live hero_proc smoke + heroci validation + per-service translation accuracy check) is what gates closing this META, regardless of v1 vs v2.

## Direction change — switching from data-schema to code-per-service Per Kristof's feedback (paraphrased): > not sure this is the right way — this is a model on top of hero_proc — too complicated — I would just get AI to write code to use the hero_proc SDK + primitives in hero_lib/os, just like we do with nushell — metadata means it needs to be always in 100% same way, makes for less flexibility He's right. PR [#91](https://forge.ourworld.tf/lhumina_code/hero_router/pulls/91) built a `ServiceDefinition` schema + interpreter on top of `hero_proc_sdk` — adding a meta-layer where one wasn't needed. The 6 "documented gaps" (LiveKit auto-bootstrap, ONNX install, cascade `--split`, multi-instance, mycelium auto-detect, Docker actions) are all cases where a service didn't fit the schema, exactly the rigidity he called out. ### New direction Each service is **Rust code** (a small module with `install` / `start` / `stop` / `health` functions) that calls `hero_proc_sdk::ServiceBuilder` + `ActionBuilder` directly, plus shared helpers ported from the existing nu `lib.nu`. No schema, no interpreter. Same model as the existing nu modules, just in Rust. ### What survives from #91 Roughly 60-65% of the PR is reusable: - Third UDS `service.sock` + Axum router + JSON-RPC dispatcher (the *call surface* is correct; only the handler bodies change) - `service.*` method namespace + OpenRPC spec + CLI subcommand (`hero_router service <op>`) - `ops_log` ring buffer - `install.rs` / `build.rs` / `health.rs` helpers — these are exactly the "primitives" Kristof referenced; refactored into `service_manager::lib` - All three docs (`README.md`, `migration.md`, `removal.md`) — mostly valid; updated to describe the trait-not-schema model ### What gets thrown out - `definition.rs` — `ServiceDefinition` struct + 7 typed-extension-point enums + the interpreter `to_proc_service_build_result()` method (~400 lines) - The struct-literal version of each per-service file (33 files) - The generic `lifecycle.rs` interpreter ### What gets rewritten Each `services/<name>.rs` becomes a `HeroService` trait impl that calls `hero_proc_sdk` directly. Volume is similar (~50 lines per service); flexibility is much higher — every nu-module wrinkle (LiveKit, ONNX, cascade, multi-instance, mycelium) becomes "just write the code in this `start()` method", not "extend the schema". ```rust pub struct HeroDb; #[async_trait] impl HeroService for HeroDb { fn name(&self) -> &'static str { "hero_db" } fn forge_loc(&self) -> &'static str { "lhumina_code/hero_db" } fn binaries(&self) -> &'static [&'static str] { &["hero_db", "hero_db_server", "hero_db_ui"] } async fn install(&self, ctx: &Ctx, opts: InstallOpts) -> Result<InstallOutcome> { match opts.mode { BuildMode::Download => lib::install_download( ctx, self.forge_loc(), self.binaries(), &opts.version, "x86_64-unknown-linux-musl", ).await, BuildMode::Source => lib::install_source(ctx, self.forge_loc(), self.binaries()).await, } } async fn start(&self, ctx: &Ctx) -> Result<DateTime<Utc>> { let bin = lib::hero_bin_dir(); let server = ActionBuilder::new( "hero_db_server", bin.join("hero_db_server").to_string_lossy(), ) .interpreter("exec") .is_process() .stop_signal("SIGTERM") .stop_timeout_ms(10_000) .env("RUST_LOG", "info") .build(); // ... ui action, kill_other, health_checks ... let svc = ServiceBuilder::new(self.name()) .description("Hero DB — encrypted Redis-backed store") .action(server) .action(ui) .build(); ctx.hp().await?.start_service(self.name(), svc, 60).await?; Ok(Utc::now()) } // stop / health have default impls that cover the common case; // services with quirks override them. } ``` For a service with a quirk (LiveKit auto-bootstrap, ONNX install, cascade variant), the quirk lives inline in that service's `install()` or `start()` body — no schema extension needed. ### Plan PR #91 is being closed; v2 is in flight on `development_mik_service_manager_v2` off `development`. Same scope (33 services + framework + docs) but the per-service files are Rust code, not data literals. New PR will follow. Closing checklist remains the same as my [previous comment](https://forge.ourworld.tf/lhumina_code/hero_router/issues/90#issuecomment-30719) — operational verification (live hero_proc smoke + heroci validation + per-service translation accuracy check) is what gates closing this META, regardless of v1 vs v2.
Owner

v2 status — code-not-data design landed in PR #92

#91 is closed; #92 is the live PR. The pivot rationale is in comment 30721 — Kristof's redirection captured.

What v2 changes from v1

v1 (#91, closed) v2 (#92, open)
Per-service shape pub const DEF: ServiceDefinition (data + 7 enums) pub struct X; impl HeroService for X (Rust code)
Quirk handling New enum variant in the schema Just write Rust in start()/install()
Engine Generic interpreter over ServiceDefinition data Trait dispatch with default impls; per-service code is the real impl
Helpers Split build.rs / install.rs / lifecycle.rs / health.rs Consolidated service_manager::lib (Rust port of nu lib.nu)
Adding a service One file with a struct literal One file with ~30-100 lines of Rust
LoC ~2451 (PR #91) ~5577 (PR #92) — bigger because each service is real code, not a config blob

The v2 framework matches the existing nu-modules-under-hero_skills pattern exactly, just in Rust. Kristof's "metadata = less flexibility" concern is structurally addressed: the engine has zero per-service knowledge, every quirk is just code in the service's file.

What survived the pivot

About 60% of v1 carried over unchanged:

  • service.sock + Axum router + JSON-RPC dispatcher
  • service.* method namespace + OpenRPC spec + CLI subcommand
  • ops_log ring buffer
  • Stable JSON-RPC error codes
  • All three docs (README.md / migration.md / removal.md) — README rewritten to describe trait-not-schema
  • The helper primitives — refactored from 4 files into one lib.rs

Coverage — same 33 services

hero_agent, hero_aibroker, hero_biz, hero_books, hero_browser,
hero_claude, hero_code, code_indexer, hero_codescalers, hero_collab,
hero_compute, hero_db, hero_editor, hero_embedder, hero_foundry,
hero_indexer, hero_livekit, hero_logic, hero_mail, hero_matrixchat,
hero_office, hero_os, hero_osis, hero_planner, hero_proxy,
hero_router, hero_runner_rhai, hero_shrimp, hero_slides, hero_voice,
hero_wallet, hero_whiteboard, mycelium

Documented exclusions unchanged: hero_proc (circular), hero_onlyoffice (Docker), hero_do (installer-only), service_core.nu (empty).

Verification status

What's verified (CI-level):

  • cargo fmt --check
  • cargo clippy -p hero_router --all-targets -- -D warnings
  • cargo build -p hero_router --release
  • cargo test -p hero_router — 119 tests (9 dispatcher e2e + 5 lib unit + 105 pre-existing)
  • Existing CLI surface unchanged — every pre-existing subcommand still parses (list, scan, spec, markdown, html, add, remove, start, stop, access)
  • All 105 pre-existing hero_router tests still pass — no regressions

What's NOT verified (operational):

  • Live hero_proc + hero_router run
  • Real install of any service via the manager
  • Lifecycle (start / health / stop) against any real service
  • heroci validation
  • Per-service translation accuracy audit (read each services/<name>.rs against its service_<name>.nu source-of-truth)

Closing checklist (unchanged)

The path to closing this META is the same regardless of v1 vs v2:

  1. Local smoke (≤30 min): service_proc start → start hero_router → walk hero_db / hero_books / hero_browser through install --mode downloadstarthealthstop
  2. Translation-bug catalog: any of the 33 service ports that fail in step 1 → fix the services/<name>.rs impl
  3. Heroci validation in a follow-up session per feedback_no_direct_push_except_hero_demo.md
  4. Per-service follow-up issues for the documented limitations (LiveKit auto-bootstrap, ONNX install, cascade --split, multi-instance, mycelium auto-detect, Docker-action support) — file before closing this META

Attempting local smoke on workstation now.

## v2 status — code-not-data design landed in PR #92 [#91](https://forge.ourworld.tf/lhumina_code/hero_router/pulls/91) is closed; [#92](https://forge.ourworld.tf/lhumina_code/hero_router/pulls/92) is the live PR. The pivot rationale is in [comment 30721](https://forge.ourworld.tf/lhumina_code/hero_router/issues/90#issuecomment-30721) — Kristof's redirection captured. ### What v2 changes from v1 | | v1 (#91, closed) | v2 (#92, open) | |---|---|---| | Per-service shape | `pub const DEF: ServiceDefinition` (data + 7 enums) | `pub struct X; impl HeroService for X` (Rust code) | | Quirk handling | New enum variant in the schema | Just write Rust in `start()`/`install()` | | Engine | Generic interpreter over `ServiceDefinition` data | Trait dispatch with default impls; per-service code is the real impl | | Helpers | Split `build.rs` / `install.rs` / `lifecycle.rs` / `health.rs` | Consolidated `service_manager::lib` (Rust port of nu `lib.nu`) | | Adding a service | One file with a struct literal | One file with ~30-100 lines of Rust | | LoC | ~2451 (PR #91) | ~5577 (PR #92) — bigger because each service is real code, not a config blob | The v2 framework matches the existing nu-modules-under-`hero_skills` pattern exactly, just in Rust. Kristof's "metadata = less flexibility" concern is structurally addressed: the engine has zero per-service knowledge, every quirk is just code in the service's file. ### What survived the pivot About 60% of v1 carried over unchanged: - `service.sock` + Axum router + JSON-RPC dispatcher - `service.*` method namespace + OpenRPC spec + CLI subcommand - `ops_log` ring buffer - Stable JSON-RPC error codes - All three docs (`README.md` / `migration.md` / `removal.md`) — README rewritten to describe trait-not-schema - The helper primitives — refactored from 4 files into one `lib.rs` ### Coverage — same 33 services ``` hero_agent, hero_aibroker, hero_biz, hero_books, hero_browser, hero_claude, hero_code, code_indexer, hero_codescalers, hero_collab, hero_compute, hero_db, hero_editor, hero_embedder, hero_foundry, hero_indexer, hero_livekit, hero_logic, hero_mail, hero_matrixchat, hero_office, hero_os, hero_osis, hero_planner, hero_proxy, hero_router, hero_runner_rhai, hero_shrimp, hero_slides, hero_voice, hero_wallet, hero_whiteboard, mycelium ``` Documented exclusions unchanged: `hero_proc` (circular), `hero_onlyoffice` (Docker), `hero_do` (installer-only), `service_core.nu` (empty). ### Verification status **What's verified** (CI-level): - ✅ `cargo fmt --check` - ✅ `cargo clippy -p hero_router --all-targets -- -D warnings` - ✅ `cargo build -p hero_router --release` - ✅ `cargo test -p hero_router` — 119 tests (9 dispatcher e2e + 5 lib unit + 105 pre-existing) - ✅ Existing CLI surface unchanged — every pre-existing subcommand still parses (`list`, `scan`, `spec`, `markdown`, `html`, `add`, `remove`, `start`, `stop`, `access`) - ✅ All 105 pre-existing hero_router tests still pass — no regressions **What's NOT verified** (operational): - [ ] Live hero_proc + hero_router run - [ ] Real install of any service via the manager - [ ] Lifecycle (`start` / `health` / `stop`) against any real service - [ ] heroci validation - [ ] Per-service translation accuracy audit (read each `services/<name>.rs` against its `service_<name>.nu` source-of-truth) ### Closing checklist (unchanged) The path to closing this META is the same regardless of v1 vs v2: 1. **Local smoke** (≤30 min): `service_proc start` → start hero_router → walk hero_db / hero_books / hero_browser through `install --mode download` → `start` → `health` → `stop` 2. **Translation-bug catalog**: any of the 33 service ports that fail in step 1 → fix the `services/<name>.rs` impl 3. **Heroci validation** in a follow-up session per `feedback_no_direct_push_except_hero_demo.md` 4. **Per-service follow-up issues** for the documented limitations (LiveKit auto-bootstrap, ONNX install, cascade `--split`, multi-instance, mycelium auto-detect, Docker-action support) — file before closing this META Attempting local smoke on workstation now.
Owner

Local smoke — manager drives a real service end-to-end (with one translation bug found and fixed)

Ran v2 (PR #92) against live hero_proc_server on the workstation. Full lifecycle proven: install → start → health → stop → status.

Trace

$ ~/hero/bin/hero_proc_server &              # supervisor up
$ hero_router --port 0 &                     # third socket bound: rpc.sock + ui.sock + service.sock

$ hero_router service list | jq length
33                                            # all 33 services in registry

$ curl --unix-socket .../service.sock http://x/openrpc.json | jq '.info.title, (.methods|length)'
"Hero Service Manager"
16                                            # OpenRPC discovery works

$ hero_router service install hero_db --mode download --version latest
{
  "duration_ms": 3489,
  "installed_paths": [
    "/home/pctwo/hero/bin/hero_db",
    "/home/pctwo/hero/bin/hero_db_server",
    "/home/pctwo/hero/bin/hero_db_ui"
  ],
  "mode": "download",
  "version": "v0.3.2"                          # forge fetch + ELF verify + freshness — all green
}

$ hero_router service start hero_db
{ "started_at": "2026-05-07T18:31:10.353Z" }

$ hero_router service health hero_db
{
  "body": "{\"service\":\"hero_db_server\",\"status\":\"ok\",\"version\":\"0.3.2\"}",
  "latency_ms": 0,
  "ok": true,
  "status": 200                                # 🎉 real /health response from a real running service
}

$ hero_router service status hero_db | jq .state
"running"

$ hero_router service stop hero_db
{ "stopped_at": "2026-05-07T18:31:22.103Z" }

$ hero_router service status hero_db | jq .state
"exited"

Translation bug caught + fixed

The first attempt failed health probe — exposed a real bug:

The v0.3.2 hero_db_server binary binds sockets directly under $HERO_SOCKET_DIR/ with flat naming (hero_db_server.sock / hero_db_resp.sock / hero_db_ui.sock), NOT under a per-service subdirectory like hero_db/rpc.sock as I'd coded (and as the upstream nu module's kill_other paths also incorrectly listed).

Fixed in PR #92 by overriding health() to probe the actual socket name and updating kill_other.socket to match what the binary actually binds. The fix is the kind of change v2's "code-not-data" architecture makes trivial — just edit the Rust function body, no schema extension needed.

This is exactly the value of running real smoke tests vs. trusting a translated-from-nu schema: the unit tests + clippy + release build all passed before this fix, but the binary's actual runtime behavior diverged from what both the nu module and my v2 port assumed.

Verified surface

  • service.sock binds at startup (third socket alongside rpc/ui)
  • rpc.discover returns the OpenRPC document (16 methods)
  • service.list returns all 33 services with correct installed / registered flags
  • service.inspect returns identity + binaries + status
  • service.install --mode download --version latest against real Forgejo release
  • ELF verify catches non-ELF; freshness check catches stale mtime
  • service.start registers with hero_proc + starts; sockets get bound
  • service.health does live UDS probe and returns the body + status
  • service.status reflects hero_proc supervisor state (running / exited)
  • service.stop clean shutdown
  • service.troubleshoot composite with status + recent_ops + log_tail
  • Per-op ring buffer captures install + start operations
  • Existing hero_router features unaffected (scanner indexes the new socket as a distinct service)

Still-not-tested

  • The other 32 service ports — almost certainly some have the same kind of socket-path or env translation bug that hero_db had. The fix path is mechanical (run smoke, fix start(), re-run) but each one is its own audit.
  • Source-build path (--mode source)
  • Heroci validation
  • Per-service quirks (LiveKit bootstrap, ONNX install, cascade --split, multi-instance, mycelium auto-detect, Docker)

Updated closing checklist

  1. Local smoke for hero_db done
  2. Audit the other 32 service ports against their actual binary behavior (smoke each, fix start() / health() overrides)
  3. Heroci validation
  4. File the 7 follow-up issues for the documented Limitations

The framework is operationally validated. What remains is per-service translation accuracy — every additional service that smokes green moves us closer to closing this META.

## Local smoke ✅ — manager drives a real service end-to-end (with one translation bug found and fixed) Ran v2 (PR #92) against live `hero_proc_server` on the workstation. **Full lifecycle proven**: install → start → health → stop → status. ### Trace ``` $ ~/hero/bin/hero_proc_server & # supervisor up $ hero_router --port 0 & # third socket bound: rpc.sock + ui.sock + service.sock $ hero_router service list | jq length 33 # all 33 services in registry $ curl --unix-socket .../service.sock http://x/openrpc.json | jq '.info.title, (.methods|length)' "Hero Service Manager" 16 # OpenRPC discovery works $ hero_router service install hero_db --mode download --version latest { "duration_ms": 3489, "installed_paths": [ "/home/pctwo/hero/bin/hero_db", "/home/pctwo/hero/bin/hero_db_server", "/home/pctwo/hero/bin/hero_db_ui" ], "mode": "download", "version": "v0.3.2" # forge fetch + ELF verify + freshness — all green } $ hero_router service start hero_db { "started_at": "2026-05-07T18:31:10.353Z" } $ hero_router service health hero_db { "body": "{\"service\":\"hero_db_server\",\"status\":\"ok\",\"version\":\"0.3.2\"}", "latency_ms": 0, "ok": true, "status": 200 # 🎉 real /health response from a real running service } $ hero_router service status hero_db | jq .state "running" $ hero_router service stop hero_db { "stopped_at": "2026-05-07T18:31:22.103Z" } $ hero_router service status hero_db | jq .state "exited" ``` ### Translation bug caught + fixed The first attempt failed health probe — exposed a real bug: The v0.3.2 `hero_db_server` binary binds sockets **directly under `$HERO_SOCKET_DIR/`** with flat naming (`hero_db_server.sock` / `hero_db_resp.sock` / `hero_db_ui.sock`), NOT under a per-service subdirectory like `hero_db/rpc.sock` as I'd coded (and as the upstream nu module's `kill_other` paths also incorrectly listed). Fixed in PR #92 by overriding `health()` to probe the actual socket name and updating `kill_other.socket` to match what the binary actually binds. The fix is the kind of change v2's "code-not-data" architecture makes trivial — just edit the Rust function body, no schema extension needed. This is exactly the value of running real smoke tests vs. trusting a translated-from-nu schema: the unit tests + clippy + release build all passed before this fix, but the binary's actual runtime behavior diverged from what both the nu module *and* my v2 port assumed. ### Verified surface - ✅ `service.sock` binds at startup (third socket alongside rpc/ui) - ✅ `rpc.discover` returns the OpenRPC document (16 methods) - ✅ `service.list` returns all 33 services with correct `installed` / `registered` flags - ✅ `service.inspect` returns identity + binaries + status - ✅ `service.install --mode download --version latest` against real Forgejo release - ✅ ELF verify catches non-ELF; freshness check catches stale mtime - ✅ `service.start` registers with hero_proc + starts; sockets get bound - ✅ `service.health` does live UDS probe and returns the body + status - ✅ `service.status` reflects hero_proc supervisor state (`running` / `exited`) - ✅ `service.stop` clean shutdown - ✅ `service.troubleshoot` composite with status + recent_ops + log_tail - ✅ Per-op ring buffer captures install + start operations - ✅ Existing `hero_router` features unaffected (scanner indexes the new socket as a distinct service) ### Still-not-tested - ❌ The other 32 service ports — almost certainly some have the same kind of socket-path or env translation bug that hero_db had. The fix path is mechanical (run smoke, fix `start()`, re-run) but each one is its own audit. - ❌ Source-build path (`--mode source`) - ❌ Heroci validation - ❌ Per-service quirks (LiveKit bootstrap, ONNX install, cascade `--split`, multi-instance, mycelium auto-detect, Docker) ### Updated closing checklist 1. ~~Local smoke for hero_db~~ ✅ done 2. Audit the other 32 service ports against their actual binary behavior (smoke each, fix `start()` / `health()` overrides) 3. Heroci validation 4. File the 7 follow-up issues for the documented Limitations The framework is operationally validated. What remains is per-service translation accuracy — every additional service that smokes green moves us closer to closing this META.
Owner

Aligned to hero_skills@371138f convention (_ui → _admin)

Per direction: don't chase per-binary smoke fixes. Mirror the upstream nu modules' canonical naming. Done in commit 92a947a on PR #92:

align(service_manager): _ui → _admin per hero_skills@371138f convention
30 files changed, 166 insertions(+), 202 deletions(-)

What changed

Sweeping rename mirroring hero_skills@371138f:

  • Binaries: hero_<service>_uihero_<service>_admin (and consequently in binaries[], ActionBuilder names, kill_other.socket paths, health probes)
  • Sockets: <service>/ui.sock<service>/admin.sock
  • Action names: <service>_ui<service>_admin

Affected: 29 of 33 service files.

Special cases

Service Note
hero_collab server + web (canonical is _web, not _admin) — reverted my prior smoke-driven misfix
hero_shrimp server + web + admin (3 actions, not 2)
hero_planner server (bare name) + admin + web
hero_db reverted the flat-socket convention I introduced; canonical is subdir (hero_db/admin.sock)
hero_mail hero_mail_clihero_mail; _ui_admin
hero_books deduped binaries (the rename collapses _ui and the dev _admin binary into one)
hero_router unchanged — its own ui.sock is the router's admin dashboard socket, not a managed service
hero_code unchanged — uses _web (no _admin in canonical)

Trade-off

Published Forgejo releases for some services still ship the pre-rename _ui asset names (e.g. hero_db v0.3.2 ships hero_db_ui, not hero_db_admin). The --mode download path fails for those services until upstream cuts new releases with renamed assets. Source-build path works regardless.

This is the explicit choice: align with the canonical convention now, accept that downloads fail until releases catch up. The alternative (chase observed binary behaviour per-service) was the churn we're avoiding.

Workspace gate

  • cargo fmt --check
  • cargo clippy -p hero_router --all-targets -- -D warnings
  • cargo build -p hero_router --release
  • cargo test -p hero_router119 tests pass
  • All 105 pre-existing hero_router tests still pass — no regressions

Closing checklist (re-stated)

The framework is canonically aligned. Remaining work to close #90:

  • Re-run smoke loop after a few services cut releases with the new _admin asset names — verify install/start/health/stop end-to-end on those
  • Heroci validation
  • File the 7 follow-up issues for documented Limitations (LiveKit auto-bootstrap, ONNX install, cascade --split, multi-instance, mycelium auto-detect, Docker actions)

PR #92 is ready for review against the new convention.

## Aligned to hero_skills@371138f convention (_ui → _admin) Per direction: don't chase per-binary smoke fixes. Mirror the upstream nu modules' canonical naming. Done in commit `92a947a` on PR #92: ``` align(service_manager): _ui → _admin per hero_skills@371138f convention 30 files changed, 166 insertions(+), 202 deletions(-) ``` ### What changed Sweeping rename mirroring [hero_skills@371138f](https://forge.ourworld.tf/lhumina_code/hero_skills/commit/371138f2d6ffc50803310db09282785bc8a439e5): - **Binaries**: `hero_<service>_ui` → `hero_<service>_admin` (and consequently in `binaries[]`, `ActionBuilder` names, `kill_other.socket` paths, health probes) - **Sockets**: `<service>/ui.sock` → `<service>/admin.sock` - **Action names**: `<service>_ui` → `<service>_admin` Affected: 29 of 33 service files. ### Special cases | Service | Note | |---|---| | `hero_collab` | server + **web** (canonical is `_web`, not `_admin`) — reverted my prior smoke-driven misfix | | `hero_shrimp` | server + **web + admin** (3 actions, not 2) | | `hero_planner` | server (bare name) + **admin + web** | | `hero_db` | reverted the flat-socket convention I introduced; canonical is subdir (`hero_db/admin.sock`) | | `hero_mail` | `hero_mail_cli` → `hero_mail`; `_ui` → `_admin` | | `hero_books` | deduped binaries (the rename collapses `_ui` and the dev `_admin` binary into one) | | `hero_router` | unchanged — its own `ui.sock` is *the router's* admin dashboard socket, not a managed service | | `hero_code` | unchanged — uses `_web` (no `_admin` in canonical) | ### Trade-off Published Forgejo releases for some services still ship the pre-rename `_ui` asset names (e.g. `hero_db v0.3.2` ships `hero_db_ui`, not `hero_db_admin`). The `--mode download` path fails for those services until upstream cuts new releases with renamed assets. Source-build path works regardless. This is the explicit choice: **align with the canonical convention now, accept that downloads fail until releases catch up**. The alternative (chase observed binary behaviour per-service) was the churn we're avoiding. ### Workspace gate - `cargo fmt --check` ✅ - `cargo clippy -p hero_router --all-targets -- -D warnings` ✅ - `cargo build -p hero_router --release` ✅ - `cargo test -p hero_router` — **119 tests pass** ✅ - All 105 pre-existing hero_router tests still pass — no regressions ### Closing checklist (re-stated) The framework is canonically aligned. Remaining work to close #90: - [ ] Re-run smoke loop *after* a few services cut releases with the new `_admin` asset names — verify install/start/health/stop end-to-end on those - [ ] Heroci validation - [ ] File the 7 follow-up issues for documented Limitations (LiveKit auto-bootstrap, ONNX install, cascade `--split`, multi-instance, mycelium auto-detect, Docker actions) PR #92 is ready for review against the new convention.
Owner

🤝 Handoff for next agent — everything you need to continue

This is a complete state dump for whoever picks this up next. Read top-to-bottom; everything below is operational, not aspirational.


TL;DR

PR #92 is open against development_mik_service_manager_v2development. Code is convention-aligned with hero_skills@371138f, gate green (fmt + clippy -D warnings + release build + 119 tests). Awaiting review + squash-merge OK.

PR #91 (v1, schema-based) is closed — superseded.


What this work delivered

A Rust-based Hero Service Manager inside hero_router:

  • Third UDS service.sock alongside the existing rpc.sock and ui.sock
  • 14 service.* JSON-RPC methods + own OpenRPC document
  • CLI subcommand: hero_router service <op>
  • HeroService trait with default impls; 33 per-service Rust modules use hero_proc_sdk + shared helpers directly (Kristof's "code-not-data" model — like nu modules in Rust)
  • Documentation: crates/hero_router/docs/service_manager/{README,migration,removal}.md

Architecture rationale

This work pivoted mid-session from a ServiceDefinition schema + interpreter (v1, PR #91) to free-form Rust per service (v2, PR #92). See comment 30721 for Kristof's redirection. Don't re-introduce the schema — quirks (LiveKit bootstrap, ONNX install, cascade --split) live as Rust code in each service file, not as schema extensions.


Verified-working state

What was tested Result
Build cargo fmt --check, clippy -D warnings, release build, 119 tests all green
Existing hero_router features All 105 pre-existing tests still pass; CLI surface unchanged no regressions
Sockets bind service.sock binds at startup alongside rpc.sock + ui.sock verified
OpenRPC discovery curl --unix-socket service.sock /openrpc.json returns 16 methods verified
service.list Returns 33 services with installed/registered flags verified
Real install service install hero_db --mode download --version latest (BEFORE the convention alignment commit) verified at comment 30763
Real lifecycle start hero_dbhealth 200 OK from real hero_db_server v0.3.2stop verified BEFORE the alignment

Critical: after the convention alignment commit 92a947a, the --mode download path will fail for services whose published releases still ship the pre-rename _ui asset names. This is a known, deliberate trade-off — the code is forward-aligned, releases need to catch up. Source-build path works regardless.

Known limitations (each tracked as a follow-up issue)

Issue Limitation Affected service(s)
#93 LiveKit credential auto-bootstrap hero_collab
#94 ONNX Runtime auto-install hero_voice, hero_embedder, hero_editor
#95 Cascade --split mother variant hero_aibroker
#96 Multi-instance suffix support hero_codescalers, mycelium
#97 BindStrategy::Mycelium engine support hero_router (self-bind)
#98 Docker-action support hero_onlyoffice (excluded from registry)
#99 FromHeroProcSecret resolver (low priority) none today
#100 Admin UI dashboard pane on ui.sock (UX)
#101 Per-service translation accuracy audit all 33
#102 Deletion plan execution (nu / Make / buildenv.sh) (cleanup)

Limitation: markers in the corresponding services/<name>.rs doc comments make the per-service caveats discoverable from the source.

Documented exclusions (4 services intentionally NOT in the registry)

In services/mod.rs doc comment:

  • hero_proc — circular: the manager IS a hero_proc client.
  • hero_onlyoffice — Docker container; #98 will add support.
  • hero_do — installer-only nu module, no daemon.
  • service_core.nu — empty meta-module.

How to continue

Option A: Wait for upstream releases, then validate

  1. Watch for new releases of: hero_db, hero_books, hero_browser, hero_indexer, hero_logic, hero_matrixchat, hero_proxy, hero_slides, hero_whiteboard, hero_office, hero_osis, hero_planner, hero_voice, hero_compute, hero_embedder, hero_editor, hero_foundry, hero_runner_rhai, mycelium, hero_aibroker — they need to publish artifacts with the new _admin asset names.
  2. Once a release is out, run smoke against it (instructions below).
  3. Track results in #101.

Option B: Source-build smoke (works regardless of release naming)

  1. export CODEROOT=$HOME/Documents/temp/hero_work (or wherever you cloned).
  2. git clone https://forge.ourworld.tf/lhumina_code/<svc>.git $CODEROOT/lhumina_code/<svc> for each service to test.
  3. hero_router service install <svc> --mode source then proceed with start / health / stop.

Option C: Heroci validation

Per feedback_no_direct_push_except_hero_demo.md — needs the user's explicit go-ahead (an L2 PR change). Don't push to heroci unprompted.

How to run the smoke (reference)

# 0. env (needed for FORGEJO_TOKEN, FORGE_TOKEN, WEBROOT)
source ~/hero/cfg/env/env.sh
export FORGE_TOKEN="$FORGEJO_TOKEN"
export WEBROOT="http://127.0.0.1:9998/"

# 1. start hero_proc
nohup ~/hero/bin/hero_proc_server > /tmp/hero_proc.log 2>&1 < /dev/null &
disown

# 2. build + start the v2 hero_router
cd $CODEROOT/lhumina_code/hero_router
git fetch && git checkout development_mik_service_manager_v2
cargo build --release -p hero_router
nohup ./target/release/hero_router --port 0 > /tmp/hero_router.log 2>&1 < /dev/null &
disown

# 3. exercise the manager
BIN=./target/release/hero_router
$BIN service list                          # 33 services
$BIN service install <name> --mode download --version latest
$BIN service start <name>
$BIN service health <name>                  # expect ok: true, status: 200
$BIN service status <name>                  # expect state: running
$BIN service stop <name>
$BIN service troubleshoot <name>            # composite diagnostic

If any service install/start fails with the new convention, the fix is one of:

  • The release ships pre-rename names → wait for new release (track in #101)
  • A genuine translation bug → edit services/<name>.rs::start() or health() (no schema change needed)
  • A documented Limitation → leave it, add a smoke-skip comment

What NOT to do

  • Don't reintroduce a ServiceDefinition schema or interpreter. Kristof was clear on this. Each service is code, not config.
  • Don't bake hero_proc secrets into action specs. Per hero_proc_meta, the canonical pattern is daemon-side runtime resolution. FromCallerEnv for env-passthrough from operator shell is fine.
  • Don't squash-merge PR #92 without explicit user OK per feedback_squash_merge_gate.md.
  • Don't push to herodemo for this PR — it's an L2 change. Heroci validation is the right next step (with go-ahead).
  • Don't flag-day-delete the nu modules. Coexistence per migration.md Phase 1; deletion only in Phase 4 once parity is verified.

Workstation state at handoff

Local box has the smoke session leftovers:

  • ~/hero/bin/: hero_proc, hero_proc_server, hero_proc_admin, hero_db, hero_db_server, hero_db_ui (note: pre-rename binary; would be hero_db_admin post-release-catchup)
  • ~/hero/var/sockets/: hero_router/{rpc,ui,service}.sock, hero_proc/rpc.sock, plus hero_db_*.sock leftovers
  • Running processes (will be stopped at handoff): hero_proc_server, hero_router --port 0, hero_db_server, hero_db_ui serve

Stop with:

pkill -f "hero_proc_server|target/release/hero_router|hero_db_server|hero_db_ui serve"

The ~/hero/bin/ binaries are intentionally left in place so a follow-up agent can re-test without reinstalling.


Documents to read first (in this order)

  1. PR #92 — the actual code
  2. crates/hero_router/docs/service_manager/README.md — developer guide
  3. crates/hero_router/docs/service_manager/migration.md — 4-phase nu→manager plan
  4. crates/hero_router/docs/service_manager/removal.md — bottom-up deletion order
  5. This issue's recent comments — chronological context

Closing checklist (what gates closing #90)

  • PR #92 reviewed + squash-merged (needs user OK)
  • Per-service smoke audit completed and tracked in #101 (needs upstream release catchup)
  • Heroci validation green (needs user OK)
  • At least 5 of #93-#100 either resolved or formally deferred

When all 4 above are checked, this META can close.

## 🤝 Handoff for next agent — everything you need to continue This is a complete state dump for whoever picks this up next. Read top-to-bottom; everything below is operational, not aspirational. --- ### TL;DR **PR [#92](https://forge.ourworld.tf/lhumina_code/hero_router/pulls/92)** is open against `development_mik_service_manager_v2` → `development`. Code is convention-aligned with [hero_skills@371138f](https://forge.ourworld.tf/lhumina_code/hero_skills/commit/371138f2d6ffc50803310db09282785bc8a439e5), gate green (fmt + clippy `-D warnings` + release build + 119 tests). **Awaiting review + squash-merge OK.** PR #91 (v1, schema-based) is **closed** — superseded. --- ### What this work delivered A Rust-based Hero Service Manager inside `hero_router`: - Third UDS `service.sock` alongside the existing `rpc.sock` and `ui.sock` - 14 `service.*` JSON-RPC methods + own OpenRPC document - CLI subcommand: `hero_router service <op>` - `HeroService` trait with default impls; 33 per-service Rust modules use `hero_proc_sdk` + shared helpers directly (Kristof's "code-not-data" model — like nu modules in Rust) - Documentation: `crates/hero_router/docs/service_manager/{README,migration,removal}.md` ### Architecture rationale This work pivoted mid-session from a `ServiceDefinition` schema + interpreter (v1, PR #91) to free-form Rust per service (v2, PR #92). See [comment 30721](https://forge.ourworld.tf/lhumina_code/hero_router/issues/90#issuecomment-30721) for Kristof's redirection. **Don't re-introduce the schema** — quirks (LiveKit bootstrap, ONNX install, cascade `--split`) live as Rust code in each service file, not as schema extensions. --- ### Verified-working state | | What was tested | Result | |---|---|---| | **Build** | `cargo fmt --check`, clippy `-D warnings`, release build, 119 tests | ✅ all green | | **Existing hero_router features** | All 105 pre-existing tests still pass; CLI surface unchanged | ✅ no regressions | | **Sockets bind** | `service.sock` binds at startup alongside `rpc.sock` + `ui.sock` | ✅ verified | | **OpenRPC discovery** | `curl --unix-socket service.sock /openrpc.json` returns 16 methods | ✅ verified | | **`service.list`** | Returns 33 services with installed/registered flags | ✅ verified | | **Real install** | `service install hero_db --mode download --version latest` (BEFORE the convention alignment commit) | ✅ verified at [comment 30763](https://forge.ourworld.tf/lhumina_code/hero_router/issues/90#issuecomment-30763) | | **Real lifecycle** | `start hero_db` → `health` 200 OK from real `hero_db_server v0.3.2` → `stop` | ✅ verified BEFORE the alignment | **Critical: after the convention alignment commit `92a947a`, the `--mode download` path will fail for services whose published releases still ship the pre-rename `_ui` asset names.** This is a known, deliberate trade-off — the code is forward-aligned, releases need to catch up. Source-build path works regardless. ### Known limitations (each tracked as a follow-up issue) | Issue | Limitation | Affected service(s) | |---|---|---| | [#93](https://forge.ourworld.tf/lhumina_code/hero_router/issues/93) | LiveKit credential auto-bootstrap | hero_collab | | [#94](https://forge.ourworld.tf/lhumina_code/hero_router/issues/94) | ONNX Runtime auto-install | hero_voice, hero_embedder, hero_editor | | [#95](https://forge.ourworld.tf/lhumina_code/hero_router/issues/95) | Cascade `--split` mother variant | hero_aibroker | | [#96](https://forge.ourworld.tf/lhumina_code/hero_router/issues/96) | Multi-instance suffix support | hero_codescalers, mycelium | | [#97](https://forge.ourworld.tf/lhumina_code/hero_router/issues/97) | `BindStrategy::Mycelium` engine support | hero_router (self-bind) | | [#98](https://forge.ourworld.tf/lhumina_code/hero_router/issues/98) | Docker-action support | hero_onlyoffice (excluded from registry) | | [#99](https://forge.ourworld.tf/lhumina_code/hero_router/issues/99) | `FromHeroProcSecret` resolver (low priority) | none today | | [#100](https://forge.ourworld.tf/lhumina_code/hero_router/issues/100) | Admin UI dashboard pane on `ui.sock` | (UX) | | [#101](https://forge.ourworld.tf/lhumina_code/hero_router/issues/101) | Per-service translation accuracy audit | all 33 | | [#102](https://forge.ourworld.tf/lhumina_code/hero_router/issues/102) | Deletion plan execution (nu / Make / buildenv.sh) | (cleanup) | `Limitation:` markers in the corresponding `services/<name>.rs` doc comments make the per-service caveats discoverable from the source. ### Documented exclusions (4 services intentionally NOT in the registry) In `services/mod.rs` doc comment: - **`hero_proc`** — circular: the manager IS a hero_proc client. - **`hero_onlyoffice`** — Docker container; #98 will add support. - **`hero_do`** — installer-only nu module, no daemon. - **`service_core.nu`** — empty meta-module. --- ### How to continue #### Option A: Wait for upstream releases, then validate 1. Watch for new releases of: hero_db, hero_books, hero_browser, hero_indexer, hero_logic, hero_matrixchat, hero_proxy, hero_slides, hero_whiteboard, hero_office, hero_osis, hero_planner, hero_voice, hero_compute, hero_embedder, hero_editor, hero_foundry, hero_runner_rhai, mycelium, hero_aibroker — they need to publish artifacts with the new `_admin` asset names. 2. Once a release is out, run smoke against it (instructions below). 3. Track results in [#101](https://forge.ourworld.tf/lhumina_code/hero_router/issues/101). #### Option B: Source-build smoke (works regardless of release naming) 1. `export CODEROOT=$HOME/Documents/temp/hero_work` (or wherever you cloned). 2. `git clone https://forge.ourworld.tf/lhumina_code/<svc>.git $CODEROOT/lhumina_code/<svc>` for each service to test. 3. `hero_router service install <svc> --mode source` then proceed with start / health / stop. #### Option C: Heroci validation Per `feedback_no_direct_push_except_hero_demo.md` — needs the user's explicit go-ahead (an L2 PR change). Don't push to heroci unprompted. ### How to run the smoke (reference) ```bash # 0. env (needed for FORGEJO_TOKEN, FORGE_TOKEN, WEBROOT) source ~/hero/cfg/env/env.sh export FORGE_TOKEN="$FORGEJO_TOKEN" export WEBROOT="http://127.0.0.1:9998/" # 1. start hero_proc nohup ~/hero/bin/hero_proc_server > /tmp/hero_proc.log 2>&1 < /dev/null & disown # 2. build + start the v2 hero_router cd $CODEROOT/lhumina_code/hero_router git fetch && git checkout development_mik_service_manager_v2 cargo build --release -p hero_router nohup ./target/release/hero_router --port 0 > /tmp/hero_router.log 2>&1 < /dev/null & disown # 3. exercise the manager BIN=./target/release/hero_router $BIN service list # 33 services $BIN service install <name> --mode download --version latest $BIN service start <name> $BIN service health <name> # expect ok: true, status: 200 $BIN service status <name> # expect state: running $BIN service stop <name> $BIN service troubleshoot <name> # composite diagnostic ``` If any service install/start fails with the new convention, the fix is one of: - The release ships pre-rename names → wait for new release (track in #101) - A genuine translation bug → edit `services/<name>.rs::start()` or `health()` (no schema change needed) - A documented Limitation → leave it, add a smoke-skip comment --- ### What NOT to do - **Don't reintroduce a `ServiceDefinition` schema or interpreter.** Kristof was clear on this. Each service is code, not config. - **Don't bake hero_proc secrets into action specs.** Per `hero_proc_meta`, the canonical pattern is daemon-side runtime resolution. `FromCallerEnv` for env-passthrough from operator shell is fine. - **Don't squash-merge PR #92 without explicit user OK** per `feedback_squash_merge_gate.md`. - **Don't push to herodemo** for this PR — it's an L2 change. Heroci validation is the right next step (with go-ahead). - **Don't flag-day-delete the nu modules.** Coexistence per `migration.md` Phase 1; deletion only in Phase 4 once parity is verified. --- ### Workstation state at handoff Local box has the smoke session leftovers: - `~/hero/bin/`: `hero_proc`, `hero_proc_server`, `hero_proc_admin`, `hero_db`, `hero_db_server`, `hero_db_ui` (note: pre-rename binary; would be `hero_db_admin` post-release-catchup) - `~/hero/var/sockets/`: `hero_router/{rpc,ui,service}.sock`, `hero_proc/rpc.sock`, plus `hero_db_*.sock` leftovers - Running processes (will be stopped at handoff): `hero_proc_server`, `hero_router --port 0`, `hero_db_server`, `hero_db_ui serve` Stop with: ```bash pkill -f "hero_proc_server|target/release/hero_router|hero_db_server|hero_db_ui serve" ``` The `~/hero/bin/` binaries are intentionally left in place so a follow-up agent can re-test without reinstalling. --- ### Documents to read first (in this order) 1. **PR [#92](https://forge.ourworld.tf/lhumina_code/hero_router/pulls/92)** — the actual code 2. `crates/hero_router/docs/service_manager/README.md` — developer guide 3. `crates/hero_router/docs/service_manager/migration.md` — 4-phase nu→manager plan 4. `crates/hero_router/docs/service_manager/removal.md` — bottom-up deletion order 5. This issue's recent comments — chronological context ### Closing checklist (what gates closing #90) - [ ] PR #92 reviewed + squash-merged (needs user OK) - [ ] Per-service smoke audit completed and tracked in #101 (needs upstream release catchup) - [ ] Heroci validation green (needs user OK) - [ ] At least 5 of #93-#100 either resolved or formally deferred When all 4 above are checked, this META can close.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_router#90
No description provided.