Migrate hero_services to zinit 0.4.0 job model (restart + health checks) #25

Open
opened 2026-03-13 12:40:22 +00:00 by mik-tf · 0 comments
Owner

Migrate hero_services to zinit 0.4.0 job model

Context

Follow-up from #24 (watchdog hotfix). Zinit 0.4.0 is already installed in the container and supports a job-based model with restart policies and periodic health checks. But hero_services_server still generates legacy TOML configs that don't use these features. Currently relying on a watchdog loop in entrypoint.sh as a band-aid (#24).

Problem

No restart-on-failure

write_service_config_with_deps() in install.rs writes old-format TOMLs:

[service]
name = "user.hero_embedder_server"
exec = "/root/hero/bin/hero_embedder_server serve"
oneshot = false

When a service crashes, zinit marks it inactive and nothing restarts it.

Dummy health checks

write_health_config_with_deps() in install.rs writes no-op health checks:

  • Services with build section: runs make health-check (target usually doesn't exist → "healthy by default")
  • Services with ports: curl probe on HTTP port (works for UI services only)
  • Socket-only servers (embedder, books, auth, etc.): echo "No health check configured"always passes

All 29 services on herodev2/herodemo2 have inactive health checks.

Hung process detection missing

Even with the watchdog hotfix (#24), a process that is alive but unresponsive (e.g., stuck on an external API call) will not be detected or restarted.

Target state

Auto-restart (every non-oneshot service)

Using zinit 0.4.0 job model:

zinit add job --exec "/root/hero/bin/hero_embedder_server serve" \
  --trigger start --restart on-failure --restart-delay 5000 --max-restarts 20 \
  user.hero_embedder_server main

Real health checks (every socket-based service)

Periodic probe of server.health on the Unix socket:

zinit add job --trigger check --interval-ms 60000 --timeout-ms 30000 \
  --exec "echo '{\"jsonrpc\":\"2.0\",\"method\":\"server.health\",\"id\":1}' \
    | socat - UNIX-CONNECT:/root/hero/var/sockets/hero_embedder_server.sock" \
  user.hero_embedder_server check

With a grace period after start (30-60 seconds) to allow initialization.

Scope

All changes are in hero_services only — no zinit changes needed. Zinit 0.4.0 already supports everything.

Files to modify:

  • crates/hero_services_server/src/install.rs — replace write_service_config_with_deps() and write_health_config_with_deps() to use zinit SDK
  • crates/hero_services_server/src/zinit.rs — already imports zinit_sdk, extend to use ServiceBuilder + ActionBuilder
  • docker/entrypoint.sh — remove watchdog loop once this is deployed

Investigation findings

During #24 investigation, we tested the zinit 0.4.0 job API and found:

  • zinit add job on a service loaded from legacy TOML does NOT enable auto-restart (job is accepted but doesn't change behavior)
  • Recreating a service via zinit add service + zinit add job changes process management (service goes inactive immediately after start)
  • Conclusion: a full migration from TOML generation to the SDK API is needed (can't just overlay jobs on legacy services)

Zinit 0.4.0 job API reference

zinit add service <name> [--persist]
zinit add job <service> <job_name> \
  --exec <cmd> \
  --trigger start|stop|check \
  --restart always|on-failure|never \
  --restart-delay <ms> \
  --max-restarts <n> \
  --interval-ms <ms>      # for check trigger
  --timeout-ms <ms>       # for check trigger
  -e KEY=VALUE \
  [--persist]

Steps

  • Understand how zinit reload interacts with API-created services (does reload wipe them?)
  • Prototype: replace TOML generation for one service (e.g., hero_embedder) with SDK calls, verify restart works
  • Implement write_service_via_sdk() replacing write_service_config_with_deps()
  • Implement write_health_via_sdk() replacing write_health_config_with_deps() with real socket probes
  • Add grace period logic (delay health check start by 30-60s after service start)
  • Test on herodev2: kill services, verify auto-restart + health check recovery
  • Remove watchdog from entrypoint.sh
  • Rebuild images, promote to herodemo2
  • #24 (watchdog hotfix — closed, deployed)
  • #23 (Hero OS UI polish — parent)
  • hero_services/crates/hero_services_server/src/install.rs — TOML generation code
  • hero_services/crates/hero_services_server/src/zinit.rs — zinit SDK integration
# Migrate hero_services to zinit 0.4.0 job model ## Context Follow-up from #24 (watchdog hotfix). Zinit 0.4.0 is already installed in the container and supports a job-based model with restart policies and periodic health checks. But `hero_services_server` still generates legacy TOML configs that don't use these features. Currently relying on a watchdog loop in `entrypoint.sh` as a band-aid (#24). ## Problem ### No restart-on-failure `write_service_config_with_deps()` in `install.rs` writes old-format TOMLs: ```toml [service] name = "user.hero_embedder_server" exec = "/root/hero/bin/hero_embedder_server serve" oneshot = false ``` When a service crashes, zinit marks it `inactive` and nothing restarts it. ### Dummy health checks `write_health_config_with_deps()` in `install.rs` writes no-op health checks: - Services with `build` section: runs `make health-check` (target usually doesn't exist → "healthy by default") - Services with `ports`: curl probe on HTTP port (works for UI services only) - Socket-only servers (embedder, books, auth, etc.): `echo "No health check configured"` → **always passes** All 29 services on herodev2/herodemo2 have inactive health checks. ### Hung process detection missing Even with the watchdog hotfix (#24), a process that is alive but unresponsive (e.g., stuck on an external API call) will not be detected or restarted. ## Target state ### Auto-restart (every non-oneshot service) Using zinit 0.4.0 job model: ``` zinit add job --exec "/root/hero/bin/hero_embedder_server serve" \ --trigger start --restart on-failure --restart-delay 5000 --max-restarts 20 \ user.hero_embedder_server main ``` ### Real health checks (every socket-based service) Periodic probe of `server.health` on the Unix socket: ``` zinit add job --trigger check --interval-ms 60000 --timeout-ms 30000 \ --exec "echo '{\"jsonrpc\":\"2.0\",\"method\":\"server.health\",\"id\":1}' \ | socat - UNIX-CONNECT:/root/hero/var/sockets/hero_embedder_server.sock" \ user.hero_embedder_server check ``` With a grace period after start (30-60 seconds) to allow initialization. ## Scope **All changes are in `hero_services` only — no zinit changes needed.** Zinit 0.4.0 already supports everything. Files to modify: - `crates/hero_services_server/src/install.rs` — replace `write_service_config_with_deps()` and `write_health_config_with_deps()` to use zinit SDK - `crates/hero_services_server/src/zinit.rs` — already imports `zinit_sdk`, extend to use `ServiceBuilder` + `ActionBuilder` - `docker/entrypoint.sh` — remove watchdog loop once this is deployed ## Investigation findings During #24 investigation, we tested the zinit 0.4.0 job API and found: - `zinit add job` on a service loaded from legacy TOML does NOT enable auto-restart (job is accepted but doesn't change behavior) - Recreating a service via `zinit add service` + `zinit add job` changes process management (service goes inactive immediately after start) - Conclusion: a full migration from TOML generation to the SDK API is needed (can't just overlay jobs on legacy services) ### Zinit 0.4.0 job API reference ``` zinit add service <name> [--persist] zinit add job <service> <job_name> \ --exec <cmd> \ --trigger start|stop|check \ --restart always|on-failure|never \ --restart-delay <ms> \ --max-restarts <n> \ --interval-ms <ms> # for check trigger --timeout-ms <ms> # for check trigger -e KEY=VALUE \ [--persist] ``` ## Steps - [ ] Understand how `zinit reload` interacts with API-created services (does reload wipe them?) - [ ] Prototype: replace TOML generation for one service (e.g., hero_embedder) with SDK calls, verify restart works - [ ] Implement `write_service_via_sdk()` replacing `write_service_config_with_deps()` - [ ] Implement `write_health_via_sdk()` replacing `write_health_config_with_deps()` with real socket probes - [ ] Add grace period logic (delay health check start by 30-60s after service start) - [ ] Test on herodev2: kill services, verify auto-restart + health check recovery - [ ] Remove watchdog from `entrypoint.sh` - [ ] Rebuild images, promote to herodemo2 ## Related - #24 (watchdog hotfix — closed, deployed) - #23 (Hero OS UI polish — parent) - `hero_services/crates/hero_services_server/src/install.rs` — TOML generation code - `hero_services/crates/hero_services_server/src/zinit.rs` — zinit SDK integration
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#25
No description provided.