Migrate hero_services to zinit 0.4.0 job model (restart + health checks) #25
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Migrate hero_services to zinit 0.4.0 job model
Context
Follow-up from #24 (watchdog hotfix). Zinit 0.4.0 is already installed in the container and supports a job-based model with restart policies and periodic health checks. But
hero_services_serverstill generates legacy TOML configs that don't use these features. Currently relying on a watchdog loop inentrypoint.shas a band-aid (#24).Problem
No restart-on-failure
write_service_config_with_deps()ininstall.rswrites old-format TOMLs:When a service crashes, zinit marks it
inactiveand nothing restarts it.Dummy health checks
write_health_config_with_deps()ininstall.rswrites no-op health checks:buildsection: runsmake health-check(target usually doesn't exist → "healthy by default")ports: curl probe on HTTP port (works for UI services only)echo "No health check configured"→ always passesAll 29 services on herodev2/herodemo2 have inactive health checks.
Hung process detection missing
Even with the watchdog hotfix (#24), a process that is alive but unresponsive (e.g., stuck on an external API call) will not be detected or restarted.
Target state
Auto-restart (every non-oneshot service)
Using zinit 0.4.0 job model:
Real health checks (every socket-based service)
Periodic probe of
server.healthon the Unix socket:With a grace period after start (30-60 seconds) to allow initialization.
Scope
All changes are in
hero_servicesonly — no zinit changes needed. Zinit 0.4.0 already supports everything.Files to modify:
crates/hero_services_server/src/install.rs— replacewrite_service_config_with_deps()andwrite_health_config_with_deps()to use zinit SDKcrates/hero_services_server/src/zinit.rs— already importszinit_sdk, extend to useServiceBuilder+ActionBuilderdocker/entrypoint.sh— remove watchdog loop once this is deployedInvestigation findings
During #24 investigation, we tested the zinit 0.4.0 job API and found:
zinit add jobon a service loaded from legacy TOML does NOT enable auto-restart (job is accepted but doesn't change behavior)zinit add service+zinit add jobchanges process management (service goes inactive immediately after start)Zinit 0.4.0 job API reference
Steps
zinit reloadinteracts with API-created services (does reload wipe them?)write_service_via_sdk()replacingwrite_service_config_with_deps()write_health_via_sdk()replacingwrite_health_config_with_deps()with real socket probesentrypoint.shRelated
hero_services/crates/hero_services_server/src/install.rs— TOML generation codehero_services/crates/hero_services_server/src/zinit.rs— zinit SDK integration