reliability service #104
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#104
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hero Proc Service Reliability Notes
We need to do deeper research and cleanup around how
serviceworks.A service is a combination of actions. Each action represents one thing that needs to be done to keep the service alive.
Anything marked as a process must be highly reliable. The service runner should keep trying until the service is alive and healthy.
Actions and Singleton Behavior
An action can have a
singletonproperty.When an action is marked as singleton, it means there is a PID file somewhere for that action. We need to verify that singleton handling is implemented correctly per action.
The lifecycle should be:
Action Metadata
Sockets and TCP ports should already be defined as part of the action itself. That means the action should explicitly know:
This allows the service runner to be much more robust.
Socket and Port Cleanup
Before starting an action, we should inspect the sockets and ports defined on the action.
If sockets exist, we can determine whether another process is using them.
The runner should then:
This is important when restarting
hero_procitself. After a restart, the service runner should be able to recover the full service state and go through the same cleanup/start cycle again.It can only cleanup others if kill_others is set or singleton is set (both is ok)
Runs and Jobs
Each service run should keep working within the same run context.
If an action fails, we should not keep creating new jobs forever. Instead, we should reuse the same job and track how many times it was restarted.
Add a new field to the job model:
This is clearer than
occurrence.Each time the binary or script inside the action is restarted, increment
restart_count.Logging
Each job should have one stable job ID.
When the job restarts, we should not create endless new job logs.
Instead:
restart_countThis avoids the current mess where we get too many job blocks and makes service behavior much easier to understand.
Main Goal
Make services self-healing.
A service should continuously try to keep its actions alive, clean up stale state, reuse the same job identity, and restart reliably until the action succeeds.
Implementation Spec for Issue #104 — Reliability Service
Objective
Make hero_proc services self-healing by giving every action explicit knowledge of the sockets and TCP ports it owns, making singleton lifecycle deterministic (PID file → kill old → cleanup sockets/ports → restart → keep retrying), and making restart accounting visible by adding a
restart_countfield onJob. This first slice keeps the same job identity across restarts and rotates its running log instead of creating an endless stream of new job records.Requirements
ActionSpecexposes the sockets and TCP ports an action is expected to own.singletonorkill_otheris set) stops conflicting holders before spawning.Jobgainsrestart_count: u32persisted in SQLite (kept alongsideattempt).is_processaction that needs to be respawned (no new SQL row, no new logs prefix);restart_countis incremented and the prior "running" log block is rotated/cleared.restart_countincrement.Files to Modify/Create
crates/hero_proc_server/src/db/actions/model.rs— extendActionSpecwithsockets: Vec<String>andtcp_ports: Vec<u16>(defaulted, backwards-compatible serde). UpdateDefault, validation, and the unit tests at the bottom of the file. The new fields describe what the action owns, whileKillOtherkeeps describing what to reap.crates/hero_proc_server/src/db/jobs/model.rs— addrestart_count: u32toJobandJobSummary, updateDefault,to_summary, schema and additive migrationALTER TABLE jobs ADD COLUMN restart_count INTEGER NOT NULL DEFAULT 0,insert_job,update_job,SELECTlists, androw_to_job.crates/hero_proc_server/src/pid/mod.rs— addforce_kill_owners(sockets: &[String], tcp_ports: &[u16]) -> Vec<u32>andcleanup_stale_sockets(sockets: &[String]).crates/hero_proc_server/src/process.rs— add helperspids_holding_tcp_port(port: u16) -> Vec<u32>andpids_holding_unix_socket(path: &Path) -> Vec<u32>(lsof-based, Linux/macOS gated).crates/hero_proc_server/src/supervisor/executor.rs—kill_otherblock callprepare_action_resources(&job)that removes stale UDS files and, whensingleton || kill_other.is_some(), reclaims live owners viapid::force_kill_owners.is_processjobs: bumprestart_count, keepphase = Retrying, and rotate the running log.crates/hero_proc_server/src/supervisor/mod.rs— makeautostart_process_jobsreuse the existing job row when a process action needs respawning (no newJob). Updatelist_process_jobs_needing_restartandlist_is_process_terminal_jobsaccordingly.crates/hero_proc_server/src/logging/store.rs— addrotate_job_logs(job_id: u32, restart_count: u32)to archive or clear prior running entries before respawn.crates/hero_proc_server/openrpc.json+ the two checked-in generated client mirrors — extendJobandActionSpecschemas.crates/hero_proc_app/src/types.rs— mirrorrestart_count,sockets,tcp_ports.crates/hero_proc_test/src/tests/functional/singleton.rs— case for UDS/TCP-port reclaim on respawn.crates/hero_proc_test/src/tests/functional/uc_31_34_action_cascade_process.rs— case for stable job id +restart_countincrement.Implementation Plan
Step 1: Extend models
Files:
crates/hero_proc_server/src/db/actions/model.rs,crates/hero_proc_server/src/db/jobs/model.rs,crates/hero_proc_app/src/types.rssockets,tcp_portstoActionSpec(#[serde(default, skip_serializing_if = "Vec::is_empty")]).restart_count: u32toJob/JobSummary(#[serde(default)]).to_summary, schema, additive migration, everyINSERT/UPDATEandrow_to_job.restart_count = 3round-trip.Dependencies: none.
Step 2: Resource-aware process helpers
Files:
crates/hero_proc_server/src/process.rs,crates/hero_proc_server/src/pid/mod.rspids_holding_tcp_port/pids_holding_unix_socketusinglsof.cleanup_stale_socketsandforce_kill_ownersinpid/mod.rs.tempfile-based unit tests.Dependencies: Step 1.
Step 3: Pre-spawn cleanup in the executor
Files:
crates/hero_proc_server/src/supervisor/executor.rsprepare_action_resources(job: &Job)called right beforekill_other.singleton || kill_other.is_some().Dependencies: Steps 1, 2.
Step 4: Reuse-row restart loop
Files:
crates/hero_proc_server/src/supervisor/mod.rs,crates/hero_proc_server/src/supervisor/executor.rs,crates/hero_proc_server/src/db/factory.rs,crates/hero_proc_server/src/db/jobs/model.rslist_process_jobs_needing_restart(and underlying SQL helper) returnsRetrying/Failedrows too.autostart_process_jobsmutates the existing row in place: reset transient state, incrementrestart_count, drop the duplicate-check block.is_process): setRetrying, bumprestart_count, reset transient fields.Dependencies: Steps 1, 3.
Step 5: Rotate the running log on restart
Files:
crates/hero_proc_server/src/logging/store.rs,crates/hero_proc_server/src/supervisor/executor.rsrotate_job_logs(job_id, restart_count)— either delete or re-tag previous "running" entries.Dependencies: Step 4.
Step 6: Plumb through OpenRPC & app types
Files:
crates/hero_proc_server/openrpc.json, both checked-in client mirrors,crates/hero_proc_app/src/types.rsJobandActionSpecschemas; regenerate the mirrors.Dependencies: Step 1.
Step 7: Tests
Files:
crates/hero_proc_test/src/tests/functional/singleton.rs,crates/hero_proc_test/src/tests/functional/uc_31_34_action_cascade_process.rsjob.id,restart_countgoes 0 → 1 → 2, oldrunning:trueentries are gone.Dependencies: Steps 1–5.
Acceptance Criteria
ActionSpeccarriessocketsandtcp_ports; round-trips through SQLite and OpenRPC.Jobcarriesrestart_count; additive migration on existing DBs.ActionSpec.socketsare removed before every start; live owners reclaimed only whensingleton || kill_other.Job.idand bumpsrestart_count.hero_proc_testfunctional tests pass; the two new tests pass.Notes
restart_count+ log rotation. Deferred: exponential-backoff continuous loop, HealthCheck-driven respawn, UI surface forrestart_count.sockets/tcp_portsdescribe what an action binds;KillOtherkeeps describing what to reap from others. Existing service definitions keep working without setting the new fields.restart_countdefaults to 0 for historical rows.lsofUDS output differs from Linux — reuse the existingkill_othersocket logic so platform behaviour does not drift.crates/hero_proc_sdk/build.rsor update in lockstep to keep admin UI / app consumers compiling.