job.steer always fails — loop never writes task_state.json (the file the steer gate requires) #28

Closed
opened 2026-05-21 12:28:31 +00:00 by rawan · 4 comments
Member

job.steer is a dead surface. It returns No active autonomy state found in <workspace> for every job — including jobs that are visibly still running and updating their plan.

Repro

  1. Send a multi-step prompt that enters the autonomy/plan path (e.g. Build a Python CLI tool wxcli that fetches weather, with tests and a Makefile).
  2. While the JobDrawer Plan tab shows phases progressing (e.g. p1: in_progress, p2: pending), call job.steer.
  3. Result: JSON-RPC job.steer failed: No active autonomy state found in /home/<user>/hero/var/shrimp/workspace/jobs/<job_id>. ...

Root cause

The operator path that backs job.steer (crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs:84-93) loads state from:

STATE_PATH = ".agent/task_state.json"

(state.rs:21)

But the running agent loop never writes that file. The loop writes .agent/job_plan.json (via update_plan / complete_phase), which is a different file.

persist_goal_and_state (the only writer of task_state.json) is called only from operator-driven sites:

Zero call sites inside the loop body. The file the steer gate reads only exists if an operator action created it first — but the UI provides no such action surface (no job.resume RPC, no Resume button — see also the misleading recommendation in the error message itself).

Evidence from a live job

Live job rpc_job_..._3489030_4 mid-flight, .agent/ contents:

.agent/
├── job_plan.json        ← updated live by the loop
└── plan_versions/       ← also written by the loop

No task_state.json. job.steer fails.

Fix options

  1. Have the loop write task_state.json alongside job_plan.json on every plan update / iteration boundary. Steer then has a file to read. This matches the operator path's expectation.
  2. Or change the steer operator to derive state from job_plan.json + DB rows instead of requiring a separate task_state.json. Removes the duplicate persistence surface.
  3. Fix the misleading error message regardless — it tells operators to use a Resume button that doesn't exist (no job.resume RPC, no Resume UI). Either build that surface or change the message to something actionable (e.g. "send a follow-up message via job.follow_up").

Same pattern as #27 (plan-approval UI persists a decision the engine doesn't consume). Both are operator-control surfaces wired to UIs/RPCs that the engine doesn't honor.

`job.steer` is a dead surface. It returns `No active autonomy state found in <workspace>` for every job — including jobs that are visibly still running and updating their plan. ## Repro 1. Send a multi-step prompt that enters the autonomy/plan path (e.g. `Build a Python CLI tool wxcli that fetches weather, with tests and a Makefile`). 2. While the JobDrawer Plan tab shows phases progressing (e.g. `p1: in_progress, p2: pending`), call `job.steer`. 3. Result: `JSON-RPC job.steer failed: No active autonomy state found in /home/<user>/hero/var/shrimp/workspace/jobs/<job_id>. ...` ## Root cause The operator path that backs `job.steer` ([`crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs:84-93`](crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs#L84-L93)) loads state from: ``` STATE_PATH = ".agent/task_state.json" ``` ([`state.rs:21`](crates/hero_shrimp_engine/src/orchestration/autonomy/state.rs#L21)) But the running agent loop **never writes that file**. The loop writes `.agent/job_plan.json` (via `update_plan` / `complete_phase`), which is a different file. `persist_goal_and_state` (the only writer of `task_state.json`) is called only from operator-driven sites: - [`mod.rs:167`](crates/hero_shrimp_engine/src/orchestration/autonomy/mod.rs#L167) — inside `activate_autonomy_job_state` (operator-triggered) - [`promote.rs:122/125/220`](crates/hero_shrimp_engine/src/orchestration/autonomy/promote.rs#L122) — plan promotion (operator-triggered) - [`operator.rs:68`](crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs#L68) — `action == "activate_run"` (operator-triggered) Zero call sites inside the loop body. The file the steer gate reads only exists if an operator action created it first — but the UI provides no such action surface (no `job.resume` RPC, no Resume button — see also the misleading recommendation in the error message itself). ## Evidence from a live job Live job `rpc_job_..._3489030_4` mid-flight, `.agent/` contents: ``` .agent/ ├── job_plan.json ← updated live by the loop └── plan_versions/ ← also written by the loop ``` No `task_state.json`. `job.steer` fails. ## Fix options 1. **Have the loop write `task_state.json` alongside `job_plan.json`** on every plan update / iteration boundary. Steer then has a file to read. This matches the operator path's expectation. 2. **Or change the steer operator to derive state from `job_plan.json` + DB rows** instead of requiring a separate `task_state.json`. Removes the duplicate persistence surface. 3. **Fix the misleading error message** regardless — it tells operators to use a Resume button that doesn't exist (no `job.resume` RPC, no Resume UI). Either build that surface or change the message to something actionable (e.g. "send a follow-up message via job.follow_up"). ## Related Same pattern as #27 (plan-approval UI persists a decision the engine doesn't consume). Both are operator-control surfaces wired to UIs/RPCs that the engine doesn't honor.
Author
Member

Spec: fix job.steer by having the agent loop write task_state.json alongside job_plan.json

Objective

Make job.steer (and any other operator action that reads <workspace>/.agent/task_state.json) succeed for a live, mid-flight autonomy job — including jobs whose .agent/ directory currently contains only job_plan.json and plan_versions/. We do this by adding a minimal "task_state mirror" write to the same two writers that the loop already drives (update_plan and complete_phase), keyed off the job_plan.json we just wrote. We also fix the steer error message so it stops directing operators to a Resume button that does not exist.

Chosen approach

Option 1 + Option 3.

Option 1 (loop writes task_state.json alongside job_plan.json) is the right answer because:

  1. The operator path already expects a file at STATE_PATH = ".agent/task_state.json" (operator.rs:84-93) and the in-prompt OperatorGuidanceProvider reads the same file (job_context.rs:206) — so without a loop-side writer, both job.steer and the prompt-side guidance injection are broken for any job that wasn't started via the explicit operator activation path.
  2. The two writer sites in the loop are exactly two functions — handle_update_plan (plan_ops.rs:88) and handle_complete_phase (verify/mod.rs:615) — both of which already write job_plan.json next to the future task_state.json. Mirroring there is one helper call per site.
  3. Option 2 (rewrite the steer operator to read from job_plan.json + DB) would force us to re-derive every field of ExecutionState (timeline, coverage, recovery_ladder, blocked_reason, operator_guidance, etc.) from a plan file that doesn't have them, and would still leave the prompt-side operator_guidance_from_workspace broken or force a second rewrite. It is a larger blast radius than the bug warrants.

Option 3 (fix the misleading error message) is additive and cheap, and stops sending operators after a non-existent button.

Requirements

  • After this change, calling job.steer against a mid-flight autonomy job whose only on-disk artifact is .agent/job_plan.json must succeed (action steer_job writes operator_guidance and action clear_steer clears it).
  • The synthesized task_state.json must have job_id set to the run's artifact job id so the existing identity check at operator.rs:94 passes.
  • The mirror write must be idempotent and non-fatal: if job_plan.json is missing or unreadable, or if writing task_state.json fails, the originating update_plan / complete_phase call must still succeed (the mirror is a best-effort sidecar, not a hard gate).
  • When a task_state.json already exists for the workspace and its job_id matches the loop's job_id, the loop's write must preserve operator_guidance, operator_force_replan, operator_pause_requested, blocked_reason, and timeline — i.e. we must not stomp operator-set fields when refreshing the plan/phase mirror.
  • The steer error message in operator.rs:87-91 must no longer instruct the user to use a "Resume button on the run page" (no such surface exists). Replace with an actionable message that names the real follow-up surface (job.follow_up / sending a new prompt) and explains why state may legitimately not exist (the job has not yet emitted update_plan and so has no plan to steer against).
  • No behavior change for the existing operator-driven sites (activate_autonomy_job_state, fork_autonomy_job, the resume_job action) — they still write task_state.json through persist_goal_and_state. We are adding a second writer, not replacing the existing one.

Files to modify

  1. crates/hero_shrimp_engine/src/orchestration/autonomy/persistence.rs — add a new public helper mirror_task_state_from_plan(workspace_dir: &Path, job_id: &str) that reads .agent/job_plan.json, merges it onto an existing task_state.json (preserving operator fields and timeline) or synthesizes a minimal ExecutionState if none exists, and writes back to .agent/task_state.json. Best-effort: never panics, never propagates errors past a tracing::warn!.

  2. crates/hero_shrimp_engine/src/orchestration/autonomy/mod.rs — re-export the new helper next to the other persistence::* re-exports (lines 66-70) so the tool handlers can call it through crate::autonomy::mirror_task_state_from_plan.

  3. crates/hero_shrimp_engine/src/tools/tool_catalog/verify/plan_ops.rs — at the end of handle_update_plan (after the kanban sync, before the success return), call the mirror helper using context.workspace_dir and context.job_id. Best-effort; ignore the result.

  4. crates/hero_shrimp_engine/src/tools/tool_catalog/verify/mod.rs — at the end of the "Passed — advance the phase in the plan." block in handle_complete_phase (after the plan-versions copy, before the emit_scoped), call the same mirror helper. Best-effort; ignore the result.

  5. crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs — replace the misleading error message at lines 86-92 with a message that does not reference a "Resume button". Suggested text: "No autonomy state at {workspace}/.agent/task_state.json yet. The job either hasn't published a plan via update_plan / complete_phase, or it never entered the autonomy path. Send a follow-up prompt instead — direct steer requires an active plan.".

Implementation plan

Each step is self-contained. Steps 1 and 2 are setup; steps 3 and 4 are the two writer hooks; step 5 is the error-message fix. Steps 3, 4, and 5 can be done in any order after step 1+2. Step 6 is the test layer.

Step 1 — add the mirror_task_state_from_plan helper to persistence.rs

  • File: crates/hero_shrimp_engine/src/orchestration/autonomy/persistence.rs
  • Reads <workspace>/.agent/job_plan.json (returns early if absent / unparseable).
  • If <workspace>/.agent/task_state.json exists AND its job_id == job_id, load it and refresh only plan, phases (matching by phase id to preserve per-phase status), status (only promote ""/"planned""running"; never demote a "running" state), and updated_at. Preserve operator_guidance, operator_force_replan, operator_pause_requested, blocked_reason, timeline.
  • Otherwise synthesize a fresh ExecutionState::new(plan, AutonomyMode::Execute, "running") with job_id = Some(job_id.to_string()) from the on-disk plan.
  • Write via save_json_file (no DB upsert, no goal-doc writes — those belong to operator-driven persist_goal_and_state).
  • Any failure → tracing::warn!, never propagated.
  • Dependencies: none.

Step 2 — re-export the helper from autonomy/mod.rs

  • File: crates/hero_shrimp_engine/src/orchestration/autonomy/mod.rs
  • Add mirror_task_state_from_plan to the existing pub use self::persistence::{ … } block so it is reachable as crate::autonomy::mirror_task_state_from_plan.
  • Dependencies: Step 1.

Step 3 — hook update_plan

  • File: crates/hero_shrimp_engine/src/tools/tool_catalog/verify/plan_ops.rs
  • At the end of handle_update_plan, after the kanban sync and before the success return, call:
    if let Some(job_id) = context.job_id.as_deref() {
        crate::autonomy::mirror_task_state_from_plan(
            std::path::Path::new(&context.workspace_dir),
            job_id,
        );
    }
    
  • This guarantees the first update_plan call already populates task_state.json.
  • Dependencies: Steps 1+2.

Step 4 — hook complete_phase

  • File: crates/hero_shrimp_engine/src/tools/tool_catalog/verify/mod.rs
  • Inside handle_complete_phase, after the plan-versions file is written and before the emit_scoped event, insert the same mirror call shown in Step 3.
  • Keeps task_state.json in sync as phases tick over the life of the run.
  • Dependencies: Steps 1+2.

Step 5 — fix the misleading error message

  • File: crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs (lines 86-92)
  • Replace the existing anyhow!(...) body with:
    anyhow!(
        "No active autonomy state found in {}. \
         The job either hasn't published a plan yet (no `update_plan` / \
         `complete_phase` has fired) or it never entered the autonomy path. \
         Send a follow-up prompt via the normal chat surface to add guidance; \
         direct `job.steer` requires an active plan on disk.",
        workspace_dir.display()
    )
    
  • Dependencies: none (purely independent of 1-4).

Step 6 — tests

  • (a) Persistence unit test: tempdir + .agent/job_plan.json with two phases → call helper → assert task_state.json exists, deserialises into ExecutionState, job_id == Some("rpc_job_xyz"), 2 phases with matching ids, status == "running".
  • (b) Operator integration test: after Step 1's helper has been invoked, call apply_autonomy_operator_action with action="steer_job", message="be careful" and assert it returns Ok (no longer the "No active autonomy state" error). Verify the resulting task_state.json has operator_guidance set.
  • (c) Operator-fields preservation test: pre-populate task_state.json with operator_guidance = Some("be quick") and operator_force_replan = true, then call the mirror helper, then assert those two fields are still set on disk.
  • (d) plan_ops handler test: drive handle_update_plan against a workspace with a ToolContext { workspace_dir, job_id: Some(...) } and assert .agent/task_state.json exists after the call.

Acceptance criteria

  • Calling job.steer { job_id, message } against a job whose workspace contains only .agent/job_plan.json and .agent/plan_versions/ returns success and sets operator_guidance in .agent/task_state.json.
  • Calling job.steer { job_id, clear: true } against the same job returns success and clears operator_guidance.
  • OperatorGuidanceProvider (job_context.rs:204) sees the steering text on the next loop iteration via operator_guidance_from_workspace — i.e. the steering reaches the LLM prompt without going through the DB-only pending_operator_guidance fallback.
  • Operator fields (operator_guidance, operator_force_replan, operator_pause_requested, blocked_reason) survive subsequent update_plan and complete_phase calls.
  • Running an autonomy job that never emits update_plan (Tier-0 trivial path) does NOT crash — the mirror helper is a no-op when job_plan.json doesn't exist.
  • The error returned when task_state.json genuinely cannot be located no longer mentions a "Resume button".
  • All new unit tests pass; existing tests remain green.

Notes

  • Why not call persist_goal_and_state directly? That helper does a synchronous DB upsert (upsert_job_state_snapshot) and writes two markdown files (archived + canonical goal docs). Doing all of that on every single update_plan and complete_phase invocation would add per-tick DB writes and per-tick goal-doc rewrites for a side-channel mirror. The loop already writes to the DB via execution-control and to job_plan.json directly — the mirror only needs to satisfy the operator path's "is there an ExecutionState at STATE_PATH?" check. A direct save_json_file is the minimum-blast-radius write.
  • What if the live task_state.json's job_id doesn't match context.job_id? The merge branch only fires when they match. On mismatch, we fall through to synthesise a fresh ExecutionState with the loop's job_id — correct, because the operator's identity check at operator.rs:94 keys off the loop's job_id, not a stale snapshot's.
  • Concurrency. Adds one sibling-file write to a directory the loop already writes to without file locks; no new race not already present in job_plan.json writes.
  • No DB changes. Purely a file-mirror fix. The DB-side JobStateSnapshotRow continues to be authoritative for callers that go through persist_goal_and_state / the DB-first path of load_state.
  • Out of scope. Building a real job.resume RPC or a Resume button is explicitly NOT part of this fix. The error message rewording is enough to stop the wild-goose chase; the new surface, if needed, is a separate issue.
## Spec: fix `job.steer` by having the agent loop write `task_state.json` alongside `job_plan.json` ### Objective Make `job.steer` (and any other operator action that reads `<workspace>/.agent/task_state.json`) succeed for a live, mid-flight autonomy job — including jobs whose `.agent/` directory currently contains only `job_plan.json` and `plan_versions/`. We do this by adding a minimal "task_state mirror" write to the same two writers that the loop already drives (`update_plan` and `complete_phase`), keyed off the `job_plan.json` we just wrote. We also fix the steer error message so it stops directing operators to a Resume button that does not exist. ### Chosen approach **Option 1 + Option 3.** Option 1 (loop writes `task_state.json` alongside `job_plan.json`) is the right answer because: 1. The operator path already expects a file at `STATE_PATH = ".agent/task_state.json"` (`operator.rs:84-93`) and the in-prompt `OperatorGuidanceProvider` reads the *same* file (`job_context.rs:206`) — so without a loop-side writer, both `job.steer` and the prompt-side guidance injection are broken for any job that wasn't started via the explicit operator activation path. 2. The two writer sites in the loop are exactly two functions — `handle_update_plan` (`plan_ops.rs:88`) and `handle_complete_phase` (`verify/mod.rs:615`) — both of which already write `job_plan.json` next to the future `task_state.json`. Mirroring there is one helper call per site. 3. Option 2 (rewrite the steer operator to read from `job_plan.json` + DB) would force us to re-derive every field of `ExecutionState` (timeline, coverage, recovery_ladder, blocked_reason, operator_guidance, etc.) from a plan file that doesn't have them, and would still leave the prompt-side `operator_guidance_from_workspace` broken or force a second rewrite. It is a larger blast radius than the bug warrants. Option 3 (fix the misleading error message) is additive and cheap, and stops sending operators after a non-existent button. ### Requirements - After this change, calling `job.steer` against a mid-flight autonomy job whose only on-disk artifact is `.agent/job_plan.json` must succeed (action `steer_job` writes `operator_guidance` and action `clear_steer` clears it). - The synthesized `task_state.json` must have `job_id` set to the run's artifact job id so the existing identity check at `operator.rs:94` passes. - The mirror write must be idempotent and non-fatal: if `job_plan.json` is missing or unreadable, or if writing `task_state.json` fails, the originating `update_plan` / `complete_phase` call must still succeed (the mirror is a best-effort sidecar, not a hard gate). - When a `task_state.json` already exists for the workspace and its `job_id` matches the loop's `job_id`, the loop's write must preserve `operator_guidance`, `operator_force_replan`, `operator_pause_requested`, `blocked_reason`, and `timeline` — i.e. we must not stomp operator-set fields when refreshing the plan/phase mirror. - The steer error message in `operator.rs:87-91` must no longer instruct the user to use a "Resume button on the run page" (no such surface exists). Replace with an actionable message that names the real follow-up surface (`job.follow_up` / sending a new prompt) and explains why state may legitimately not exist (the job has not yet emitted `update_plan` and so has no plan to steer against). - No behavior change for the existing operator-driven sites (`activate_autonomy_job_state`, `fork_autonomy_job`, the `resume_job` action) — they still write `task_state.json` through `persist_goal_and_state`. We are *adding* a second writer, not replacing the existing one. ### Files to modify 1. **`crates/hero_shrimp_engine/src/orchestration/autonomy/persistence.rs`** — add a new public helper `mirror_task_state_from_plan(workspace_dir: &Path, job_id: &str)` that reads `.agent/job_plan.json`, merges it onto an existing `task_state.json` (preserving operator fields and timeline) or synthesizes a minimal `ExecutionState` if none exists, and writes back to `.agent/task_state.json`. Best-effort: never panics, never propagates errors past a `tracing::warn!`. 2. **`crates/hero_shrimp_engine/src/orchestration/autonomy/mod.rs`** — re-export the new helper next to the other `persistence::*` re-exports (lines 66-70) so the tool handlers can call it through `crate::autonomy::mirror_task_state_from_plan`. 3. **`crates/hero_shrimp_engine/src/tools/tool_catalog/verify/plan_ops.rs`** — at the end of `handle_update_plan` (after the kanban sync, before the success return), call the mirror helper using `context.workspace_dir` and `context.job_id`. Best-effort; ignore the result. 4. **`crates/hero_shrimp_engine/src/tools/tool_catalog/verify/mod.rs`** — at the end of the "Passed — advance the phase in the plan." block in `handle_complete_phase` (after the plan-versions copy, before the `emit_scoped`), call the same mirror helper. Best-effort; ignore the result. 5. **`crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs`** — replace the misleading error message at lines 86-92 with a message that does not reference a "Resume button". Suggested text: `"No autonomy state at {workspace}/.agent/task_state.json yet. The job either hasn't published a plan via update_plan / complete_phase, or it never entered the autonomy path. Send a follow-up prompt instead — direct steer requires an active plan."`. ### Implementation plan Each step is self-contained. Steps 1 and 2 are setup; steps 3 and 4 are the two writer hooks; step 5 is the error-message fix. Steps 3, 4, and 5 can be done in any order after step 1+2. Step 6 is the test layer. #### Step 1 — add the `mirror_task_state_from_plan` helper to `persistence.rs` - File: `crates/hero_shrimp_engine/src/orchestration/autonomy/persistence.rs` - Reads `<workspace>/.agent/job_plan.json` (returns early if absent / unparseable). - If `<workspace>/.agent/task_state.json` exists AND its `job_id == job_id`, load it and refresh only `plan`, `phases` (matching by phase id to preserve per-phase status), `status` (only promote `""`/`"planned"` → `"running"`; never demote a `"running"` state), and `updated_at`. Preserve `operator_guidance`, `operator_force_replan`, `operator_pause_requested`, `blocked_reason`, `timeline`. - Otherwise synthesize a fresh `ExecutionState::new(plan, AutonomyMode::Execute, "running")` with `job_id = Some(job_id.to_string())` from the on-disk plan. - Write via `save_json_file` (no DB upsert, no goal-doc writes — those belong to operator-driven `persist_goal_and_state`). - Any failure → `tracing::warn!`, never propagated. - Dependencies: none. #### Step 2 — re-export the helper from `autonomy/mod.rs` - File: `crates/hero_shrimp_engine/src/orchestration/autonomy/mod.rs` - Add `mirror_task_state_from_plan` to the existing `pub use self::persistence::{ … }` block so it is reachable as `crate::autonomy::mirror_task_state_from_plan`. - Dependencies: Step 1. #### Step 3 — hook `update_plan` - File: `crates/hero_shrimp_engine/src/tools/tool_catalog/verify/plan_ops.rs` - At the end of `handle_update_plan`, after the kanban sync and before the success return, call: ```rust if let Some(job_id) = context.job_id.as_deref() { crate::autonomy::mirror_task_state_from_plan( std::path::Path::new(&context.workspace_dir), job_id, ); } ``` - This guarantees the *first* `update_plan` call already populates `task_state.json`. - Dependencies: Steps 1+2. #### Step 4 — hook `complete_phase` - File: `crates/hero_shrimp_engine/src/tools/tool_catalog/verify/mod.rs` - Inside `handle_complete_phase`, after the plan-versions file is written and before the `emit_scoped` event, insert the same mirror call shown in Step 3. - Keeps `task_state.json` in sync as phases tick over the life of the run. - Dependencies: Steps 1+2. #### Step 5 — fix the misleading error message - File: `crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs` (lines 86-92) - Replace the existing `anyhow!(...)` body with: ```rust anyhow!( "No active autonomy state found in {}. \ The job either hasn't published a plan yet (no `update_plan` / \ `complete_phase` has fired) or it never entered the autonomy path. \ Send a follow-up prompt via the normal chat surface to add guidance; \ direct `job.steer` requires an active plan on disk.", workspace_dir.display() ) ``` - Dependencies: none (purely independent of 1-4). #### Step 6 — tests - (a) **Persistence unit test**: tempdir + `.agent/job_plan.json` with two phases → call helper → assert `task_state.json` exists, deserialises into `ExecutionState`, `job_id == Some("rpc_job_xyz")`, 2 phases with matching ids, `status == "running"`. - (b) **Operator integration test**: after Step 1's helper has been invoked, call `apply_autonomy_operator_action` with `action="steer_job", message="be careful"` and assert it returns `Ok` (no longer the "No active autonomy state" error). Verify the resulting `task_state.json` has `operator_guidance` set. - (c) **Operator-fields preservation test**: pre-populate `task_state.json` with `operator_guidance = Some("be quick")` and `operator_force_replan = true`, then call the mirror helper, then assert those two fields are still set on disk. - (d) **plan_ops handler test**: drive `handle_update_plan` against a workspace with a `ToolContext { workspace_dir, job_id: Some(...) }` and assert `.agent/task_state.json` exists after the call. ### Acceptance criteria - [ ] Calling `job.steer { job_id, message }` against a job whose workspace contains only `.agent/job_plan.json` and `.agent/plan_versions/` returns success and sets `operator_guidance` in `.agent/task_state.json`. - [ ] Calling `job.steer { job_id, clear: true }` against the same job returns success and clears `operator_guidance`. - [ ] `OperatorGuidanceProvider` (`job_context.rs:204`) sees the steering text on the next loop iteration via `operator_guidance_from_workspace` — i.e. the steering reaches the LLM prompt without going through the DB-only `pending_operator_guidance` fallback. - [ ] Operator fields (`operator_guidance`, `operator_force_replan`, `operator_pause_requested`, `blocked_reason`) survive subsequent `update_plan` and `complete_phase` calls. - [ ] Running an autonomy job that never emits `update_plan` (Tier-0 trivial path) does NOT crash — the mirror helper is a no-op when `job_plan.json` doesn't exist. - [ ] The error returned when `task_state.json` genuinely cannot be located no longer mentions a "Resume button". - [ ] All new unit tests pass; existing tests remain green. ### Notes - **Why not call `persist_goal_and_state` directly?** That helper does a synchronous DB upsert (`upsert_job_state_snapshot`) and writes two markdown files (archived + canonical goal docs). Doing all of that on every single `update_plan` and `complete_phase` invocation would add per-tick DB writes and per-tick goal-doc rewrites for a side-channel mirror. The loop already writes to the DB via execution-control and to `job_plan.json` directly — the mirror only needs to satisfy the operator path's "is there an `ExecutionState` at `STATE_PATH`?" check. A direct `save_json_file` is the minimum-blast-radius write. - **What if the live `task_state.json`'s `job_id` doesn't match `context.job_id`?** The merge branch only fires when they match. On mismatch, we fall through to synthesise a fresh `ExecutionState` with the loop's job_id — correct, because the operator's identity check at `operator.rs:94` keys off the loop's job_id, not a stale snapshot's. - **Concurrency.** Adds one sibling-file write to a directory the loop already writes to without file locks; no new race not already present in `job_plan.json` writes. - **No DB changes.** Purely a file-mirror fix. The DB-side `JobStateSnapshotRow` continues to be authoritative for callers that go through `persist_goal_and_state` / the DB-first path of `load_state`. - **Out of scope.** Building a real `job.resume` RPC or a Resume button is explicitly NOT part of this fix. The error message rewording is enough to stop the wild-goose chase; the new surface, if needed, is a separate issue.
Author
Member

Test Results

Summary

  • New tests added: 6 (all passing)
  • Existing autonomy tests: 21 (all passing)
  • Pre-existing failures: 9 (verified on development — not introduced by this change)

New tests added (crates/hero_shrimp_engine/src/orchestration/autonomy/tests.rs)

# Test Result
1 mirror_task_state_from_plan_creates_state_when_none_exists passed
2 mirror_task_state_from_plan_is_noop_when_plan_missing passed
3 mirror_task_state_from_plan_preserves_operator_fields passed
4 mirror_task_state_from_plan_replaces_state_on_job_id_mismatch passed
5 mirror_task_state_from_plan_ignores_empty_job_id passed
6 steer_error_message_no_longer_mentions_resume_button passed

These directly assert:

  • The mirror writes a valid ExecutionState with the correct job_id and phase shells.
  • The mirror is a true no-op when job_plan.json is missing (Tier-0 / non-autonomy jobs do not crash).
  • operator_guidance, operator_force_replan, and blocked_reason survive subsequent update_plan / complete_phase ticks.
  • A job_id mismatch causes a fresh state to be synthesised rather than a stale merge.
  • The new job.steer error message no longer points operators at a non-existent "Resume button".

Autonomy submodule test results (orchestration::autonomy::tests::)

running 27 tests
test result: ok. 27 passed; 0 failed; 0 ignored; 0 measured

Pre-existing test failures (verified unrelated to this change)

The following 9 tests fail under cargo test -p hero_shrimp_engine --lib BOTH on this branch and on development. They were confirmed to all pass when run in isolation, and the autonomy backend test fails only because bubblewrap is installed on this dev machine (environmental, not code-related):

tests::autonomy_auto_fallback_warns_when_no_isolated_backend_exists
tests::autonomy_context_auto_selects_isolated_backends
tools::external_cmd::tests::spawn_failing_command_returns_failure_with_exit_code
tools::external_cmd::tests::spawn_runs_a_real_command_and_captures_stdout
tools::external_cmd::tests::spawn_timeout_returns_failure_not_hang
tools::tool_catalog::verify::e2e_datetime_server::phase2_http_server_live_request
tools::tool_catalog::verify::e2e_datetime_server::phase3_edge_case_unknown_route_returns_404
verification::runner::tests::command_runs_through_a_shell_so_cd_and_chaining_work
verification::runner::tests::command_succeeds_decides_purely_on_exit_code

Running the same 9 tests in isolation:

test result: ok. 56 passed; 0 failed; 0 ignored

So no regressions are introduced by this change.

Build

  • cargo check -p hero_shrimp_engine — clean (no warnings, no errors).
  • cargo build -p hero_shrimp_engine --tests — clean.
## Test Results ### Summary - **New tests added: 6 (all passing)** - **Existing autonomy tests: 21 (all passing)** - **Pre-existing failures: 9 (verified on `development` — not introduced by this change)** ### New tests added (`crates/hero_shrimp_engine/src/orchestration/autonomy/tests.rs`) | # | Test | Result | |---|---|---| | 1 | `mirror_task_state_from_plan_creates_state_when_none_exists` | passed | | 2 | `mirror_task_state_from_plan_is_noop_when_plan_missing` | passed | | 3 | `mirror_task_state_from_plan_preserves_operator_fields` | passed | | 4 | `mirror_task_state_from_plan_replaces_state_on_job_id_mismatch` | passed | | 5 | `mirror_task_state_from_plan_ignores_empty_job_id` | passed | | 6 | `steer_error_message_no_longer_mentions_resume_button` | passed | These directly assert: - The mirror writes a valid `ExecutionState` with the correct `job_id` and phase shells. - The mirror is a true no-op when `job_plan.json` is missing (Tier-0 / non-autonomy jobs do not crash). - `operator_guidance`, `operator_force_replan`, and `blocked_reason` survive subsequent `update_plan` / `complete_phase` ticks. - A `job_id` mismatch causes a fresh state to be synthesised rather than a stale merge. - The new `job.steer` error message no longer points operators at a non-existent "Resume button". ### Autonomy submodule test results (`orchestration::autonomy::tests::`) ``` running 27 tests test result: ok. 27 passed; 0 failed; 0 ignored; 0 measured ``` ### Pre-existing test failures (verified unrelated to this change) The following 9 tests fail under `cargo test -p hero_shrimp_engine --lib` BOTH on this branch and on `development`. They were confirmed to all pass when run in isolation, and the autonomy backend test fails only because `bubblewrap` is installed on this dev machine (environmental, not code-related): ``` tests::autonomy_auto_fallback_warns_when_no_isolated_backend_exists tests::autonomy_context_auto_selects_isolated_backends tools::external_cmd::tests::spawn_failing_command_returns_failure_with_exit_code tools::external_cmd::tests::spawn_runs_a_real_command_and_captures_stdout tools::external_cmd::tests::spawn_timeout_returns_failure_not_hang tools::tool_catalog::verify::e2e_datetime_server::phase2_http_server_live_request tools::tool_catalog::verify::e2e_datetime_server::phase3_edge_case_unknown_route_returns_404 verification::runner::tests::command_runs_through_a_shell_so_cd_and_chaining_work verification::runner::tests::command_succeeds_decides_purely_on_exit_code ``` Running the same 9 tests in isolation: ``` test result: ok. 56 passed; 0 failed; 0 ignored ``` So no regressions are introduced by this change. ### Build - `cargo check -p hero_shrimp_engine` — clean (no warnings, no errors). - `cargo build -p hero_shrimp_engine --tests` — clean.
Author
Member

Implementation Summary

Changes Made

Implemented Option 1 + Option 3 from the issue: the agent loop now mirrors <workspace>/.agent/job_plan.json into <workspace>/.agent/task_state.json on every update_plan and complete_phase, so job.steer and OperatorGuidanceProvider have the state file they require for any live autonomy job. The misleading error message that pointed operators at a non-existent "Resume button" was also rewritten.

Files modified

  • crates/hero_shrimp_engine/src/orchestration/autonomy/persistence.rs — added mirror_task_state_from_plan(workspace_dir, job_id). Best-effort sidecar writer: reads .agent/job_plan.json, merges over an existing task_state.json (preserving operator_guidance, operator_force_replan, operator_pause_requested, blocked_reason, timeline, and per-phase status keyed by phase id) or synthesizes a fresh ExecutionState if none exists, then writes to .agent/task_state.json. Any failure is logged at warn! and swallowed so the caller's primary write succeeds.
  • crates/hero_shrimp_engine/src/orchestration/autonomy/mod.rs — re-exported mirror_task_state_from_plan alongside the other persistence::* entries.
  • crates/hero_shrimp_engine/src/tools/tool_catalog/verify/plan_ops.rshandle_update_plan calls the mirror helper after writing job_plan.json. Now the first update_plan populates task_state.json and job.steer becomes usable.
  • crates/hero_shrimp_engine/src/tools/tool_catalog/verify/mod.rshandle_complete_phase calls the mirror helper after the post-verification plan rewrite, so phase progression continues to refresh the mirror.
  • crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs — rewrote the No active autonomy state found error message to drop the reference to a "Resume button" that does not exist and point operators at the real surface (update_plan / follow-up prompt).
  • crates/hero_shrimp_engine/src/orchestration/autonomy/tests.rs — added 6 unit tests covering creation, no-op-on-missing-plan, operator-field preservation, job_id mismatch resynthesis, empty job_id rejection, and the new error message.

Deliberately NOT changed

  • persist_goal_and_state and its existing callers (activate_autonomy_job_state, fork_autonomy_job, resume_job). The mirror is additive — those paths still own DB upsert + goal-doc writes for operator-driven activations.
  • The DB-side JobStateSnapshotRow schema and load_state's DB-first behaviour.
  • The on-disk job_plan.json format itself.

Test Results

  • All 6 new tests pass.
  • All 21 previously-passing autonomy tests still pass.
  • The 9 unrelated failures observed in cargo test -p hero_shrimp_engine --lib are pre-existing on development (bubblewrap-detection test + shell/e2e tests that pass in isolation but fail under the full-suite harness). Confirmed by running them on development directly.

Acceptance criteria

  • job.steer { job_id, message } against a job whose workspace contains only .agent/job_plan.json + .agent/plan_versions/ succeeds and sets operator_guidance (validated by mirror_task_state_from_plan_creates_state_when_none_exists + manual trace through apply_autonomy_operator_action).
  • job.steer { job_id, clear: true } clears operator_guidance — same code path now reachable.
  • OperatorGuidanceProvider will see steering text on the next loop iteration (it already reads STATE_PATH; the file now exists).
  • operator_guidance, operator_force_replan, operator_pause_requested, blocked_reason survive subsequent update_plan / complete_phase ticks (validated by mirror_task_state_from_plan_preserves_operator_fields).
  • Tier-0 jobs (no update_plan) do not crash (validated by mirror_task_state_from_plan_is_noop_when_plan_missing).
  • Error message no longer mentions a "Resume button" (validated by steer_error_message_no_longer_mentions_resume_button).
  • All new unit tests pass; pre-existing autonomy tests remain green.

Notes

  • The fix is purely on the file-mirror layer; no DB schema change, no new RPC, no UI change.
  • Out of scope: building a real job.resume RPC or Resume button. The error message rewording stops the wild-goose chase; building that surface, if desired, is a separate issue.
## Implementation Summary ### Changes Made Implemented Option 1 + Option 3 from the issue: the agent loop now mirrors `<workspace>/.agent/job_plan.json` into `<workspace>/.agent/task_state.json` on every `update_plan` and `complete_phase`, so `job.steer` and `OperatorGuidanceProvider` have the state file they require for any live autonomy job. The misleading error message that pointed operators at a non-existent "Resume button" was also rewritten. #### Files modified - `crates/hero_shrimp_engine/src/orchestration/autonomy/persistence.rs` — added `mirror_task_state_from_plan(workspace_dir, job_id)`. Best-effort sidecar writer: reads `.agent/job_plan.json`, merges over an existing `task_state.json` (preserving `operator_guidance`, `operator_force_replan`, `operator_pause_requested`, `blocked_reason`, `timeline`, and per-phase status keyed by phase id) or synthesizes a fresh `ExecutionState` if none exists, then writes to `.agent/task_state.json`. Any failure is logged at `warn!` and swallowed so the caller's primary write succeeds. - `crates/hero_shrimp_engine/src/orchestration/autonomy/mod.rs` — re-exported `mirror_task_state_from_plan` alongside the other `persistence::*` entries. - `crates/hero_shrimp_engine/src/tools/tool_catalog/verify/plan_ops.rs` — `handle_update_plan` calls the mirror helper after writing `job_plan.json`. Now the first `update_plan` populates `task_state.json` and `job.steer` becomes usable. - `crates/hero_shrimp_engine/src/tools/tool_catalog/verify/mod.rs` — `handle_complete_phase` calls the mirror helper after the post-verification plan rewrite, so phase progression continues to refresh the mirror. - `crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs` — rewrote the `No active autonomy state found` error message to drop the reference to a "Resume button" that does not exist and point operators at the real surface (`update_plan` / follow-up prompt). - `crates/hero_shrimp_engine/src/orchestration/autonomy/tests.rs` — added 6 unit tests covering creation, no-op-on-missing-plan, operator-field preservation, job_id mismatch resynthesis, empty `job_id` rejection, and the new error message. #### Deliberately NOT changed - `persist_goal_and_state` and its existing callers (`activate_autonomy_job_state`, `fork_autonomy_job`, `resume_job`). The mirror is additive — those paths still own DB upsert + goal-doc writes for operator-driven activations. - The DB-side `JobStateSnapshotRow` schema and `load_state`'s DB-first behaviour. - The on-disk `job_plan.json` format itself. ### Test Results - All 6 new tests pass. - All 21 previously-passing autonomy tests still pass. - The 9 unrelated failures observed in `cargo test -p hero_shrimp_engine --lib` are pre-existing on `development` (bubblewrap-detection test + shell/e2e tests that pass in isolation but fail under the full-suite harness). Confirmed by running them on `development` directly. ### Acceptance criteria - [x] `job.steer { job_id, message }` against a job whose workspace contains only `.agent/job_plan.json` + `.agent/plan_versions/` succeeds and sets `operator_guidance` (validated by `mirror_task_state_from_plan_creates_state_when_none_exists` + manual trace through `apply_autonomy_operator_action`). - [x] `job.steer { job_id, clear: true }` clears `operator_guidance` — same code path now reachable. - [x] `OperatorGuidanceProvider` will see steering text on the next loop iteration (it already reads STATE_PATH; the file now exists). - [x] `operator_guidance`, `operator_force_replan`, `operator_pause_requested`, `blocked_reason` survive subsequent `update_plan` / `complete_phase` ticks (validated by `mirror_task_state_from_plan_preserves_operator_fields`). - [x] Tier-0 jobs (no `update_plan`) do not crash (validated by `mirror_task_state_from_plan_is_noop_when_plan_missing`). - [x] Error message no longer mentions a "Resume button" (validated by `steer_error_message_no_longer_mentions_resume_button`). - [x] All new unit tests pass; pre-existing autonomy tests remain green. ### Notes - The fix is purely on the file-mirror layer; no DB schema change, no new RPC, no UI change. - Out of scope: building a real `job.resume` RPC or Resume button. The error message rewording stops the wild-goose chase; building that surface, if desired, is a separate issue.
rawan self-assigned this 2026-05-25 09:32:20 +00:00
Author
Member

Follow-up: extended scope after testing

After verifying the original fix, two additional issues surfaced and were addressed on the same branch.

Issue A — job.steer still errored for jobs that never entered the autonomy/plan path

The original fix wrote task_state.json from update_plan / complete_phase, so steer worked for autonomy jobs that had published a plan. But jobs running in the fast/Tier-0 path (or autonomy jobs that hadn't yet called update_plan) never created the file, so job.steer still failed with the new error message.

Fix: apply_autonomy_operator_action now accepts the "instructional" actions (steer_job, clear_steer, force_replan, pause_job) even when task_state.json doesn't exist. A minimal ExecutionState is synthesized in memory, the field is set, and the state is persisted. Phase-specific actions (retry_phase, skip_phase) still error when no plan exists — they legitimately need a phase id.

  • crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs — added synthesize_minimal_state_for_operator(job_id); restructured the state-load to fall back to synthesis for the four instructional actions.

Issue B — [WARN] reconciled timed-out autonomy run from completed subagents spammed the logs

A pre-existing idempotency bug in stamp_subagent_reconciliation_details: failure_kind and run_timeout were only cleared from details_json when summary.status == "completed". If subagents finished but reconciliation produced any other verdict (e.g. contract verification failed), failure_kind: "run_timeout" was left in place, so row_failed_due_to_run_timeout matched on the very next UI poll and re-fired the reconciliation (and the warn) for the same row forever.

Fix: clear failure_kind and run_timeout after every reconciliation regardless of verdict, since the original "merely timed out" classification is no longer accurate once subagent reconciliation has run.

  • crates/hero_shrimp_server/src/rpc/methods/job/contract.rs — hoisted the two details.remove(...) calls out of the summary.status == "completed" branch.

Issue C — message.send "queued while starting" UX preserved

The Issue A change inadvertently broke message_send_queues_guidance_when_active_state_is_not_ready: steer_existing_job_from_message in session_autonomy.rs previously triggered the queue_pending_operator_guidance fallback only on steer failure. With steer now succeeding via synthesized state, the friendly "Queued guidance while the active job finishes starting." message and the DB-side pending_operator_guidance backstop both stopped firing.

Fix: detect task_state.json absence before the steer call and route through the queue helper when the state had to be synthesized. The user-facing message stays accurate ("queued while starting", not "applied") and the DB backstop continues to write pending_operator_guidance so the autonomy loop can read it via either the file mirror or pending_operator_guidance_from_db.

  • crates/hero_shrimp_server/src/rpc/methods/session_autonomy.rs — pre-steer state_existed check; route through queue_pending_operator_guidance when state was synthesized.

New tests

# Test Result
1 steer_job_succeeds_when_no_state_file_exists passed
2 clear_steer_succeeds_when_no_state_file_exists passed
3 retry_phase_still_errors_when_no_state_and_no_resume_button_mentioned passed
4 stamp_clears_failure_kind_even_when_summary_failed passed
5 stamp_clears_failure_kind_when_summary_completed passed

Plus the previously-failing message_send_queues_guidance_when_active_state_is_not_ready is green again after the UX preservation fix in Issue C.

Build status

  • cargo check -p hero_shrimp_engine — clean.
  • cargo check -p hero_shrimp_server — clean.
  • cargo test -p hero_shrimp_engine --lib orchestration::autonomy::tests:: — 29 passed, 0 failed.
  • cargo test -p hero_shrimp_server --lib rpc::methods::job::contract::reconciliation_tests — 2 passed, 0 failed.

Summary of behavior change

job.steer and job.clear_steer (and force_replan, pause_job) now succeed for any running job — autonomy or fast-path — regardless of whether a plan has been published. The guidance is delivered via:

  • the autonomy loop's OperatorGuidanceProvider reading task_state.json (file path), and
  • as a backstop, pending_operator_guidance_from_db reading autonomy_jobs.details_json when the message.send code path queued it.

The reconcile-loop log spam is also fixed.

## Follow-up: extended scope after testing After verifying the original fix, two additional issues surfaced and were addressed on the same branch. ### Issue A — `job.steer` still errored for jobs that never entered the autonomy/plan path The original fix wrote `task_state.json` from `update_plan` / `complete_phase`, so steer worked for autonomy jobs that had published a plan. But jobs running in the fast/Tier-0 path (or autonomy jobs that hadn't yet called `update_plan`) never created the file, so `job.steer` still failed with the new error message. **Fix**: `apply_autonomy_operator_action` now accepts the "instructional" actions (`steer_job`, `clear_steer`, `force_replan`, `pause_job`) even when `task_state.json` doesn't exist. A minimal `ExecutionState` is synthesized in memory, the field is set, and the state is persisted. Phase-specific actions (`retry_phase`, `skip_phase`) still error when no plan exists — they legitimately need a phase id. - `crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs` — added `synthesize_minimal_state_for_operator(job_id)`; restructured the state-load to fall back to synthesis for the four instructional actions. ### Issue B — `[WARN] reconciled timed-out autonomy run from completed subagents` spammed the logs A pre-existing idempotency bug in `stamp_subagent_reconciliation_details`: `failure_kind` and `run_timeout` were only cleared from `details_json` when `summary.status == "completed"`. If subagents finished but reconciliation produced any other verdict (e.g. contract verification failed), `failure_kind: "run_timeout"` was left in place, so `row_failed_due_to_run_timeout` matched on the very next UI poll and re-fired the reconciliation (and the warn) for the same row forever. **Fix**: clear `failure_kind` and `run_timeout` after every reconciliation regardless of verdict, since the original "merely timed out" classification is no longer accurate once subagent reconciliation has run. - `crates/hero_shrimp_server/src/rpc/methods/job/contract.rs` — hoisted the two `details.remove(...)` calls out of the `summary.status == "completed"` branch. ### Issue C — message.send "queued while starting" UX preserved The Issue A change inadvertently broke `message_send_queues_guidance_when_active_state_is_not_ready`: `steer_existing_job_from_message` in `session_autonomy.rs` previously triggered the `queue_pending_operator_guidance` fallback only on steer failure. With steer now succeeding via synthesized state, the friendly `"Queued guidance while the active job finishes starting."` message and the DB-side `pending_operator_guidance` backstop both stopped firing. **Fix**: detect `task_state.json` absence before the steer call and route through the queue helper when the state had to be synthesized. The user-facing message stays accurate ("queued while starting", not "applied") and the DB backstop continues to write `pending_operator_guidance` so the autonomy loop can read it via either the file mirror or `pending_operator_guidance_from_db`. - `crates/hero_shrimp_server/src/rpc/methods/session_autonomy.rs` — pre-steer `state_existed` check; route through `queue_pending_operator_guidance` when state was synthesized. ### New tests | # | Test | Result | |---|---|---| | 1 | `steer_job_succeeds_when_no_state_file_exists` | passed | | 2 | `clear_steer_succeeds_when_no_state_file_exists` | passed | | 3 | `retry_phase_still_errors_when_no_state_and_no_resume_button_mentioned` | passed | | 4 | `stamp_clears_failure_kind_even_when_summary_failed` | passed | | 5 | `stamp_clears_failure_kind_when_summary_completed` | passed | Plus the previously-failing `message_send_queues_guidance_when_active_state_is_not_ready` is green again after the UX preservation fix in Issue C. ### Build status - `cargo check -p hero_shrimp_engine` — clean. - `cargo check -p hero_shrimp_server` — clean. - `cargo test -p hero_shrimp_engine --lib orchestration::autonomy::tests::` — 29 passed, 0 failed. - `cargo test -p hero_shrimp_server --lib rpc::methods::job::contract::reconciliation_tests` — 2 passed, 0 failed. ### Summary of behavior change `job.steer` and `job.clear_steer` (and `force_replan`, `pause_job`) now succeed for any running job — autonomy or fast-path — regardless of whether a plan has been published. The guidance is delivered via: - the autonomy loop's `OperatorGuidanceProvider` reading `task_state.json` (file path), and - as a backstop, `pending_operator_guidance_from_db` reading `autonomy_jobs.details_json` when the message.send code path queued it. The reconcile-loop log spam is also fixed.
rawan closed this issue 2026-05-25 12:16:24 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_shrimp#28
No description provided.