Phase 3: Benchmark flow — statistics per workflow version #7

Open
opened 2026-04-19 16:33:13 +00:00 by timur · 2 comments
Owner

Context

After Phases 1 & 2, workflows have versions and typed inputs. We need a way to measure how each version performs across realistic inputs, persisting the results so we can compare versions and track trends.

This also lays the groundwork for Phase 4 (version router), which uses benchmark data to pick the right version per user request.

This is similar to optimize_flow but without the optimization part — just run the workflow N times, collect stats, save them. No config mutation.

Goal

  1. New OSIS root object Benchmark storing per-run metrics
  2. New flow template benchmark_flow.json that runs the benchmark
  3. UI integration: "Benchmark" button on a workflow version, benchmark history visible on the workflow page

Depends on

  • #5 must be merged (needs typed Workflow.inputs to generate mock inputs)
  • #6 recommended (examples can seed mock inputs)

OSchema changes

File: crates/hero_logic/schemas/logic/logic.oschema

# PerRunResult captures one benchmark run's metrics.
PerRunResult = {
    play_sid: str
    input_values: {str: str}    # which inputs this run used
    status: str                 # success|failed|timeout
    duration_ms: u64
    retries: u32                # count of script_execution node iterations > 0
    tokens_prompt: u64          # 0 if not captured
    tokens_completion: u64
    error: str                  # empty on success
}

# Benchmark is a persisted measurement of a workflow version. [rootobject]
Benchmark = {
    sid: str
    workflow_sid: str
    workflow_version_sid: str
    created_at: int
    num_runs: u32
    success_rate: f32           # 0.0-100.0
    avg_duration_ms: u64
    min_duration_ms: u64
    max_duration_ms: u64
    total_retries: u32
    avg_tokens_prompt: u64
    avg_tokens_completion: u64
    estimated_cost_usd: f64     # computed from tokens if available, else 0
    difficulty_rating: f32      # 0.0-1.0, set by future flows (#8), defaults to 0.5
    runs: [PerRunResult]        # full per-run data
    notes: str                  # optional human/AI notes
}

New flow template

File: crates/hero_logic/templates/benchmark_flow.json

A 4-node flow:

Node 1: fetch_workflow_meta (Python)

Input: target_workflow_sid, optional target_version_sid.

  • RPC workflow_get to retrieve the workflow
  • Extract declared inputs (name, type, description, required, default)
  • If target_version_sid not provided, use current_version_sid
  • Output: {workflow_sid, workflow_version_sid, inputs, name}

Node 2: generate_mock_inputs (AI)

Input: workflow meta from node 1, num_runs parameter, optional input_hints (e.g., "realistic task prompts").

  • System prompt: given this workflow's declared inputs and description, generate N realistic input value sets that exercise typical use cases. Output as JSON array of objects where each object maps input_name → value.
  • Output: [{input_values}, {input_values}, ...]

Node 3: run_plays (Python)

Input: mock inputs from node 2.

  • For each input set: logicservice.play_start(workflow_sid, input_data=json.dumps(inputs), name=f"benchmark-{i}") with the target version SID
  • Poll each play to completion
  • Collect PerRunResult for each (play_sid, status, duration_ms, retries, tokens_prompt, tokens_completion, error)
  • If hero_proc starts capturing tokens (currently returns 0), use those; otherwise leave tokens at 0
  • Output: [PerRunResult, PerRunResult, ...]

Node 4: compute_stats_and_save (Python)

Input: run results from node 3 + meta from node 1.

  • Compute aggregates: success_rate, avg/min/max duration, total_retries, avg_tokens_prompt, avg_tokens_completion
  • Estimate cost: map model IDs to per-1k-token prices (hardcoded table in the script for now), compute avg_tokens_prompt * prompt_price + avg_tokens_completion * completion_price
  • Build a Benchmark record: use benchmark_set RPC via hero_logic
  • Output: the Benchmark SID and a short human summary

Flow inputs (Phase 1 format):

{
  "inputs": [
    {"name": "target_workflow_sid", "type": "string", "required": true, "description": "Workflow to benchmark"},
    {"name": "target_version_sid", "type": "string", "required": false, "default": "", "description": "Specific version (defaults to current)"},
    {"name": "num_runs", "type": "integer", "required": false, "default": "5", "description": "Number of mock input sets to test"},
    {"name": "input_hints", "type": "string", "required": false, "default": "", "description": "Hints for mock input generation"}
  ]
}

Code changes

crates/hero_logic/src/logic/server/rpc.rs

New RPC methods:

  • benchmark_list_for_workflow(workflow_sid) -> [Benchmark] — list all benchmarks for a workflow, newest first
  • benchmark_list_for_version(workflow_version_sid) -> [Benchmark] — list for a specific version
  • benchmark_latest_for_version(workflow_version_sid) -> Benchmark? — most recent benchmark for quick access

(CRUD for Benchmark — get/set/delete/find — is auto-generated from OSchema)

Cost estimation script (inside compute_stats_and_save)

Simple hardcoded price table; expand later:

PRICE_PER_1K = {
    "openai/gpt-4o-mini": (0.00015, 0.00060),  # (prompt, completion)
    "openai/gpt-4o": (0.0025, 0.01),
    "openai/gpt-5-pro": (0.015, 0.060),
    "anthropic/claude-haiku": (0.0008, 0.004),
    "anthropic/claude-sonnet-4.6": (0.003, 0.015),
    "anthropic/claude-opus-4": (0.015, 0.075),
    # fallback: 0
}

UI changes

File: crates/hero_logic_ui/templates/workflow_editor.html + JS

  • "Benchmark" button in the editor header next to "Start play". Clicking opens a modal with num_runs input (default 5) and optional input_hints. Submitting triggers benchmark_flow.
  • Benchmark history panel (left sidebar or new tab) per workflow version:
    • Latest stats card: success_rate, avg_duration_ms, estimated_cost_usd
    • History table: created_at, num_runs, success_rate, avg_duration, cost
    • Click a row → drawer with per-run breakdown
  • Dashboard workflow card: show latest benchmark stats inline (avg_dur: 12.5s · $0.02/run)

Integration with optimize_flow

optimize_flow can use the Benchmark record instead of duplicating metrics collection. Follow-up refactor: make optimize_flow delegate metric collection to benchmark_flow per config. Not required for this issue.

Acceptance criteria

  • Benchmark root object exists with all fields; CRUD works
  • benchmark_flow template loads and runs end-to-end
  • AI-generated mock inputs conform to the target workflow's declared inputs
  • Running benchmark_flow persists a Benchmark record linked to {workflow_sid, workflow_version_sid}
  • benchmark_list_for_workflow returns benchmarks for a workflow, ordered by created_at DESC
  • Workflow editor has a "Benchmark" button and benchmark history panel
  • Cost estimation works for known models (gpt-4o-mini, claude-sonnet, etc.); shows $0 for unknown
  • Per-run breakdown shows play_sid links (click to see the full play)

Out of scope

  • Token capture from hero_proc (currently returns 0s) — tracked separately
  • Comparative charts across versions — Phase 4 might introduce visualizations
  • Running benchmarks on a schedule
  • CI integration
## Context After Phases 1 & 2, workflows have versions and typed inputs. We need a way to measure how each version performs across realistic inputs, persisting the results so we can compare versions and track trends. This also lays the groundwork for Phase 4 (version router), which uses benchmark data to pick the right version per user request. This is similar to `optimize_flow` but without the optimization part — just run the workflow N times, collect stats, save them. No config mutation. ## Goal 1. New OSIS root object `Benchmark` storing per-run metrics 2. New flow template `benchmark_flow.json` that runs the benchmark 3. UI integration: "Benchmark" button on a workflow version, benchmark history visible on the workflow page ## Depends on - **#5 must be merged** (needs typed `Workflow.inputs` to generate mock inputs) - **#6** recommended (examples can seed mock inputs) ## OSchema changes File: `crates/hero_logic/schemas/logic/logic.oschema` ``` # PerRunResult captures one benchmark run's metrics. PerRunResult = { play_sid: str input_values: {str: str} # which inputs this run used status: str # success|failed|timeout duration_ms: u64 retries: u32 # count of script_execution node iterations > 0 tokens_prompt: u64 # 0 if not captured tokens_completion: u64 error: str # empty on success } # Benchmark is a persisted measurement of a workflow version. [rootobject] Benchmark = { sid: str workflow_sid: str workflow_version_sid: str created_at: int num_runs: u32 success_rate: f32 # 0.0-100.0 avg_duration_ms: u64 min_duration_ms: u64 max_duration_ms: u64 total_retries: u32 avg_tokens_prompt: u64 avg_tokens_completion: u64 estimated_cost_usd: f64 # computed from tokens if available, else 0 difficulty_rating: f32 # 0.0-1.0, set by future flows (#8), defaults to 0.5 runs: [PerRunResult] # full per-run data notes: str # optional human/AI notes } ``` ## New flow template File: `crates/hero_logic/templates/benchmark_flow.json` A 4-node flow: ### Node 1: `fetch_workflow_meta` (Python) Input: `target_workflow_sid`, optional `target_version_sid`. - RPC `workflow_get` to retrieve the workflow - Extract declared `inputs` (name, type, description, required, default) - If `target_version_sid` not provided, use `current_version_sid` - Output: `{workflow_sid, workflow_version_sid, inputs, name}` ### Node 2: `generate_mock_inputs` (AI) Input: workflow meta from node 1, `num_runs` parameter, optional `input_hints` (e.g., "realistic task prompts"). - System prompt: given this workflow's declared inputs and description, generate N realistic input value sets that exercise typical use cases. Output as JSON array of objects where each object maps input_name → value. - Output: `[{input_values}, {input_values}, ...]` ### Node 3: `run_plays` (Python) Input: mock inputs from node 2. - For each input set: `logicservice.play_start(workflow_sid, input_data=json.dumps(inputs), name=f"benchmark-{i}")` with the target version SID - Poll each play to completion - Collect PerRunResult for each (play_sid, status, duration_ms, retries, tokens_prompt, tokens_completion, error) - If `hero_proc` starts capturing tokens (currently returns 0), use those; otherwise leave tokens at 0 - Output: `[PerRunResult, PerRunResult, ...]` ### Node 4: `compute_stats_and_save` (Python) Input: run results from node 3 + meta from node 1. - Compute aggregates: success_rate, avg/min/max duration, total_retries, avg_tokens_prompt, avg_tokens_completion - Estimate cost: map model IDs to per-1k-token prices (hardcoded table in the script for now), compute `avg_tokens_prompt * prompt_price + avg_tokens_completion * completion_price` - Build a Benchmark record: use `benchmark_set` RPC via hero_logic - Output: the Benchmark SID and a short human summary Flow inputs (Phase 1 format): ```json { "inputs": [ {"name": "target_workflow_sid", "type": "string", "required": true, "description": "Workflow to benchmark"}, {"name": "target_version_sid", "type": "string", "required": false, "default": "", "description": "Specific version (defaults to current)"}, {"name": "num_runs", "type": "integer", "required": false, "default": "5", "description": "Number of mock input sets to test"}, {"name": "input_hints", "type": "string", "required": false, "default": "", "description": "Hints for mock input generation"} ] } ``` ## Code changes ### `crates/hero_logic/src/logic/server/rpc.rs` New RPC methods: - `benchmark_list_for_workflow(workflow_sid) -> [Benchmark]` — list all benchmarks for a workflow, newest first - `benchmark_list_for_version(workflow_version_sid) -> [Benchmark]` — list for a specific version - `benchmark_latest_for_version(workflow_version_sid) -> Benchmark?` — most recent benchmark for quick access (CRUD for Benchmark — get/set/delete/find — is auto-generated from OSchema) ### Cost estimation script (inside `compute_stats_and_save`) Simple hardcoded price table; expand later: ```python PRICE_PER_1K = { "openai/gpt-4o-mini": (0.00015, 0.00060), # (prompt, completion) "openai/gpt-4o": (0.0025, 0.01), "openai/gpt-5-pro": (0.015, 0.060), "anthropic/claude-haiku": (0.0008, 0.004), "anthropic/claude-sonnet-4.6": (0.003, 0.015), "anthropic/claude-opus-4": (0.015, 0.075), # fallback: 0 } ``` ## UI changes File: `crates/hero_logic_ui/templates/workflow_editor.html` + JS - **"Benchmark" button** in the editor header next to "Start play". Clicking opens a modal with `num_runs` input (default 5) and optional `input_hints`. Submitting triggers `benchmark_flow`. - **Benchmark history panel** (left sidebar or new tab) per workflow version: - Latest stats card: success_rate, avg_duration_ms, estimated_cost_usd - History table: created_at, num_runs, success_rate, avg_duration, cost - Click a row → drawer with per-run breakdown - **Dashboard workflow card**: show latest benchmark stats inline (`avg_dur: 12.5s · $0.02/run`) ### Integration with optimize_flow `optimize_flow` can use the Benchmark record instead of duplicating metrics collection. Follow-up refactor: make `optimize_flow` delegate metric collection to `benchmark_flow` per config. Not required for this issue. ## Acceptance criteria - [ ] `Benchmark` root object exists with all fields; CRUD works - [ ] `benchmark_flow` template loads and runs end-to-end - [ ] AI-generated mock inputs conform to the target workflow's declared `inputs` - [ ] Running benchmark_flow persists a Benchmark record linked to `{workflow_sid, workflow_version_sid}` - [ ] `benchmark_list_for_workflow` returns benchmarks for a workflow, ordered by created_at DESC - [ ] Workflow editor has a "Benchmark" button and benchmark history panel - [ ] Cost estimation works for known models (gpt-4o-mini, claude-sonnet, etc.); shows $0 for unknown - [ ] Per-run breakdown shows play_sid links (click to see the full play) ## Out of scope - Token capture from hero_proc (currently returns 0s) — tracked separately - Comparative charts across versions — Phase 4 might introduce visualizations - Running benchmarks on a schedule - CI integration
Author
Owner

Backend + template landed: 98831aa

Phase 3 is functional end-to-end. UI integration (Benchmark button, history panel) still to do.

What's live

OSchema:

  • PerRunResult: per-run metrics (play_sid, status, duration, retries, tokens, error, input_values_json)
  • Benchmark root object: aggregate measurement — success_rate, min/avg/max duration, retries, tokens, cost, difficulty_rating, runs[], linked to {workflow_sid, workflow_version_sid}

RPC methods:

  • benchmark_list_for_workflow(workflow_sid) — newest-first
  • benchmark_list_for_version(workflow_version_sid)
  • benchmark_latest_for_version(workflow_version_sid) — empty string when none

New template: benchmark_flow.json

4-node DAG:

  1. fetch_meta — loads Workflow record, parses declared inputs + current_version_sid from OTOML, calls workflow_version_fetch for the full version
  2. generate_inputs — AI generates N realistic input sets conforming to typed inputs
  3. run_and_measure — starts N plays, polls to completion, aggregates metrics, persists a Benchmark record via benchmark.set
  4. report — fetches the stored Benchmark and prints a summary

Verified end-to-end

$ play_start benchmark_flow({target_workflow_sid: 00dn, num_runs: 2})
→ fetch_meta:succ generate_inputs:succ run_and_measure:succ report:succ (35s)

======================================================================
BENCHMARK REPORT
======================================================================

Benchmark: 00du
Workflow:  00dn (version 00do)

Runs: 2
Success rate: 100.0%
Duration (min/avg/max): 11117ms / 13647ms / 16178ms
Total retries: 2
======================================================================

$ benchmark_list_for_workflow(00dn)
→ ['00du']  ✓

Cost tracking stub

estimated_cost_usd defaults to 0 because hero_proc doesn't report tokens per job yet. The price table lives in run_and_measure for easy swap-in when tokens become available:

PRICE_PER_1K = {
    'openai/gpt-4o-mini': (0.00015, 0.0006),
    'openai/gpt-4o': (0.0025, 0.01),
    'anthropic/claude-sonnet-4-6': (0.003, 0.015),
    ...
}

Remaining

  • Benchmark + PerRunResult types
  • benchmark_list_for_workflow / _for_version / _latest_for_version RPC methods
  • benchmark_flow.json template (4 nodes)
  • AI-generated mock inputs conform to declared workflow inputs
  • Persisted Benchmark record linked to {workflow_sid, workflow_version_sid}
  • UI: Benchmark button on workflow page
  • UI: Benchmark history table + latest stats card
  • UI: Per-run drawer with play_sid links
  • Token/cost capture once hero_proc exposes it

Moving to Phase 4 (#8) next.

## Backend + template landed: 98831aa Phase 3 is functional end-to-end. UI integration (Benchmark button, history panel) still to do. ### What's live **OSchema:** - `PerRunResult`: per-run metrics (play_sid, status, duration, retries, tokens, error, input_values_json) - `Benchmark` root object: aggregate measurement — success_rate, min/avg/max duration, retries, tokens, cost, difficulty_rating, runs[], linked to `{workflow_sid, workflow_version_sid}` **RPC methods:** - `benchmark_list_for_workflow(workflow_sid)` — newest-first - `benchmark_list_for_version(workflow_version_sid)` - `benchmark_latest_for_version(workflow_version_sid)` — empty string when none **New template: `benchmark_flow.json`** 4-node DAG: 1. `fetch_meta` — loads Workflow record, parses declared `inputs` + `current_version_sid` from OTOML, calls `workflow_version_fetch` for the full version 2. `generate_inputs` — AI generates N realistic input sets conforming to typed `inputs` 3. `run_and_measure` — starts N plays, polls to completion, aggregates metrics, persists a `Benchmark` record via `benchmark.set` 4. `report` — fetches the stored Benchmark and prints a summary ### Verified end-to-end ``` $ play_start benchmark_flow({target_workflow_sid: 00dn, num_runs: 2}) → fetch_meta:succ generate_inputs:succ run_and_measure:succ report:succ (35s) ====================================================================== BENCHMARK REPORT ====================================================================== Benchmark: 00du Workflow: 00dn (version 00do) Runs: 2 Success rate: 100.0% Duration (min/avg/max): 11117ms / 13647ms / 16178ms Total retries: 2 ====================================================================== $ benchmark_list_for_workflow(00dn) → ['00du'] ✓ ``` ### Cost tracking stub `estimated_cost_usd` defaults to 0 because `hero_proc` doesn't report tokens per job yet. The price table lives in `run_and_measure` for easy swap-in when tokens become available: ```python PRICE_PER_1K = { 'openai/gpt-4o-mini': (0.00015, 0.0006), 'openai/gpt-4o': (0.0025, 0.01), 'anthropic/claude-sonnet-4-6': (0.003, 0.015), ... } ``` ### Remaining - [x] `Benchmark` + `PerRunResult` types - [x] `benchmark_list_for_workflow` / `_for_version` / `_latest_for_version` RPC methods - [x] `benchmark_flow.json` template (4 nodes) - [x] AI-generated mock inputs conform to declared workflow inputs - [x] Persisted Benchmark record linked to {workflow_sid, workflow_version_sid} - [ ] **UI: Benchmark button on workflow page** - [ ] **UI: Benchmark history table + latest stats card** - [ ] **UI: Per-run drawer with play_sid links** - [ ] Token/cost capture once hero_proc exposes it Moving to Phase 4 (#8) next.
Author
Owner

Phase 3 UI landed (commit 761b396)

Benchmark controls + history panel are now in the workflow editor:

  • Topbar Benchmark button opens a config modal (num_runs, input_hints).
  • Submit calls workflow_from_template("benchmark_flow") then play_start on the resulting workflow with {target_workflow_sid, target_version_sid, num_runs, input_hints} as structured input_data, then redirects to the play view so the user can watch the benchmark execute.
  • Sidebar Benchmarks panel lists the 10 most recent Benchmark records for the current WorkflowVersion via benchmark_list_for_version, rendering {success_rate, num_runs, avg_duration, estimated_cost} per row. Refresh button re-fetches.
  • fetchBenchmarkScalars parses the OTOML returned by benchmark.get inline (no JS TOML lib required) — extracts only the flat scalar summary fields.

Verified:

  • Editor page now includes openBenchmarkModal, refreshBenchmarks, hl-benchmarks-list, hl-benchmark-modal in rendered HTML
  • benchmark_list_for_version(00e0) returns [] (expected — no benchmarks yet)

Still open for this issue:

  • Per-run drawer: click a benchmark row → modal showing the PerRunResult list
  • Auto-refresh the sidebar panel after the benchmark play completes (currently requires the refresh button)
  • Token/cost capture: populate avg_tokens_* / estimated_cost_usd by having hero_proc expose per-job token counts to the executor
### Phase 3 UI landed (commit 761b396) Benchmark controls + history panel are now in the workflow editor: - **Topbar `Benchmark` button** opens a config modal (`num_runs`, `input_hints`). - Submit calls `workflow_from_template("benchmark_flow")` then `play_start` on the resulting workflow with `{target_workflow_sid, target_version_sid, num_runs, input_hints}` as structured `input_data`, then redirects to the play view so the user can watch the benchmark execute. - **Sidebar `Benchmarks` panel** lists the 10 most recent `Benchmark` records for the current `WorkflowVersion` via `benchmark_list_for_version`, rendering `{success_rate, num_runs, avg_duration, estimated_cost}` per row. Refresh button re-fetches. - `fetchBenchmarkScalars` parses the OTOML returned by `benchmark.get` inline (no JS TOML lib required) — extracts only the flat scalar summary fields. **Verified:** - Editor page now includes `openBenchmarkModal`, `refreshBenchmarks`, `hl-benchmarks-list`, `hl-benchmark-modal` in rendered HTML - `benchmark_list_for_version(00e0)` returns `[]` (expected — no benchmarks yet) **Still open for this issue:** - Per-run drawer: click a benchmark row → modal showing the PerRunResult list - Auto-refresh the sidebar panel after the benchmark play completes (currently requires the refresh button) - Token/cost capture: populate `avg_tokens_*` / `estimated_cost_usd` by having hero_proc expose per-job token counts to the executor
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_logic#7
No description provided.