lhumina_code/hero_logic

Fork 0

Phase 3: Benchmark flow — statistics per workflow version #7

New issue

Open

opened 2026-04-19 16:33:13 +00:00 by timur · 2 comments

timur commented

2026-04-19 16:33:13 +00:00

Owner

Context

After Phases 1 & 2, workflows have versions and typed inputs. We need a way to measure how each version performs across realistic inputs, persisting the results so we can compare versions and track trends.

This also lays the groundwork for Phase 4 (version router), which uses benchmark data to pick the right version per user request.

This is similar to optimize_flow but without the optimization part — just run the workflow N times, collect stats, save them. No config mutation.

Goal

New OSIS root object Benchmark storing per-run metrics
New flow template benchmark_flow.json that runs the benchmark
UI integration: "Benchmark" button on a workflow version, benchmark history visible on the workflow page

Depends on

#5 must be merged (needs typed Workflow.inputs to generate mock inputs)
#6 recommended (examples can seed mock inputs)

OSchema changes

File: crates/hero_logic/schemas/logic/logic.oschema

# PerRunResult captures one benchmark run's metrics.
PerRunResult = {
    play_sid: str
    input_values: {str: str}    # which inputs this run used
    status: str                 # success|failed|timeout
    duration_ms: u64
    retries: u32                # count of script_execution node iterations > 0
    tokens_prompt: u64          # 0 if not captured
    tokens_completion: u64
    error: str                  # empty on success
}

# Benchmark is a persisted measurement of a workflow version. [rootobject]
Benchmark = {
    sid: str
    workflow_sid: str
    workflow_version_sid: str
    created_at: int
    num_runs: u32
    success_rate: f32           # 0.0-100.0
    avg_duration_ms: u64
    min_duration_ms: u64
    max_duration_ms: u64
    total_retries: u32
    avg_tokens_prompt: u64
    avg_tokens_completion: u64
    estimated_cost_usd: f64     # computed from tokens if available, else 0
    difficulty_rating: f32      # 0.0-1.0, set by future flows (#8), defaults to 0.5
    runs: [PerRunResult]        # full per-run data
    notes: str                  # optional human/AI notes
}

New flow template

File: crates/hero_logic/templates/benchmark_flow.json

A 4-node flow:

Node 1: `fetch_workflow_meta` (Python)

Input: target_workflow_sid, optional target_version_sid.

RPC workflow_get to retrieve the workflow
Extract declared inputs (name, type, description, required, default)
If target_version_sid not provided, use current_version_sid
Output: {workflow_sid, workflow_version_sid, inputs, name}

Node 2: `generate_mock_inputs` (AI)

Input: workflow meta from node 1, num_runs parameter, optional input_hints (e.g., "realistic task prompts").

System prompt: given this workflow's declared inputs and description, generate N realistic input value sets that exercise typical use cases. Output as JSON array of objects where each object maps input_name → value.
Output: [{input_values}, {input_values}, ...]

Node 3: `run_plays` (Python)

Input: mock inputs from node 2.

For each input set: logicservice.play_start(workflow_sid, input_data=json.dumps(inputs), name=f"benchmark-{i}") with the target version SID
Poll each play to completion
Collect PerRunResult for each (play_sid, status, duration_ms, retries, tokens_prompt, tokens_completion, error)
If hero_proc starts capturing tokens (currently returns 0), use those; otherwise leave tokens at 0
Output: [PerRunResult, PerRunResult, ...]

Node 4: `compute_stats_and_save` (Python)

Input: run results from node 3 + meta from node 1.

Compute aggregates: success_rate, avg/min/max duration, total_retries, avg_tokens_prompt, avg_tokens_completion
Estimate cost: map model IDs to per-1k-token prices (hardcoded table in the script for now), compute avg_tokens_prompt * prompt_price + avg_tokens_completion * completion_price
Build a Benchmark record: use benchmark_set RPC via hero_logic
Output: the Benchmark SID and a short human summary

Flow inputs (Phase 1 format):

{
  "inputs": [
    {"name": "target_workflow_sid", "type": "string", "required": true, "description": "Workflow to benchmark"},
    {"name": "target_version_sid", "type": "string", "required": false, "default": "", "description": "Specific version (defaults to current)"},
    {"name": "num_runs", "type": "integer", "required": false, "default": "5", "description": "Number of mock input sets to test"},
    {"name": "input_hints", "type": "string", "required": false, "default": "", "description": "Hints for mock input generation"}
  ]
}

Code changes

`crates/hero_logic/src/logic/server/rpc.rs`

New RPC methods:

benchmark_list_for_workflow(workflow_sid) -> [Benchmark] — list all benchmarks for a workflow, newest first
benchmark_list_for_version(workflow_version_sid) -> [Benchmark] — list for a specific version
benchmark_latest_for_version(workflow_version_sid) -> Benchmark? — most recent benchmark for quick access

(CRUD for Benchmark — get/set/delete/find — is auto-generated from OSchema)

Cost estimation script (inside `compute_stats_and_save`)

Simple hardcoded price table; expand later:

PRICE_PER_1K = {
    "openai/gpt-4o-mini": (0.00015, 0.00060),  # (prompt, completion)
    "openai/gpt-4o": (0.0025, 0.01),
    "openai/gpt-5-pro": (0.015, 0.060),
    "anthropic/claude-haiku": (0.0008, 0.004),
    "anthropic/claude-sonnet-4.6": (0.003, 0.015),
    "anthropic/claude-opus-4": (0.015, 0.075),
    # fallback: 0
}

UI changes

File: crates/hero_logic_ui/templates/workflow_editor.html + JS

"Benchmark" button in the editor header next to "Start play". Clicking opens a modal with num_runs input (default 5) and optional input_hints. Submitting triggers benchmark_flow.
Benchmark history panel (left sidebar or new tab) per workflow version:
- Latest stats card: success_rate, avg_duration_ms, estimated_cost_usd
- History table: created_at, num_runs, success_rate, avg_duration, cost
- Click a row → drawer with per-run breakdown
Dashboard workflow card: show latest benchmark stats inline (avg_dur: 12.5s · $0.02/run)

Integration with optimize_flow

optimize_flow can use the Benchmark record instead of duplicating metrics collection. Follow-up refactor: make optimize_flow delegate metric collection to benchmark_flow per config. Not required for this issue.

Acceptance criteria

Benchmark root object exists with all fields; CRUD works
benchmark_flow template loads and runs end-to-end
AI-generated mock inputs conform to the target workflow's declared inputs
Running benchmark_flow persists a Benchmark record linked to {workflow_sid, workflow_version_sid}
benchmark_list_for_workflow returns benchmarks for a workflow, ordered by created_at DESC
Workflow editor has a "Benchmark" button and benchmark history panel
Cost estimation works for known models (gpt-4o-mini, claude-sonnet, etc.); shows $0 for unknown
Per-run breakdown shows play_sid links (click to see the full play)

Out of scope

Token capture from hero_proc (currently returns 0s) — tracked separately
Comparative charts across versions — Phase 4 might introduce visualizations
Running benchmarks on a schedule
CI integration

## Context After Phases 1 & 2, workflows have versions and typed inputs. We need a way to measure how each version performs across realistic inputs, persisting the results so we can compare versions and track trends. This also lays the groundwork for Phase 4 (version router), which uses benchmark data to pick the right version per user request. This is similar to `optimize_flow` but without the optimization part — just run the workflow N times, collect stats, save them. No config mutation. ## Goal 1. New OSIS root object `Benchmark` storing per-run metrics 2. New flow template `benchmark_flow.json` that runs the benchmark 3. UI integration: "Benchmark" button on a workflow version, benchmark history visible on the workflow page ## Depends on - **#5 must be merged** (needs typed `Workflow.inputs` to generate mock inputs) - **#6** recommended (examples can seed mock inputs) ## OSchema changes File: `crates/hero_logic/schemas/logic/logic.oschema` ``` # PerRunResult captures one benchmark run's metrics. PerRunResult = { play_sid: str input_values: {str: str} # which inputs this run used status: str # success|failed|timeout duration_ms: u64 retries: u32 # count of script_execution node iterations > 0 tokens_prompt: u64 # 0 if not captured tokens_completion: u64 error: str # empty on success } # Benchmark is a persisted measurement of a workflow version. [rootobject] Benchmark = { sid: str workflow_sid: str workflow_version_sid: str created_at: int num_runs: u32 success_rate: f32 # 0.0-100.0 avg_duration_ms: u64 min_duration_ms: u64 max_duration_ms: u64 total_retries: u32 avg_tokens_prompt: u64 avg_tokens_completion: u64 estimated_cost_usd: f64 # computed from tokens if available, else 0 difficulty_rating: f32 # 0.0-1.0, set by future flows (#8), defaults to 0.5 runs: [PerRunResult] # full per-run data notes: str # optional human/AI notes } ``` ## New flow template File: `crates/hero_logic/templates/benchmark_flow.json` A 4-node flow: ### Node 1: `fetch_workflow_meta` (Python) Input: `target_workflow_sid`, optional `target_version_sid`. - RPC `workflow_get` to retrieve the workflow - Extract declared `inputs` (name, type, description, required, default) - If `target_version_sid` not provided, use `current_version_sid` - Output: `{workflow_sid, workflow_version_sid, inputs, name}` ### Node 2: `generate_mock_inputs` (AI) Input: workflow meta from node 1, `num_runs` parameter, optional `input_hints` (e.g., "realistic task prompts"). - System prompt: given this workflow's declared inputs and description, generate N realistic input value sets that exercise typical use cases. Output as JSON array of objects where each object maps input_name → value. - Output: `[{input_values}, {input_values}, ...]` ### Node 3: `run_plays` (Python) Input: mock inputs from node 2. - For each input set: `logicservice.play_start(workflow_sid, input_data=json.dumps(inputs), name=f"benchmark-{i}")` with the target version SID - Poll each play to completion - Collect PerRunResult for each (play_sid, status, duration_ms, retries, tokens_prompt, tokens_completion, error) - If `hero_proc` starts capturing tokens (currently returns 0), use those; otherwise leave tokens at 0 - Output: `[PerRunResult, PerRunResult, ...]` ### Node 4: `compute_stats_and_save` (Python) Input: run results from node 3 + meta from node 1. - Compute aggregates: success_rate, avg/min/max duration, total_retries, avg_tokens_prompt, avg_tokens_completion - Estimate cost: map model IDs to per-1k-token prices (hardcoded table in the script for now), compute `avg_tokens_prompt * prompt_price + avg_tokens_completion * completion_price` - Build a Benchmark record: use `benchmark_set` RPC via hero_logic - Output: the Benchmark SID and a short human summary Flow inputs (Phase 1 format): ```json { "inputs": [ {"name": "target_workflow_sid", "type": "string", "required": true, "description": "Workflow to benchmark"}, {"name": "target_version_sid", "type": "string", "required": false, "default": "", "description": "Specific version (defaults to current)"}, {"name": "num_runs", "type": "integer", "required": false, "default": "5", "description": "Number of mock input sets to test"}, {"name": "input_hints", "type": "string", "required": false, "default": "", "description": "Hints for mock input generation"} ] } ``` ## Code changes ### `crates/hero_logic/src/logic/server/rpc.rs` New RPC methods: - `benchmark_list_for_workflow(workflow_sid) -> [Benchmark]` — list all benchmarks for a workflow, newest first - `benchmark_list_for_version(workflow_version_sid) -> [Benchmark]` — list for a specific version - `benchmark_latest_for_version(workflow_version_sid) -> Benchmark?` — most recent benchmark for quick access (CRUD for Benchmark — get/set/delete/find — is auto-generated from OSchema) ### Cost estimation script (inside `compute_stats_and_save`) Simple hardcoded price table; expand later: ```python PRICE_PER_1K = { "openai/gpt-4o-mini": (0.00015, 0.00060), # (prompt, completion) "openai/gpt-4o": (0.0025, 0.01), "openai/gpt-5-pro": (0.015, 0.060), "anthropic/claude-haiku": (0.0008, 0.004), "anthropic/claude-sonnet-4.6": (0.003, 0.015), "anthropic/claude-opus-4": (0.015, 0.075), # fallback: 0 } ``` ## UI changes File: `crates/hero_logic_ui/templates/workflow_editor.html` + JS - **"Benchmark" button** in the editor header next to "Start play". Clicking opens a modal with `num_runs` input (default 5) and optional `input_hints`. Submitting triggers `benchmark_flow`. - **Benchmark history panel** (left sidebar or new tab) per workflow version: - Latest stats card: success_rate, avg_duration_ms, estimated_cost_usd - History table: created_at, num_runs, success_rate, avg_duration, cost - Click a row → drawer with per-run breakdown - **Dashboard workflow card**: show latest benchmark stats inline (`avg_dur: 12.5s · $0.02/run`) ### Integration with optimize_flow `optimize_flow` can use the Benchmark record instead of duplicating metrics collection. Follow-up refactor: make `optimize_flow` delegate metric collection to `benchmark_flow` per config. Not required for this issue. ## Acceptance criteria - [ ] `Benchmark` root object exists with all fields; CRUD works - [ ] `benchmark_flow` template loads and runs end-to-end - [ ] AI-generated mock inputs conform to the target workflow's declared `inputs` - [ ] Running benchmark_flow persists a Benchmark record linked to `{workflow_sid, workflow_version_sid}` - [ ] `benchmark_list_for_workflow` returns benchmarks for a workflow, ordered by created_at DESC - [ ] Workflow editor has a "Benchmark" button and benchmark history panel - [ ] Cost estimation works for known models (gpt-4o-mini, claude-sonnet, etc.); shows $0 for unknown - [ ] Per-run breakdown shows play_sid links (click to see the full play) ## Out of scope - Token capture from hero_proc (currently returns 0s) — tracked separately - Comparative charts across versions — Phase 4 might introduce visualizations - Running benchmarks on a schedule - CI integration

timur referenced this issue

2026-04-19 16:34:07 +00:00

Phase 4: Request-based version router — pick_version_flow #8

timur referenced this issue from a commit

2026-04-19 17:01:27 +00:00

feat(phase1): workflow versioning + typed flow inputs (#5)

timur referenced this issue

2026-04-19 17:29:41 +00:00

Phase 2: Examples use typed flow inputs #6

timur referenced this issue from a commit

2026-04-19 18:02:35 +00:00

feat(phase3): benchmark_flow + Benchmark persistence (#7)

timur commented

2026-04-19 18:03:05 +00:00

Author

Owner

Backend + template landed: `98831aa`

Phase 3 is functional end-to-end. UI integration (Benchmark button, history panel) still to do.

What's live

OSchema:

PerRunResult: per-run metrics (play_sid, status, duration, retries, tokens, error, input_values_json)
Benchmark root object: aggregate measurement — success_rate, min/avg/max duration, retries, tokens, cost, difficulty_rating, runs[], linked to {workflow_sid, workflow_version_sid}

RPC methods:

benchmark_list_for_workflow(workflow_sid) — newest-first
benchmark_list_for_version(workflow_version_sid)
benchmark_latest_for_version(workflow_version_sid) — empty string when none

New template: benchmark_flow.json

4-node DAG:

fetch_meta — loads Workflow record, parses declared inputs + current_version_sid from OTOML, calls workflow_version_fetch for the full version
generate_inputs — AI generates N realistic input sets conforming to typed inputs
run_and_measure — starts N plays, polls to completion, aggregates metrics, persists a Benchmark record via benchmark.set
report — fetches the stored Benchmark and prints a summary

Verified end-to-end

$ play_start benchmark_flow({target_workflow_sid: 00dn, num_runs: 2})
→ fetch_meta:succ generate_inputs:succ run_and_measure:succ report:succ (35s)

======================================================================
BENCHMARK REPORT
======================================================================

Benchmark: 00du
Workflow:  00dn (version 00do)

Runs: 2
Success rate: 100.0%
Duration (min/avg/max): 11117ms / 13647ms / 16178ms
Total retries: 2
======================================================================

$ benchmark_list_for_workflow(00dn)
→ ['00du']  ✓

Cost tracking stub

estimated_cost_usd defaults to 0 because hero_proc doesn't report tokens per job yet. The price table lives in run_and_measure for easy swap-in when tokens become available:

PRICE_PER_1K = {
    'openai/gpt-4o-mini': (0.00015, 0.0006),
    'openai/gpt-4o': (0.0025, 0.01),
    'anthropic/claude-sonnet-4-6': (0.003, 0.015),
    ...
}

Remaining

Benchmark + PerRunResult types
benchmark_list_for_workflow / _for_version / _latest_for_version RPC methods
benchmark_flow.json template (4 nodes)
AI-generated mock inputs conform to declared workflow inputs
Persisted Benchmark record linked to {workflow_sid, workflow_version_sid}
UI: Benchmark button on workflow page
UI: Benchmark history table + latest stats card
UI: Per-run drawer with play_sid links
Token/cost capture once hero_proc exposes it

Moving to Phase 4 (#8) next.

## Backend + template landed: 98831aa Phase 3 is functional end-to-end. UI integration (Benchmark button, history panel) still to do. ### What's live **OSchema:** - `PerRunResult`: per-run metrics (play_sid, status, duration, retries, tokens, error, input_values_json) - `Benchmark` root object: aggregate measurement — success_rate, min/avg/max duration, retries, tokens, cost, difficulty_rating, runs[], linked to `{workflow_sid, workflow_version_sid}` **RPC methods:** - `benchmark_list_for_workflow(workflow_sid)` — newest-first - `benchmark_list_for_version(workflow_version_sid)` - `benchmark_latest_for_version(workflow_version_sid)` — empty string when none **New template: `benchmark_flow.json`** 4-node DAG: 1. `fetch_meta` — loads Workflow record, parses declared `inputs` + `current_version_sid` from OTOML, calls `workflow_version_fetch` for the full version 2. `generate_inputs` — AI generates N realistic input sets conforming to typed `inputs` 3. `run_and_measure` — starts N plays, polls to completion, aggregates metrics, persists a `Benchmark` record via `benchmark.set` 4. `report` — fetches the stored Benchmark and prints a summary ### Verified end-to-end ``` $ play_start benchmark_flow({target_workflow_sid: 00dn, num_runs: 2}) → fetch_meta:succ generate_inputs:succ run_and_measure:succ report:succ (35s) ====================================================================== BENCHMARK REPORT ====================================================================== Benchmark: 00du Workflow: 00dn (version 00do) Runs: 2 Success rate: 100.0% Duration (min/avg/max): 11117ms / 13647ms / 16178ms Total retries: 2 ====================================================================== $ benchmark_list_for_workflow(00dn) → ['00du'] ✓ ``` ### Cost tracking stub `estimated_cost_usd` defaults to 0 because `hero_proc` doesn't report tokens per job yet. The price table lives in `run_and_measure` for easy swap-in when tokens become available: ```python PRICE_PER_1K = { 'openai/gpt-4o-mini': (0.00015, 0.0006), 'openai/gpt-4o': (0.0025, 0.01), 'anthropic/claude-sonnet-4-6': (0.003, 0.015), ... } ``` ### Remaining - [x] `Benchmark` + `PerRunResult` types - [x] `benchmark_list_for_workflow` / `_for_version` / `_latest_for_version` RPC methods - [x] `benchmark_flow.json` template (4 nodes) - [x] AI-generated mock inputs conform to declared workflow inputs - [x] Persisted Benchmark record linked to {workflow_sid, workflow_version_sid} - [ ] **UI: Benchmark button on workflow page** - [ ] **UI: Benchmark history table + latest stats card** - [ ] **UI: Per-run drawer with play_sid links** - [ ] Token/cost capture once hero_proc exposes it Moving to Phase 4 (#8) next.

timur referenced this issue

2026-04-19 18:08:06 +00:00

Phase 4: Request-based version router — pick_version_flow #8

timur referenced this issue from a commit

2026-04-19 18:29:08 +00:00

feat(phase3-ui): benchmark button + history panel (#7)

timur commented

2026-04-19 18:29:20 +00:00

Author

Owner

Phase 3 UI landed (commit `761b396`)

Benchmark controls + history panel are now in the workflow editor:

Topbar Benchmark button opens a config modal (num_runs, input_hints).
Submit calls workflow_from_template("benchmark_flow") then play_start on the resulting workflow with {target_workflow_sid, target_version_sid, num_runs, input_hints} as structured input_data, then redirects to the play view so the user can watch the benchmark execute.
Sidebar Benchmarks panel lists the 10 most recent Benchmark records for the current WorkflowVersion via benchmark_list_for_version, rendering {success_rate, num_runs, avg_duration, estimated_cost} per row. Refresh button re-fetches.
fetchBenchmarkScalars parses the OTOML returned by benchmark.get inline (no JS TOML lib required) — extracts only the flat scalar summary fields.

Verified:

Editor page now includes openBenchmarkModal, refreshBenchmarks, hl-benchmarks-list, hl-benchmark-modal in rendered HTML
benchmark_list_for_version(00e0) returns [] (expected — no benchmarks yet)

Still open for this issue:

Per-run drawer: click a benchmark row → modal showing the PerRunResult list
Auto-refresh the sidebar panel after the benchmark play completes (currently requires the refresh button)
Token/cost capture: populate avg_tokens_* / estimated_cost_usd by having hero_proc expose per-job token counts to the executor

### Phase 3 UI landed (commit 761b396) Benchmark controls + history panel are now in the workflow editor: - **Topbar `Benchmark` button** opens a config modal (`num_runs`, `input_hints`). - Submit calls `workflow_from_template("benchmark_flow")` then `play_start` on the resulting workflow with `{target_workflow_sid, target_version_sid, num_runs, input_hints}` as structured `input_data`, then redirects to the play view so the user can watch the benchmark execute. - **Sidebar `Benchmarks` panel** lists the 10 most recent `Benchmark` records for the current `WorkflowVersion` via `benchmark_list_for_version`, rendering `{success_rate, num_runs, avg_duration, estimated_cost}` per row. Refresh button re-fetches. - `fetchBenchmarkScalars` parses the OTOML returned by `benchmark.get` inline (no JS TOML lib required) — extracts only the flat scalar summary fields. **Verified:** - Editor page now includes `openBenchmarkModal`, `refreshBenchmarks`, `hl-benchmarks-list`, `hl-benchmark-modal` in rendered HTML - `benchmark_list_for_version(00e0)` returns `[]` (expected — no benchmarks yet) **Still open for this issue:** - Per-run drawer: click a benchmark row → modal showing the PerRunResult list - Auto-refresh the sidebar panel after the benchmark play completes (currently requires the refresh button) - Token/cost capture: populate `avg_tokens_*` / `estimated_cost_usd` by having hero_proc expose per-job token counts to the executor

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

lhumina_code/hero_logic#7

No description provided.

Rows
Columns

Phase 3: Benchmark flow — statistics per workflow version #7

Context

Goal

Depends on

OSchema changes

New flow template

Node 1: fetch_workflow_meta (Python)

Node 2: generate_mock_inputs (AI)

Node 3: run_plays (Python)

Node 4: compute_stats_and_save (Python)

Code changes

crates/hero_logic/src/logic/server/rpc.rs

Cost estimation script (inside compute_stats_and_save)

UI changes

Integration with optimize_flow

Acceptance criteria

Out of scope

Backend + template landed: 98831aa

What's live

Verified end-to-end

Cost tracking stub

Remaining

Phase 3 UI landed (commit 761b396)

Node 1: `fetch_workflow_meta` (Python)

Node 2: `generate_mock_inputs` (AI)

Node 3: `run_plays` (Python)

Node 4: `compute_stats_and_save` (Python)

`crates/hero_logic/src/logic/server/rpc.rs`

Cost estimation script (inside `compute_stats_and_save`)

Backend + template landed: `98831aa`

Phase 3 UI landed (commit `761b396`)