Phase 3: Benchmark flow — statistics per workflow version #7
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
After Phases 1 & 2, workflows have versions and typed inputs. We need a way to measure how each version performs across realistic inputs, persisting the results so we can compare versions and track trends.
This also lays the groundwork for Phase 4 (version router), which uses benchmark data to pick the right version per user request.
This is similar to
optimize_flowbut without the optimization part — just run the workflow N times, collect stats, save them. No config mutation.Goal
Benchmarkstoring per-run metricsbenchmark_flow.jsonthat runs the benchmarkDepends on
Workflow.inputsto generate mock inputs)OSchema changes
File:
crates/hero_logic/schemas/logic/logic.oschemaNew flow template
File:
crates/hero_logic/templates/benchmark_flow.jsonA 4-node flow:
Node 1:
fetch_workflow_meta(Python)Input:
target_workflow_sid, optionaltarget_version_sid.workflow_getto retrieve the workflowinputs(name, type, description, required, default)target_version_sidnot provided, usecurrent_version_sid{workflow_sid, workflow_version_sid, inputs, name}Node 2:
generate_mock_inputs(AI)Input: workflow meta from node 1,
num_runsparameter, optionalinput_hints(e.g., "realistic task prompts").[{input_values}, {input_values}, ...]Node 3:
run_plays(Python)Input: mock inputs from node 2.
logicservice.play_start(workflow_sid, input_data=json.dumps(inputs), name=f"benchmark-{i}")with the target version SIDhero_procstarts capturing tokens (currently returns 0), use those; otherwise leave tokens at 0[PerRunResult, PerRunResult, ...]Node 4:
compute_stats_and_save(Python)Input: run results from node 3 + meta from node 1.
avg_tokens_prompt * prompt_price + avg_tokens_completion * completion_pricebenchmark_setRPC via hero_logicFlow inputs (Phase 1 format):
Code changes
crates/hero_logic/src/logic/server/rpc.rsNew RPC methods:
benchmark_list_for_workflow(workflow_sid) -> [Benchmark]— list all benchmarks for a workflow, newest firstbenchmark_list_for_version(workflow_version_sid) -> [Benchmark]— list for a specific versionbenchmark_latest_for_version(workflow_version_sid) -> Benchmark?— most recent benchmark for quick access(CRUD for Benchmark — get/set/delete/find — is auto-generated from OSchema)
Cost estimation script (inside
compute_stats_and_save)Simple hardcoded price table; expand later:
UI changes
File:
crates/hero_logic_ui/templates/workflow_editor.html+ JSnum_runsinput (default 5) and optionalinput_hints. Submitting triggersbenchmark_flow.avg_dur: 12.5s · $0.02/run)Integration with optimize_flow
optimize_flowcan use the Benchmark record instead of duplicating metrics collection. Follow-up refactor: makeoptimize_flowdelegate metric collection tobenchmark_flowper config. Not required for this issue.Acceptance criteria
Benchmarkroot object exists with all fields; CRUD worksbenchmark_flowtemplate loads and runs end-to-endinputs{workflow_sid, workflow_version_sid}benchmark_list_for_workflowreturns benchmarks for a workflow, ordered by created_at DESCOut of scope
Backend + template landed:
98831aaPhase 3 is functional end-to-end. UI integration (Benchmark button, history panel) still to do.
What's live
OSchema:
PerRunResult: per-run metrics (play_sid, status, duration, retries, tokens, error, input_values_json)Benchmarkroot object: aggregate measurement — success_rate, min/avg/max duration, retries, tokens, cost, difficulty_rating, runs[], linked to{workflow_sid, workflow_version_sid}RPC methods:
benchmark_list_for_workflow(workflow_sid)— newest-firstbenchmark_list_for_version(workflow_version_sid)benchmark_latest_for_version(workflow_version_sid)— empty string when noneNew template:
benchmark_flow.json4-node DAG:
fetch_meta— loads Workflow record, parses declaredinputs+current_version_sidfrom OTOML, callsworkflow_version_fetchfor the full versiongenerate_inputs— AI generates N realistic input sets conforming to typedinputsrun_and_measure— starts N plays, polls to completion, aggregates metrics, persists aBenchmarkrecord viabenchmark.setreport— fetches the stored Benchmark and prints a summaryVerified end-to-end
Cost tracking stub
estimated_cost_usddefaults to 0 becausehero_procdoesn't report tokens per job yet. The price table lives inrun_and_measurefor easy swap-in when tokens become available:Remaining
Benchmark+PerRunResulttypesbenchmark_list_for_workflow/_for_version/_latest_for_versionRPC methodsbenchmark_flow.jsontemplate (4 nodes)Moving to Phase 4 (#8) next.
Phase 3 UI landed (commit
761b396)Benchmark controls + history panel are now in the workflow editor:
Benchmarkbutton opens a config modal (num_runs,input_hints).workflow_from_template("benchmark_flow")thenplay_starton the resulting workflow with{target_workflow_sid, target_version_sid, num_runs, input_hints}as structuredinput_data, then redirects to the play view so the user can watch the benchmark execute.Benchmarkspanel lists the 10 most recentBenchmarkrecords for the currentWorkflowVersionviabenchmark_list_for_version, rendering{success_rate, num_runs, avg_duration, estimated_cost}per row. Refresh button re-fetches.fetchBenchmarkScalarsparses the OTOML returned bybenchmark.getinline (no JS TOML lib required) — extracts only the flat scalar summary fields.Verified:
openBenchmarkModal,refreshBenchmarks,hl-benchmarks-list,hl-benchmark-modalin rendered HTMLbenchmark_list_for_version(00e0)returns[](expected — no benchmarks yet)Still open for this issue:
avg_tokens_*/estimated_cost_usdby having hero_proc expose per-job token counts to the executor