Hero Agent v0.7.x: fix read aloud, conversations, convo mode + cross-browser voice #80
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
v0.7.0-dev is deployed on herodev.gent04.grid.tf. Core features work (SSE chat, STT, MCP 62 tools, system prompt, 5 aibroker models, integration tests 20/20). But voice UI and conversation persistence have browser-level issues that need fixing.
What Works (v0.7.0-dev)
What Needs Fixing
Level 1: Browser user gesture issues (BLOCKING)
Browsers require
speechSynthesis.speak()andnew AudioContext()to be called in direct response to a user click. Dioxusspawn(async { document::eval(...) })runs OUTSIDE the click context — browsers silently block it.Console error:
The AudioContext was not allowed to start. It must be resumed (or created) after a user gesture on the page.window._heroAutoRead = trueflag on Read click, add a MutationObserver that watches for new AI message bubbles and auto-speaks themdangerous_inner_htmlwith raw<button onclick="...">for the Read toggle<script>that listens for custom eventsresp.okcheckLevel 2: Conversation persistence
{"conversations":[...]}wrapper) — verify it worksuse_effectrestore may have race conditionsLevel 3: Cross-browser voice (Phase 2 from issue #78)
{"type":"wake_word"}via WebSocket. Works on ALL browsers.ortcrate to hero_voice, export Whisper tiny to ONNX, fallback chain: local → Groq cloudScriptProcessorNodein Convo mode JSLevel 4: Server-side TTS (nice to have)
/api/voice/ttsreturns audio instead of 404Key Technical Insight
Dioxus async eval cannot satisfy browser user gesture requirements.
The pattern
onclick → spawn(async { document::eval("speechSynthesis.speak(...)") })does NOT work because:onclicktriggers a Rust closurespawn()schedules an async taskdocument::eval()calls JS via WASM bridgeThe fix is to keep audio initialization in pure JS triggered by DOM events, not through the Dioxus→WASM→JS bridge.
Files to modify
hero_archipelagos/.../ai/src/island.rshero_archipelagos/.../ai/src/views/message_bubble.rshero_archipelagos/.../ai/src/services/ai_service.rshero_agent/.../routes.rshero_voice/.../audio.rshero_voice/.../ws.rsRepos involved
Build & test
Priority order
Status: Work in Progress
Technical Decision: Pure JS Event Delegation (not MutationObserver or pre-warm hacks)
After assessing all approaches against production standards (clean code, future-proof, industry standard, secure):
Architecture: Separation of Concerns
data-*attributes on elements, JS reads them on clickKey pattern:
For server TTS with slow responses: create
AudioContextat click time (gesture valid), then fetch audio and decode —AudioContextstays valid after creation.Deliverables
Level 1 — Browser gesture fixes (this PR):
[data-read-aloud]buttons → server TTS with speechSynthesis fallbackAudioContextin JS onclick of convo toggle, store globally, WASM references but never createsAudioContextcreated at gesture time — nonew Audio()autoplay neededLevel 2 — Backend (this PR):
POST /api/conversationshandler in hero_agentOut of scope (issue #78):
Repos touched
hero_archipelagos— AI island JS + message bubble + input areahero_agent— conversation POST endpointBuild plan
make dist-clean-wasm→make test-local(20/20) → squash merge → deploy v0.7.1-devSigned-off-by: mik-tf
Update: Rewrote JS delegation → pure web-sys (Rust)
After review, the JS event delegation approach didn't fit Hero's Rust-first architecture. Rewrote to use
web-sysbindings directly from Dioxus onclick handlers.What changed
data-tts-text+ JS delegated clickonclick→voice::ensure_tts_context()+voice::speak()window._heroTtsSpeak()globalvoice::speak()(AudioContext from toggle click)window._heroTtsStop()evalvoice::stop_tts()(pure Rust)#hero-convo-btnvoice::ensure_convo_context()in Dioxus onclickeval()(callback-heavy API, impractical in pure web-sys)New file:
voice.rsDedicated module with:
ensure_tts_context()— create/resume AudioContext (gesture-valid)ensure_convo_context()— 16kHz AudioContext for conversation streamingspeak_browser()— browser speechSynthesis (synchronous)speak_server()— fetch TTS from hero_agent, play via AudioContextspeak()— server TTS with browser fallbackstop_tts()— cancel all playbackcargo checkon nativeWhy web-sys works for gesture chain
Dioxus
onclickruns the Rust closure synchronously in the click event.web_sys::AudioContext::new()called from that closure is in gesture context — browser allows it. Onlydocument::eval()breaks the chain (async bridge).Remaining JS eval (acceptable)
Convo mode WebSocket + ScriptProcessor streaming: these APIs are deeply callback-based (
onmessage,onaudioprocess). Pure web-sys would require leaked closures. Kept as eval but AudioContext is created in Rust first.Rebuilding now. Will re-test 20/20 before deploy.
Signed-off-by: mik-tf
Deployed: v0.7.1-dev on herodev
Test results
Repos touched
975bfdd): POST/DELETE/PATCH conversation endpoints, list returns full infof45c6fe): voice.rs web-sys module, pure Rust TTS, gesture-valid AudioContextWhat was fixed
SpeechSynthesis+AudioContextin Dioxus onclick — gesture chain preservedensure_convo_context()in onclick, WebSocket streaming via evalvoice::speak()with AudioContext from toggle clickvoice.rsmodule, no JS globals, pure web-sys bindingsRelease
https://forge.ourworld.tf/lhumina_code/hero_services/releases/tag/v0.7.1-dev
Signed-off-by: mik-tf
Status update: v0.7.2-dev
What works in v0.7.2-dev
voice.rsweb-sys module — clean Rust foundation for audio APIsKnown limitation: TTS playback (read aloud / auto-read)
Browser TTS (speechSynthesis + AudioContext) requires user gesture context that expires unpredictably across browsers. The web-sys approach creates AudioContext correctly in onclick, but the actual audio playback call runs async and some browsers reject it.
Decision: defer TTS to issue #78 (server-side audio). Server-side TTS via WebSocket eliminates all browser gesture issues permanently.
Moving to #78 immediately
Instead of fighting browser audio policies with stepping stones, we're implementing the production solution:
This makes #80 scope = conversation CRUD + voice input + auto-scroll (delivered). TTS playback = #78 scope.
Signed-off-by: mik-tf
Closing — v0.7.2-dev deployed
Delivered:
TTS playback deferred to #78 (server-side audio — the production solution).
Release: https://forge.ourworld.tf/lhumina_code/hero_services/releases/tag/v0.7.2-dev
Signed-off-by: mik-tf