Slack Feature Parity — Implementation Plan #9
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_collab#9
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Hero Collab is a markdown-centric collaboration platform built as a Hero OS service. The backend is ~95% complete (65 RPC methods, 18 DB tables, group-based permissions), but the user-facing experience has critical gaps that prevent production use as a Slack alternative.
Current State
What Works
What's Broken (before this work)
toggleReaction()always calledmessage.reactfirst; serverINSERT OR IGNOREsilently succeeded on duplicates, so unreact never firedcaller_idis client-sent and optional (bypasses all permission checks if missing)rpc_proxydrops identity headers:req.into_body()consumes the request, losing allX-Hero-User/Context/Claimsheaders before forwarding to rpc.sockWhat's Missing (Slack Feature Map)
Implementation Plan (7 Phases)
Phase 0: Critical Bug Fixes (1-2 days)
message.toggle_reactatomic RPCPhase 1: Authentication via hero_proxy (3-5 days)
COLLAB_AUTH_MODEfeature flagX-Hero-Userheader, map to collab user via newexternal_idcolumn, inject ascaller_iduser.meRPC methodusers.listRPC for invitesPhase 2: Chat UI Completion (5-7 days)
Phase 3: Missing Core Features (5-7 days)
Phase 4: Canvases & Documents (5-7 days)
Phase 5: Voice/Video Huddles (7-10 days, optional)
Phase 6: Hardening & Polish (3-4 days)
TECH_SPEC Divergences
The original
TECH_SPEC.mdpredates hero_proxy, hero_collab_app, and the CLI crate:TECH_SPEC should be updated after Phase 2. Until then,
plan/slack-feature-parity.mdsupersedes it.Post-Implementation Updates
Each phase requires updating:
openrpc.json(SDK auto-regenerates), CLI subcommands, Dioxus app types, and documentation. External repos: hero_proxy (configure roles/claims after Phase 1), hero_os (verify island after Phase 6).Phase 0 Progress — Complete ✅ (not pushed yet)
Changes are on local
developmentbranch, pending push:b541818fix: add atomic reaction toggle and fix reaction bugbc9a843feat: compose-with-files attachment flow + DB migrations1bebdeechore: update OpenRPC spec for Phase 0 changesWhat was done:
Reaction toggle fix:
message.toggle_reactRPC — atomic server-side method that checks existence and does INSERT or DELETE, returns{action: "added"|"removed"}toggleReaction()in bothchat-app.js(Dioxus island) andchat.html(standalone) to use the new methodAttachment compose-with-files flow:
attachments.message_idmade nullable via safe table recreation migration (PRAGMA check pattern)attachments/{workspace_id}/{attachment_id}/{filename}attachment.uploadaccepts optionalmessage_id+workspace_id, validates 25MB file size limitmessage.sendaccepts optionalattachment_idsarray, associates pending attachments (syncs both DB column and data blob)message.get/message.listnow returnattachments[]array (same JOIN pattern as reactions)external_idcolumn to users table (for Phase 1 hero_proxy mapping)OpenRPC spec updated:
message.toggle_reactmethodattachment.upload(message_id optional, +workspace_id)message.send(+attachment_ids)Blocker for Phase 1+
WebSocket real-time sync is broken when accessed via hero_router. The
proxy_to_socketfunction in hero_router:resp.into_body().collect()) — kills streamingUpgraderesponse header (line 556 in routes.rs) — kills WebSocket handshakeThis affects ALL Hero services with WebSocket, not just hero_collab. hero_whiteboard has the identical pattern (
/ws/{board_id}on ui.sock) and would have the same issue.The WebSocket architecture (UI socket serves WS relay, browser connects via hero_router) is the standard Hero pattern — hero_collab's TECH_SPEC explicitly says it was "learned from hero_whiteboard's proven dual-channel pattern." The issue is in hero_router's proxy layer.
See hero_router issue for proposed fix.
WebSocket Blocker Resolved ✅
The WebSocket issue was not a hero_router bug. hero_router already has WebSocket passthrough support (commit
d1632cd— "feat: add WebSocket tunnel support to hero_router", April 13). The locally installed binary was simply built before that commit.After rebuilding hero_router from latest
developmentbranch, WebSocket works correctly:The hero_router issue (#35) has been closed. WebSocket real-time sync is no longer a blocker for Phase 1 and Phase 2.
Note: hero_collab's
chat.htmlhad a missing base-path prefix in the WebSocket URL (was/ws/{id}, fixed to{basePath}/ws/{id}). This has been fixed locally (not pushed yet).Phase 0 pushed to development ✅
4 commits pushed:
b541818fix: add atomic reaction toggle and fix reaction bugbc9a843feat: compose-with-files attachment flow + DB migrations1bebdeechore: update OpenRPC spec for Phase 0 changesb77429dfix: WebSocket base-path prefix and dual-codebase syncWorking: Reaction toggle, attachment upload/download, OpenRPC spec (65 methods).
Known issue: WebSocket connects (101 Switching Protocols via curl) but the browser sees the connection close immediately ("Finished" status, empty response). The WS tunnel in hero_router (
proxy_ws_tunnel) may be dropping the connection after handshake. RPC-based features work fine — real-time sync via WebSocket needs further debugging. This does not block Phase 1 (auth) or Phase 2 UI work.Proceeding to Phase 1 (authentication via hero_proxy).
Phase 1 (Auth) — Core Complete, pushed to development
Commits:
9077493feat: Phase 1 auth — hero_proxy header integration, user.me, rpc_proxy forwarding395cb70feat: Phase 1E+1F — user_id validation, channel member permissions, auth-aware UIWhat was done:
Auth header reading (1A):
http_rpc()readsX-Hero-User,X-Hero-Context,X-Hero-Claims.handle_rpc()looks up user byexternal_id/email/alias, injects ascaller_id. Verified e2e: hero_proxy → hero_router → hero_collab_server.Feature flag (1-PRE-2):
COLLAB_AUTH_MODE=proxy|devenv var. Defaultdev.RPC proxy header forwarding (1B):
rpc_proxy()extracts identity headers BEFORE consuming request body, forwards them to rpc.sock. Previously all headers were silently dropped.user.me RPC (1D): Returns authenticated user or auto-creates from
X-Hero-User. Returns{authenticated: false}in dev mode.user_id validation (1E): 12 self-operation handlers use
resolve_self_user_id()— blocks impersonation when authenticated, backward compatible in dev mode. Channel member add/remove got new permission checks (was zero before).Chat UI auth (1F):
init()callsuser.mefirst. If authenticated, skips user picker. Falls back to dev mode picker otherwise.Remaining Phase 1 items (deferred):
Next: Phase 2 (Chat UI Completion)
Post-Phase 1 Fixes — Pushed to development
Commits since last update:
eb93d62fix: switch hero_collab_ui to axum::serve() for WebSocket supporta513aedfix: consolidate dual JS codebase, image previews, message ordering, soft-delete filtering3c9645ffix: include attachments in message.send response for immediate preview5dc9417fix: export missing global functions (selectChannel, startDm, pickUser, etc.)WebSocket Fixed
Root cause: hero_collab_ui used manual
hyper::server::conn::http1::Builderwhich didn't properly propagate WebSocket upgrades through hero_router's tunnel. Switched toaxum::serve()(same as hero_proc_ui which has working WebSocket). Connection now stays open — "Connected" status in sidebar.Dual Codebase Consolidated
chat.htmlchat.htmlis now a pure HTML template that loadschat-app.jsvia<script src>window.*for HTML handlersuser.mecall) added tochat-app.js(was only in the deleted inline copy)Bug Fixes
message.listresults reversed from DESC to chronological order. Messages now appear in same order before and after refresh.message.listresults (was showing deleted messages after refresh).message.sendnow includesattachments[]in response so image previews appear immediately, not just after refresh.Current state
All Phase 0 + Phase 1 core features working:
Ready for Phase 2 (Chat UI Completion): unread counts, typing indicators, presence, inline editing.
Comprehensive Review Complete — Fixes Pushed
Ran a full code review of all Phase 0+1 work. Found 12 issues, fixed the critical ones:
Fixed (pushed):
cb781ef—user.merace condition: concurrent calls for new user could create duplicates. Fixed with existence check inside lock + INSERT OR IGNORE.cb781ef— Send button stayed disabled when files attached but no text typed. Fixed button logic to checkstate.pendingAttachments.length.17059e4— DM functions (pickDmUser,dmSearchKey,filterDmUsers) not exported towindowafter IIFE consolidation. Caused ReferenceError when clicking DM autocomplete entries.Accepted (not blocking):
COLLAB_AUTH_MODE=proxy)Review verified working:
Starting Phase 2 (Chat UI Completion).
Phase 2 Complete + Security Fixes — Pushed
Commits since Phase 1 update:
Phase 2 features:
e76b5cafeat: Phase 2A-D — unread counts, typing indicators, presence, WebSocket events1c5f7f6fix: Phase 2 review — WS broadcast, sendBeacon, typing cleanup, heartbeat leakReview fixes:
edd697ffix: gate delete button to message owner, filter deleted messages in get17059e4fix: export missing DM functions (pickDmUser, dmSearchKey, filterDmUsers)cb781effix: review fixes — user.me race condition, send button with attachmentsSecurity fixes:
9897044fix: path traversal protection in storage, channel membership check on attachments88ec970feat: WebSocket auth — extract X-Hero-User on WS upgrade (Phase 1C)b58cfe0docs: add known deferred issues section to planPhase 2 features working:
Security hardening done:
Ecosystem research on WebSocket auth:
Checked hero_whiteboard and hero_proc — neither checks auth on WebSocket connections. The Hero pattern is: WebSocket relays are unauthenticated broadcast pipes, auth happens at hero_proxy edge. Our implementation (extracting X-Hero-User + logging) is ahead of the ecosystem standard.
Remaining deferred (Phase 6):
Ready for Phase 3 (search, notifications, pins, channel browse).
Deferred Items Resolved — Only 3 remain
Previously deferred 5 items from Phase 1+2 reviews. Resolved 2 more:
Now complete:
ws_handlerextractsX-Hero-User, callsuser.me+channel.member.listvia rpc.sock to verify channel membership. Non-members get 403 Forbidden. Dev mode (no auth header) still allows connections for backward compat. Goes beyond hero_whiteboard and hero_proc which have zero WS auth. (88ec970,43dd7f9)LocalBackend.full_path()now canonicalizes paths and rejects traversal attempts. (9897044)attachment.get/deletecheck channel membership viacheck_attachment_access(). (9897044)edd697f)edd697f)Still deferred (Phase 6):
uploaded_bycolumn + schema change)Starting Phase 3 (search, notifications, pins, channel browse).
Phase 3 Complete — Pushed to development
Initial implementation:
9f774befeat: Phase 3 — search, notifications, pins, channel browseProduction hardening (13 issues fixed from code review):
b6e3948fix: Phase 3 production hardening — 13 issues from code reviewOpenRPC spec updated:
33e00b6chore: update OpenRPC spec — 70 methods (was 65)Phase 3 features:
3A. Full-text search:
3B. Mention notifications:
3C. Pinned messages:
3D. Channel browse/discovery:
Review hardening (13 fixes):
FTS sync on edit/delete, FTS query sanitization, FTS backfill, notification interval leak, sender_name in mentions, mention ID from PK not blob, list_pinned SQL filter, pin permission check, search access control, mention LIMIT, json_extract for name lookup, ID type safety
OpenRPC spec: 70 methods. Ready for Phase 4 (Canvases & Documents).
UX Polish Round — Pushed
Multiple UX fixes after hands-on testing:
e0dc5fcPin button in hover bar, editor 404 fix, thread composer buttonse2ce56dThread hover actions, emoji/attachment context-awareness56d2b73Thread-aware emoji, attachments, pin, delete, reactions979c3c3Soft-delete in thread.replies, pending attachment targeting613cd88Proper separation of main and thread composer state (no workaround)70bbd11Emoji picker positioned near triggering button5fd5d68Thread reactions, inline editing, scroll to new messaged318235Emoji picker viewport-aware positioningKey improvements:
OpenRPC: 70 methods. Starting Phase 4 (Canvases).
Phase 4 Research Complete — Architecture Redesigned
What changed
Original Phase 4 plan used Editor.js + last-write-wins per block. After research and verification, completely redesigned to use CRDT for proper real-time collaboration.
Research conducted
Open-source alternatives studied: Mattermost/Focalboard (block model), Zulip (event queues), Conduit (Rust Matrix), Rocket.Chat, Element/Matrix.
CRDT library evaluation (Rust):
Editor evaluation:
Verified against actual code and crates
esm.sh) — no bundler needed, vanilla JSencode_state_as_update_v1()→ BLOB → SQLiteNew architecture
Key difference from original plan:
canvas_blockstable — yrs handles document structure as a CRDTWhat's NOT changing
document.*system stays for simple markdown files[canvas:{id}]message cards (same plan)Plan updated and synced to
plan/slack-feature-parity.md. Ready to implement.Phase 4 Complete + Extended UX Redesign — Pushed to development
Phase 4 Core (Canvas Collaborative Editing)
Backend (11 RPC methods):
canvases,canvas_state(yrs CRDT BLOB),canvas_collaboratorscanvas.create/get/list/update/delete/share/unshare/update_role/get_preview/save_state/load_statechannel.find_dm— proper DM lookup by participant IDsFrontend:
scripts/vendor-bundle/build.sh/canvas/{id}with toolbar, title editing, undo/redoOpenRPC: 82 methods. CLI: full canvas subcommands.
Extended Phase 4 — Slack-aligned UX
Share modal:
Canvas in chat:
[canvas:ID]→ async hydration showing title, creator, timestampSidebar: creator name, relative time, shared indicator, right-click context menu (Open, Copy link, Delete)
Collaborator bar: colored avatar circles via awareness protocol
Security (3 audit rounds, 19 issues fixed)
securityLevel: strict(was loose — XSS)Chat Architecture Refactor
flex-direction: column-reverse: scroll pins to bottom automatically — no JS hacks, images loading async do not affect scrollCommits (21):
What's Next
Phase 5 (Voice/Video Huddles) — optional, requires LiveKit
Phase 6 (Hardening):
Phase 4 — Canvases with Real-Time Collaboration — ✅ Shipped
Ref commits:
263cb89,58d3715,35c2b2c,80f9a24,9d8e09d,c143c0b,753076e,07150da,06911fb,6214d95What landed
Backend — 11 RPC methods, new yrs CRDT infrastructure:
canvases(metadata),canvas_state(yrs-encoded BLOB),canvas_collaboratorscanvas.create / get / list / update / delete / share / unshare / update_role / get_preview / save_state / load_state/ws/canvas/{id}(distinct from the text-only channel WS) running the yrs y-sync protocolArc<RwLock<HashMap<u64, yrs::Doc>>>for active docs; periodic flush tocanvas_stateBLOBcanvas.save_state/load_stateforwardX-Hero-Userfor authFrontend — Tiptap + Yjs editor, no bundler:
templates/canvas.html+static/js/canvas-app.js— Tiptap editor bound to Yjs via@tiptap/extension-collaboration@tiptap/extension-collaboration-cursorfor live cursors with colored avatars on the topbar[canvas:ID]card[canvas:ID]in any message renders as interactive card (title, creator, edited-X-ago) fetched viacanvas.get_preview[canvas:ID]Vendored bundle:
scripts/vendor-bundle/build.shproducesstatic/js/vendor/collab-editor.min.js(Tiptap + Yjs + y-websocket + collaboration-cursor) via esbuildy-prosemirror → @tiptap/y-tiptapso all packages share a singleySyncPluginKey— a subtle bug we hit during integrationNotable mid-Phase corrections
Several UX bugs discovered during hands-on testing and fixed:
flex-direction: column-reverseand zero JS scroll hacks)channel.find_dmRPC to avoid duplicate DM channels on share)Acceptance notes
Two-browser collaboration verified: type in tab A, see edits merge in tab B with sub-second latency; cursors visible with correct colors and names. Restart server — canvas content survives (reloaded from SQLite BLOB). Non-collaborator WS connect returns 403. Chat card hydration works on page reload and live message append.
Docs updated in
plan/slack-feature-parity.md(Phase 4 + Phase 4-EXT sections).Phase 5 — Slack-Style LiveKit Huddles — ✅ Shipped (audio v1)
Ref commits:
55ddf63(feature),75017f7(sync openrpc + SDK + CLI)What landed
Server (
hero_collab_server):src/livekit.rs— JWT generation vialivekit-api 0.4. HS256 signing, 15-minute TTL,can_publish_sources = ["microphone"](audio-only — no video/screen push),can_publish_data = false(chat runs on the existing WS).LiveKitConfig::from_env()loads once at startup; falls back betweenLIVEKIT_URL(hero_osis convention) andLIVEKIT_WS_URL.src/handlers/huddle.rs— 5 RPCs:huddle.start / join / leave / list / participants. Reusesrooms+room_participantstables via the JSON-blob pattern. Idempotent start (second caller gets the active room back). Thread-anchor message created onstartwithkind: "huddle_start"and auto-flipped tohuddle_endedon auto-end. FinalN peoplecount sourced from ajoiner_idsset kept on the room blob — survives leave events, unlikeroom_participants.check_channel_member()— proxy-mode auth: missingcaller_id→"Authentication required for huddle operations"; DB errors propagate with?instead of collapsing into "Not a member". Dev mode bypass confined to explicitCOLLAB_AUTH_MODE=dev.resolve_display_name()— distinguishes missing user / missing name / DB error; each path logged at the right level.src/huddle_reaper.rs— background tokio task, every 5 minutes polls LiveKit'sListParticipantsfor each active huddle. If LiveKit reports empty,force_end_huddle(room_id)mirrors the leave-auto-end branch: deletes participant rows, marks ended, flips anchor. Treats LiveKit as the source of truth so crashed-browser ghosts get reaped. Skips force-end on LiveKit errors — prefer ghost to cut-off live call.main.rs—COLLAB_AUTH_MODEdefault flipped toproxy(fail-closed).devmust be explicit. Reaper spawned after AppState init.UI (
hero_collab_ui):static/js/huddle.js(HuddleManagerIIFE) — wrapslivekit-clientSDK; onjoinHuddle, callshuddle.start+huddle.join, connects to LiveKitRoom, attaches remote audio tracks to hidden sink, publishes mic (catches permission denial with an explicit "joined muted" toast), subscribes toActiveSpeakersChanged.onDisconnectedkeyed onDisconnectReason— user toasts forDUPLICATE_IDENTITY,SERVER_SHUTDOWN,PARTICIPANT_REMOVED,CONNECTION_ERROR, etc.forceStopLocalTracks()safety net to prevent stuck browser-mic LED.chat-app.js— huddle button in channel header, in-call bar with mute / leave / participant avatars (speaking indicator), sidebar participant-count indicator (clickable →joinHuddleFromSidebar), Slack-style anchor cards in the message list: active with Join button, ended faded with participant-count summary. Companion thread auto-opens on join.refreshChannelHuddleIndicatordegrades closed on error instead of silently lying.chat.html— huddle-bar markup,<audio id="huddle-audio-sink">, CSS for buttons / indicator / system-message cards / speaking pulse.static/js/vendor/livekit-client.min.js— vendored UMD bundle (506 KB, no CDN at runtime).Tooling + spec:
scripts/vendor-bundle/build.sh— copies the LiveKit UMD bundle alongside the collab-editor bundle.openrpc.json— 5huddle.*method specs with full response schemas:huddle.startreturnsanchor+thread_message_id+creator_name+joiner_ids;huddle.leavereturnsanchorwhenended=true. This enables the "server returns anchor → client broadcasts via existing chat WS" pattern that delivers real-time UI updates without any new push transport.openrpc.client.generated.rs— SDK regenerated by theopenrpc_client!proc macro;HuddleStartOutputetc. now have typed fields for the new data.hero_collab/src/main.rs(CLI) — newhero_collab huddle start / join / leave / list / participantssubcommand group, mirrors the existing Room pattern.Architectural decisions
hero_whiteboard. Own SQLite, own token generation. Copied token-generation logic fromhero_osis/communication/rpc.rsrather than depending onhero_osisat runtime.hero_osispoints at (same env vars:LIVEKIT_URL,LIVEKIT_API_KEY,LIVEKIT_API_SECRET).huddle.startandhuddle.leavereturn the anchor message, and the initiating/leaving client broadcastsmessage.created/message.updatedon the existing chat WS. Reaper covers the ghost case via LiveKit polling. Proper server-push bridge deferred to v2 asV2-Optionalinplan/feature-huddles-v2.mdwith implement-vs-drop tradeoffs documented.Verification
docker run livekit/livekit-server --dev --bind 0.0.0.0):huddle.startis idempotent, returns the anchor inlinehuddle.joinreturns a valid JWT — decoded payload hasexp - nbf = 900,canPublishSources = ["microphone"],canPublishData = false, correctroomandsubclaimsListRoomsTwirp API returns 401 permissions-denied (token is signature-valid but we only grantedroomJoin— proves the signature is accepted)ended: false; Alice leaves (last) →ended: true, anchor flips to"🎧 Huddle ended · less than a minute · 2 people"(correct distinct-joiner count)caller_idrequest rejected, non-member rejected, member acceptedcursor: pointerand workingonclick; thread auto-opens on join (URL hash updates to.../thread/{anchor_id}). Real WebRTC not testable in headless — JS is a thin SDK wrapper; ws_url + token validity confirmed.External audit — key findings
plan/known-issues.mdandplan/feature-huddles-v2.md.hero_router/scanner.rsbackground-task pattern, env vars matchhero_osis, no new dependencies, startup env load is actually cleaner than hero_osis (per-requestenv::varwas flagged as a footgun in the sibling codebase).Remaining Work
Round 3 — Huddle fix-up (next commit, ~1 hour)
Findings from the external audit that should land before calling the feature "production-grade broadly" (currently ship-ready behind a feature flag). Each is concrete:
huddle.liston current channel, detect transition to "empty" → invalidatestate.activeHuddles[channelId]+ reload messages. ~10 lines inchat-app.js.broadcastStatesilently drops if WS is mid-reconnect. Fix:console.warnon drop; consider queuing critical events. 1 line today, optional upgrade later.anchorin the response, client broadcasts a synthetichuddle.endedevent that triggersrefreshChannelHuddleIndicatoron receivers. ~15 lines split between server + client.onDisconnectedvsleaveHuddleduplicate-broadcast race. Fix: skiphuddle.participant_leftbroadcast inonDisconnectedwhen reason isCLIENT_INITIATED; defer nullinghuddle.roomuntil afteronDisconnectedhas run (preventsforceStopLocalTracksno-op on the happy leave path). 5 lines.catch (err) {}swallows huddle-specific exceptions (the huddle dispatch block is now inside this try/catch so it's load-bearing). Fix:console.error. 1 line.tracing::info!on huddle start / join / leave withroom,user,channel,ended,total_participants(H-3-7 in known-issues). The one operational thing that blocks debugging a production incident. ~30 min.Sub-total: ~1 hour of focused work, then browser re-test of the specific B1 (reaper → clients) and B4 (leave-flow) scenarios.
Phase 6 — Hardening & polish (3–4 days, from
plan/slack-feature-parity.md)Not yet started. The full list:
idx_workspace_members_user,idx_channel_members_user,idx_messages_user.user_id. 60 RPC/min, 10message.send/min. Newrate_limit.rs.log_activity(db, ws_id, user_id, action, data)called from key operations. Table exists but is unused.main.rs:http_rpc; don't leak SQL or filesystem path details.CorsLayerfromtower-http(already in deps).attachment.cleanup_pendingRPC; periodic sweep via tokio task or startup pass.tracing::info!, exposeactive_ws_connectionson/health, addrpc_calls_total/rpc_errors_total/avg_latency_mstosystem.metrics.hero_collab_server+hero_collab_uiviahero_proc, creates workspace/user/channel, exercises each feature end-to-end.sendBeacon,localStorage,FileReader,fetch). Test Chrome / Firefox / Safari.Other gaps in the main plan (not Phase 6, not huddles)
From
plan/slack-feature-parity.md:devmode:user.preferencesRPC (referenced by huddles v2 — mute-on-join preference).Tracked elsewhere (follow-up issues, not in this thread)
plan/feature-huddles-v2.md— huddles v2 roadmap (screenshare, video, DM ringing, reactions, recording, optional server-push bridge).plan/known-issues.md— Tier-3 deferred items on the huddles v1 path (H-3-1..9: transactional writes, partial unique index, unit tests, joiner_ids table, TZ storage, observability, mute-on-join, active-speaker ring).plan/feature-voice-to-canvas.md— voice-to-text dictation into canvas (separate feature, separate issue).Recommended next milestone
Round-3 fix-up → merge → then Phase 6A/6I/6J first (observability + load-test + integration tests) before broadening huddles beyond feature-flag. Everything else can fan out in parallel.
Round 3 Hardening — ✅ Shipped (ship-ready broadly, no feature flag)
Ref commit:
cc3dcc6What landed
Two prior audit rounds surfaced five correctness gaps that kept the feature "ship-ready behind a feature flag." Round 3 closes them; Round 3.5 fixes the two regressions Round 3 itself introduced (caught by the follow-up audit).
Backend (
hero_collab_server):handlers/huddle.rs:load_messagefiltersdeleted_at IS NULL— soft-deleted anchors can never be broadcast as live.fetch_participantslogs + continues on malformed row instead of rendering ghost{}avatars.leavereturnsanchor_missing: truewhenended=trueand the updated-anchor reload fails, so the client can broadcast a synthetichuddle.endedevent and peers still flip their cards.participantsnow fails closed on a room blob missingchannel_id(priorif let Some(...)leaked participant metadata on corrupt rows).tracing::info!onhuddle.start / join / leavewith fieldshuddle_event,room_id,channel_id,user_id,ended,anchor_missing,distinct_joiners. "User X stuck huddle for 10 min" incident is now greppable end-to-end from logs.huddle_reaper.rs:LIVEKIT_URLmust start withws://orwss://or the reaper refuses to spawn with a louderror!. No more silent per-tick warnings against a brokenapi_host.snapshot_active_huddles(created_at < now - 5min) — eliminates the "brand-new huddle with first client still in handshake gets reaped" race.poll_started_atBEFORE the LiveKit call, then deletes only participant rows withjoined_at < poll_started_at. A late-joiner row whosejoined_at >= poll_started_atpreserves the huddle — TOCTOU defense without defeating ghost cleanup (the Round 3 regression).force_end_huddle: anchor-flip failure returnsErr, whole thing rolls back. No more half-ended state.Frontend (
hero_collab_ui/static/js):huddle.js:DisconnectReasonis a numeric protobuf enum (CLIENT_INITIATED=1,DUPLICATE_IDENTITY=2, …), not a string. The previousreasonStr === 'CLIENT_INITIATED'check never matched → the skip-broadcast branch was unreachable (duplicate broadcasts on every leave) and the string-keyedDISCONNECT_MESSAGEStoast map was silently dead (no disconnect toasts ever fired). Now keyed numerically; enum values verified live againstwindow.LivekitClient.DisconnectReason.DUPLICATE_IDENTITY,SERVER_SHUTDOWN,PARTICIPANT_REMOVED,ROOM_DELETED,STATE_MISMATCH,JOIN_FAILURE,SIGNAL_CLOSE,ROOM_CLOSED,CONNECTION_TIMEOUT,MEDIA_FAILURE.CLIENT_INITIATEDandUNKNOWN_REASONintentionally stay silent.disconnectReasonName()helper for readable log lines.broadcastState:console.warnon dropped broadcasts withreadyState+ event type.leaveHuddle: onanchor_missing, broadcasts synthetichuddle.endedevent.chat-app.js:onmessagecatch logs with payload instead of silently swallowing.startHuddleReconciliationPoll) — browsers learn about reaper-initiated ends that never went through the chat WS. Guard narrowed to "current-channel huddle" so a user huddling in #A still gets #B reconciliation. Wrapped in try/catch.toggleHuddle: post-join UI failures no longer show misleading "failed to start" toast — the huddle IS live at that point.toggleHuddle+huddleLeaveexplicitly callrefreshChannelHuddleIndicatoron their own tab (WS relay skips ownconn_id).Verification
End-to-end verified with real WebRTC — the missing piece from prior rounds (Playwright headless has no media stack):
Reaper ghost cleanup live-tested:
huddle.listempty after, anchor flipped to "Huddle ended · 18 minutes · 1 person".Browser 30s reconciliation poll live-tested: manually ended a huddle via DB, observed the UI flip within 35s with no WS event involved.
TOCTOU defense live-tested: reaper correctly aborts force-end when a participant row exists with
joined_at >= poll_started_at.JWT hygiene decoded: TTL=900s,
canPublishSources=["microphone"],canPublishData=false. Proxy-mode auth rejects both missingcaller_idand non-channel-member callers.Audit trail this round
Four audits dispatched in parallel after Round 3:
channel_idfail-open (fixed) and setInterval missing try/catch (fixed).huddleLeavestale-sidebar on refresh failure (mitigated by 30s poll).Final Round 3.5 audit: "Round 3.5 ready to commit."
Operational note
Requires LiveKit started with
--node-ip <host-reachable-ip>— e.g.127.0.0.1for local single-machine testing. Default Docker container IP (172.17.0.2) is not routable from the host, causing the "could not establish pc connection" error. In production, set to your LAN or public IP depending on where clients connect from. This matches how hero_osis is deployed.Day-1-after-ship watchlist
Per the Hero-OS-philosophy audit:
grep "has .* live joiner(s) that arrived after the LiveKit poll". If non-trivial per day, the reaper interval vs LiveKit handshake window needs tuning.grep anchor_missing=true. Should be ~0. Non-zero meansload_messageis failing post-leave (messages table contention, soft-delete regression)."Huddle broadcast dropped". Ask beta users to share console on any "sidebar stuck" report. If common, move huddle events off the chat WS onto a dedicated reliable channel.What's deferred (still)
plan/known-issues.md— H-3-1 transactional writes inhuddle.start, H-3-2check_channel_memberDB-error propagation (now done ✓), H-3-3 partial unique index, H-3-4 unit tests, H-3-5joiner_idsdedicated table, H-3-6 TZ storage, H-3-7 structured logs (now done ✓), H-3-8 mute-on-join, H-3-9 active-speaker ring.plan/feature-huddles-v2.md— full v2 roadmap including optional server-push bridge.Verdict
Feature is ship-ready broadly. No feature flag required. Phase 5 done.
Phase 6.0 — Typed RpcError Model — ✅ Shipped
Ref commit:
e7c31b9What landed
Replaces string-based error classification in the RPC dispatch path with a typed
RpcErrorenum. Every variant has a stable JSON-RPC error code and a sanitization strategy. TheInternalvariant — the safe default for anything handlers didn't explicitly classify — logs the full cause chain server-side with a uniquetrace_idand returns only that trace_id to the client. No SQL, no filesystem paths, no panic messages ever reach the wire.Wire contract
dataInvalidRequestMethodNotFoundValidation{field, reason}Internal{trace_id}UnauthenticatedPermissionDeniedNotFoundConflictRateLimited{retry_after_ms}Why typed, not anyhow-string-matching
The prior handler chain was
handler → anyhow::bail!("Something went wrong with X") → handle_rpc → if msg.starts_with("Method not found") { -32601 } else { -32000 }. That's fragile (wording changes break clients), carries no structured data, and leaked SQL / file paths / panic text onto the wire whenever an unexpected internal error surfaced. The Round-3 audit flagged this as "workaround-grade." Phase 6.0 replaces it.Key design choices
From<anyhow::Error> for RpcErrorALWAYS producesInternal. No blanket string-classification — that is the workaround we're removing. Handlers that want a typed variant construct it explicitly:Err(RpcError::PermissionDenied("not in channel".into())). A unit test explicitly smugglesanyhow!("Authentication required")through theFromimpl and asserts the result isInternal, notUnauthenticated— because classification would be easy and wrong.anyhow::Result<Value>; the dispatch layer adds.map_err(Into::into)which wraps every such error asInternal. Safe by default; loud on the server (trace_id + full chain); sanitized on the wire.trace_idformat:e<pid_hex>_<counter_hex>— short enough to read aloud in a support ticket, unique within a process run. Emitted attracing::warn!level with the full anyhow chain so ops can grep.Migrated reference implementation
Huddle handlers migrated end-to-end as the pattern for others to follow:
check_channel_membernow returnsRpcResult<()>— missing caller_id in proxy mode →Unauthenticated, not-in-channel →PermissionDenied, DB error →Internal.huddle::start / join / leave / list / participantsreturnRpcResult<Value>with explicit variants:NotFoundInvalidRequestConflictInternalwith trace_idConflict(deployment config issue)require_u64(params, field)helper returnsValidationwith structureddata: {field, reason}. Replaces theparams["X"].as_u64().ok_or_else(anyhow!("X required"))?idiom across all huddle handlers.Deliberately deferred
Legacy handlers (workspace, user, channel, message, thread, document, canvas, attachment, room, presence, read, group, permissions, mention) still return
anyhow::Result<Value>. Their errors flow throughFrom<anyhow::Error>and arrive asInternalwith a trace_id. Client UX is strictly no worse than before — the old path stringified errors directly onto the wire; the new path sanitizes. Each handler migrates to structured variants in follow-up commits as it's touched for validation / rate-limiting / observability work in Phase 6.1+.handlers::resolve_self_user_idstaysanyhow::Result<u64>for the same reason — 12 callers across 6 modules would need coordinated migration. Tracked for a later sub-phase.Verification
rpc_error.rs, all passing:Internalnever leaks cause string contents; message is fixed text, data always contains a trace_idFrom<anyhow::Error>is guaranteed never to classifydoes.not.exist→-32601 "Method not found: does.not.exist"huddle.start (no args)→-32602 + data {field:"channel_id", reason:"missing_or_wrong_type"}huddle.join (bad room)→-32003 "Huddle room 99999 does not exist"huddle.list (non-member)→-32002 "Not a member of channel 4"huddle.join (ended room)→-32004 "Huddle has ended"(missing method)→-32600 "missing method"Compatibility
The SDK (
openrpc.client.generated.rs) and CLI consume{code, message, data}generically via the vendoredClientError::Rpc(code, msg)— no code-specific pattern matching anywhere in our clients. This commit is strictly additive on the wire: legacy clients continue to work unchanged; new clients can start readingerror.code+error.datafor structured handling.Ecosystem notes also in this commit
plan/slack-feature-parity.md: added a note about hero_proxy commit5f7bb04(X-Hero-Contextnow sourced from the authenticated user, not per-domain route — matches our Phase 1 assumptions exactly).What's next
Phase 6.1a on top of this foundation: a proper
validatorcrate–based input validation layer, withRpcError::Validationas the output type and structureddataconveying which field failed which rule. Then rate limiting (uses theRateLimitedvariant reserved here), DB indexes, attachment cleanup, integration tests, load test, browser compat.Phase 6.1a — Input Validation (parse-don't-validate newtypes) — ✅ Shipped
Ref commit:
73e48f7What landed
Every user-facing string field with a semantic rule (length cap, email format, channel-slug pattern) is now a newtype whose
Deserializeimpl runs the validator. A value of typeName,Email,ChannelName,Description, orMessageContentis guaranteed to have been validated at serde-decode time — there is no in-crate constructor that skips the check. Handlers cannot see an invalid value; rejection happens before the handler body runs.Wire contract — live-verified
workspace.create name=""field \name`: must not be empty`workspace.create name=100×🎧workspace.create name=101×🎧user.create email="not-an-email"Missing separator '@'channel.create name="UPPERCASE"^[a-z0-9][a-z0-9_-]*$message.send content=""{nme: "Typo"}deny_unknown_fieldscatches typoattachment.uploadoversized{max_bytes, actual_bytes}_hero_userpresent with authKey decisions made through specialist consultation
Three independent specialists reviewed the design (type-design, ecosystem precedent, industry consensus):
typifycodegen — typify would have saved ~100 lines of Rust, but only by committing to an openrpc.json restructure (components.schemas + $ref) and a build script. That scope is a separate refactor. Public API (Name,Email, etc.) is stable either way; swapping to typify-generated implementations later is mechanical.#[serde(deny_unknown_fields)]on input structs. For hero_collab's monorepo with co-versioned SDK/server, the benefit (catching typo'd field names at the boundary) outweighs the forward-compat concern industry auditors flagged for multi-version deployments.openrpc.jsonwith JSON Schema constraint annotations (maxLength,minLength,pattern,format) —openrpc_client!macro tolerates unknown keys (verified by reading the macro source), so additions are wire-safe. Spec is now single source of truth for "what is valid input".user@localhostthat every major registration flow rejects in practice.Newtypes + constants
What was migrated
9 handlers now return
RpcResult<Value>and deserialize throughparse_input:workspace.create / updateuser.create / updatechannel.create / updatemessage.send / updateattachment.uploadUntouched handlers stay on
anyhow::Result<Value>with Internal-wrap-via-From — per Phase 6.0's progressive migration design. Each time a handler is touched for later hardening (rate limit, activity log), it migrates too.Spec-drift test
A unit test parses
openrpc.jsonat test time, finds each annotated field (maxLength/format/pattern), and asserts every adversarial payload the schema would reject is also rejected by the Rust newtype. Catches drift between spec and Rust WITHOUT requiring code generation. 14 constraint annotations covered; a floorassert!(covered >= 10)fires if annotations get silently removed.Audit trail
Three-specialist design consultation (pre-implementation) → all three validated:
hero_osisalready annotates its specs; we add discipline of Rust enforcement on topAccountId32Production-grade audit of the implementation caught two critical bugs BEFORE commit:
_hero_userinjection collided withdeny_unknown_fields— would have broken every authenticated call. Fixed:parse_inputstrips server-injected fields before typed deserialize.UserCreate.workspace_idwas spuriously required but neither spec, SDK, nor CLI sends it —user.createvia CLI was 100% broken. Fixed: removed from the struct.Both fixes have regression tests. Full test suite: 18 unit tests passing.
Hardening change clients should know
Typed
u64fields now reject string-encoded numeric IDs ("123"is no longer coerced to123). Pre-migration handlers permissively accepted both viaid_from_json.hero_collab_sdkalways sends numeric IDs; CLI uses the SDK; no in-house callers affected. Browser-side callers thatJSON.stringify(BigInt)now need to send numbers. Documented so future integrators have a trail.What's next
Phase 6.1a is done. Remaining in Phase 6:
RpcError::RateLimitedvariant reserved in Phase 6.0).Typify evaluation for a spec-driven codegen path is tracked as a separate potential refactor — the current hand-written public API (
Name,Email, etc.) would stay stable; only the origin (hand-written ↔ generated) would swap.Phase 0-5 status audit — done vs documented
Ref commit:
40e8d92(plan docs only — no functional change)Audit result
39 DONE · 1 PARTIAL · 7 NOT DONE (1 of the 7 is explicitly deferred by plan). Evidence: file:line per item, from a code-level cross-reference against every subitem in
plan/slack-feature-parity.md.Top 5 gaps (production-impactful)
user.meon first login). Practically blocks multi-user invites. Fix: addproxy_client.rs+ newcollab.users.availableRPC that queries~/hero/var/sockets/hero_proxy/rpc.sock. Tracked asK-4-2.K-4-3.docker-compose.yml, no.env.example, no README section explaining--node-ip. Ops blocker for anyone but the original author. Tracked asK-4-4.K-4-1.presence.update/read.updatedhandlers. Outbound both fire correctly; the browser WSonmessageswitch doesn't consume these from peers. Multi-tab presence lags the 60-second poll; read cursors from other tabs never update. Tracked asK-4-5.Work shipped beyond Phase 5 scope (not folded into "Phase 5 done")
These land in their own commits and are tracked separately from the phase 0-5 audit:
cc3dcc6) — reaper ghost cleanup, DisconnectReason enum fix, URL validation, TOCTOU defense, observabilitye7c31b9) — typedRpcErrormodel + sanitized Internal + trace_id73e48f7) — parse-don't-validate newtypes + spec-drift test + openrpc.json constraint annotationsWhat this commit contains
Non-code only. Updated plan docs:
plan/slack-feature-parity.md→ new "Implementation Status Snapshot" section at top with the audit tableplan/known-issues.md→ 5 new entries (K-4-1throughK-4-5) with file:line references, concrete fixes, severity tagsRecommended next steps
Phase 6.1b — typed inputs for canvas/document/group (commit
2158cf5)Extends the Phase 6.1a parse-don't-validate pattern to the remaining write handlers.
Added newtypes (
validation.rs):Title— ≤200 graphemes, non-empty, trimmed-on-store (canvas/document titles)DocumentContent— ≤500,000 graphemes, empty OK (long-form markdown bodies)Icon— ≤32 graphemes, empty OK (short UI slug / emoji glyph)Roleenum —Viewer|Editor, rejects unknown values at deserialize timeMigrated handlers (10 entrypoints):
canvas.{create, get, update, delete, share, unshare, update_role, get_preview, save_state, load_state},document.{create, update, share},group.{create, update}. All returnRpcResult<Value>; dispatcher arms inrpc.rsdrop.map_err(Into::into)accordingly.Audit fixes folded in:
check_canvas_accessnow returnsRpcResult<()>so policy denials surface as-32002 PermissionDenied/-32001 Unauthenticatedinstead of being sanitized into-32603 Internal.Conflict(-32004) toPermissionDenied(-32002) — it's a policy refusal, not a state conflict.unshareowner-removal and soft-deleted canvas lookups (-32003 NotFound).Backwards compatibility:
group.createpreserves all three pre-migration caller shapes — admin dashboard's{name, description}-only (global groups), CLI's singularworkspace_id, and the legacyworkspace_idsarray. Precedence matches:workspace_idswins when both present.Spec + tests: 15 new JSON Schema constraint annotations in
openrpc.json. The drift-testpick_newtype_testgained three match arms (title / content / icon); coverage floor raised from ≥10 to ≥20 so forgetting to extend the test when adding a newtype fails CI. 30 passing unit tests (up from 25).Live-verified: Every adversarial payload (empty title, over-cap content, unknown field via
deny_unknown_fields, invalid role enum, owner-role change, non-collaborator access, soft-deleted get) returns the expected structured error code via raw Unix socket RPC.Next up: Phase 6.2 (rate limiting, DB indexes, activity log, observability extension) → Phase 6.3 (attachment cleanup, integration + load tests, browser-compat baseline).
Phase 6.2 shipped — runtime hardening (4 commits)
Three sub-phases, landed across concurrent tracks (6.2d/6.2a on main + 6.2b/6.2c in a worktree-isolated agent, merged via cherry-pick).
6.2d — Observability (
23573bb)handle_rpcnow emits a structuredrpc.dispatchtracing event on every call withrpc.method,rpc.duration_ms,rpc.status(ok|error), andrpc.error_codewhen failing. Operators cangrep rpc.method=huddle.joinor filter validation noise viarpc.error_code=-32602.Three atomic counters added to
AppState(rpc_calls_total,rpc_errors_total,rpc_latency_sum_us) and surfaced throughsystem.metricsasrpc_calls_total,rpc_errors_total,avg_latency_ms. Relaxed ordering throughout — monitoring, not consistency. Latency summed in microseconds so sub-ms calls don't round to zero.hero_collab_uigains anactive_ws_connectionsgauge (AtomicUsize, paired fetch_add/fetch_sub around both chat and canvas WS handlers) exposed via/health. Single number for "are clients connected".6.2a — Rate limiting (
ae9c3c0)New
rate_limit.rsmodule. Two buckets per authenticatedcaller_id:message.send/min (tighter inner constraint on the write path that hits the DB hardest)Two-phase check-then-commit: refill + verify both buckets, then consume only if both pass. A rejection never spuriously depletes the other bucket — mattered because the naive ordering wasted tokens on send-rejection, caught by a unit test.
Rejections flow through the same error path as any other error, so they hit the 6.2d counters and tracing. Operators can grep
rpc.error_code=-32005to see throttled callers.Unauthenticated dev-mode calls (
caller_id=None) bypass unlimited — matches the existing auth model.7 unit tests covering bucket mechanics, per-caller isolation, and the two-phase invariant. Drops the
#[allow(dead_code)]onRpcError::RateLimited— the variant reserved in Phase 6.0 now has a consumer.6.2b + 6.2c — Indexes + activity log (
275674e)Three new DB indexes on user-id columns:
idx_workspace_members_user,idx_channel_members_user,idx_messages_user.CREATE INDEX IF NOT EXISTSso re-running against existing DBs is safe.New
handlers/activity.rsmodule withlog_activity_or_warn— fire-and-forget: activity-log failure must NEVER break a successful business operation, so it's logged viatracing::warn!and swallowed. 26 call sites across 6 handler files (workspace, channel, message, canvas, document, group).message.sendpayload deliberately omits content — high-frequency path, and content is PII.Ran as a worktree-isolated background agent in parallel with 6.2d/6.2a; cherry-picked onto main after both tracks verified.
Live-verified
rpc.healthas caller 100 → 60 ok + 3 rate-limited withretry_after_ms ≈ 998msrpc_errors_totaland emit structured tracesworkspace.create→canvas.create→document.create→group.create) lands 4 rows inactivity_logwith correct action + workspace_id + data payloadsqlite3 .indexes/healthonui.sockreturnsactive_ws_connectionsTest count: 40 passing (up from 30)
+7 rate_limit, +3 activity_log.
Next: Phase 6.3 (attachment cleanup, integration tests, load test, browser compat) — running 6H/6J/6K in a parallel agent now, 6G on main.
Phase 6.3 shipped — hardening polish (2 commits)
Two tracks ran in parallel: main thread shipped attachment cleanup; a background agent produced the load test + integration suite + browser-compat doc. Both pushed to
development.6.3G — Attachment cleanup (
7beb6b6)Pending uploads (
message_id IS NULL) previously accumulated indefinitely — opening the file picker then closing the tab left a blob in storage + a row in the DB that nothing ever swept. Two entry points now exist:attachment.cleanup_pendingRPC — admin-only (routes throughcheck_permission; unknown actions fall through tofalse). Optionalolder_than_secsparam overrides the 24h TTL; handy for ad-hoc reclamation and tests. Returns{deleted_rows, files_removed, older_than_secs}; the two counts diverge only when a blob was already missing on disk (loggedwarn, DB row still deleted so state reconciles).main.rs::async_mainalongside the existing huddle reaper. 24h TTL, errors logged but don't stop the loop. Silent when there's nothing to do.Live-verified: uploaded 2 pending + 1 attached; cleanup with
older_than_secs=1after a 2s wait reportsdeleted_rows=2, files_removed=2and leaves the attached row untouched.6.3H/6J/6K — Load test + integration tests + browser compat (
2a9b697)crates/hero_collab_examples/examples/load_test.rs(473 lines, manual-run only): 20 concurrent WS clients × 10 msg/s × 30s = 6000 sends. Bootstraps fresh workspace + channel + 20 distinct users so each caller has its own rate-limit bucket. Prints per-client + aggregate latency percentiles. Exit 0 iff no WS drops, no SQLite lock timeouts, p95 < 200ms. Rate-limit rejections (-32005) counted separately, do not fail the run.crates/hero_collab_server/tests/integration.rs(500 lines): 5#[tokio::test]fixtures each spawnhero_collab_serverviaenv!("CARGO_BIN_EXE_...")against/private/tmp/hct_.../s/(short enough for macOSsun_path≤ 104 bytes;/tmpsymlink breaksstorage.rs's canonicalize-based traversal guard). Fixture SIGTERMs onDrop. Tests: happy-path message send+list, reaction toggle idempotence, attachment→message flow withattachment_ids, rate-limit fires on burst, validation fires on empty canvas title. 45/45 server tests pass (40 pre-existing unit + 5 new integration).crates/hero_collab_ui/BROWSER_SUPPORT.md— minimum supported browsers (Chrome/Edge 100+, Firefox 100+, Safari 15.4+) with per-API rationale sourced from an actual grep ofstatic/js/+templates/. The Tested Configurations table currently shows all "not tested" — no structured compat sweep has been performed at the 6.3 baseline; future manual passes should fill it in honestly.Test count: 45 (was 40)
+5 integration tests covering end-to-end flows across 6.1a/6.1b/6.2a/6.3G work.
Phase 6 status
Core hardening complete. Every planned 6.x subitem has shipped: 6.0 typed errors, 6.1a/6.1b typed inputs + newtypes, 6.2a rate limiting, 6.2b indexes, 6.2c activity log, 6.2d observability, 6.3G cleanup, 6H/6J/6K tests+docs.
Not closing #9 yet — real gaps remain
The Phase 0-5 audit from 2026-04-17 surfaced real production blockers tracked as K-4-1..K-4-5 + K-6-1 in
plan/known-issues.md:onmessagedoesn't handlepresence.update/read.updated. Multi-tab lag.Rough sequencing for next ~1 working week: 5A + 0A together (small) → 1G federation (blocker) → 2D (small WS fix) → 2F + 2G (UI panels) → K-6-1 docs.
Post-Plan-A hardening — Sprints H1/H2/H3 shipped; P0/P1 next
Three hardening sprints landed since the Phase 6.3 comment, plus an audit round that surfaced a real XSS ship-blocker. Everything below is on
developmentas of74d137b.Sprint H1 — ship-blockers from the Plan-A audit (
055d5b8)Four audits ran in parallel (code-reviewer, silent-failure-hunter, Slack-parity, business-logic) against the Plan-A end state. Three ship-blockers landed in H1:
canvas.update_roleagainst a missing canvas returned-32603 Internalinstead of-32003 NotFound. Replaced.ok().flatten()with explicitQueryReturnedNoRowsmatch.message.sendattachment-claim path usedlet _ = db.execute(...)on the blob UPDATE, so a silent failure desynced themessage_idcolumn from the blob'smessage_idfield.message.get(which reads the blob) would then showattachments: []for a successful insert. Fix: wrap the whole send flow inunchecked_transaction, use?instead oflet _ =, commit at end.attachment.cleanup_pendingdenial in proxy mode returned-32603 Internalinstead of-32002 PermissionDenied(anyhow path, not the typed wrapper).Regression tests:
h1_s1,h1_s2,h1_s4intests/integration.rs.Sprint H2 — pre-beta hardening, straightforward items (
8dc9ada)10+ items, skipping S3 (canvas WS close UX) and H-B (startDm federated path) per Specialist 1's scope estimate at the time:
hero_proxy_sdk::users_listcalls insideusers_availablenow bounded by a 3s tokio timeout. A hero_proxy mid-restart (socket bound, accept loop not running) was blockingblock_in_placeworkers indefinitely; under 10 concurrent calls that would exhaust the tokio pool.room.endnow allows creator-OR-admin. Previously required workspace_admin.channel.member.addself-add guard extended fromkind == "dm"to also coverkind == "group". Pre-fix this produced a 1-member "group DM with nobody."message.sendwrapped inunchecked_transaction(also covers S2 above — they merged).chat-app.jsWSonclosehandler now inspects close codes 1008/1011/≥4000, suppresses reconnect, and shows a toast. Exponential backoff restored for transient codes.check_permission_typedacrosscanvas.update_role, the 6.3G RPC, and the related.ok().flatten()sites.openrpc.jsonschema fixes (attachment.upload,canvas.share,canvas.save_state,canvas.load_state) + missingcollab.users.availablemethod spec.Regression tests:
h2_hd,h2_he.Sprint H3 — close the two H2 deferrals (
8fce623)Verification during H3 found Specialist 1's scope warnings on S3/H-B were based on y-websocket v1.x APIs; the repo actually pins
y-websocket@^3.0.0(scripts/vendor-bundle/package.json). v3 exposesconnection-closeas a public event delivering the nativeCloseEvent, so S3 collapses from a half-day subclassing exercise to ~15 LoC of event handler.canvas-app.js::initEditornow subscribes towsProvider.on('connection-close', ...), halts reconnect on terminal codes 1008/1011/≥4000, and surfacesCloseEvent.reasonvia the existingshowError()panel. In parallel,routes.rs::ws_canvas_handlernow upgrades the socket and sendsCloseFrame(1008, reason)on auth rejections instead of returning HTTP 403 — browsers translate a 403-on-upgrade to code 1006 with no reason string, losing the server message entirely. Upgrade-then-close is the standard pattern that lets the reason reach the client.startDmagainst a federated-only user. New RPCcollab.users.claim_federated(username)inhandlers/federation.rsmirrorsuser.me's upsert-by-external-id: verify the username exists in hero_proxy viausers_list, thenINSERT OR IGNOREintouserswithexternal_id = username. Fail-closed: hero_proxy unreachable → error, never an unverified local row.chat-app.js::pickDmUsernow materialiseslocal: falsepicks via this RPC before callingstartDm. Idempotent: repeated claim for the same identity returns the existing id without a hero_proxy round-trip.Regression tests:
h_b_1(missing username → Validation),h_b_2(idempotent claim for existing local row).Full 24-test integration suite green across H1/H2/H3.
SDK regeneration (
74d137b)openrpc.client.generated.rspicks upcollab.users.claim_federated,collab.users.available, and theattachment.uploadinput-shape fix. No behaviour change; typed clients available for all three.External audit round — P0/P1 ratified, proceeding locally
An independent audit re-landed after H3. Cross-checked every claim against the tree; ratified the list with one reframe and one second-order finding.
P0 (actual ship-blocker, 1 item)
XSS via markdown → innerHTML. Custom
_markedRendereratchat-app.js:218-222overrides only.html.marked@15.0.7's default.linkhandler does not blockjavascript:/data:schemes — its URL helperY()only runsencodeURI. So[click](javascript:alert(document.cookie))renders as a live<a href="javascript:alert(...)">across sixrenderMarkdown → innerHTMLsites (chat-app.js:1038, 1214, 1274, 1869, 1896, 2476). Zero DOMPurify anywhere.Fix order, pinned after verifying both regex replacements in
renderMarkdownemit hardcoded-shape HTML (mention\w+capture, canvas-card\d+capture — both literal-only attrs):P0 second-order (same class, deployment-invariant)
base_path_middleware(routes.rs:52-60) readsX-Forwarded-Prefixwith zero scheme validation — only.trim_end_matches('/'). Template inlines into<meta name="base-path">andchat-app.jsconcatenates into canvas-cardhrefasbasePath + '/canvas/' + id. An attacker reaching ui.sock directly can injectX-Forwarded-Prefix: javascript:alert(1)#→href="javascript:alert(1)#/canvas/1". The#fragment trick fires thejavascript:scheme.Same class as the rpc_proxy 4-header trust boundary. Fix: one ops note in
deploy/README.md("ui.sock must sit behind hero_router") + a two-line defense-in-depth inbase_path_middleware: if the trimmed value is non-empty and doesn't start with/, discard.P1 (correctness / quality; user's stated bar)
message.rs:528, 597). Stillanyhow::bail!— wire codes wrong. Tracked as P-D-1 / P-D-4.if let Some(cid) = caller_id { if cid > 0 { check } }(message.rs:31,333,379,520,589 + channel.rs:187,281). Fail-open by shape in dev mode, safe in proxy — but collapse into onerequire_caller()helper.data-*+addEventListener.-D warningsin CI — 52 current warnings, 44 of them fail-D warnings(the other 8 are allowed-by-config lints). Tracked as P-C-5 / P-C-6.Notification, but we do. One-line fix.P2 / P3 (tracked, not scheduled)
.unwrap()on.lock()→db_lock()helper. Hygiene, not blocking.canvas.save_state/huddle.*toparse_input. Functionally equivalent today.schema_versiontable + explicit migration list before the 4th schema change.hero_archipelagos_core+ other ecosystem crates to commit SHAs instead ofbranch = "development".chat-app.jsES-module split (3,649 lines, 40+ window.* leaks) — own project.hero_collab_appadmin island coverage of new entities — scope question.Execution plan
base_path_middlewaredefense-in-depth +BROWSER_SUPPORT.mdfix.require_caller()helper → inline-onclick modernisation → rpc_proxy deploy doc → clippy gate (last, once all other lints are quiet).cargo test+ Playwright browser smoke before the next one starts.Will post follow-up per-commit summaries.
P0 + P1 shipped — six commits on
development(3088595 → 69b9c22)All items from the external-audit plan I outlined in #issuecomment-20102 are now closed. Each commit verified locally with
cargo test(28/28 integration) + a Playwright browser smoke against a TCP-bridged dev stack.P0 — XSS + base_path (
3088595)The audit's one ship-blocker was marked v15 no longer blocking
javascript:/data:/vbscript:URL schemes through the default.linkrenderer — the custom_markedRendereratchat-app.js:218only overrode.html. A message[click](javascript:alert(document.cookie))rendered as a live XSS on everyrenderMarkdown() → innerHTMLsite (six of them).scripts/vendor-bundlepicks updompurify@^3.4.0; the UMD bundle is copied tocrates/hero_collab_ui/static/js/purify.min.js(same pattern as marked / highlight.js).renderMarkdownand ineditor.html::updatePreview: Steps 3 and 4 are safe post-sanitize because the captures (\w+,\d+) are constrained and the emitted HTML is fixed-shape with only literal attributes.sanitizeMarkdown()falls through to HTML-entity escaping, so a broken vendor dep produces text-only messages, not live HTML.base_path_middleware(routes.rs) now rejects anyX-Forwarded-Prefixvalue that doesn't start with/. Without that, an attacker reachingui.sockdirectly could have injectedjavascript:alert(1)#and turned every templated link into a livejavascript:scheme.Notificationunder "we do not use" while H10 shipped background-tab notifications. Fixed, and added DOMPurify to the "we use" matrix.scripts/test-xss-sanitization.mjsloads the same minifiedmarked.min.js+purify.min.jsthe browser gets, runsrenderMarkdownagainst an XSS corpus (javascript: / data: / onerror /<script>/ mention preservation / canvas-card\d+boundary / https non-regression). 12 assertions, all green.P1-A — pin/unpin typed errors (
06801b0)message::pin/message::unpinmigrated fromanyhow::Result<Value>toRpcResult<Value>. Pre-fix, every failure mode collapsed to-32603 Internal— clients couldn't distinguish "message not found" from "not a channel member" from a real server bug.Post-fix:
idparamanyhow::bail!)Three regression tests (
p1a_pin_missing_message_is_not_found,p1a_pin_missing_id_is_validation,p1a_pin_by_non_member_is_permission_denied) pin the wire codes.rpc.rsdrops the.map_err(Into::into)for both methods.The broader P-D-1 cascade (react / toggle_react / get / list / update / delete / search + channel/document/group leftovers) still depends on migrating
check_permissionitself toRpcResult. Tracked as a follow-up — this commit closes the specific pin/unpin pair the audit called out.P1-B —
require_caller()helper (63d5f17)The
if let Some(cid) = caller_id { if cid > 0 { check } }shape appeared at 7 sites (message.rs:31/333/379/520/589, channel.rs:187/281). Its latent failure mode: in proxy mode, if the header middleware ever failed to injectcaller_id, the handler silently skipped the permission check.New helper
permissions::require_caller(caller_id: Option<u64>) -> RpcResult<Option<u64>>:Ok(Some(cid))when caller is present and > 0 → run the check.Ok(None)in dev mode, no caller → skip check (preserves CLI + user-picker flows).Err(Unauthenticated)in proxy mode, no caller → fail-closed.All 7 sites migrated.
p1b_dev_mode_missing_caller_id_still_workspins the non-regression guarantee (dev-mode pin without caller_id still succeeds). The proxy-mode fail-closed is guaranteed by the helper's code shape; a full end-to-end proxy test requires provisioning workspace_admin grants, blocked by the broader P-D-1 typed-error migration.Caveat: for the still-anyhow handlers (channel.member.add/remove, message.react, message.delete),
Err(Unauthenticated)converts back toanyhow::Error → RpcError::Internalon the wire (collapses to-32603). The check itself still fires — the?short-circuit prevents the handler body from running — but the wire code is classified incorrectly. Resolved when P-D-1 migrates those handlers.P1-C — inline-onclick modernisation (
bed794e)Three inline onclicks interpolated a full JSON-serialised object into the HTML attribute:
Contained today (objects are server-sourced), but one escape-mismatch away from breaking JS parsing inside the attribute.
Replaced with
data-channel-id/data-user-id/data-start-dm-user-idnumeric attributes + a one-time delegated click listener on each parent (#channels-list,#dm-list,#user-picker-list), bound with adata-p1c-bound="1"marker to prevent double-registration. The huddle-join icon gets its owndata-join-channel-idso the delegate can distinguish "clicked icon" from "clicked row" (preserving the previousevent.stopPropagation()behaviour).Browser-verified: no inline onclicks on
.channel-itemafter render; clicking a row drivesselectChannelvia the delegate; zero console errors.P1-D — deployment invariant doc (
5d8736c)Codified the hero_proxy/hero_router trust boundary in
deploy/README.md. The audit's "allowlist forwarded headers" framing was wrong —rpc_proxyalready forwards only 4 named headers. The real gap was missing documentation.Added production-deploy checklist (hero_proxy binds external only, socket-dir perms locked to hero_router's user, no stray forwarders, TCP shims bind loopback). Called out that
COLLAB_AUTH_MODE=devdoes NOT fail-closed on missing identity — intentional for CLI/test flows but explicitly unsupported in production.P1-E — clippy
-D warningsgate (69b9c22)Brings server + ui to
cargo clippy -- -D warningsclean. 43 source warnings → 0.Most fixed by
cargo clippy --fix(38 sites — collapsible_if × 16, useless_conversion × 6, and_then→map, map_or→is_some_and, etc.). Manual fixes for:routes.rs::canvas_channels— extractedCanvasBroadcastSendertype alias.message.rs::search—if let (Some(cid), Some(ws_id)) = (caller_id, workspace_id)replaces theunwrap()-after-is_some()pair.attachment.rs::sanitize_attachment_filename— removed identical-branches if and the now-deaddecodedvariable.#[allow(dead_code)]on public API helpers with no current callers but stable contracts (UserCreate.caller_id, UserUpdate.caller_id, TokenBucket::try_consume, Description::is_empty).New
make lint-stricttarget —cargo clippy -p hero_collab_server -p hero_collab_ui --no-deps -- -D warnings. Scoped to the two audited crates for now; broadening to--workspaceis a follow-up once sdk/app/cli are also clean. This is the CI gate the P-C-5/P-C-6 follow-up items tracked.What's left
Still tracked in
plan/known-issues.mdcheck_permission→RpcResultmigration; unlocks correct wire codes for react/toggle_react/get/list/update/delete/search + channel/document/group leftovers. ~15 handlers.db_lock()helper +.unwrap()sweep (107 sites), validation-style convergence onparse_inputfor huddle/canvas.save_state,schema_versiontable.branch = "development"(hero_archipelagos_core, hero_proxy_sdk, etc.) — a SHA pin would lock out silent-upstream-rebase risk. Not unique to hero_collab; ecosystem-wide pattern.Pre-existing bootstrap quirk surfaced during browser smoke (not a regression)
When hero_collab_server boots fresh, the default channel 1 is created but the default user 1 is NOT auto-added as a member. A
message.sendfrom user 1 → channel 1 is denied by the member check (and user 1 has noworkspace_admingrant). This has existed since before the audit — fixtures work around it by callingchannel.member.addin every setup. Should fix at the bootstrap layer (either auto-add creator as member inchannel.create, or provision user 1 as workspace_admin in the initial migration). Filing as a follow-up.SDK + CLI + openrpc.json
Verified current:
openrpc.client.generated.rsdoesn't drift oncargo build, CLI has no pin/unpin/require_caller/base_pathtouchpoints, openrpc.json has 90 methods (no schema additions in P0-P1). P0/P1 was purely security, typed-error, refactor, docs, style — no RPC shape changes.All six commits pushed to
origin/development.Post-P1 hygiene + dev-UX sprint — 4 more landmark closures
Since #issuecomment-20104 landed the P0/P1 batch, 11 commits shipped on
development(ca38d23..51b6e4b). Split into five themes:Config plumbing — CLI flags end-to-end
ca38d23 feat(config): externalise auth-mode + LiveKit config via CLI flagsFixed a silent hero_proc-supervision gap: the CLI wrapper registered the server action without forwarding any configuration, and hero_proc spawns children with a clean env. So
COLLAB_AUTH_MODE=dev hero_collab --startsilently boots in proxy mode, and LiveKit credentials the operator exported never reached the binary.Three-tier precedence:
hero_collab --startflag--auth-mode=<dev|proxy>--auth-mode=<dev|proxy>COLLAB_AUTH_MODE--livekit-url=…--livekit-url=…LIVEKIT_URL--livekit-api-key=…--livekit-api-key=…LIVEKIT_API_KEY--livekit-api-secret-file=…LIVEKIT_API_SECRET_FILEor legacyLIVEKIT_API_SECRETNo
--livekit-api-secretCLI flag — argv is world-readable viaps auxww. Secret-file path is the only argv-safe option; env fallback stays for legacy deployments.Supporting plumbing:
9065374(dev-mode flag read from aOnceLock<bool>populated byinit_dev_mode()at startup — was previously read from the env var at request time, which went stale the moment the CLI flag became the source of truth),b2f7cbc(dev-mode now bypassescheck_permissionunconditionally instead of only whencaller_idwas absent — eliminated a foot-gun where any DevTools RPC that passedcaller_idhit a fresh-DB permission wall),5d858b1(LiveKit UDP port range narrowed from 10 000 ports to the single 7882 that--devmode actually uses — caught abind: address already in useduring local bring-up on port 53667).Bootstrap rewrite — seeded dev fixtures
4855d72 feat(dev): --seed-dev-users flag + make devstart + drop "You" auto-create8d0be18 fix(channel): auto-add creator to channel_members as admin on createTwo closures on the bootstrap design gap:
channel.createnow auto-adds the creator as a channel admin. Previously the handler left the new channel empty of members, so every client had to follow everychannel.createwith achannel.member.add— error-prone, not Slack-parity. Closed by8d0be18.chat-app.js::initno longer silently creates{name:"You", email:"you@hero.collab"}on empty DB. Empty DB now shows an explicit welcome screen with amake devstartpointer. Server-side seeding handled by a newdev_seedmodule behind--seed-dev-users, provisioning 4 named users (Alice/Bob/Carol/Dave), one "General" workspace, one#generalchannel with all 4 as members (Alice admin). Triple-gated (opt-in flag + dev-mode-only + empty-DB-only). Newmake devstarttarget does the whole wipe-install-restart-seed flow in one call. Closed by4855d72.The third bootstrap sub-case (first-user-is-workspace-admin in proxy mode) remains open; it's a real design question about Slack-style "workspace owner" semantics and tracked in
known-issues.md.Hygiene — parking_lot swap
dde51a4 refactor(hygiene): swap std::Mutex for parking_lot::Mutex on the DB handleCloses the 106-site
.lock().unwrap()pattern and its latent bug: a single handler panicking while holding the DB lock would poisonstd::sync::Mutex, turning every subsequent.lock().unwrap()into a panic that killed the server for the rest of its life.parking_lot::Mutexhas no poison semantics — a panicking holder cleanly releases the lock. Mechanical sweep of all 106 call sites (rate-limit's unrelatedMutex<HashMap>left on std::sync). Zero API-surface change, zero happy-path behavior change, class of bug eliminated.P-D-1 typed-error cascade — closed in two commits
5436c80 refactor(types): P-D-1 step 1 — migrate foundation fns + user.rs + remove _typed shims49a89ae refactor(types): P-D-1 step 2 — migrate 37 handlers to RpcResult, close the cascadeThe big one.
check_permissionandresolve_self_user_idnow returnRpcResultdirectly; the*_typedstring-match wrappers are gone. All 37 previously-anyhow handlers (message / channel / workspace / document / canvas / group / user families) migrated toRpcResult<Value>. Every.map_err(Into::into)dropped from therpc.rsdispatch (37 sites). Common patterns (ok_or_else(|| anyhow!("X required"))) replaced withRpcError::Validation { field, reason }.QueryReturnedNoRowsbranches mapped toRpcError::NotFound.Wire-code contract now matches the failure mode:
-32603 Internalis now reserved for genuine unclassified exceptions.anyhowstays as the error type for internal DB helpers (has_cycle,resolve_user_rights,fetch_*_for_messages) — they bubble up throughRpcError::Internal(anyhow::Error)at the handler boundary, which preserves the full cause chain for trace_id logging.Also closes P-D-4 (message.react/unreact/toggle_react + check_message_not_deleted) — subsumed by the same sweep.
Docs + tracker catchup
9cf1664 docs: refresh README + deploy + known-issues for the CLI-flag era51b6e4b docs(plan): consolidate external-audit follow-up + mark in-line closuresREADME quickstart rewritten around
make devstart.deploy/README.mdgained the three-tier flag/env precedence table + a "getting docker compose to see.env" workaround (hit this during LiveKit bring-up: compose v2 didn't auto-load the.envdespite the file being in the right directory —set -a; source .env; set +ais the fix).plan/known-issues.mdconsolidated: P-A-6, P-D-1, P-D-4, and the bootstrap-design-gap sub-cases all marked CLOSED with commit cross-references.Verification status
cargo build(server + ui + sdk + cli + app)cargo clippy --no-deps -- -D warningscargo test -p hero_collab_server --test integrationmake devstartend-to-endWhat's next
The following items remain in
known-issues.mdwith open status:user.megrants admin to the first caller.parse_input— canvas.save_state still usesparams["..."].as_u64()pattern. Style/DRY, not correctness.schema_versiontable + explicit migration list — speculative; bites on the 4th migration, not today.All pushed to
development. Ready for review.Plan closure — Slack feature parity shipped
Closing out the 7-phase plan with a look at what actually changed. This comment is written for product/business review; commit messages,
plan/known-issues.md, and the session-level comments above carry the technical detail.Where we started
Before the plan, hero_collab was a solid backend (most of the RPC surface already existed — 64 methods, 18 tables, group-based permissions) paired with a UI that hit a wall the moment anyone tried to use it. Broken reactions, a paperclip button that did nothing, no auth, no unread counts, no search, no canvases, no huddles, a user labelled "You". Maybe 10-15% of what a user would call "Slack-shaped" was working end-to-end.
The plan was an exhaustive Slack-feature-by-Slack-feature map, 7 phases of implementation, backed by research into Slack's own model plus how Mattermost, Zulip, Rocket.Chat, Conduit and Element had each tackled the same space. Tool choices (Tiptap + Yjs for canvases, LiveKit for huddles, SQLite FTS5 for search) came out of that research — each evaluated against the next 2-3 alternatives before committing.
What shipped
Identity & onboarding. Authentication now flows through hero_proxy: the internet-facing layer authenticates (OAuth, signature, or IP match), strips spoofable headers, and injects trusted identity headers that hero_collab reads downstream. Operators can test locally in dev mode via a one-line
make devstartthat provisions 4 named users + a default workspace and channel. The "You" placeholder is gone; the welcome screen is explicit. Multi-user testing is just open-two-tabs-and-pick.The core chat loop. Every Slack-daily-use feature works: send/edit/delete messages, threaded replies, reactions (the toggle bug was one of the first fixes), file attachments (drag-and-drop compose flow with image preview), @mentions with autocomplete + notification badge + background-tab browser alerts, inline editing, public + private channels, direct messages and group DMs, channel discovery, pinned messages, full-text search via SQLite FTS5, unread counts, typing indicators, presence (online/offline with timeout-based stale-detection), context menus, keyboard shortcuts, a full emoji picker, markdown rendering with syntax highlighting and mermaid diagrams.
Canvases. Real-time collaborative markdown documents, the closest analogue to Slack Canvases. Two people editing the same canvas in two tabs see each other's cursors and edits merge without conflict — CRDT-backed via Tiptap + Yjs (server-side
yrs). Canvas cards can be shared into chat messages; the picker lets you insert them inline in the composer.Huddles. Voice calls backed by LiveKit (the same SFU Slack uses internally). Click the headphones icon in any channel header, get a JWT, connect. The whole stack (LiveKit + Redis) ships as a docker-compose file; operators just fill in credentials.
Federation. hero_proxy users who have never opened hero_collab now appear in the invite dropdown — fixed the "you can only invite people who have already logged in" gap. A
claim_federatedRPC materialises the local row on first DM, with a fail-closed guard if hero_proxy is unreachable.Architecture decisions
Two choices shaped most of the implementation and deserve being on the record:
Path A over Path B — we chose "standalone chat service with its own SQLite + hero_proxy headers" rather than "chat UI layer backed by hero_osis's multi-context communication domain". The former ships faster, iterates freely on the data model, and works as a single-context product without requiring hero_osis. The trade-off is that hero_collab can't participate in hero_osis multi-tenancy and its messages can't be referenced by hero_osis's cross-domain features (calendar integration, etc.). Migration to Path B remains possible — every hardening improvement we made is reusable; only the schemas would be re-authored. Full rationale, timeline evidence, and migration-triggers documented in
plan/known-issues.md.Typed errors over anyhow — every RPC now returns a typed
RpcResultwith specific JSON-RPC error codes for each failure class (validation, not-found, unauthenticated, permission-denied, rate-limited, conflict). Before this migration, every failure mode collapsed to a generic "-32603 Internal error"; clients couldn't distinguish "you sent bad input" from "the server is broken." Took two commits spread across sprints to fully cascade through 40+ handlers.Hardening — the invisible half
A platform that crashes once a week looks broken even if it has every feature. Phase 6 and Plan A were entirely about not-that:
parking_lot./(blocks ajavascript:injection vector), the deployment invariant that internal sockets must sit behind hero_router is documented with a production checklist.-D warningsCI gate.What's out of scope
Three Slack-heavyweight features are explicitly NOT in the plan: third-party app integrations (the "/giphy" kind of thing), Slack Connect (cross-org channels), and Enterprise Grid (multiple workspaces as one tenant). These are billion-dollar features that would each be their own multi-month project; we didn't build toward them and the plan says so.
Before vs. after
Roughly: we went from ~10-15% of Slack's daily-use surface working to ~85-90%. The remaining gap is the explicitly-out-of-scope set above (apps, Slack Connect, Enterprise Grid) plus polish items tracked in
plan/known-issues.md(first-user-is-admin convention for proxy mode, chat-app.js ES-module split, a few minor UX refinements).What's next
A UX-focused iteration pass to refine the rough edges — the kind of polish that comes from actually using the product for a week straight rather than testing features in isolation. Mostly flow-level, not feature-level. Separately, the
plan/known-issues.mdbacklog tracks the design questions that surfaced during the work (bootstrap ownership convention, ecosystem version pinning, admin-dashboard coverage of newer entities) — those each deserve their own scoped conversation rather than riding along in another broad sweep.Closing this one.