[ops] Restore herodemo.gent01.grid.tf to fully-functional state — services updated, data populated, demo browseable #46

Open
opened 2026-04-30 16:30:49 +00:00 by mik-tf · 1 comment
Owner

Goal

Get https://herodemo.gent01.grid.tf/ back to a fully-functional demo state — every archipelago tab shows live content, all services on the latest origin/development binaries, demo browseable end-to-end. This is the operational priority, separate from the reproducibility work tracked in the runbook + seed issues.

Why this is now an issue (state at 2026-04-30 ~16:30 UTC)

Today's session fixed multiple real bugs (hero_proc sysmon fd leak hero_proc#81, hero_office X-Forwarded-Proto, photos double-slash, etc.) but the operational disruption (hero_proc daemon restarts during diagnosis, partial sweep failure on hero_os install, mass-bounce that didn't include the per-domain hero_osis_* services) left the demo in a state where:

  • Per-domain hero_osis_* services have stale supervisor state — UI calls return HTTP 404: Socket 'rpc.sock' not found for 'hero_osis_<X>' (observed for hero_osis_base, hero_osis_communication; likely affects all 14).
  • 15 of 24 services still on pre-sweep binaries — they were bounced into running state but their binaries weren't actually rebuilt because the sweep stopped at hero_os install (local changes detected, committing before pull error).
  • Demo data status is unverified — earlier screenshots from working demo days showed Persons / Companies / Deals / Sprints / Calendar events / Messages / Photos / Songs / Videos. Right now the UI shows empty for everything OSIS-backed. Unclear whether OSIS data on disk is intact and just blocked by the dead sockets, or whether some of it was actually lost during today's daemon swaps.

Acceptance criteria

  • Step 1 — bounce per-domain hero_osis_* services (immediate fix for 404s):
    for d in identity base business calendar code communication embedder files finance flow job ledger media network projects settings ai; do
        hero_proc service restart hero_osis_$d
    done
    
    Verify each has rpc.sock + ui.sock post-restart.
  • Step 2 — browser audit in incognito: log in, click through every archipelago tab, record what's empty, what's populated, what errors. Match against the canonical "working demo" state from the Apr 24 baseline.
  • Step 3 — resolve the hero_os sweep blocker so phase 1 can complete:
    • Investigate hero_os source repo on the VM at ~/hero/code0/hero_os: what local uncommitted changes are there?
    • Either commit (if real), stash, or hard-reset to origin/development.
    • Re-run service_complete --update --release to get the remaining 15 services rebuilt.
    • Phase 2 will then bounce them all with new binaries.
  • Step 4 — handle missing demo data:
    • If OSIS data is intact post step 1 (records still there, UI shows them): take a clean backup snapshot.
    • If data is missing: choose between (a) restoring the most recent backup tarball to recover the historical demo state, or (b) waiting on the OSIS-seed work and re-seeding fresh. Probably (a) for now; (b) once the seed is in place.
  • Step 5 — final verification:
    • hero_proc service list — every service green
    • All ~25 socket dirs have the expected sockets (no missing rpc.sock or ui.sock)
    • hero_proc daemon healthy (RSS bounded, fd count low)
    • Full browser walk-through: every archipelago shows realistic content
    • Office docs still openable in OnlyOffice (X-Forwarded-Proto + OO_PUBLIC_PROTO chain still works)
    • Take a snapshot tarball of the verified-working state for safekeeping

Sequencing

  1. Steps 1-2 today — both quick. Just need to actually execute step 1 and observe step 2.
  2. Step 3 — depends on what the local changes are. Fixable in ~10 min if it's just dirty patch leftovers.
  3. Step 4 — judgement call after seeing step 2 results.
  4. Step 5 — final acceptance.

This issue stays open until the demo is visibly populated end-to-end. It is independent of the reproducibility issues.

References

  • hero_proc#81 — sysmon fd leak fix that started this whole sweep effort
  • hero_skills@4cb40f6 — gentle cargo + force restart in service_complete --update

Signed-off-by: mik-tf

## Goal Get https://herodemo.gent01.grid.tf/ back to a fully-functional demo state — every archipelago tab shows live content, all services on the latest `origin/development` binaries, demo browseable end-to-end. This is the **operational priority**, separate from the reproducibility work tracked in the runbook + seed issues. ## Why this is now an issue (state at 2026-04-30 ~16:30 UTC) Today's session fixed multiple real bugs (hero_proc sysmon fd leak `hero_proc#81`, hero_office X-Forwarded-Proto, photos double-slash, etc.) but the operational disruption (hero_proc daemon restarts during diagnosis, partial sweep failure on `hero_os` install, mass-bounce that didn't include the per-domain `hero_osis_*` services) left the demo in a state where: - **Per-domain `hero_osis_*` services have stale supervisor state** — UI calls return `HTTP 404: Socket 'rpc.sock' not found for 'hero_osis_<X>'` (observed for `hero_osis_base`, `hero_osis_communication`; likely affects all 14). - **15 of 24 services still on pre-sweep binaries** — they were bounced into running state but their binaries weren't actually rebuilt because the sweep stopped at hero_os install (`local changes detected, committing before pull` error). - **Demo data status is unverified** — earlier screenshots from working demo days showed Persons / Companies / Deals / Sprints / Calendar events / Messages / Photos / Songs / Videos. Right now the UI shows empty for everything OSIS-backed. Unclear whether OSIS data on disk is intact and just blocked by the dead sockets, or whether some of it was actually lost during today's daemon swaps. ## Acceptance criteria - [ ] **Step 1 — bounce per-domain `hero_osis_*` services** (immediate fix for 404s): ``` for d in identity base business calendar code communication embedder files finance flow job ledger media network projects settings ai; do hero_proc service restart hero_osis_$d done ``` Verify each has `rpc.sock` + `ui.sock` post-restart. - [ ] **Step 2 — browser audit** in incognito: log in, click through every archipelago tab, record what's empty, what's populated, what errors. Match against the canonical "working demo" state from the Apr 24 baseline. - [ ] **Step 3 — resolve the `hero_os` sweep blocker** so phase 1 can complete: - Investigate `hero_os` source repo on the VM at `~/hero/code0/hero_os`: what local uncommitted changes are there? - Either commit (if real), stash, or hard-reset to `origin/development`. - Re-run `service_complete --update --release` to get the remaining 15 services rebuilt. - Phase 2 will then bounce them all with new binaries. - [ ] **Step 4 — handle missing demo data**: - **If OSIS data is intact** post step 1 (records still there, UI shows them): take a clean backup snapshot. - **If data is missing**: choose between (a) restoring the most recent backup tarball to recover the historical demo state, or (b) waiting on the OSIS-seed work and re-seeding fresh. Probably (a) for now; (b) once the seed is in place. - [ ] **Step 5 — final verification**: - `hero_proc service list` — every service green - All ~25 socket dirs have the expected sockets (no missing rpc.sock or ui.sock) - hero_proc daemon healthy (RSS bounded, fd count low) - Full browser walk-through: every archipelago shows realistic content - Office docs still openable in OnlyOffice (X-Forwarded-Proto + OO_PUBLIC_PROTO chain still works) - Take a snapshot tarball of the verified-working state for safekeeping ## Sequencing 1. **Steps 1-2 today** — both quick. Just need to actually execute step 1 and observe step 2. 2. **Step 3** — depends on what the local changes are. Fixable in ~10 min if it's just dirty patch leftovers. 3. **Step 4** — judgement call after seeing step 2 results. 4. **Step 5** — final acceptance. This issue stays open until the demo is **visibly populated end-to-end**. It is independent of the reproducibility issues. ## References - `hero_proc#81` — sysmon fd leak fix that started this whole sweep effort - `hero_skills@4cb40f6` — gentle cargo + force restart in `service_complete --update` Signed-off-by: mik-tf
Author
Owner

Status update — observations during execution (2026-04-30 PM session)

Sweep ran via service_complete --update --release with the gentle-cargo + force-restart fixes from hero_skills@4cb40f6. Phase 1 successfully built and installed binaries for 12 services (proc, router, mycelium, code, codescalers, lib_rhai, embedder, proxy, db, os, osis, collab) before phase 2 stalled on service_livekit.

Sysmon fix held perfectly

hero_proc#81 (the sysmon /proc fd leak fix deployed earlier in the session) stayed solid through the whole sweep:

  • 3h 30m+ uptime, 51 MB RSS, 151 fds at the end — vs. the pre-fix curve which would have put the daemon at 25+ GB by now.
  • Zero /proc/<pid>/stat retention regardless of how many service restarts hammered through.

The sweep would not have been survivable without the leak fix — just gentle cargo alone would not have prevented the previous OOM trajectory.

sweep blocked at service_livekit start

Failure trace: service_livekit.nu start path probes Redis on a hardcoded port 6379 (Hero's actual default is 6378), then on probe failure falls through agent.nu to invoke ^claude (Claude Code CLI). Two bugs in one path. Filed as a separate hero_skills issue.

Workaround used today: hero_proc service start hero_livekit directly (bypasses the broken nu start logic). hero_livekit came back up healthy on the freshly-built binary.

Remaining 9 services done via manual loop

Because phase 2 stops on first failure, services after livekit (biz, aibroker, logic, slides, whiteboard, indexer, foundry, voice, agent) never got their start --reset --update. Resolved with a manual loop on the VM:

for svc in biz aibroker logic slides whiteboard indexer foundry voice agent; do
    nu -c "use ~/hero/code/hero_skills/tools/modules/services *; service_$svc install --update --release"
    hero_proc service restart hero_$svc
done

Per-domain hero_osis_* recovery

A separate symptom surfaced before the sweep: every hero_osis_<domain>/rpc.sock was missing, breaking Contexts, Photos, Biz, Messages UIs (all 404s on the per-domain sockets). Root cause: the unified hero_osis server (which atomically binds all 17 per-domain sockets) had been killed during today's earlier mass-bounce; hero_osis_ui was alive but the actual server wasn't. Fixed by service_osis start --reset — restored all per-domain sockets in one shot. All 5 contexts (root, default, geomind, incubaid, threefold) and underlying OSIS data on disk were intact — no data loss.

Demo throughout

https://herodemo.gent01.grid.tf/ stayed responsive (401 in <1s) the entire ~3h sweep. Gentle cargo (nice 19 ionice idle -j 4) kept the VM at load avg 1-4 vs the load 80+ from this morning's un-niced cargo storm.

Next operational steps for #46

  1. Wait for the manual loop to complete (in progress, ~25 more min as of this comment)
  2. Final socket audit across all 22 service dirs
  3. Browser walk-through: Photos / Videos / Biz / Messages / Office / etc. — verify every archipelago populates
  4. Take a snapshot tarball of the verified-working state
  • hero_proc#81 — sysmon /proc fd leak (fixed, deployed)
  • hero_skills issue (filing now) — service_livekit redis port + claude CLI dependencies
  • hero_demo issue (filing now) — bootstrap must work on any Ubuntu, root or unprivileged user

Signed-off-by: mik-tf

## Status update — observations during execution (2026-04-30 PM session) **Sweep ran via `service_complete --update --release`** with the gentle-cargo + force-restart fixes from `hero_skills@4cb40f6`. Phase 1 successfully built and installed binaries for **12 services** (proc, router, mycelium, code, codescalers, lib_rhai, embedder, proxy, db, os, osis, collab) before phase 2 stalled on `service_livekit`. ### Sysmon fix held perfectly `hero_proc#81` (the sysmon /proc fd leak fix deployed earlier in the session) **stayed solid** through the whole sweep: - **3h 30m+ uptime**, **51 MB RSS**, **151 fds** at the end — vs. the pre-fix curve which would have put the daemon at 25+ GB by now. - Zero `/proc/<pid>/stat` retention regardless of how many service restarts hammered through. The sweep would not have been survivable without the leak fix — just gentle cargo alone would not have prevented the previous OOM trajectory. ### sweep blocked at `service_livekit start` Failure trace: `service_livekit.nu` start path probes Redis on a hardcoded port 6379 (Hero's actual default is 6378), then on probe failure falls through `agent.nu` to invoke `^claude` (Claude Code CLI). Two bugs in one path. Filed as a separate hero_skills issue. **Workaround used today**: `hero_proc service start hero_livekit` directly (bypasses the broken nu start logic). hero_livekit came back up healthy on the freshly-built binary. ### Remaining 9 services done via manual loop Because phase 2 stops on first failure, services after `livekit` (biz, aibroker, logic, slides, whiteboard, indexer, foundry, voice, agent) never got their `start --reset --update`. Resolved with a manual loop on the VM: ```bash for svc in biz aibroker logic slides whiteboard indexer foundry voice agent; do nu -c "use ~/hero/code/hero_skills/tools/modules/services *; service_$svc install --update --release" hero_proc service restart hero_$svc done ``` ### Per-domain `hero_osis_*` recovery A separate symptom surfaced before the sweep: every `hero_osis_<domain>/rpc.sock` was missing, breaking Contexts, Photos, Biz, Messages UIs (all 404s on the per-domain sockets). Root cause: the unified `hero_osis` server (which atomically binds all 17 per-domain sockets) had been killed during today's earlier mass-bounce; `hero_osis_ui` was alive but the actual server wasn't. Fixed by `service_osis start --reset` — restored all per-domain sockets in one shot. **All 5 contexts (root, default, geomind, incubaid, threefold) and underlying OSIS data on disk were intact** — no data loss. ### Demo throughout `https://herodemo.gent01.grid.tf/` stayed responsive (401 in <1s) the entire ~3h sweep. Gentle cargo (`nice 19 ionice idle -j 4`) kept the VM at load avg 1-4 vs the load 80+ from this morning's un-niced cargo storm. ### Next operational steps for #46 1. ⏳ Wait for the manual loop to complete (in progress, ~25 more min as of this comment) 2. Final socket audit across all 22 service dirs 3. Browser walk-through: Photos / Videos / Biz / Messages / Office / etc. — verify every archipelago populates 4. Take a snapshot tarball of the verified-working state ### Related issues spawned today - hero_proc#81 — sysmon /proc fd leak (fixed, deployed) - hero_skills issue (filing now) — service_livekit redis port + claude CLI dependencies - hero_demo issue (filing now) — bootstrap must work on any Ubuntu, root or unprivileged user Signed-off-by: mik-tf
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_demo#46
No description provided.