Provision fresh herodemo on DO droplet using lab download-install flow #66
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Provision a fresh herodemo on a DigitalOcean droplet using the canonical lab build/install flow (download prebuilt musl-x86_64 binaries from each repo's
latestForgejo release — no source compile on the VM). Replace the dead TFGrid herodemo atherodemo.gent01.grid.tf(contracts cancelled 2026-05-19, see below). Keep the olderhero.gent04.grid.tfdeployment untouched as the fallback during transition.Why now
The 2026-01-16-vintage TFGrid herodemo VM accumulated 2–3 weeks of binary drift (most services on May 1 / May 6 builds) and hit the home#202 supervisor-vs-socket drift pattern multiple times in one inspection. A 2026-05-19 attempt to upgrade in-place to
hero_proc_server v0.6.0exposed:hero_router --port 9988); new binaries reject those (now--bind <ADDR:PORT>).[[env]]blocks lackPATH_ROOT— daemons panic atpaths.rs:38:32per Lesson #19 / s125 (mycelium_networkPR #48 added this exact fix to 4 daemons).nutools/modules/services/in hero_skills HEAD has been gutted to a single file;bootstrap_droplet_source.shclones source but does not register services.Net: upgrading in-place is more risky than a fresh provision. The fresh provision uses today's
latestForgejo releases — which we just populated for the full canonical demo set (~20 repos).Cancelled / contract state
Cancelled on 2026-05-19 via
tfcmd cancel contracts:herodemo(5 contracts): net 2091015 + 2091016, vm 2091017, gateway 2091019, name 2091018 →herodemo.gent01.grid.tfis GONEherozero(2 contracts) +heroci(5 contracts) — earlier in the same sessionKept on TFGrid:
hero(4 contracts: net 2070724, vm 2070725, gateway 2070727, name 2070726) →hero.gent04.grid.tf— DO NOT CANCEL until DO droplet is green. This is the only Hero deploy left during the transition window.New target architecture
tfgrid-composedo_droplet_create.sh+cf_dns_set.sh)hero_builderlatestrelease downloads vialab build <repo> --download --installlatest)--info --jsonoutputherodemo.gent01.grid.tf(TFGrid name contract)<name>.gent01.grid.tfor DO-managed FQDN; Cloudflare/etc. proxying as neededPlan
Phase 1 — bootstrap script rework
Update
deploy/cloud_vm/scripts/bootstrap_droplet_source.sh(or write a newbootstrap_droplet_releases.sh) to:lab(one-line install — depends on hero_skills#268 install.sh fix)lab user init --root ~/heroto populateinit.shwithPATH_ROOT/PATH_VAR/PATH_BUILD/PATH_CODE/HERO_SOCKET_DIRlab build <repo> --download --install --platforms linux-musl-x86_64— pulls today'slatestbinaries to~/hero/build/bins/, then copies them with platform-suffix-stripped names to~/hero/bin/hero_proc_serverin screen with the full env exported~/hero/bin/<bin>and calls--info --json, then useshero_proc service add+hero_proc service add-jobwith proper script/env/health-check/restart-policycf_dns_set.sh)Phase 2 — DO droplet provisioning
deploy/cloud_vm/scripts/do_droplet_create.shfor createscpthenssh ... bash bootstrap_droplet_releases.shPhase 3 — verification
https://<new-fqdn>/hero_os/uiloads past the WASM shellhero_proc service listhero_osis_*sockets respond 200 to/healthhero_routerlog shows N services scanned, ≥95% healthyhero.gent04.grid.tfonly AFTER new deploy is green for 24h+Phase 4 — clean-up
bootstrap_droplet_source.shclone-and-source-build path (kept as_sourcevariant in the script archive) once_releasesis provenmemory/project_demo_service_set.mdif the canonical set driftsGotchas — must be solved as part of this plan
Gotcha 1 — service.toml
[[env]] PATH_ROOTblock per Lesson #19Every daemon needs
[[env]] name="PATH_ROOT" default="~/hero"block in its repo'sservice.toml, otherwise hero_proc-spawned actions panic atpaths.rs:38:32. Status (2026-05-19):mycelium_network— fixed in PR #48 (4 daemons)Phase 1 step 6 (auto-registration from
--info --json) must injectPATH_ROOTinto every action's env block as a safety net. Alternatively: audit and fix every repo'sservice.tomlbefore this plan starts. Recommend belt-and-suspenders.Gotcha 2 — old CLI flags vs new CLI flags
Several services changed CLI between the May-1 deploy and today's
latest:hero_router:--port 0 --address 10.1.2.2 --ui-port 9990→--bind <ADDR:PORT>(repeatable)--helpon each new binary before writing the registration script.The auto-registration must derive the start command from
service.toml[[binaries]]+[[binaries.sockets]]+[[binaries.tcp]]blocks (which are the new source of truth), NOT from old action scripts.Gotcha 3 —
lab build --download --installdoesn't strip the platform suffix for non-host platformslab build --download --install --platforms linux-musl-x86_64on a glibc-x86_64 host downloads<bin>-linux-musl-x86_64to~/hero/build/bins/but does NOT copy to~/hero/bin/<bin>because lab considerslinux-musl-x86_64≠host. The musl-static binaries DO run on glibc Linux; this is an over-conservative gate.Workaround in the bootstrap script:
Upstream fix would be either:
(a)
lab build --installtreats musl-static as host-compatible on glibc Linux (sensible default), OR(b) Add
--install-musl-as-hostopt-in flag.Worth filing as a separate
labissue. For now, the bootstrap script handles it inline.Gotcha 4 —
install.sh404 at the canonical URL (also blocks hero_skills#268)The lab skill advertises
https://forge.ourworld.tf/lhumina_code/hero_skills/raw/branch/development/crates/lab/install.shbut that URL returns 404. The bootstrap script currently needsscpof the locallabbinary as a workaround. Tracked in hero_skills#268; resolution unblocks both this issue and that one.Definition of done
labin CIservice.tomlfiles audited; missing[[env]] PATH_ROOTblocks added (or auto-injection landed in bootstrap)bootstrap_droplet_releases.shlands inhero_demo/deploy/cloud_vm/scripts/and replaces_sourcein the canonical flowdo_droplet_create.sh+ bootstrap scripthero.gent04.grid.tfcancelled only after 24h green window on the new deploymemory/project_demo_service_set.mdreviewed against actual deployEstimated effort
What's already done (skip these — don't re-do)
latestrelease (populated 2026-05-19;hero_proc,hero_router,hero_os,hero_osis,hero_db,hero_collab,hero_livekit,hero_aibroker,hero_embedder,hero_indexer,hero_biz,hero_slides,hero_books,hero_office,hero_proxy,hero_foundry,hero_logic,hero_code,hero_whiteboard+hero_proc)herodemo,herozero,herocicancelledRelated issues
development_crate_layoutfor that repo)s132 close — TFGrid VM bootstrap from binaries proven end-to-end
The original arc this issue tracks (s132 = TFGrid VM bootstrap → s133 = app services → s134 = gateway + verification) has been rescoped post-meeting. Today's s132 work delivered the first leg + then the whole arc realigned to the new hero_os_tfgrid_deployer / hero_cockpit umbrella per the meeting minutes shared in hero_os_tfgrid_deployer#1.
What s132 landed
make deploy ENV=herolab(OpenTofu) in 61 s. Mycelium IPv64fe:c68d:e525:af6a:ff0f:d77d:6b71:f322. Gatewayherolab.gent02.grid.tf.make setup-binaries ENV=herolab→ apt → driver user → curl prebuilt lab from hero_skills latest → 34/34 PASS onlab build $repo --download --install(mycelium_network skipped — TFGrid native via zinit) → 3 libonnxruntime sonames installed → hero_proc_server PID 6971 + hero_router PID 7070 vialab service→ 6/6 smoke tests pass →curl http://127.0.0.1:9988/.well-known/heroservice.json→ HTTP 200.09f8365— 11 dev_mik commits squashed direct-to-development; +252/-1271 across 26 files (deleted 8 legacy env profiles + deploy/cloud_vm/; added envs/herolab/ + scripts/setup-binaries.sh + scripts/d07_set.txt + Makefile setup-binaries target).What this issue's s133-s134 plan becomes
The original plan was "start all 34 app services" (s133) → "nginx + LE + htpasswd + Cloudflare" (s134). Per the meeting, this scope is now absorbed into the larger demo-deployer arc:
Known followup carried forward
https://herolab.gent02.grid.tf: hero_router binds to loopback by default; needs--bind 0.0.0.0(s74 work) + hero_proxy in front. Tracked as Track B (s140 territory) in the new roadmap.Closing recommendation
This issue can be closed once the team confirms the rescope. The substantive s132 piece (VM bootstrap from binaries) is GREEN; the rest is rerouted to the new arc.
Full session narrative + per-commit detail in
sessions/132.yml(local workspace).