lhumina_code/hero_shrimp

Fork 0

hero_shrimp #10

New issue

Closed

opened 2026-03-12 13:53:19 +00:00 by thabeta · 0 comments

thabeta commented

2026-03-12 13:53:19 +00:00

Owner

HeroShrimp

HeroShrimp is a single-user autonomous agent runtime with:

local and chat-channel interaction
tool use
model routing
skills
memory
reliability and recovery
admin/runtime observability
operator tooling

This ticket covers the full product and engineering surface, not just open bugs.

Master Checklist

1. Runtime Foundation

Single-user runtime model
Shared inbound contract across channels
SQLite persistence with WAL mode
Startup recovery flow
Runtime maintenance loop
Channel watchdog / self-healing restarts

Acceptance criteria:

runtime starts cleanly with DB init and recovery
long-running maintenance updates freshness state
unclean shutdowns are detected and repaired where supported
channels share a single-user execution contract

2. Channels

Interactive CLI channel
One-shot CLI execution with --prompt
One-shot CLI execution with --prompt-stdin
One-shot CLI execution with --json
Admin HTTP channel
SSE event stream

Acceptance criteria:

CLI works both interactively and non-interactively
Admin server exposes runtime and doctor APIs
Telegram/WhatsApp can start when configured
live channel smoke path is documented and repeatable

3. Core Agent Behavior

Triage layer
Quick-answer path
Main multi-step agent loop
[-] Plan gating / active-step enforcement UX
Tool routing before LLM tool exposure
Partial progress / checkpoint behavior
Blocked-state bailout after repeated no-progress iterations

Acceptance criteria:

simple queries can complete without unnecessary tool use
multi-step requests can use tools iteratively
repeated no-progress behavior fails safely instead of looping forever
planning constraints do not deadlock normal research/coding flows

4. Tool System

Built-in tool registry
Tool schema validation
Tool timeout handling
Tool caching
Tool audit logging
Tool policy hook integration
Parallel tool-call limit
Dynamic loader ignores test/spec files

Acceptance criteria:

tools validate arguments before execution
dangerous or blocked tools fail safely
tool execution is auditable
duplicate malformed tool exposure is avoided

5. Safety and Policy

Safety levels (strict, standard, relaxed)
Env-based allowlist/denylist
Workspace policy file support
shell_run remains enabled with hardening
[-] Higher-level policy profiles

Acceptance criteria:

blocked tools do not execute
shell execution is constrained rather than disabled
policy behavior is visible and explainable to the operator

6. Model Integration and Routing

Direct OpenRouter support
AI Broker support when configured
Primary/fallback model chain
Manual set_model tool
Phase-based model routing
Routing heuristics for simple/standard/complex work
Routing runtime stats
One-shot JSON proof of selected/success model

Acceptance criteria:

runtime can call configured models successfully
phase-based routing can choose different primary models
fallback chain still works on model/provider failure
routed decisions are visible via logs and /api/runtime

7. Skills System

Markdown skills as first-class tools
Frontmatter parsing and schema generation
Recursive skill discovery
Support for SHRIMP_SKILLS_DIR
Support for ~/.agents/skills
Support for ~/.agents
Eligibility checks (channel/user/patterns)
Skill hot reload
Skill ranking before exposure to LLM
Skill explainability (considered -> exposed -> called)
Runtime skill counters
Persistent skill usage stats in SQLite

Acceptance criteria:

nested SKILL.md files are discovered
irrelevant skills are not blindly exposed
operators can inspect which skills were considered, exposed, and called
skill usage is visible in runtime/admin surfaces

8. Memory System

Memory backend abstraction
Save/query/list memory
Prompt memory retrieval
Ranked memory retrieval pipeline
Query expansion / temporal decay / diversity controls
Memory outbox durability
Retry / dead-letter behavior
Memory compaction
Snapshot export/import

Acceptance criteria:

memory survives restarts
retrieval pipeline behavior is controllable and observable
outbox failures are recoverable
snapshots can export/import correctly

9. Reliability and Recovery

Inbound dedup cache
Inflight duplicate join
Transient inbound retry with backoff
Startup repair for stuck outbox work
Runtime maintenance freshness checks
Outbox lag-age health checks
Outbox recover/replay/drain actions
[-] Broader chaos coverage

Acceptance criteria:

duplicate inbound events do not trigger duplicate work
stale outbox states can be repaired automatically or manually
doctor surfaces unhealthy backlog/lag conditions clearly

10. Doctor and Operator Tooling

Doctor health check command
Doctor maintenance command
Snapshot export/import commands
Outbox repair commands
Doctor tool interface
Admin doctor API actions

Acceptance criteria:

common operational fixes do not require manual DB edits
doctor output is actionable
admin and CLI paths expose the same core repair capabilities

11. Observability and Admin UX

Dashboard stats view
Messages view
Audit view
Usage view
Memories view
Jobs view
Config view
Runtime API
Routing runtime visibility
Skill runtime visibility

Acceptance criteria:

operator can inspect runtime state without terminal access
routing, outbox, maintenance, and skill behavior are visible
dashboard remains usable on desktop and mobile

12. Plugins and MCP

Workspace plugin loading
Runtime hook registration
[-] MCP operational polish

Acceptance criteria:

plugins can extend tools/hooks safely
MCP tools can be discovered when configured
plugin/MCP failures degrade cleanly

13. Documentation

User-first README
Full env reference
.env.example aligned with runtime config
Architecture doc
Internals doc
Tools doc
Channels doc
Database doc
Comparison docs

Acceptance criteria:

a new user can configure and run the project from docs alone
runtime behavior described in docs matches current implementation
optional vs required config is explicit

14. Testing and Verification

Typecheck coverage
Routing tests
Skills-system tests
Startup recovery tests
Runtime maintenance tests
Inbound reliability tests
Doctor tests
One-shot CLI arg tests
Skill observability tests

Acceptance criteria:

core runtime behavior is verified by automated tests
one-shot execution is scriptable for smoke checks
live integrations have a documented verification flow

Definition of “Project in Good Shape”

HeroShrimp is in good shape when:

core runtime is reliable for long-lived single-user operation
one-shot and interactive flows are both usable
model routing is observable and script-testable
skills are discoverable, explainable, and measurable
memory/recovery/operator tooling are strong enough to avoid manual DB surgery
docs match implementation closely

## HeroShrimp HeroShrimp is a single-user autonomous agent runtime with: - local and chat-channel interaction - tool use - model routing - skills - memory - reliability and recovery - admin/runtime observability - operator tooling This ticket covers the full product and engineering surface, not just open bugs. ## Master Checklist ### 1. Runtime Foundation - [x] Single-user runtime model - [x] Shared inbound contract across channels - [x] SQLite persistence with WAL mode - [x] Startup recovery flow - [x] Runtime maintenance loop - [x] Channel watchdog / self-healing restarts Acceptance criteria: - runtime starts cleanly with DB init and recovery - long-running maintenance updates freshness state - unclean shutdowns are detected and repaired where supported - channels share a single-user execution contract ### 2. Channels - [x] Interactive CLI channel - [x] One-shot CLI execution with `--prompt` - [x] One-shot CLI execution with `--prompt-stdin` - [x] One-shot CLI execution with `--json` - [x] Admin HTTP channel - [x] SSE event stream Acceptance criteria: - CLI works both interactively and non-interactively - Admin server exposes runtime and doctor APIs - Telegram/WhatsApp can start when configured - live channel smoke path is documented and repeatable ### 3. Core Agent Behavior - [x] Triage layer - [x] Quick-answer path - [x] Main multi-step agent loop - [-] Plan gating / active-step enforcement UX - [x] Tool routing before LLM tool exposure - [x] Partial progress / checkpoint behavior - [x] Blocked-state bailout after repeated no-progress iterations Acceptance criteria: - simple queries can complete without unnecessary tool use - multi-step requests can use tools iteratively - repeated no-progress behavior fails safely instead of looping forever - planning constraints do not deadlock normal research/coding flows ### 4. Tool System - [x] Built-in tool registry - [x] Tool schema validation - [x] Tool timeout handling - [x] Tool caching - [x] Tool audit logging - [x] Tool policy hook integration - [x] Parallel tool-call limit - [x] Dynamic loader ignores test/spec files Acceptance criteria: - tools validate arguments before execution - dangerous or blocked tools fail safely - tool execution is auditable - duplicate malformed tool exposure is avoided ### 5. Safety and Policy - [x] Safety levels (`strict`, `standard`, `relaxed`) - [x] Env-based allowlist/denylist - [x] Workspace policy file support - [x] `shell_run` remains enabled with hardening - [-] Higher-level policy profiles Acceptance criteria: - blocked tools do not execute - shell execution is constrained rather than disabled - policy behavior is visible and explainable to the operator ### 6. Model Integration and Routing - [x] Direct OpenRouter support - [x] AI Broker support when configured - [x] Primary/fallback model chain - [x] Manual `set_model` tool - [x] Phase-based model routing - [x] Routing heuristics for simple/standard/complex work - [x] Routing runtime stats - [x] One-shot JSON proof of selected/success model Acceptance criteria: - runtime can call configured models successfully - phase-based routing can choose different primary models - fallback chain still works on model/provider failure - routed decisions are visible via logs and `/api/runtime` ### 7. Skills System - [x] Markdown skills as first-class tools - [x] Frontmatter parsing and schema generation - [x] Recursive skill discovery - [x] Support for `SHRIMP_SKILLS_DIR` - [x] Support for `~/.agents/skills` - [x] Support for `~/.agents` - [x] Eligibility checks (channel/user/patterns) - [x] Skill hot reload - [x] Skill ranking before exposure to LLM - [x] Skill explainability (`considered -> exposed -> called`) - [x] Runtime skill counters - [x] Persistent skill usage stats in SQLite Acceptance criteria: - nested `SKILL.md` files are discovered - irrelevant skills are not blindly exposed - operators can inspect which skills were considered, exposed, and called - skill usage is visible in runtime/admin surfaces ### 8. Memory System - [x] Memory backend abstraction - [x] Save/query/list memory - [x] Prompt memory retrieval - [x] Ranked memory retrieval pipeline - [x] Query expansion / temporal decay / diversity controls - [x] Memory outbox durability - [x] Retry / dead-letter behavior - [x] Memory compaction - [x] Snapshot export/import Acceptance criteria: - memory survives restarts - retrieval pipeline behavior is controllable and observable - outbox failures are recoverable - snapshots can export/import correctly ### 9. Reliability and Recovery - [x] Inbound dedup cache - [x] Inflight duplicate join - [x] Transient inbound retry with backoff - [x] Startup repair for stuck outbox work - [x] Runtime maintenance freshness checks - [x] Outbox lag-age health checks - [x] Outbox recover/replay/drain actions - [-] Broader chaos coverage Acceptance criteria: - duplicate inbound events do not trigger duplicate work - stale outbox states can be repaired automatically or manually - doctor surfaces unhealthy backlog/lag conditions clearly ### 10. Doctor and Operator Tooling - [x] Doctor health check command - [x] Doctor maintenance command - [x] Snapshot export/import commands - [x] Outbox repair commands - [x] Doctor tool interface - [x] Admin doctor API actions Acceptance criteria: - common operational fixes do not require manual DB edits - doctor output is actionable - admin and CLI paths expose the same core repair capabilities ### 11. Observability and Admin UX - [x] Dashboard stats view - [x] Messages view - [x] Audit view - [x] Usage view - [x] Memories view - [x] Jobs view - [x] Config view - [x] Runtime API - [x] Routing runtime visibility - [x] Skill runtime visibility Acceptance criteria: - operator can inspect runtime state without terminal access - routing, outbox, maintenance, and skill behavior are visible - dashboard remains usable on desktop and mobile ### 12. Plugins and MCP - [x] Workspace plugin loading - [x] Runtime hook registration - [-] MCP operational polish Acceptance criteria: - plugins can extend tools/hooks safely - MCP tools can be discovered when configured - plugin/MCP failures degrade cleanly ### 13. Documentation - [x] User-first README - [x] Full env reference - [x] `.env.example` aligned with runtime config - [x] Architecture doc - [x] Internals doc - [x] Tools doc - [x] Channels doc - [x] Database doc - [x] Comparison docs Acceptance criteria: - a new user can configure and run the project from docs alone - runtime behavior described in docs matches current implementation - optional vs required config is explicit ### 14. Testing and Verification - [x] Typecheck coverage - [x] Routing tests - [x] Skills-system tests - [x] Startup recovery tests - [x] Runtime maintenance tests - [x] Inbound reliability tests - [x] Doctor tests - [x] One-shot CLI arg tests - [x] Skill observability tests Acceptance criteria: - core runtime behavior is verified by automated tests - one-shot execution is scriptable for smoke checks - live integrations have a documented verification flow ## Definition of “Project in Good Shape” HeroShrimp is in good shape when: - core runtime is reliable for long-lived single-user operation - one-shot and interactive flows are both usable - model routing is observable and script-testable - skills are discoverable, explainable, and measurable - memory/recovery/operator tooling are strong enough to avoid manual DB surgery - docs match implementation closely