hero_shrimp #10

Closed
opened 2026-03-12 13:53:19 +00:00 by thabeta · 0 comments
Owner

HeroShrimp

HeroShrimp is a single-user autonomous agent runtime with:

  • local and chat-channel interaction
  • tool use
  • model routing
  • skills
  • memory
  • reliability and recovery
  • admin/runtime observability
  • operator tooling

This ticket covers the full product and engineering surface, not just open bugs.

Master Checklist

1. Runtime Foundation

  • Single-user runtime model
  • Shared inbound contract across channels
  • SQLite persistence with WAL mode
  • Startup recovery flow
  • Runtime maintenance loop
  • Channel watchdog / self-healing restarts

Acceptance criteria:

  • runtime starts cleanly with DB init and recovery
  • long-running maintenance updates freshness state
  • unclean shutdowns are detected and repaired where supported
  • channels share a single-user execution contract

2. Channels

  • Interactive CLI channel
  • One-shot CLI execution with --prompt
  • One-shot CLI execution with --prompt-stdin
  • One-shot CLI execution with --json
  • Admin HTTP channel
  • SSE event stream

Acceptance criteria:

  • CLI works both interactively and non-interactively
  • Admin server exposes runtime and doctor APIs
  • Telegram/WhatsApp can start when configured
  • live channel smoke path is documented and repeatable

3. Core Agent Behavior

  • Triage layer
  • Quick-answer path
  • Main multi-step agent loop
  • [-] Plan gating / active-step enforcement UX
  • Tool routing before LLM tool exposure
  • Partial progress / checkpoint behavior
  • Blocked-state bailout after repeated no-progress iterations

Acceptance criteria:

  • simple queries can complete without unnecessary tool use
  • multi-step requests can use tools iteratively
  • repeated no-progress behavior fails safely instead of looping forever
  • planning constraints do not deadlock normal research/coding flows

4. Tool System

  • Built-in tool registry
  • Tool schema validation
  • Tool timeout handling
  • Tool caching
  • Tool audit logging
  • Tool policy hook integration
  • Parallel tool-call limit
  • Dynamic loader ignores test/spec files

Acceptance criteria:

  • tools validate arguments before execution
  • dangerous or blocked tools fail safely
  • tool execution is auditable
  • duplicate malformed tool exposure is avoided

5. Safety and Policy

  • Safety levels (strict, standard, relaxed)
  • Env-based allowlist/denylist
  • Workspace policy file support
  • shell_run remains enabled with hardening
  • [-] Higher-level policy profiles

Acceptance criteria:

  • blocked tools do not execute
  • shell execution is constrained rather than disabled
  • policy behavior is visible and explainable to the operator

6. Model Integration and Routing

  • Direct OpenRouter support
  • AI Broker support when configured
  • Primary/fallback model chain
  • Manual set_model tool
  • Phase-based model routing
  • Routing heuristics for simple/standard/complex work
  • Routing runtime stats
  • One-shot JSON proof of selected/success model

Acceptance criteria:

  • runtime can call configured models successfully
  • phase-based routing can choose different primary models
  • fallback chain still works on model/provider failure
  • routed decisions are visible via logs and /api/runtime

7. Skills System

  • Markdown skills as first-class tools
  • Frontmatter parsing and schema generation
  • Recursive skill discovery
  • Support for SHRIMP_SKILLS_DIR
  • Support for ~/.agents/skills
  • Support for ~/.agents
  • Eligibility checks (channel/user/patterns)
  • Skill hot reload
  • Skill ranking before exposure to LLM
  • Skill explainability (considered -> exposed -> called)
  • Runtime skill counters
  • Persistent skill usage stats in SQLite

Acceptance criteria:

  • nested SKILL.md files are discovered
  • irrelevant skills are not blindly exposed
  • operators can inspect which skills were considered, exposed, and called
  • skill usage is visible in runtime/admin surfaces

8. Memory System

  • Memory backend abstraction
  • Save/query/list memory
  • Prompt memory retrieval
  • Ranked memory retrieval pipeline
  • Query expansion / temporal decay / diversity controls
  • Memory outbox durability
  • Retry / dead-letter behavior
  • Memory compaction
  • Snapshot export/import

Acceptance criteria:

  • memory survives restarts
  • retrieval pipeline behavior is controllable and observable
  • outbox failures are recoverable
  • snapshots can export/import correctly

9. Reliability and Recovery

  • Inbound dedup cache
  • Inflight duplicate join
  • Transient inbound retry with backoff
  • Startup repair for stuck outbox work
  • Runtime maintenance freshness checks
  • Outbox lag-age health checks
  • Outbox recover/replay/drain actions
  • [-] Broader chaos coverage

Acceptance criteria:

  • duplicate inbound events do not trigger duplicate work
  • stale outbox states can be repaired automatically or manually
  • doctor surfaces unhealthy backlog/lag conditions clearly

10. Doctor and Operator Tooling

  • Doctor health check command
  • Doctor maintenance command
  • Snapshot export/import commands
  • Outbox repair commands
  • Doctor tool interface
  • Admin doctor API actions

Acceptance criteria:

  • common operational fixes do not require manual DB edits
  • doctor output is actionable
  • admin and CLI paths expose the same core repair capabilities

11. Observability and Admin UX

  • Dashboard stats view
  • Messages view
  • Audit view
  • Usage view
  • Memories view
  • Jobs view
  • Config view
  • Runtime API
  • Routing runtime visibility
  • Skill runtime visibility

Acceptance criteria:

  • operator can inspect runtime state without terminal access
  • routing, outbox, maintenance, and skill behavior are visible
  • dashboard remains usable on desktop and mobile

12. Plugins and MCP

  • Workspace plugin loading
  • Runtime hook registration
  • [-] MCP operational polish

Acceptance criteria:

  • plugins can extend tools/hooks safely
  • MCP tools can be discovered when configured
  • plugin/MCP failures degrade cleanly

13. Documentation

  • User-first README
  • Full env reference
  • .env.example aligned with runtime config
  • Architecture doc
  • Internals doc
  • Tools doc
  • Channels doc
  • Database doc
  • Comparison docs

Acceptance criteria:

  • a new user can configure and run the project from docs alone
  • runtime behavior described in docs matches current implementation
  • optional vs required config is explicit

14. Testing and Verification

  • Typecheck coverage
  • Routing tests
  • Skills-system tests
  • Startup recovery tests
  • Runtime maintenance tests
  • Inbound reliability tests
  • Doctor tests
  • One-shot CLI arg tests
  • Skill observability tests

Acceptance criteria:

  • core runtime behavior is verified by automated tests
  • one-shot execution is scriptable for smoke checks
  • live integrations have a documented verification flow

Definition of “Project in Good Shape”

HeroShrimp is in good shape when:

  • core runtime is reliable for long-lived single-user operation
  • one-shot and interactive flows are both usable
  • model routing is observable and script-testable
  • skills are discoverable, explainable, and measurable
  • memory/recovery/operator tooling are strong enough to avoid manual DB surgery
  • docs match implementation closely
## HeroShrimp HeroShrimp is a single-user autonomous agent runtime with: - local and chat-channel interaction - tool use - model routing - skills - memory - reliability and recovery - admin/runtime observability - operator tooling This ticket covers the full product and engineering surface, not just open bugs. ## Master Checklist ### 1. Runtime Foundation - [x] Single-user runtime model - [x] Shared inbound contract across channels - [x] SQLite persistence with WAL mode - [x] Startup recovery flow - [x] Runtime maintenance loop - [x] Channel watchdog / self-healing restarts Acceptance criteria: - runtime starts cleanly with DB init and recovery - long-running maintenance updates freshness state - unclean shutdowns are detected and repaired where supported - channels share a single-user execution contract ### 2. Channels - [x] Interactive CLI channel - [x] One-shot CLI execution with `--prompt` - [x] One-shot CLI execution with `--prompt-stdin` - [x] One-shot CLI execution with `--json` - [x] Admin HTTP channel - [x] SSE event stream Acceptance criteria: - CLI works both interactively and non-interactively - Admin server exposes runtime and doctor APIs - Telegram/WhatsApp can start when configured - live channel smoke path is documented and repeatable ### 3. Core Agent Behavior - [x] Triage layer - [x] Quick-answer path - [x] Main multi-step agent loop - [-] Plan gating / active-step enforcement UX - [x] Tool routing before LLM tool exposure - [x] Partial progress / checkpoint behavior - [x] Blocked-state bailout after repeated no-progress iterations Acceptance criteria: - simple queries can complete without unnecessary tool use - multi-step requests can use tools iteratively - repeated no-progress behavior fails safely instead of looping forever - planning constraints do not deadlock normal research/coding flows ### 4. Tool System - [x] Built-in tool registry - [x] Tool schema validation - [x] Tool timeout handling - [x] Tool caching - [x] Tool audit logging - [x] Tool policy hook integration - [x] Parallel tool-call limit - [x] Dynamic loader ignores test/spec files Acceptance criteria: - tools validate arguments before execution - dangerous or blocked tools fail safely - tool execution is auditable - duplicate malformed tool exposure is avoided ### 5. Safety and Policy - [x] Safety levels (`strict`, `standard`, `relaxed`) - [x] Env-based allowlist/denylist - [x] Workspace policy file support - [x] `shell_run` remains enabled with hardening - [-] Higher-level policy profiles Acceptance criteria: - blocked tools do not execute - shell execution is constrained rather than disabled - policy behavior is visible and explainable to the operator ### 6. Model Integration and Routing - [x] Direct OpenRouter support - [x] AI Broker support when configured - [x] Primary/fallback model chain - [x] Manual `set_model` tool - [x] Phase-based model routing - [x] Routing heuristics for simple/standard/complex work - [x] Routing runtime stats - [x] One-shot JSON proof of selected/success model Acceptance criteria: - runtime can call configured models successfully - phase-based routing can choose different primary models - fallback chain still works on model/provider failure - routed decisions are visible via logs and `/api/runtime` ### 7. Skills System - [x] Markdown skills as first-class tools - [x] Frontmatter parsing and schema generation - [x] Recursive skill discovery - [x] Support for `SHRIMP_SKILLS_DIR` - [x] Support for `~/.agents/skills` - [x] Support for `~/.agents` - [x] Eligibility checks (channel/user/patterns) - [x] Skill hot reload - [x] Skill ranking before exposure to LLM - [x] Skill explainability (`considered -> exposed -> called`) - [x] Runtime skill counters - [x] Persistent skill usage stats in SQLite Acceptance criteria: - nested `SKILL.md` files are discovered - irrelevant skills are not blindly exposed - operators can inspect which skills were considered, exposed, and called - skill usage is visible in runtime/admin surfaces ### 8. Memory System - [x] Memory backend abstraction - [x] Save/query/list memory - [x] Prompt memory retrieval - [x] Ranked memory retrieval pipeline - [x] Query expansion / temporal decay / diversity controls - [x] Memory outbox durability - [x] Retry / dead-letter behavior - [x] Memory compaction - [x] Snapshot export/import Acceptance criteria: - memory survives restarts - retrieval pipeline behavior is controllable and observable - outbox failures are recoverable - snapshots can export/import correctly ### 9. Reliability and Recovery - [x] Inbound dedup cache - [x] Inflight duplicate join - [x] Transient inbound retry with backoff - [x] Startup repair for stuck outbox work - [x] Runtime maintenance freshness checks - [x] Outbox lag-age health checks - [x] Outbox recover/replay/drain actions - [-] Broader chaos coverage Acceptance criteria: - duplicate inbound events do not trigger duplicate work - stale outbox states can be repaired automatically or manually - doctor surfaces unhealthy backlog/lag conditions clearly ### 10. Doctor and Operator Tooling - [x] Doctor health check command - [x] Doctor maintenance command - [x] Snapshot export/import commands - [x] Outbox repair commands - [x] Doctor tool interface - [x] Admin doctor API actions Acceptance criteria: - common operational fixes do not require manual DB edits - doctor output is actionable - admin and CLI paths expose the same core repair capabilities ### 11. Observability and Admin UX - [x] Dashboard stats view - [x] Messages view - [x] Audit view - [x] Usage view - [x] Memories view - [x] Jobs view - [x] Config view - [x] Runtime API - [x] Routing runtime visibility - [x] Skill runtime visibility Acceptance criteria: - operator can inspect runtime state without terminal access - routing, outbox, maintenance, and skill behavior are visible - dashboard remains usable on desktop and mobile ### 12. Plugins and MCP - [x] Workspace plugin loading - [x] Runtime hook registration - [-] MCP operational polish Acceptance criteria: - plugins can extend tools/hooks safely - MCP tools can be discovered when configured - plugin/MCP failures degrade cleanly ### 13. Documentation - [x] User-first README - [x] Full env reference - [x] `.env.example` aligned with runtime config - [x] Architecture doc - [x] Internals doc - [x] Tools doc - [x] Channels doc - [x] Database doc - [x] Comparison docs Acceptance criteria: - a new user can configure and run the project from docs alone - runtime behavior described in docs matches current implementation - optional vs required config is explicit ### 14. Testing and Verification - [x] Typecheck coverage - [x] Routing tests - [x] Skills-system tests - [x] Startup recovery tests - [x] Runtime maintenance tests - [x] Inbound reliability tests - [x] Doctor tests - [x] One-shot CLI arg tests - [x] Skill observability tests Acceptance criteria: - core runtime behavior is verified by automated tests - one-shot execution is scriptable for smoke checks - live integrations have a documented verification flow ## Definition of “Project in Good Shape” HeroShrimp is in good shape when: - core runtime is reliable for long-lived single-user operation - one-shot and interactive flows are both usable - model routing is observable and script-testable - skills are discoverable, explainable, and measurable - memory/recovery/operator tooling are strong enough to avoid manual DB surgery - docs match implementation closely
thabeta self-assigned this 2026-03-12 13:53:23 +00:00
thabeta added this to the ACTIVE project 2026-03-12 13:53:33 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_shrimp#10
No description provided.