docs: complete v1.1 research synthesis for session lifecycle, memory & reporting
This commit is contained in:
95
.planning/research/PITFALLS.md
Normal file
95
.planning/research/PITFALLS.md
Normal file
@@ -0,0 +1,95 @@
|
||||
# Domain Pitfalls: ngn-agent v1.1
|
||||
|
||||
**Domain:** Hermes Agent session workspace, memory & reporting integration
|
||||
**Researched:** 2026-06-14
|
||||
|
||||
## Critical Pitfalls
|
||||
|
||||
Mistakes that cause rewrites or major issues.
|
||||
|
||||
### Pitfall 1: Docker Container Restart Loses Cloned Repos
|
||||
**What goes wrong:** Docker container restarts (after idle timeout or gateway restart) lose all git repos cloned inside the container. Session-init clones are gone.
|
||||
**Why it happens:** Container persistent storage (`container_persistent: true`) keeps containers alive for `lifetime_seconds` (300s = 5min) of inactivity. After that, the container is destroyed. Any `git clone` inside `/tmp` or `/root` is ephemeral.
|
||||
**Consequences:** Repos disappear mid-session. The agent loses its workspace.
|
||||
**Prevention:** **Always clone to a host-mounted volume.** Add `~/Projects:/workspace/repos:rw` to `docker_volumes`. The session-init script should check for existing clones on the mounted volume and only clone if missing.
|
||||
**Detection:** Agent reports "directory not found" when accessing repos. Script always re-clones (slow) instead of checking `.git` directory existence.
|
||||
|
||||
### Pitfall 2: Memory Provider Conflict (Multiple External Providers)
|
||||
**What goes wrong:** Configuring two external memory providers (e.g., Hindsight + Honcho) silently fails — only the first is registered.
|
||||
**Why it happens:** `MemoryManager.add_provider()` explicitly rejects a second external provider with a warning (agent/memory_manager.py:342-354).
|
||||
**Consequences:** User thinks both are active but only one works. No error message visible outside logs.
|
||||
**Prevention:** Set `memory.provider: hindsight` and nothing else. Never add a second external provider.
|
||||
**Detection:** `hermes memory list` shows only one provider. Check `~/.hermes/hermes-agent/logs/` for "Rejected memory provider" warning.
|
||||
|
||||
### Pitfall 3: Cron Job Prompt Injection via Skill Content
|
||||
**What goes wrong:** A skill loaded by a cron job contains hidden prompt-injection payload that causes the cron LLM to take unintended actions.
|
||||
**Why it happens:** Cron jobs load skill content at runtime via `_build_job_prompt()`. Skill content is scanned for injection patterns, but false negatives are possible (cron/scheduler.py:1249-1303).
|
||||
**Consequences:** Cron job runs with auto-approved tools (cron jobs have `approvals.cron_mode: deny` but denial is for tool approval, not LLM output).
|
||||
**Prevention:** Keep cron job skills simple and vetted. Use `no_agent` scripts for deterministic operations.
|
||||
**Detection:** Cron output contains unexpected content. Check `cron/output/<job_id>/` for anomalous responses.
|
||||
|
||||
## Moderate Pitfalls
|
||||
|
||||
### Pitfall 1: Hindsight Cloud API Rate Limits
|
||||
**What goes wrong:** Hindsight API rate-limits or throttles requests, causing memory writes to silently fail (async, non-blocking in MemoryManager).
|
||||
**Why it happens:** `sync_turn()` is dispatched to a background thread. Failures are logged as warnings, not surfaced to the agent or user.
|
||||
**Consequences:** Memory loss — agent thinks it saved facts but they never persisted.
|
||||
**Prevention:** Monitor `~/.hermes/hermes-agent/logs/` for "sync_turn failed" warnings. Consider Hindsight local mode if Cloud proves unreliable.
|
||||
**Detection:** `grep "sync_turn failed" ~/.hermes/hermes-agent/logs/*`
|
||||
|
||||
### Pitfall 2: SSH Key Exposure Inside Docker
|
||||
**What goes wrong:** Hermes agent running inside Docker has read access to `~/.ssh/` via mounted volume.
|
||||
**Why it happens:** The agent has file read tools. If an attacker compromises the agent (prompt injection), they could exfiltrate SSH keys.
|
||||
**Consequences:** Private SSH keys leaked. Access to all repos the keys authorize.
|
||||
**Prevention:**
|
||||
- Mount `~/.ssh:ro` (read-only, keys can't be modified by agent)
|
||||
- Use a **deploy key** (per-repo, read-only) instead of personal SSH key
|
||||
- Set `ssh-add -l` to verify key restrictions
|
||||
- Consider HTTPS + personal access token (scoped, revocable) instead of SSH
|
||||
**Detection:** Monitor Docker container network egress for unexpected outbound connections.
|
||||
|
||||
### Pitfall 3: Shell Init Script Blocking Container Start
|
||||
**What goes wrong:** The session-init.sh script hangs (git clone needs SSH key passphrase, network timeout, etc.), blocking the Docker shell.
|
||||
**Why it happens:** `shell_init_files` runs synchronously before the shell prompt appears. A hanging script prevents the agent from starting.
|
||||
**Consequences:** Agent gets a timeout error from the terminal backend. Session is stuck.
|
||||
**Prevention:** Add timeout to clone operations: `timeout 30 git clone ...`. Wrap script in `(sleep 5; ...) &` for async init. Add `set -euo pipefail` for early failure detection.
|
||||
**Detection:** Docker exec test: `docker exec <container> /bin/bash -c "echo test"` to verify shell responsiveness.
|
||||
|
||||
## Minor Pitfalls
|
||||
|
||||
### Pitfall 1: Hindsight API Key in Git History
|
||||
**What goes wrong:** `.env` containing `HINDSIGHT_API_KEY` gets committed to a git repo.
|
||||
**Why it happens:** Developer accidentally stages `.env` files.
|
||||
**Prevention:** `/Users/bapung/.hermes/` is outside the ngn-agent repo. No risk unless `.env` is copied into a repo directory.
|
||||
|
||||
### Pitfall 2: DEFAULT_REPOS Superposition
|
||||
**What goes wrong:** Two different session-init scripts or skills try to clone the same repo simultaneously.
|
||||
**Why it happens:** Both `shell_init_files` and a session-start hook try to clone.
|
||||
**Prevention:** Use only ONE mechanism. Prefer `shell_init_files` as it's guaranteed to run before the agent starts.
|
||||
|
||||
### Pitfall 3: Cron Report Delivers to Wrong Chat
|
||||
**What goes wrong:** Daily report delivers to the wrong Telegram chat.
|
||||
**Why it happens:** `deliver: origin` routes to the chat where the cron job was created. If created via CLI, `origin` is missing and cron falls back to the first available home channel (cron/scheduler.py:444).
|
||||
**Prevention:** Explicitly set `deliver: telegram:474440517` (the ngn-agent DM) instead of `deliver: telegram` or `deliver: origin`.
|
||||
**Detection:** Check cron delivery errors via `hermes cron list`.
|
||||
|
||||
## Phase-Specific Warnings
|
||||
|
||||
| Phase Topic | Likely Pitfall | Mitigation |
|
||||
|-------------|---------------|------------|
|
||||
| Hindsight activation | Provider conflict with other external provider | Verify `memory.provider` is set to only `hindsight` |
|
||||
| Docker SSH volume | Key exposure via agent | Use deploy keys, read-only mount, monitor egress |
|
||||
| Session init script | Blocking clone hangs container | Add timeouts, async background mode |
|
||||
| Daily report skill | Poor quality LLM summaries | Iterate skill prompt; test with `hermes cron run <id>` |
|
||||
| Stale cleanup script | Deleting active sessions | Add dry-run mode; check `last_updated` carefully |
|
||||
| Docker volumes | Path mismatch between host/container | Use absolute paths in `docker_volumes` config |
|
||||
| Git clone auth | SSH key passphrase prompt | Use key without passphrase or `ssh-agent` forwarding |
|
||||
|
||||
## Sources
|
||||
|
||||
- Hermes v0.16.0 source: `agent/memory_manager.py` line 342-354 (provider conflict)
|
||||
- Hermes v0.16.0 source: `agent/memory_provider.py` line 115-131 (async sync_turn, silent failure)
|
||||
- Hermes v0.16.0 source: `cron/scheduler.py` line 1249-1303 (prompt injection scanning)
|
||||
- Hermes v0.16.0 source: `cron/scheduler.py` line 444 (origin fallback for delivery)
|
||||
- Docker container lifecycle: `container_persistent: true` + `lifetime_seconds: 300` in config.yaml
|
||||
- Existing shell init script pattern: `terminal.shell_init_files: []` (currently empty)
|
||||
Reference in New Issue
Block a user