Files
ngn-agent/.planning/research/PITFALLS.md

7.2 KiB

Domain Pitfalls: ngn-agent v1.1

Domain: Hermes Agent session workspace, memory & reporting integration Researched: 2026-06-14

Critical Pitfalls

Mistakes that cause rewrites or major issues.

Pitfall 1: Docker Container Restart Loses Cloned Repos

What goes wrong: Docker container restarts (after idle timeout or gateway restart) lose all git repos cloned inside the container. Session-init clones are gone. Why it happens: Container persistent storage (container_persistent: true) keeps containers alive for lifetime_seconds (300s = 5min) of inactivity. After that, the container is destroyed. Any git clone inside /tmp or /root is ephemeral. Consequences: Repos disappear mid-session. The agent loses its workspace. Prevention: Always clone to a host-mounted volume. Add ~/Projects:/workspace/repos:rw to docker_volumes. The session-init script should check for existing clones on the mounted volume and only clone if missing. Detection: Agent reports "directory not found" when accessing repos. Script always re-clones (slow) instead of checking .git directory existence.

Pitfall 2: Memory Provider Conflict (Multiple External Providers)

What goes wrong: Configuring two external memory providers (e.g., Hindsight + Honcho) silently fails — only the first is registered. Why it happens: MemoryManager.add_provider() explicitly rejects a second external provider with a warning (agent/memory_manager.py:342-354). Consequences: User thinks both are active but only one works. No error message visible outside logs. Prevention: Set memory.provider: hindsight and nothing else. Never add a second external provider. Detection: hermes memory list shows only one provider. Check ~/.hermes/hermes-agent/logs/ for "Rejected memory provider" warning.

Pitfall 3: Cron Job Prompt Injection via Skill Content

What goes wrong: A skill loaded by a cron job contains hidden prompt-injection payload that causes the cron LLM to take unintended actions. Why it happens: Cron jobs load skill content at runtime via _build_job_prompt(). Skill content is scanned for injection patterns, but false negatives are possible (cron/scheduler.py:1249-1303). Consequences: Cron job runs with auto-approved tools (cron jobs have approvals.cron_mode: deny but denial is for tool approval, not LLM output). Prevention: Keep cron job skills simple and vetted. Use no_agent scripts for deterministic operations. Detection: Cron output contains unexpected content. Check cron/output/<job_id>/ for anomalous responses.

Moderate Pitfalls

Pitfall 1: Hindsight Cloud API Rate Limits

What goes wrong: Hindsight API rate-limits or throttles requests, causing memory writes to silently fail (async, non-blocking in MemoryManager). Why it happens: sync_turn() is dispatched to a background thread. Failures are logged as warnings, not surfaced to the agent or user. Consequences: Memory loss — agent thinks it saved facts but they never persisted. Prevention: Monitor ~/.hermes/hermes-agent/logs/ for "sync_turn failed" warnings. Consider Hindsight local mode if Cloud proves unreliable. Detection: grep "sync_turn failed" ~/.hermes/hermes-agent/logs/*

Pitfall 2: SSH Key Exposure Inside Docker

What goes wrong: Hermes agent running inside Docker has read access to ~/.ssh/ via mounted volume. Why it happens: The agent has file read tools. If an attacker compromises the agent (prompt injection), they could exfiltrate SSH keys. Consequences: Private SSH keys leaked. Access to all repos the keys authorize. Prevention:

  • Mount ~/.ssh:ro (read-only, keys can't be modified by agent)
  • Use a deploy key (per-repo, read-only) instead of personal SSH key
  • Set ssh-add -l to verify key restrictions
  • Consider HTTPS + personal access token (scoped, revocable) instead of SSH Detection: Monitor Docker container network egress for unexpected outbound connections.

Pitfall 3: Shell Init Script Blocking Container Start

What goes wrong: The session-init.sh script hangs (git clone needs SSH key passphrase, network timeout, etc.), blocking the Docker shell. Why it happens: shell_init_files runs synchronously before the shell prompt appears. A hanging script prevents the agent from starting. Consequences: Agent gets a timeout error from the terminal backend. Session is stuck. Prevention: Add timeout to clone operations: timeout 30 git clone .... Wrap script in (sleep 5; ...) & for async init. Add set -euo pipefail for early failure detection. Detection: Docker exec test: docker exec <container> /bin/bash -c "echo test" to verify shell responsiveness.

Minor Pitfalls

Pitfall 1: Hindsight API Key in Git History

What goes wrong: .env containing HINDSIGHT_API_KEY gets committed to a git repo. Why it happens: Developer accidentally stages .env files. Prevention: /Users/bapung/.hermes/ is outside the ngn-agent repo. No risk unless .env is copied into a repo directory.

Pitfall 2: DEFAULT_REPOS Superposition

What goes wrong: Two different session-init scripts or skills try to clone the same repo simultaneously. Why it happens: Both shell_init_files and a session-start hook try to clone. Prevention: Use only ONE mechanism. Prefer shell_init_files as it's guaranteed to run before the agent starts.

Pitfall 3: Cron Report Delivers to Wrong Chat

What goes wrong: Daily report delivers to the wrong Telegram chat. Why it happens: deliver: origin routes to the chat where the cron job was created. If created via CLI, origin is missing and cron falls back to the first available home channel (cron/scheduler.py:444). Prevention: Explicitly set deliver: telegram:474440517 (the ngn-agent DM) instead of deliver: telegram or deliver: origin. Detection: Check cron delivery errors via hermes cron list.

Phase-Specific Warnings

Phase Topic Likely Pitfall Mitigation
Hindsight activation Provider conflict with other external provider Verify memory.provider is set to only hindsight
Docker SSH volume Key exposure via agent Use deploy keys, read-only mount, monitor egress
Session init script Blocking clone hangs container Add timeouts, async background mode
Daily report skill Poor quality LLM summaries Iterate skill prompt; test with hermes cron run <id>
Stale cleanup script Deleting active sessions Add dry-run mode; check last_updated carefully
Docker volumes Path mismatch between host/container Use absolute paths in docker_volumes config
Git clone auth SSH key passphrase prompt Use key without passphrase or ssh-agent forwarding

Sources

  • Hermes v0.16.0 source: agent/memory_manager.py line 342-354 (provider conflict)
  • Hermes v0.16.0 source: agent/memory_provider.py line 115-131 (async sync_turn, silent failure)
  • Hermes v0.16.0 source: cron/scheduler.py line 1249-1303 (prompt injection scanning)
  • Hermes v0.16.0 source: cron/scheduler.py line 444 (origin fallback for delivery)
  • Docker container lifecycle: container_persistent: true + lifetime_seconds: 300 in config.yaml
  • Existing shell init script pattern: terminal.shell_init_files: [] (currently empty)