# Inbox Skill Test Plan ## Purpose This directory tracks human-readable test plans for the `skills/inbox/` Codex skill bundle. These documents are not command-contract specs for the `inbox` CLI itself. That coverage already lives under [../inbox/](../inbox/). This directory exists to describe a different test surface: - whether an agent can actually use the packaged inbox skill - whether multiple agents can coordinate through the bundled CLI asset - whether a real skill-guided conversation reaches the expected inbox state ## Test Model - `README.md` is the index for this directory - each skill test case lives in its own Markdown file - use stable case slugs in filenames ## Shared Execution Contract Use these defaults unless a case file explicitly overrides them: - run the scenario with real subagents, not simulated transcripts - inject the same skill bundle into every participating agent - launch all role agents in parallel when the scenario depends on agent-to-agent timing - require every agent to coordinate through the bundled CLI and shared SQLite DB instead of ordinary chat - validate the final inbox state independently from the main thread after the agents stop ## How An Agent Runs These Cases Use one test-runner agent to execute each case. The test-runner agent is responsible for: - reading this `README.md` first, then one specific case file - creating an isolated temporary directory and SQLite DB path for that run - launching the role agents described in `Agent Topology` - injecting the same `skills/inbox/` bundle into every role agent - passing each role agent the prompt text from the case file with concrete values substituted for `SKILL_PATH`, `TMPDIR`, and `THREAD_ID` when needed - coordinating launch order or parallel start according to the case file - collecting agent final summaries as evidence - resolving the final `THREAD_ID` - running the `Validation Commands` from the main thread after the role agents stop - comparing the observed results against `Expected Outcomes` and `Assertions` - returning a final pass/fail judgment with concrete evidence The role agents are responsible for: - acting only within the role assigned in the case file - using the injected inbox skill rather than ad hoc repository discovery - coordinating through the bundled CLI and shared DB - reporting the concrete thread id, key command outcomes, and final observed state back to the test-runner agent The test-runner agent should treat a case as passed only when: - all role agents reach a final state without violating the case contract - the independent validation commands succeed - the final inbox state matches the assertions in the case file The test-runner agent should treat a case as failed when: - any role agent times out or stalls - a required inbox action is skipped - a role agent falls back to ordinary chat for critical coordination - the final inbox state conflicts with the documented assertions The test-runner agent should report results in this shape: - `case` - `db_path` - `thread_id` - `result`: `pass` or `fail` - `agent_summaries` - `validation_evidence` - `assertion_checklist` - `notes` ## Default Timeouts Use these defaults unless a case file explicitly overrides them: - per-agent timeout: `3m` - overall scenario timeout: `5m` - async wait margin for the main thread: `30s` ## Default Failure Conditions Treat the test as failed if any of the following happens: - any required agent does not reach a final state before timeout - any required inbox command returns a non-success result unless the case expects that failure - the final `show` output does not match the expected thread state - the expected message sequence or key message bodies do not appear - the agents fall back to ordinary chat for critical coordination instead of inbox messages ## Evidence Capture Collect at least the following artifacts for every run: - agent final summaries - final `show --thread THREAD_ID --json` output - at least one independent listing or lookup command such as `list` or `fetch` - the temporary DB path and resolved thread id ## Cleanup Policy Use these defaults unless a case file explicitly overrides them: - keep the temporary DB and working directory on failure for debugging - cleanup the temporary DB and working directory on success only if the caller does not need replay artifacts ## Per-Case Template Each case file should use this structure: - `Test Type` - `Purpose` - `Preconditions` - `Agent Topology` - `Inputs` - `Execution Parameters` - `Execution Steps` - `Validation Commands` - `Expected Outcomes` - `Assertions` - `Cleanup` - `Recorded Example Run` when a real run has already been captured ## Case Files | Case Slug | File | Coverage Note | | --- | --- | --- | | `multi-agent-roundtrip-through-bundled-cli` | [multi-agent-roundtrip-through-bundled-cli.md](./multi-agent-roundtrip-through-bundled-cli.md) | validates that two agents can use the bundled inbox skill to complete a blocked question and done result roundtrip | | `parallel-workers-claim-conflict-through-bundled-cli` | [parallel-workers-claim-conflict-through-bundled-cli.md](./parallel-workers-claim-conflict-through-bundled-cli.md) | validates that two workers using the skill observe a real `lease_conflict` on the same thread | | `blocked-worker-timeout-without-reply-through-bundled-cli` | [blocked-worker-timeout-without-reply-through-bundled-cli.md](./blocked-worker-timeout-without-reply-through-bundled-cli.md) | validates that a blocked worker using the skill receives the expected `wait-reply` timeout outcome when no leader reply arrives | | `leader-cancels-claimed-thread-through-bundled-cli` | [leader-cancels-claimed-thread-through-bundled-cli.md](./leader-cancels-claimed-thread-through-bundled-cli.md) | validates that a leader can cancel an actively claimed thread and that both agents observe the cancelled terminal state | | `artifact-roundtrip-through-bundled-cli` | [artifact-roundtrip-through-bundled-cli.md](./artifact-roundtrip-through-bundled-cli.md) | validates that bundled CLI usage through the skill preserves body-file and artifact data across task and result messages | ## Scope In scope: - explicit `$inbox` skill invocation - bundled `./assets/inbox` CLI usage - shared SQLite DB coordination between multiple agents - end-to-end thread state and message history validation - negative-path skill scenarios such as lease conflicts and reply timeouts - skill-guided artifact and body-file roundtrips Out of scope: - per-command flag and JSON contract coverage - store-level race conditions - implicit skill triggering without `$inbox` ## Relationship To Other Test Docs - [../inbox/](../inbox/) covers CLI command behavior - this directory covers skill-guided multi-agent behavior on top of that CLI