docs: add inbox skill test scenarios

2026-03-19 12:35:05 +08:00
parent 1a9fc4c136
commit 72d7caa552
7 changed files with 568 additions and 0 deletions
@@ -22,6 +22,7 @@ As of now:
 - integration tests now cover each implemented inbox command, plus the main inbox workflows, wait/watch flows, artifact persistence, unread behavior, and JSON error contracts
 - a human-readable inbox command test-plan set has been authored under `docs/tests/inbox/`
 - a reusable Codex skill package for `inbox` now exists under `skills/inbox/`, with a formal `SKILL.md`, `agents/openai.yaml`, and a bundled CLI binary asset
 - an inbox skill forward-test plan directory now exists under `docs/tests/inbox-skill/`, with a shared execution template and multiple scenario cases
 - `orch` currently exists as a command skeleton only
 - no scheduler workflows have been implemented yet
@@ -0,0 +1,113 @@
 # Inbox Skill Test Plan
 ## Purpose
 This directory tracks human-readable test plans for the `skills/inbox/` Codex skill bundle.
 These documents are not command-contract specs for the `inbox` CLI itself.
 That coverage already lives under [../inbox/](../inbox/).
 This directory exists to describe a different test surface:
 - whether an agent can actually use the packaged inbox skill
 - whether multiple agents can coordinate through the bundled CLI asset
 - whether a real skill-guided conversation reaches the expected inbox state
 ## Test Model
 - `README.md` is the index for this directory
 - each skill test case lives in its own Markdown file
 - use stable case slugs in filenames
 ## Shared Execution Contract
 Use these defaults unless a case file explicitly overrides them:
 - run the scenario with real subagents, not simulated transcripts
 - inject the same skill bundle into every participating agent
 - launch all role agents in parallel when the scenario depends on agent-to-agent timing
 - require every agent to coordinate through the bundled CLI and shared SQLite DB instead of ordinary chat
 - validate the final inbox state independently from the main thread after the agents stop
 ## Default Timeouts
 Use these defaults unless a case file explicitly overrides them:
 - per-agent timeout: `3m`
 - overall scenario timeout: `5m`
 - async wait margin for the main thread: `30s`
 ## Default Failure Conditions
 Treat the test as failed if any of the following happens:
 - any required agent does not reach a final state before timeout
 - any required inbox command returns a non-success result unless the case expects that failure
 - the final `show` output does not match the expected thread state
 - the expected message sequence or key message bodies do not appear
 - the agents fall back to ordinary chat for critical coordination instead of inbox messages
 ## Evidence Capture
 Collect at least the following artifacts for every run:
 - agent final summaries
 - final `show --thread THREAD_ID --json` output
 - at least one independent listing or lookup command such as `list` or `fetch`
 - the temporary DB path and resolved thread id
 ## Cleanup Policy
 Use these defaults unless a case file explicitly overrides them:
 - keep the temporary DB and working directory on failure for debugging
 - cleanup the temporary DB and working directory on success only if the caller does not need replay artifacts
 ## Per-Case Template
 Each case file should use this structure:
 - `Test Type`
 - `Purpose`
 - `Preconditions`
 - `Agent Topology`
 - `Inputs`
 - `Execution Parameters`
 - `Execution Steps`
 - `Validation Commands`
 - `Expected Outcomes`
 - `Assertions`
 - `Cleanup`
 - `Recorded Example Run` when a real run has already been captured
 ## Case Files
 | Case Slug | File | Coverage Note |
 | --- | --- | --- |
 | `multi-agent-roundtrip-through-bundled-cli` | [multi-agent-roundtrip-through-bundled-cli.md](./multi-agent-roundtrip-through-bundled-cli.md) | validates that two agents can use the bundled inbox skill to complete a blocked question and done result roundtrip |
 | `parallel-workers-claim-conflict-through-bundled-cli` | [parallel-workers-claim-conflict-through-bundled-cli.md](./parallel-workers-claim-conflict-through-bundled-cli.md) | validates that two workers using the skill observe a real `lease_conflict` on the same thread |
 | `blocked-worker-timeout-without-reply-through-bundled-cli` | [blocked-worker-timeout-without-reply-through-bundled-cli.md](./blocked-worker-timeout-without-reply-through-bundled-cli.md) | validates that a blocked worker using the skill receives the expected `wait-reply` timeout outcome when no leader reply arrives |
 | `leader-cancels-claimed-thread-through-bundled-cli` | [leader-cancels-claimed-thread-through-bundled-cli.md](./leader-cancels-claimed-thread-through-bundled-cli.md) | validates that a leader can cancel an actively claimed thread and that both agents observe the cancelled terminal state |
 | `artifact-roundtrip-through-bundled-cli` | [artifact-roundtrip-through-bundled-cli.md](./artifact-roundtrip-through-bundled-cli.md) | validates that bundled CLI usage through the skill preserves body-file and artifact data across task and result messages |
 ## Scope
 In scope:
 - explicit `$inbox` skill invocation
 - bundled `./assets/inbox` CLI usage
 - shared SQLite DB coordination between multiple agents
 - end-to-end thread state and message history validation
 - negative-path skill scenarios such as lease conflicts and reply timeouts
 - skill-guided artifact and body-file roundtrips
 Out of scope:
 - per-command flag and JSON contract coverage
 - store-level race conditions
 - implicit skill triggering without `$inbox`
 ## Relationship To Other Test Docs
 - [../inbox/](../inbox/) covers CLI command behavior
 - this directory covers skill-guided multi-agent behavior on top of that CLI
@@ -0,0 +1,83 @@
 # Case: `artifact-roundtrip-through-bundled-cli`
 ## Test Type
 This is a `forward-test` and an artifact-preservation validation.
 The goal is to verify that agents using the packaged inbox skill can exchange body-file content and artifacts through the bundled CLI without losing message data.
 ## Purpose
 Validate that all of the following can be true at the same time:
 - the leader can create task input files and send them through the bundled CLI
 - the worker can inspect those artifacts through inbox history
 - the worker can return a final result using body-file or artifact inputs
 - the final thread history preserves both task-side and result-side file references
 ## Preconditions
 - skill path exists: `SKILL_PATH=skills/inbox`
 - bundled CLI executable exists: `SKILL_PATH/assets/inbox`
 - use an empty temporary directory `TMPDIR`
 - test database path is `TMPDIR/coord.db`
 ## Agent Topology
 - `leader`
 - `worker-a`
 ## Inputs
 ### Leader Prompt
 ```text
 Use $inbox at SKILL_PATH to act as leader on SQLite DB TMPDIR/coord.db. Only coordinate through the bundled inbox CLI from the skill. Workflow: 1) initialize the DB, 2) create a small task file under TMPDIR, 3) send one task to worker-a using body-file plus at least one artifact and artifact metadata, 4) wait until worker-a marks the thread done, 5) inspect the final thread with show, 6) stop. Do not use ordinary chat to coordinate with the other agent.
 ```
 ### Worker Prompt
 ```text
 Use $inbox at SKILL_PATH to act as worker-a on SQLite DB TMPDIR/coord.db. Only coordinate through the bundled inbox CLI from the skill. Workflow: 1) fetch and claim the task, 2) inspect the task message with show and confirm the artifact is visible, 3) create a small result file under TMPDIR, 4) finish the thread with done using body-file or artifact input, 5) stop after reporting what files were preserved. Do not use ordinary chat to coordinate with the other agent.
 ```
 ## Execution Parameters
 - use the shared execution contract from [README.md](./README.md)
 - use the shared timeout defaults from [README.md](./README.md)
 - do not override the default cleanup policy
 ## Execution Steps
 1. Inject the same `skills/inbox/` skill into both real agents
 2. Point both agents at the same database path `TMPDIR/coord.db`
 3. Launch `leader` and `worker-a` in parallel
 4. Wait for both agents to finish
 5. Resolve `THREAD_ID` from the agent outputs or inbox history
 6. Independently run the validation commands from the main thread
 ## Validation Commands
 ```bash
 SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_ID
 ```
 ## Expected Outcomes
 - `leader` successfully creates a task file and sends it through `body-file`
 - the initial task message contains at least one artifact reference
 - `worker-a` successfully inspects the task artifact through `show`
 - `worker-a` completes the thread with `done`
 - the final `show` output preserves task-side and result-side file content or artifact references
 ## Assertions
 - the first task message contains non-empty body content sourced from a file
 - the first task message contains at least one artifact entry
 - the final `result` message contains either body-file content or at least one artifact entry
 - the final thread status is `done`
 ## Cleanup
 - use the default cleanup policy from [README.md](./README.md)
 - if the run fails, retain `TMPDIR`, created files, and `coord.db` for replay and manual inspection
@@ -0,0 +1,88 @@
 # Case: `blocked-worker-timeout-without-reply-through-bundled-cli`
 ## Test Type
 This is a `forward-test` and a timeout-path skill validation.
 The goal is to verify that a blocked worker using the bundled inbox skill sees the correct `wait-reply` timeout behavior when no answer arrives.
 ## Purpose
 Validate that all of the following can be true at the same time:
 - a worker can use the skill to fetch, claim, and block a real thread
 - the worker can call `wait-reply` through the bundled CLI
 - the leader intentionally does not answer
 - the worker receives the expected timeout contract instead of silently succeeding
 - the thread remains in a blocked state with the question preserved
 ## Preconditions
 - skill path exists: `SKILL_PATH=skills/inbox`
 - bundled CLI executable exists: `SKILL_PATH/assets/inbox`
 - use an empty temporary directory `TMPDIR`
 - test database path is `TMPDIR/coord.db`
 ## Agent Topology
 - `leader`
 - `worker-a`
 ## Inputs
 ### Leader Prompt
 ```text
 Use $inbox at SKILL_PATH to act as leader on SQLite DB TMPDIR/coord.db. Only coordinate through the bundled inbox CLI from the skill. Workflow: 1) initialize the DB, 2) send exactly one task to worker-a, 3) monitor until worker-a asks one blocked question, 4) intentionally do not reply, 5) stop after confirming the thread is still blocked. Do not use ordinary chat to coordinate with the other agent.
 ```
 ### Worker Prompt
 ```text
 Use $inbox at SKILL_PATH to act as worker-a on SQLite DB TMPDIR/coord.db. Only coordinate through the bundled inbox CLI from the skill. Workflow: 1) fetch pending work, 2) claim it, 3) send a blocked update with one precise question, 4) call wait-reply with a short timeout, 5) stop after reporting the timeout result exactly as observed. Do not use ordinary chat to coordinate with the other agent.
 ```
 ## Execution Parameters
 - use the shared execution contract from [README.md](./README.md)
 - override the worker-side wait timeout to a short interval such as `10s`
 - keep the default cleanup policy
 ## Execution Steps
 1. Inject the same `skills/inbox/` skill into both real agents
 2. Point both agents at the same database path `TMPDIR/coord.db`
 3. Launch `leader` and `worker-a` in parallel
 4. Wait for both agents to finish
 5. Resolve `THREAD_ID` from the agent outputs or inbox history
 6. Independently run the validation commands from the main thread
 ## Validation Commands
 ```bash
 SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_ID
 SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json list --status blocked
 ```
 ## Expected Outcomes
 - `leader` successfully creates one thread for `worker-a`
 - `worker-a` successfully fetches and claims it
 - `worker-a` emits one blocked `question`
 - the blocked question is preserved at least in `message.payload_json.question`
 - `worker-a` runs `wait-reply` and receives the no-match timeout contract
 - the leader emits no `answer` message
 - the final thread status remains `blocked`
 ## Assertions
 - the worker reports exit code `10` and JSON error code `no_matching_work` from `wait-reply`
 - `show` includes the blocked `question` message
 - `show.data.messages[*].payload_json.question` contains `Should logging go to stdout or stderr?`
 - `show` does not include any `answer` message
 - `list --status blocked` returns the thread
 ## Cleanup
 - use the default cleanup policy from [README.md](./README.md)
 - if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
@@ -0,0 +1,84 @@
 # Case: `leader-cancels-claimed-thread-through-bundled-cli`
 ## Test Type
 This is a `forward-test` and a terminal-state intervention validation.
 The goal is to verify that a leader and worker can both observe a thread transition to `cancelled` through the bundled inbox skill while the thread is actively claimed.
 ## Purpose
 Validate that all of the following can be true at the same time:
 - the worker can fetch and claim a real thread through the skill
 - the leader can cancel that thread through the same bundled CLI
 - the final thread state is `cancelled`
 - both parties can inspect the terminal state from inbox history
 ## Preconditions
 - skill path exists: `SKILL_PATH=skills/inbox`
 - bundled CLI executable exists: `SKILL_PATH/assets/inbox`
 - use an empty temporary directory `TMPDIR`
 - test database path is `TMPDIR/coord.db`
 ## Agent Topology
 - `leader`
 - `worker-a`
 ## Inputs
 ### Leader Prompt
 ```text
 Use $inbox at SKILL_PATH to act as leader on SQLite DB TMPDIR/coord.db. Only coordinate through the bundled inbox CLI from the skill. Workflow: 1) initialize the DB, 2) send exactly one task to worker-a, 3) wait until worker-a has claimed the thread or reported in_progress, 4) cancel the thread with a clear reason, 5) inspect the final thread with show, 6) stop. Do not use ordinary chat to coordinate with the other agent.
 ```
 ### Worker Prompt
 ```text
 Use $inbox at SKILL_PATH to act as worker-a on SQLite DB TMPDIR/coord.db. Only coordinate through the bundled inbox CLI from the skill. Workflow: 1) fetch pending work, 2) claim it, 3) send an in_progress update, 4) keep monitoring the thread until it reaches a terminal state, 5) stop after reporting the final status you observed. Do not use ordinary chat to coordinate with the other agent.
 ```
 ## Execution Parameters
 - use the shared execution contract from [README.md](./README.md)
 - use the shared timeout defaults from [README.md](./README.md)
 - do not override the default cleanup policy
 ## Execution Steps
 1. Inject the same `skills/inbox/` skill into both real agents
 2. Point both agents at the same database path `TMPDIR/coord.db`
 3. Launch `leader` and `worker-a` in parallel
 4. Wait for both agents to finish
 5. Resolve `THREAD_ID` from the agent outputs or inbox history
 6. Independently run the validation commands from the main thread
 ## Validation Commands
 ```bash
 SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_ID
 SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json list --status cancelled
 ```
 ## Expected Outcomes
 - `worker-a` successfully claims the thread
 - `worker-a` emits one `progress` message
 - `leader` successfully emits `cancel` with a reason
 - the final thread status is `cancelled`
 - the worker reports that it observed the cancelled terminal state
 ## Assertions
 - `show` contains at least `task -> event -> progress -> control`
 - the final thread status is `cancelled`
 - the terminal message or thread history captures the cancel reason
 - `list --status cancelled` returns the thread
 ## Cleanup
 - use the default cleanup policy from [README.md](./README.md)
 - if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
@@ -0,0 +1,106 @@
 # Case: `multi-agent-roundtrip-through-bundled-cli`
 ## Test Type
 This is a `forward-test` and a multi-agent end-to-end skill validation.
 The goal is not to validate one CLI subcommand in isolation. The goal is to validate that two real agents can complete a closed-loop coordination flow through the packaged `skills/inbox/` skill and bundled CLI.
 ## Purpose
 Validate that all of the following can be true at the same time:
 - both agents can explicitly use `$inbox`
 - both agents coordinate through the bundled `./assets/inbox` against the same SQLite DB
 - the worker follows the protocol `fetch -> claim -> update -> wait-reply -> done`
 - the leader follows the protocol `init -> send -> show/reply -> show`
 - the final inbox thread state and message history match the expected contract
 ## Preconditions
 - skill path exists: `SKILL_PATH=skills/inbox`
 - bundled CLI executable exists: `SKILL_PATH/assets/inbox`
 - use an empty temporary directory `TMPDIR`
 - test database path is `TMPDIR/coord.db`
 ## Agent Topology
 - `leader`
 - `worker-a`
 ## Inputs
 ### Leader Prompt
 ```text
 Use $inbox at SKILL_PATH to act as leader on SQLite DB TMPDIR/coord.db. Only coordinate through the bundled inbox CLI from the skill. Workflow: 1) initialize the DB, 2) send exactly one task to worker-a asking them to implement a small logging choice, 3) monitor the thread until worker-a asks one blocked question, 4) answer the blocked question with a clear decision ('use stdout'), 5) wait until worker-a marks the thread done, 6) inspect the final thread with show, then stop. Do not use ordinary chat to coordinate with the other agent.
 ```
 ### Worker Prompt
 ```text
 Use $inbox at SKILL_PATH to act as worker-a on SQLite DB TMPDIR/coord.db. Only coordinate through the bundled inbox CLI from the skill. Workflow: 1) wait until there is pending work for worker-a, 2) fetch it, 3) claim it, 4) send an in_progress update, 5) send a blocked update with one precise question asking whether logging should go to stdout or stderr, 6) wait for a reply, 7) finish the task with done using the received decision, 8) stop. Do not use ordinary chat to coordinate with the other agent.
 ```
 ## Execution Parameters
 - use the shared execution contract from [README.md](./README.md)
 - use the shared timeout defaults from [README.md](./README.md)
 - do not override the default cleanup policy
 ## Execution Steps
 1. Inject the same `skills/inbox/` skill into both real agents
 2. Point both agents at the same database path `TMPDIR/coord.db`
 3. Launch `leader` and `worker-a` in parallel
 4. Wait for both agents to finish
 5. Resolve `THREAD_ID` from the agent outputs or inbox history
 6. Independently run the validation commands from the main thread
 ## Validation Commands
 ```bash
 SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_ID
 SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json list --assigned-to worker-a
 ```
 ## Expected Outcomes
 - `leader` successfully runs `init`
 - `leader` successfully `send`s one new thread to `worker-a`
 - `worker-a` successfully `fetch`es that thread and successfully `claim`s it
 - `worker-a` emits one `progress` message
 - `worker-a` emits one `question` message focused on `stdout` vs `stderr`
 - `leader` successfully emits one `answer` message with the explicit decision `Use stdout.`
 - `worker-a` successfully consumes that answer through `wait-reply`
 - `worker-a` successfully emits `done`
 - `show` returns `thread.status == "done"`
 ## Assertions
 - `show` contains at least the following message kinds in order:
  - `task`
  - `event` (`thread claimed`)
  - `progress`
  - `question`
  - `answer`
  - `result`
 - `question.body == "Should logging go to stdout or stderr?"`
 - `answer.body == "Use stdout."`
 - the final `result` message explicitly states that logging uses `stdout`
 - `list --assigned-to worker-a` shows the thread and its status is `done`
 - coordination happens primarily through the inbox thread rather than ordinary chat
 ## Cleanup
 - use the default cleanup policy from [README.md](./README.md)
 - if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
 ## Recorded Example Run
 This case already has one reference forward-test run:
 - DB: `/tmp/inbox-skill-fwd.j9kKvp/coord.db`
 - Thread: `thr_48d6f6a77eff4c2e88ce80e8fdc05da3`
 That run passed. The thread history contained `task -> event -> progress -> question -> answer -> result`, and the final thread state was `done`.
@@ -0,0 +1,93 @@
 # Case: `parallel-workers-claim-conflict-through-bundled-cli`
 ## Test Type
 This is a `forward-test` and a multi-agent negative-path validation.
 The goal is to verify that two workers using the same bundled inbox skill can exercise a real claim conflict through the SQLite-backed inbox instead of simulating the outcome.
 ## Purpose
 Validate that all of the following can be true at the same time:
 - multiple workers can use the same `skills/inbox/` bundle against one shared DB
 - one worker can successfully claim the thread
 - a competing worker can observe and attempt to claim that same thread
 - the competing worker receives the expected `lease_conflict` contract
 - the thread remains owned by the original worker
 ## Preconditions
 - skill path exists: `SKILL_PATH=skills/inbox`
 - bundled CLI executable exists: `SKILL_PATH/assets/inbox`
 - use an empty temporary directory `TMPDIR`
 - test database path is `TMPDIR/coord.db`
 ## Agent Topology
 - `leader`
 - `worker-a`
 - `worker-b`
 ## Inputs
 ### Leader Prompt
 ```text
 Use $inbox at SKILL_PATH to act as leader on SQLite DB TMPDIR/coord.db. Only coordinate through the bundled inbox CLI from the skill. Workflow: 1) initialize the DB, 2) send exactly one task assigned to worker-a, 3) stop after confirming the thread exists and report the thread id. Do not use ordinary chat to coordinate with the workers.
 ```
 ### Worker A Prompt
 ```text
 Use $inbox at SKILL_PATH to act as worker-a on SQLite DB TMPDIR/coord.db. Only coordinate through the bundled inbox CLI from the skill. Workflow: 1) wait for pending work assigned to worker-a, 2) fetch it, 3) claim it, 4) stop after confirming the claim succeeded and report the thread id and lease result. Do not use ordinary chat to coordinate with the other agents.
 ```
 ### Worker B Prompt
 ```text
 Use $inbox at SKILL_PATH to act as worker-b on SQLite DB TMPDIR/coord.db. Only coordinate through the bundled inbox CLI from the skill. This is a conflict test. Workflow: 1) wait until there is a thread assigned to worker-a visible through inbox inspection, 2) resolve its thread id, 3) attempt to claim that thread as worker-b, 4) stop after reporting the exact error contract you observed. Do not use ordinary chat to coordinate with the other agents.
 ```
 ## Execution Parameters
 - use the shared execution contract from [README.md](./README.md)
 - use the shared timeout defaults from [README.md](./README.md)
 - do not override the default cleanup policy
 ## Execution Steps
 1. Inject the same `skills/inbox/` skill into all three real agents
 2. Point all three agents at the same database path `TMPDIR/coord.db`
 3. Launch `leader`, `worker-a`, and `worker-b` in parallel
 4. Wait for all agents to finish
 5. Resolve `THREAD_ID` from the agent outputs or inbox history
 6. Independently run the validation commands from the main thread
 ## Validation Commands
 ```bash
 SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_ID
 SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json list --assigned-to worker-a
 ```
 ## Expected Outcomes
 - `leader` successfully runs `init`
 - `leader` successfully creates one thread for `worker-a`
 - `worker-a` successfully `claim`s that thread
 - `worker-b` attempts `claim --agent worker-b --thread THREAD_ID`
 - `worker-b` receives exit code `20` and JSON error code `lease_conflict`
 - the final thread remains assigned to `worker-a`
 ## Assertions
 - `show` contains a worker-side `event` message with summary `thread claimed`
 - the final thread status is still `claimed` or `in_progress`, not transferred to `worker-b`
 - `list --assigned-to worker-a` still returns the thread
 - no agent reports successful ownership transfer to `worker-b`
 ## Cleanup
 - use the default cleanup policy from [README.md](./README.md)
 - if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection