163 lines
6.7 KiB
Markdown
163 lines
6.7 KiB
Markdown
# Inbox Skill Test Plan
|
|
|
|
## Purpose
|
|
|
|
This directory tracks human-readable test plans for the `skills/inbox/` Codex skill bundle.
|
|
|
|
These documents are not command-contract specs for the `inbox` CLI itself.
|
|
That coverage already lives under [../inbox/](../inbox/).
|
|
|
|
This directory exists to describe a different test surface:
|
|
|
|
- whether an agent can actually use the packaged inbox skill
|
|
- whether multiple agents can coordinate through the bundled CLI asset
|
|
- whether a real skill-guided conversation reaches the expected inbox state
|
|
|
|
## Test Model
|
|
|
|
- `README.md` is the index for this directory
|
|
- each skill test case lives in its own Markdown file
|
|
- use stable case slugs in filenames
|
|
|
|
## Shared Execution Contract
|
|
|
|
Use these defaults unless a case file explicitly overrides them:
|
|
|
|
- run the scenario with real subagents, not simulated transcripts
|
|
- inject the same skill bundle into every participating agent
|
|
- launch all role agents in parallel when the scenario depends on agent-to-agent timing
|
|
- require every agent to coordinate through the bundled CLI and shared SQLite DB instead of ordinary chat
|
|
- validate the final inbox state independently from the main thread after the agents stop
|
|
|
|
## How An Agent Runs These Cases
|
|
|
|
Use one test-runner agent to execute each case.
|
|
|
|
The test-runner agent is responsible for:
|
|
|
|
- reading this `README.md` first, then one specific case file
|
|
- creating an isolated temporary directory and SQLite DB path for that run
|
|
- launching the role agents described in `Agent Topology`
|
|
- injecting the same `skills/inbox/` bundle into every role agent
|
|
- passing each role agent the prompt text from the case file with concrete values substituted for `SKILL_PATH`, `TMPDIR`, and `THREAD_ID` when needed
|
|
- coordinating launch order or parallel start according to the case file
|
|
- collecting agent final summaries as evidence
|
|
- resolving the final `THREAD_ID`
|
|
- running the `Validation Commands` from the main thread after the role agents stop
|
|
- comparing the observed results against `Expected Outcomes` and `Assertions`
|
|
- returning a final pass/fail judgment with concrete evidence
|
|
|
|
The role agents are responsible for:
|
|
|
|
- acting only within the role assigned in the case file
|
|
- using the injected inbox skill rather than ad hoc repository discovery
|
|
- coordinating through the bundled CLI and shared DB
|
|
- reporting the concrete thread id, key command outcomes, and final observed state back to the test-runner agent
|
|
|
|
The test-runner agent should treat a case as passed only when:
|
|
|
|
- all role agents reach a final state without violating the case contract
|
|
- the independent validation commands succeed
|
|
- the final inbox state matches the assertions in the case file
|
|
|
|
The test-runner agent should treat a case as failed when:
|
|
|
|
- any role agent times out or stalls
|
|
- a required inbox action is skipped
|
|
- a role agent falls back to ordinary chat for critical coordination
|
|
- the final inbox state conflicts with the documented assertions
|
|
|
|
The test-runner agent should report results in this shape:
|
|
|
|
- `case`
|
|
- `db_path`
|
|
- `thread_id`
|
|
- `result`: `pass` or `fail`
|
|
- `agent_summaries`
|
|
- `validation_evidence`
|
|
- `assertion_checklist`
|
|
- `notes`
|
|
|
|
## Default Timeouts
|
|
|
|
Use these defaults unless a case file explicitly overrides them:
|
|
|
|
- per-agent timeout: `3m`
|
|
- overall scenario timeout: `5m`
|
|
- async wait margin for the main thread: `30s`
|
|
|
|
## Default Failure Conditions
|
|
|
|
Treat the test as failed if any of the following happens:
|
|
|
|
- any required agent does not reach a final state before timeout
|
|
- any required inbox command returns a non-success result unless the case expects that failure
|
|
- the final `show` output does not match the expected thread state
|
|
- the expected message sequence or key message bodies do not appear
|
|
- the agents fall back to ordinary chat for critical coordination instead of inbox messages
|
|
|
|
## Evidence Capture
|
|
|
|
Collect at least the following artifacts for every run:
|
|
|
|
- agent final summaries
|
|
- final `show --thread THREAD_ID --json` output
|
|
- at least one independent listing or lookup command such as `list` or `fetch`
|
|
- the temporary DB path and resolved thread id
|
|
|
|
## Cleanup Policy
|
|
|
|
Use these defaults unless a case file explicitly overrides them:
|
|
|
|
- keep the temporary DB and working directory on failure for debugging
|
|
- cleanup the temporary DB and working directory on success only if the caller does not need replay artifacts
|
|
|
|
## Per-Case Template
|
|
|
|
Each case file should use this structure:
|
|
|
|
- `Test Type`
|
|
- `Purpose`
|
|
- `Preconditions`
|
|
- `Agent Topology`
|
|
- `Inputs`
|
|
- `Execution Parameters`
|
|
- `Execution Steps`
|
|
- `Validation Commands`
|
|
- `Expected Outcomes`
|
|
- `Assertions`
|
|
- `Cleanup`
|
|
- `Recorded Example Run` when a real run has already been captured
|
|
|
|
## Case Files
|
|
|
|
| Case Slug | File | Coverage Note |
|
|
| --- | --- | --- |
|
|
| `multi-agent-roundtrip-through-bundled-cli` | [multi-agent-roundtrip-through-bundled-cli.md](./multi-agent-roundtrip-through-bundled-cli.md) | validates that two agents can use the bundled inbox skill to complete a blocked question and done result roundtrip |
|
|
| `parallel-workers-claim-conflict-through-bundled-cli` | [parallel-workers-claim-conflict-through-bundled-cli.md](./parallel-workers-claim-conflict-through-bundled-cli.md) | validates that two workers using the skill observe a real `lease_conflict` on the same thread |
|
|
| `blocked-worker-timeout-without-reply-through-bundled-cli` | [blocked-worker-timeout-without-reply-through-bundled-cli.md](./blocked-worker-timeout-without-reply-through-bundled-cli.md) | validates that a blocked worker using the skill receives the expected `wait-reply` timeout outcome when no leader reply arrives |
|
|
| `leader-cancels-claimed-thread-through-bundled-cli` | [leader-cancels-claimed-thread-through-bundled-cli.md](./leader-cancels-claimed-thread-through-bundled-cli.md) | validates that a leader can cancel an actively claimed thread and that both agents observe the cancelled terminal state |
|
|
| `artifact-roundtrip-through-bundled-cli` | [artifact-roundtrip-through-bundled-cli.md](./artifact-roundtrip-through-bundled-cli.md) | validates that bundled CLI usage through the skill preserves body-file and artifact data across task and result messages |
|
|
|
|
## Scope
|
|
|
|
In scope:
|
|
|
|
- explicit `$inbox` skill invocation
|
|
- bundled `./assets/inbox` CLI usage
|
|
- shared SQLite DB coordination between multiple agents
|
|
- end-to-end thread state and message history validation
|
|
- negative-path skill scenarios such as lease conflicts and reply timeouts
|
|
- skill-guided artifact and body-file roundtrips
|
|
|
|
Out of scope:
|
|
|
|
- per-command flag and JSON contract coverage
|
|
- store-level race conditions
|
|
- implicit skill triggering without `$inbox`
|
|
|
|
## Relationship To Other Test Docs
|
|
|
|
- [../inbox/](../inbox/) covers CLI command behavior
|
|
- this directory covers skill-guided multi-agent behavior on top of that CLI
|