107 lines
4.4 KiB
Markdown
107 lines
4.4 KiB
Markdown
# Case: `multi-agent-roundtrip-through-bundled-cli`
|
|
|
|
## Test Type
|
|
|
|
This is a `forward-test` and a multi-agent end-to-end skill validation.
|
|
|
|
The goal is not to validate one CLI subcommand in isolation. The goal is to validate that two real agents can complete a closed-loop coordination flow through the packaged `skills/inbox/` skill and bundled CLI.
|
|
|
|
## Purpose
|
|
|
|
Validate that all of the following can be true at the same time:
|
|
|
|
- both agents can explicitly use `$inbox`
|
|
- both agents coordinate through the bundled `./assets/inbox` against the same SQLite DB
|
|
- the worker follows the protocol `fetch -> claim -> update -> wait-reply -> done`
|
|
- the leader follows the protocol `init -> send -> show/reply -> show`
|
|
- the final inbox thread state and message history match the expected contract
|
|
|
|
## Preconditions
|
|
|
|
- skill path exists: `SKILL_PATH=skills/inbox`
|
|
- bundled CLI executable exists: `SKILL_PATH/assets/inbox`
|
|
- use an empty temporary directory `TMPDIR`
|
|
- test database path is `TMPDIR/coord.db`
|
|
|
|
## Agent Topology
|
|
|
|
- `leader`
|
|
- `worker-a`
|
|
|
|
## Inputs
|
|
|
|
### Leader Prompt
|
|
|
|
```text
|
|
Use $inbox at SKILL_PATH to act as leader on SQLite DB TMPDIR/coord.db. Only coordinate through the bundled inbox CLI from the skill. Workflow: 1) initialize the DB, 2) send exactly one task to worker-a asking them to implement a small logging choice, 3) monitor the thread until worker-a asks one blocked question, 4) answer the blocked question with a clear decision ('use stdout'), 5) wait until worker-a marks the thread done, 6) inspect the final thread with show, then stop. Do not use ordinary chat to coordinate with the other agent.
|
|
```
|
|
|
|
### Worker Prompt
|
|
|
|
```text
|
|
Use $inbox at SKILL_PATH to act as worker-a on SQLite DB TMPDIR/coord.db. Only coordinate through the bundled inbox CLI from the skill. Workflow: 1) wait until there is pending work for worker-a, 2) fetch it, 3) claim it, 4) send an in_progress update, 5) send a blocked update with one precise question asking whether logging should go to stdout or stderr, 6) wait for a reply, 7) finish the task with done using the received decision, 8) stop. Do not use ordinary chat to coordinate with the other agent.
|
|
```
|
|
|
|
## Execution Parameters
|
|
|
|
- use the shared execution contract from [README.md](./README.md)
|
|
- use the shared timeout defaults from [README.md](./README.md)
|
|
- do not override the default cleanup policy
|
|
|
|
## Execution Steps
|
|
|
|
1. Inject the same `skills/inbox/` skill into both real agents
|
|
2. Point both agents at the same database path `TMPDIR/coord.db`
|
|
3. Launch `leader` and `worker-a` in parallel
|
|
4. Wait for both agents to finish
|
|
5. Resolve `THREAD_ID` from the agent outputs or inbox history
|
|
6. Independently run the validation commands from the main thread
|
|
|
|
## Validation Commands
|
|
|
|
```bash
|
|
SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_ID
|
|
SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json list --assigned-to worker-a
|
|
```
|
|
|
|
## Expected Outcomes
|
|
|
|
- `leader` successfully runs `init`
|
|
- `leader` successfully `send`s one new thread to `worker-a`
|
|
- `worker-a` successfully `fetch`es that thread and successfully `claim`s it
|
|
- `worker-a` emits one `progress` message
|
|
- `worker-a` emits one `question` message focused on `stdout` vs `stderr`
|
|
- `leader` successfully emits one `answer` message with the explicit decision `Use stdout.`
|
|
- `worker-a` successfully consumes that answer through `wait-reply`
|
|
- `worker-a` successfully emits `done`
|
|
- `show` returns `thread.status == "done"`
|
|
|
|
## Assertions
|
|
|
|
- `show` contains at least the following message kinds in order:
|
|
- `task`
|
|
- `event` (`thread claimed`)
|
|
- `progress`
|
|
- `question`
|
|
- `answer`
|
|
- `result`
|
|
- `question.body == "Should logging go to stdout or stderr?"`
|
|
- `answer.body == "Use stdout."`
|
|
- the final `result` message explicitly states that logging uses `stdout`
|
|
- `list --assigned-to worker-a` shows the thread and its status is `done`
|
|
- coordination happens primarily through the inbox thread rather than ordinary chat
|
|
|
|
## Cleanup
|
|
|
|
- use the default cleanup policy from [README.md](./README.md)
|
|
- if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
|
|
|
|
## Recorded Example Run
|
|
|
|
This case already has one reference forward-test run:
|
|
|
|
- DB: `/tmp/inbox-skill-fwd.j9kKvp/coord.db`
|
|
- Thread: `thr_48d6f6a77eff4c2e88ce80e8fdc05da3`
|
|
|
|
That run passed. The thread history contained `task -> event -> progress -> question -> answer -> result`, and the final thread state was `done`.
|