Case: `multi-agent-roundtrip-through-bundled-cli`

Test Type

This is a forward-test and a multi-agent end-to-end skill validation.

The goal is not to validate one CLI subcommand in isolation. The goal is to validate that two real agents can complete a closed-loop coordination flow through the packaged skills/inbox/ skill and bundled CLI.

Purpose

Validate that all of the following can be true at the same time:

both agents can explicitly use $inbox
both agents coordinate through the bundled ./assets/inbox against the same SQLite DB
the worker follows the protocol fetch -> claim -> update -> wait-reply -> done
the leader follows the protocol init -> send -> show/reply -> show
the final inbox thread state and message history match the expected contract

Preconditions

skill path exists: SKILL_PATH=skills/inbox
bundled CLI executable exists: SKILL_PATH/assets/inbox
use an empty temporary directory TMPDIR
test database path is TMPDIR/coord.db

Agent Topology

leader
worker-a

Inputs

Leader Prompt

Use $inbox at SKILL_PATH to act as leader on SQLite DB TMPDIR/coord.db. Only coordinate through the bundled inbox CLI from the skill. Workflow: 1) initialize the DB, 2) send exactly one task to worker-a asking them to implement a small logging choice, 3) monitor the thread until worker-a asks one blocked question, 4) answer the blocked question with a clear decision ('use stdout'), 5) wait until worker-a marks the thread done, 6) inspect the final thread with show, then stop. Do not use ordinary chat to coordinate with the other agent.

Worker Prompt

Use $inbox at SKILL_PATH to act as worker-a on SQLite DB TMPDIR/coord.db. Only coordinate through the bundled inbox CLI from the skill. Workflow: 1) wait until there is pending work for worker-a, 2) fetch it, 3) claim it, 4) send an in_progress update, 5) send a blocked update with one precise question asking whether logging should go to stdout or stderr, 6) wait for a reply, 7) finish the task with done using the received decision, 8) stop. Do not use ordinary chat to coordinate with the other agent.

Execution Parameters

use the shared execution contract from README.md
use the shared timeout defaults from README.md
do not override the default cleanup policy

Execution Steps

Inject the same skills/inbox/ skill into both real agents
Point both agents at the same database path TMPDIR/coord.db
Launch leader and worker-a in parallel
Wait for both agents to finish
Resolve THREAD_ID from the agent outputs or inbox history
Independently run the validation commands from the main thread

Validation Commands

SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_ID
SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json list --assigned-to worker-a

Expected Outcomes

leader successfully runs init
leader successfully sends one new thread to worker-a
worker-a successfully fetches that thread and successfully claims it
worker-a emits one progress message
worker-a emits one question message focused on stdout vs stderr
leader successfully emits one answer message with the explicit decision Use stdout.
worker-a successfully consumes that answer through wait-reply
worker-a successfully emits done
show returns thread.status == "done"

Assertions

show contains at least the following message kinds in order:
- task
- event (thread claimed)
- progress
- question
- answer
- result
question.body == "Should logging go to stdout or stderr?"
answer.body == "Use stdout."
the final result message explicitly states that logging uses stdout
list --assigned-to worker-a shows the thread and its status is done
coordination happens primarily through the inbox thread rather than ordinary chat

Cleanup

use the default cleanup policy from README.md
if the run fails, retain TMPDIR and coord.db for replay and manual inspection

Recorded Example Run

This case already has one reference forward-test run:

DB: /tmp/inbox-skill-fwd.j9kKvp/coord.db
Thread: thr_48d6f6a77eff4c2e88ce80e8fdc05da3

That run passed. The thread history contained task -> event -> progress -> question -> answer -> result, and the final thread state was done.

4.4 KiB Raw Blame History

Case: multi-agent-roundtrip-through-bundled-cli