kurihada/ai-workflow-skill

Fork 0

Files

T

kurihada b110bb24d9 docs: add execution roadmap workflow

2026-03-19 12:52:02 +08:00

6.7 KiB

Raw Blame History

Inbox Skill Test Plan

Purpose

This directory tracks human-readable test plans for the skills/inbox/ Codex skill bundle.

These documents are not command-contract specs for the inbox CLI itself. That coverage already lives under ../inbox/.

This directory exists to describe a different test surface:

whether an agent can actually use the packaged inbox skill
whether multiple agents can coordinate through the bundled CLI asset
whether a real skill-guided conversation reaches the expected inbox state

Test Model

README.md is the index for this directory
each skill test case lives in its own Markdown file
use stable case slugs in filenames

Shared Execution Contract

Use these defaults unless a case file explicitly overrides them:

run the scenario with real subagents, not simulated transcripts
inject the same skill bundle into every participating agent
launch all role agents in parallel when the scenario depends on agent-to-agent timing
require every agent to coordinate through the bundled CLI and shared SQLite DB instead of ordinary chat
validate the final inbox state independently from the main thread after the agents stop

How An Agent Runs These Cases

Use one test-runner agent to execute each case.

The test-runner agent is responsible for:

reading this README.md first, then one specific case file
creating an isolated temporary directory and SQLite DB path for that run
launching the role agents described in Agent Topology
injecting the same skills/inbox/ bundle into every role agent
passing each role agent the prompt text from the case file with concrete values substituted for SKILL_PATH, TMPDIR, and THREAD_ID when needed
coordinating launch order or parallel start according to the case file
collecting agent final summaries as evidence
resolving the final THREAD_ID
running the Validation Commands from the main thread after the role agents stop
comparing the observed results against Expected Outcomes and Assertions
returning a final pass/fail judgment with concrete evidence

The role agents are responsible for:

acting only within the role assigned in the case file
using the injected inbox skill rather than ad hoc repository discovery
coordinating through the bundled CLI and shared DB
reporting the concrete thread id, key command outcomes, and final observed state back to the test-runner agent

The test-runner agent should treat a case as passed only when:

all role agents reach a final state without violating the case contract
the independent validation commands succeed
the final inbox state matches the assertions in the case file

The test-runner agent should treat a case as failed when:

any role agent times out or stalls
a required inbox action is skipped
a role agent falls back to ordinary chat for critical coordination
the final inbox state conflicts with the documented assertions

The test-runner agent should report results in this shape:

case
db_path
thread_id
result: pass or fail
agent_summaries
validation_evidence
assertion_checklist
notes

Default Timeouts

Use these defaults unless a case file explicitly overrides them:

per-agent timeout: 3m
overall scenario timeout: 5m
async wait margin for the main thread: 30s

Default Failure Conditions

Treat the test as failed if any of the following happens:

any required agent does not reach a final state before timeout
any required inbox command returns a non-success result unless the case expects that failure
the final show output does not match the expected thread state
the expected message sequence or key message bodies do not appear
the agents fall back to ordinary chat for critical coordination instead of inbox messages

Evidence Capture

Collect at least the following artifacts for every run:

agent final summaries
final show --thread THREAD_ID --json output
at least one independent listing or lookup command such as list or fetch
the temporary DB path and resolved thread id

Cleanup Policy

Use these defaults unless a case file explicitly overrides them:

keep the temporary DB and working directory on failure for debugging
cleanup the temporary DB and working directory on success only if the caller does not need replay artifacts

Per-Case Template

Each case file should use this structure:

Test Type
Purpose
Preconditions
Agent Topology
Inputs
Execution Parameters
Execution Steps
Validation Commands
Expected Outcomes
Assertions
Cleanup
Recorded Example Run when a real run has already been captured

Case Files

Case Slug	File	Coverage Note
`multi-agent-roundtrip-through-bundled-cli`	multi-agent-roundtrip-through-bundled-cli.md	validates that two agents can use the bundled inbox skill to complete a blocked question and done result roundtrip
`parallel-workers-claim-conflict-through-bundled-cli`	parallel-workers-claim-conflict-through-bundled-cli.md	validates that two workers using the skill observe a real `lease_conflict` on the same thread
`blocked-worker-timeout-without-reply-through-bundled-cli`	blocked-worker-timeout-without-reply-through-bundled-cli.md	validates that a blocked worker using the skill receives the expected `wait-reply` timeout outcome when no leader reply arrives
`leader-cancels-claimed-thread-through-bundled-cli`	leader-cancels-claimed-thread-through-bundled-cli.md	validates that a leader can cancel an actively claimed thread and that both agents observe the cancelled terminal state
`artifact-roundtrip-through-bundled-cli`	artifact-roundtrip-through-bundled-cli.md	validates that bundled CLI usage through the skill preserves body-file and artifact data across task and result messages

Scope

In scope:

explicit $inbox skill invocation
bundled ./assets/inbox CLI usage
shared SQLite DB coordination between multiple agents
end-to-end thread state and message history validation
negative-path skill scenarios such as lease conflicts and reply timeouts
skill-guided artifact and body-file roundtrips

Out of scope:

per-command flag and JSON contract coverage
store-level race conditions
implicit skill triggering without $inbox

Relationship To Other Test Docs

../inbox/ covers CLI command behavior
this directory covers skill-guided multi-agent behavior on top of that CLI

6.7 KiB Raw Blame History

Inbox Skill Test Plan

Purpose

Test Model

Shared Execution Contract

How An Agent Runs These Cases

Default Timeouts

Default Failure Conditions

Evidence Capture

Cleanup Policy

Per-Case Template

Case Files

Scope

Relationship To Other Test Docs

6.7 KiB

Raw Blame History