Files
ai-workflow-skill/docs/tests/inbox-skill/README.md
T

6.7 KiB

Inbox Skill Test Plan

Purpose

This directory tracks human-readable test plans for the skills/inbox/ Codex skill bundle.

These documents are not command-contract specs for the inbox CLI itself. That coverage already lives under ../inbox/.

This directory exists to describe a different test surface:

  • whether an agent can actually use the packaged inbox skill
  • whether multiple agents can coordinate through the bundled CLI asset
  • whether a real skill-guided conversation reaches the expected inbox state

Test Model

  • README.md is the index for this directory
  • each skill test case lives in its own Markdown file
  • use stable case slugs in filenames

Shared Execution Contract

Use these defaults unless a case file explicitly overrides them:

  • run the scenario with real subagents, not simulated transcripts
  • inject the same skill bundle into every participating agent
  • launch all role agents in parallel when the scenario depends on agent-to-agent timing
  • require every agent to coordinate through the bundled CLI and shared SQLite DB instead of ordinary chat
  • validate the final inbox state independently from the main thread after the agents stop

How An Agent Runs These Cases

Use one test-runner agent to execute each case.

The test-runner agent is responsible for:

  • reading this README.md first, then one specific case file
  • creating an isolated temporary directory and SQLite DB path for that run
  • launching the role agents described in Agent Topology
  • injecting the same skills/inbox/ bundle into every role agent
  • passing each role agent the prompt text from the case file with concrete values substituted for SKILL_PATH, TMPDIR, and THREAD_ID when needed
  • coordinating launch order or parallel start according to the case file
  • collecting agent final summaries as evidence
  • resolving the final THREAD_ID
  • running the Validation Commands from the main thread after the role agents stop
  • comparing the observed results against Expected Outcomes and Assertions
  • returning a final pass/fail judgment with concrete evidence

The role agents are responsible for:

  • acting only within the role assigned in the case file
  • using the injected inbox skill rather than ad hoc repository discovery
  • coordinating through the bundled CLI and shared DB
  • reporting the concrete thread id, key command outcomes, and final observed state back to the test-runner agent

The test-runner agent should treat a case as passed only when:

  • all role agents reach a final state without violating the case contract
  • the independent validation commands succeed
  • the final inbox state matches the assertions in the case file

The test-runner agent should treat a case as failed when:

  • any role agent times out or stalls
  • a required inbox action is skipped
  • a role agent falls back to ordinary chat for critical coordination
  • the final inbox state conflicts with the documented assertions

The test-runner agent should report results in this shape:

  • case
  • db_path
  • thread_id
  • result: pass or fail
  • agent_summaries
  • validation_evidence
  • assertion_checklist
  • notes

Default Timeouts

Use these defaults unless a case file explicitly overrides them:

  • per-agent timeout: 3m
  • overall scenario timeout: 5m
  • async wait margin for the main thread: 30s

Default Failure Conditions

Treat the test as failed if any of the following happens:

  • any required agent does not reach a final state before timeout
  • any required inbox command returns a non-success result unless the case expects that failure
  • the final show output does not match the expected thread state
  • the expected message sequence or key message bodies do not appear
  • the agents fall back to ordinary chat for critical coordination instead of inbox messages

Evidence Capture

Collect at least the following artifacts for every run:

  • agent final summaries
  • final show --thread THREAD_ID --json output
  • at least one independent listing or lookup command such as list or fetch
  • the temporary DB path and resolved thread id

Cleanup Policy

Use these defaults unless a case file explicitly overrides them:

  • keep the temporary DB and working directory on failure for debugging
  • cleanup the temporary DB and working directory on success only if the caller does not need replay artifacts

Per-Case Template

Each case file should use this structure:

  • Test Type
  • Purpose
  • Preconditions
  • Agent Topology
  • Inputs
  • Execution Parameters
  • Execution Steps
  • Validation Commands
  • Expected Outcomes
  • Assertions
  • Cleanup
  • Recorded Example Run when a real run has already been captured

Case Files

Case Slug File Coverage Note
multi-agent-roundtrip-through-bundled-cli multi-agent-roundtrip-through-bundled-cli.md validates that two agents can use the bundled inbox skill to complete a blocked question and done result roundtrip
parallel-workers-claim-conflict-through-bundled-cli parallel-workers-claim-conflict-through-bundled-cli.md validates that two workers using the skill observe a real lease_conflict on the same thread
blocked-worker-timeout-without-reply-through-bundled-cli blocked-worker-timeout-without-reply-through-bundled-cli.md validates that a blocked worker using the skill receives the expected wait-reply timeout outcome when no leader reply arrives
leader-cancels-claimed-thread-through-bundled-cli leader-cancels-claimed-thread-through-bundled-cli.md validates that a leader can cancel an actively claimed thread and that both agents observe the cancelled terminal state
artifact-roundtrip-through-bundled-cli artifact-roundtrip-through-bundled-cli.md validates that bundled CLI usage through the skill preserves body-file and artifact data across task and result messages

Scope

In scope:

  • explicit $inbox skill invocation
  • bundled ./assets/inbox CLI usage
  • shared SQLite DB coordination between multiple agents
  • end-to-end thread state and message history validation
  • negative-path skill scenarios such as lease conflicts and reply timeouts
  • skill-guided artifact and body-file roundtrips

Out of scope:

  • per-command flag and JSON contract coverage
  • store-level race conditions
  • implicit skill triggering without $inbox

Relationship To Other Test Docs

  • ../inbox/ covers CLI command behavior
  • this directory covers skill-guided multi-agent behavior on top of that CLI