9.3 KiB
Orch Skill Test Plan
Purpose
This directory tracks human-readable test plans for the skills/orch/ Codex skill bundle.
These documents are not command-contract specs for the orch CLI itself.
That coverage already lives under ../orch/.
This directory exists to describe a different test surface:
- whether a leader agent can actually use the packaged
orchskill - whether the bundled
./assets/orchCLI works inside real skill-guided conversations - whether leader-side orchestration driven by the skill reaches the expected run, task, thread, and worktree state
Test Model
README.mdis the index for this directory- each skill test case lives in its own Markdown file
- use stable case slugs in filenames
Shared Execution Contract
Use these defaults unless a case file explicitly overrides them:
- run the scenario with real subagents, not simulated transcripts
- inject
skills/orch/into the leader agent - inject
skills/inbox/into worker agents whenever worker-side thread progress is required - initialize the shared SQLite DB before launching role agents with
INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json init - require the leader to coordinate through the bundled
./assets/orchCLI from the skill instead of ordinary chat - require workers to coordinate through the bundled
./assets/inboxCLI from their skill instead of ordinary chat - validate final run and thread state independently from the main thread after the agents stop
- create any required Git repo fixture before launching agents for worktree cases
How An Agent Runs These Cases
Use one test-runner agent to execute each case.
The test-runner agent is responsible for:
- reading this
README.mdfirst, then one specific case file - creating an isolated temporary directory and DB path for that run
- initializing the DB once through the bundled inbox CLI before launching role agents
- creating any required temporary Git repo fixture before launching role agents
- launching the role agents described in
Agent Topology - injecting
skills/orch/into the leader andskills/inbox/into workers - passing each role agent the prompt text from the case file with concrete values substituted for
ORCH_SKILL_PATH,INBOX_SKILL_PATH,TMPDIR,RUN_ID,THREAD_ID, andWORKTREE_PATHwhen needed - coordinating launch order or parallel start according to the case file
- collecting agent final summaries as evidence
- resolving final run ids, thread ids, and worktree paths from agent outputs
- running the
Validation Commandsfrom the main thread after the role agents stop - comparing the observed results against
Expected OutcomesandAssertions - returning a final pass/fail judgment with concrete evidence
The role agents are responsible for:
- acting only within the role assigned in the case file
- using the injected skill bundle rather than ad hoc repository discovery
- coordinating through the bundled CLI and shared DB
- reporting concrete run ids, thread ids, worktree paths, and key command outcomes back to the test-runner agent
The test-runner agent should treat a case as passed only when:
- all role agents reach a final state without violating the case contract
- the independent validation commands succeed
- the final orch and inbox state matches the assertions in the case file
The test-runner agent should treat a case as failed when:
- any required agent times out or stalls
- a required orch or inbox action is skipped
- the leader falls back to ordinary chat for orchestration decisions that should go through
orch - workers fall back to ordinary chat for progress that should go through
inbox - the final run, task, thread, or worktree state conflicts with the documented assertions
The test-runner agent should report results in this shape:
casedb_pathrun_idthread_idsworktree_pathsresult:passorfailagent_summariesvalidation_evidenceassertion_checklistnotes
Default Timeouts
Use these defaults unless a case file explicitly overrides them:
- per-agent timeout:
4m - overall scenario timeout:
6m - async wait margin for the main thread:
45s
Default Failure Conditions
Treat the test as failed if any of the following happens:
- any required agent does not reach a final state before timeout
- any required orch or inbox command returns a non-success result unless the case expects that failure
- the final
orch statusoutput does not match the expected run or task state - the final
inbox showoutput does not match the expected thread or message history - a required worktree is missing too early or still present after cleanup in a cleanup case
- the agents fall back to ordinary chat for critical coordination instead of the bundled CLIs
Evidence Capture
Collect at least the following artifacts for every run:
- agent final summaries
- final
orch status --run RUN_ID --jsonoutput - final
inbox show --thread THREAD_ID --jsonoutput for every relevant thread - any
blocked,wait,retry,reassign, orcleanupoutput relevant to the case - the temporary DB path, resolved run id, resolved thread ids, and any worktree paths
Cleanup Policy
Use these defaults unless a case file explicitly overrides them:
- keep the temporary DB, repo fixture, and working directory on failure for debugging
- cleanup the temporary working directory on success only if the caller does not need replay artifacts
Direct CLI Replay
The repository also includes a reusable direct replay runner at scripts/run_orch_skill_forward_tests.sh.
This runner executes the bundled skills/orch/assets/orch and skills/inbox/assets/inbox binaries against temporary SQLite DBs and Git fixtures without spawning Codex role agents.
Use it to validate packaged CLI behavior and record concrete evidence quickly, but do not treat it as a full replacement for the real subagent-forward model described above.
The case files in this directory now include recorded example runs captured through that direct replay path on 2026-03-19.
Real Subagent Forward Runs
The five cases in this directory were also executed with real spawned role agents on 2026-03-19.
That run used injected project-local skills/orch/ and skills/inbox/ bundles with a narrow-context fallback (fork_context: false) after an earlier wider-context attempt proved unreliable for this repo.
The successful evidence root for those runs was /tmp/orch-skill-subagents.J1XWgs.
Some longer cases used staged leader progression while keeping the same leader agent active across phases so the run still exercised real agent-driven orch control flow instead of a main-thread direct replay.
Per-Case Template
Each case file should use this structure:
Test TypePurposePreconditionsAgent TopologyInputsExecution ParametersExecution StepsValidation CommandsExpected OutcomesAssertionsCleanupRecorded Example Runwhen a real run has already been captured
Case Files
| Case Slug | File | Coverage Note |
|---|---|---|
leader-run-dispatch-reconcile-through-bundled-cli |
leader-run-dispatch-reconcile-through-bundled-cli.md | validates that a leader can drive a complete run -> task -> dispatch -> reconcile -> status happy path through the packaged orch skill |
leader-blocked-answer-resume-through-bundled-cli |
leader-blocked-answer-resume-through-bundled-cli.md | validates that a leader can observe a blocked task, answer it through orch, and reach final completion with a real worker |
strict-worktree-dispatch-to-cleanup-through-bundled-cli |
strict-worktree-dispatch-to-cleanup-through-bundled-cli.md | validates that the skill can drive strict worktree allocation, reconcile completion, and cleanup through the bundled orch CLI |
leader-retries-failed-task-through-bundled-cli |
leader-retries-failed-task-through-bundled-cli.md | validates that a leader can reconcile a failed attempt and create a successful retry through the packaged orch skill |
leader-reassigns-blocked-task-through-bundled-cli |
leader-reassigns-blocked-task-through-bundled-cli.md | validates that a leader can reassign a blocked task from one worker to another and close the run through the packaged orch skill |
Scope
In scope:
- explicit
$orchskill invocation - bundled
./assets/orchCLI usage - leader-side run, task, dependency, dispatch, reconcile, answer, retry, reassign, wait, status, and cleanup flows
- interaction between a leader using
skills/orch/and workers usingskills/inbox/ - worktree-backed dispatch and cleanup validation
- end-to-end run state and thread history validation
Out of scope:
- per-command flag and JSON contract coverage for
orch - worker-only skill behavior that already belongs under ../inbox-skill/
- the separate
council-reviewskill package - implicit skill triggering without
$orch
Relationship To Other Test Docs
- ../orch/ covers CLI command behavior
- ../inbox-skill/ covers worker-side skill-guided behavior on top of inbox
- this directory covers leader-side skill-guided behavior on top of
orch