# Orch Skill Test Plan ## Purpose This directory tracks human-readable test plans for the `skills/orch/` Codex skill bundle. These documents are not command-contract specs for the `orch` CLI itself. That coverage already lives under [../orch/](../orch/). This directory exists to describe a different test surface: - whether a leader agent can actually use the packaged `orch` skill - whether the bundled `./assets/orch` CLI works inside real skill-guided conversations - whether leader-side orchestration driven by the skill reaches the expected run, task, thread, and worktree state ## Test Model - `README.md` is the index for this directory - each skill test case lives in its own Markdown file - use stable case slugs in filenames ## Shared Execution Contract Use these defaults unless a case file explicitly overrides them: - run the scenario with real subagents, not simulated transcripts - inject `skills/orch/` into the leader agent - inject `skills/inbox/` into worker agents whenever worker-side thread progress is required - initialize the shared SQLite DB before launching role agents with `INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json init` - require the leader to coordinate through the bundled `./assets/orch` CLI from the skill instead of ordinary chat - require workers to coordinate through the bundled `./assets/inbox` CLI from their skill instead of ordinary chat - validate final run and thread state independently from the main thread after the agents stop - create any required Git repo fixture before launching agents for worktree cases ## How An Agent Runs These Cases Use one test-runner agent to execute each case. The test-runner agent is responsible for: - reading this `README.md` first, then one specific case file - creating an isolated temporary directory and DB path for that run - initializing the DB once through the bundled inbox CLI before launching role agents - creating any required temporary Git repo fixture before launching role agents - launching the role agents described in `Agent Topology` - injecting `skills/orch/` into the leader and `skills/inbox/` into workers - passing each role agent the prompt text from the case file with concrete values substituted for `ORCH_SKILL_PATH`, `INBOX_SKILL_PATH`, `TMPDIR`, `RUN_ID`, `THREAD_ID`, and `WORKTREE_PATH` when needed - coordinating launch order or parallel start according to the case file - collecting agent final summaries as evidence - resolving final run ids, thread ids, and worktree paths from agent outputs - running the `Validation Commands` from the main thread after the role agents stop - comparing the observed results against `Expected Outcomes` and `Assertions` - returning a final pass/fail judgment with concrete evidence The role agents are responsible for: - acting only within the role assigned in the case file - using the injected skill bundle rather than ad hoc repository discovery - coordinating through the bundled CLI and shared DB - reporting concrete run ids, thread ids, worktree paths, and key command outcomes back to the test-runner agent The test-runner agent should treat a case as passed only when: - all role agents reach a final state without violating the case contract - the independent validation commands succeed - the final orch and inbox state matches the assertions in the case file The test-runner agent should treat a case as failed when: - any required agent times out or stalls - a required orch or inbox action is skipped - the leader falls back to ordinary chat for orchestration decisions that should go through `orch` - workers fall back to ordinary chat for progress that should go through `inbox` - the final run, task, thread, or worktree state conflicts with the documented assertions The test-runner agent should report results in this shape: - `case` - `db_path` - `run_id` - `thread_ids` - `worktree_paths` - `result`: `pass` or `fail` - `agent_summaries` - `validation_evidence` - `assertion_checklist` - `notes` ## Default Timeouts Use these defaults unless a case file explicitly overrides them: - per-agent timeout: `4m` - overall scenario timeout: `6m` - async wait margin for the main thread: `45s` ## Default Failure Conditions Treat the test as failed if any of the following happens: - any required agent does not reach a final state before timeout - any required orch or inbox command returns a non-success result unless the case expects that failure - the final `orch status` output does not match the expected run or task state - the final `inbox show` output does not match the expected thread or message history - a required worktree is missing too early or still present after cleanup in a cleanup case - the agents fall back to ordinary chat for critical coordination instead of the bundled CLIs ## Evidence Capture Collect at least the following artifacts for every run: - agent final summaries - final `orch status --run RUN_ID --json` output - final `inbox show --thread THREAD_ID --json` output for every relevant thread - any `blocked`, `wait`, `retry`, `reassign`, or `cleanup` output relevant to the case - the temporary DB path, resolved run id, resolved thread ids, and any worktree paths ## Cleanup Policy Use these defaults unless a case file explicitly overrides them: - keep the temporary DB, repo fixture, and working directory on failure for debugging - cleanup the temporary working directory on success only if the caller does not need replay artifacts ## Per-Case Template Each case file should use this structure: - `Test Type` - `Purpose` - `Preconditions` - `Agent Topology` - `Inputs` - `Execution Parameters` - `Execution Steps` - `Validation Commands` - `Expected Outcomes` - `Assertions` - `Cleanup` - `Recorded Example Run` when a real run has already been captured ## Case Files | Case Slug | File | Coverage Note | | --- | --- | --- | | `leader-run-dispatch-reconcile-through-bundled-cli` | [leader-run-dispatch-reconcile-through-bundled-cli.md](./leader-run-dispatch-reconcile-through-bundled-cli.md) | validates that a leader can drive a complete `run -> task -> dispatch -> reconcile -> status` happy path through the packaged orch skill | | `leader-blocked-answer-resume-through-bundled-cli` | [leader-blocked-answer-resume-through-bundled-cli.md](./leader-blocked-answer-resume-through-bundled-cli.md) | validates that a leader can observe a blocked task, answer it through `orch`, and reach final completion with a real worker | | `strict-worktree-dispatch-to-cleanup-through-bundled-cli` | [strict-worktree-dispatch-to-cleanup-through-bundled-cli.md](./strict-worktree-dispatch-to-cleanup-through-bundled-cli.md) | validates that the skill can drive strict worktree allocation, reconcile completion, and cleanup through the bundled orch CLI | | `leader-retries-failed-task-through-bundled-cli` | [leader-retries-failed-task-through-bundled-cli.md](./leader-retries-failed-task-through-bundled-cli.md) | validates that a leader can reconcile a failed attempt and create a successful retry through the packaged orch skill | | `leader-reassigns-blocked-task-through-bundled-cli` | [leader-reassigns-blocked-task-through-bundled-cli.md](./leader-reassigns-blocked-task-through-bundled-cli.md) | validates that a leader can reassign a blocked task from one worker to another and close the run through the packaged orch skill | ## Scope In scope: - explicit `$orch` skill invocation - bundled `./assets/orch` CLI usage - leader-side run, task, dependency, dispatch, reconcile, answer, retry, reassign, wait, status, and cleanup flows - interaction between a leader using `skills/orch/` and workers using `skills/inbox/` - worktree-backed dispatch and cleanup validation - end-to-end run state and thread history validation Out of scope: - per-command flag and JSON contract coverage for `orch` - worker-only skill behavior that already belongs under [../inbox-skill/](../inbox-skill/) - the separate `council-review` skill package - implicit skill triggering without `$orch` ## Relationship To Other Test Docs - [../orch/](../orch/) covers CLI command behavior - [../inbox-skill/](../inbox-skill/) covers worker-side skill-guided behavior on top of inbox - this directory covers leader-side skill-guided behavior on top of `orch`