Files
ai-workflow-skill/docs/tests/orch-skill/README.md
T

211 lines
12 KiB
Markdown

# Orch Skill Test Plan
## Purpose
This directory tracks human-readable test plans for the `skills/orch/` Codex skill bundle.
These documents are not command-contract specs for the `orch` CLI itself.
That coverage already lives under [../orch/](../orch/).
This directory exists to describe a different test surface:
- whether a leader agent can actually use the packaged `orch` skill
- whether the bundled `./assets/orch` CLI works inside real skill-guided conversations
- whether leader-side orchestration driven by the skill reaches the expected run, task, thread, and worktree state
## Test Model
- `README.md` is the index for this directory
- each skill test case lives in its own Markdown file
- use stable case slugs in filenames
## Shared Execution Contract
Use these defaults unless a case file explicitly overrides them:
- run the scenario with real subagents, not simulated transcripts
- inject `skills/orch/` into the leader agent
- inject `skills/inbox/` into worker agents whenever worker-side thread progress is required
- initialize the shared SQLite DB before launching role agents with `INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json init`
- require the leader to coordinate through the bundled `./assets/orch` CLI from the skill instead of ordinary chat
- require workers to coordinate through the bundled `./assets/inbox` CLI from their skill instead of ordinary chat
- launch-bridge cases may use a leader-only topology where the leader spawns worker subagents after dispatch instead of relying on the test-runner to launch separate worker roles
- validate final run and thread state independently from the main thread after the agents stop
- create any required Git repo fixture before launching agents for worktree cases
## How An Agent Runs These Cases
Use one test-runner agent to execute each case.
The test-runner agent is responsible for:
- reading this `README.md` first, then one specific case file
- creating an isolated temporary directory and DB path for that run
- initializing the DB once through the bundled inbox CLI before launching role agents
- creating any required temporary Git repo fixture before launching role agents
- launching the role agents described in `Agent Topology`
- injecting `skills/orch/` into the leader and `skills/inbox/` into workers
- passing each role agent the prompt text from the case file with concrete values substituted for `ORCH_SKILL_PATH`, `INBOX_SKILL_PATH`, `TMPDIR`, `RUN_ID`, `THREAD_ID`, and `WORKTREE_PATH` when needed
- coordinating launch order or parallel start according to the case file
- collecting agent final summaries as evidence
- resolving final run ids, thread ids, and worktree paths from agent outputs
- running the `Validation Commands` from the main thread after the role agents stop
- comparing the observed results against `Expected Outcomes` and `Assertions`
- returning a final pass/fail judgment with concrete evidence
The role agents are responsible for:
- acting only within the role assigned in the case file
- using the injected skill bundle rather than ad hoc repository discovery
- coordinating through the bundled CLI and shared DB
- reporting concrete run ids, thread ids, worktree paths, and key command outcomes back to the test-runner agent
For launch-bridge cases:
- the leader may be the only top-level role agent
- that leader is responsible for spawning any worker subagents itself after `dispatch`
- spawned worker subagents should use the generated worker brief plus `skills/inbox/`, not ordinary chat
The test-runner agent should treat a case as passed only when:
- all role agents reach a final state without violating the case contract
- the independent validation commands succeed
- the final orch and inbox state matches the assertions in the case file
The test-runner agent should treat a case as failed when:
- any required agent times out or stalls
- a required orch or inbox action is skipped
- the leader falls back to ordinary chat for orchestration decisions that should go through `orch`
- workers fall back to ordinary chat for progress that should go through `inbox`
- the final run, task, thread, or worktree state conflicts with the documented assertions
The test-runner agent should report results in this shape:
- `case`
- `db_path`
- `run_id`
- `thread_ids`
- `worktree_paths`
- `result`: `pass` or `fail`
- `agent_summaries`
- `validation_evidence`
- `assertion_checklist`
- `notes`
## Default Timeouts
Use these defaults unless a case file explicitly overrides them:
- per-agent timeout: `4m`
- overall scenario timeout: `6m`
- async wait margin for the main thread: `45s`
## Default Failure Conditions
Treat the test as failed if any of the following happens:
- any required agent does not reach a final state before timeout
- any required orch or inbox command returns a non-success result unless the case expects that failure
- the final `orch status` output does not match the expected run or task state
- the final `inbox show` output does not match the expected thread or message history
- a required worktree is missing too early or still present after cleanup in a cleanup case
- the agents fall back to ordinary chat for critical coordination instead of the bundled CLIs
## Evidence Capture
Collect at least the following artifacts for every run:
- agent final summaries
- final `orch status --run RUN_ID --json` output
- final `inbox show --thread THREAD_ID --json` output for every relevant thread
- any `blocked`, `wait`, `retry`, `reassign`, or `cleanup` output relevant to the case
- the temporary DB path, resolved run id, resolved thread ids, and any worktree paths
## Cleanup Policy
Use these defaults unless a case file explicitly overrides them:
- keep the temporary DB, repo fixture, and working directory on failure for debugging
- cleanup the temporary working directory on success only if the caller does not need replay artifacts
## Direct CLI Replay
The repository also includes a reusable direct replay runner at `scripts/run_orch_skill_forward_tests.sh`.
This runner executes the bundled `skills/orch/assets/orch` and `skills/inbox/assets/inbox` binaries against temporary SQLite DBs and Git fixtures without spawning Codex role agents.
Use it to validate packaged CLI behavior and record concrete evidence quickly, but do not treat it as a full replacement for the real subagent-forward model described above.
All eight case files in this directory now include recorded example runs captured through that direct replay path on `2026-03-19`.
## Real Subagent Forward Runs
The original five cases in this directory were also executed with real spawned role agents on `2026-03-19`.
That run used injected project-local `skills/orch/` and `skills/inbox/` bundles with a narrow-context fallback (`fork_context: false`) after an earlier wider-context attempt proved unreliable for this repo.
The successful evidence root for those runs was `/tmp/orch-skill-subagents.J1XWgs`.
Some longer cases used staged leader progression while keeping the same leader agent active across phases so the run still exercised real agent-driven `orch` control flow instead of a main-thread direct replay.
The three gap-fill cases added later on `2026-03-19` currently have direct replay evidence only and have not yet been rerun through the real subagent-forward path.
## Per-Case Template
Each case file should use this structure:
- `Test Type`
- `Purpose`
- `Preconditions`
- `Agent Topology`
- `Inputs`
- `Execution Parameters`
- `Execution Steps`
- `Validation Commands`
- `Expected Outcomes`
- `Assertions`
- `Cleanup`
- `Recorded Example Run` when a real run has already been captured
## Case Files
| Case Slug | File | Coverage Note |
| --- | --- | --- |
| `leader-run-dispatch-reconcile-through-bundled-cli` | [leader-run-dispatch-reconcile-through-bundled-cli.md](./leader-run-dispatch-reconcile-through-bundled-cli.md) | validates that a leader can drive a complete `run -> task -> dispatch -> reconcile -> status` happy path through the packaged orch skill |
| `leader-blocked-answer-resume-through-bundled-cli` | [leader-blocked-answer-resume-through-bundled-cli.md](./leader-blocked-answer-resume-through-bundled-cli.md) | validates that a leader can observe a blocked task, answer it through `orch`, and reach final completion with a real worker |
| `strict-worktree-dispatch-to-cleanup-through-bundled-cli` | [strict-worktree-dispatch-to-cleanup-through-bundled-cli.md](./strict-worktree-dispatch-to-cleanup-through-bundled-cli.md) | validates that the skill can drive `execution-mode code` worktree allocation, reconcile completion, and cleanup through the bundled orch CLI |
| `leader-dispatches-dependent-task-after-prerequisite-through-bundled-cli` | [leader-dispatches-dependent-task-after-prerequisite-through-bundled-cli.md](./leader-dispatches-dependent-task-after-prerequisite-through-bundled-cli.md) | validates that a leader can use `dep add` and `ready` to hold back dependent work until a prerequisite completes, then dispatch the newly ready task |
| `leader-cancels-active-task-through-bundled-cli` | [leader-cancels-active-task-through-bundled-cli.md](./leader-cancels-active-task-through-bundled-cli.md) | validates that a leader can cancel an already active task through the packaged orch skill without cancelling unrelated ready work |
| `leader-answers-blocked-task-with-payload-json-through-bundled-cli` | [leader-answers-blocked-task-with-payload-json-through-bundled-cli.md](./leader-answers-blocked-task-with-payload-json-through-bundled-cli.md) | validates that a leader can answer a blocked task with structured payload data only and still drive the run to completion |
| `leader-retries-failed-task-through-bundled-cli` | [leader-retries-failed-task-through-bundled-cli.md](./leader-retries-failed-task-through-bundled-cli.md) | validates that a leader can reconcile a failed attempt and create a successful retry through the packaged orch skill |
| `leader-reassigns-blocked-task-through-bundled-cli` | [leader-reassigns-blocked-task-through-bundled-cli.md](./leader-reassigns-blocked-task-through-bundled-cli.md) | validates that a leader can reassign a blocked task from one worker to another and close the run through the packaged orch skill |
| `leader-dispatches-and-launches-worker-through-codex-bridge` | [leader-dispatches-and-launches-worker-through-codex-bridge.md](./leader-dispatches-and-launches-worker-through-codex-bridge.md) | validates that a leader can dispatch a task, render a standardized worker brief, and launch a worker subagent from the same Codex thread |
| `strict-worktree-dispatch-launches-worker-through-codex-bridge` | [strict-worktree-dispatch-launches-worker-through-codex-bridge.md](./strict-worktree-dispatch-launches-worker-through-codex-bridge.md) | validates that a leader can launch a code-writing worker subagent from saved `execution-mode code` dispatch metadata while preserving the assigned worktree contract |
## Scope
In scope:
- explicit `$orch` skill invocation
- bundled `./assets/orch` CLI usage
- leader-side run, task, dependency, dispatch, reconcile, answer, retry, reassign, wait, status, and cleanup flows
- interaction between a leader using `skills/orch/` and workers using `skills/inbox/`
- leader-side launch-bridge workflows where the leader spawns worker subagents after `dispatch`
- worktree-backed dispatch and cleanup validation
- end-to-end run state and thread history validation
Out of scope:
- per-command flag and JSON contract coverage for `orch`
- worker-only skill behavior that already belongs under [../inbox-skill/](../inbox-skill/)
- the separate `council-review` skill package
- implicit skill triggering without `$orch`
- changing the core `orch` CLI so it launches workers by itself
## Relationship To Other Test Docs
- [../orch/](../orch/) covers CLI command behavior
- [../inbox-skill/](../inbox-skill/) covers worker-side skill-guided behavior on top of inbox
- this directory covers leader-side skill-guided behavior on top of `orch`