Add council-review skill test plan docs

2026-03-19 17:25:40 +08:00
parent 8b26815d53
commit 0b533a70f9
7 changed files with 584 additions and 0 deletions
@@ -0,0 +1,174 @@
+# Council Review Skill Test Plan
+
+## Purpose
+
+This directory tracks human-readable test plans for the `skills/council-review/` Codex skill bundle.
+
+These documents are not command-contract specs for the `orch council` CLI itself.
+That coverage already lives under [../orch/](../orch/).
+
+This directory exists to describe a different test surface:
+
+- whether a leader agent can actually use the packaged `council-review` skill
+- whether the bundled `./assets/orch` CLI works inside real skill-guided council workflows
+- whether a council run driven by the skill reaches the expected reviewer, grouping, tally, and report state
+
+## Test Model
+
+- `README.md` is the index for this directory
+- each skill test case lives in its own Markdown file
+- use stable case slugs in filenames
+
+## Shared Execution Contract
+
+Use these defaults unless a case file explicitly overrides them:
+
+- run the scenario with real subagents, not simulated transcripts
+- inject `skills/council-review/` into the leader agent
+- inject `skills/inbox/` into reviewer agents whenever reviewer task completion is required
+- initialize the shared SQLite DB before launching role agents with `INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json init`
+- require the leader to coordinate through the bundled `./assets/orch` CLI from the council-review skill instead of ordinary chat
+- require reviewer agents to coordinate through the bundled `./assets/inbox` CLI from their skill instead of ordinary chat
+- validate final council run, reviewer task state, and report state independently from the main thread after the agents stop
+- create any required repo fixture before launching agents for mixed or repo-target cases
+
+## How An Agent Runs These Cases
+
+Use one test-runner agent to execute each case.
+
+The test-runner agent is responsible for:
+
+- reading this `README.md` first, then one specific case file
+- creating an isolated temporary directory and DB path for that run
+- initializing the DB once through the bundled inbox CLI before launching role agents
+- creating any required temporary Git repo fixture before launching role agents
+- launching the role agents described in `Agent Topology`
+- injecting `skills/council-review/` into the leader and `skills/inbox/` into reviewers
+- passing each role agent the prompt text from the case file with concrete values substituted for `COUNCIL_SKILL_PATH`, `INBOX_SKILL_PATH`, `TMPDIR`, `RUN_ID`, `THREAD_ID`, and `REPORT_PATH` when needed
+- coordinating launch order or parallel start according to the case file
+- collecting agent final summaries as evidence
+- resolving final run ids, thread ids, and report artifact paths from agent outputs
+- running the `Validation Commands` from the main thread after the role agents stop
+- comparing the observed results against `Expected Outcomes` and `Assertions`
+- returning a final pass/fail judgment with concrete evidence
+
+The role agents are responsible for:
+
+- acting only within the role assigned in the case file
+- using the injected skill bundle rather than ad hoc repository discovery
+- coordinating through the bundled CLI and shared DB
+- reporting concrete run ids, thread ids, report artifact paths, and key command outcomes back to the test-runner agent
+
+The test-runner agent should treat a case as passed only when:
+
+- all role agents reach a final state without violating the case contract
+- the independent validation commands succeed
+- the final council, orch, and inbox state matches the assertions in the case file
+
+The test-runner agent should treat a case as failed when:
+
+- any required agent times out or stalls
+- a required council, orch, or inbox action is skipped
+- the leader falls back to ordinary chat for workflow control that should go through the bundled council-review skill
+- reviewer agents fall back to ordinary chat instead of returning results through inbox
+- the final council grouping, summary, or report state conflicts with the documented assertions
+
+The test-runner agent should report results in this shape:
+
+- `case`
+- `db_path`
+- `run_id`
+- `thread_ids`
+- `report_paths`
+- `result`: `pass` or `fail`
+- `agent_summaries`
+- `validation_evidence`
+- `assertion_checklist`
+- `notes`
+
+## Default Timeouts
+
+Use these defaults unless a case file explicitly overrides them:
+
+- per-agent timeout: `4m`
+- overall scenario timeout: `6m`
+- async wait margin for the main thread: `45s`
+
+## Default Failure Conditions
+
+Treat the test as failed if any of the following happens:
+
+- any required agent does not reach a final state before timeout
+- any required council, orch, or inbox command returns a non-success result unless the case expects that failure
+- the final `council report --json` output does not match the expected grouped recommendations
+- the final `orch status` output does not match the expected reviewer task state
+- a required markdown report artifact is missing when the case expects one
+- the agents fall back to ordinary chat for critical coordination instead of the bundled CLIs
+
+## Evidence Capture
+
+Collect at least the following artifacts for every run:
+
+- agent final summaries
+- final `council report --json` output when the case reaches report stage
+- final `orch status --run RUN_ID --json` output
+- final `inbox show --thread THREAD_ID --json` output for every relevant reviewer thread when reviewers participated
+- any `council wait` or `council tally` output relevant to the case
+- the temporary DB path, resolved run id, resolved thread ids, and any report artifact paths
+
+## Cleanup Policy
+
+Use these defaults unless a case file explicitly overrides them:
+
+- keep the temporary DB, repo fixture, and working directory on failure for debugging
+- cleanup the temporary working directory on success only if the caller does not need replay artifacts
+
+## Per-Case Template
+
+Each case file should use this structure:
+
+- `Test Type`
+- `Purpose`
+- `Preconditions`
+- `Agent Topology`
+- `Inputs`
+- `Execution Parameters`
+- `Execution Steps`
+- `Validation Commands`
+- `Expected Outcomes`
+- `Assertions`
+- `Cleanup`
+- `Recorded Example Run` when a real run has already been captured
+
+## Case Files
+
+| Case Slug | File | Coverage Note |
+| --- | --- | --- |
+| `council-brainstorm-end-to-end-through-bundled-cli` | [council-brainstorm-end-to-end-through-bundled-cli.md](./council-brainstorm-end-to-end-through-bundled-cli.md) | validates that the council-review skill can drive `start -> wait -> tally -> report` with three real reviewer agents |
+| `council-unanimous-only-default-report-through-bundled-cli` | [council-unanimous-only-default-report-through-bundled-cli.md](./council-unanimous-only-default-report-through-bundled-cli.md) | validates that unanimous-only runs default to `consensus` output while preserving the underlying summary counts |
+| `council-wait-timeout-through-bundled-cli` | [council-wait-timeout-through-bundled-cli.md](./council-wait-timeout-through-bundled-cli.md) | validates that the leader sees the expected timeout contract when reviewer tasks do not complete |
+| `council-report-rejects-before-tally-through-bundled-cli` | [council-report-rejects-before-tally-through-bundled-cli.md](./council-report-rejects-before-tally-through-bundled-cli.md) | validates that the skill surfaces the stable invalid-state error when report is attempted before tally |
+
+## Scope
+
+In scope:
+
+- explicit `$council-review` skill invocation
+- bundled `./assets/orch` CLI usage for `orch council ...`
+- end-to-end council start, wait, tally, and report flows
+- interaction between a leader using `skills/council-review/` and reviewers using `skills/inbox/`
+- default report policy, unanimous-only behavior, and timeout/error-path validation
+
+Out of scope:
+
+- per-command flag and JSON contract coverage for `orch council`
+- generic leader orchestration flows that already belong under [../orch-skill/](../orch-skill/)
+- worker-only skill behavior that belongs under [../inbox-skill/](../inbox-skill/)
+- implicit skill triggering without `$council-review`
+
+## Relationship To Other Test Docs
+
+- [../orch/](../orch/) covers CLI command behavior
+- [../orch-skill/](../orch-skill/) covers generic leader-side orchestration behavior on top of `orch`
+- [../inbox-skill/](../inbox-skill/) covers worker-side skill-guided behavior on top of inbox
+- this directory covers the separate user-facing `council-review` skill on top of `orch council`