ai-workflow-skill/docs/tests/council-review-skill/README.md

# Council Review Skill Test Plan

## Purpose

This directory tracks human-readable test plans for the `skills/council-review/` Codex skill bundle.

These documents are not command-contract specs for the `orch council` CLI itself.
That coverage already lives under [../orch/](../orch/).

This directory exists to describe a different test surface:

- whether a leader agent can actually use the packaged `council-review` skill
- whether the bundled `./assets/orch` CLI works inside real skill-guided council workflows
- whether a council run driven by the skill reaches the expected reviewer, grouping, tally, and report state

## Test Model

- `README.md` is the index for this directory
- each skill test case lives in its own Markdown file
- use stable case slugs in filenames

## Shared Execution Contract

Use these defaults unless a case file explicitly overrides them:

- run the scenario with real subagents, not simulated transcripts
- inject `skills/council-review/` into the leader agent
- inject `skills/inbox/` into reviewer agents whenever reviewer task completion is required
- initialize the shared SQLite DB before launching role agents with `INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json init`
- require the leader to coordinate through the bundled `./assets/orch` CLI from the council-review skill instead of ordinary chat
- require reviewer agents to coordinate through the bundled `./assets/inbox` CLI from their skill instead of ordinary chat
- validate final council run, reviewer task state, and report state independently from the main thread after the agents stop
- create any required target-file or repo fixture before launching agents for target-file, mixed, or repo-target cases

## How An Agent Runs These Cases

Use one test-runner agent to execute each case.

The test-runner agent is responsible for:

- reading this `README.md` first, then one specific case file
- creating an isolated temporary directory and DB path for that run
- initializing the DB once through the bundled inbox CLI before launching role agents
- creating any required temporary target file or Git repo fixture before launching role agents
- launching the role agents described in `Agent Topology`
- injecting `skills/council-review/` into the leader and `skills/inbox/` into reviewers
- passing each role agent the prompt text from the case file with concrete values substituted for `COUNCIL_SKILL_PATH`, `INBOX_SKILL_PATH`, `TMPDIR`, `RUN_ID`, `THREAD_ID`, and `REPORT_PATH` when needed
- coordinating launch order or parallel start according to the case file
- collecting agent final summaries as evidence
- resolving final run ids, thread ids, and report artifact paths from agent outputs
- running the `Validation Commands` from the main thread after the role agents stop
- comparing the observed results against `Expected Outcomes` and `Assertions`
- returning a final pass/fail judgment with concrete evidence

The role agents are responsible for:

- acting only within the role assigned in the case file
- using the injected skill bundle rather than ad hoc repository discovery
- coordinating through the bundled CLI and shared DB
- reporting concrete run ids, thread ids, report artifact paths, and key command outcomes back to the test-runner agent

The test-runner agent should treat a case as passed only when:

- all role agents reach a final state without violating the case contract
- the independent validation commands succeed
- the final council, orch, and inbox state matches the assertions in the case file

The test-runner agent should treat a case as failed when:

- any required agent times out or stalls
- a required council, orch, or inbox action is skipped
- the leader falls back to ordinary chat for workflow control that should go through the bundled council-review skill
- reviewer agents fall back to ordinary chat instead of returning results through inbox
- the final council grouping, summary, or report state conflicts with the documented assertions

The test-runner agent should report results in this shape:

- `case`
- `db_path`
- `run_id`
- `thread_ids`
- `report_paths`
- `result`: `pass` or `fail`
- `agent_summaries`
- `validation_evidence`
- `assertion_checklist`
- `notes`

## Default Timeouts

Use these defaults unless a case file explicitly overrides them:

- per-agent timeout: `4m`
- overall scenario timeout: `6m`
- async wait margin for the main thread: `45s`

## Default Failure Conditions

Treat the test as failed if any of the following happens:

- any required agent does not reach a final state before timeout
- any required council, orch, or inbox command returns a non-success result unless the case expects that failure
- the final `council report --json` output does not match the expected grouped recommendations
- the final `orch status` output does not match the expected reviewer task state
- a required markdown report artifact is missing when the case expects one
- the agents fall back to ordinary chat for critical coordination instead of the bundled CLIs

## Evidence Capture

Collect at least the following artifacts for every run:

- agent final summaries
- final `council report --json` output when the case reaches report stage
- final `orch status --run RUN_ID --json` output
- final `inbox show --thread THREAD_ID --json` output for every relevant reviewer thread when reviewers participated
- any `council wait` or `council tally` output relevant to the case
- the temporary DB path, resolved run id, resolved thread ids, and any report artifact paths

## Cleanup Policy

Use these defaults unless a case file explicitly overrides them:

- keep the temporary DB, repo fixture, and working directory on failure for debugging
- cleanup the temporary working directory on success only if the caller does not need replay artifacts

## Per-Case Template

Each case file should use this structure:

- `Test Type`
- `Purpose`
- `Preconditions`
- `Agent Topology`
- `Inputs`
- `Execution Parameters`
- `Execution Steps`
- `Validation Commands`
- `Expected Outcomes`
- `Assertions`
- `Cleanup`
- `Recorded Example Run` when a real run has already been captured

## Case Files

| Case Slug | File | Coverage Note |
| --- | --- | --- |
| `council-brainstorm-end-to-end-through-bundled-cli` | [council-brainstorm-end-to-end-through-bundled-cli.md](./council-brainstorm-end-to-end-through-bundled-cli.md) | validates that the council-review skill can drive `start -> wait -> tally -> report` with three real reviewer agents |
| `council-unanimous-only-default-report-through-bundled-cli` | [council-unanimous-only-default-report-through-bundled-cli.md](./council-unanimous-only-default-report-through-bundled-cli.md) | validates that unanimous-only runs default to `consensus` output while preserving the underlying summary counts |
| `council-wait-timeout-through-bundled-cli` | [council-wait-timeout-through-bundled-cli.md](./council-wait-timeout-through-bundled-cli.md) | validates that the leader sees the expected timeout contract when reviewer tasks do not complete |
| `council-report-rejects-before-tally-through-bundled-cli` | [council-report-rejects-before-tally-through-bundled-cli.md](./council-report-rejects-before-tally-through-bundled-cli.md) | validates that the skill surfaces the stable invalid-state error when report is attempted before tally |
| `council-report-show-all-includes-minority-through-bundled-cli` | [council-report-show-all-includes-minority-through-bundled-cli.md](./council-report-show-all-includes-minority-through-bundled-cli.md) | validates that an explicit `--show all` report includes the otherwise hidden minority group |
| `council-report-rejects-invalid-show-through-bundled-cli` | [council-report-rejects-invalid-show-through-bundled-cli.md](./council-report-rejects-invalid-show-through-bundled-cli.md) | validates that the leader sees the stable `invalid_input` contract for an invalid report bucket selection |
| `council-tally-strict-keeps-distinct-proposals-through-bundled-cli` | [council-tally-strict-keeps-distinct-proposals-through-bundled-cli.md](./council-tally-strict-keeps-distinct-proposals-through-bundled-cli.md) | validates that strict similarity preserves near-duplicate wording as separate minority groups |
| `council-reviewer-output-invalid-json-fails-tally-through-bundled-cli` | [council-reviewer-output-invalid-json-fails-tally-through-bundled-cli.md](./council-reviewer-output-invalid-json-fails-tally-through-bundled-cli.md) | validates that malformed reviewer result JSON reaches the leader as the stable tally-time `invalid_input` contract |
| `council-start-with-target-file-through-bundled-cli` | [council-start-with-target-file-through-bundled-cli.md](./council-start-with-target-file-through-bundled-cli.md) | validates that the skill can start a council run from explicit `--target-file` context instead of a pure inline prompt |

## Scope

In scope:

- explicit `$council-review` skill invocation
- bundled `./assets/orch` CLI usage for `orch council ...`
- end-to-end council start, wait, tally, and report flows
- interaction between a leader using `skills/council-review/` and reviewers using `skills/inbox/`
- default report policy, explicit minority inclusion, and invalid report-filter validation
- normal and strict tally behavior
- malformed reviewer-output failure paths
- non-prompt target context including `target-file`

Out of scope:

- per-command flag and JSON contract coverage for `orch council`
- generic leader orchestration flows that already belong under [../orch-skill/](../orch-skill/)
- worker-only skill behavior that belongs under [../inbox-skill/](../inbox-skill/)
- implicit skill triggering without `$council-review`

## Relationship To Other Test Docs

- [../orch/](../orch/) covers CLI command behavior
- [../orch-skill/](../orch-skill/) covers generic leader-side orchestration behavior on top of `orch`
- [../inbox-skill/](../inbox-skill/) covers worker-side skill-guided behavior on top of inbox
- this directory covers the separate user-facing `council-review` skill on top of `orch council`