Files
ai-workflow-skill/docs/tests/council-review-skill

Council Review Skill Test Plan

Purpose

This directory tracks human-readable test plans for the skills/council-review/ Codex skill bundle.

These documents are not command-contract specs for the orch council CLI itself. That coverage already lives under ../orch/.

This directory exists to describe a different test surface:

  • whether a leader agent can actually use the packaged council-review skill
  • whether the bundled ./assets/orch CLI works inside real skill-guided council workflows
  • whether a council run driven by the skill reaches the expected reviewer, grouping, tally, and report state

Test Model

  • README.md is the index for this directory
  • each skill test case lives in its own Markdown file
  • use stable case slugs in filenames

Shared Execution Contract

Use these defaults unless a case file explicitly overrides them:

  • run the scenario with real subagents, not simulated transcripts
  • inject skills/council-review/ into the leader agent
  • inject skills/inbox/ into reviewer agents whenever reviewer task completion is required
  • initialize the shared SQLite DB before launching role agents with INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json init
  • require the leader to coordinate through the bundled ./assets/orch CLI from the council-review skill instead of ordinary chat
  • require reviewer agents to coordinate through the bundled ./assets/inbox CLI from their skill instead of ordinary chat
  • validate final council run, reviewer task state, and report state independently from the main thread after the agents stop
  • create any required repo fixture before launching agents for mixed or repo-target cases

How An Agent Runs These Cases

Use one test-runner agent to execute each case.

The test-runner agent is responsible for:

  • reading this README.md first, then one specific case file
  • creating an isolated temporary directory and DB path for that run
  • initializing the DB once through the bundled inbox CLI before launching role agents
  • creating any required temporary Git repo fixture before launching role agents
  • launching the role agents described in Agent Topology
  • injecting skills/council-review/ into the leader and skills/inbox/ into reviewers
  • passing each role agent the prompt text from the case file with concrete values substituted for COUNCIL_SKILL_PATH, INBOX_SKILL_PATH, TMPDIR, RUN_ID, THREAD_ID, and REPORT_PATH when needed
  • coordinating launch order or parallel start according to the case file
  • collecting agent final summaries as evidence
  • resolving final run ids, thread ids, and report artifact paths from agent outputs
  • running the Validation Commands from the main thread after the role agents stop
  • comparing the observed results against Expected Outcomes and Assertions
  • returning a final pass/fail judgment with concrete evidence

The role agents are responsible for:

  • acting only within the role assigned in the case file
  • using the injected skill bundle rather than ad hoc repository discovery
  • coordinating through the bundled CLI and shared DB
  • reporting concrete run ids, thread ids, report artifact paths, and key command outcomes back to the test-runner agent

The test-runner agent should treat a case as passed only when:

  • all role agents reach a final state without violating the case contract
  • the independent validation commands succeed
  • the final council, orch, and inbox state matches the assertions in the case file

The test-runner agent should treat a case as failed when:

  • any required agent times out or stalls
  • a required council, orch, or inbox action is skipped
  • the leader falls back to ordinary chat for workflow control that should go through the bundled council-review skill
  • reviewer agents fall back to ordinary chat instead of returning results through inbox
  • the final council grouping, summary, or report state conflicts with the documented assertions

The test-runner agent should report results in this shape:

  • case
  • db_path
  • run_id
  • thread_ids
  • report_paths
  • result: pass or fail
  • agent_summaries
  • validation_evidence
  • assertion_checklist
  • notes

Default Timeouts

Use these defaults unless a case file explicitly overrides them:

  • per-agent timeout: 4m
  • overall scenario timeout: 6m
  • async wait margin for the main thread: 45s

Default Failure Conditions

Treat the test as failed if any of the following happens:

  • any required agent does not reach a final state before timeout
  • any required council, orch, or inbox command returns a non-success result unless the case expects that failure
  • the final council report --json output does not match the expected grouped recommendations
  • the final orch status output does not match the expected reviewer task state
  • a required markdown report artifact is missing when the case expects one
  • the agents fall back to ordinary chat for critical coordination instead of the bundled CLIs

Evidence Capture

Collect at least the following artifacts for every run:

  • agent final summaries
  • final council report --json output when the case reaches report stage
  • final orch status --run RUN_ID --json output
  • final inbox show --thread THREAD_ID --json output for every relevant reviewer thread when reviewers participated
  • any council wait or council tally output relevant to the case
  • the temporary DB path, resolved run id, resolved thread ids, and any report artifact paths

Cleanup Policy

Use these defaults unless a case file explicitly overrides them:

  • keep the temporary DB, repo fixture, and working directory on failure for debugging
  • cleanup the temporary working directory on success only if the caller does not need replay artifacts

Per-Case Template

Each case file should use this structure:

  • Test Type
  • Purpose
  • Preconditions
  • Agent Topology
  • Inputs
  • Execution Parameters
  • Execution Steps
  • Validation Commands
  • Expected Outcomes
  • Assertions
  • Cleanup
  • Recorded Example Run when a real run has already been captured

Case Files

Case Slug File Coverage Note
council-brainstorm-end-to-end-through-bundled-cli council-brainstorm-end-to-end-through-bundled-cli.md validates that the council-review skill can drive start -> wait -> tally -> report with three real reviewer agents
council-unanimous-only-default-report-through-bundled-cli council-unanimous-only-default-report-through-bundled-cli.md validates that unanimous-only runs default to consensus output while preserving the underlying summary counts
council-wait-timeout-through-bundled-cli council-wait-timeout-through-bundled-cli.md validates that the leader sees the expected timeout contract when reviewer tasks do not complete
council-report-rejects-before-tally-through-bundled-cli council-report-rejects-before-tally-through-bundled-cli.md validates that the skill surfaces the stable invalid-state error when report is attempted before tally

Scope

In scope:

  • explicit $council-review skill invocation
  • bundled ./assets/orch CLI usage for orch council ...
  • end-to-end council start, wait, tally, and report flows
  • interaction between a leader using skills/council-review/ and reviewers using skills/inbox/
  • default report policy, unanimous-only behavior, and timeout/error-path validation

Out of scope:

  • per-command flag and JSON contract coverage for orch council
  • generic leader orchestration flows that already belong under ../orch-skill/
  • worker-only skill behavior that belongs under ../inbox-skill/
  • implicit skill triggering without $council-review

Relationship To Other Test Docs

  • ../orch/ covers CLI command behavior
  • ../orch-skill/ covers generic leader-side orchestration behavior on top of orch
  • ../inbox-skill/ covers worker-side skill-guided behavior on top of inbox
  • this directory covers the separate user-facing council-review skill on top of orch council