kurihada/ai-workflow-skill

Fork 0

Files

T

kurihada 0b533a70f9 Add council-review skill test plan docs

2026-03-19 17:25:40 +08:00

8.2 KiB

Raw Blame History

Council Review Skill Test Plan

Purpose

This directory tracks human-readable test plans for the skills/council-review/ Codex skill bundle.

These documents are not command-contract specs for the orch council CLI itself. That coverage already lives under ../orch/.

This directory exists to describe a different test surface:

whether a leader agent can actually use the packaged council-review skill
whether the bundled ./assets/orch CLI works inside real skill-guided council workflows
whether a council run driven by the skill reaches the expected reviewer, grouping, tally, and report state

Test Model

README.md is the index for this directory
each skill test case lives in its own Markdown file
use stable case slugs in filenames

Shared Execution Contract

Use these defaults unless a case file explicitly overrides them:

run the scenario with real subagents, not simulated transcripts
inject skills/council-review/ into the leader agent
inject skills/inbox/ into reviewer agents whenever reviewer task completion is required
initialize the shared SQLite DB before launching role agents with INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json init
require the leader to coordinate through the bundled ./assets/orch CLI from the council-review skill instead of ordinary chat
require reviewer agents to coordinate through the bundled ./assets/inbox CLI from their skill instead of ordinary chat
validate final council run, reviewer task state, and report state independently from the main thread after the agents stop
create any required repo fixture before launching agents for mixed or repo-target cases

How An Agent Runs These Cases

Use one test-runner agent to execute each case.

The test-runner agent is responsible for:

reading this README.md first, then one specific case file
creating an isolated temporary directory and DB path for that run
initializing the DB once through the bundled inbox CLI before launching role agents
creating any required temporary Git repo fixture before launching role agents
launching the role agents described in Agent Topology
injecting skills/council-review/ into the leader and skills/inbox/ into reviewers
passing each role agent the prompt text from the case file with concrete values substituted for COUNCIL_SKILL_PATH, INBOX_SKILL_PATH, TMPDIR, RUN_ID, THREAD_ID, and REPORT_PATH when needed
coordinating launch order or parallel start according to the case file
collecting agent final summaries as evidence
resolving final run ids, thread ids, and report artifact paths from agent outputs
running the Validation Commands from the main thread after the role agents stop
comparing the observed results against Expected Outcomes and Assertions
returning a final pass/fail judgment with concrete evidence

The role agents are responsible for:

acting only within the role assigned in the case file
using the injected skill bundle rather than ad hoc repository discovery
coordinating through the bundled CLI and shared DB
reporting concrete run ids, thread ids, report artifact paths, and key command outcomes back to the test-runner agent

The test-runner agent should treat a case as passed only when:

all role agents reach a final state without violating the case contract
the independent validation commands succeed
the final council, orch, and inbox state matches the assertions in the case file

The test-runner agent should treat a case as failed when:

any required agent times out or stalls
a required council, orch, or inbox action is skipped
the leader falls back to ordinary chat for workflow control that should go through the bundled council-review skill
reviewer agents fall back to ordinary chat instead of returning results through inbox
the final council grouping, summary, or report state conflicts with the documented assertions

The test-runner agent should report results in this shape:

case
db_path
run_id
thread_ids
report_paths
result: pass or fail
agent_summaries
validation_evidence
assertion_checklist
notes

Default Timeouts

Use these defaults unless a case file explicitly overrides them:

per-agent timeout: 4m
overall scenario timeout: 6m
async wait margin for the main thread: 45s

Default Failure Conditions

Treat the test as failed if any of the following happens:

any required agent does not reach a final state before timeout
any required council, orch, or inbox command returns a non-success result unless the case expects that failure
the final council report --json output does not match the expected grouped recommendations
the final orch status output does not match the expected reviewer task state
a required markdown report artifact is missing when the case expects one
the agents fall back to ordinary chat for critical coordination instead of the bundled CLIs

Evidence Capture

Collect at least the following artifacts for every run:

agent final summaries
final council report --json output when the case reaches report stage
final orch status --run RUN_ID --json output
final inbox show --thread THREAD_ID --json output for every relevant reviewer thread when reviewers participated
any council wait or council tally output relevant to the case
the temporary DB path, resolved run id, resolved thread ids, and any report artifact paths

Cleanup Policy

Use these defaults unless a case file explicitly overrides them:

keep the temporary DB, repo fixture, and working directory on failure for debugging
cleanup the temporary working directory on success only if the caller does not need replay artifacts

Per-Case Template

Each case file should use this structure:

Test Type
Purpose
Preconditions
Agent Topology
Inputs
Execution Parameters
Execution Steps
Validation Commands
Expected Outcomes
Assertions
Cleanup
Recorded Example Run when a real run has already been captured

Case Files

Case Slug	File	Coverage Note
`council-brainstorm-end-to-end-through-bundled-cli`	council-brainstorm-end-to-end-through-bundled-cli.md	validates that the council-review skill can drive `start -> wait -> tally -> report` with three real reviewer agents
`council-unanimous-only-default-report-through-bundled-cli`	council-unanimous-only-default-report-through-bundled-cli.md	validates that unanimous-only runs default to `consensus` output while preserving the underlying summary counts
`council-wait-timeout-through-bundled-cli`	council-wait-timeout-through-bundled-cli.md	validates that the leader sees the expected timeout contract when reviewer tasks do not complete
`council-report-rejects-before-tally-through-bundled-cli`	council-report-rejects-before-tally-through-bundled-cli.md	validates that the skill surfaces the stable invalid-state error when report is attempted before tally

Scope

In scope:

explicit $council-review skill invocation
bundled ./assets/orch CLI usage for orch council ...
end-to-end council start, wait, tally, and report flows
interaction between a leader using skills/council-review/ and reviewers using skills/inbox/
default report policy, unanimous-only behavior, and timeout/error-path validation

Out of scope:

per-command flag and JSON contract coverage for orch council
generic leader orchestration flows that already belong under ../orch-skill/
worker-only skill behavior that belongs under ../inbox-skill/
implicit skill triggering without $council-review

Relationship To Other Test Docs

../orch/ covers CLI command behavior
../orch-skill/ covers generic leader-side orchestration behavior on top of orch
../inbox-skill/ covers worker-side skill-guided behavior on top of inbox
this directory covers the separate user-facing council-review skill on top of orch council

8.2 KiB Raw Blame History

Council Review Skill Test Plan

Purpose

Test Model

Shared Execution Contract

How An Agent Runs These Cases

Default Timeouts

Default Failure Conditions

Evidence Capture

Cleanup Policy

Per-Case Template

Case Files

Scope

Relationship To Other Test Docs

8.2 KiB

Raw Blame History