From b75310231250d353a8ae5c45069c29dcdbcc9019 Mon Sep 17 00:00:00 2001 From: kurihada Date: Thu, 19 Mar 2026 18:41:10 +0800 Subject: [PATCH] Record council-review skill test evidence --- .../council-review-skill-direct-replay.md | 62 ++++++++++++++++++ ...review-skill-gap-fill-real-forward-test.md | 65 +++++++++++++++++++ ...ejects-invalid-show-through-bundled-cli.md | 16 +++++ ...l-includes-minority-through-bundled-cli.md | 17 +++++ ...id-json-fails-tally-through-bundled-cli.md | 17 +++++ ...rt-with-target-file-through-bundled-cli.md | 16 +++++ ...-distinct-proposals-through-bundled-cli.md | 16 +++++ 7 files changed, 209 insertions(+) create mode 100644 docs/roadmaps/archive/council-review-skill-direct-replay.md create mode 100644 docs/roadmaps/archive/council-review-skill-gap-fill-real-forward-test.md diff --git a/docs/roadmaps/archive/council-review-skill-direct-replay.md b/docs/roadmaps/archive/council-review-skill-direct-replay.md new file mode 100644 index 0000000..b35c753 --- /dev/null +++ b/docs/roadmaps/archive/council-review-skill-direct-replay.md @@ -0,0 +1,62 @@ +# Title + +Direct Replay Of Council Review Skill Forward Tests + +## Status + +- `completed` + +## Owner + +- Codex main agent + +## Started At + +- `2026-03-19` + +## Goal + +- Execute the documented `docs/tests/council-review-skill/` forward-test scenarios with real subagents and bundled skill assets. +- Collect pass/fail outcomes and concrete evidence for the current skill bundle behavior. + +## Scope + +- Run the current council-review skill test-plan cases against isolated temp DBs. +- Use `skills/council-review/` for the leader and `skills/inbox/` for reviewers where the case requires reviewer completion. +- Validate outcomes from the main thread with bundled CLI commands. + +## Checklist + +- [x] Review the council-review skill test-plan directory and choose execution order. +- [x] Run `council-report-rejects-before-tally-through-bundled-cli`. +- [x] Run `council-wait-timeout-through-bundled-cli`. +- [x] Run `council-brainstorm-end-to-end-through-bundled-cli`. +- [x] Run `council-unanimous-only-default-report-through-bundled-cli`. +- [x] Summarize results and archive this execution roadmap. + +## Files + +- `docs/tests/council-review-skill/README.md` +- `docs/tests/council-review-skill/*.md` +- `docs/roadmaps/archive/council-review-skill-direct-replay.md` + +## Decisions + +- Start with the single-agent error/timeout cases to verify the leader skill behavior before spending time on four-agent end-to-end runs. +- Keep each case in its own temp directory and DB for isolation. + +## Blockers + +- none + +## Next Step + +- If desired, append `Recorded Example Run` sections to the council-review skill case docs using the captured run ids and temp paths from this replay. + +## Completion Summary + +- `council-report-rejects-before-tally-through-bundled-cli`: passed on `/tmp/council-skill-report-before-tally.AXZn2p/coord.db`; main-thread replay returned exit code `30` with `invalid_state` and the expected “run council tally first” message. +- `council-wait-timeout-through-bundled-cli`: passed on `/tmp/council-skill-wait-timeout.csirvt/coord.db`; main-thread replay returned `woke == false`, `all_complete == false`, and three visible reviewer statuses while `orch status` showed the run still `running`. +- `council-brainstorm-end-to-end-through-bundled-cli`: passed on `/tmp/council-skill-e2e.DLaTj6/coord.db`; main-thread validation confirmed `run.status == done`, three reviewer tasks `done`, default report `show == ["consensus","majority"]`, summary counts `1/1/1`, and markdown artifact `/tmp/council-skill-e2e.DLaTj6/.orch/reports/council_skill_001.md`. +- `council-unanimous-only-default-report-through-bundled-cli`: passed on `/tmp/council-skill-unanimous.MzF1lp/coord.db`; main-thread validation confirmed `run.status == done`, default report `show == ["consensus"]`, preserved summary counts `1/1/1`, and markdown artifact `/tmp/council-skill-unanimous.MzF1lp/.orch/reports/council_skill_002.md`. +- One reviewer agent in the unanimous-only run had an initial thread-id parsing misstep, but it retried through the bundled inbox CLI and finished successfully; the case still passed under independent main-thread validation. diff --git a/docs/roadmaps/archive/council-review-skill-gap-fill-real-forward-test.md b/docs/roadmaps/archive/council-review-skill-gap-fill-real-forward-test.md new file mode 100644 index 0000000..264b9f5 --- /dev/null +++ b/docs/roadmaps/archive/council-review-skill-gap-fill-real-forward-test.md @@ -0,0 +1,65 @@ +# Title + +Replay New Council Review Skill Gap-Fill Cases With Sub-Agents + +## Status + +- `completed` + +## Owner + +- Codex main agent + +## Started At + +- `2026-03-19` + +## Goal + +- Execute the five newly added `docs/tests/council-review-skill/` gap-fill cases with real sub-agents and bundled skill assets. +- Capture concrete pass/fail evidence for each case and record the outcome in the workstream trace. + +## Scope + +- Run the five new `council-review-skill` case docs with sub-agents rather than direct CLI replay alone. +- Use `skills/council-review/` for leader roles and `skills/inbox/` for reviewer roles where the case requires reviewer completion. +- Validate outcomes from the main thread with bundled CLI commands and temp-path evidence. + +## Checklist + +- [x] Review the relevant roadmap and case docs before execution. +- [x] Launch sub-agent runners for the five new council-review skill cases. +- [x] Collect final evidence and determine pass/fail for each case. +- [x] Update docs or recorded evidence as needed and archive this execution roadmap. + +## Files + +- `docs/tests/council-review-skill/README.md` +- `docs/tests/council-review-skill/council-report-show-all-includes-minority-through-bundled-cli.md` +- `docs/tests/council-review-skill/council-report-rejects-invalid-show-through-bundled-cli.md` +- `docs/tests/council-review-skill/council-tally-strict-keeps-distinct-proposals-through-bundled-cli.md` +- `docs/tests/council-review-skill/council-reviewer-output-invalid-json-fails-tally-through-bundled-cli.md` +- `docs/tests/council-review-skill/council-start-with-target-file-through-bundled-cli.md` +- `docs/roadmaps/archive/council-review-skill-gap-fill-real-forward-test.md` + +## Decisions + +- Use sub-agents as the execution surface because the user explicitly asked for sub-agent-based testing. +- Group the five cases into a few parallel runners to balance throughput against coordination overhead. +- Prefer the documented forward-test model first; use main-thread validation commands to independently confirm the reported outcome. + +## Blockers + +- initial double-case runners were too broad: leader sub-agents spent time on repository process discovery instead of immediately running the documented bundled-CLI steps +- nested role-agent shell startup needed the narrower `codex exec --dangerously-bypass-approvals-and-sandbox` workaround before the local bundled CLI commands could start reliably + +## Next Step + +- Commit or otherwise preserve the recorded real-forward evidence if the user wants the updated case docs saved in Git history. + +## Completion Summary + +- All five newly added `council-review-skill` cases passed under real sub-agent execution with isolated temp DBs and bundled skill assets. +- Main-thread validation independently confirmed the critical assertions for `target-file`, `show all`, invalid `--show`, `strict` tally semantics, and malformed-reviewer JSON failure at tally time. +- Added `Recorded Real Forward Run` sections to the five case docs with concrete temp paths, run ids, thread ids, and validation summaries. +- The final successful runs used narrower role prompts that explicitly forbade repo discovery or roadmap work before executing the bundled CLI workflow steps. diff --git a/docs/tests/council-review-skill/council-report-rejects-invalid-show-through-bundled-cli.md b/docs/tests/council-review-skill/council-report-rejects-invalid-show-through-bundled-cli.md index 8c9ab0c..65b2108 100644 --- a/docs/tests/council-review-skill/council-report-rejects-invalid-show-through-bundled-cli.md +++ b/docs/tests/council-review-skill/council-report-rejects-invalid-show-through-bundled-cli.md @@ -84,3 +84,19 @@ COUNCIL_SKILL_PATH/assets/orch --db TMPDIR/coord.db --json council report --run - use the default cleanup policy from [README.md](./README.md) - if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection + +## Recorded Real Forward Run + +- recorded on: `2026-03-19` +- execution mode: `real_subagent_forward_test` +- result: `pass` +- evidence root: `/tmp/council-skill-invalid-show-narrow.Sw6so6` +- observed run id: `council_skill_006` +- observed thread ids: +- `architecture-reviewer`: `thr_7fad634dd9d245239d4fbd2287992d54` +- `implementation-reviewer`: `thr_fc76cff125f04fc491064b828a18ff69` +- `risk-reviewer`: `thr_f421bf49fa1240beb5c7a2d5f38aab6b` +- evidence summary: +- main-thread `status --run council_skill_006 --json` returned `run.status == "done"` and `task_counts.done == 3` +- main-thread `council report --run council_skill_006 --show consensus,invalid --json` exited with code `30` +- the returned error payload was `invalid_input` with message `show must contain consensus, majority, minority, or all` diff --git a/docs/tests/council-review-skill/council-report-show-all-includes-minority-through-bundled-cli.md b/docs/tests/council-review-skill/council-report-show-all-includes-minority-through-bundled-cli.md index 1407c8b..66d8a5d 100644 --- a/docs/tests/council-review-skill/council-report-show-all-includes-minority-through-bundled-cli.md +++ b/docs/tests/council-review-skill/council-report-show-all-includes-minority-through-bundled-cli.md @@ -88,3 +88,20 @@ test -f REPORT_PATH - use the default cleanup policy from [README.md](./README.md) - if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection + +## Recorded Real Forward Run + +- recorded on: `2026-03-19` +- execution mode: `real_subagent_forward_test` +- result: `pass` +- evidence root: `/tmp/council-skill-show-all-narrow.Uk0ThB` +- observed run id: `council_skill_005` +- observed thread ids: +- `architecture-reviewer`: `thr_c4cb0a9a5dd142619e854fc0f3864ea8` +- `implementation-reviewer`: `thr_3a54f2e1bc6945f38627958f7f6b4728` +- `risk-reviewer`: `thr_16765453dedf45b4a6ccf4ecfab710db` +- observed report path: `/tmp/council-skill-show-all-narrow.Uk0ThB/.orch/reports/council_skill_005.md` +- evidence summary: +- main-thread `status --run council_skill_005 --json` returned `run.status == "done"` and `task_counts.done == 3` +- main-thread `council report --run council_skill_005 --show all --json` returned `show == ["consensus","majority","minority"]`, summary counts `1/1/1`, and `grouped_recommendations` length `3` +- the returned groups included a `minority` bucket and the markdown artifact existed on disk diff --git a/docs/tests/council-review-skill/council-reviewer-output-invalid-json-fails-tally-through-bundled-cli.md b/docs/tests/council-review-skill/council-reviewer-output-invalid-json-fails-tally-through-bundled-cli.md index aada6ad..51a0a7a 100644 --- a/docs/tests/council-review-skill/council-reviewer-output-invalid-json-fails-tally-through-bundled-cli.md +++ b/docs/tests/council-review-skill/council-reviewer-output-invalid-json-fails-tally-through-bundled-cli.md @@ -107,3 +107,20 @@ COUNCIL_SKILL_PATH/assets/orch --db TMPDIR/coord.db --json council tally --run c - use the default cleanup policy from [README.md](./README.md) - if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection + +## Recorded Real Forward Run + +- recorded on: `2026-03-19` +- execution mode: `real_subagent_forward_test` +- result: `pass` +- evidence root: `/tmp/council-reviewer-output-invalid-json-fails-tally-through-bundled-cli.narrow1.i6ZP98` +- observed run id: `council_skill_008` +- observed thread ids: +- `architecture-reviewer`: `thr_350c43fdf8a449228b8611ce5114326d` +- `implementation-reviewer`: `thr_db858b530cb044a7bceeaa417f1cea75` +- `risk-reviewer`: `thr_1c93381b070c47c49e312039b8343655` +- evidence summary: +- main-thread `council wait --run council_skill_008 --timeout-seconds 2 --json` returned `woke == true` and `all_complete == true` +- main-thread `council tally --run council_skill_008 --similarity normal --json` exited with code `30` +- the returned error payload was `invalid_input` with message `reviewer output must be valid JSON` +- this run confirmed the negative path where reviewer tasks are all `done` but tally still fails on stored reviewer-output validation diff --git a/docs/tests/council-review-skill/council-start-with-target-file-through-bundled-cli.md b/docs/tests/council-review-skill/council-start-with-target-file-through-bundled-cli.md index f914070..faf5a84 100644 --- a/docs/tests/council-review-skill/council-start-with-target-file-through-bundled-cli.md +++ b/docs/tests/council-review-skill/council-start-with-target-file-through-bundled-cli.md @@ -95,3 +95,19 @@ sqlite3 TMPDIR/coord.db "SELECT acceptance_json FROM tasks WHERE run_id = 'counc - use the default cleanup policy from [README.md](./README.md) - if the run fails, retain `TMPDIR`, `brief.md`, and `coord.db` for replay and manual inspection + +## Recorded Real Forward Run + +- recorded on: `2026-03-19` +- execution mode: `real_subagent_forward_test` +- result: `pass` +- evidence root: `/tmp/council-skill-target-file.ikPOLP` +- observed run id: `council_skill_009` +- observed thread ids: +- `CR1`: `thr_32df58f9b55945b899257f583708b7ef` +- `CR2`: `thr_c5f8c552cb1240649546df8386be3668` +- `CR3`: `thr_172eabff13eb48ed9af2deee928a9438` +- evidence summary: +- main-thread `status --run council_skill_009 --json` returned three `dispatched` council tasks and a non-terminal run +- main-thread `sqlite3` validation showed `council_inputs.target_file == "/tmp/council-skill-target-file.ikPOLP/brief.md"` with empty `prompt`, `repo_path`, and `target_task_id` +- main-thread `sqlite3` validation of `CR1` acceptance JSON showed the same `target_file` persisted into the council task payload diff --git a/docs/tests/council-review-skill/council-tally-strict-keeps-distinct-proposals-through-bundled-cli.md b/docs/tests/council-review-skill/council-tally-strict-keeps-distinct-proposals-through-bundled-cli.md index 1b5932f..683ae02 100644 --- a/docs/tests/council-review-skill/council-tally-strict-keeps-distinct-proposals-through-bundled-cli.md +++ b/docs/tests/council-review-skill/council-tally-strict-keeps-distinct-proposals-through-bundled-cli.md @@ -102,3 +102,19 @@ COUNCIL_SKILL_PATH/assets/orch --db TMPDIR/coord.db --json council tally --run c - use the default cleanup policy from [README.md](./README.md) - if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection + +## Recorded Real Forward Run + +- recorded on: `2026-03-19` +- execution mode: `real_subagent_forward_test` +- result: `pass` +- evidence root: `/tmp/council-tally-strict-keeps-distinct-proposals-through-bundled-cli.narrow4.UCbqOc` +- observed run id: `council_skill_007` +- observed thread ids: +- `architecture-reviewer`: `thr_9e153f61692b4475a55f5c3068842ea5` +- `implementation-reviewer`: `thr_abbd9a2961374b13b3d3e27720fe27ab` +- `risk-reviewer`: `thr_3f2d64211f274f64b606bd8b8c6be5f7` +- evidence summary: +- main-thread `council wait --run council_skill_007 --timeout-seconds 2 --json` returned `woke == true` and `all_complete == true` +- main-thread `council tally --run council_skill_007 --similarity strict --json` returned `similarity == "strict"` and `counts.minority == 3` +- the returned proposal set preserved all three distinct values, including both `Move API contract definitions into a dedicated module.` and `Move API contract definitions into dedicated module`