Add orch skill forward test evidence

This commit is contained in:
2026-03-19 18:36:31 +08:00
parent d17b5ebfbd
commit e9cbb15c2d
10 changed files with 1036 additions and 0 deletions
+1
View File
@@ -27,6 +27,7 @@ As of now:
- an inbox skill forward-test plan directory now exists under `docs/tests/inbox-skill/`, with a shared execution template and multiple scenario cases
- an orch skill forward-test plan directory now exists under `docs/tests/orch-skill/`, with a shared execution contract and initial leader-side workflow scenarios
- a repo-local replay runner now exists at `scripts/run_orch_skill_forward_tests.sh`, and the five `docs/tests/orch-skill/` cases now include recorded example runs from a bundled-CLI replay captured on `2026-03-19`
- the five `docs/tests/orch-skill/` cases now also include recorded real subagent-forward runs captured on `2026-03-19`, with spawned leader and worker agents using the packaged `skills/orch/` and `skills/inbox/` bundles
- a council-review skill forward-test plan directory now exists under `docs/tests/council-review-skill/`, with a shared execution contract and nine council workflow scenarios covering end-to-end flow, unanimous-only defaults, timeout/before-tally errors, explicit minority reporting, invalid report filters, strict tally semantics, malformed reviewer JSON, and target-file inputs
- an execution-roadmap workflow now exists under `docs/roadmaps/active/` and `docs/roadmaps/archive/` for agent-level work traces and completion archives
- a repo-local `scripts/package_skill_clis.sh` packaging flow now builds bundled skill CLI assets for `inbox`, `orch`, and `council-review`
@@ -0,0 +1,66 @@
# Title
Direct Replay For Orch Skill Cases
## Status
- `completed`
## Owner
- codex
## Started At
- `2026-03-19`
## Goal
- Execute the documented `docs/tests/orch-skill/` scenarios against the bundled `skills/orch/assets/orch` and `skills/inbox/assets/inbox` binaries, capture concrete evidence, and sync the repo docs with the observed results.
## Scope
- add a reusable local runner for the five documented orch-skill scenarios
- run the scenarios and capture per-case evidence
- update the orch-skill docs with recorded runs and note the execution mode
- update the implementation roadmap to reflect the new replay coverage
## Checklist
- [x] Review the orch-skill case docs and bundled CLI surfaces.
- [x] Add a reusable direct replay runner for the five orch-skill scenarios.
- [x] Execute the runner and collect evidence for all five cases.
- [x] Update the orch-skill docs with recorded example runs and execution notes.
- [x] Update the implementation roadmap and archive this execution roadmap.
## Files
- `scripts/run_orch_skill_forward_tests.sh`
- `docs/tests/orch-skill/README.md`
- `docs/tests/orch-skill/leader-run-dispatch-reconcile-through-bundled-cli.md`
- `docs/tests/orch-skill/leader-blocked-answer-resume-through-bundled-cli.md`
- `docs/tests/orch-skill/strict-worktree-dispatch-to-cleanup-through-bundled-cli.md`
- `docs/tests/orch-skill/leader-retries-failed-task-through-bundled-cli.md`
- `docs/tests/orch-skill/leader-reassigns-blocked-task-through-bundled-cli.md`
- `docs/implementation-roadmap.md`
- `docs/roadmaps/archive/orch-skill-direct-replay.md`
## Decisions
- Use direct bundled-CLI replay instead of spawning Codex role agents in this turn, because the current session does not permit sub-agent delegation unless the user explicitly asks for it.
- Keep the replay runner repo-local so the same scenarios can be rerun later without reconstructing the command flow by hand.
## Blockers
- none
## Next Step
- rerun `scripts/run_orch_skill_forward_tests.sh` when the bundled skill binaries or orch-skill case docs change, and add true multi-agent forward coverage later if explicit sub-agent execution is needed
## Completion Summary
- Added `scripts/run_orch_skill_forward_tests.sh` as a reusable direct bundled-CLI replay runner for the five documented orch-skill scenarios.
- Executed the runner on `2026-03-19`; all five scenarios passed and produced per-case JSON evidence under a temporary output root.
- Updated `docs/tests/orch-skill/README.md` plus all five case files with recorded example runs and explicit execution-mode notes.
- Updated `docs/implementation-roadmap.md` to record the new replay runner and captured orch-skill execution evidence.
@@ -0,0 +1,67 @@
# Title
Real Subagent Forward Tests For Orch Skill
## Status
- `completed`
## Owner
- codex
## Started At
- `2026-03-19`
## Goal
- Execute the documented `docs/tests/orch-skill/` scenarios using real spawned role agents with injected `skills/orch/` and `skills/inbox/`, then record concrete pass/fail evidence and sync the repository docs.
## Scope
- validate subagent skill injection for project-local orch and inbox skills
- run the five documented orch-skill forward cases with real leader and worker subagents
- collect main-thread validation evidence and agent summaries
- update the orch-skill docs and implementation roadmap with the real forward-test results
## Checklist
- [x] Re-read the orch-skill shared execution contract and worker skill constraints.
- [x] Validate project-local skill injection with a small spawned-agent probe.
- [x] Execute the five orch-skill cases with real spawned role agents and collect evidence.
- [x] Update the orch-skill docs and implementation roadmap with the real forward-test results.
- [x] Archive this execution roadmap with a completion summary.
## Files
- `docs/tests/orch-skill/README.md`
- `docs/tests/orch-skill/leader-run-dispatch-reconcile-through-bundled-cli.md`
- `docs/tests/orch-skill/leader-blocked-answer-resume-through-bundled-cli.md`
- `docs/tests/orch-skill/strict-worktree-dispatch-to-cleanup-through-bundled-cli.md`
- `docs/tests/orch-skill/leader-retries-failed-task-through-bundled-cli.md`
- `docs/tests/orch-skill/leader-reassigns-blocked-task-through-bundled-cli.md`
- `docs/implementation-roadmap.md`
- `docs/roadmaps/archive/orch-skill-real-forward-test.md`
## Decisions
- Use real spawned role agents per case instead of the direct replay runner, because the user explicitly asked for true tests with subagents.
- Keep the main thread responsible for DB setup, fixture creation, and independent validation so the final judgment does not rely only on role-agent self-reporting.
- Fall back from `fork_context: true` to `fork_context: false` for the real case runs after the first wider-context attempt stalled and mis-executed the worker-side contract in this repo.
- For the longer `retry` and `reassign` cases, keep one leader agent active across staged prompts instead of one long monolithic prompt, because staged execution proved more reliable while still preserving a real agent-owned `orch` flow.
## Blockers
- none
## Next Step
- rerun the same five cases when the packaged skill binaries or case docs change, and consider adding the same real subagent coverage for `council-review` if that surface needs parity
## Completion Summary
- Verified both project-local skill bundles with spawned-agent help-command probes before the real runs.
- Collected successful real subagent evidence for all five orch-skill cases under `/tmp/orch-skill-subagents.J1XWgs`.
- Main-thread validation confirmed all five final successful runs reached the expected `orch` and `inbox` states.
- Updated `docs/tests/orch-skill/README.md`, all five case files, and `docs/implementation-roadmap.md` to record the new real forward-test coverage.
+20
View File
@@ -122,6 +122,26 @@ Use these defaults unless a case file explicitly overrides them:
- keep the temporary DB, repo fixture, and working directory on failure for debugging
- cleanup the temporary working directory on success only if the caller does not need replay artifacts
## Direct CLI Replay
The repository also includes a reusable direct replay runner at `scripts/run_orch_skill_forward_tests.sh`.
This runner executes the bundled `skills/orch/assets/orch` and `skills/inbox/assets/inbox` binaries against temporary SQLite DBs and Git fixtures without spawning Codex role agents.
Use it to validate packaged CLI behavior and record concrete evidence quickly, but do not treat it as a full replacement for the real subagent-forward model described above.
The case files in this directory now include recorded example runs captured through that direct replay path on `2026-03-19`.
## Real Subagent Forward Runs
The five cases in this directory were also executed with real spawned role agents on `2026-03-19`.
That run used injected project-local `skills/orch/` and `skills/inbox/` bundles with a narrow-context fallback (`fork_context: false`) after an earlier wider-context attempt proved unreliable for this repo.
The successful evidence root for those runs was `/tmp/orch-skill-subagents.J1XWgs`.
Some longer cases used staged leader progression while keeping the same leader agent active across phases so the run still exercised real agent-driven `orch` control flow instead of a main-thread direct replay.
## Per-Case Template
Each case file should use this structure:
@@ -87,3 +87,30 @@ INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_I
- use the default cleanup policy from [README.md](./README.md)
- if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
## Recorded Example Run
- recorded on: `2026-03-19`
- execution mode: `direct_cli_replay` via `scripts/run_orch_skill_forward_tests.sh`
- result: `pass`
- observed run id: `run_blog_skill_002`
- observed thread id: `thr_42ce634f273745e9b95badc14ce52708`
- evidence summary:
- `orch wait --for task_blocked` woke on the worker question, and `inbox wait-reply` later woke on the leader answer
- final `orch status --run run_blog_skill_002 --json` returned `run.status == "done"` and `tasks[0].status == "done"`
- final `inbox show --thread thr_42ce634f273745e9b95badc14ce52708 --json` contained `question`, `answer`, and `result` messages
- the recorded `question` payload was `Should logging go to stdout or stderr?`, and the recorded `answer` body was `Use stdout for MVP.`
- note: this recorded run exercised the packaged binaries directly in a temporary DB and did not spawn separate Codex role agents
## Recorded Real Forward Run
- recorded on: `2026-03-19`
- execution mode: `real_subagent_forward_test`
- result: `pass`
- evidence root: `/tmp/orch-skill-subagents.J1XWgs/leader-blocked-answer-resume-through-bundled-cli`
- observed run id: `run_blog_skill_002`
- observed thread id: `thr_fd11536a0b2f4c668f6e78c38090816e`
- evidence summary:
- a real leader agent using `skills/orch/` completed `wait --for task_blocked`, `blocked`, `answer`, `wait --for task_done`, `reconcile`, and `status`
- a real worker agent using `skills/inbox/` completed `claim`, `update --status in_progress`, `update --status blocked`, `wait-reply`, resume `update`, and `done`
- main-thread validation confirmed `run.status == "done"`, `task.status == "done"`, the blocked question payload `Should logging go to stdout or stderr?`, and the answer body `Use stdout for MVP.`
@@ -96,3 +96,34 @@ INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_I
- use the default cleanup policy from [README.md](./README.md)
- if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
## Recorded Example Run
- recorded on: `2026-03-19`
- execution mode: `direct_cli_replay` via `scripts/run_orch_skill_forward_tests.sh`
- result: `pass`
- observed run id: `run_blog_skill_reassign_001`
- observed original thread id: `thr_0a61240412134de3b3d9ab219b6c8f19`
- observed reassigned thread id: `thr_12fbcf6d89d948548306198d013d77a5`
- evidence summary:
- `orch wait --for task_blocked` woke after worker-a posted a blocked question with payload `Proceed with v1 scope?`
- `orch reassign --run run_blog_skill_reassign_001 --task T1 --to worker-b --json` returned `attempt_no == 2` and assigned the new attempt to `worker-b`
- final `inbox show` on the original thread returned `thread.status == "cancelled"` and preserved the blocked `question` message
- final `inbox show` on the reassigned thread returned `thread.status == "done"`
- final `orch status --run run_blog_skill_reassign_001 --json` returned `run.status == "done"` and `tasks[0].status == "done"`
- note: this recorded run exercised the packaged binaries directly in a temporary DB and did not spawn separate Codex role agents
## Recorded Real Forward Run
- recorded on: `2026-03-19`
- execution mode: `real_subagent_forward_test`
- result: `pass`
- evidence root: `/tmp/orch-skill-subagents.J1XWgs/leader-reassigns-blocked-task-through-bundled-cli-phased`
- observed run id: `run_blog_skill_reassign_001`
- observed original thread id: `thr_7d43af5bc1f7467da98a39adb0de5808`
- observed reassigned thread id: `thr_eba253db8965423b855d0c784a29702c`
- evidence summary:
- the same real leader agent using `skills/orch/` completed the case in three phases: initial `run/task/dispatch`, then `wait --for task_blocked` plus `reassign`, then final `wait --for task_done` plus `status`
- a real `worker-a` agent using `skills/inbox/` claimed the original thread and posted the blocked question `Proceed with v1 scope?`
- a real `worker-b` agent using `skills/inbox/` claimed the reassigned thread and completed it
- main-thread validation confirmed the original thread finished `cancelled`, the reassigned thread finished `done`, and the original blocked question remained visible in thread history
@@ -89,3 +89,33 @@ INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_I
- use the default cleanup policy from [README.md](./README.md)
- if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
## Recorded Example Run
- recorded on: `2026-03-19`
- execution mode: `direct_cli_replay` via `scripts/run_orch_skill_forward_tests.sh`
- result: `pass`
- observed run id: `run_blog_skill_retry_001`
- observed first thread id: `thr_8dbf2d2e46d7469891cc1ef604da476f`
- observed second thread id: `thr_bdd86f4fe08e4ebfb39b8151ac41a3bb`
- evidence summary:
- `orch wait --for task_failed` woke after the first worker-owned thread failed
- `orch retry --run run_blog_skill_retry_001 --task T1 --json` returned `attempt_no == 2` with a distinct replacement thread for the same worker
- final `inbox show` on the first thread returned `thread.status == "failed"`
- final `inbox show` on the second thread returned `thread.status == "done"`
- final `orch status --run run_blog_skill_retry_001 --json` returned `run.status == "done"` and `tasks[0].status == "done"`
- note: this recorded run exercised the packaged binaries directly in a temporary DB and did not spawn separate Codex role agents
## Recorded Real Forward Run
- recorded on: `2026-03-19`
- execution mode: `real_subagent_forward_test`
- result: `pass`
- evidence root: `/tmp/orch-skill-subagents.J1XWgs/leader-retries-failed-task-through-bundled-cli-phased`
- observed run id: `run_blog_skill_retry_001`
- observed first thread id: `thr_1e22121642294b56aae351ddec5180d1`
- observed second thread id: `thr_f2ab1f1899964007b2447796204e1928`
- evidence summary:
- the same real leader agent using `skills/orch/` completed the case in three phases: initial `run/task/dispatch`, then `wait --for task_failed` plus `retry`, then final `wait --for task_done` plus `status`
- a real worker agent using `skills/inbox/` failed the first thread, polled for the retried pending thread, then claimed and completed the second thread
- main-thread validation confirmed the two thread ids were distinct, the first thread finished `failed`, the second thread finished `done`, and the run/task both finished `done`
@@ -88,3 +88,29 @@ INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_I
- use the default cleanup policy from [README.md](./README.md)
- if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
## Recorded Example Run
- recorded on: `2026-03-19`
- execution mode: `direct_cli_replay` via `scripts/run_orch_skill_forward_tests.sh`
- result: `pass`
- observed run id: `run_blog_skill_001`
- observed thread id: `thr_eced1b8cb1254065a7cd3aaff6dc0bcb`
- evidence summary:
- final `orch status --run run_blog_skill_001 --json` returned `run.status == "done"` with a single task `T1` in state `done`
- final `inbox show --thread thr_eced1b8cb1254065a7cd3aaff6dc0bcb --json` returned thread state `done` and message kinds `task`, `progress`, and `result`
- the replay also observed `orch wait --for task_done` wake successfully before the final reconcile
- note: this recorded run exercised the packaged binaries directly in a temporary DB and did not spawn separate Codex role agents
## Recorded Real Forward Run
- recorded on: `2026-03-19`
- execution mode: `real_subagent_forward_test`
- result: `pass`
- evidence root: `/tmp/orch-skill-subagents.J1XWgs/leader-run-dispatch-reconcile-through-bundled-cli`
- observed run id: `run_blog_skill_001`
- observed thread id: `thr_7c64e75bbcce4143a7fc425242f7e7d3`
- evidence summary:
- a real leader agent using `skills/orch/` completed `run init`, `task add`, `dispatch`, `wait`, `reconcile`, and `status`
- a real worker agent using `skills/inbox/` completed `fetch`, `claim`, `update --status in_progress`, and `done`
- main-thread validation confirmed `status.data.run.status == "done"`, `status.data.tasks[0].status == "done"`, and thread history kinds `task`, `progress`, and `result`
@@ -88,3 +88,32 @@ test ! -d WORKTREE_PATH
- use the default cleanup policy from [README.md](./README.md)
- if the run fails, retain `TMPDIR`, `coord.db`, and the Git repo fixture for replay and manual inspection
## Recorded Example Run
- recorded on: `2026-03-19`
- execution mode: `direct_cli_replay` via `scripts/run_orch_skill_forward_tests.sh`
- result: `pass`
- observed run id: `run_blog_skill_worktree_001`
- observed thread id: `thr_5743259fdccb41f9bb33dce0040b27a5`
- observed worktree suffix: `.orch/worktrees/run-blog-skill-worktree-001/T1/attempt-1`
- evidence summary:
- `orch dispatch --strict-worktree` returned `base_ref == "HEAD"`, a concrete `base_commit`, branch `orch/run-blog-skill-worktree-001/T1/attempt-1`, and a non-empty `worktree_path`
- the task payload stored on the worker thread exposed the same `worktree_path`
- final `orch status --run run_blog_skill_worktree_001 --json` returned `run.status == "done"` and `tasks[0].status == "done"`
- final `orch cleanup --run run_blog_skill_worktree_001 --task T1 --json` returned one cleaned attempt and the worktree directory no longer existed afterward
- note: this recorded run exercised the packaged binaries directly in a temporary DB and Git fixture and did not spawn separate Codex role agents
## Recorded Real Forward Run
- recorded on: `2026-03-19`
- execution mode: `real_subagent_forward_test`
- result: `pass`
- evidence root: `/tmp/orch-skill-subagents.J1XWgs/strict-worktree-dispatch-to-cleanup-through-bundled-cli`
- observed run id: `run_blog_skill_worktree_001`
- observed thread id: `thr_089527cd07f74b52a524ba07ed74c2e4`
- observed worktree path: `/private/tmp/orch-skill-subagents.J1XWgs/strict-worktree-dispatch-to-cleanup-through-bundled-cli/repo/.orch/worktrees/run-blog-skill-worktree-001/T1/attempt-1`
- evidence summary:
- a real leader agent using `skills/orch/` completed strict `dispatch`, `wait`, `reconcile`, `cleanup`, and `status`
- a real worker agent using `skills/inbox/` claimed the thread and finished it with `done`
- main-thread validation confirmed that the task payload did include the same `worktree_path` even though the worker agent summary failed to notice it, and also confirmed the worktree directory no longer existed after cleanup