Add orch skill forward test evidence
This commit is contained in:
@@ -27,6 +27,7 @@ As of now:
|
||||
- an inbox skill forward-test plan directory now exists under `docs/tests/inbox-skill/`, with a shared execution template and multiple scenario cases
|
||||
- an orch skill forward-test plan directory now exists under `docs/tests/orch-skill/`, with a shared execution contract and initial leader-side workflow scenarios
|
||||
- a repo-local replay runner now exists at `scripts/run_orch_skill_forward_tests.sh`, and the five `docs/tests/orch-skill/` cases now include recorded example runs from a bundled-CLI replay captured on `2026-03-19`
|
||||
- the five `docs/tests/orch-skill/` cases now also include recorded real subagent-forward runs captured on `2026-03-19`, with spawned leader and worker agents using the packaged `skills/orch/` and `skills/inbox/` bundles
|
||||
- a council-review skill forward-test plan directory now exists under `docs/tests/council-review-skill/`, with a shared execution contract and nine council workflow scenarios covering end-to-end flow, unanimous-only defaults, timeout/before-tally errors, explicit minority reporting, invalid report filters, strict tally semantics, malformed reviewer JSON, and target-file inputs
|
||||
- an execution-roadmap workflow now exists under `docs/roadmaps/active/` and `docs/roadmaps/archive/` for agent-level work traces and completion archives
|
||||
- a repo-local `scripts/package_skill_clis.sh` packaging flow now builds bundled skill CLI assets for `inbox`, `orch`, and `council-review`
|
||||
|
||||
@@ -0,0 +1,66 @@
|
||||
# Title
|
||||
|
||||
Direct Replay For Orch Skill Cases
|
||||
|
||||
## Status
|
||||
|
||||
- `completed`
|
||||
|
||||
## Owner
|
||||
|
||||
- codex
|
||||
|
||||
## Started At
|
||||
|
||||
- `2026-03-19`
|
||||
|
||||
## Goal
|
||||
|
||||
- Execute the documented `docs/tests/orch-skill/` scenarios against the bundled `skills/orch/assets/orch` and `skills/inbox/assets/inbox` binaries, capture concrete evidence, and sync the repo docs with the observed results.
|
||||
|
||||
## Scope
|
||||
|
||||
- add a reusable local runner for the five documented orch-skill scenarios
|
||||
- run the scenarios and capture per-case evidence
|
||||
- update the orch-skill docs with recorded runs and note the execution mode
|
||||
- update the implementation roadmap to reflect the new replay coverage
|
||||
|
||||
## Checklist
|
||||
|
||||
- [x] Review the orch-skill case docs and bundled CLI surfaces.
|
||||
- [x] Add a reusable direct replay runner for the five orch-skill scenarios.
|
||||
- [x] Execute the runner and collect evidence for all five cases.
|
||||
- [x] Update the orch-skill docs with recorded example runs and execution notes.
|
||||
- [x] Update the implementation roadmap and archive this execution roadmap.
|
||||
|
||||
## Files
|
||||
|
||||
- `scripts/run_orch_skill_forward_tests.sh`
|
||||
- `docs/tests/orch-skill/README.md`
|
||||
- `docs/tests/orch-skill/leader-run-dispatch-reconcile-through-bundled-cli.md`
|
||||
- `docs/tests/orch-skill/leader-blocked-answer-resume-through-bundled-cli.md`
|
||||
- `docs/tests/orch-skill/strict-worktree-dispatch-to-cleanup-through-bundled-cli.md`
|
||||
- `docs/tests/orch-skill/leader-retries-failed-task-through-bundled-cli.md`
|
||||
- `docs/tests/orch-skill/leader-reassigns-blocked-task-through-bundled-cli.md`
|
||||
- `docs/implementation-roadmap.md`
|
||||
- `docs/roadmaps/archive/orch-skill-direct-replay.md`
|
||||
|
||||
## Decisions
|
||||
|
||||
- Use direct bundled-CLI replay instead of spawning Codex role agents in this turn, because the current session does not permit sub-agent delegation unless the user explicitly asks for it.
|
||||
- Keep the replay runner repo-local so the same scenarios can be rerun later without reconstructing the command flow by hand.
|
||||
|
||||
## Blockers
|
||||
|
||||
- none
|
||||
|
||||
## Next Step
|
||||
|
||||
- rerun `scripts/run_orch_skill_forward_tests.sh` when the bundled skill binaries or orch-skill case docs change, and add true multi-agent forward coverage later if explicit sub-agent execution is needed
|
||||
|
||||
## Completion Summary
|
||||
|
||||
- Added `scripts/run_orch_skill_forward_tests.sh` as a reusable direct bundled-CLI replay runner for the five documented orch-skill scenarios.
|
||||
- Executed the runner on `2026-03-19`; all five scenarios passed and produced per-case JSON evidence under a temporary output root.
|
||||
- Updated `docs/tests/orch-skill/README.md` plus all five case files with recorded example runs and explicit execution-mode notes.
|
||||
- Updated `docs/implementation-roadmap.md` to record the new replay runner and captured orch-skill execution evidence.
|
||||
@@ -0,0 +1,67 @@
|
||||
# Title
|
||||
|
||||
Real Subagent Forward Tests For Orch Skill
|
||||
|
||||
## Status
|
||||
|
||||
- `completed`
|
||||
|
||||
## Owner
|
||||
|
||||
- codex
|
||||
|
||||
## Started At
|
||||
|
||||
- `2026-03-19`
|
||||
|
||||
## Goal
|
||||
|
||||
- Execute the documented `docs/tests/orch-skill/` scenarios using real spawned role agents with injected `skills/orch/` and `skills/inbox/`, then record concrete pass/fail evidence and sync the repository docs.
|
||||
|
||||
## Scope
|
||||
|
||||
- validate subagent skill injection for project-local orch and inbox skills
|
||||
- run the five documented orch-skill forward cases with real leader and worker subagents
|
||||
- collect main-thread validation evidence and agent summaries
|
||||
- update the orch-skill docs and implementation roadmap with the real forward-test results
|
||||
|
||||
## Checklist
|
||||
|
||||
- [x] Re-read the orch-skill shared execution contract and worker skill constraints.
|
||||
- [x] Validate project-local skill injection with a small spawned-agent probe.
|
||||
- [x] Execute the five orch-skill cases with real spawned role agents and collect evidence.
|
||||
- [x] Update the orch-skill docs and implementation roadmap with the real forward-test results.
|
||||
- [x] Archive this execution roadmap with a completion summary.
|
||||
|
||||
## Files
|
||||
|
||||
- `docs/tests/orch-skill/README.md`
|
||||
- `docs/tests/orch-skill/leader-run-dispatch-reconcile-through-bundled-cli.md`
|
||||
- `docs/tests/orch-skill/leader-blocked-answer-resume-through-bundled-cli.md`
|
||||
- `docs/tests/orch-skill/strict-worktree-dispatch-to-cleanup-through-bundled-cli.md`
|
||||
- `docs/tests/orch-skill/leader-retries-failed-task-through-bundled-cli.md`
|
||||
- `docs/tests/orch-skill/leader-reassigns-blocked-task-through-bundled-cli.md`
|
||||
- `docs/implementation-roadmap.md`
|
||||
- `docs/roadmaps/archive/orch-skill-real-forward-test.md`
|
||||
|
||||
## Decisions
|
||||
|
||||
- Use real spawned role agents per case instead of the direct replay runner, because the user explicitly asked for true tests with subagents.
|
||||
- Keep the main thread responsible for DB setup, fixture creation, and independent validation so the final judgment does not rely only on role-agent self-reporting.
|
||||
- Fall back from `fork_context: true` to `fork_context: false` for the real case runs after the first wider-context attempt stalled and mis-executed the worker-side contract in this repo.
|
||||
- For the longer `retry` and `reassign` cases, keep one leader agent active across staged prompts instead of one long monolithic prompt, because staged execution proved more reliable while still preserving a real agent-owned `orch` flow.
|
||||
|
||||
## Blockers
|
||||
|
||||
- none
|
||||
|
||||
## Next Step
|
||||
|
||||
- rerun the same five cases when the packaged skill binaries or case docs change, and consider adding the same real subagent coverage for `council-review` if that surface needs parity
|
||||
|
||||
## Completion Summary
|
||||
|
||||
- Verified both project-local skill bundles with spawned-agent help-command probes before the real runs.
|
||||
- Collected successful real subagent evidence for all five orch-skill cases under `/tmp/orch-skill-subagents.J1XWgs`.
|
||||
- Main-thread validation confirmed all five final successful runs reached the expected `orch` and `inbox` states.
|
||||
- Updated `docs/tests/orch-skill/README.md`, all five case files, and `docs/implementation-roadmap.md` to record the new real forward-test coverage.
|
||||
@@ -122,6 +122,26 @@ Use these defaults unless a case file explicitly overrides them:
|
||||
- keep the temporary DB, repo fixture, and working directory on failure for debugging
|
||||
- cleanup the temporary working directory on success only if the caller does not need replay artifacts
|
||||
|
||||
## Direct CLI Replay
|
||||
|
||||
The repository also includes a reusable direct replay runner at `scripts/run_orch_skill_forward_tests.sh`.
|
||||
|
||||
This runner executes the bundled `skills/orch/assets/orch` and `skills/inbox/assets/inbox` binaries against temporary SQLite DBs and Git fixtures without spawning Codex role agents.
|
||||
|
||||
Use it to validate packaged CLI behavior and record concrete evidence quickly, but do not treat it as a full replacement for the real subagent-forward model described above.
|
||||
|
||||
The case files in this directory now include recorded example runs captured through that direct replay path on `2026-03-19`.
|
||||
|
||||
## Real Subagent Forward Runs
|
||||
|
||||
The five cases in this directory were also executed with real spawned role agents on `2026-03-19`.
|
||||
|
||||
That run used injected project-local `skills/orch/` and `skills/inbox/` bundles with a narrow-context fallback (`fork_context: false`) after an earlier wider-context attempt proved unreliable for this repo.
|
||||
|
||||
The successful evidence root for those runs was `/tmp/orch-skill-subagents.J1XWgs`.
|
||||
|
||||
Some longer cases used staged leader progression while keeping the same leader agent active across phases so the run still exercised real agent-driven `orch` control flow instead of a main-thread direct replay.
|
||||
|
||||
## Per-Case Template
|
||||
|
||||
Each case file should use this structure:
|
||||
|
||||
@@ -87,3 +87,30 @@ INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_I
|
||||
|
||||
- use the default cleanup policy from [README.md](./README.md)
|
||||
- if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
|
||||
|
||||
## Recorded Example Run
|
||||
|
||||
- recorded on: `2026-03-19`
|
||||
- execution mode: `direct_cli_replay` via `scripts/run_orch_skill_forward_tests.sh`
|
||||
- result: `pass`
|
||||
- observed run id: `run_blog_skill_002`
|
||||
- observed thread id: `thr_42ce634f273745e9b95badc14ce52708`
|
||||
- evidence summary:
|
||||
- `orch wait --for task_blocked` woke on the worker question, and `inbox wait-reply` later woke on the leader answer
|
||||
- final `orch status --run run_blog_skill_002 --json` returned `run.status == "done"` and `tasks[0].status == "done"`
|
||||
- final `inbox show --thread thr_42ce634f273745e9b95badc14ce52708 --json` contained `question`, `answer`, and `result` messages
|
||||
- the recorded `question` payload was `Should logging go to stdout or stderr?`, and the recorded `answer` body was `Use stdout for MVP.`
|
||||
- note: this recorded run exercised the packaged binaries directly in a temporary DB and did not spawn separate Codex role agents
|
||||
|
||||
## Recorded Real Forward Run
|
||||
|
||||
- recorded on: `2026-03-19`
|
||||
- execution mode: `real_subagent_forward_test`
|
||||
- result: `pass`
|
||||
- evidence root: `/tmp/orch-skill-subagents.J1XWgs/leader-blocked-answer-resume-through-bundled-cli`
|
||||
- observed run id: `run_blog_skill_002`
|
||||
- observed thread id: `thr_fd11536a0b2f4c668f6e78c38090816e`
|
||||
- evidence summary:
|
||||
- a real leader agent using `skills/orch/` completed `wait --for task_blocked`, `blocked`, `answer`, `wait --for task_done`, `reconcile`, and `status`
|
||||
- a real worker agent using `skills/inbox/` completed `claim`, `update --status in_progress`, `update --status blocked`, `wait-reply`, resume `update`, and `done`
|
||||
- main-thread validation confirmed `run.status == "done"`, `task.status == "done"`, the blocked question payload `Should logging go to stdout or stderr?`, and the answer body `Use stdout for MVP.`
|
||||
|
||||
@@ -96,3 +96,34 @@ INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_I
|
||||
|
||||
- use the default cleanup policy from [README.md](./README.md)
|
||||
- if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
|
||||
|
||||
## Recorded Example Run
|
||||
|
||||
- recorded on: `2026-03-19`
|
||||
- execution mode: `direct_cli_replay` via `scripts/run_orch_skill_forward_tests.sh`
|
||||
- result: `pass`
|
||||
- observed run id: `run_blog_skill_reassign_001`
|
||||
- observed original thread id: `thr_0a61240412134de3b3d9ab219b6c8f19`
|
||||
- observed reassigned thread id: `thr_12fbcf6d89d948548306198d013d77a5`
|
||||
- evidence summary:
|
||||
- `orch wait --for task_blocked` woke after worker-a posted a blocked question with payload `Proceed with v1 scope?`
|
||||
- `orch reassign --run run_blog_skill_reassign_001 --task T1 --to worker-b --json` returned `attempt_no == 2` and assigned the new attempt to `worker-b`
|
||||
- final `inbox show` on the original thread returned `thread.status == "cancelled"` and preserved the blocked `question` message
|
||||
- final `inbox show` on the reassigned thread returned `thread.status == "done"`
|
||||
- final `orch status --run run_blog_skill_reassign_001 --json` returned `run.status == "done"` and `tasks[0].status == "done"`
|
||||
- note: this recorded run exercised the packaged binaries directly in a temporary DB and did not spawn separate Codex role agents
|
||||
|
||||
## Recorded Real Forward Run
|
||||
|
||||
- recorded on: `2026-03-19`
|
||||
- execution mode: `real_subagent_forward_test`
|
||||
- result: `pass`
|
||||
- evidence root: `/tmp/orch-skill-subagents.J1XWgs/leader-reassigns-blocked-task-through-bundled-cli-phased`
|
||||
- observed run id: `run_blog_skill_reassign_001`
|
||||
- observed original thread id: `thr_7d43af5bc1f7467da98a39adb0de5808`
|
||||
- observed reassigned thread id: `thr_eba253db8965423b855d0c784a29702c`
|
||||
- evidence summary:
|
||||
- the same real leader agent using `skills/orch/` completed the case in three phases: initial `run/task/dispatch`, then `wait --for task_blocked` plus `reassign`, then final `wait --for task_done` plus `status`
|
||||
- a real `worker-a` agent using `skills/inbox/` claimed the original thread and posted the blocked question `Proceed with v1 scope?`
|
||||
- a real `worker-b` agent using `skills/inbox/` claimed the reassigned thread and completed it
|
||||
- main-thread validation confirmed the original thread finished `cancelled`, the reassigned thread finished `done`, and the original blocked question remained visible in thread history
|
||||
|
||||
@@ -89,3 +89,33 @@ INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_I
|
||||
|
||||
- use the default cleanup policy from [README.md](./README.md)
|
||||
- if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
|
||||
|
||||
## Recorded Example Run
|
||||
|
||||
- recorded on: `2026-03-19`
|
||||
- execution mode: `direct_cli_replay` via `scripts/run_orch_skill_forward_tests.sh`
|
||||
- result: `pass`
|
||||
- observed run id: `run_blog_skill_retry_001`
|
||||
- observed first thread id: `thr_8dbf2d2e46d7469891cc1ef604da476f`
|
||||
- observed second thread id: `thr_bdd86f4fe08e4ebfb39b8151ac41a3bb`
|
||||
- evidence summary:
|
||||
- `orch wait --for task_failed` woke after the first worker-owned thread failed
|
||||
- `orch retry --run run_blog_skill_retry_001 --task T1 --json` returned `attempt_no == 2` with a distinct replacement thread for the same worker
|
||||
- final `inbox show` on the first thread returned `thread.status == "failed"`
|
||||
- final `inbox show` on the second thread returned `thread.status == "done"`
|
||||
- final `orch status --run run_blog_skill_retry_001 --json` returned `run.status == "done"` and `tasks[0].status == "done"`
|
||||
- note: this recorded run exercised the packaged binaries directly in a temporary DB and did not spawn separate Codex role agents
|
||||
|
||||
## Recorded Real Forward Run
|
||||
|
||||
- recorded on: `2026-03-19`
|
||||
- execution mode: `real_subagent_forward_test`
|
||||
- result: `pass`
|
||||
- evidence root: `/tmp/orch-skill-subagents.J1XWgs/leader-retries-failed-task-through-bundled-cli-phased`
|
||||
- observed run id: `run_blog_skill_retry_001`
|
||||
- observed first thread id: `thr_1e22121642294b56aae351ddec5180d1`
|
||||
- observed second thread id: `thr_f2ab1f1899964007b2447796204e1928`
|
||||
- evidence summary:
|
||||
- the same real leader agent using `skills/orch/` completed the case in three phases: initial `run/task/dispatch`, then `wait --for task_failed` plus `retry`, then final `wait --for task_done` plus `status`
|
||||
- a real worker agent using `skills/inbox/` failed the first thread, polled for the retried pending thread, then claimed and completed the second thread
|
||||
- main-thread validation confirmed the two thread ids were distinct, the first thread finished `failed`, the second thread finished `done`, and the run/task both finished `done`
|
||||
|
||||
@@ -88,3 +88,29 @@ INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_I
|
||||
|
||||
- use the default cleanup policy from [README.md](./README.md)
|
||||
- if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
|
||||
|
||||
## Recorded Example Run
|
||||
|
||||
- recorded on: `2026-03-19`
|
||||
- execution mode: `direct_cli_replay` via `scripts/run_orch_skill_forward_tests.sh`
|
||||
- result: `pass`
|
||||
- observed run id: `run_blog_skill_001`
|
||||
- observed thread id: `thr_eced1b8cb1254065a7cd3aaff6dc0bcb`
|
||||
- evidence summary:
|
||||
- final `orch status --run run_blog_skill_001 --json` returned `run.status == "done"` with a single task `T1` in state `done`
|
||||
- final `inbox show --thread thr_eced1b8cb1254065a7cd3aaff6dc0bcb --json` returned thread state `done` and message kinds `task`, `progress`, and `result`
|
||||
- the replay also observed `orch wait --for task_done` wake successfully before the final reconcile
|
||||
- note: this recorded run exercised the packaged binaries directly in a temporary DB and did not spawn separate Codex role agents
|
||||
|
||||
## Recorded Real Forward Run
|
||||
|
||||
- recorded on: `2026-03-19`
|
||||
- execution mode: `real_subagent_forward_test`
|
||||
- result: `pass`
|
||||
- evidence root: `/tmp/orch-skill-subagents.J1XWgs/leader-run-dispatch-reconcile-through-bundled-cli`
|
||||
- observed run id: `run_blog_skill_001`
|
||||
- observed thread id: `thr_7c64e75bbcce4143a7fc425242f7e7d3`
|
||||
- evidence summary:
|
||||
- a real leader agent using `skills/orch/` completed `run init`, `task add`, `dispatch`, `wait`, `reconcile`, and `status`
|
||||
- a real worker agent using `skills/inbox/` completed `fetch`, `claim`, `update --status in_progress`, and `done`
|
||||
- main-thread validation confirmed `status.data.run.status == "done"`, `status.data.tasks[0].status == "done"`, and thread history kinds `task`, `progress`, and `result`
|
||||
|
||||
@@ -88,3 +88,32 @@ test ! -d WORKTREE_PATH
|
||||
|
||||
- use the default cleanup policy from [README.md](./README.md)
|
||||
- if the run fails, retain `TMPDIR`, `coord.db`, and the Git repo fixture for replay and manual inspection
|
||||
|
||||
## Recorded Example Run
|
||||
|
||||
- recorded on: `2026-03-19`
|
||||
- execution mode: `direct_cli_replay` via `scripts/run_orch_skill_forward_tests.sh`
|
||||
- result: `pass`
|
||||
- observed run id: `run_blog_skill_worktree_001`
|
||||
- observed thread id: `thr_5743259fdccb41f9bb33dce0040b27a5`
|
||||
- observed worktree suffix: `.orch/worktrees/run-blog-skill-worktree-001/T1/attempt-1`
|
||||
- evidence summary:
|
||||
- `orch dispatch --strict-worktree` returned `base_ref == "HEAD"`, a concrete `base_commit`, branch `orch/run-blog-skill-worktree-001/T1/attempt-1`, and a non-empty `worktree_path`
|
||||
- the task payload stored on the worker thread exposed the same `worktree_path`
|
||||
- final `orch status --run run_blog_skill_worktree_001 --json` returned `run.status == "done"` and `tasks[0].status == "done"`
|
||||
- final `orch cleanup --run run_blog_skill_worktree_001 --task T1 --json` returned one cleaned attempt and the worktree directory no longer existed afterward
|
||||
- note: this recorded run exercised the packaged binaries directly in a temporary DB and Git fixture and did not spawn separate Codex role agents
|
||||
|
||||
## Recorded Real Forward Run
|
||||
|
||||
- recorded on: `2026-03-19`
|
||||
- execution mode: `real_subagent_forward_test`
|
||||
- result: `pass`
|
||||
- evidence root: `/tmp/orch-skill-subagents.J1XWgs/strict-worktree-dispatch-to-cleanup-through-bundled-cli`
|
||||
- observed run id: `run_blog_skill_worktree_001`
|
||||
- observed thread id: `thr_089527cd07f74b52a524ba07ed74c2e4`
|
||||
- observed worktree path: `/private/tmp/orch-skill-subagents.J1XWgs/strict-worktree-dispatch-to-cleanup-through-bundled-cli/repo/.orch/worktrees/run-blog-skill-worktree-001/T1/attempt-1`
|
||||
- evidence summary:
|
||||
- a real leader agent using `skills/orch/` completed strict `dispatch`, `wait`, `reconcile`, `cleanup`, and `status`
|
||||
- a real worker agent using `skills/inbox/` claimed the thread and finished it with `done`
|
||||
- main-thread validation confirmed that the task payload did include the same `worktree_path` even though the worker agent summary failed to notice it, and also confirmed the worktree directory no longer existed after cleanup
|
||||
|
||||
Reference in New Issue
Block a user