Add orch skill forward test evidence

This commit is contained in:
2026-03-19 18:36:31 +08:00
parent d17b5ebfbd
commit e9cbb15c2d
10 changed files with 1036 additions and 0 deletions
+20
View File
@@ -122,6 +122,26 @@ Use these defaults unless a case file explicitly overrides them:
- keep the temporary DB, repo fixture, and working directory on failure for debugging
- cleanup the temporary working directory on success only if the caller does not need replay artifacts
## Direct CLI Replay
The repository also includes a reusable direct replay runner at `scripts/run_orch_skill_forward_tests.sh`.
This runner executes the bundled `skills/orch/assets/orch` and `skills/inbox/assets/inbox` binaries against temporary SQLite DBs and Git fixtures without spawning Codex role agents.
Use it to validate packaged CLI behavior and record concrete evidence quickly, but do not treat it as a full replacement for the real subagent-forward model described above.
The case files in this directory now include recorded example runs captured through that direct replay path on `2026-03-19`.
## Real Subagent Forward Runs
The five cases in this directory were also executed with real spawned role agents on `2026-03-19`.
That run used injected project-local `skills/orch/` and `skills/inbox/` bundles with a narrow-context fallback (`fork_context: false`) after an earlier wider-context attempt proved unreliable for this repo.
The successful evidence root for those runs was `/tmp/orch-skill-subagents.J1XWgs`.
Some longer cases used staged leader progression while keeping the same leader agent active across phases so the run still exercised real agent-driven `orch` control flow instead of a main-thread direct replay.
## Per-Case Template
Each case file should use this structure:
@@ -87,3 +87,30 @@ INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_I
- use the default cleanup policy from [README.md](./README.md)
- if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
## Recorded Example Run
- recorded on: `2026-03-19`
- execution mode: `direct_cli_replay` via `scripts/run_orch_skill_forward_tests.sh`
- result: `pass`
- observed run id: `run_blog_skill_002`
- observed thread id: `thr_42ce634f273745e9b95badc14ce52708`
- evidence summary:
- `orch wait --for task_blocked` woke on the worker question, and `inbox wait-reply` later woke on the leader answer
- final `orch status --run run_blog_skill_002 --json` returned `run.status == "done"` and `tasks[0].status == "done"`
- final `inbox show --thread thr_42ce634f273745e9b95badc14ce52708 --json` contained `question`, `answer`, and `result` messages
- the recorded `question` payload was `Should logging go to stdout or stderr?`, and the recorded `answer` body was `Use stdout for MVP.`
- note: this recorded run exercised the packaged binaries directly in a temporary DB and did not spawn separate Codex role agents
## Recorded Real Forward Run
- recorded on: `2026-03-19`
- execution mode: `real_subagent_forward_test`
- result: `pass`
- evidence root: `/tmp/orch-skill-subagents.J1XWgs/leader-blocked-answer-resume-through-bundled-cli`
- observed run id: `run_blog_skill_002`
- observed thread id: `thr_fd11536a0b2f4c668f6e78c38090816e`
- evidence summary:
- a real leader agent using `skills/orch/` completed `wait --for task_blocked`, `blocked`, `answer`, `wait --for task_done`, `reconcile`, and `status`
- a real worker agent using `skills/inbox/` completed `claim`, `update --status in_progress`, `update --status blocked`, `wait-reply`, resume `update`, and `done`
- main-thread validation confirmed `run.status == "done"`, `task.status == "done"`, the blocked question payload `Should logging go to stdout or stderr?`, and the answer body `Use stdout for MVP.`
@@ -96,3 +96,34 @@ INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_I
- use the default cleanup policy from [README.md](./README.md)
- if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
## Recorded Example Run
- recorded on: `2026-03-19`
- execution mode: `direct_cli_replay` via `scripts/run_orch_skill_forward_tests.sh`
- result: `pass`
- observed run id: `run_blog_skill_reassign_001`
- observed original thread id: `thr_0a61240412134de3b3d9ab219b6c8f19`
- observed reassigned thread id: `thr_12fbcf6d89d948548306198d013d77a5`
- evidence summary:
- `orch wait --for task_blocked` woke after worker-a posted a blocked question with payload `Proceed with v1 scope?`
- `orch reassign --run run_blog_skill_reassign_001 --task T1 --to worker-b --json` returned `attempt_no == 2` and assigned the new attempt to `worker-b`
- final `inbox show` on the original thread returned `thread.status == "cancelled"` and preserved the blocked `question` message
- final `inbox show` on the reassigned thread returned `thread.status == "done"`
- final `orch status --run run_blog_skill_reassign_001 --json` returned `run.status == "done"` and `tasks[0].status == "done"`
- note: this recorded run exercised the packaged binaries directly in a temporary DB and did not spawn separate Codex role agents
## Recorded Real Forward Run
- recorded on: `2026-03-19`
- execution mode: `real_subagent_forward_test`
- result: `pass`
- evidence root: `/tmp/orch-skill-subagents.J1XWgs/leader-reassigns-blocked-task-through-bundled-cli-phased`
- observed run id: `run_blog_skill_reassign_001`
- observed original thread id: `thr_7d43af5bc1f7467da98a39adb0de5808`
- observed reassigned thread id: `thr_eba253db8965423b855d0c784a29702c`
- evidence summary:
- the same real leader agent using `skills/orch/` completed the case in three phases: initial `run/task/dispatch`, then `wait --for task_blocked` plus `reassign`, then final `wait --for task_done` plus `status`
- a real `worker-a` agent using `skills/inbox/` claimed the original thread and posted the blocked question `Proceed with v1 scope?`
- a real `worker-b` agent using `skills/inbox/` claimed the reassigned thread and completed it
- main-thread validation confirmed the original thread finished `cancelled`, the reassigned thread finished `done`, and the original blocked question remained visible in thread history
@@ -89,3 +89,33 @@ INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_I
- use the default cleanup policy from [README.md](./README.md)
- if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
## Recorded Example Run
- recorded on: `2026-03-19`
- execution mode: `direct_cli_replay` via `scripts/run_orch_skill_forward_tests.sh`
- result: `pass`
- observed run id: `run_blog_skill_retry_001`
- observed first thread id: `thr_8dbf2d2e46d7469891cc1ef604da476f`
- observed second thread id: `thr_bdd86f4fe08e4ebfb39b8151ac41a3bb`
- evidence summary:
- `orch wait --for task_failed` woke after the first worker-owned thread failed
- `orch retry --run run_blog_skill_retry_001 --task T1 --json` returned `attempt_no == 2` with a distinct replacement thread for the same worker
- final `inbox show` on the first thread returned `thread.status == "failed"`
- final `inbox show` on the second thread returned `thread.status == "done"`
- final `orch status --run run_blog_skill_retry_001 --json` returned `run.status == "done"` and `tasks[0].status == "done"`
- note: this recorded run exercised the packaged binaries directly in a temporary DB and did not spawn separate Codex role agents
## Recorded Real Forward Run
- recorded on: `2026-03-19`
- execution mode: `real_subagent_forward_test`
- result: `pass`
- evidence root: `/tmp/orch-skill-subagents.J1XWgs/leader-retries-failed-task-through-bundled-cli-phased`
- observed run id: `run_blog_skill_retry_001`
- observed first thread id: `thr_1e22121642294b56aae351ddec5180d1`
- observed second thread id: `thr_f2ab1f1899964007b2447796204e1928`
- evidence summary:
- the same real leader agent using `skills/orch/` completed the case in three phases: initial `run/task/dispatch`, then `wait --for task_failed` plus `retry`, then final `wait --for task_done` plus `status`
- a real worker agent using `skills/inbox/` failed the first thread, polled for the retried pending thread, then claimed and completed the second thread
- main-thread validation confirmed the two thread ids were distinct, the first thread finished `failed`, the second thread finished `done`, and the run/task both finished `done`
@@ -88,3 +88,29 @@ INBOX_SKILL_PATH/assets/inbox --db TMPDIR/coord.db --json show --thread THREAD_I
- use the default cleanup policy from [README.md](./README.md)
- if the run fails, retain `TMPDIR` and `coord.db` for replay and manual inspection
## Recorded Example Run
- recorded on: `2026-03-19`
- execution mode: `direct_cli_replay` via `scripts/run_orch_skill_forward_tests.sh`
- result: `pass`
- observed run id: `run_blog_skill_001`
- observed thread id: `thr_eced1b8cb1254065a7cd3aaff6dc0bcb`
- evidence summary:
- final `orch status --run run_blog_skill_001 --json` returned `run.status == "done"` with a single task `T1` in state `done`
- final `inbox show --thread thr_eced1b8cb1254065a7cd3aaff6dc0bcb --json` returned thread state `done` and message kinds `task`, `progress`, and `result`
- the replay also observed `orch wait --for task_done` wake successfully before the final reconcile
- note: this recorded run exercised the packaged binaries directly in a temporary DB and did not spawn separate Codex role agents
## Recorded Real Forward Run
- recorded on: `2026-03-19`
- execution mode: `real_subagent_forward_test`
- result: `pass`
- evidence root: `/tmp/orch-skill-subagents.J1XWgs/leader-run-dispatch-reconcile-through-bundled-cli`
- observed run id: `run_blog_skill_001`
- observed thread id: `thr_7c64e75bbcce4143a7fc425242f7e7d3`
- evidence summary:
- a real leader agent using `skills/orch/` completed `run init`, `task add`, `dispatch`, `wait`, `reconcile`, and `status`
- a real worker agent using `skills/inbox/` completed `fetch`, `claim`, `update --status in_progress`, and `done`
- main-thread validation confirmed `status.data.run.status == "done"`, `status.data.tasks[0].status == "done"`, and thread history kinds `task`, `progress`, and `result`
@@ -88,3 +88,32 @@ test ! -d WORKTREE_PATH
- use the default cleanup policy from [README.md](./README.md)
- if the run fails, retain `TMPDIR`, `coord.db`, and the Git repo fixture for replay and manual inspection
## Recorded Example Run
- recorded on: `2026-03-19`
- execution mode: `direct_cli_replay` via `scripts/run_orch_skill_forward_tests.sh`
- result: `pass`
- observed run id: `run_blog_skill_worktree_001`
- observed thread id: `thr_5743259fdccb41f9bb33dce0040b27a5`
- observed worktree suffix: `.orch/worktrees/run-blog-skill-worktree-001/T1/attempt-1`
- evidence summary:
- `orch dispatch --strict-worktree` returned `base_ref == "HEAD"`, a concrete `base_commit`, branch `orch/run-blog-skill-worktree-001/T1/attempt-1`, and a non-empty `worktree_path`
- the task payload stored on the worker thread exposed the same `worktree_path`
- final `orch status --run run_blog_skill_worktree_001 --json` returned `run.status == "done"` and `tasks[0].status == "done"`
- final `orch cleanup --run run_blog_skill_worktree_001 --task T1 --json` returned one cleaned attempt and the worktree directory no longer existed afterward
- note: this recorded run exercised the packaged binaries directly in a temporary DB and Git fixture and did not spawn separate Codex role agents
## Recorded Real Forward Run
- recorded on: `2026-03-19`
- execution mode: `real_subagent_forward_test`
- result: `pass`
- evidence root: `/tmp/orch-skill-subagents.J1XWgs/strict-worktree-dispatch-to-cleanup-through-bundled-cli`
- observed run id: `run_blog_skill_worktree_001`
- observed thread id: `thr_089527cd07f74b52a524ba07ed74c2e4`
- observed worktree path: `/private/tmp/orch-skill-subagents.J1XWgs/strict-worktree-dispatch-to-cleanup-through-bundled-cli/repo/.orch/worktrees/run-blog-skill-worktree-001/T1/attempt-1`
- evidence summary:
- a real leader agent using `skills/orch/` completed strict `dispatch`, `wait`, `reconcile`, `cleanup`, and `status`
- a real worker agent using `skills/inbox/` claimed the thread and finished it with `done`
- main-thread validation confirmed that the task payload did include the same `worktree_path` even though the worker agent summary failed to notice it, and also confirmed the worktree directory no longer existed after cleanup