ai-workflow-skill/docs/architecture.md

# Agent Coordination Architecture

## Purpose

This document defines the system split between the worker-facing `inbox` layer and the leader-facing `orch` layer.

The design target is a local, file-portable agent coordination stack:

- `inbox`: durable communication bus
- `orch`: task graph and scheduling control plane
- worktree-backed task execution for code-writing workers
- optional user-facing council review workflow on top of `orch`
- shared SQLite database file
- leader and workers coordinated through stable CLI commands

## Why Two Layers

`inbox` and `orch` solve different problems.

- `inbox` answers: how do agents exchange durable messages, claim work, report progress, and return results?
- `orch` answers: what work exists, which tasks are ready, who should get them, and what happens after a block, failure, or retry?

If `inbox` is reduced to pure chat storage, the scheduler must reconstruct state from message history and ownership becomes ambiguous. If `inbox` tries to become a full scheduler, worker concerns and leader concerns get mixed into one unstable interface.

## Role Model

- `user`: talks only to the leader
- `leader`: owns the overall goal, task graph, acceptance criteria, and final integration
- `worker`: executes one assigned task at a time and reports through `inbox`
- `inbox`: durable thread/message/lease/artifact store
- `orch`: run/task/dependency/dispatch state machine built on top of `inbox`

## Default Usage Rules

- The leader should use `orch` as the default control surface.
- The leader may use `inbox` directly for inspection or manual repair.
- Workers should use `inbox` only.
- Workers should not use `orch`.
- `orch dispatch` creates handoff state, not execution. Leaders still need a separate worker runtime or worker agent to consume the assigned inbox thread.
- User-facing discussion stays with the leader.
- Code-writing workers should run in `orch`-assigned Git worktrees, not in the user's primary checkout.

## Shared Storage Model

Both CLIs should point at the same SQLite file.

- `inbox` owns communication tables such as threads, messages, leases, and artifacts.
- `orch` owns scheduling tables such as runs, tasks, dependencies, and attempts.
- both layers append to a shared event stream for blocking waits
- `orch dispatch` creates or updates `inbox` threads.
- `orch reconcile` reads `inbox` state and updates task state.

This preserves a clean boundary while keeping deployment simple.

## Optional Codex Launch Bridge

Some environments may layer an execution bridge on top of `orch`.

Recommended shape:

- `orch dispatch --json` creates the durable handoff state
- a leader-side Codex bridge reads the dispatch result
- that bridge may spawn a worker sub-agent and pass it the mapped `thread_id`, `assigned_to`, and any `worktree_path`
- the worker still reports only through `inbox`

This bridge belongs above the CLI layer.
It should not be implemented as core `orch` runtime behavior because worker launch is host-specific while run and attempt state are meant to stay portable.

## Worker Execution Model

For code tasks, execution should be isolated from the user's primary checkout.

- `orch dispatch --execution-mode code` should create a task-attempt worktree
- the assigned worktree path should be stored in attempt metadata and inbox task payload
- the worker runtime should execute inside that worktree
- strict mode should require a committed base revision
- `orch dispatch --execution-mode analysis` should stay on a thread-only path with no worktree, but it still requires a separate worker runtime to claim the inbox thread

See [worktree-execution.md](/home/kurihada/project/ai-workflow-skill/docs/worktree-execution.md) for the full lifecycle.

## Event-Driven Waiting

The leader does not receive worker messages as an in-memory push. Workers write state into `inbox`, and the leader must read it back through CLI commands.

The intended solution is event-driven blocking waits, not ad hoc `sleep` loops.

- leaders should use `orch wait`
- blocked workers should use `inbox wait-reply`
- low-level polling may still exist internally, but it should be hidden inside the CLI

This means there is still one logical leader. The extra behavior is a blocking wait primitive, not a second leader.

## Shared Event Stream

To support blocking waits cleanly, both layers should append rows to a shared `events` table.

Typical emitters:

- `inbox`: claim, progress, blocked, answer, done, fail, cancel
- `orch`: dispatch, answer, retry, reassign, cancel, reconcile-driven task state changes

Typical consumers:

- `orch wait`: watches run-scoped task events for the leader
- `inbox wait-reply`: watches thread-scoped reply events for a blocked worker

Every waiter should use a monotonic cursor such as `event_id` or `message_id`, so it can resume safely without reprocessing old events.

## Recommended Binary Layout

The recommended v1 shape is:

- `inbox` binary for communication primitives
- `orch` binary for leader-side planning and scheduling
- one shared `--db PATH`

If packaging later favors a single binary, the same model can be exposed as command groups:

- `agentctl inbox ...`
- `agentctl orch ...`

## Responsibility Split

`inbox` should own:

- directed messages
- durable threads
- worker claiming and leases
- progress, blocked, result, and failure events
- artifact references
- thread history and watch functionality
- thread-scoped waiting for replies

`orch` should own:

- runs
- task graph and dependencies
- ready queue calculation
- dispatch decisions
- task-attempt worktree allocation
- blocked queue review for the leader
- retries, reassignment, and cancellation
- mapping task attempts to inbox threads
- run-scoped waiting for actionable events
- reusable higher-level workflows such as council review

## What Not To Mix

Do not put these into `inbox`:

- dependency graph logic
- automatic worker selection policy
- retry policy
- acceptance-driven task completion logic

Do not put these into `orch`:

- worker claiming
- low-level message append/reply primitives
- raw thread history storage

## Reading Order

- [inbox-cli.md](/home/kurihada/project/ai-workflow-skill/docs/inbox-cli.md): worker-facing bus and low-level message protocol
- [orch-cli.md](/home/kurihada/project/ai-workflow-skill/docs/orch-cli.md): leader-facing scheduler and task graph control plane
- [worktree-execution.md](/home/kurihada/project/ai-workflow-skill/docs/worktree-execution.md): strict worktree model for code-writing task attempts
- [council-review.md](/home/kurihada/project/ai-workflow-skill/docs/council-review.md): user-facing three-reviewer brainstorm and voting workflow
- [skill-workspace-monorepo.md](/home/kurihada/project/ai-workflow-skill/docs/skill-workspace-monorepo.md): repository structure, package ownership, and skill workspace layout

## Skills

The intended skill split mirrors the CLI split.

- `inbox` skill: used when an agent needs to fetch work, claim a thread, send progress, ask blocked questions, reply, or return results through `inbox`
- `orch` skill: used when the leader needs to create runs, decompose tasks, manage dependencies, dispatch ready work, inspect blocks, answer them, retry failures, or reassign work through `orch`; it is not itself the worker launcher
- `orch` skill may include helper assets for leader-side launch bridges, but the durable source of truth for scheduling remains the `orch` CLI and shared SQLite state
- `council-review` skill: used when the user explicitly wants a structured three-reviewer brainstorm or review with grouped and tallied recommendations