feat(skill/gemini-image-web): unify image flow with music/video skills

2026-03-04 13:54:27 +08:00
parent 787a3334b6
commit b9153c70c7
4 changed files with 364 additions and 35 deletions
@@ -7,12 +7,14 @@ description: "Generate images in Gemini web via browser automation, download res
 ## Workflow
-1. Open Gemini web and confirm user is logged in.
+1. Open Gemini web and check whether the user is logged in.
-2. Set output directory and target image count.
+2. If not logged in, stop and explicitly ask the user to log in.
-3. Send one image-generation prompt per request until target count is reached.
+3. If logged in, open `工具` and click `创作图片`/`制作图片`.
-4. For each request, wait until generation ends (`停止回答` button disappears), then download.
+4. Set output directory and target image count.
-5. Collect downloaded files into target folder with batch naming, dedupe, and manifest.
+5. Send one image-generation prompt per request until target count is reached.
-6. Return file paths, manifest path, and failure summary.
+6. For each request, wait until generation ends (`停止回答` button disappears), then download the latest image result.
 7. Collect downloaded files into target folder with batch naming, dedupe, and manifest.
 8. Return file paths, manifest path, and failure summary.
 ## 1) Prerequisites
@@ -22,16 +24,38 @@ description: "Generate images in Gemini web via browser automation, download res
  - `export PLAYWRIGHT_SHARED_SESSION=codex-shared`
  - Invoke Playwright CLI through `/Users/xd/java/xhs/tools/pw` (do not pass `--session` manually).
 - Decide output directory before generation, for example:
-  - `/Users/xd/java/xhs/output/gemini`
+  - `/Users/xd/java/xhs/output/gemini-image`
-## 2) Open Gemini
+Quick run:
 ```bash
 export PLAYWRIGHT_SHARED_SESSION=codex-shared
 python3 scripts/run_image_flow.py \
  --prompt "生成一张电影感赛博朋克街景海报，夜晚霓虹，雨天反光，纵向构图。" \
  --target /Users/xd/java/xhs/output/gemini-image \
  --count 1
 ```
 ## 2) Open Gemini And Enforce Login Gate
 - Navigate to Gemini app page.
- Confirm login state by checking account/avatar area.
+- Check login state via account/avatar area or login controls.
- If not logged in, stop and ask user to complete login manually.
+- If login controls are present (`登录`, `Sign in`, or `ServiceLogin` URL), stop immediately and ask user to log in.
 - Continue only when login is confirmed.
 - If model selection is needed, choose a model that supports image output.
-## 3) Multi-Image Generation Strategy
+## 3) Enter Image Creation Tool
 - Click `工具`.
 - Click image tool item by visible text priority:
  - `创作图片`
  - `制作图片`
  - `Create image`
  - `Image`
 - If quick-intent card click is intercepted by overlay, retry via `工具` menu item.
 - If image tool is not present after login is confirmed, stop and report capability unavailable for this account/region/model.
 ## 4) Multi-Image Generation Strategy
 - Gemini web currently returns one image per request.
 - If user asks for `N` images, run `N` requests in sequence.
@@ -44,7 +68,7 @@ Prompt construction rules:
 - Include visual style, lighting, composition, and aspect ratio.
 - Include banned elements only if user requests negative constraints.
-## 4) Wait For Completion (Explicit End Condition)
+## 5) Wait For Completion (Explicit End Condition)
 - After submit, wait for generation state to appear.
 - Treat generation as complete only when:
@@ -52,14 +76,14 @@ Prompt construction rules:
  - latest assistant response has downloadable image action.
 - If refs are stale or state is unclear, re-snapshot and retry once.
-## 5) Download Images
+## 6) Download Images
 - Download from the latest assistant response block (not old history blocks).
 - Click `下载完整尺寸的图片`.
 - Wait for download completion toast/progress to end before next request.
 - Repeat until target count is reached or retry budget is exhausted.
-## 6) Collect Downloaded Files
+## 7) Collect Downloaded Files
 Use bundled script:
@@ -67,11 +91,11 @@ Use bundled script:
 python3 scripts/collect_downloads.py \
  --source /var/folders/.../playwright-mcp-output/<session-id> \
  --source ~/Downloads \
-  --target /ABS/PATH/TO/output/gemini \
+  --target /ABS/PATH/TO/output/gemini-image \
  --since <download_start_unix_ts> \
  --limit <max_to_collect> \
  --expected-count <required_count> \
-  --prefix gemini \
+  --prefix gemini-image \
  --batch-id <run_id> \
  --prompt "<prompt_used>"
 ```
@@ -80,17 +104,19 @@ Script behavior:
 - Source strategy:
  - Prefer Playwright temp download directory first.
-  - Fallback to `~/Downloads` when primary source has no matches.
+  - Also scan `.playwright-cli` and fallback to `~/Downloads`.
 - Filters to image extensions (`png,jpg,jpeg,webp`).
 - Uses batch naming (`<prefix>-<batch-id>-NN.ext`).
 - Dedupes by SHA-256 (current run + existing target files).
 - Captures dimensions (`width`, `height`) and writes JSON manifest.
 - Prints absolute output paths and manifest path.
-## 7) Failure Handling By Step
+## 8) Failure Handling By Step
 - Login step:
  - If login/captcha/MFA blocks, stop and ask user to complete manually.
 - Tool-selection step:
  - If `创作图片` is missing after login, stop and report unsupported capability.
 - Generation step:
  - If failed once, retry once with minimal prompt rewrite.
  - If still failing, record failure reason and continue remaining quota if requested.
@@ -105,7 +131,7 @@ Script behavior:
  - If dedupe removes all files, return manifest with `no_files_after_dedupe`.
  - If collected count < required count, return `insufficient_files`.
-## 8) Return Output
+## 9) Return Output
 Return:
@@ -115,13 +141,13 @@ Return:
 - manifest absolute path
 - retries, failures, and skipped duplicates
-## 9) Reliability Rules
+## 10) Reliability Rules
- Re-snapshot after navigation, model switch, and generation completion.
+- Re-snapshot after navigation, tool switch, and generation completion.
 - If refs are stale or click intercepted, re-snapshot and retry once.
 - Do not assume static selectors across Gemini updates; rely on visible text and role-first matching.
-## 10) Boundaries
+## 11) Boundaries
 - Do not bypass login verification, captcha, paywalls, or security checks.
 - Do not submit disallowed or unsafe image prompts.
@@ -130,4 +156,5 @@ Return:
 ## Scripts
 - `/Users/xd/java/xhs/tools/pw`: Shared Playwright CLI entrypoint with fixed session + lock.
 - `scripts/run_image_flow.py`: End-to-end runner (login gate, enter image tool, generate, download image, collect files).
 - `scripts/collect_downloads.py`: Collect recent downloaded images with fallback sources, dedupe, and manifest.
@@ -1,4 +1,4 @@
 interface:
  display_name: "Gemini Image Web"
  short_description: "Generate Gemini images via web, multi-request, dedupe, and manifest."
-  default_prompt: "Use $gemini-image-web with PLAYWRIGHT_SHARED_SESSION=codex-shared; run browser steps only through /Users/xd/java/xhs/tools/pw, generate one image per Gemini request until target count is reached, download full-size outputs, then collect files with fallback source strategy, dedupe, and manifest metadata."
+  default_prompt: "Use $gemini-image-web with PLAYWRIGHT_SHARED_SESSION=codex-shared; run scripts/run_image_flow.py via /Users/xd/java/xhs/tools/pw-backed CLI flow to verify login, generate images, prefer full-size download, and collect deduped outputs with manifest."
@@ -58,8 +58,8 @@ def parse_args() -> argparse.Namespace:
    )
    parser.add_argument(
        "--prefix",
-        default="gemini",
+        default="gemini-image",
-        help="Filename prefix for collected files. Default: gemini",
+        help="Filename prefix for collected files. Default: gemini-image",
    )
    parser.add_argument(
        "--batch-id",
@@ -107,7 +107,7 @@ def collect_candidates(source: Path, since_ts: float, allowed_ext: set[str]) ->
    files: list[Path] = []
    if not source.exists():
        return files
-    for path in source.iterdir():
+    for path in source.rglob("*"):
        if not path.is_file():
            continue
        ext = path.suffix.lower().lstrip(".")
@@ -125,7 +125,10 @@ def collect_candidates(source: Path, since_ts: float, allowed_ext: set[str]) ->
 def discover_playwright_sources() -> list[Path]:
    globs = (
        "/var/folders/*/*/T/playwright-mcp-output/*",
        "/private/var/folders/*/*/T/playwright-mcp-output/*",
        "/var/folders/*/*/*/T/playwright-mcp-output/*",
        "/private/var/folders/*/*/*/T/playwright-mcp-output/*",
        "/tmp/playwright-mcp-output/*",
    )
    candidates: list[Path] = []
@@ -147,6 +150,8 @@ def resolve_sources(raw_sources: list[str] | None) -> list[Path]:
    if raw_sources:
        return [Path(item).expanduser().resolve() for item in raw_sources]
    auto_sources = discover_playwright_sources()
    auto_sources.append((Path.cwd() / ".playwright-cli").resolve())
    auto_sources.append((Path(__file__).resolve().parents[3] / ".playwright-cli").resolve())
    auto_sources.append((Path.home() / "Downloads").resolve())
    result: list[Path] = []
    seen: set[Path] = set()
@@ -214,16 +219,23 @@ def iso_ts(ts: float) -> str:
    return datetime.fromtimestamp(ts, tz=timezone.utc).isoformat()
-def select_source_candidates(
+def collect_candidates_all_sources(
    sources: list[Path], since_ts: float, allowed_ext: set[str]
-) -> tuple[Path | None, list[Path], list[dict[str, object]]]:
+) -> tuple[list[Path], list[dict[str, object]]]:
    tried: list[dict[str, object]] = []
    merged: list[Path] = []
    seen: set[Path] = set()
    for source in sources:
        files = collect_candidates(source, since_ts, allowed_ext)
        tried.append({"source": str(source), "matches": len(files)})
-        if files:
+        for file_path in files:
-            return source, files, tried
+            resolved = file_path.resolve()
-    return None, [], tried
+            if resolved in seen:
                continue
            seen.add(resolved)
            merged.append(file_path)
    merged.sort(key=lambda p: p.stat().st_mtime, reverse=True)
    return merged, tried
 def collect_existing_hashes(target: Path, allowed_ext: set[str]) -> set[str]:
@@ -269,9 +281,7 @@ def main() -> int:
        return 2
    sources = resolve_sources(args.source)
-    selected_source, candidates, tried_sources = select_source_candidates(
+    candidates, tried_sources = collect_candidates_all_sources(sources, args.since, allowed_ext)
        sources, args.since, allowed_ext
    )
    if not candidates:
        payload = {
            "status": "no_matching_files",
@@ -345,7 +355,6 @@ def main() -> int:
        "batch_id": batch_id,
        "prompt": args.prompt,
        "target_dir": str(target),
        "source_dir": str(selected_source) if selected_source else None,
        "sources_tried": tried_sources,
        "since_ts": args.since,
        "limit": args.limit,
@@ -0,0 +1,293 @@
 #!/usr/bin/env python3
 """Run Gemini image generation flow end-to-end via Playwright CLI."""
 from __future__ import annotations
 import argparse
 import json
 import os
 import re
 import subprocess
 import sys
 import time
 from pathlib import Path
 class FlowError(RuntimeError):
    """Raised when a subprocess command in the flow fails."""
 def run_command(
    cmd: list[str], *, capture_output: bool = True, check: bool = True
 ) -> subprocess.CompletedProcess[str]:
    kwargs: dict[str, object] = {"text": True}
    if capture_output:
        kwargs["stdout"] = subprocess.PIPE
        kwargs["stderr"] = subprocess.STDOUT
    proc = subprocess.run(cmd, **kwargs)
    if check and proc.returncode != 0:
        output = proc.stdout if capture_output else ""
        raise FlowError(
            f"Command failed ({proc.returncode}): {' '.join(cmd)}\n{output}"
        )
    return proc
 def run_pw(pw_shared: Path, *args: str) -> str:
    proc = run_command([str(pw_shared), *args], capture_output=True)
    return proc.stdout or ""
 def is_login_required(pw_shared: Path) -> bool:
    out = run_pw(
        pw_shared,
        "eval",
        (
            "() => {"
            "const hasAccount = !!document.querySelector("
            "'button[aria-label*=\\\"Google 账号\\\"], "
            "button[aria-label*=\\\"Google Account\\\"]'"
            ");"
            "const hasService = !!document.querySelector('a[href*=\\\"ServiceLogin\\\"]');"
            "const hasLoginCtl = Array.from(document.querySelectorAll('a,button'))"
            ".some(el => /登录|Sign in/i.test((el.textContent || '').trim()));"
            "return !hasAccount && (hasService || hasLoginCtl);"
            "}"
        ),
    )
    return bool(re.search(r"(?m)^true$", out))
 def enter_image_tool(pw_shared: Path) -> None:
    js = r"""
 async (page) => {
 const labels = [/创作图片/, /制作图片/, /Create image/i, /Image/i];
 const openToolMenu = async () => {
  const cn = page.getByRole('button', { name: '工具', exact: true }).first();
  if (await cn.count()) {
    await cn.click();
    return true;
  }
  const generic = page.getByRole('button', { name: /工具|Tools/i }).first();
  if (await generic.count()) {
    await generic.click();
    return true;
  }
  return false;
 };
 const tryCardButtons = async () => {
  for (const re of labels) {
    const btn = page.getByRole('button', { name: re }).first();
    if (await btn.count()) {
      try {
        await btn.click({ timeout: 2000 });
        return true;
      } catch (_) {
        // Overlay may intercept pointer. Fall through to menu strategy.
      }
    }
  }
  return false;
 };
 const tryToolMenu = async () => {
  const opened = await openToolMenu();
  if (!opened) return false;
  for (const re of labels) {
    const itemCheck = page.getByRole('menuitemcheckbox', { name: re }).first();
    if (await itemCheck.count()) {
      await itemCheck.click();
      return true;
    }
    const itemPlain = page.getByRole('menuitem', { name: re }).first();
    if (await itemPlain.count()) {
      await itemPlain.click();
      return true;
    }
  }
  return false;
 };
 let ok = await tryCardButtons();
 if (!ok) ok = await tryToolMenu();
 if (!ok) ok = await tryToolMenu();
 if (!ok) throw new Error('Image tool entry not found');
 }
 """
    run_pw(pw_shared, "run-code", js)
 def submit_and_download_one(pw_shared: Path, prompt: str) -> None:
    js = f"""
 async (page) => {{
 const prompt = {json.dumps(prompt)};
 const input = page.getByRole('textbox', {{ name: /为 Gemini 输入提示|Enter a prompt/i }}).first();
 await input.click();
 await input.fill(prompt);
 await input.press('Enter');
 const stopBtn = page.getByRole('button', {{ name: /停止回答|Stop response/i }}).first();
 await stopBtn.waitFor({{ state: 'visible', timeout: 15000 }}).catch(() => {{}});
 await stopBtn.waitFor({{ state: 'hidden', timeout: 240000 }});
 const downloadBtn = page.getByRole('button', {{ name: /下载完整尺寸的图片|下载图片|Download full size|Download image|Download/i }}).last();
 if (!(await downloadBtn.count())) {{
  throw new Error('Image download button not found');
 }}
 const downloadPromise = page.waitForEvent('download', {{ timeout: 45000 }}).catch(() => null);
 await downloadBtn.click();
 const preferredItem = page.getByRole('menuitem', {{ name: /完整尺寸|Full size|PNG|JPG|JPEG|WEBP/i }}).first();
 if (await preferredItem.isVisible().catch(() => false)) {{
  await preferredItem.click();
 }} else {{
  const anyItem = page.getByRole('menuitem').first();
  if (await anyItem.isVisible().catch(() => false)) {{
    await anyItem.click();
  }}
 }}
 const download = await downloadPromise;
 if (!download) {{
  const failedToast = page.getByText(/下载失败|Download failed|无法下载|保存失败/i).first();
  if (await failedToast.isVisible().catch(() => false)) {{
    throw new Error('Image download failed');
  }}
  throw new Error('Image download did not start');
 }}
 await download.path().catch(() => null);
 await page.waitForTimeout(800);
 }}
 """
    run_pw(pw_shared, "run-code", js)
 def retry_click_latest_download(pw_shared: Path) -> None:
    js = r"""
 async (page) => {
 const btn = page.getByRole('button', { name: /下载完整尺寸的图片|下载图片|Download full size|Download image|Download/i }).last();
 if (!(await btn.count())) {
  throw new Error('Image download button not found for retry');
 }
 const downloadPromise = page.waitForEvent('download', { timeout: 45000 }).catch(() => null);
 await btn.click();
 const download = await downloadPromise;
 if (!download) {
  throw new Error('Retry image download did not start');
 }
 await download.path().catch(() => null);
 await page.waitForTimeout(800);
 }
 """
    run_pw(pw_shared, "run-code", js)
 def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Generate images on Gemini web and collect downloaded files."
    )
    parser.add_argument("--prompt", required=True, help="Prompt text for image generation.")
    parser.add_argument(
        "--target", required=True, help="Absolute output directory for collected files."
    )
    parser.add_argument(
        "--count", type=int, default=1, help="Number of images to generate. Default: 1."
    )
    parser.add_argument(
        "--no-headed",
        action="store_true",
        help="Run browser without headed mode.",
    )
    return parser.parse_args()
 def main() -> int:
    args = parse_args()
    if args.count < 1:
        print("--count must be a positive integer.", file=sys.stderr)
        return 1
    repo_root = Path(__file__).resolve().parents[3]
    pw_shared = Path(
        os.environ.get("PW_SHARED_WRAPPER", str(repo_root / "tools/pw"))
    ).expanduser()
    collect_script = (Path(__file__).resolve().parent / "collect_downloads.py").resolve()
    if not pw_shared.exists() or not pw_shared.is_file():
        print(f"Shared Playwright wrapper not found: {pw_shared}", file=sys.stderr)
        return 1
    if not os.access(pw_shared, os.X_OK):
        print(f"Shared Playwright wrapper is not executable: {pw_shared}", file=sys.stderr)
        return 1
    if not collect_script.exists():
        print(f"Collector script not found: {collect_script}", file=sys.stderr)
        return 1
    target = Path(args.target).expanduser().resolve()
    target.mkdir(parents=True, exist_ok=True)
    start_ts = time.time()
    try:
        os.environ["PLAYWRIGHT_SHARED_INIT_MODE"] = (
            "headless" if args.no_headed else "headed"
        )
        run_pw(pw_shared, "snapshot")
        run_pw(pw_shared, "goto", "https://gemini.google.com/app")
        run_pw(pw_shared, "snapshot")
        if is_login_required(pw_shared):
            print(
                "Gemini is not logged in. Please log in at https://gemini.google.com/app and rerun.",
                file=sys.stderr,
            )
            return 2
        enter_image_tool(pw_shared)
        for i in range(1, args.count + 1):
            current_prompt = args.prompt
            if args.count > 1:
                current_prompt = (
                    f"{args.prompt}\n"
                    f"变体要求：这是第 {i} / {args.count} 张。保持主题一致，但构图和光影细节需要变化。"
                )
            submit_and_download_one(pw_shared, current_prompt)
        collect_cmd = [
            sys.executable,
            str(collect_script),
            "--target",
            str(target),
            "--since",
            str(start_ts),
            "--expected-count",
            str(args.count),
            "--limit",
            str(args.count),
            "--prefix",
            "gemini-image",
            "--prompt",
            args.prompt,
        ]
        proc = run_command(collect_cmd, capture_output=False, check=False)
        if proc.returncode == 0:
            return 0
        # Fallback: click latest image download button once and retry collection.
        try:
            retry_click_latest_download(pw_shared)
        except FlowError:
            return proc.returncode
        retry_proc = run_command(collect_cmd, capture_output=False, check=False)
        return retry_proc.returncode
    except FlowError as exc:
        print(str(exc), file=sys.stderr)
        return 1
 if __name__ == "__main__":
    raise SystemExit(main())