--- name: web-automation description: Browse and scrape web pages using Playwright-compatible CloakBrowser. Use when automating web workflows, extracting rendered page content, handling authenticated sessions, or scraping websites with bot protection. --- # Web Automation with CloakBrowser (Codex) Automated web browsing and scraping using Playwright-compatible CloakBrowser with two execution paths under one skill: - one-shot extraction via `extract.js` - broader stateful automation via CloakBrowser and the existing `auth.ts`, `browse.ts`, `flow.ts`, and `scrape.ts` ## When To Use Which Command - Use `node scripts/extract.js ""` for one-shot extraction from a single URL when you need rendered content, bounded stealth behavior, and JSON output. - Use `npx tsx scrape.ts ...` when you need markdown output, Readability extraction, full-page cleanup, or selector-based scraping. - Use `npx tsx browse.ts ...`, `auth.ts`, or `flow.ts` when the task needs interactive navigation, persistent sessions, login handling, click/type actions, or multi-step workflows. ## Requirements - Node.js 20+ - pnpm - Network access to download the CloakBrowser binary on first use or via preinstall ## First-Time Setup ```bash cd ~/.openclaw/workspace/skills/web-automation/scripts pnpm install npx cloakbrowser install pnpm approve-builds pnpm rebuild better-sqlite3 esbuild ``` ## Updating CloakBrowser ```bash cd ~/.openclaw/workspace/skills/web-automation/scripts pnpm up cloakbrowser playwright-core npx cloakbrowser install pnpm approve-builds pnpm rebuild better-sqlite3 esbuild ``` ## Prerequisite Check (MANDATORY) Before running any automation, verify CloakBrowser and Playwright Core dependencies are installed and scripts are configured to use CloakBrowser. ```bash cd ~/.openclaw/workspace/skills/web-automation/scripts node check-install.js ``` If any check fails, stop and return: "Missing dependency/config: web-automation requires `cloakbrowser` and `playwright-core` with CloakBrowser-based scripts. Run setup in this skill, then retry." If runtime fails with missing native bindings for `better-sqlite3` or `esbuild`, run: ```bash cd ~/.openclaw/workspace/skills/web-automation/scripts pnpm approve-builds pnpm rebuild better-sqlite3 esbuild ``` ## Quick Reference - Install check: `node check-install.js` - One-shot JSON extract: `node scripts/extract.js "https://example.com"` - Zillow photo URLs: `node scripts/zillow-photos.js "https://www.zillow.com/homedetails/..."` - HAR photo URLs: `node scripts/har-photos.js "https://www.har.com/homedetail/..."` - Browse page: `npx tsx browse.ts --url "https://example.com"` - Scrape markdown: `npx tsx scrape.ts --url "https://example.com" --mode main --output page.md` - Authenticate: `npx tsx auth.ts --url "https://example.com/login"` - Natural-language flow: `npx tsx flow.ts --instruction 'go to https://example.com then click on "Login" then type "user@example.com" in #email then press enter'` ## OpenClaw Exec Approvals / Allowlist If OpenClaw prompts for exec approval every time this skill runs, add a local approvals allowlist for the main agent before retrying. This is especially helpful for repeated `extract.js`, `browse.ts`, and other CloakBrowser-backed scrapes. ```bash openclaw approvals allowlist add --agent main "/opt/homebrew/bin/node" openclaw approvals allowlist add --agent main "/usr/bin/env" openclaw approvals allowlist add --agent main "~/.openclaw/workspace/skills/web-automation/scripts/*.js" openclaw approvals allowlist add --agent main "~/.openclaw/workspace/skills/web-automation/scripts/node_modules/.bin/*" ``` Then verify: ```bash openclaw approvals get ``` Notes: - If `node` lives somewhere else on the host, replace `/opt/homebrew/bin/node` with the output of `which node`. - If matching problems persist, replace `~/.openclaw/...` with the full absolute path such as `/Users//.openclaw/...`. - Keep the allowlist scoped to the main agent unless there is a real reason to broaden it. - Prefer file-based commands like `node check-install.js` or `node scripts/zillow-photos.js ...` over inline interpreter eval (`node -e`, `node --input-type=module -e`). OpenClaw exec approvals treat inline eval as a higher-friction path. ## One-shot extraction Use `extract.js` when you need a single page fetch with JavaScript rendering and lightweight anti-bot shaping, but not a full automation session. ```bash node scripts/extract.js "https://example.com" WAIT_TIME=5000 node scripts/extract.js "https://example.com" SCREENSHOT_PATH=/tmp/page.png SAVE_HTML=true node scripts/extract.js "https://example.com" ``` Output is JSON only and includes fields such as: - `requestedUrl` - `finalUrl` - `title` - `content` - `metaDescription` - `status` - `elapsedSeconds` - `challengeDetected` - optional `screenshot` - optional `htmlFile` ## General flow runner Use `flow.ts` for multi-step commands in plain language (go/click/type/press/wait/screenshot). Example: ```bash npx tsx flow.ts --instruction 'go to https://search.fiorinis.com then type "pippo" then press enter then wait 2s' ``` ## Real-estate photo extraction Use the dedicated extractors before trying a free-form gallery flow. - Zillow: `node scripts/zillow-photos.js ""` - HAR: `node scripts/har-photos.js ""` These scripts are purpose-built for the common `See all photos` / `Show all photos` workflow: - open the listing page - click the all-photos entry point - wait for the resulting photo page or scroller view - extract direct image URLs from the rendered page Output is JSON with: - `requestedUrl` - `finalUrl` - `clickedLabel` - `photoCount` - `imageUrls` - `notes` For property-assessor style workflows, prefer these dedicated commands over generic natural-language gallery automation. ### Gallery/lightbox and all-photos workflows For real-estate listings and other image-heavy pages, prefer the most accessible all-photos view first. Practical rules: - A scrollable all-photos page, expanded photo grid, or photo list is an acceptable source for condition review if it clearly exposes the listing images. - Do not treat a listing page hero image, gallery collage preview, or modal landing view alone as full photo review. - Only rely on next-arrow / slideshow traversal when the site does not provide an accessible all-photos view. - If using a gallery, confirm the image changed before counting the next screenshot as reviewed. - If a generic `Next` control exits the gallery or returns to the listing shell, stop and adjust the selector/interaction; do not claim the photos were reviewed. - Blind `ArrowRight` presses are not reliable enough unless you have already verified that they advance the gallery on that site. - For smaller listings, review all photos when practical; otherwise review enough distinct photos to cover kitchen, baths, living areas, bedrooms, exterior, and any waterfront/balcony/deck elements. - If automation cannot reliably access enough photos, say so explicitly in the final answer. Where possible, prefer a site’s explicit `See all photos` / `Show all photos` path over fragile modal navigation. ## Compatibility Aliases - `CAMOUFOX_PROFILE_PATH` still works as a legacy alias for `CLOAKBROWSER_PROFILE_PATH` - `CAMOUFOX_HEADLESS` still works as a legacy alias for `CLOAKBROWSER_HEADLESS` - `CAMOUFOX_USERNAME` and `CAMOUFOX_PASSWORD` still work as legacy aliases for `CLOAKBROWSER_USERNAME` and `CLOAKBROWSER_PASSWORD` ## Notes - Sessions persist in CloakBrowser profile storage. - Use `--wait` for dynamic pages. - Use `--mode selector --selector "..."` for targeted extraction. - `extract.js` keeps stealth and bounded anti-bot shaping while keeping the browser sandbox enabled.