refactor: consolidate web scraping into web-automation

2026-03-10 19:24:17 -05:00
parent 4b505e4421
commit 6e2fd17734
12 changed files with 136 additions and 247 deletions
@@ -1,18 +1,20 @@
 # web-automation

-Automated web browsing and scraping using Playwright with Camoufox anti-detection browser.
+Automated web browsing and scraping using Playwright, with one-shot extraction and broader Camoufox-based automation under a single skill.

 ## What this skill is for

+- One-shot extraction from one URL with JSON output
 - Automating web workflows
 - Authenticated session flows (logins/cookies)
 - Extracting page content to markdown
 - Working with bot-protected or dynamic pages

-## Routing rule
+## Command selection

- For one-shot page extraction from a single URL, prefer `playwright-safe`
- Use `web-automation` only when the task needs interactive browser control, multi-step navigation, or authenticated flows
+- Use `node skills/web-automation/scripts/extract.js "<URL>"` for one-shot extraction from a single URL
+- Use `npx tsx scrape.ts ...` for markdown scraping modes
+- Use `npx tsx browse.ts ...`, `auth.ts`, or `flow.ts` for interactive or authenticated flows

 ## Requirements

@@ -25,6 +27,7 @@ Automated web browsing and scraping using Playwright with Camoufox anti-detectio
 ```bash
 cd ~/.openclaw/workspace/skills/web-automation/scripts
 pnpm install
+npx playwright install chromium
 npx camoufox-js fetch
 pnpm approve-builds
 pnpm rebuild better-sqlite3 esbuild
@@ -50,6 +53,9 @@ Without this, `browse.ts` and `scrape.ts` may fail before launch because the nat
 ## Common commands

 ```bash
+# One-shot JSON extraction
+node skills/web-automation/scripts/extract.js "https://example.com"
+
 # Browse a page
 npx tsx browse.ts --url "https://example.com"

@@ -63,6 +69,41 @@ npx tsx auth.ts --url "https://example.com/login"
 npx tsx flow.ts --instruction 'go to https://search.fiorinis.com then type "pippo" then press enter then wait 2s'
 ```

+## One-shot extraction (`extract.js`)
+
+Use `extract.js` when the task is just: open one URL, render it, and return structured content.
+
+### Features
+
+- JavaScript rendering
+- lightweight stealth and bounded anti-bot shaping
+- JSON-only output
+- optional screenshot and saved HTML
+- browser sandbox left enabled
+
+### Options
+
+```bash
+WAIT_TIME=5000 node skills/web-automation/scripts/extract.js "https://example.com"
+SCREENSHOT_PATH=/tmp/page.png node skills/web-automation/scripts/extract.js "https://example.com"
+SAVE_HTML=true node skills/web-automation/scripts/extract.js "https://example.com"
+HEADLESS=false node skills/web-automation/scripts/extract.js "https://example.com"
+USER_AGENT="Mozilla/5.0 ..." node skills/web-automation/scripts/extract.js "https://example.com"
+```
+
+### Output fields
+
+- `requestedUrl`
+- `finalUrl`
+- `title`
+- `content`
+- `metaDescription`
+- `status`
+- `elapsedSeconds`
+- `challengeDetected`
+- optional `screenshot`
+- optional `htmlFile`
+
 ## Natural-language flow runner (`flow.ts`)

 Use `flow.ts` when you want a general command style like: