diff --git a/README.md b/README.md index fbbc1eb..f135c32 100644 --- a/README.md +++ b/README.md @@ -18,10 +18,9 @@ This repository contains practical OpenClaw skills and companion integrations. I |---|---|---| | `elevenlabs-stt` | Transcribe local audio files with ElevenLabs Speech-to-Text, with diarization, language hints, event tags, and JSON output. | `skills/elevenlabs-stt` | | `gitea-api` | Interact with Gitea via REST API (repos, issues, PRs, releases, branches, user info). | `skills/gitea-api` | -| `playwright-safe` | Single-entry Playwright scraper for one-shot extraction with JS rendering and moderate anti-bot handling. | `skills/playwright-safe` | | `portainer` | Manage Portainer stacks via API (list, start/stop/restart, update, prune images). | `skills/portainer` | | `searxng` | Search through a local or self-hosted SearXNG instance for web, news, images, and more. | `skills/searxng` | -| `web-automation` | Automate browsing/scraping with Playwright + Camoufox (auth flows, extraction, bot-protected sites). | `skills/web-automation` | +| `web-automation` | One-shot extraction plus broader browsing/scraping with Playwright + Camoufox (auth flows, extraction, bot-protected sites). | `skills/web-automation` | ## Integrations diff --git a/docs/README.md b/docs/README.md index 94fdb74..8e59c30 100644 --- a/docs/README.md +++ b/docs/README.md @@ -6,11 +6,9 @@ This folder contains detailed docs for each skill in this repository. - [`elevenlabs-stt`](elevenlabs-stt.md) — Local audio transcription through ElevenLabs Speech-to-Text - [`gitea-api`](gitea-api.md) — REST-based Gitea automation (no `tea` CLI required) -- [`playwright-safe`](playwright-safe.md) — Single-entry Playwright scraper for one-shot extraction with JS rendering and moderate anti-bot handling - [`portainer`](portainer.md) — Portainer stack management (list, lifecycle, updates, image pruning) - [`searxng`](searxng.md) — Privacy-respecting metasearch via a local or self-hosted SearXNG instance -- [`web-automation`](web-automation.md) — Playwright + Camoufox browser automation and scraping - +- [`web-automation`](web-automation.md) — One-shot extraction plus Playwright + Camoufox browser automation and scraping ## Integrations diff --git a/docs/playwright-safe.md b/docs/playwright-safe.md deleted file mode 100644 index 5b7545b..0000000 --- a/docs/playwright-safe.md +++ /dev/null @@ -1,72 +0,0 @@ -# playwright-safe - -Single-entry Playwright scraper for one-shot page extraction with JavaScript rendering and moderate anti-bot handling. - -## What this skill is for - -- Extracting title, visible text, and metadata from one URL -- Pages that need client-side rendering -- Moderate anti-bot shaping without a full browser automation workflow -- Structured JSON output that agents can consume directly - -## What this skill is not for - -- Multi-step browser workflows -- Authenticated login flows -- Interactive click/type sequences across multiple pages - -Use `web-automation` for those broader browser tasks. - -## Runtime requirements - -- Node.js 18+ -- Local Playwright install under the skill directory - -## First-time setup - -```bash -cd ~/.openclaw/workspace/skills/playwright-safe -npm install -npx playwright install chromium -``` - -## Entry point - -```bash -node skills/playwright-safe/scripts/playwright-safe.js "" -``` - -Only pass a user-provided `http` or `https` URL. - -## Options - -```bash -WAIT_TIME=5000 node skills/playwright-safe/scripts/playwright-safe.js "" -SCREENSHOT_PATH=/tmp/page.png node skills/playwright-safe/scripts/playwright-safe.js "" -SAVE_HTML=true node skills/playwright-safe/scripts/playwright-safe.js "" -HEADLESS=false node skills/playwright-safe/scripts/playwright-safe.js "" -USER_AGENT="Mozilla/5.0 ..." node skills/playwright-safe/scripts/playwright-safe.js "" -``` - -## Output - -The script prints JSON only. It includes: - -- `requestedUrl` -- `finalUrl` -- `title` -- `content` -- `metaDescription` -- `status` -- `elapsedSeconds` -- `challengeDetected` -- optional `screenshot` -- optional `htmlFile` - -## Security posture - -- Keeps lightweight stealth and anti-bot shaping -- Keeps the browser sandbox enabled -- Does not use `--no-sandbox` -- Does not use `--disable-setuid-sandbox` -- Avoids site-specific extractors and cross-skill dependencies diff --git a/docs/web-automation.md b/docs/web-automation.md index 2960043..ee33bda 100644 --- a/docs/web-automation.md +++ b/docs/web-automation.md @@ -1,18 +1,20 @@ # web-automation -Automated web browsing and scraping using Playwright with Camoufox anti-detection browser. +Automated web browsing and scraping using Playwright, with one-shot extraction and broader Camoufox-based automation under a single skill. ## What this skill is for +- One-shot extraction from one URL with JSON output - Automating web workflows - Authenticated session flows (logins/cookies) - Extracting page content to markdown - Working with bot-protected or dynamic pages -## Routing rule +## Command selection -- For one-shot page extraction from a single URL, prefer `playwright-safe` -- Use `web-automation` only when the task needs interactive browser control, multi-step navigation, or authenticated flows +- Use `node skills/web-automation/scripts/extract.js ""` for one-shot extraction from a single URL +- Use `npx tsx scrape.ts ...` for markdown scraping modes +- Use `npx tsx browse.ts ...`, `auth.ts`, or `flow.ts` for interactive or authenticated flows ## Requirements @@ -25,6 +27,7 @@ Automated web browsing and scraping using Playwright with Camoufox anti-detectio ```bash cd ~/.openclaw/workspace/skills/web-automation/scripts pnpm install +npx playwright install chromium npx camoufox-js fetch pnpm approve-builds pnpm rebuild better-sqlite3 esbuild @@ -50,6 +53,9 @@ Without this, `browse.ts` and `scrape.ts` may fail before launch because the nat ## Common commands ```bash +# One-shot JSON extraction +node skills/web-automation/scripts/extract.js "https://example.com" + # Browse a page npx tsx browse.ts --url "https://example.com" @@ -63,6 +69,41 @@ npx tsx auth.ts --url "https://example.com/login" npx tsx flow.ts --instruction 'go to https://search.fiorinis.com then type "pippo" then press enter then wait 2s' ``` +## One-shot extraction (`extract.js`) + +Use `extract.js` when the task is just: open one URL, render it, and return structured content. + +### Features + +- JavaScript rendering +- lightweight stealth and bounded anti-bot shaping +- JSON-only output +- optional screenshot and saved HTML +- browser sandbox left enabled + +### Options + +```bash +WAIT_TIME=5000 node skills/web-automation/scripts/extract.js "https://example.com" +SCREENSHOT_PATH=/tmp/page.png node skills/web-automation/scripts/extract.js "https://example.com" +SAVE_HTML=true node skills/web-automation/scripts/extract.js "https://example.com" +HEADLESS=false node skills/web-automation/scripts/extract.js "https://example.com" +USER_AGENT="Mozilla/5.0 ..." node skills/web-automation/scripts/extract.js "https://example.com" +``` + +### Output fields + +- `requestedUrl` +- `finalUrl` +- `title` +- `content` +- `metaDescription` +- `status` +- `elapsedSeconds` +- `challengeDetected` +- optional `screenshot` +- optional `htmlFile` + ## Natural-language flow runner (`flow.ts`) Use `flow.ts` when you want a general command style like: diff --git a/skills/playwright-safe/.gitignore b/skills/playwright-safe/.gitignore deleted file mode 100644 index 8539a8a..0000000 --- a/skills/playwright-safe/.gitignore +++ /dev/null @@ -1,3 +0,0 @@ -node_modules/ -*.png -*.html diff --git a/skills/playwright-safe/SKILL.md b/skills/playwright-safe/SKILL.md deleted file mode 100644 index 171f041..0000000 --- a/skills/playwright-safe/SKILL.md +++ /dev/null @@ -1,68 +0,0 @@ ---- -name: playwright-safe -description: Use when a page needs JavaScript rendering or moderate anti-bot handling and the agent should use a single local Playwright scraper instead of generic web fetch tooling. ---- - -# Playwright Safe - -Single-entry Playwright scraper for dynamic or moderately bot-protected pages. - -## When To Use - -- Page content depends on client-side rendering -- Generic `scrape` or `webfetch` is likely to miss rendered content -- The task needs one direct page extraction with lightweight stealth behavior - -## Do Not Use - -- For multi-step browser workflows with login/stateful interaction -- For site-specific automation flows -- When the page can be handled by a simpler built-in fetch path - -## Setup - -```bash -cd ~/.openclaw/workspace/skills/playwright-safe -npm install -npx playwright install chromium -``` - -## Command - -```bash -node scripts/playwright-safe.js "" -``` - -Only pass a user-provided `http` or `https` URL. - -## Options - -```bash -WAIT_TIME=5000 node scripts/playwright-safe.js "" -SCREENSHOT_PATH=/tmp/page.png node scripts/playwright-safe.js "" -SAVE_HTML=true node scripts/playwright-safe.js "" -HEADLESS=false node scripts/playwright-safe.js "" -USER_AGENT="Mozilla/5.0 ..." node scripts/playwright-safe.js "" -``` - -## Output - -The script prints JSON only, suitable for direct agent consumption. Fields include: - -- `requestedUrl` -- `finalUrl` -- `title` -- `content` -- `metaDescription` -- `status` -- `elapsedSeconds` -- `challengeDetected` -- optional `screenshot` -- optional `htmlFile` - -## Safety Notes - -- Stealth and anti-bot shaping are retained -- Chromium sandbox remains enabled -- No sandbox-disabling flags are used -- No site-specific extractors or foreign tool dependencies are used diff --git a/skills/playwright-safe/package-lock.json b/skills/playwright-safe/package-lock.json deleted file mode 100644 index 8027c38..0000000 --- a/skills/playwright-safe/package-lock.json +++ /dev/null @@ -1,59 +0,0 @@ -{ - "name": "playwright-safe", - "version": "0.1.0", - "lockfileVersion": 3, - "requires": true, - "packages": { - "": { - "name": "playwright-safe", - "version": "0.1.0", - "dependencies": { - "playwright": "^1.52.0" - } - }, - "node_modules/fsevents": { - "version": "2.3.2", - "resolved": "https://registry.npmjs.org/fsevents/-/fsevents-2.3.2.tgz", - "integrity": "sha512-xiqMQR4xAeHTuB9uWm+fFRcIOgKBMiOBP+eXiyT7jsgVCq1bkVygt00oASowB7EdtpOHaaPgKt812P9ab+DDKA==", - "hasInstallScript": true, - "license": "MIT", - "optional": true, - "os": [ - "darwin" - ], - "engines": { - "node": "^8.16.0 || ^10.6.0 || >=11.0.0" - } - }, - "node_modules/playwright": { - "version": "1.58.2", - "resolved": "https://registry.npmjs.org/playwright/-/playwright-1.58.2.tgz", - "integrity": "sha512-vA30H8Nvkq/cPBnNw4Q8TWz1EJyqgpuinBcHET0YVJVFldr8JDNiU9LaWAE1KqSkRYazuaBhTpB5ZzShOezQ6A==", - "license": "Apache-2.0", - "dependencies": { - "playwright-core": "1.58.2" - }, - "bin": { - "playwright": "cli.js" - }, - "engines": { - "node": ">=18" - }, - "optionalDependencies": { - "fsevents": "2.3.2" - } - }, - "node_modules/playwright-core": { - "version": "1.58.2", - "resolved": "https://registry.npmjs.org/playwright-core/-/playwright-core-1.58.2.tgz", - "integrity": "sha512-yZkEtftgwS8CsfYo7nm0KE8jsvm6i/PTgVtB8DL726wNf6H2IMsDuxCpJj59KDaxCtSnrWan2AeDqM7JBaultg==", - "license": "Apache-2.0", - "bin": { - "playwright-core": "cli.js" - }, - "engines": { - "node": ">=18" - } - } - } -} diff --git a/skills/playwright-safe/package.json b/skills/playwright-safe/package.json deleted file mode 100644 index 2a93c61..0000000 --- a/skills/playwright-safe/package.json +++ /dev/null @@ -1,12 +0,0 @@ -{ - "name": "playwright-safe", - "version": "0.1.0", - "private": true, - "description": "Single-entry Playwright scraper skill with bounded stealth behavior", - "scripts": { - "smoke": "node scripts/playwright-safe.js" - }, - "dependencies": { - "playwright": "^1.52.0" - } -} diff --git a/skills/web-automation/SKILL.md b/skills/web-automation/SKILL.md index 49bea82..5549200 100644 --- a/skills/web-automation/SKILL.md +++ b/skills/web-automation/SKILL.md @@ -1,20 +1,20 @@ --- name: web-automation -description: Browse and scrape web pages using Playwright with Camoufox anti-detection browser. Use when automating web workflows, extracting page content to markdown, handling authenticated sessions, or scraping websites with bot protection. +description: Browse and scrape web pages using Playwright with Camoufox anti-detection browser. Use when automating web workflows, extracting rendered page content, handling authenticated sessions, or scraping websites with bot protection. --- # Web Automation with Camoufox (Codex) -Automated web browsing and scraping using Playwright with Camoufox anti-detection browser. +Automated web browsing and scraping using Playwright with two execution paths under one skill: -## Routing Rule +- one-shot extraction via `extract.js` +- broader stateful automation via Camoufox and the existing `auth.ts`, `browse.ts`, `flow.ts`, and `scrape.ts` -Before using this skill, classify the task: +## When To Use Which Command -- If the task is one-shot page extraction for title/content from a single URL, use `~/.openclaw/workspace/skills/playwright-safe/SKILL.md` instead. -- If the task needs a multi-step browser flow, authenticated session handling, or interactive navigation/click/type behavior, use this `web-automation` skill. - -Do not use `web-automation` for simple single-page extraction when `playwright-safe` is available. +- Use `node scripts/extract.js ""` for one-shot extraction from a single URL when you need rendered content, bounded stealth behavior, and JSON output. +- Use `npx tsx scrape.ts ...` when you need markdown output, Readability extraction, full-page cleanup, or selector-based scraping. +- Use `npx tsx browse.ts ...`, `auth.ts`, or `flow.ts` when the task needs interactive navigation, persistent sessions, login handling, click/type actions, or multi-step workflows. ## Requirements @@ -27,6 +27,7 @@ Do not use `web-automation` for simple single-page extraction when `playwright-s ```bash cd ~/.openclaw/workspace/skills/web-automation/scripts pnpm install +npx playwright install chromium npx camoufox-js fetch pnpm approve-builds pnpm rebuild better-sqlite3 esbuild @@ -38,13 +39,13 @@ Before running any automation, verify Playwright + Camoufox dependencies are ins ```bash cd ~/.openclaw/workspace/skills/web-automation/scripts -node -e "require.resolve('playwright-core/package.json');require.resolve('camoufox-js/package.json');console.log('OK: playwright-core + camoufox-js installed')" +node -e "require.resolve('playwright/package.json');require.resolve('playwright-core/package.json');require.resolve('camoufox-js/package.json');console.log('OK: playwright + playwright-core + camoufox-js installed')" node -e "const fs=require('fs');const t=fs.readFileSync('browse.ts','utf8');if(!/camoufox-js/.test(t)){throw new Error('browse.ts is not configured for Camoufox')}console.log('OK: Camoufox integration detected in browse.ts')" ``` If any check fails, stop and return: -"Missing dependency/config: web-automation requires `playwright-core` + `camoufox-js` and Camoufox-based scripts. Run setup in this skill, then retry." +"Missing dependency/config: web-automation requires `playwright`, `playwright-core`, and `camoufox-js` with Camoufox-based scripts. Run setup in this skill, then retry." If runtime fails with missing native bindings for `better-sqlite3` or `esbuild`, run: @@ -56,11 +57,35 @@ pnpm rebuild better-sqlite3 esbuild ## Quick Reference +- One-shot JSON extract: `node scripts/extract.js "https://example.com"` - Browse page: `npx tsx browse.ts --url "https://example.com"` - Scrape markdown: `npx tsx scrape.ts --url "https://example.com" --mode main --output page.md` - Authenticate: `npx tsx auth.ts --url "https://example.com/login"` - Natural-language flow: `npx tsx flow.ts --instruction 'go to https://example.com then click on "Login" then type "user@example.com" in #email then press enter'` +## One-shot extraction + +Use `extract.js` when you need a single page fetch with JavaScript rendering and lightweight anti-bot shaping, but not a full automation session. + +```bash +node scripts/extract.js "https://example.com" +WAIT_TIME=5000 node scripts/extract.js "https://example.com" +SCREENSHOT_PATH=/tmp/page.png SAVE_HTML=true node scripts/extract.js "https://example.com" +``` + +Output is JSON only and includes fields such as: + +- `requestedUrl` +- `finalUrl` +- `title` +- `content` +- `metaDescription` +- `status` +- `elapsedSeconds` +- `challengeDetected` +- optional `screenshot` +- optional `htmlFile` + ## General flow runner Use `flow.ts` for multi-step commands in plain language (go/click/type/press/wait/screenshot). @@ -76,3 +101,4 @@ npx tsx flow.ts --instruction 'go to https://search.fiorinis.com then type "pipp - Sessions persist in Camoufox profile storage. - Use `--wait` for dynamic pages. - Use `--mode selector --selector "..."` for targeted extraction. +- `extract.js` keeps stealth and bounded anti-bot shaping while keeping the Chromium sandbox enabled. diff --git a/skills/playwright-safe/scripts/playwright-safe.js b/skills/web-automation/scripts/extract.js similarity index 88% rename from skills/playwright-safe/scripts/playwright-safe.js rename to skills/web-automation/scripts/extract.js index 04bf5fd..45fe724 100755 --- a/skills/playwright-safe/scripts/playwright-safe.js +++ b/skills/web-automation/scripts/extract.js @@ -1,7 +1,8 @@ #!/usr/bin/env node -const fs = require("fs"); -const path = require("path"); +import fs from "node:fs"; +import path from "node:path"; +import { fileURLToPath } from "node:url"; const DEFAULT_WAIT_MS = 5000; const MAX_WAIT_MS = 20000; @@ -11,6 +12,9 @@ const CONTENT_LIMIT = 12000; const DEFAULT_USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36"; +const __filename = fileURLToPath(import.meta.url); +const __dirname = path.dirname(__filename); + function fail(message, details) { const payload = { error: message }; if (details) payload.details = details; @@ -26,7 +30,7 @@ function parseWaitTime(raw) { function parseTarget(rawUrl) { if (!rawUrl) { - fail("Missing URL. Usage: node scripts/playwright-safe.js "); + fail("Missing URL. Usage: node skills/web-automation/scripts/extract.js "); } let parsed; @@ -66,6 +70,17 @@ async function detectChallenge(page) { } } +async function loadPlaywright() { + try { + return await import("playwright"); + } catch (error) { + fail( + "Playwright is not installed for this skill. Run pnpm install and npx playwright install chromium in skills/web-automation/scripts first.", + error.message + ); + } +} + async function main() { const requestedUrl = parseTarget(process.argv[2]); const waitTime = parseWaitTime(process.env.WAIT_TIME); @@ -74,15 +89,7 @@ async function main() { const headless = process.env.HEADLESS !== "false"; const userAgent = process.env.USER_AGENT || DEFAULT_USER_AGENT; const startedAt = Date.now(); - let chromium; - try { - ({ chromium } = require("playwright")); - } catch (error) { - fail( - "Playwright is not installed for this skill. Run npm install and npx playwright install chromium first.", - error.message - ); - } + const { chromium } = await loadPlaywright(); let browser; try { @@ -176,8 +183,9 @@ async function main() { } if (saveHtml) { - const htmlTarget = - screenshotPath ? screenshotPath.replace(/\.[^.]+$/, ".html") : path.resolve(`page-${Date.now()}.html`); + const htmlTarget = screenshotPath + ? screenshotPath.replace(/\.[^.]+$/, ".html") + : path.resolve(__dirname, `page-${Date.now()}.html`); ensureParentDir(htmlTarget); fs.writeFileSync(htmlTarget, await page.content()); result.htmlFile = htmlTarget; diff --git a/skills/web-automation/scripts/package.json b/skills/web-automation/scripts/package.json index 8468ae3..bef6a7c 100644 --- a/skills/web-automation/scripts/package.json +++ b/skills/web-automation/scripts/package.json @@ -4,6 +4,7 @@ "description": "Web browsing and scraping scripts using Camoufox", "type": "module", "scripts": { + "extract": "node extract.js", "browse": "tsx browse.ts", "scrape": "tsx scrape.ts", "fetch-browser": "npx camoufox-js fetch" @@ -14,6 +15,7 @@ "camoufox-js": "^0.8.5", "jsdom": "^24.0.0", "minimist": "^1.2.8", + "playwright": "^1.58.2", "playwright-core": "^1.40.0", "turndown": "^7.1.2", "turndown-plugin-gfm": "^1.0.2" diff --git a/skills/web-automation/scripts/pnpm-lock.yaml b/skills/web-automation/scripts/pnpm-lock.yaml index 3e349b3..35fde47 100644 --- a/skills/web-automation/scripts/pnpm-lock.yaml +++ b/skills/web-automation/scripts/pnpm-lock.yaml @@ -23,6 +23,9 @@ importers: minimist: specifier: ^1.2.8 version: 1.2.8 + playwright: + specifier: ^1.58.2 + version: 1.58.2 playwright-core: specifier: ^1.40.0 version: 1.57.0 @@ -440,6 +443,11 @@ packages: fs-constants@1.0.0: resolution: {integrity: sha512-y6OAwoSIf7FyjMIv94u+b5rdheZEjzR63GTyZJm5qh4Bi+2YgwLCcI/fPFZkL5PSixOt6ZNKm+w+Hfp/Bciwow==} + fsevents@2.3.2: + resolution: {integrity: sha512-xiqMQR4xAeHTuB9uWm+fFRcIOgKBMiOBP+eXiyT7jsgVCq1bkVygt00oASowB7EdtpOHaaPgKt812P9ab+DDKA==} + engines: {node: ^8.16.0 || ^10.6.0 || >=11.0.0} + os: [darwin] + fsevents@2.3.3: resolution: {integrity: sha512-5xoDfX+fL7faATnagmWPpbFtwh/R77WmMMqqHGS65C3vvB0YHrgF+B1YmZ3441tMj5n63k0212XNoJwzlhffQw==} engines: {node: ^8.16.0 || ^10.6.0 || >=11.0.0} @@ -679,6 +687,16 @@ packages: engines: {node: '>=18'} hasBin: true + playwright-core@1.58.2: + resolution: {integrity: sha512-yZkEtftgwS8CsfYo7nm0KE8jsvm6i/PTgVtB8DL726wNf6H2IMsDuxCpJj59KDaxCtSnrWan2AeDqM7JBaultg==} + engines: {node: '>=18'} + hasBin: true + + playwright@1.58.2: + resolution: {integrity: sha512-vA30H8Nvkq/cPBnNw4Q8TWz1EJyqgpuinBcHET0YVJVFldr8JDNiU9LaWAE1KqSkRYazuaBhTpB5ZzShOezQ6A==} + engines: {node: '>=18'} + hasBin: true + prebuild-install@7.1.3: resolution: {integrity: sha512-8Mf2cbV7x1cXPUILADGI3wuhfqWvtiLA1iclTDbFRZkgRQS0NqsPZphna9V+HyTEadheuPmjaJMsbzKQFOzLug==} engines: {node: '>=10'} @@ -1196,6 +1214,9 @@ snapshots: fs-constants@1.0.0: {} + fsevents@2.3.2: + optional: true + fsevents@2.3.3: optional: true @@ -1428,6 +1449,14 @@ snapshots: playwright-core@1.57.0: {} + playwright-core@1.58.2: {} + + playwright@1.58.2: + dependencies: + playwright-core: 1.58.2 + optionalDependencies: + fsevents: 2.3.2 + prebuild-install@7.1.3: dependencies: detect-libc: 2.1.2