Files

Stefano Fiorini 6e2fd17734 refactor: consolidate web scraping into web-automation

2026-03-10 19:24:17 -05:00

4.0 KiB

Raw Blame History

name, description

name	description
web-automation	Browse and scrape web pages using Playwright with Camoufox anti-detection browser. Use when automating web workflows, extracting rendered page content, handling authenticated sessions, or scraping websites with bot protection.

Web Automation with Camoufox (Codex)

Automated web browsing and scraping using Playwright with two execution paths under one skill:

one-shot extraction via extract.js
broader stateful automation via Camoufox and the existing auth.ts, browse.ts, flow.ts, and scrape.ts

When To Use Which Command

Use node scripts/extract.js "<URL>" for one-shot extraction from a single URL when you need rendered content, bounded stealth behavior, and JSON output.
Use npx tsx scrape.ts ... when you need markdown output, Readability extraction, full-page cleanup, or selector-based scraping.
Use npx tsx browse.ts ..., auth.ts, or flow.ts when the task needs interactive navigation, persistent sessions, login handling, click/type actions, or multi-step workflows.

Requirements

Node.js 20+
pnpm
Network access to download browser binaries

First-Time Setup

cd ~/.openclaw/workspace/skills/web-automation/scripts
pnpm install
npx playwright install chromium
npx camoufox-js fetch
pnpm approve-builds
pnpm rebuild better-sqlite3 esbuild

Prerequisite Check (MANDATORY)

Before running any automation, verify Playwright + Camoufox dependencies are installed and scripts are configured to use Camoufox.

cd ~/.openclaw/workspace/skills/web-automation/scripts
node -e "require.resolve('playwright/package.json');require.resolve('playwright-core/package.json');require.resolve('camoufox-js/package.json');console.log('OK: playwright + playwright-core + camoufox-js installed')"
node -e "const fs=require('fs');const t=fs.readFileSync('browse.ts','utf8');if(!/camoufox-js/.test(t)){throw new Error('browse.ts is not configured for Camoufox')}console.log('OK: Camoufox integration detected in browse.ts')"

If any check fails, stop and return:

"Missing dependency/config: web-automation requires playwright, playwright-core, and camoufox-js with Camoufox-based scripts. Run setup in this skill, then retry."

If runtime fails with missing native bindings for better-sqlite3 or esbuild, run:

cd ~/.openclaw/workspace/skills/web-automation/scripts
pnpm approve-builds
pnpm rebuild better-sqlite3 esbuild

Quick Reference

One-shot JSON extract: node scripts/extract.js "https://example.com"
Browse page: npx tsx browse.ts --url "https://example.com"
Scrape markdown: npx tsx scrape.ts --url "https://example.com" --mode main --output page.md
Authenticate: npx tsx auth.ts --url "https://example.com/login"
Natural-language flow: npx tsx flow.ts --instruction 'go to https://example.com then click on "Login" then type "user@example.com" in #email then press enter'

One-shot extraction

Use extract.js when you need a single page fetch with JavaScript rendering and lightweight anti-bot shaping, but not a full automation session.

node scripts/extract.js "https://example.com"
WAIT_TIME=5000 node scripts/extract.js "https://example.com"
SCREENSHOT_PATH=/tmp/page.png SAVE_HTML=true node scripts/extract.js "https://example.com"

Output is JSON only and includes fields such as:

requestedUrl
finalUrl
title
content
metaDescription
status
elapsedSeconds
challengeDetected
optional screenshot
optional htmlFile

General flow runner

Use flow.ts for multi-step commands in plain language (go/click/type/press/wait/screenshot).

Example:

npx tsx flow.ts --instruction 'go to https://search.fiorinis.com then type "pippo" then press enter then wait 2s'

Notes

Sessions persist in Camoufox profile storage.
Use --wait for dynamic pages.
Use --mode selector --selector "..." for targeted extraction.
extract.js keeps stealth and bounded anti-bot shaping while keeping the Chromium sandbox enabled.

4.0 KiB Raw Blame History