# Scrapling MCP Server The Scrapling MCP server exposes nine web scraping tools over the MCP protocol. It supports CSS-selector-based content narrowing (reducing tokens by extracting only relevant elements before returning results), three levels of scraping capability (plain HTTP, browser-rendered, and stealth/anti-bot bypass), and persistent browser session management. All scraping tools return a `ResponseModel` with fields: `status` (int), `content` (list of strings), `url` (str). ## Tools ### `get` -- HTTP request (single URL) Fast HTTP GET with browser fingerprint impersonation (TLS, headers). Suitable for static pages with no/low bot protection. **Key parameters:** | Parameter | Type | Default | Description | |---------------------|------------------------------------|--------------|--------------------------------------------------------------------| | `url` | str | required | URL to fetch | | `extraction_type` | `"markdown"` / `"html"` / `"text"` | `"markdown"` | Output format | | `css_selector` | str or null | null | CSS selector to narrow content (applied after `main_content_only`) | | `main_content_only` | bool | true | Restrict to `` content | | `impersonate` | str | `"chrome"` | Browser fingerprint to impersonate | | `proxy` | str or null | null | Proxy URL, e.g. `"http://user:pass@host:port"` | | `proxy_auth` | dict or null | null | `{"username": "...", "password": "..."}` | | `auth` | dict or null | null | HTTP basic auth, same format as proxy_auth | | `timeout` | number | 30 | Seconds before timeout | | `retries` | int | 3 | Retry attempts on failure | | `retry_delay` | int | 1 | Seconds between retries | | `stealthy_headers` | bool | true | Generate realistic browser headers and Google referer | | `http3` | bool | false | Use HTTP/3 (may conflict with `impersonate`) | | `follow_redirects` | bool or "safe" | "safe" | Follow redirects. "safe" rejects redirects to internal/private IPs | | `max_redirects` | int | 30 | Max redirects (-1 for unlimited) | | `headers` | dict or null | null | Custom request headers | | `cookies` | dict or null | null | Request cookies | | `params` | dict or null | null | Query string parameters | | `verify` | bool | true | Verify HTTPS certificates | ### `bulk_get` -- HTTP request (multiple URLs) Async concurrent version of `get`. Same parameters except `url` is replaced by `urls` (list of strings). All URLs are fetched in parallel. Returns a list of `ResponseModel`. ### `fetch` -- Browser fetch (single URL) Opens a Chromium browser via Playwright to render JavaScript. Suitable for dynamic/SPA pages with no/low bot protection. **Key parameters (beyond shared ones):** | Parameter | Type | Default | Description | |-----------------------|---------------------|--------------|---------------------------------------------------------------------------------| | `url` | str | required | URL to fetch | | `extraction_type` | str | `"markdown"` | `"markdown"` / `"html"` / `"text"` | | `css_selector` | str or null | null | Narrow content before extraction | | `main_content_only` | bool | true | Restrict to `` | | `headless` | bool | true | Run browser hidden (true) or visible (false) | | `proxy` | str or dict or null | null | String URL or `{"server": "...", "username": "...", "password": "..."}` | | `timeout` | number | 30000 | Timeout in **milliseconds** | | `wait` | number | 0 | Extra wait (ms) after page load before extraction | | `wait_selector` | str or null | null | CSS selector to wait for before extraction | | `wait_selector_state` | str | `"attached"` | State for wait_selector: `"attached"` / `"visible"` / `"hidden"` / `"detached"` | | `network_idle` | bool | false | Wait until no network activity for 500ms | | `disable_resources` | bool | false | Block fonts, images, media, stylesheets, etc. for speed | | `google_search` | bool | true | Set a Google referer header | | `real_chrome` | bool | false | Use locally installed Chrome instead of bundled Chromium | | `cdp_url` | str or null | null | Connect to existing browser via CDP URL | | `extra_headers` | dict or null | null | Additional request headers | | `useragent` | str or null | null | Custom user-agent (auto-generated if null) | | `cookies` | list or null | null | Playwright-format cookies | | `timezone_id` | str or null | null | Browser timezone, e.g. `"America/New_York"` | | `locale` | str or null | null | Browser locale, e.g. `"en-GB"` | | `session_id` | str or null | null | Reuse a persistent session from `open_session` instead of creating a new browser | ### `bulk_fetch` -- Browser fetch (multiple URLs) Concurrent browser version of `fetch`. Same parameters (including `session_id`) except `url` is replaced by `urls` (list of strings). Each URL opens in a separate browser tab. Returns a list of `ResponseModel`. ### `stealthy_fetch` -- Stealth browser fetch (single URL) Anti-bot bypass fetcher with fingerprint spoofing. Use this for sites with Cloudflare Turnstile/Interstitial or other strong protections. **Additional parameters (beyond those in `fetch`):** | Parameter | Type | Default | Description | |--------------------|--------------|---------|------------------------------------------------------------------| | `solve_cloudflare` | bool | false | Automatically solve Cloudflare Turnstile/Interstitial challenges | | `hide_canvas` | bool | false | Add noise to canvas operations to prevent fingerprinting | | `block_webrtc` | bool | false | Force WebRTC to respect proxy settings (prevents IP leak) | | `allow_webgl` | bool | true | Keep WebGL enabled (disabling is detectable by WAFs) | | `additional_args` | dict or null | null | Extra Playwright context args (overrides Scrapling defaults) | | `session_id` | str or null | null | Reuse a persistent stealthy session from `open_session` | All parameters from `fetch` are also accepted. ### `bulk_stealthy_fetch` -- Stealth browser fetch (multiple URLs) Concurrent stealth version. Same parameters (including `session_id`) as `stealthy_fetch` except `url` is replaced by `urls` (list of strings). Returns a list of `ResponseModel`. ### `open_session` -- Create a persistent browser session Opens a browser session that stays alive across multiple fetch calls, avoiding the overhead of launching a new browser each time. Returns a `SessionCreatedModel` with `session_id`, `session_type`, `created_at`, `is_alive`, and `message`. **Key parameters:** | Parameter | Type | Default | Description | |--------------------|-----------------------------|--------------|---------------------------------------------------------------------| | `session_type` | `"dynamic"` / `"stealthy"` | required | Type of browser session to create | | `headless` | bool | true | Run browser hidden or visible | | `max_pages` | int | 5 | Max concurrent browser tabs (1-50) | | `proxy` | str or dict or null | null | Proxy for all requests in this session | | `timeout` | number | 30000 | Default timeout in ms | | `solve_cloudflare` | bool | false | (Stealthy only) Auto-solve Cloudflare challenges | | `hide_canvas` | bool | false | (Stealthy only) Canvas fingerprint noise | | `block_webrtc` | bool | false | (Stealthy only) Block WebRTC IP leak | | `allow_webgl` | bool | true | (Stealthy only) Keep WebGL enabled | Plus all other browser session parameters (`google_search`, `real_chrome`, `cdp_url`, `locale`, `timezone_id`, `useragent`, `extra_headers`, `cookies`, `disable_resources`, `network_idle`, `wait_selector`, `wait_selector_state`). A dynamic session can only be used with `fetch`/`bulk_fetch`. A stealthy session can only be used with `stealthy_fetch`/`bulk_stealthy_fetch`. ### `close_session` -- Close a persistent browser session Closes a session and frees its browser resources. Always close sessions when done. | Parameter | Type | Default | Description | |--------------|------|----------|----------------------------------| | `session_id` | str | required | Session ID from `open_session` | Returns a `SessionClosedModel` with `session_id` and `message`. ### `list_sessions` -- List active sessions Returns a list of `SessionInfo` objects, each with `session_id`, `session_type`, `created_at`, and `is_alive`. No parameters. ## Tool selection guide | Scenario | Tool | |------------------------------------------|---------------------------------------------------------------| | Static page, no bot protection | `get` | | Multiple static pages | `bulk_get` | | JavaScript-rendered / SPA page | `fetch` | | Multiple JS-rendered pages | `bulk_fetch` | | Cloudflare or strong anti-bot protection | `stealthy_fetch` (with `solve_cloudflare=true` for Turnstile) | | Multiple protected pages | `bulk_stealthy_fetch` | | Multiple pages from the same site | `open_session` + `fetch`/`stealthy_fetch` with `session_id` | Start with `get` (fastest, lowest resource cost). Escalate to `fetch` if content requires JS rendering. Escalate to `stealthy_fetch` only if blocked. For multiple pages from the same site, use a persistent session to avoid browser launch overhead. ## Content extraction tips - Use `css_selector` to narrow results before they reach the model -- this saves significant tokens. - `main_content_only=true` (default) strips nav/footer by restricting to ``. - `extraction_type="markdown"` (default) is best for readability. Use `"text"` for minimal output, `"html"` when structure matters. - If a `css_selector` matches multiple elements, all are returned in the `content` list. ## Prompt injection protection When `main_content_only=true` (the default), the server automatically sanitizes scraped content to prevent prompt injection from malicious websites. It strips: - CSS-hidden elements (`display:none`, `visibility:hidden`, `opacity:0`, `font-size:0`, `height:0`, `width:0`) - `aria-hidden="true"` elements - `