Developer Guides

CAPTCHA Solver for Scrapy — Python Integration Guide

Scrapy handles HTTP at scale, but it has no built-in CAPTCHA handling. When a spider hits a protected page, Scrapy's default behavior is to receive the CAPTCHA page and parse garbage — silently failing. This guide covers how to detect and solve CAPTCHAs inside Scrapy spiders using the CaptchaAI API.

For the core API integration pattern, see the CAPTCHA Solver API Integration Guide.

Prerequisites

pip install scrapy requests

Where CAPTCHAs Appear in Scrapy Pipelines

Scenario Typical CAPTCHA type
Login form before scraping reCAPTCHA v2/v3, hCaptcha
Protected listing page Cloudflare Turnstile
High-volume crawling (IP flagged) reCAPTCHA v3, Cloudflare Managed
Form submission in spider reCAPTCHA v2 invisible

Scrapy's HTTP-only engine can handle token injection approaches (where you inject the solved token into a POST request). For browser-dependent CAPTCHAs (Cloudflare Managed Challenge, some hCaptcha Enterprise), you need Playwright middleware — covered at the end.

Approach 1 — Solve and Inject in Spider (Simple)

The most straightforward approach: detect the CAPTCHA page in parse(), solve it, then re-submit the request with the token.

import scrapy
import requests
import time


CAPTCHAI_KEY = "YOUR_API_KEY"


def solve_recaptcha_v2(page_url: str, site_key: str) -> str:
    r = requests.post("https://ocr.captchaai.com/in.php", data={
        "key": CAPTCHAI_KEY,
        "method": "userrecaptcha",
        "googlekey": site_key,
        "pageurl": page_url,
        "json": 1,
    }, timeout=30)
    task_id = r.json()["request"]

    time.sleep(5)
    for _ in range(24):
        r = requests.get("https://ocr.captchaai.com/res.php", params={
            "key": CAPTCHAI_KEY, "action": "get", "id": task_id, "json": 1,
        }, timeout=30)
        d = r.json()
        if d.get("status") == 1:
            return d["request"]
        time.sleep(5)
    raise TimeoutError("reCAPTCHA solve timed out")


class LoginSpider(scrapy.Spider):
    name = "login_spider"
    start_urls = ["https://example.com/login"]

    def parse(self, response):
        # Extract site key from the page
        site_key = response.css("[data-sitekey]::attr(data-sitekey)").get()

        if site_key:
            # Solve the CAPTCHA
            token = solve_recaptcha_v2(response.url, site_key)

            # Submit the login form with the CAPTCHA token
            yield scrapy.FormRequest.from_response(
                response,
                formdata={
                    "username": "user@example.com",
                    "password": "password123",
                    "g-recaptcha-response": token,
                },
                callback=self.after_login,
            )
        else:
            yield from self.after_login(response)

    def after_login(self, response):
        if "dashboard" in response.url or "welcome" in response.text.lower():
            self.logger.info("Login successful")
            # Continue scraping...
        else:
            self.logger.error(f"Login failed: {response.url}")

Approach 2 — Downloader Middleware (Scalable)

For spiders that encounter CAPTCHAs across many pages, a downloader middleware centralizes CAPTCHA handling without modifying each spider.

Middleware Implementation

# middlewares/captcha_middleware.py

import re
import requests
import time
import logging

logger = logging.getLogger(__name__)


def _extract_sitekey(text: str, patterns: list[str]) -> str | None:
    for p in patterns:
        m = re.search(p, text)
        if m:
            return m.group(1)
    return None


def _solve_recaptcha_v2(api_key: str, page_url: str, site_key: str) -> str:
    r = requests.post("https://ocr.captchaai.com/in.php", data={
        "key": api_key, "method": "userrecaptcha",
        "googlekey": site_key, "pageurl": page_url, "json": 1,
    }, timeout=30)
    task_id = r.json()["request"]
    time.sleep(5)
    for _ in range(24):
        r = requests.get("https://ocr.captchaai.com/res.php", params={
            "key": api_key, "action": "get", "id": task_id, "json": 1,
        }, timeout=30)
        d = r.json()
        if d.get("status") == 1:
            return d["request"]
        time.sleep(5)
    raise TimeoutError("Solve timed out")


def _solve_hcaptcha(api_key: str, page_url: str, site_key: str) -> str:
    r = requests.post("https://ocr.captchaai.com/in.php", data={
        "key": api_key, "method": "hcaptcha",
        "sitekey": site_key, "pageurl": page_url, "json": 1,
    }, timeout=30)
    task_id = r.json()["request"]
    time.sleep(10)
    for _ in range(24):
        r = requests.get("https://ocr.captchaai.com/res.php", params={
            "key": api_key, "action": "get", "id": task_id, "json": 1,
        }, timeout=30)
        d = r.json()
        if d.get("status") == 1:
            return d["request"]
        time.sleep(5)
    raise TimeoutError("Solve timed out")


class CaptchaSolverMiddleware:
    """
    Scrapy downloader middleware that detects and solves CAPTCHAs
    before returning the response to the spider.
    """

    RECAPTCHA_PATTERNS = [
        r'data-sitekey=["\']([0-9A-Za-z_\-]{40,})["\']',
        r'"sitekey"\s*:\s*"([0-9A-Za-z_\-]{40,})"',
    ]
    HCAPTCHA_PATTERNS = [
        r'data-sitekey=["\']([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})["\']',
    ]

    def __init__(self, api_key: str):
        self.api_key = api_key

    @classmethod
    def from_crawler(cls, crawler):
        return cls(api_key=crawler.settings.get("CAPTCHAI_API_KEY"))

    def process_response(self, request, response, spider):
        body = response.text

        # Detect hCaptcha (UUID site key format — check first)
        hcaptcha_key = _extract_sitekey(body, self.HCAPTCHA_PATTERNS)
        if hcaptcha_key and "hcaptcha" in body:
            logger.info(f"hCaptcha detected on {response.url}")
            token = _solve_hcaptcha(self.api_key, response.url, hcaptcha_key)
            return self._resubmit_with_token(
                request, response, "h-captcha-response", token
            )

        # Detect reCAPTCHA (40-char alphanumeric site key)
        recaptcha_key = _extract_sitekey(body, self.RECAPTCHA_PATTERNS)
        if recaptcha_key and ("grecaptcha" in body or "recaptcha" in body):
            logger.info(f"reCAPTCHA detected on {response.url}")
            token = _solve_recaptcha_v2(self.api_key, response.url, recaptcha_key)
            return self._resubmit_with_token(
                request, response, "g-recaptcha-response", token
            )

        return response

    def _resubmit_with_token(self, request, response, field_name: str, token: str):
        """Re-submit the original request with the CAPTCHA token injected."""
        import scrapy
        from scrapy.http import FormRequest

        new_request = FormRequest.from_response(
            response,
            formdata={field_name: token},
            callback=request.callback,
            errback=request.errback,
            dont_filter=True,
        )
        return new_request

Register in settings.py

# settings.py

CAPTCHAI_API_KEY = "YOUR_API_KEY"

DOWNLOADER_MIDDLEWARES = {
    "myproject.middlewares.captcha_middleware.CaptchaSolverMiddleware": 600,
}

The middleware runs at priority 600 — after Scrapy's built-in retry/redirect middleware (priority 550) but before the default downloader.

Approach 3 — Scrapy-Playwright for Browser-Dependent CAPTCHAs

For Cloudflare Turnstile or Managed Challenge, use scrapy-playwright to run a real browser:

pip install scrapy-playwright
playwright install chromium
# settings.py additions
DOWNLOAD_HANDLERS = {
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": True}
# spider using playwright + captcha solving
import scrapy
from scrapy_playwright.page import PageMethod


class CloudflareSpider(scrapy.Spider):
    name = "cf_spider"

    def start_requests(self):
        yield scrapy.Request(
            "https://example.com/protected",
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_load_state", "networkidle"),
                ],
            },
            callback=self.parse_protected,
        )

    def parse_protected(self, response):
        # At this point, playwright has loaded the page
        # If Turnstile is present, solve and inject via page methods
        site_key = response.css("[data-sitekey]::attr(data-sitekey)").get()
        if site_key:
            # Solve externally, then use PageMethod to inject
            token = solve_turnstile(CAPTCHAI_KEY, response.url, site_key)
            self.logger.info(f"Turnstile token obtained: {token[:20]}...")
            # Inject and submit would be handled via additional PageMethods
        yield {"url": response.url, "title": response.css("title::text").get()}

Common Issues

Issue Cause Fix
Spider gets CAPTCHA HTML, not target data CAPTCHA not detected Check detection regex against actual page HTML
FormRequest.from_response fails Page has no <form> tag Build the POST request manually with scrapy.Request
Token rejected at target Middleware re-submits GET instead of POST Ensure _resubmit_with_token uses the correct method
Middleware not triggering Priority conflict Check DOWNLOADER_MIDDLEWARES priority; 600 is safe
Concurrent requests solve same page twice Race condition Use Scrapy's CONCURRENT_REQUESTS = 1 during debugging

Production Readiness Notes

Use CAPTCHA Solver for Scrapy — Python Integration Guide as a decision and implementation aid, not just as a one-time reference. The practical test for captcha solver for scrapy is whether the same approach behaves reliably when traffic is messy: rotating sessions, expired tokens, changing widget parameters, intermittent solver delays, and target pages that refresh without warning. For Python developer building scrapers with Scrapy, the safest rollout is to start with a narrow fixture, record every submitted task, and compare the solver response with the browser state that finally submits the form. That makes failures explainable instead of mysterious, especially when a target alternates between visible challenges, invisible checks, and server-side verification.

Evaluation Criteria

A how-to should be exercised against staging first, then promoted with feature flags so failed solves can fall back without blocking the entire workflow. For developer integration work, the most useful scorecard combines technical acceptance with operational cost. A low nominal price is not enough if retries double the real cost per accepted token, and a fast median solve time is not enough if p95 latency stalls the queue. Track these criteria before you standardize the workflow:

  • The challenge subtype, sitekey, action, rqdata, blob, captchaId, or page URL used for each task.
  • Median and p95 solve time, separated by provider and target domain.
  • Accepted-token rate on the target page, not just successful API responses.
  • Retry count, timeout count, zero-balance incidents, and invalid-parameter errors.
  • The exact browser, proxy region, and user-agent that submitted the solved token.

Rollout Checklist

Before this guidance moves into a production job, build a small acceptance suite around the pages that matter most. Run it with a fixed browser profile, then repeat with the proxy and concurrency settings you expect in production. Keep the first release conservative: bounded polling, clear timeout handling, and a fallback path when the solver cannot return a usable answer. For developer integration, treat the code as a production pattern: timeouts, retries, logging, secret storage, and test fixtures matter as much as the solve request itself. That checklist keeps the article useful after the first copy-paste, because the integration is judged by end-to-end completion rather than by whether a code sample returned a string.

Monitoring Signals

Healthy CAPTCHA automation is observable. Log the task id, provider, challenge type, target host, queue time, solve time, final submit status, and normalized error code for every attempt. Review those logs in daily batches at first, then move to alerts once the baseline is stable. Sudden drops usually come from target-side changes: a new sitekey, a changed action name, a stricter hostname check, an added managed challenge, or a proxy pool that no longer matches the expected geography. When you can see those shifts quickly, provider switching becomes a controlled decision instead of a late-night rewrite.

Maintenance Cadence

Revisit the setup whenever the target UI changes, when the solver provider changes task names or pricing, or when benchmark data shows a sustained latency or solve-rate shift. Keep one known-good fixture for each CAPTCHA subtype and rerun it after dependency upgrades, browser updates, and proxy changes. If the article is used for vendor selection, repeat the same fixture across at least two providers before renewing a balance or migrating the whole pipeline. That habit keeps captcha solver for scrapy work aligned with the real target behavior rather than with stale assumptions.

Comments are disabled for this article.

Related Posts

Developer Guides How to Use a CAPTCHA Solver with Selenium
Step-by-step guide to integrating a CAPTCHA solver into Selenium automation.

Step-by-step guide to integrating a CAPTCHA solver into Selenium automation. Covers re CAPTCHA v 2, h Captcha,...

May 04, 2026
Developer Guides How to Use a CAPTCHA Solver with Playwright
Step-by-step guide to integrating a CAPTCHA solver into Playwright automation.

Step-by-step guide to integrating a CAPTCHA solver into Playwright automation. Covers re CAPTCHA v 2, h Captch...

May 06, 2026
hCaptcha How to Solve hCaptcha in Python
Complete Python tutorial for solving h Captcha automatically — covers site key extraction, solver API integration with Captcha AI, token injection using Playwri...

Complete Python tutorial for solving h Captcha automatically — covers site key extraction, solver API integrat...

May 05, 2026