D4Vinci/Scrapling

Selection methods
· Fetchers
· Spiders
· Proxy Rotation
· CLI
· MCP

Scrapling은 단일 요청부터 대규모 크롤링까지 모든 작업을 처리하는 적응형 웹 스크래핑 프레임워크입니다.

그의 파서가 웹사이트 변경사항을 학습하고 페이지가 업데이트될 때 자동으로 요소를 재배치합니다. 그 fetchers 는 Cloudflare Turnstile 와 같은 안티-bot 시스템을 기본으로 우회합니다. 그리고 그의 spider framework 는 몇 줄의 Python 으로 동시, 다중 세션 크롤링을 확장할 수 있으며 pause/resume 와 자동 proxy rotation 을 제공합니다 - One library, zero compromises.

실시간 통계와 스트리밍이 포함된 번개 속도의 크롤링. Web Scrapers 를 위한 Web Scrapers 와 일반 사용자들에 의해 개발되었으며, 모든 사람이 사용할 수 있습니다.

from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
StealthyFetcher.adaptive = True
p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Fetch website under the radar!
...

또는 전체 크롤링으로 확장

from scrapling.spiders import Spider, Response
class MySpider(Spider):
    name = "demo"
    ...

ColdProxy 는 195 개 이상의 국가에 걸쳐 안정적인 웹 스크래핑, 공공 데이터 수집 및 지리적 타겟팅 테스트를 위한 레지던셜 및 데이터센터 프로кси를 제공합니다.
| | Scrapling 은 Cloudflare Turnstile 를 처리합니다. 엔터프라이즈급 보호를 위해 Hyper Solutions 은 Akamai, DataDome, Kasada, Incapsula 에 대한 유효한 안티-bot 토큰을 생성하는 API 엔드포인트를 제공합니다. 간단한 API 호출로 브라우저 자동화 없이.
| | Hey, 우리는 BirdProxies 를 만들었습니다. 프로кси가 복잡하거나 과할 수 없어야 합니다. 195 개 이상의 위치에서 빠른 레지던셜 및 ISP 프로кси, 공정한 가격, 그리고 실제 지원. 랜딩 페이지의 FlappyBird 게임을 통해 무료 데이터를 시도해 보세요!
| | Evomi: $0.49/GB 의 레지던셜 프로кси. 스크래핑 브라우저는 완전히 위조된 Chromium, 레지던셜 IP, 자동 CAPTCHA 해결, 안티-bot 우회. Scraper API 는 문제 없는 결과를 제공합니다. MCP 와 N8N 통합이 가능합니다.
| | TikHub.io 는 TikTok, X, YouTube 및 Instagram 을 포함한 16 개 이상의 플랫폼에 걸쳐 900 개 이상의 안정적인 API 를 제공하며, 40M+ 데이터셋도 제공합니다. 또한 Claude, GPT, GEMINI 등 최대 71% 할인된 DISCOUNTED AI 모델을 제공합니다.
| | Nsocks 는 개발자와 스크래퍼를 위한 빠른 레지던셜 및 ISP 프로кси를 제공합니다. 글로벌 IP 커버리지, 높은 익명성, 스마트 회전, 자동화 및 데이터 추출을 위한 신뢰할 수 있는 성능. Xcrawl 을 사용하여 대규모 웹 크롤링을 단순화하세요.
| | 노트북을 닫으세요. 스크래퍼는 계속 실행됩니다. PetroSky VPS - 비일시적인 자동화를 위해 구축된 클라우드 서버. Windows 와 Linux 머신으로 전체 제어. €6.99/월 부터.
| | 스크래핑, 자동화 및 다중 계정용 안정적인 프로кси. 클린 IP, 빠른 응답, 부하 하에서 신뢰할 수 있는 성능. 확장 가능한 워크플로우를 위해 구축되었습니다.
| | Swiftproxy 는 195 개 이상의 국가에 걸쳐 80M+ IP 를 제공하는 확장 가능한 레지던셜 프로кси를 제공하며, 빠른可靠的 연결, 자동 회전 및 강력한 안티-block 성능을 제공합니다. 무료 체험 가능.

여러분의 광고를 보여주고 싶으신가요? 여기를 클릭하세요

여러분의 광고를 보여주고 싶으신가요? 여기를 클릭하고 여러분에게 맞는 티어를 선택하세요!

🕷️ Scrapy-like Spider API: start_urls로 스파이더 정의, async parse 콜백, 그리고 Request/Response

objects.

⚡
Concurrent Crawling: Configurable concurrency limits, per-domain throttling, and download delays.
🔄
Multi-Session Support: Unified interface for HTTP requests, and stealthy headless browsers in a single spider - route requests to different sessions by ID.
💾
Pause & Resume: Checkpoint-based crawl persistence. Press Ctrl+C for a graceful shutdown; restart to resume from where you left off.
📡
Streaming Mode: Stream scraped items as they arrive viaasync for item in spider.stream()

with real-time stats - ideal for UI, pipelines, and long-running crawls.

🛡️
Blocked Request Detection: Automatic detection and retry of blocked requests with customizable logic.
🤖
Robots.txt Compliance: Optionalrobots_txt_obey

flag that respectsDisallow

,
Crawl-delay

,
andRequest-rate

directives with per-domain caching.

🧪
Development Mode: Cache responses to disk on the first run and replay them on subsequent runs - iterate on yourparse()

logic without re-hitting the target servers.

📦
Built-in Export: Export results through hooks and your own pipeline or the built-in JSON/JSONL withresult.items.to_json()

/
result.items.to_jsonl()

respectively.

HTTP Requests: Fast and stealthy HTTP requests with theFetcher

class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP/3.Dynamic Loading: Fetch dynamic websites with full browser automation through theDynamicFetcher

class supporting Playwright's Chromium and Google's Chrome.Anti-bot Bypass: Advanced stealth capabilities withStealthyFetcher

and fingerprint spoofing. Can easily bypass all types of Cloudflare's Turnstile/Interstitial with automation.Session Management: Persistent session support withFetcherSession

,
StealthySession

,
andDynamicSession

classes for cookie and state management across requests.Proxy Rotation: Built-inProxyRotator

with cyclic or custom rotation strategies across all session types, plus per-request proxy overrides.Domain & Ad Blocking: Block requests to specific domains (and their subdomains) or enable built-in ad blocking (~3,500 known ad/tracker domains) in browser-based fetchers.DNS Leak Prevention: Optional DNS-over-HTTPS support to route DNS queries through Cloudflare's DoH, preventing DNS leaks when using proxies.Async Support: Complete async support across all fetchers and dedicated async session classes.

🔄
Smart Element Tracking: Relocate elements after website changes using intelligent similarity algorithms.
🎯
Smart Flexible Selection: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more.
🔍
Find Similar Elements: Automatically locate elements similar to found elements.
🤖
MCP Server to be used with AI: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features powerful, custom capabilities that leverage Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage. (demo video)
🚀
Lightning Fast: 최적화된 성능은 대부분의 Python 크롤링 라이브러리를 능가합니다. - 🔋
Memory Efficient: 최소한의 메모리 사용량을 위한 최적화된 데이터 구조와 지연 로딩 (lazy loading). - ⚡
Fast JSON Serialization: 표준 라이브러리에 비해 10 배 더 빠릅니다. - 🏗️
Battle tested: Scrapling 은 92% 의 테스트 커버리지와 전체 타입 힌트 (type hints) 를 갖추고 있으며, 지난 1 년 동안 수백 명의 웹 크롤러에 매일 사용되었습니다.
🎯
Interactive Web Scraping Shell: Scrapling 통합, 단축키 및 새로운 도구를 갖춘 선택적 내장 IPython 션으로 웹 크롤링 스크립트 개발 속도를 높입니다. curl 요청을 Scrapling 요청으로 변환하거나 브라우저에서 요청 결과를 확인하는 기능 등. - 🚀
Use it directly from the Terminal: 코드를 작성하지 않고도 URL 을 직접 크롤링할 수 있습니다. - 🛠️
Rich Navigation API: 부모, 형제, 자식 탐색 방법을 갖춘 고급 DOM 트래버스 (DOM traversal). - 🧬
Enhanced Text Processing: 내장 정규식 (regex), 정화 방법 및 최적화된 문자열 연산. - 📝
Auto Selector Generation: 모든 요소에 대한 견고한 CSS/XPath 선택자를 생성합니다. - 🔌
Familiar API: Scrapy/BeautifulSoup 와 유사하며 Scrapy/Parsel 에서 동일한 가상 요소 (pseudo-elements) 를 사용합니다. - 📘
Complete Type Coverage: 훌륭한 IDE 지원 및 코드 완성을 위한 전체 타입 힌트. 코드는 변경 때마다 PyRight와 MyPy로 자동 스캔됩니다. - 🔋
Ready Docker image: 각 릴리스마다 모든 브라우저가 포함된 Docker 이미지가 자동으로 빌드되고 푸시됩니다.

Scrapling 이 할 수 있는 것을 깊이 파고들지 않고도 빠르게 살펴보겠습니다.

세션 지원 HTTP 요청

from scrapling.fetchers import Fetcher, FetcherSession
with FetcherSession(impersonate='chrome') as session: # 최신 버전의 Chrome TLS 지문 사용
page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
...

고급 스티치 모드 (stealth mode)

from scrapling.fetchers import StealthyFetcher, StealthySession
with StealthySession(headless=True, solve_cloudflare=True) as session: # 브라우저가 완료될 때까지 열려 있도록 유지
page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
...

전체 브라우저 자동화 (Full browser automation)

from scrapling.fetchers import DynamicFetcher, DynamicSession
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # 브라우저가 완료될 때까지 열려 있도록 유지
page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
...

동시 요청, 여러 세션 타입, пауза/리큐메 (pause/resume) 를 갖춘 크롤러를 구축합니다:

from scrapling.spiders import Spider, Request, Response
class QuotesSpider(Spider):
name = "quotes"
...

단일 스파이더에서 여러 세션 타입을 사용:

from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MultiSessionSpider(Spider):
...

체크포인트 (checkpoints) 를 사용하여 긴 크롤을 пау스하고 리큐메합니다:

QuotesSpider(crawldir="./crawl_data").start()

Ctrl+C 를 누르면 부드럽게 пау스됩니다. 진행도는 자동으로 저장됩니다. 나중에 스파이더를 다시 시작할 때 동일한 crawldir 을 전달하면 중단된 곳에서 리큐메합니다.

from scrapling.fetchers import Fetcher
# 풍부한 요소 선택 및 탐색
page = Fetcher.get('https://quotes.toscrape.com/')
...

웹사이트를 페치하지 않고 바로 파서를 사용할 수 있습니다:

from scrapling.parser import Selector
page = Selector("<html>...</html>")

그리고 작동 방식이 정확히 동일합니다!

import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
async with FetcherSession(http3=True) as session: # `FetcherSession`은 컨텍스트를 인식하며 동기/비동기 패턴 모두에서 작동할 수 있습니다
...

Scrapling에는 강력한 명령줄 인터페이스 (CLI) 가 포함되어 있습니다:

인터랙티브 웹 크롤링 쉘을 실행하세요

scrapling shell

코딩 없이 페이지를 파일로 직접 추출합니다 (기본적으로 body 태그 내부의 내용을 추출). 출력 파일이 .txt 로 끝난다면, 대상의 텍스트 내용이 추출됩니다. .md 로 끝난다면 HTML 콘텐츠의 마크다운 표현이 됩니다; .html 로 끝난다면 HTML 콘텐츠 자체입니다.

scrapling extract get 'https://example.com' content.md
scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # CSS 선택자 `#fromSkipToProducts`와 일치하는 모든 요소
scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
...

참고

MCP 서버 및 인터랙티브 웹 크롤링 쉘을 포함하여 많은 추가 기능이 있지만, 이 페이지를 간결하게 유지하고 싶습니다. 전체 문서화를 확인하세요

Scrapling 은 강력할 뿐만 아니라 놀라게 빠릅니다. 다음 벤치마크는 Scrapling 의 파서와 다른 인기 라이브러리의 최신 버전 간의 비교입니다.

#	라이브러리	시간 (ms)	vs Scrapling
1	Scrapling	2.02	1.0x
...
Scrapling 의 적응형 요소 찾기는 대체품을 크게 능가합니다:

라이브러리	시간 (ms)	vs Scrapling
Scrapling	2.39	1.0x
AutoScraper	12.45	5.209x

모든 벤치마크는 100 회 이상의 실행의 평균입니다. 방법론은 benchmarks.py 를 확인하세요.

Scrapling 은 Python 3.10 또는 그 이상을 필요로 합니다:

pip install scrapling

이 설치에는 파서 엔진과 의존성만 포함되어 있으며, 크롤러 (fetchers) 나 명령줄 의존성은 포함되지 않습니다.

아래 추가 기능을 사용하거나 크롤러 (fetchers) 나 그 클래스를 사용하려면, 크롤러의 의존성과 브라우저 의존성을 다음과 같이 설치해야 합니다:

pip install "scrapling[fetchers]"

scrapling install # 일반 설치

scrapling install --force # 강제 재설치

이것은 모든 브라우저와 시스템 의존성 및 지문 조작 의존성을 다운로드합니다.

또는 명령어를 실행하는 대신 코드로 설치할 수 있습니다:

from scrapling.cli import install
install([], standalone_mode=False) # 일반 설치
install(["--force"], standalone_mode=False) # 강제 재설치

추가 기능:
MCP 서버 기능 설치: pip install "scrapling[ai]"
쉘 기능 (웹 크롤링 쉘 및 extract 명령어) 설치:pip install "scrapling[shell]"
모든 기능 설치: pip install "scrapling[all]"

이러한 추가 기능 중 하나를 설치한 후 (이미 설치하지 않은 경우), 브라우저 의존성을 scrapling install 로 설치해야 합니다.

MCP 서버 기능 설치:

DockerHub 에서 다음 명령어로 모든 추가 기능과 브라우저를 포함한 Docker 이미지를 설치할 수도 있습니다:

docker pull pyd4vinci/scrapling

또는 GitHub 레지스트리에서 다운로드할 수 있습니다:

docker pull ghcr.io/d4vinci/scrapling:latest

이 이미지는 GitHub Actions 과 저장소의 메인 브랜치로 자동으로 빌드되어 푸시됩니다.

D4Vinci/Scrapling

요약

핵심 포인트

D4Vinci/Scrapling

댓글