Distributed Crawling

Scrapy Redis

Redis-based distributed components for Scrapy

Scrapy-Redis provides Redis-backed components for Scrapy, enabling distributed crawling with shared request queues and item pipelines.

Scrapy

Fast, high-level web crawling framework for Python

Distributed CrawlingScraping Frameworks

Scrapy is an open-source Python framework for building fast, scalable web crawlers that extract structured data from websites efficiently.

Scrapy Cluster

Distributed on-demand scraping with Scrapy

Distributed Crawling

Scrapy Cluster uses Redis and Kafka to create a distributed, on-demand Scrapy crawling cluster for coordinated large-scale web scraping.

Frontera

Scalable crawl frontier framework

Distributed Crawling

Frontera is a Python crawl frontier framework for managing when and what to crawl, enabling web crawlers of any scale with Scrapy integration.

Nutch

Highly extensible and scalable web crawler

Distributed Crawling

Apache Nutch is a highly extensible, production-ready web crawler built on Hadoop for large-scale batch crawling and data acquisition tasks.

Heritrix

Web-scale archival web crawler

Distributed Crawling

Heritrix is the Internet Archive's open-source, extensible, archival-quality web crawler designed for large-scale web preservation and data collection.

A curated collection of the best distributed web crawling systems for large-scale data collection across multiple nodes.