Scrapy Cluster

PublishedJun 12, 2026Last updatedJul 13, 2026

Scrapy Cluster uses Redis and Kafka to create a distributed, on-demand Scrapy crawling cluster for coordinated large-scale web scraping.

Visit Scrapy Cluster

Sources

github.com

Verified against the istresearch/scrapy-cluster repository.

Scrapy Cluster is a distributed, on-demand web scraping system that combines Scrapy with Redis and Kafka for coordinated, large-scale crawling operations.

Key Features:

Kafka Integration - API-driven crawl requests via Kafka for real-time, on-demand scraping
Redis Coordination - Distributed request management and deduplication across nodes
Horizontal Scaling - Add crawler nodes dynamically to increase throughput
Kafka Monitor - REST API for submitting and managing crawl jobs programmatically
Modular Design - Pluggable components for customizing crawl behavior and data flow

Whether you're building on-demand scraping services, scaling data collection infrastructure, or coordinating crawlers across a cluster, Scrapy Cluster provides a proven architecture for distributed Scrapy deployments.

Categories:

Distributed Crawling

Features:

Distributed CrawlingJob QueueDeduplicationREST APIReal-Time StreamingOpen Source

Tags:

free open-source

Stars
1.2K
Forks
322
Last commit
2 years ago
License
MIT
Language
Python

View Repository

Similar to Scrapy Cluster

View all tools

Scrapy

Fast, high-level web crawling framework for Python

Stars
59.5K
Forks
11.2K
Last commit
5 months ago

Same categoryOpen sourceStronger repo

Scrapy is an open-source Python framework for building fast, scalable web crawlers that extract structured data from websites efficiently.

Scrapy Redis

Redis-based distributed components for Scrapy

Stars
5.6K
Forks
1.6K
Last commit
2 years ago

Same categoryOpen sourceStronger repo

Scrapy-Redis provides Redis-backed components for Scrapy, enabling distributed crawling with shared request queues and item pipelines.

Heritrix

Web-scale archival web crawler

Stars
3.2K
Forks
782
Last commit
6 months ago

Same categoryOpen sourceStronger repo

Heritrix is the Internet Archive's open-source, extensible, archival-quality web crawler designed for large-scale web preservation and data collection.

Scrapy Cluster

Scrapy Cluster uses Redis and Kafka to create a distributed, on-demand Scrapy crawling cluster for coordinated large-scale web scraping.

Sources