Favicon of Nutch

Nutch

Apache Nutch is a highly extensible, production-ready web crawler built on Hadoop for large-scale batch crawling and data acquisition tasks.

Screenshot of Nutch website

Apache Nutch is a highly extensible, production-ready web crawler that leverages Apache Hadoop for scalable, distributed batch crawling operations.

Key Features:

  • Hadoop Integration — Built on Hadoop for distributed crawling across large clusters
  • Pluggable Architecture — Extensible parsing, indexing, and scoring via plugins
  • Scalable — Handles web-scale crawling with horizontal scaling across nodes
  • Search Integration — Native integration with Apache Solr and Elasticsearch for indexing
  • Mature — Battle-tested in production environments for over two decades
  • Fine-Grained Config — Configurable URL filtering, fetch scheduling, and politeness

Whether you're building search engines, creating web archives, or running large-scale data acquisition, Apache Nutch provides a proven, extensible crawler for distributed environments.

Share:

  • Stars

    3.1K
  • Forks

    1.3K
  • Last commit

    2 months ago
  • License

    Apache-2.0
  • Language

    Java
View Repository

Similar to Nutch

Favicon

 

  
  
Favicon

 

  
  
Favicon