Favicon of Heritrix

Heritrix

Heritrix is the Internet Archive's open-source, extensible, archival-quality web crawler designed for large-scale web preservation and data collection.

Screenshot of Heritrix website

Heritrix is an extensible, archival-quality web crawler built by the Internet Archive for large-scale web preservation and data collection.

Key Features:

  • Web-Scale Crawling — Designed to crawl billions of pages for web archival projects
  • WARC Output — Produces standard WARC files for long-term web preservation
  • Fine-Grained Control — Extensive configuration for crawl scope, politeness, and prioritization
  • Web UI — Browser-based interface for monitoring and controlling active crawls
  • Docker Support — Run Heritrix in containers for reproducible crawl environments
  • Extensible — Plugin system for custom processors, extractors, and modules

Whether you're preserving web content for research, building large-scale web archives, or conducting comprehensive domain crawls, Heritrix provides production-grade crawling trusted by the Internet Archive.

Share:

  • Stars

    3.2K
  • Forks

    782
  • Last commit

    3 months ago
  • License

    NOASSERTION
  • Language

    Java
View Repository

Similar to Heritrix

Favicon

 

  
  
Favicon

 

  
  
Favicon