
Heritrix is an extensible, archival-quality web crawler built by the Internet Archive for large-scale web preservation and data collection.
Key Features:
Whether you're preserving web content for research, building large-scale web archives, or conducting comprehensive domain crawls, Heritrix provides production-grade crawling trusted by the Internet Archive.
Stars
3.2KForks
782Last commit
3 months agoLicense
NOASSERTIONLanguage
Java