Heritrix is the Internet Archive's open-source, extensible, archival-quality web crawler designed for large-scale web preservation and data collection.
Nodriver is the successor to undetected-chromedriver, providing fast CDP-based browser automation with built-in anti-detection and no Selenium dependency.
Apache Hudi is an open data lakehouse platform that enables efficient record-level upserts, incremental processing, and ACID transactions on data lakes.
BullMQ is an open-source Redis-based message queue trusted by thousands of companies processing billions of jobs daily across Node.js, Python, and more.
Browserless provides headless browser infrastructure in Docker for web scraping and automation, with built-in bot detection bypass and CAPTCHA solving.