webscraping.app
Browse
About Us
Submit
Sign In
Latest tools
Categories
Tags
Submit
About Us
/
Categories
/
Data Processing
Data Processing
Browse 5 subcategories of Data Processing tools and find the perfect solution for your needs.
Workflow Orchestration
9 tools
Task Queues
5 tools
ETL Tools
9 tools
Data Transformation
5 tools
Distributed Crawling
6 tools
Popular Categories:
Browser Automation
14
Scraping Frameworks
11
ETL Tools
9
Analytics Databases
9
SERP APIs
9
Workflow Orchestration
9
AI Web Scraping
8
Scraping APIs
8
Proxy Services
6
Distributed Crawling
6
Search Engines
6
Cloud Compute
6
Order by
AWS Athena
Serverless SQL queries on S3. Query scraped data files directly without loading. Pay per query.
ETL Tools
AWS Glue
Serverless ETL service. Data catalog, crawlers, Spark-based transforms. Native S3 and Redshift integration.
ETL Tools
Scrapy Cluster
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Distributed Crawling
Frontera
A scalable frontier for web crawlers
Distributed Crawling
Meltano
Code-first ELT engine built on Singer. GitOps workflows, 300+ connectors, integrates with dbt.
ETL Tools
Nutch
Apache Nutch is an extensible and scalable web crawler
Distributed Crawling
Heritrix
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Distributed Crawling
Inngest
The leading workflow orchestration platform. Run stateful step functions and AI workflows on serverless, servers, or the edge.
Task Queues
dlt (data load tool)
Lightweight Python library for data loading. Auto schema inference, 5000+ sources supported.
Data Transformation
ETL Tools
Scrapy Redis
Redis-based components for Scrapy.
Distributed Crawling
Huey
Tiny but feature-rich, Redis or SQLite
Task Queues
BullMQ
BullMQ - Message Queue and Batch processing for NodeJS, Python, Elixir and PHP based on Redis
Task Queues
Mage AI
Hybrid Python/SQL/R in same pipeline
Workflow Orchestration
RQ (Redis Queue)
Simple, lightweight, Redis-based
Task Queues
dbt
Transform data in your warehouse using SQL. Version control, testing, documentation for data models.
Data Transformation
ETL Tools
Great Expectations
Data quality testing with 'Expectations'. Validate scraped data, auto-generate docs, CI/CD integration.
Data Transformation
Dagster
An orchestration platform for the development, production, and observation of data assets.
ETL Tools
Workflow Orchestration
Windmill
Fast execution, code + visual builder
Workflow Orchestration
Temporal
Temporal service
Workflow Orchestration
Luigi
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Workflow Orchestration
Airbyte
Open-source ELT platform with 600+ connectors. Move data from APIs, databases, files to warehouses and lakes.
ETL Tools
Prefect
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
ETL Tools
Workflow Orchestration
Kestra
Event Driven Orchestration & Scheduling Platform for Mission Critical Applications
Workflow Orchestration
Pydantic
Data validation using Python type hints. Rust-powered core for speed. Define schemas for scraped data.
Data Transformation
Celery
Distributed Task Queue (development branch)
Task Queues
Polars
Lightning-fast DataFrame library in Rust. 10-100x faster than pandas. Lazy evaluation, out-of-core processing.
Data Transformation
Apache Airflow
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
ETL Tools
Workflow Orchestration
Scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Distributed Crawling
Scraping Frameworks
n8n
Low-code/no-code, larger community
Workflow Orchestration