Apache Hudi

PublishedJun 12, 2026Last updatedJul 13, 2026

Apache Hudi is an open data lakehouse platform that enables efficient record-level upserts, incremental processing, and ACID transactions on data lakes.

Visit Apache Hudi

Sources

github.com

Verified against the apache/hudi repository and the PyPI package.

Apache Hudi is a powerful, open-source lakehouse platform that reimagines batch processing with efficient incremental data pipelines.

Key Features:

Record-Level Upserts - Quickly update and delete individual records with fast, pluggable indexing.
Incremental Processing - Replace batch pipelines with streaming ingestion for 10x faster data processing.
ACID Transactions - Guarantee atomic writes with snapshot isolation tailored for lake-scale operations.
Time Travel - Query historical data, audit changes, and roll back to previous table versions.
Multi-Engine Integration - Works with Spark, Flink, Presto, Trino, Hive, and dbt orchestration.
Streaming Ingestion - Ingest from Kafka, Pulsar, and CDC sources with built-in deduplication.

Whether you're building real-time analytics, managing CDC pipelines, or modernizing batch ETL, Apache Hudi delivers efficient lakehouse storage with minute-level freshness.

Categories:

Analytics Databases Data Lakehouse

Features:

Open SourceBatch ProcessingReal-Time StreamingETL PipelineDeduplicationACID TransactionsTime Travel

Tags:

free java open-source python