8 projects
Big Data Integration Platforms
Comprehensive frameworks that package, deploy, test, and maintain integrated big data ecosystems with multiple components working together. These platforms simplify the deployment and management of distributed big data technologies across clusters.
25,199 contributors
$2.5B
OpenSearch
The purpose of the OpenSearch Software Foundation is to raise, budget and spend funds in support of various open source, open data and/or open standards projects relating to open source search and analysis solutions.
14,614
1,871
$571M
Apache Hudi
Apache Hudi is a data lake platform that provides streaming data ingestion and bulk data management capabilities. It enables atomic updates, record-level change streams, and incremental data processing on large analytical datasets stored in data lakes. The platform supports ACID transactions, efficient upserts, and real-time analytics while maintaining data quality and consistency.
3,034
270
$23M
Apache SeaTunnel
Apache SeaTunnel is a distributed data integration platform that enables high-performance data synchronization between various data sources and destinations. It provides a unified pipeline for real-time and batch data transfer, supporting multiple data systems like databases, messaging systems, and file storage, with features for data transformation and processing.
2,465
170
$20M
Apache Hadoop
Apache Hadoop is a distributed computing framework that enables processing and storage of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, with each offering local computation and storage.
2,314
274
$190M
YTsaurus
YTsaurus is a distributed storage and processing platform designed for managing large-scale data. It provides a comprehensive suite of tools for data organization, processing, and analysis, supporting features like distributed execution, data replication, and resource management across clusters.
1,508
29
$1.6B
Apache Kyuubi
Apache Kyuubi is a distributed multi-tenant service that provides high-performance SQL query capabilities and resource management for big data workloads. It offers a unified gateway for accessing data lakes through various engines like Apache Spark, enabling secure, scalable, and highly available data processing.
974
108
$8.2M
HPCC Systems Platform
HPCC Systems Platform is an open-source, enterprise-grade big data analytics computing platform that allows processing and analysis of massive data sets across parallel computing clusters. It provides a complete end-to-end data lake management solution with built-in ETL capabilities, high-performance distributed computing, and a declarative programming language called ECL.
290
9
$89M
Bigtop
Bigtop is an Apache Foundation project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components.