28 projects
ClickHouse
ClickHouse is an open-source column-oriented database management system that enables real-time analytics using SQL queries. It is designed for high performance on large datasets, featuring fast data ingestion, efficient compression, and parallel processing capabilities.
11,770
1,654
$105M
Polars
Polars is a high-performance DataFrame library implemented in Rust, offering fast data manipulation and analysis capabilities with a Python API. It features a query optimizer, parallel execution, and efficient memory usage through Arrow columnar format.
5,776
1,184
$23M
The Presto Foundation Fund
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.
5,422
746
$2.1B
Trino
Trino is a distributed SQL query engine designed to query large data sets distributed across multiple heterogeneous data sources. It enables fast, interactive analytics across diverse data sources including Hadoop, object stores, relational databases, and other systems.
5,214
733
$69M
Apache Beam
Apache Beam is a unified programming model and framework for building and executing batch and streaming data processing pipelines. It provides a portable API that enables developers to write data processing code once and run it on various execution engines like Apache Spark, Apache Flink, and Google Cloud Dataflow.
4,769
635
$95M
Dask
Dask is a flexible parallel computing library for analytics that provides dynamic task scheduling optimized for computation and integrates with Python data science libraries like NumPy, Pandas and Scikit-learn. It enables parallel and distributed computing through intuitive APIs and scales Python code from multi-core machines to clusters.
3,588
905
$6.9M
Apache Hudi
Apache Hudi is a data lake platform that provides streaming data ingestion and bulk data management capabilities. It enables atomic updates, record-level change streams, and incremental data processing on large analytical datasets stored in data lakes. The platform supports ACID transactions, efficient upserts, and real-time analytics while maintaining data quality and consistency.
3,060
275
$24M
Hazelcast
Hazelcast is an open-source distributed computing platform that provides in-memory data storage and processing capabilities. It offers features like distributed caching, distributed data structures, distributed computing, and clustering for building scalable applications.
2,979
462
$64M
Apache DataFusion
Apache DataFusion is a fast, extensible query execution framework written in Rust that enables efficient processing of large-scale data using SQL. It provides a modular architecture for building high-performance data processing systems and analytics applications, with support for various data sources and formats.
2,557
602
$23M
Apache Hadoop
Apache Hadoop is a distributed computing framework that enables processing and storage of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, with each offering local computation and storage.
2,331
270
$189M
Apache Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. It provides a mechanism to project structure onto data and query it using HQL (Hive Query Language), a SQL-like language.
1,503
148
$96M
Apache HBase
Apache HBase is a distributed, scalable, big data store designed to provide quick random access to huge amounts of structured data. It is a NoSQL database that runs on top of Hadoop HDFS, offering real-time read/write access to large datasets and supporting high-throughput applications.
1,477
128
$42M
sparklyr
R interface for Apache Spark.
1,294
143
$2.1M
Apache Paimon
Apache Paimon is a streaming data lake platform that provides real-time data lake storage and management capabilities. It enables unified batch and streaming data processing with ACID transactions, schema evolution, and time travel features while maintaining high performance for both streaming and batch workloads.
1,042
99
$22M
Vespa
Vespa is an open source big data serving engine that enables storing, searching, ranking and organizing structured content at large scale. It provides real-time processing, advanced search capabilities, and machine learning-powered ranking for applications requiring low latency access to large datasets.
778
118
$71M
Scio
Scio is an Apache Beam-based Scala API for data processing pipelines, particularly focused on big data and analytics workloads. It provides a high-level, type-safe interface for building data pipelines that can run on various execution engines like Google Cloud Dataflow.
666
125
$3.8M
Apache Drill
Apache Drill is a distributed SQL query engine that enables data exploration and analytics across diverse data sources, including Hadoop, NoSQL databases, cloud storage, and local files. It provides schema-free SQL querying capabilities and allows users to analyze structured and semi-structured data without requiring predefined schemas.
634
101
$34M
Scalding
Scalding is a Scala library built on top of Cascading that makes it easy to write MapReduce jobs in a concise, type-safe way. It provides a domain-specific language for expressing complex data transformations and analytics on Hadoop.
576
121
$2.6M
Apache CarbonData
Apache CarbonData is a high-performance columnar data format and storage solution for big data analytics. It provides efficient data compression, indexing, and query optimization to improve processing speed of analytical workloads. The project enables fast interactive queries on petabyte scale data while supporting seamless integration with Hadoop ecosystems.
483
34
$14M
Apache Impala
Apache Impala is a massively parallel processing (MPP) SQL query engine for Apache Hadoop that provides high-performance, low-latency SQL queries on data stored in popular Apache Hadoop file formats. It enables users to query data using familiar SQL syntax, whether the data resides in HDFS, Apache HBase, or other data storage systems.
416
19
$46M
ListenBrainz
ListenBrainz is an open source music listening history service that allows users to track and share their music listening habits. It collects and stores information about what songs users listen to and provides analytics, recommendations, and insights based on listening data.
385
57
$9.6M
HPCC Systems Platform
HPCC Systems Platform is an open-source, enterprise-grade big data analytics computing platform that allows processing and analysis of massive data sets across parallel computing clusters. It provides a complete end-to-end data lake management solution with built-in ETL capabilities, high-performance distributed computing, and a declarative programming language called ECL.
290
10
$89M
AsterixDB
AsterixDB is a scalable, open-source Big Data Management System (BDMS) that provides storage, management, and query processing capabilities for large volumes of semi-structured data. It supports a rich data model (ADM) similar to JSON and offers a declarative query language (AQL) for efficient data manipulation and analytics.
175
6
$38M
Vineyard
Vineyard is an in-memory immutable data manager that provides out-of-the-box high-level abstraction and zero-copy in-memory sharing for distributed data in big data tasks, such as graph analytics, numerical computing, and machine learning.
149
39
$7.6M
PartD
PartD is a key-value byte store with a file-based cache, designed for parallel computing and big data applications. It provides fast writes of Python objects to disk through caching and buffering, making it particularly useful for parallel computing frameworks like Dask.
54
30
$84K
GenevaERS
GenevaERS is the single-pass optimization engine for data extraction and reporting on z/OS. Originally developed by PricewaterhouseCoopers as the Geneva Enterprise Reporting System and acquired by IBM, GenevaERS offers businesses a high-level reporting solution uniquely tuned for big data scanning and improved financial transparency for better decision-making. This project combines the processing power of GenevaERS, the reliability of the mainframe and the dynamic open source community.
36
10
$57M
TonY Project
The mission of the Project is to design and implement an open source framework to run distributed deep learning jobs reliably on computing infrastructures.
24
1
Apache Spark Website
Apache Spark Website