20 projects
ClickHouse
ClickHouse is an open-source column-oriented database management system that enables real-time analytics using SQL queries. It is designed for high performance on large datasets, featuring fast data ingestion, efficient compression, and parallel processing capabilities.
11,482
1,602
$102M
Polars
Polars is a high-performance DataFrame library implemented in Rust, offering fast data manipulation and analysis capabilities with a Python API. It features a query optimizer, parallel execution, and efficient memory usage through Arrow columnar format.
5,567
1,125
$22M
The Presto Foundation Fund
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.
5,382
749
$2B
Trino
Trino is a distributed SQL query engine designed to query large data sets distributed across multiple heterogeneous data sources. It enables fast, interactive analytics across diverse data sources including Hadoop, object stores, relational databases, and other systems.
5,117
724
$68M
Apache Beam
Apache Beam is a unified programming model and framework for building and executing batch and streaming data processing pipelines. It provides a portable API that enables developers to write data processing code once and run it on various execution engines like Apache Spark, Apache Flink, and Google Cloud Dataflow.
4,741
624
$94M
Dask
Dask is a flexible parallel computing library for analytics that provides dynamic task scheduling optimized for computation and integrates with Python data science libraries like NumPy, Pandas and Scikit-learn. It enables parallel and distributed computing through intuitive APIs and scales Python code from multi-core machines to clusters.
3,564
900
$6.8M
Apache Hudi
Apache Hudi is a data lake platform that provides streaming data ingestion and bulk data management capabilities. It enables atomic updates, record-level change streams, and incremental data processing on large analytical datasets stored in data lakes. The platform supports ACID transactions, efficient upserts, and real-time analytics while maintaining data quality and consistency.
3,033
270
$23M
Hazelcast
Hazelcast is an open-source distributed computing platform that provides in-memory data storage and processing capabilities. It offers features like distributed caching, distributed data structures, distributed computing, and clustering for building scalable applications.
2,976
464
$63M
Apache DataFusion
Apache DataFusion is a fast, extensible query execution framework written in Rust that enables efficient processing of large-scale data using SQL. It provides a modular architecture for building high-performance data processing systems and analytics applications, with support for various data sources and formats.
2,378
558
$21M
Apache Hadoop
Apache Hadoop is a distributed computing framework that enables processing and storage of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, with each offering local computation and storage.
2,311
268
$190M
Apache Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. It provides a mechanism to project structure onto data and query it using HQL (Hive Query Language), a SQL-like language.
1,500
149
$96M
Apache HBase
Apache HBase is a distributed, scalable, big data store designed to provide quick random access to huge amounts of structured data. It is a NoSQL database that runs on top of Hadoop HDFS, offering real-time read/write access to large datasets and supporting high-throughput applications.
1,478
131
$41M
sparklyr
R interface for Apache Spark.
1,297
139
$2.1M
Vespa
Vespa is an open source big data serving engine that enables storing, searching, ranking and organizing structured content at large scale. It provides real-time processing, advanced search capabilities, and machine learning-powered ranking for applications requiring low latency access to large datasets.
765
115
$71M
ListenBrainz
ListenBrainz is an open source music listening history service that allows users to track and share their music listening habits. It collects and stores information about what songs users listen to and provides analytics, recommendations, and insights based on listening data.
342
58
$9.3M
HPCC Systems Platform
HPCC Systems Platform is an open-source, enterprise-grade big data analytics computing platform that allows processing and analysis of massive data sets across parallel computing clusters. It provides a complete end-to-end data lake management solution with built-in ETL capabilities, high-performance distributed computing, and a declarative programming language called ECL.
289
9
$89M
Vineyard
Vineyard is an in-memory immutable data manager that provides out-of-the-box high-level abstraction and zero-copy in-memory sharing for distributed data in big data tasks, such as graph analytics, numerical computing, and machine learning.
148
36
$7.6M
GenevaERS
GenevaERS is the single-pass optimization engine for data extraction and reporting on z/OS. Originally developed by PricewaterhouseCoopers as the Geneva Enterprise Reporting System and acquired by IBM, GenevaERS offers businesses a high-level reporting solution uniquely tuned for big data scanning and improved financial transparency for better decision-making. This project combines the processing power of GenevaERS, the reliability of the mainframe and the dynamic open source community.
36
10
$57M
TonY Project
The mission of the Project is to design and implement an open source framework to run distributed deep learning jobs reliably on computing infrastructures.
24
1
Apache Spark
Apache Spark Website