LFX Platform

Know more about LFX Platform

LFX Insights

Big Data Processing Frameworks

Tools for processing and analyzing extremely large and complex datasets.

28 projects

57,448 contributors

$3.2B

ClickHouse

ClickHouse is an open-source column-oriented database management system that enables real-time analytics using SQL queries. It is designed for high performance on large datasets, featuring fast data ingestion, efficient compression, and parallel processing capabilities.

Contributors

11,770

Organizations

1,654

Software value

$105M

Polars

Polars is a high-performance DataFrame library implemented in Rust, offering fast data manipulation and analysis capabilities with a Python API. It features a query optimizer, parallel execution, and efficient memory usage through Arrow columnar format.

Contributors

5,776

Organizations

1,184

Software value

$23M

The Presto Foundation Fund

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Contributors

5,422

Organizations

746

Software value

$2.1B

Trino

Trino is a distributed SQL query engine designed to query large data sets distributed across multiple heterogeneous data sources. It enables fast, interactive analytics across diverse data sources including Hadoop, object stores, relational databases, and other systems.

Contributors

5,214

Organizations

733

Software value

$69M

Apache Beam

Apache Beam is a unified programming model and framework for building and executing batch and streaming data processing pipelines. It provides a portable API that enables developers to write data processing code once and run it on various execution engines like Apache Spark, Apache Flink, and Google Cloud Dataflow.

Contributors

4,769

Organizations

635

Software value

$95M

Dask

Dask is a flexible parallel computing library for analytics that provides dynamic task scheduling optimized for computation and integrates with Python data science libraries like NumPy, Pandas and Scikit-learn. It enables parallel and distributed computing through intuitive APIs and scales Python code from multi-core machines to clusters.

Contributors

3,588

Organizations

905

Software value

$6.9M

Apache Hudi

Apache Hudi is a data lake platform that provides streaming data ingestion and bulk data management capabilities. It enables atomic updates, record-level change streams, and incremental data processing on large analytical datasets stored in data lakes. The platform supports ACID transactions, efficient upserts, and real-time analytics while maintaining data quality and consistency.

Contributors

3,060

Organizations

275

Software value

$24M

Hazelcast

Hazelcast is an open-source distributed computing platform that provides in-memory data storage and processing capabilities. It offers features like distributed caching, distributed data structures, distributed computing, and clustering for building scalable applications.

Contributors

2,979

Organizations

462

Software value

$64M

Apache DataFusion

Apache DataFusion is a fast, extensible query execution framework written in Rust that enables efficient processing of large-scale data using SQL. It provides a modular architecture for building high-performance data processing systems and analytics applications, with support for various data sources and formats.

Contributors

2,557

Organizations

602

Software value

$23M

Apache Hadoop

Apache Hadoop is a distributed computing framework that enables processing and storage of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, with each offering local computation and storage.

Contributors

2,331

Organizations

270

Software value

$189M

Apache Hive

Apache Hive is a data warehouse software project built on top of Apache Hadoop that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. It provides a mechanism to project structure onto data and query it using HQL (Hive Query Language), a SQL-like language.

Contributors

1,503

Organizations

148

Software value

$96M

Apache HBase

Apache HBase is a distributed, scalable, big data store designed to provide quick random access to huge amounts of structured data. It is a NoSQL database that runs on top of Hadoop HDFS, offering real-time read/write access to large datasets and supporting high-throughput applications.

Contributors

1,477

Organizations

128

Software value

$42M

sparklyr

R interface for Apache Spark.

Contributors

1,294

Organizations

143

Software value

$2.1M

Apache Paimon

Apache Paimon is a streaming data lake platform that provides real-time data lake storage and management capabilities. It enables unified batch and streaming data processing with ACID transactions, schema evolution, and time travel features while maintaining high performance for both streaming and batch workloads.

Contributors

1,042

Organizations

99

Software value

$22M

Vespa

Vespa is an open source big data serving engine that enables storing, searching, ranking and organizing structured content at large scale. It provides real-time processing, advanced search capabilities, and machine learning-powered ranking for applications requiring low latency access to large datasets.

Contributors

778

Organizations

118

Software value

$71M

Scio

Scio is an Apache Beam-based Scala API for data processing pipelines, particularly focused on big data and analytics workloads. It provides a high-level, type-safe interface for building data pipelines that can run on various execution engines like Google Cloud Dataflow.

Contributors

666

Organizations

125

Software value

$3.8M

Apache Drill

Apache Drill is a distributed SQL query engine that enables data exploration and analytics across diverse data sources, including Hadoop, NoSQL databases, cloud storage, and local files. It provides schema-free SQL querying capabilities and allows users to analyze structured and semi-structured data without requiring predefined schemas.

Contributors

634

Organizations

101

Software value

$34M

Scalding

Scalding is a Scala library built on top of Cascading that makes it easy to write MapReduce jobs in a concise, type-safe way. It provides a domain-specific language for expressing complex data transformations and analytics on Hadoop.

Contributors

576

Organizations

121

Software value

$2.6M

Apache CarbonData

Apache CarbonData is a high-performance columnar data format and storage solution for big data analytics. It provides efficient data compression, indexing, and query optimization to improve processing speed of analytical workloads. The project enables fast interactive queries on petabyte scale data while supporting seamless integration with Hadoop ecosystems.

Contributors

483

Organizations

34

Software value

$14M

Apache Impala

Apache Impala is a massively parallel processing (MPP) SQL query engine for Apache Hadoop that provides high-performance, low-latency SQL queries on data stored in popular Apache Hadoop file formats. It enables users to query data using familiar SQL syntax, whether the data resides in HDFS, Apache HBase, or other data storage systems.

Contributors

416

Organizations

19

Software value

$46M

ListenBrainz

ListenBrainz is an open source music listening history service that allows users to track and share their music listening habits. It collects and stores information about what songs users listen to and provides analytics, recommendations, and insights based on listening data.

Contributors

385

Organizations

57

Software value

$9.6M

HPCC Systems Platform

HPCC Systems Platform is an open-source, enterprise-grade big data analytics computing platform that allows processing and analysis of massive data sets across parallel computing clusters. It provides a complete end-to-end data lake management solution with built-in ETL capabilities, high-performance distributed computing, and a declarative programming language called ECL.

Contributors

290

Organizations

10

Software value

$89M

AsterixDB

AsterixDB is a scalable, open-source Big Data Management System (BDMS) that provides storage, management, and query processing capabilities for large volumes of semi-structured data. It supports a rich data model (ADM) similar to JSON and offers a declarative query language (AQL) for efficient data manipulation and analytics.

Contributors

175

Organizations

6

Software value

$38M

Vineyard

Vineyard is an in-memory immutable data manager that provides out-of-the-box high-level abstraction and zero-copy in-memory sharing for distributed data in big data tasks, such as graph analytics, numerical computing, and machine learning.

Contributors

149

Organizations

39

Software value

$7.6M

PartD

PartD is a key-value byte store with a file-based cache, designed for parallel computing and big data applications. It provides fast writes of Python objects to disk through caching and buffering, making it particularly useful for parallel computing frameworks like Dask.

Contributors

54

Organizations

30

Software value

$84K

GenevaERS

GenevaERS is the single-pass optimization engine for data extraction and reporting on z/OS. Originally developed by PricewaterhouseCoopers as the Geneva Enterprise Reporting System and acquired by IBM, GenevaERS offers businesses a high-level reporting solution uniquely tuned for big data scanning and improved financial transparency for better decision-making. This project combines the processing power of GenevaERS, the reliability of the mainframe and the dynamic open source community.

Contributors

36

Organizations

10

Software value

$57M

Archived

TonY Project

The mission of the Project is to design and implement an open source framework to run distributed deep learning jobs reliably on computing infrastructures.

Contributors

24

Organizations

1

Apache Spark Website

Apache Spark Website

This project hasn't been onboarded to LFX Insights.
Looking for a project that’s not listed?