LFX Platform

Know more about LFX Platform

LFX Insights

Big Data Processing Frameworks

Tools for processing and analyzing extremely large and complex datasets.

20 projects

52,430 contributors

$3B

ClickHouse

ClickHouse is an open-source column-oriented database management system that enables real-time analytics using SQL queries. It is designed for high performance on large datasets, featuring fast data ingestion, efficient compression, and parallel processing capabilities.

Contributors

11,482

Organizations

1,602

Software value

$102M

Polars

Polars is a high-performance DataFrame library implemented in Rust, offering fast data manipulation and analysis capabilities with a Python API. It features a query optimizer, parallel execution, and efficient memory usage through Arrow columnar format.

Contributors

5,567

Organizations

1,125

Software value

$22M

The Presto Foundation Fund

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Contributors

5,382

Organizations

749

Software value

$2B

Trino

Trino is a distributed SQL query engine designed to query large data sets distributed across multiple heterogeneous data sources. It enables fast, interactive analytics across diverse data sources including Hadoop, object stores, relational databases, and other systems.

Contributors

5,117

Organizations

724

Software value

$68M

Apache Beam

Apache Beam is a unified programming model and framework for building and executing batch and streaming data processing pipelines. It provides a portable API that enables developers to write data processing code once and run it on various execution engines like Apache Spark, Apache Flink, and Google Cloud Dataflow.

Contributors

4,741

Organizations

624

Software value

$94M

Dask

Dask is a flexible parallel computing library for analytics that provides dynamic task scheduling optimized for computation and integrates with Python data science libraries like NumPy, Pandas and Scikit-learn. It enables parallel and distributed computing through intuitive APIs and scales Python code from multi-core machines to clusters.

Contributors

3,564

Organizations

900

Software value

$6.8M

Apache Hudi

Apache Hudi is a data lake platform that provides streaming data ingestion and bulk data management capabilities. It enables atomic updates, record-level change streams, and incremental data processing on large analytical datasets stored in data lakes. The platform supports ACID transactions, efficient upserts, and real-time analytics while maintaining data quality and consistency.

Contributors

3,033

Organizations

270

Software value

$23M

Hazelcast

Hazelcast is an open-source distributed computing platform that provides in-memory data storage and processing capabilities. It offers features like distributed caching, distributed data structures, distributed computing, and clustering for building scalable applications.

Contributors

2,976

Organizations

464

Software value

$63M

Apache DataFusion

Apache DataFusion is a fast, extensible query execution framework written in Rust that enables efficient processing of large-scale data using SQL. It provides a modular architecture for building high-performance data processing systems and analytics applications, with support for various data sources and formats.

Contributors

2,378

Organizations

558

Software value

$21M

Apache Hadoop

Apache Hadoop is a distributed computing framework that enables processing and storage of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, with each offering local computation and storage.

Contributors

2,311

Organizations

268

Software value

$190M

Apache Hive

Apache Hive is a data warehouse software project built on top of Apache Hadoop that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. It provides a mechanism to project structure onto data and query it using HQL (Hive Query Language), a SQL-like language.

Contributors

1,500

Organizations

149

Software value

$96M

Apache HBase

Apache HBase is a distributed, scalable, big data store designed to provide quick random access to huge amounts of structured data. It is a NoSQL database that runs on top of Hadoop HDFS, offering real-time read/write access to large datasets and supporting high-throughput applications.

Contributors

1,478

Organizations

131

Software value

$41M

sparklyr

R interface for Apache Spark.

Contributors

1,297

Organizations

139

Software value

$2.1M

Vespa

Vespa is an open source big data serving engine that enables storing, searching, ranking and organizing structured content at large scale. It provides real-time processing, advanced search capabilities, and machine learning-powered ranking for applications requiring low latency access to large datasets.

Contributors

765

Organizations

115

Software value

$71M

ListenBrainz

ListenBrainz is an open source music listening history service that allows users to track and share their music listening habits. It collects and stores information about what songs users listen to and provides analytics, recommendations, and insights based on listening data.

Contributors

342

Organizations

58

Software value

$9.3M

HPCC Systems Platform

HPCC Systems Platform is an open-source, enterprise-grade big data analytics computing platform that allows processing and analysis of massive data sets across parallel computing clusters. It provides a complete end-to-end data lake management solution with built-in ETL capabilities, high-performance distributed computing, and a declarative programming language called ECL.

Contributors

289

Organizations

9

Software value

$89M

Vineyard

Vineyard is an in-memory immutable data manager that provides out-of-the-box high-level abstraction and zero-copy in-memory sharing for distributed data in big data tasks, such as graph analytics, numerical computing, and machine learning.

Contributors

148

Organizations

36

Software value

$7.6M

GenevaERS

GenevaERS is the single-pass optimization engine for data extraction and reporting on z/OS. Originally developed by PricewaterhouseCoopers as the Geneva Enterprise Reporting System and acquired by IBM, GenevaERS offers businesses a high-level reporting solution uniquely tuned for big data scanning and improved financial transparency for better decision-making. This project combines the processing power of GenevaERS, the reliability of the mainframe and the dynamic open source community.

Contributors

36

Organizations

10

Software value

$57M

TonY Project

The mission of the Project is to design and implement an open source framework to run distributed deep learning jobs reliably on computing infrastructures.

Contributors

24

Organizations

1

Apache Spark

Apache Spark Website

This project hasn't been onboarded to LFX Insights.
Looking for a project that’s not listed?