25 projects
Apache Spark
Apache Spark - A unified analytics engine for large-scale data processing
9,131
1,283
$82M
Airbyte
Airbyte is an open-source data integration platform that helps users replicate data from applications, APIs, and databases to data warehouses, lakes, and other destinations. It provides a large collection of pre-built connectors and allows users to build custom ones, enabling automated data synchronization and ETL workflows.
8,620
1,251
$113M
Logstash
Logstash is a server-side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a destination of choice. It is commonly used to collect logs and other time-series data for search, analysis and visualization in Elasticsearch.
5,696
1,297
$5.4M
dbt Core
dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
4,004
614
$8.6M
Dagster
Dagster is an open-source data orchestration framework that lets you define, test, and orchestrate data pipelines using Python code. It provides tools for building, testing, and monitoring data workflows while emphasizing software engineering best practices like modularity, testability, and gradual typing.
3,962
711
$79M
Apache SeaTunnel
Apache SeaTunnel is a distributed data integration platform that enables high-performance data synchronization between various data sources and destinations. It provides a unified pipeline for real-time and batch data transfer, supporting multiple data systems like databases, messaging systems, and file storage, with features for data transformation and processing.
2,521
173
$20M
Apache NiFi
Apache NiFi is an enterprise data flow management and automation platform that enables organizations to reliably process, route, transform and distribute data between diverse systems. It provides a web-based interface for designing, controlling and monitoring data flows, with features for data provenance, security, extensibility and real-time control.
1,876
229
$44M
Debezium
Debezium is an open source distributed platform for change data capture (CDC). It captures row-level changes in databases like MySQL, PostgreSQL, MongoDB, and others, and streams them to applications in real-time. This enables event-driven architectures, data replication, and microservices integration.
1,656
270
$14M
Apache DevLake
Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.
1,046
200
$9.4M
Mage AI
🧙 Build, run, and manage data pipelines for integrating and transforming data.
978
132
$24M
Pentaho Data Integration
Pentaho Data Integration (PDI), also known as Kettle, is an open source ETL (Extract, Transform, Load) tool that enables users to design and implement data integration workflows. It provides a graphical interface for creating data pipelines, transforming data between different formats and systems, and automating data movement processes.
913
51
$52M
Apache Hop
Hop Orchestration Platform
720
53
$44M
CDAP
An open source framework for building data analytic applications.
660
33
$24M
Meltano
Meltano is an open source ELT (Extract, Load, Transform) platform that helps organizations integrate and manage their data pipelines. It provides a command-line interface and web UI for orchestrating data workflows, managing configurations, and connecting various data tools and services.
419
83
$3.5M
BigQuery ETL
Bigquery ETL
366
31
$19M
Cumulus Framework
Cumulus Framework + Cumulus API
252
33
$17M
Tapdata
Tapdata Live Data Platform Project
151
9
$18M
Instill Core
Instill Core is an open-source MLOps platform that provides infrastructure for building and deploying AI applications. It enables integration of various AI models and data sources through a unified API and pipeline system.
138
26
$2.1M
Tenzir
Tenzir is a high-performance data processing engine that enables real-time analysis and transformation of large-scale network and security data. It provides a unified platform for collecting, enriching, and analyzing diverse data sources with a focus on network security and observability.
134
35
$7.7M
Stroom
Stroom is a highly scalable data storage, processing and analysis platform.
68
8
$50M
Cloud Dataflow Templates
Cloud Dataflow Google-provided templates for solving in-Cloud data tasks
FIWARE Cygnus
A connector in charge of persisting context data sources into other third-party databases and storage systems, creating a historical view of the context
stream-reactor
A collection of open source Apache 2.0 Kafka Connector maintained by Lenses.io.