15 projects
Airbyte
Airbyte is an open-source data integration platform that helps users replicate data from applications, APIs, and databases to data warehouses, lakes, and other destinations. It provides a large collection of pre-built connectors and allows users to build custom ones, enabling automated data synchronization and ETL workflows.
8,550
1,242
$109M
LLamaIndex
LlamaIndex is an open-source data framework for building LLM applications, providing tools to connect custom data sources to large language models, with features for data ingestion, structuring, retrieval, and natural language querying.
8,409
1,081
$61M
Geospatial Data Abstraction Library (GDAL)
GDAL (Geospatial Data Abstraction Library) is a translator library for raster and vector geospatial data formats that provides a unified abstract data model and API for accessing and manipulating geographic data. It supports reading, writing, and processing of a wide variety of geospatial file formats.
3,108
625
$87M
MapStruct
MapStruct is a Java annotation processor that automates the generation of type-safe bean mapping code, reducing the need to write manual object transformations between Java bean types. It generates readable and performant code for converting between different object models at compile time.
2,608
374
$4.6M
Apache SeaTunnel
Apache SeaTunnel is a distributed data integration platform that enables high-performance data synchronization between various data sources and destinations. It provides a unified pipeline for real-time and batch data transfer, supporting multiple data systems like databases, messaging systems, and file storage, with features for data transformation and processing.
2,465
170
$20M
OpenRefine
OpenRefine is a powerful open source tool for working with messy data, cleaning it, transforming it from one format into another, and extending it with web services and external data. It allows users to explore large data sets, fix inconsistencies, reconcile and match data to databases like Wikidata, and transform data into different formats for further use.
1,639
311
$22M
Pentaho Data Integration
Pentaho Data Integration (PDI), also known as Kettle, is an open source ETL (Extract, Transform, Load) tool that enables users to design and implement data integration workflows. It provides a graphical interface for creating data pipelines, transforming data between different formats and systems, and automating data movement processes.
928
49
$52M
Spring Data Commons
Spring Data Commons is a foundational library that provides shared infrastructure for Spring Data modules, offering core interfaces, annotations, and utilities for implementing data access patterns and object-relational mapping across different data stores. It standardizes basic CRUD operations, query derivation, and repository abstractions.
835
186
$2.8M
Node CSV
Node CSV is a comprehensive Node.js library for parsing, formatting, transforming and stringifying CSV data. It provides a collection of packages for working with CSV files, including modules for parsing CSV to arrays/objects, converting data to CSV format, and transforming CSV data streams.
633
219
$7.3M
Pentaho Platform
Pentaho Platform is an open-source business intelligence and data integration suite that provides data warehousing, reporting, analytics, data mining, and ETL (Extract, Transform, Load) capabilities. It offers a comprehensive platform for data-driven decision making, including tools for data visualization, dashboards, and enterprise reporting.
624
31
$19M
Keyv
Keyv is a simple key-value storage system with support for multiple backends, providing a consistent interface for caching and storing data across different storage adapters like Redis, MongoDB, MySQL, and others
477
174
$531K
Apache Jena
Apache Jena is a free and open-source Java framework for building Semantic Web and Linked Data applications. It provides APIs for reading, processing, writing, and querying RDF data, along with support for OWL and SPARQL. The framework includes tools for working with RDF graphs, ontologies, and reasoning engines.
474
120
$26M
eemeli/yaml
A JavaScript library for parsing and working with YAML data, providing a complete implementation of the YAML 1.2 specification with support for all common use cases and extensible APIs
459
207
$670K
Comunica
Comunica is a modular JavaScript framework for querying Linked Data on the Web. It provides a flexible architecture for building SPARQL query engines that can operate over various data sources and interfaces, supporting both local and remote data access.
259
94
$5M
Microsoft Power Query Documentation
Public repository for Microsoft Power Query documentation. All content in this repository is published to learn.microsoft.com.