LFX Platform

Know more about LFX Platform

LFX Insights

Content Extraction Libraries

Libraries for extracting text and metadata from a wide range of file formats to enable content analysis and processing.

23 projects

12,652 contributors

$151M

pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Contributors

2,193

Organizations

436

Software value

$2M

OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Contributors

1,424

Organizations

223

Software value

$1.1M

NPOI

NPOI is a .NET library for reading and writing Microsoft Office formats like Excel, Word and PowerPoint files without requiring Microsoft Office to be installed. It provides APIs to manipulate various Office document formats including XLS, XLSX, DOC and PPTX.

Contributors

1,367

Organizations

129

Software value

$20M

Unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Contributors

1,329

Organizations

225

Software value

$38M

pdfminer.six

A Python library for extracting text, images, and metadata from PDF files. It provides tools for parsing PDF documents, analyzing their structure, and converting them into other formats. The library supports both Python 2 and 3, and includes features for handling various PDF encodings and document layouts.

Contributors

1,099

Organizations

214

Software value

$3.9M

qpdf

QPDF is a command-line tool and C++ library for structural, content-preserving transformations on PDF files. It supports operations like linearization, encryption, decryption, and manipulation of PDF objects without changing the content of the original PDF.

Contributors

776

Organizations

161

Software value

$3.5M

Apache Tika

Apache Tika is a content detection and analysis framework that detects and extracts metadata and text from various file formats. It provides a unified interface for parsing different file types including documents, images, audio, and video files, enabling applications to process content without needing to understand the underlying file formats.

Contributors

503

Organizations

115

Software value

$9.1M

Bio-Formats

Bio-Formats is a Java library for reading and writing data in life sciences image file formats. It is developed by the Open Microscopy Environment. Bio-Formats is released under the GNU General Public License (GPL); commercial licenses are available from Glencoe Software.

Contributors

490

Organizations

94

Software value

$6.5M

MediaInfo

Convenient unified display of the most relevant technical and tag data for video and audio files.

Contributors

487

Organizations

67

Software value

$3.8M

metadata-extractor

A Java library for reading metadata (Exif, IPTC, XMP, ICC, etc.) from image and video files, providing a simple API to extract format-specific information from media files

Contributors

467

Organizations

81

Software value

$1.3M

geotiff.js

A JavaScript library for reading and parsing GeoTIFF images in web browsers and Node.js. It provides functionality to decode geographic raster data stored in TIFF format, supporting various compression methods and data types.

Contributors

404

Organizations

86

Software value

$246K

Apache POI

Mirror of Apache POI

Contributors

333

Organizations

71

Software value

$19M

music-metadata

Stream and file based music metadata parser for node. Supporting a wide range of audio and tag formats.

Contributors

328

Organizations

86

Software value

$569K

LibreDWG

LibreDWG is a free software library for reading and writing DWG/DXF files, providing tools and APIs to handle AutoCAD drawing formats. It aims to be a free alternative to proprietary DWG processing libraries.

Contributors

323

Organizations

36

Software value

$40M

Globby

A Node.js package that provides pattern matching of files using glob patterns, with support for multiple patterns, negation, and gitignore rules

Contributors

297

Organizations

138

Software value

$74K

gray-matter

A Node.js library for parsing front-matter from documents. It extracts and parses YAML, JSON, or other front-matter blocks from strings or files, commonly used in static site generators and documentation tools.

Contributors

228

Organizations

93

Software value

$156K

Replace in file

A Node.js library that provides functionality to find and replace text content in files using regular expressions or strings, with support for synchronous and asynchronous operations, multiple file patterns, and various configuration options

Contributors

205

Organizations

55

Software value

$73K

plist.js

A JavaScript library for parsing and building Apple Property List (plist) files, supporting both XML and binary formats

Contributors

170

Organizations

78

Software value

$617K

ignore

A Node.js library for parsing .gitignore-style ignore files and matching file paths against ignore patterns. It provides functionality to create ignore pattern matchers, test paths, and filter file lists based on ignore rules.

Contributors

120

Organizations

62

Software value

$47K

glob-stream

A Node.js library that converts glob expressions into object streams of file paths, used by gulp and other build tools to match files using patterns like '*.js'

Contributors

109

Organizations

56

Software value

$22K

Mailparser

Decode mime formatted e-mails

This project hasn't been onboarded to LFX Insights.

front-matter

A Node.js library for parsing YAML front matter from strings or files, commonly used in static site generators and markdown processing to extract metadata from content files

This project hasn't been onboarded to LFX Insights.
Looking for a project that’s not listed?