23 projects
pypdf
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
2,193
436
$2M
OCRmyPDF
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
1,424
223
$1.1M
NPOI
NPOI is a .NET library for reading and writing Microsoft Office formats like Excel, Word and PowerPoint files without requiring Microsoft Office to be installed. It provides APIs to manipulate various Office document formats including XLS, XLSX, DOC and PPTX.
1,367
129
$20M
Unstructured
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
1,329
225
$38M
pdfminer.six
A Python library for extracting text, images, and metadata from PDF files. It provides tools for parsing PDF documents, analyzing their structure, and converting them into other formats. The library supports both Python 2 and 3, and includes features for handling various PDF encodings and document layouts.
1,099
214
$3.9M
qpdf
QPDF is a command-line tool and C++ library for structural, content-preserving transformations on PDF files. It supports operations like linearization, encryption, decryption, and manipulation of PDF objects without changing the content of the original PDF.
776
161
$3.5M
Apache Tika
Apache Tika is a content detection and analysis framework that detects and extracts metadata and text from various file formats. It provides a unified interface for parsing different file types including documents, images, audio, and video files, enabling applications to process content without needing to understand the underlying file formats.
503
115
$9.1M
Bio-Formats
Bio-Formats is a Java library for reading and writing data in life sciences image file formats. It is developed by the Open Microscopy Environment. Bio-Formats is released under the GNU General Public License (GPL); commercial licenses are available from Glencoe Software.
490
94
$6.5M
MediaInfo
Convenient unified display of the most relevant technical and tag data for video and audio files.
487
67
$3.8M
metadata-extractor
A Java library for reading metadata (Exif, IPTC, XMP, ICC, etc.) from image and video files, providing a simple API to extract format-specific information from media files
467
81
$1.3M
geotiff.js
A JavaScript library for reading and parsing GeoTIFF images in web browsers and Node.js. It provides functionality to decode geographic raster data stored in TIFF format, supporting various compression methods and data types.
404
86
$246K
Apache POI
Mirror of Apache POI
333
71
$19M
music-metadata
Stream and file based music metadata parser for node. Supporting a wide range of audio and tag formats.
328
86
$569K
LibreDWG
LibreDWG is a free software library for reading and writing DWG/DXF files, providing tools and APIs to handle AutoCAD drawing formats. It aims to be a free alternative to proprietary DWG processing libraries.
323
36
$40M
Globby
A Node.js package that provides pattern matching of files using glob patterns, with support for multiple patterns, negation, and gitignore rules
297
138
$74K
gray-matter
A Node.js library for parsing front-matter from documents. It extracts and parses YAML, JSON, or other front-matter blocks from strings or files, commonly used in static site generators and documentation tools.
228
93
$156K
Replace in file
A Node.js library that provides functionality to find and replace text content in files using regular expressions or strings, with support for synchronous and asynchronous operations, multiple file patterns, and various configuration options
205
55
$73K
plist.js
A JavaScript library for parsing and building Apple Property List (plist) files, supporting both XML and binary formats
170
78
$617K
ignore
A Node.js library for parsing .gitignore-style ignore files and matching file paths against ignore patterns. It provides functionality to create ignore pattern matchers, test paths, and filter file lists based on ignore rules.
120
62
$47K
glob-stream
A Node.js library that converts glob expressions into object streams of file paths, used by gulp and other build tools to match files using patterns like '*.js'
109
56
$22K
Mailparser
Decode mime formatted e-mails
front-matter
A Node.js library for parsing YAML front matter from strings or files, commonly used in static site generators and markdown processing to extract metadata from content files