Search projects, repositories, or collections

⇧+K

Curated Collections

Content Extraction Libraries

Libraries for extracting text and metadata from a wide range of file formats to enable content analysis and processing.

by The Linux Foundation

・

25 projects ・ Updated 14 Mar 2025

Only Linux Foundation projects

Project

Contributors

Organizations

Software value

pypdf

2,211

440

$2M

OCRmyPDF

1,437

225

$1.1M

NPOI

1,369

129

$21M

Unstructured

1,350

225

$39M

pdfminer.six

1,101

215

$3.9M

qpdf

778

161

$3.5M

readxl

694

144

$384K

Apache Tika

505

116

$10M

Bio-Formats

492

$6.5M

MediaInfo

492

$3.8M

metadata-extractor

472

$1.3M

geotiff.js

406

$252K

Apache POI

342

$19M

music-metadata

332

$571K

LibreDWG

324

$40M

Globby

298

136

$80K

gray-matter

230

$156K

Replace in file

205

$73K

plist.js

170

$617K

xan

143

$2M

Tabula

125

$556K

ignore

120

$47K

front-matter

113

$19K

glob-stream

109

$22K

Mailparser

Looking for a project that’s not listed?