LFX Platform

Know more about LFX Platform

LFX Insights
Curated Collections

Content Extraction Libraries

Libraries for extracting text and metadata from a wide range of file formats to enable content analysis and processing.

The Linux Foundation

by The Linux Foundation

25 projects ・ Updated 14 Mar 2025

Project
Contributors
Organizations
Software value
pypdf
2,211
440
$2M
OCRmyPDF
1,437
225
$1.1M
NPOI
1,369
129
$21M
Unstructured
1,350
225
$39M
pdfminer.six
1,101
215
$3.9M
qpdf
778
161
$3.5M
readxl
694
144
$384K
Apache Tika
505
116
$10M
Bio-Formats
492
95
$6.5M
MediaInfo
492
68
$3.8M
metadata-extractor
472
81
$1.3M
geotiff.js
406
89
$252K
Apache POI
342
72
$19M
music-metadata
332
86
$571K
LibreDWG
324
36
$40M
Globby
298
136
$80K
gray-matter
230
92
$156K
Replace in file
205
55
$73K
plist.js
170
78
$617K
xan
143
38
$2M
Tabula
125
34
$556K
ignore
120
62
$47K
front-matter
113
44
$19K
glob-stream
109
55
$22K
Mailparser
0
0
$0
Looking for a project that’s not listed?