textract
A text extraction node module.
Currently Extracts...
- DOC
- DOCX
- XLS
- XLSX
- XLSB
- XLSM
- PPTX
- DXF
- PNG
- JPG
- GIF
- RTF
application/javascript
- All
text/*
mime-types.
Does textract not extract from files of the type you need? Add an issue or submit a pull request. It's super easy to add an extractor for a new mime type.
Install
npm install textract
Requirements
PDF
extraction requirespdftotext
be installed, linkDOC
extraction requirescatdoc
be installed, linkRTF
extraction requirescatdoc
be installedDOCX
extraction requiresunzip
be availablePPTX
extraction requiresunzip
be availablePNG
,JPG
andGIF
requiretesseract
to be available, link. Images need to be pretty clear, high DPI and made almost entirely of just text fortesseract
to be able to accurately extract the text.DXF
extraction requiresdrawingtotext
be available, link
Usage
Commmand Line
If textract is installed gloablly, via npm install -g textract
, then the following command will write the extracted text to the console.
$ textract pathToFile
In your node app
Import
var textract = require('textract');
Execution
If you do not know the mime type of the file
textract(filePath, function( error, text ) {})
If you know the mime type of the file
textract(type, filePath, function( error, text ) {})
If you wish to pass some config...and know the mime type...
textract(type, filePath, config, function( error, text ) {})
If you wish to pass some config, but do not know the mime type
textract(filePath, config, function( error, text ) {})
Error will contain informative text about why the extraction failed. If textract does not currently extract files of the type provided, a typeNotFound
flag will be tossed on the error object.
If processing a .gif
on OSX, an error will be thrown with a macProcessGif
flag on it set to true. Tesseract has issues with .gif
s on OSX.
Configuration
Configuration can be passed into textract. The following configuration options are available
preserveLineBreaks
: By default textract does NOT preserve line breaks. Pass this in astrue
and textract will not strip any line breaks.exec
: Some extractors (xlsx, docx, dxf) use node'sexec
functionality. This setting allows for providing config toexec
execution. One reason you might want to provide this config is if you are dealing with very large files. You might want to increase theexec
maxBuffer
setting.[ext].exec
: Each extractor can take specific exec config.macProcessGif
: By default on OSX textract will not run tesseract on.gif
files. (See this Stack Overflow post) If you've figured out to make it work, set this flag totrue
to turngif
processing back on.
Release Notes
0.12.0
- #21, #22, Now using j via its binaries rather than using it via node. This makes XLS/X extraction slower, but reduces memory consumption of textract signifcantly.
0.11.2
- Updated pdf-text-extract to latest, fixes #20.
0.11.1
- Addressed path escaping issues with tesseract, fixes [#18] (dbashford#18)
0.11.0
- Using j to handle
xls
andxlsx
, this removes the requirement on thexls2csv
binary. - j also supports
xlsb
andxlsm