dbashford / textract Goto Github PK

View Code? Open in Web Editor NEW

1.6K 44.0 184.0 5.21 MB

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

License: MIT License

JavaScript 35.29% CSS 0.01% HTML 54.84% Rich Text Format 9.87%

extract-text extraction nodejs

textract's Introduction

textract

A text extraction node module.

Currently Extracts...

HTML, HTM
ATOM, RSS
Markdown
EPUB
XML, XSL
PDF
DOC, DOCX
ODT, OTT (experimental, feedback needed!)
RTF
XLS, XLSX, XLSB, XLSM, XLTX
CSV
ODS, OTS
PPTX, POTX
ODP, OTP
ODG, OTG
PNG, JPG, GIF
DXF
application/javascript
All text/* mime-types.

In almost all cases above, what textract cares about is the mime type. So .html and .htm, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, application/vnd.ms-excel is the mime type for .xls, but also for 5 other file types.

Does textract not extract from files of the type you need? Add an issue or submit a pull request. It many cases textract is already capable, it is just not paying attention to the mime type you may be interested in.

Install

npm install textract

Extraction Requirements

Note, if any of the requirements below are missing, textract will run and extract all files for types it is capable. Not having these items installed does not prevent you from using textract, it just prevents you from extracting those specific files.

PDF extraction requires pdftotext be installed, link
DOC extraction requires antiword be installed, link, unless on OSX in which case textutil (installed by default) is used.
RTF extraction requires unrtf be installed, link, unless on OSX in which case textutil (installed by default) is used.
PNG, JPG and GIF require tesseract to be available, link. Images need to be pretty clear, high DPI and made almost entirely of just text for tesseract to be able to accurately extract the text.
DXF extraction requires drawingtotext be available, link

Configuration

Configuration can be passed into textract. The following configuration options are available

preserveLineBreaks: When using the command line this is set to true to preserve stdout readability. When using the library via node this is set to false. Pass this in as true and textract will not strip any line breaks.
preserveOnlyMultipleLineBreaks: Some extractors, like PDF, insert line breaks at the end of every line, even if the middle of a sentence. If this option (default false) is set to true, then any instances of a single line break are removed but multiple line breaks are preserved. Check your output with this option, though, this doesn't preserve paragraphs unless there are multiple breaks.
exec: Some extractors (dxf) use node's exec functionality. This setting allows for providing config to exec execution. One reason you might want to provide this config is if you are dealing with very large files. You might want to increase the exec maxBuffer setting.
[ext].exec: Each extractor can take specific exec config. Keep in mind many extractors are responsible for extracting multiple types, so, for instance, the odt extractor is what you would configure for odt and odg/odt etc. Check the extractors to see which you want to specifically configure. At the bottom of each is a list of types for which the extractor is responsible.
tesseract.lang: A pass-through to tesseract allowing for setting of language for extraction. ex: { tesseract: { lang:"chi_sim" } }
tesseract.cmd: tesseract.lang allows a quick means to provide the most popular tesseract option, but if you need to configure more options, you can simply pass cmd. cmd is the string that matches the command-line options you want to pass to tesseract. For instance, to provide language and psm, you would pass { tesseract: { cmd:"-l chi_sim -psm 10" } }
pdftotextOptions: This is a proxy options object to the library textract uses for pdf extraction: pdf-text-extract. Options include ownerPassword, userPassword if you are extracting text from password protected PDFs. IMPORTANT: textract modifies the pdf-text-extract layout default so that, instead of layout: layout, it uses layout:raw. It is not suggested you modify this without understanding what trouble that might get you in. See this GH issue for why textract overrides that library's default.
typeOverride: Used with fromUrl, if set, rather than using the content-type from the URL request, will use the provided typeOverride.
includeAltText: When extracting HTML, whether or not to include alt text with the extracted text. By default this is false.

To use this configuration at the command line, prefix each open with a --.

Ex: textract image.png --tesseract.lang=deu

Usage

Commmand Line

If textract is installed gloablly, via npm install -g textract, then the following command will write the extracted text to the console for a file on the file system.

$ textract pathToFile

Flags

Configuration flags can be passed into textract via the command line.

textract pathToFile --preserveLineBreaks false

Parameters like exec.maxBuffer can be passed as you'd expect.

textract pathToFile --exec.maxBuffer 500000

And multiple flags can be used together.

textract pathToFile --preserveLineBreaks false --exec.maxBuffer 500000

Node

Import

var textract = require('textract');

APIs

There are several ways to extract text. For all methods, the extracted text and an error object are passed to a callback.

error will contain informative text about why the extraction failed. If textract does not currently extract files of the type provided, a typeNotFound flag will be tossed on the error object.

File

textract.fromFileWithPath(filePath, function( error, text ) {})

textract.fromFileWithPath(filePath, config, function( error, text ) {})

File + mime type

textract.fromFileWithMimeAndPath(type, filePath, function( error, text ) {})

textract.fromFileWithMimeAndPath(type, filePath, config, function( error, text ) {})

Buffer + mime type

textract.fromBufferWithMime(type, buffer, function( error, text ) {})

textract.fromBufferWithMime(type, buffer, config, function( error, text ) {})

Buffer + file name/path

textract.fromBufferWithName(name, buffer, function( error, text ) {})

textract.fromBufferWithName(name, buffer, config, function( error, text ) {})

URL

When passing a URL, the URL can either be a string, or a node.js URL object. Using the URL object allows fine grained control over the URL being used.

textract.fromUrl(url, function( error, text ) {})

textract.fromUrl(url, config, function( error, text ) {})

Testing Notes

Running Tests on a Mac?

sudo port install tesseract-chi-sim
sudo port install tesseract-eng
You will also want to disable textract's usage of textutil as the tests are based on output from antiword.
- Go into /lib/extractors/{doc|doc-osx|rtf} and modify the code under if ( os.platform() === 'darwin' ) {. Uncommented the commented lines in these sections.

textract's People

Contributors

Stargazers

Watchers

Forkers

davidworkman9 nvdnkpr james1x0 voz rosslynp nisaacson rakesh-mohanta gitter-badger enelesmai tpreusse tommygnr aqum luzc08 sidhuko ahgentil danthemaen dlakata oatkiller jxcjxcjzx desperado1992 chagge sjtu2008 bline giserh prakhyatata kamilziajka saibabanadh veljkomatic andre0799 arcanebear rebiyon sebastiansingle99 tyolab yelabbassi sadanoah agrimm moz maxkurama parallelsoftware wandec xiaohuanit harendranathvegi9 zhhb olivierb-ob tetsuyas1 empia mr2fish daminhtung maxism redanium deplay semtle spneto catataw gragtah hongtaicao sahwar yankee-by wtianyu wangxiaoshuo gitgrimbo ge-lx oliveira jtn-ms magicianlee007 xiaodin1 njlr aglaianwoman geeph asb14690 tansaku kukkadapusushma darrencook huydeerpets perminder-klair menikmathi neurogrid opencii polygox bharatrsharma outwrite abibazhi jogli5er mapboss dupenf droplr zengjing19890310 bradparks byoung2 raulromanp ripkens konijnendijk halfz carloslema kiitehq derekzhang79 jrsglobalpriv apporoad nayoung0 jackeluo

textract's Issues

xlsx extractor?

Is it possible to build an extractor for Excel (*.xlsx) files?

PPTX beyond 9 pages will end up out of order

Will end up with double digit pages showing up first.

Make messages about failed extractors clear "Info" messages.

Because nothing is wrong other than textract won't be able to extract that type.

Problems with cyrillic symbols

When I execute js file with node.js with following content(for example with .doc file):
var textract = require('textract');

textract.fromFileWithPath('test.doc', function( error, text ) {
if (error) throw error;
console.log(text);
})

with .doc file, all cyrillic symbols ureadable (but when I execute Catdoc, then I can read it)
and with .docx file all cyrillic symbols removes.

PDF extractor options are ignored

Why does the pdf extractor ignore the options?

A bug when when extracting from an image with tesseract

Error:

< 29 Mar 22:40:41 - error: [App] Error extracting [[ /XXX/Screen Shot 2014-03-06 at 14.43.23.png ]], exec error: Error: Command failed: read_params_file: Can't open Shot
< read_params_file: Can't open 2014-03-06
< read_params_file: Can't open at
< read_params_file: Can't open 14.43.23.png
< read_params_file: Can't open /YYY/node_modules/textract/lib/extractors/temp/Screen
< read_params_file: Can't open Shot
< read_params_file: Can't open 2014-03-06
< read_params_file: Can't open at
< read_params_file: Can't open 14.43.23
< Cannot open input file: /XXX

The problem is that the paths are not escaped before calling the tesseract command:

  exec( "tesseract " + filePath + " " + fileTempOutPath + " quiet",

Will submit a pull request fixing the issue.

PPTX support?

Does it work? Because for me it does not.

$ textract 'test.pptx'
textract not ready, retrying in .5 seconds
textract: 'drawingtotext' does not appear to be installed, so it will be unable
to extract DXFs.
textract: 'catdoc' does not appear to be installed, so it will be unable to extr
act DOCs.
[Error: extract powerpoint, pptx, exec error: Error: stdout maxBuffer exceeded.]

Unable to extract text from doc and docx files

Whenever i run the project i keep getting the following warnings:

textract: 'unzip' does not appear to be installed, so textract will be unable to
extract DOCXs.
textract: 'catdoc' does not appear to be installed, so it will be unable to extr
act DOCs.

I have properly installed catdoc command and its working in command prompt using path envionment variables.

Also i am unable to install the unzip module as there is no link provided for this and i would like to know how to install it.

If anybody would provide some information on this i will be very thankful to him.

Support for Buffer objects containing a base64 encoded string?

Use case: An Express API for taking DOC/PDF/DOCX and returning text.

Rather than uploading a file to the server and then having textract read that off the disk it would be preferable to take a base64 encoded DOC/PDF/DOCX file sent as a string in a POST request, put it in a buffer, and then have textract read that buffer.

Support for more file formats

Hi David,

Do you have plans to update textract to support .ppt, .~~xlsx~~, .~~xltx~~, .~~potx~~, .key, .pages, ~~.xml~~? I'd also love to see support for OpenOffice file formats, like ~~.odt~~, ~~.ott~~, ~~.ods~~, ~~.ots~~, ~~.odg~~, ~~.otg~~, ~~.odp~~, ~~.otp~~.

Thanks!

pdf-to-text version upgrade

We've noticed that the pdf-text-extract npm module has been updated (now at 1.1.2).

This new version fixes some problems we have been having where warnings in the extraction process come back as errors and thus we do not get the extracted text.

Any chance we can get the package.json file updated to use 1.1.2 for pdf-text-extract?

Update NPM?

Any change of getting an update on NPM so we can have pptx extractor? I actually wrote the pptx extractor, then noticed you had done it already!

Preserve newline behaviour

For me, the preserve newline behaviour isn't quite working as I expected (tested with the docx extractor).

I have text like this in a docx file:

2 downlighters; door to hall.

Hall
Double glazed window to front;

With preserveLineBreaks I get this output:

2 downlighters; door to hall. Hall
Double glazed window to front;

After outputting some stuff to the console I can see the newlines are there as expected but then they get parsed out.

Taking a look at how preserveLineBreaks is implemented I see it's a big, hairy regex, so not sure what it is doing at first glance. From my naive point of view it would be nicer to get the raw text output, if I need to filter further I can make my own mind. Or if there is a 'clean' function as a configuration option I could use it to override the default behaviour.

Disable info text?

I don't have the DFX conversion software installed so every time I do a text extraction, I get info warning text saying INFO: 'drawingtotext' does not appear to be installed, so textract will be unable to extract DXFs. and then the text of my document after it. Is there any way to disable this?

Add CSV support

We have a requirement for CSV support in a project. Would this be useful to use a popular npm library with the same interface as textract?

I will be able to PR my work early next week.

Determine if replaceTextChars is still necessary and remove if not

See #58

.docx extractor options

It looks like the options passed to other extractors is not utilized for the .docx extraction process. textract API's are passing an empty string back to the callback for large .docx files (testing with a .docx around 400 pages).

Reading files from S3

Hi David!

Trying to create an endpoint in an Express server like this:

app.get('/textract', function(req, res, next) { textract("https://s3.amazonaws.com/testbucket1a2b3c/test.pdf", function(error, text) { console.log(error); res.end(); }); });

Console returns [Error: File at path [[ https://s3.amazonaws.com/testbucket1a2b3c/test.pdf ]] does not exist.]

What does this mean exactly? Textract only works with local files? (in this case my file is uploaded to S3). Thanks!

Add support for .key, .pages

Ref #42

make the temp folder in an actual temp location

I installed textract:

$ sudo npm install -g textract

Every invocation of textract seems to fail:

$ textract -h

fs.js:647
  return binding.mkdir(pathModule._makeLong(path),
                 ^
Error: EACCES, permission denied '/usr/local/lib/node_modules/textract/lib/extractors/temp'
    at Object.fs.mkdirSync (fs.js:647:18)
    at Object.<anonymous> (/usr/local/lib/node_modules/textract/lib/extractors/images.js:83:8)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at module.exports (/usr/local/lib/node_modules/textract/lib/extract.js:85:10)
    at Array.map (native)

This happens on OSX because the module was installed as root but invoked as a normal user. On linux and osx the temp folder should probably be a proper temporary directory in a location like /tmp

[Error: extract docx unzip exec error: Error: stdout maxBuffer exceeded.]

Hi David,

I'm getting this error when trying to textract a big .docx file (1.8MB). I tried increasing the maxBuffer setting by doing $ textract big-file.docx --exec.maxBuffer 512000 but it's not working (tried many values, but none seem to work).

Do you know a possible fix?

Thanks!

Extractor that fails test still registers

if ( extractor.test ) {
  extractor.test();
}
return extractor;

Consider replacing catdoc

I tried installing catdoc on osx 10.9.3 (for RTF support) using brew as well as from source, and for whatever reason it just does not want to play nice. What formats currently use catdoc? Are there pure-JS text extractors for those formats?

Filenames with round brackets "(" or ")" break the extraction process

If your filename is named with brackets, for instance "new doc(1).docx" the extraction fails(at least for docx files). Escaping the brackets won't work because then fs.exists on line 7 of index.js fails.

Remove extraneous white space

Get a lot of extractions that'll look something like this

some text                more text             some other text

No need for all the white space.

Markdown support

Streams?

Any plans on using node streams?

use node modules instead of external programs

For example:

pdf-text for PDF files
xlsjs for XLS files
xlsx for XLSX/XLSM/XLSB files

I'm sure more pure-JS parsers exist

Some spaces showing up in the middle of words

From here: #5 (comment)

This causes change has caused random spaces in the middle of words in the .docx files I've been using. It seems to be an issue when either the w:t tag has an attribute of xml-spacing="preserve" or the sibling to the w:t tag w:rPr has a child node of

Here you go:
https://docs.google.com/file/d/0Bxcbem1SSxNoaXRRazcwWG82Y1k/edit
the extracted text will be this:
this is a test docu ment that won t be extracted properly.
should be:
this is a test document that won't be extracted properly.
(the quote thing might be a little harder to fix than the space).

PPT Support?

Pre-2007 powerpoint

how can i set options with language

i want use language chi_sim

where can i set options

Removes too much whitespace

I am finding that textract is removing all of the line breaks within a document. Commenting out cleanseText seemed to fix it but perhaps a better way would be to specify whether text is 'cleansed' with params?

lang parameter

How do I pass the language that should be used for ocr?

Add ability to optionally write file to disk

Many (most) extractors now do not need to be on disk to be extracted. Would be nice to avoid that step.

textract function can be invoked before all extractors are loaded

I added a simple call to "textract(filePath, callback)" in my "app.js", like this:

    var textract = require('textract');
    var filePath = "examples/Cosmos.pdf";
    textract(filePath, function( error, text )
    {
        if (error)
        {
            console.log("%s", error);
        }
        else if (!text)
        {
            console.log("Error: no text received");
        }
        else
        {
            // Ignore punctuation for now...
            var terms = text.split(" ");
            console.log("terms found: #%d", terms.length);
        }
    });

When running it via "node app" it reports that "Error: textract does not currently extract files of type [[ application/pdf ]]".

Reading the source I found that the extractor for PDFs was indeed there (under "lib/extractors/") so I added a "console.log()" to "registerExtractor(extractor)" in "lib/extract.js" and I found that the PDF extractor was loaded AFTER my call to "textract()" was "completed".

I rearranged my code as follows and it works (because I'm now waiting 5 seconds for the extractors to be loaded):

var delayedExtraction = function()
{
    textract(filePath, function( error, text )
    {
        if (error)
        {
            console.log("%s", error);
        }
        else if (!text)
        {
            console.log("Error: no text received");
        }
        else
        {
            // Ignore punctuation for now...
            var terms = text.split(" ");
            console.log("terms found: #%d", terms.length);
        }
    });
};
setTimeout(delayedExtraction, 5000);

I know this way it works, but I'd like textract to take care of this concurrency issue in a deterministic way ;-)

Thanks!

Exceeding buffer error

With larger docx files an buffer exceeded error is generated.

I got around this by modifying:
lib/extractors/docx.js

adding the following to the exec statement near the top of the file:
{maxBuffer: 50000*1024},

Ideally this could be a configurable parameter.

Cheers!

docx files "preserveLineBreaks" does not seem to work.

Verify and fix

Look into yauzl for replacing requirement for unzip

Error: Cannot find module `ppt`

I've just made a deployment with the latest version of the lib (0.17) and get the following error in the log:

/graspeo/current/node_modules/mongoose/node_modules/mongodb/lib/mongodb/db.js:297
          throw err;
                ^
Error: Cannot find module 'ppt'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Function.cls_wrapMethod [as _load] (/graspeo/current/node_modules/newrelic/lib/shimmer.js:208:38)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/graspeo/current/node_modules/textract/lib/extractors/ppt.js:2:11)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)

Earlier today everything was fine so I assume this is because of the new release. Changing version back to 0.16 made things work again.

By the way, thanks for the great lib!

support cyrillic

cleanseText removes cyrillic letters.

The cause is that WHITELIST_PRESERVE_LINEBREAKS and WHITELIST_STRIP_LINEBREAKS will remove all unknown characters.

See RegEx with extended alphabet to match all unicode letters.

Error: extractNewWordDocument exec error: Error: stdout maxBuffer exceeded

$ textract some-file.docx 
textract not ready, retrying in .5 seconds
textract: 'drawingtotext' does not appear to be installed, so it will be unable to extract DXFs.
[Error: extractNewWordDocument exec error: Error: stdout maxBuffer exceeded.]

Any way to avoid this error? Or is it just something im doing wrong?
I dont need drawingtotext, just for doc and docx i guess?

Add support for speech to text via google API

Capture catdoc existing differently

analyze the output of catdoc __filename to see if catdoc is there but just can't find the file.

ODT Support

Is ODT support in the pipeline?

Also with docx files "preserveLineBreaks" does not seem to work.

PPTX missing newlines, writes error messages to stdout

I took the test file and used powerpoint to save as an RTF file. Using textutil on OSX, I generated a baseline. Ideally, textract should produce the exact text:

$ textutil -convert txt layout_types_2011.rtf # creates layout_types_2011.txt
$ textract layout_types_2011.pptx 2>/dev/null >layout_types_2011.textract
$ diff layout_types_2011.txt layout_types_2011.textract

While the differences might be conscious decisions, it's worth clarifying:

A) the line "textract not ready, retrying in .5 seconds" is printed to stdout. This probably should be printed to stderr: https://github.com/dbashford/textract/blob/master/lib/extract.js#L72 should use console.error rather than console.log

B) Newlines are completely lost. For example, slide 10 reads

Who thought this would be a good idea?

Unfortunately the arrow keys act relative to the screen rather than the text

The entire input situation is confusing

but textract is writing

Who thought this would be a good idea? Unfortunately the arrow keys act relative to the screen rather than the text The entire input situation is confusing

C) The … character U+2026 is missing (is that intentional?)

Parsing issues.

Receiving the following error when trying get text from simple docx. http://www.filedropper.com/testres
[Error: extractNewWordDocument exec error: Error: Command failed: [tests/testres.docx] End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. note: tests/testres.docx may be a plain executable, not an archive ]

Do you support rtf? Should I be forcing a file type?
{ [Error: textract does not currently extract files of type [[ application/rtf ]]] typeNotFound: true }

~~Parsing a plain .txt http://www.filedropper.com/testres_1 I receive~~

~~*********************** C o u r i e r N e w~~

Here's the server thats trying to parse these files. Using express and node.js

exports.indexFile = function (req, res) { console.log(JSON.stringify(req.body)); var path = req.body.path, ext = req.body.extension, ext = ext.toString().toLowerCase(); if(ext == "pdf" || ext == "doc" || ext == "docx" || ext == "rtf" || ext == "txt") { textract(path, function(err, text) { console.log(err); console.log(text); res.send(text); }); } else { res.send("File type not supported."); } }

Please let me know asap.

EDIT: I forgot to close the document creator before uploading the files, resulting in a corrupted document. But the RTF question is still open.

Using a docker container for dependencies

I've quickly implemented from a project using this currently on my fork where you can find a contribution guide. Its the smallest image out there doing the same at 86MB and you should be able to build the container locally with different versions of node after pulling from the image repository.

In Node v4.2.1 I'm getting child depreciation warnings which is failing command line tests and we would have to work out how to compile the drawingtotext binary as I can't find much documentation other than making. This might be a separate container which generates the package and hosts it on github.

Let me know your thoughts!

Fork: https://github.com/sidhuko/textract
Github: https://github.com/sidhuko/docker-textract
Docker hub: https://hub.docker.com/r/sidhuko/textract/

With at least docx words can end up smashed together.

2126150Microsoft Macintosh Word011falseW

Only seen this with docx, usually with things like complex footers.

Excessive memory usage?

We've recently began to shard out our text extraction processes and I noticed a significant spike in memory usage. Looks like it's coming from this module. Running the following:

var textract = require('textract');
setInterval(function () {
  console.error(process.memoryUsage());
}, 1000);

Results in around 135 MB of memory being used. Comment out the first line and that shoots down to around 10 MB.

Any ideas what's causing this?

dbashford / textract Goto Github PK

textract's Introduction

textract

Currently Extracts...

Install

Extraction Requirements

Configuration

Usage

Commmand Line

Flags

Node

Import

APIs

File

File + mime type

Buffer + mime type

Buffer + file name/path

URL

Testing Notes

Running Tests on a Mac?

textract's People

Contributors

Stargazers

Watchers

Forkers

textract's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs