GithubHelp home page GithubHelp logo

dbashford / textract Goto Github PK

View Code? Open in Web Editor NEW
1.6K 44.0 184.0 5.21 MB

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

License: MIT License

JavaScript 35.29% CSS 0.01% HTML 54.84% Rich Text Format 9.87%
extract-text extraction nodejs

textract's Introduction

textract

A text extraction node module.

NPM NPM

Currently Extracts...

  • HTML, HTM
  • ATOM, RSS
  • Markdown
  • EPUB
  • XML, XSL
  • PDF
  • DOC, DOCX
  • ODT, OTT (experimental, feedback needed!)
  • RTF
  • XLS, XLSX, XLSB, XLSM, XLTX
  • CSV
  • ODS, OTS
  • PPTX, POTX
  • ODP, OTP
  • ODG, OTG
  • PNG, JPG, GIF
  • DXF
  • application/javascript
  • All text/* mime-types.

In almost all cases above, what textract cares about is the mime type. So .html and .htm, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, application/vnd.ms-excel is the mime type for .xls, but also for 5 other file types.

Does textract not extract from files of the type you need? Add an issue or submit a pull request. It many cases textract is already capable, it is just not paying attention to the mime type you may be interested in.

Install

npm install textract

Extraction Requirements

Note, if any of the requirements below are missing, textract will run and extract all files for types it is capable. Not having these items installed does not prevent you from using textract, it just prevents you from extracting those specific files.

  • PDF extraction requires pdftotext be installed, link
  • DOC extraction requires antiword be installed, link, unless on OSX in which case textutil (installed by default) is used.
  • RTF extraction requires unrtf be installed, link, unless on OSX in which case textutil (installed by default) is used.
  • PNG, JPG and GIF require tesseract to be available, link. Images need to be pretty clear, high DPI and made almost entirely of just text for tesseract to be able to accurately extract the text.
  • DXF extraction requires drawingtotext be available, link

Configuration

Configuration can be passed into textract. The following configuration options are available

  • preserveLineBreaks: When using the command line this is set to true to preserve stdout readability. When using the library via node this is set to false. Pass this in as true and textract will not strip any line breaks.
  • preserveOnlyMultipleLineBreaks: Some extractors, like PDF, insert line breaks at the end of every line, even if the middle of a sentence. If this option (default false) is set to true, then any instances of a single line break are removed but multiple line breaks are preserved. Check your output with this option, though, this doesn't preserve paragraphs unless there are multiple breaks.
  • exec: Some extractors (dxf) use node's exec functionality. This setting allows for providing config to exec execution. One reason you might want to provide this config is if you are dealing with very large files. You might want to increase the exec maxBuffer setting.
  • [ext].exec: Each extractor can take specific exec config. Keep in mind many extractors are responsible for extracting multiple types, so, for instance, the odt extractor is what you would configure for odt and odg/odt etc. Check the extractors to see which you want to specifically configure. At the bottom of each is a list of types for which the extractor is responsible.
  • tesseract.lang: A pass-through to tesseract allowing for setting of language for extraction. ex: { tesseract: { lang:"chi_sim" } }
  • tesseract.cmd: tesseract.lang allows a quick means to provide the most popular tesseract option, but if you need to configure more options, you can simply pass cmd. cmd is the string that matches the command-line options you want to pass to tesseract. For instance, to provide language and psm, you would pass { tesseract: { cmd:"-l chi_sim -psm 10" } }
  • pdftotextOptions: This is a proxy options object to the library textract uses for pdf extraction: pdf-text-extract. Options include ownerPassword, userPassword if you are extracting text from password protected PDFs. IMPORTANT: textract modifies the pdf-text-extract layout default so that, instead of layout: layout, it uses layout:raw. It is not suggested you modify this without understanding what trouble that might get you in. See this GH issue for why textract overrides that library's default.
  • typeOverride: Used with fromUrl, if set, rather than using the content-type from the URL request, will use the provided typeOverride.
  • includeAltText: When extracting HTML, whether or not to include alt text with the extracted text. By default this is false.

To use this configuration at the command line, prefix each open with a --.

Ex: textract image.png --tesseract.lang=deu

Usage

Commmand Line

If textract is installed gloablly, via npm install -g textract, then the following command will write the extracted text to the console for a file on the file system.

$ textract pathToFile

Flags

Configuration flags can be passed into textract via the command line.

textract pathToFile --preserveLineBreaks false

Parameters like exec.maxBuffer can be passed as you'd expect.

textract pathToFile --exec.maxBuffer 500000

And multiple flags can be used together.

textract pathToFile --preserveLineBreaks false --exec.maxBuffer 500000

Node

Import

var textract = require('textract');

APIs

There are several ways to extract text. For all methods, the extracted text and an error object are passed to a callback.

error will contain informative text about why the extraction failed. If textract does not currently extract files of the type provided, a typeNotFound flag will be tossed on the error object.

File
textract.fromFileWithPath(filePath, function( error, text ) {})
textract.fromFileWithPath(filePath, config, function( error, text ) {})
File + mime type
textract.fromFileWithMimeAndPath(type, filePath, function( error, text ) {})
textract.fromFileWithMimeAndPath(type, filePath, config, function( error, text ) {})
Buffer + mime type
textract.fromBufferWithMime(type, buffer, function( error, text ) {})
textract.fromBufferWithMime(type, buffer, config, function( error, text ) {})
Buffer + file name/path
textract.fromBufferWithName(name, buffer, function( error, text ) {})
textract.fromBufferWithName(name, buffer, config, function( error, text ) {})
URL

When passing a URL, the URL can either be a string, or a node.js URL object. Using the URL object allows fine grained control over the URL being used.

textract.fromUrl(url, function( error, text ) {})
textract.fromUrl(url, config, function( error, text ) {})

Testing Notes

Running Tests on a Mac?

  • sudo port install tesseract-chi-sim
  • sudo port install tesseract-eng
  • You will also want to disable textract's usage of textutil as the tests are based on output from antiword.
    • Go into /lib/extractors/{doc|doc-osx|rtf} and modify the code under if ( os.platform() === 'darwin' ) {. Uncommented the commented lines in these sections.

textract's People

Contributors

agrimm avatar ahgentil avatar andre0799 avatar aqum avatar bbaaxx avatar davidworkman9 avatar dbashford avatar dlakata avatar gitgrimbo avatar james1x0 avatar jbilcke avatar kamilziajka avatar kangik0817 avatar konijnendijk avatar maxism avatar nihalagesudraz avatar olivierb-ob avatar pags avatar ripkens avatar sidhuko avatar spneto avatar tracker1 avatar vangorra avatar voz avatar weianan avatar willshiao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textract's Issues

xlsx extractor?

Is it possible to build an extractor for Excel (*.xlsx) files?

Problems with cyrillic symbols

When I execute js file with node.js with following content(for example with .doc file):
var textract = require('textract');

textract.fromFileWithPath('test.doc', function( error, text ) {
if (error) throw error;
console.log(text);
})

with .doc file, all cyrillic symbols ureadable (but when I execute Catdoc, then I can read it)
and with .docx file all cyrillic symbols removes.

A bug when when extracting from an image with tesseract

Error:

< 29 Mar 22:40:41 - error: [App] Error extracting [[ /XXX/Screen Shot 2014-03-06 at 14.43.23.png ]], exec error: Error: Command failed: read_params_file: Can't open Shot
< read_params_file: Can't open 2014-03-06
< read_params_file: Can't open at
< read_params_file: Can't open 14.43.23.png
< read_params_file: Can't open /YYY/node_modules/textract/lib/extractors/temp/Screen
< read_params_file: Can't open Shot
< read_params_file: Can't open 2014-03-06
< read_params_file: Can't open at
< read_params_file: Can't open 14.43.23
< Cannot open input file: /XXX

The problem is that the paths are not escaped before calling the tesseract command:

  exec( "tesseract " + filePath + " " + fileTempOutPath + " quiet",

Will submit a pull request fixing the issue.

PPTX support?

Does it work? Because for me it does not.

$ textract 'test.pptx'
textract not ready, retrying in .5 seconds
textract: 'drawingtotext' does not appear to be installed, so it will be unable
to extract DXFs.
textract: 'catdoc' does not appear to be installed, so it will be unable to extr
act DOCs.
[Error: extract powerpoint, pptx, exec error: Error: stdout maxBuffer exceeded.]

Unable to extract text from doc and docx files

Whenever i run the project i keep getting the following warnings:

textract: 'unzip' does not appear to be installed, so textract will be unable to
extract DOCXs.
textract: 'catdoc' does not appear to be installed, so it will be unable to extr
act DOCs.

I have properly installed catdoc command and its working in command prompt using path envionment variables.

Also i am unable to install the unzip module as there is no link provided for this and i would like to know how to install it.

If anybody would provide some information on this i will be very thankful to him.

Support for Buffer objects containing a base64 encoded string?

Use case: An Express API for taking DOC/PDF/DOCX and returning text.

Rather than uploading a file to the server and then having textract read that off the disk it would be preferable to take a base64 encoded DOC/PDF/DOCX file sent as a string in a POST request, put it in a buffer, and then have textract read that buffer.

Support for more file formats

Hi David,

Do you have plans to update textract to support .ppt, .xlsx, .xltx, .potx, .key, .pages, .xml? I'd also love to see support for OpenOffice file formats, like .odt, .ott, .ods, .ots, .odg, .otg, .odp, .otp.

Thanks!

pdf-to-text version upgrade

We've noticed that the pdf-text-extract npm module has been updated (now at 1.1.2).

This new version fixes some problems we have been having where warnings in the extraction process come back as errors and thus we do not get the extracted text.

Any chance we can get the package.json file updated to use 1.1.2 for pdf-text-extract?

Update NPM?

Any change of getting an update on NPM so we can have pptx extractor? I actually wrote the pptx extractor, then noticed you had done it already!

Preserve newline behaviour

For me, the preserve newline behaviour isn't quite working as I expected (tested with the docx extractor).

I have text like this in a docx file:

2 downlighters; door to hall.

Hall
Double glazed window to front;

With preserveLineBreaks I get this output:

2 downlighters; door to hall. Hall
Double glazed window to front;

After outputting some stuff to the console I can see the newlines are there as expected but then they get parsed out.

Taking a look at how preserveLineBreaks is implemented I see it's a big, hairy regex, so not sure what it is doing at first glance. From my naive point of view it would be nicer to get the raw text output, if I need to filter further I can make my own mind. Or if there is a 'clean' function as a configuration option I could use it to override the default behaviour.

Disable info text?

I don't have the DFX conversion software installed so every time I do a text extraction, I get info warning text saying INFO: 'drawingtotext' does not appear to be installed, so textract will be unable to extract DXFs. and then the text of my document after it. Is there any way to disable this?

Add CSV support

We have a requirement for CSV support in a project. Would this be useful to use a popular npm library with the same interface as textract?

I will be able to PR my work early next week.

.docx extractor options

It looks like the options passed to other extractors is not utilized for the .docx extraction process. textract API's are passing an empty string back to the callback for large .docx files (testing with a .docx around 400 pages).

Reading files from S3

Hi David!

Trying to create an endpoint in an Express server like this:

app.get('/textract', function(req, res, next) { textract("https://s3.amazonaws.com/testbucket1a2b3c/test.pdf", function(error, text) { console.log(error); res.end(); }); });

Console returns [Error: File at path [[ https://s3.amazonaws.com/testbucket1a2b3c/test.pdf ]] does not exist.]

What does this mean exactly? Textract only works with local files? (in this case my file is uploaded to S3). Thanks!

make the temp folder in an actual temp location

I installed textract:

$ sudo npm install -g textract

Every invocation of textract seems to fail:

$ textract -h

fs.js:647
  return binding.mkdir(pathModule._makeLong(path),
                 ^
Error: EACCES, permission denied '/usr/local/lib/node_modules/textract/lib/extractors/temp'
    at Object.fs.mkdirSync (fs.js:647:18)
    at Object.<anonymous> (/usr/local/lib/node_modules/textract/lib/extractors/images.js:83:8)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at module.exports (/usr/local/lib/node_modules/textract/lib/extract.js:85:10)
    at Array.map (native)

This happens on OSX because the module was installed as root but invoked as a normal user. On linux and osx the temp folder should probably be a proper temporary directory in a location like /tmp

Consider replacing catdoc

I tried installing catdoc on osx 10.9.3 (for RTF support) using brew as well as from source, and for whatever reason it just does not want to play nice. What formats currently use catdoc? Are there pure-JS text extractors for those formats?

Remove extraneous white space

Get a lot of extractions that'll look something like this

some text                more text             some other text

No need for all the white space.

Streams?

Any plans on using node streams?

Some spaces showing up in the middle of words

From here: #5 (comment)

This causes change has caused random spaces in the middle of words in the .docx files I've been using. It seems to be an issue when either the w:t tag has an attribute of xml-spacing="preserve" or the sibling to the w:t tag w:rPr has a child node of

Here you go:
https://docs.google.com/file/d/0Bxcbem1SSxNoaXRRazcwWG82Y1k/edit
the extracted text will be this:
this is a test docu ment that won t be extracted properly.
should be:
this is a test document that won't be extracted properly.
(the quote thing might be a little harder to fix than the space).

Removes too much whitespace

I am finding that textract is removing all of the line breaks within a document. Commenting out cleanseText seemed to fix it but perhaps a better way would be to specify whether text is 'cleansed' with params?

lang parameter

How do I pass the language that should be used for ocr?

textract function can be invoked before all extractors are loaded

I added a simple call to "textract(filePath, callback)" in my "app.js", like this:

    var textract = require('textract');
    var filePath = "examples/Cosmos.pdf";
    textract(filePath, function( error, text )
    {
        if (error)
        {
            console.log("%s", error);
        }
        else if (!text)
        {
            console.log("Error: no text received");
        }
        else
        {
            // Ignore punctuation for now...
            var terms = text.split(" ");
            console.log("terms found: #%d", terms.length);
        }
    });

When running it via "node app" it reports that "Error: textract does not currently extract files of type [[ application/pdf ]]".

Reading the source I found that the extractor for PDFs was indeed there (under "lib/extractors/") so I added a "console.log()" to "registerExtractor(extractor)" in "lib/extract.js" and I found that the PDF extractor was loaded AFTER my call to "textract()" was "completed".

I rearranged my code as follows and it works (because I'm now waiting 5 seconds for the extractors to be loaded):

var delayedExtraction = function()
{
    textract(filePath, function( error, text )
    {
        if (error)
        {
            console.log("%s", error);
        }
        else if (!text)
        {
            console.log("Error: no text received");
        }
        else
        {
            // Ignore punctuation for now...
            var terms = text.split(" ");
            console.log("terms found: #%d", terms.length);
        }
    });
};
setTimeout(delayedExtraction, 5000);

I know this way it works, but I'd like textract to take care of this concurrency issue in a deterministic way ;-)

Thanks!

Exceeding buffer error

With larger docx files an buffer exceeded error is generated.

I got around this by modifying:
lib/extractors/docx.js

adding the following to the exec statement near the top of the file:
{maxBuffer: 50000*1024},

Ideally this could be a configurable parameter.

Cheers!

Error: Cannot find module `ppt`

I've just made a deployment with the latest version of the lib (0.17) and get the following error in the log:

/graspeo/current/node_modules/mongoose/node_modules/mongodb/lib/mongodb/db.js:297
          throw err;
                ^
Error: Cannot find module 'ppt'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Function.cls_wrapMethod [as _load] (/graspeo/current/node_modules/newrelic/lib/shimmer.js:208:38)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/graspeo/current/node_modules/textract/lib/extractors/ppt.js:2:11)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)

Earlier today everything was fine so I assume this is because of the new release. Changing version back to 0.16 made things work again.

By the way, thanks for the great lib!

support cyrillic

cleanseText removes cyrillic letters.

The cause is that WHITELIST_PRESERVE_LINEBREAKS and WHITELIST_STRIP_LINEBREAKS will remove all unknown characters.

See RegEx with extended alphabet to match all unicode letters.

Error: extractNewWordDocument exec error: Error: stdout maxBuffer exceeded

$ textract some-file.docx 
textract not ready, retrying in .5 seconds
textract: 'drawingtotext' does not appear to be installed, so it will be unable to extract DXFs.
[Error: extractNewWordDocument exec error: Error: stdout maxBuffer exceeded.]

Any way to avoid this error? Or is it just something im doing wrong?
I dont need drawingtotext, just for doc and docx i guess?

ODT Support

Is ODT support in the pipeline?

Also with docx files "preserveLineBreaks" does not seem to work.

PPTX missing newlines, writes error messages to stdout

I took the test file and used powerpoint to save as an RTF file. Using textutil on OSX, I generated a baseline. Ideally, textract should produce the exact text:

$ textutil -convert txt layout_types_2011.rtf # creates layout_types_2011.txt
$ textract layout_types_2011.pptx 2>/dev/null >layout_types_2011.textract
$ diff layout_types_2011.txt layout_types_2011.textract

While the differences might be conscious decisions, it's worth clarifying:

A) the line "textract not ready, retrying in .5 seconds" is printed to stdout. This probably should be printed to stderr: https://github.com/dbashford/textract/blob/master/lib/extract.js#L72 should use console.error rather than console.log

B) Newlines are completely lost. For example, slide 10 reads

Who thought this would be a good idea?

Unfortunately the arrow keys act relative to the screen rather than the text

The entire input situation is confusing

but textract is writing

Who thought this would be a good idea? Unfortunately the arrow keys act relative to the screen rather than the text The entire input situation is confusing

C) The โ€ฆ character U+2026 is missing (is that intentional?)

Parsing issues.

Receiving the following error when trying get text from simple docx. http://www.filedropper.com/testres
[Error: extractNewWordDocument exec error: Error: Command failed: [tests/testres.docx] End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. note: tests/testres.docx may be a plain executable, not an archive ]

Do you support rtf? Should I be forcing a file type?
{ [Error: textract does not currently extract files of type [[ application/rtf ]]] typeNotFound: true }

Parsing a plain .txt http://www.filedropper.com/testres_1 I receive

*********************** C o u r i e r N e w

Here's the server thats trying to parse these files. Using express and node.js

exports.indexFile = function (req, res) { console.log(JSON.stringify(req.body)); var path = req.body.path, ext = req.body.extension, ext = ext.toString().toLowerCase(); if(ext == "pdf" || ext == "doc" || ext == "docx" || ext == "rtf" || ext == "txt") { textract(path, function(err, text) { console.log(err); console.log(text); res.send(text); }); } else { res.send("File type not supported."); } }

Please let me know asap.

EDIT: I forgot to close the document creator before uploading the files, resulting in a corrupted document. But the RTF question is still open.

Using a docker container for dependencies

I've quickly implemented from a project using this currently on my fork where you can find a contribution guide. Its the smallest image out there doing the same at 86MB and you should be able to build the container locally with different versions of node after pulling from the image repository.

In Node v4.2.1 I'm getting child depreciation warnings which is failing command line tests and we would have to work out how to compile the drawingtotext binary as I can't find much documentation other than making. This might be a separate container which generates the package and hosts it on github.

Let me know your thoughts!

Fork: https://github.com/sidhuko/textract
Github: https://github.com/sidhuko/docker-textract
Docker hub: https://hub.docker.com/r/sidhuko/textract/

Excessive memory usage?

We've recently began to shard out our text extraction processes and I noticed a significant spike in memory usage. Looks like it's coming from this module. Running the following:

var textract = require('textract');
setInterval(function () {
  console.error(process.memoryUsage());
}, 1000);

Results in around 135 MB of memory being used. Comment out the first line and that shoots down to around 10 MB.

Any ideas what's causing this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.