GithubHelp home page GithubHelp logo

textract's Introduction

textract

A text extraction node module.

NPM NPM

Currently Extracts...

  • PDF
  • DOC
  • DOCX
  • XLS
  • XLSX
  • XLSB
  • XLSM
  • PPTX
  • DXF
  • PNG
  • JPG
  • GIF
  • RTF
  • application/javascript
  • All text/* mime-types.

Does textract not extract from files of the type you need? Add an issue or submit a pull request. It's super easy to add an extractor for a new mime type.

Install

npm install textract

Requirements

  • PDF extraction requires pdftotext be installed, link
  • DOC extraction requires catdoc be installed, link
  • RTF extraction requires catdoc be installed
  • DOCX extraction requires unzip be available
  • PPTX extraction requires unzip be available
  • PNG, JPG and GIF require tesseract to be available, link. Images need to be pretty clear, high DPI and made almost entirely of just text for tesseract to be able to accurately extract the text.
  • DXF extraction requires drawingtotext be available, link

Usage

Commmand Line

If textract is installed gloablly, via npm install -g textract, then the following command will write the extracted text to the console.

$ textract pathToFile

In your node app

Import

var textract = require('textract');

Execution

If you do not know the mime type of the file

textract(filePath, function( error, text ) {})

If you know the mime type of the file

textract(type, filePath, function( error, text ) {})

If you wish to pass some config...and know the mime type...

textract(type, filePath, config, function( error, text ) {})

If you wish to pass some config, but do not know the mime type

textract(filePath, config, function( error, text ) {})

Error will contain informative text about why the extraction failed. If textract does not currently extract files of the type provided, a typeNotFound flag will be tossed on the error object.

If processing a .gif on OSX, an error will be thrown with a macProcessGif flag on it set to true. Tesseract has issues with .gifs on OSX.

Configuration

Configuration can be passed into textract. The following configuration options are available

  • preserveLineBreaks: By default textract does NOT preserve line breaks. Pass this in as true and textract will not strip any line breaks.
  • exec: Some extractors (xlsx, docx, dxf) use node's exec functionality. This setting allows for providing config to exec execution. One reason you might want to provide this config is if you are dealing with very large files. You might want to increase the exec maxBuffer setting.
  • [ext].exec: Each extractor can take specific exec config.
  • macProcessGif: By default on OSX textract will not run tesseract on .gif files. (See this Stack Overflow post) If you've figured out to make it work, set this flag to true to turn gif processing back on.

Release Notes

0.12.0

  • #21, #22, Now using j via its binaries rather than using it via node. This makes XLS/X extraction slower, but reduces memory consumption of textract signifcantly.

0.11.2

  • Updated pdf-text-extract to latest, fixes #20.

0.11.1

  • Addressed path escaping issues with tesseract, fixes [#18] (dbashford#18)

0.11.0

  • Using j to handle xls and xlsx, this removes the requirement on the xls2csv binary.
  • j also supports xlsb and xlsm

textract's People

Contributors

bbaaxx avatar davidworkman9 avatar dbashford avatar james1x0 avatar voz avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.