GithubHelp home page GithubHelp logo

richardscottoz / amazon-textract-textractor Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aws-samples/amazon-textract-textractor

1.0 0.0 1.0 130 KB

Analyze documents with Amazon Textract and generate output in multiple formats.

License: Apache License 2.0

Python 100.00%

amazon-textract-textractor's Introduction

Textractor

textractor helps speed up PoCs by allowing you to quickly extract text, forms and tables from documents using Amazon Textract. It can generate output in different formats including raw JSON, JSON for each page in the document, text, text in reading order, key/values exported as CSV, tables exported as CSV. It can also generate insights or translate detected text by using Amazon Comprehend, Amazon Comprehend Medical and Amazon Translate. It takes advantage of Textract response parser library to easily consume JSON returned by Amazon Textract.

Prerequisites

Setup

  • Download code and unzip on your local machine.
  • run python -m pip install -r requirements.txt

Usage

Format:

  • python3 textractor.py --documents [file|folder|S3Object|S3Folder] --text --forms --tables --region [AWSRegion] --insights --medical-insights --translate [LanguageCode]

Examples:

  • python3 textractor.py --documents mydoc.jpg --text
  • python3 textractor.py --documents ./mydocs/ --text --forms --tables
  • python3 textractor.py --documents s3://mybucket/mydoc.pdf --text --forms --tables
  • python3 textractor.py --documents s3://mybucket/myfolder/ --forms
  • python3 textractor.py --documents s3://mybucket/myfolder/ --text --forms --tables --region us-east-1 --insights --medical-insights --translate es

Path to a folder on local drive or S3 bucket must end with /

Only one of the flags (--text, --forms and --tables) is required at the minimum. You can use combination of all three.

--region is optional. us-east-1 is default for local files/folder. For documents in S3, region of S3 bucket is selected as default AWS region to call Amazon Textract.

--insights, --medical-insights and --translate are optional.

Generated Output

Tool generates several files in the format below:

Text, forms and tables related output files

  • document-response.json: Raw JSON response of Amazon Textract API call.
  • document-page-n-response.json: Raw JSON blocks for each page document.
  • document-page-n-text.txt: Detected text for each page in the document.
  • document-page-n-text-inreadingorder.txt: Detected text in reading order (multi-column) for each page in the document.
  • document-page-n-forms.csv: Key/Value pairs for each page in the document.
  • document-page-n-tables.csv: Tables detected for each page in the document.
  • document-page-n-table-n-tables.csv: Pretty-printed tables detected for each page in the document.

Insights related output files

  • document-page-n-insights-entities.csv: Entities in detected text for each page in the document.
  • document-page-n-insights-sentiment.csv: Sentiment in detected text for each page in the document.
  • document-page-n-insights-keyPhrases.csv: Key phrases in detected text for each page in the document.
  • document-page-n-insights-syntax.csv: Syntax in detected text for each page in the document.
  • document-page-n-medical-insights-entities.csv: Medical entities in detected text for each page in the document.
  • document-page-n-medical-insights-phi.json: Phi in detected text for each page in the document.
  • document-page-n-text-translation.txt: Translation of detected text for each page in the document.

Arguments

Argument Description
--documents Name of the document or local folder/S3 bucket
--text Extract text from the document
--forms Extract key/value pairs from the document
--tables Extract tables from the document
--region AWS region to use for Amazon Textract API call. us-east-1 is default.
--insights Generate files with sentiment, entities, syntax, and key phrases.
--medical-insights Generate files with medical entities and phi.
--translate Generate file with translation.

Source Code


# Call Amazon Textract and get JSON response
docproc = DocumentProcessor(bucketName, filePath, awsRegion, detectText, detectForms, tables)
response = docproc.run()

# Get DOM
doc = Document(response)

# Iterate over elements in the document
for page in doc.pages:
    # Print lines and words
    for line in page.lines:
        print("Line: {}--{}".format(line.text, line.confidence))
        for word in line.words:
            print("Word: {}--{}".format(word.text, word.confidence))
    
    # Print tables
    for table in page.tables:
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print("Table[{}][{}] = {}-{}".format(r, c, cell.text, cell.confidence))

    # Print fields
    for field in page.form.fields:
        print("Field: Key: {}, Value: {}".format(field.key.text, field.value.text))

    # Get field by key
    key = "Phone Number:"
    field = page.form.getFieldByKey(key)
    if(field):
        print("Field: Key: {}, Value: {}".format(field.key, field.value))

    # Search fields by key
    key = "address"
    fields = page.form.searchFieldsByKey(key)
    for field in fields:
        print("Field: Key: {}, Value: {}".format(field.key, field.value))

Cost

  • As you run this tool, it calls different APIs (Amazon Textract, optionally Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate) in your AWS account. You will get charged for all the API calls made as part of the analysis.

Other Resources

License

This library is licensed under the Apache 2.0 License.

amazon-textract-textractor's People

Stargazers

 avatar

Forkers

hortonshelpers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.