Post-process Amazon Textract results with Hugging Face transformer models for document understanding

License: MIT No Attribution

Python 55.72% Jupyter Notebook 31.44% HTML 7.46% Batchfile 0.04% Dockerfile 0.40% JavaScript 0.24% Vue 3.69% SCSS 0.09% TypeScript 0.92%

amazon-textract huggingface-transformers document-analysis ocr

amazon-textract-transformer-pipeline's Introduction

Trainable Document Extraction with Transformer Models on Amazon SageMaker

To automate document-based business processes, we usually need to extract specific, standard data points from diverse input documents: For example, vendor and line-item details from purchase orders; customer name and date-of-birth from identity documents; or specific clauses in contracts.

In human-readable documents, both layout and text content are important to extract meaning: So accuracy may be disappointing when using text-only methods (like regular expressions or entity recognition models), position-only methods (like template-based models), or manual combinations of the two.

This sample and accompanying blog post demonstrate trainable, multi-modal, layout-aware document understanding on AWS using Amazon SageMaker and open-source models from Hugging Face Transformers - optionally integrated with Amazon Textract.

Solution Overview

This sample sets up a document processing pipeline orchestrated by AWS Step Functions, as shown below:

Documents uploaded to the input bucket automatically trigger the workflow, which:

Extracts document data using Amazon Textract (or an alternative OCR engine).
Enriches the Textract/OCR JSON with extra insights using an ML model deployed in SageMaker.
Consolidates the business-level fields in a post-processing Lambda function.
(If any expected fields were missing or low-confidence), forwards the results to a human reviewer.

In the provided example, input documents are specimen credit card agreements per this dataset published by the US Consumer Finance Protection Bureau (CFPB).

We define a diverse set of "fields" of interest: from short entity-like fields (such as provider name, credit card name, and effective date), to longer and more flexible concepts (like "minimum payment calculation method" and "terms applicable to local areas only").

Bounding box ananotations are collected to train a SageMaker ML model to classify the words detected by Amazon Textract between these field types - using both the content of the text and the location/size of each word as inputs:

From this enriched JSON, the post-processing Lambda can apply simple rule-based logic to extract and validate the fields of interest.

For any documents where required fields could not be detected confidently, the output is forwarded to human review using Amazon Augmented AI (A2I) with a customized task template UI:

By orchestrating the process through AWS Step Functions (rather than point-to-point integrations between each stage for example), we gain the benefits that the overall flow can be graphically visualized and individual stages can be more easily customized. A wide range of integration options are available for adding new stages and storing results (or triggering further workflows) from the output:

Getting Started

To deploy this sample you'll need access to your target AWS Account with sufficient permissions to deploy the various resources created by the solution (which includes IAM resources).

Skipping local setup

Steps 0-3 below are for locally building and deploying the CDK solution and require setting up some developer-oriented tools. If you can't do this or prefer not to, you can deploy the following "bootstrap" CloudFormation stack and skip to step 4:

(Or use .infrastructure/attp-bootstrap.cfn.yaml)

⚠️ Note before using this option:

This bootstrap stack grants broad IAM permissions to the created AWS CodeBuild project role for deploying the sample - so it's not recommended for use in production environments.

If using the 'Launch Stack' button above, remember to check it opens the Console in the correct AWS Region you want to deploy in and switch regions if necessary.

If CDK deployment itself takes longer than 60 minutes for some reason (e.g. customization, alternative configuration), the bootstrap stack will time out and fail. We think this shouldn't happen in normal usage, but if you see it: Remove the CodeBuildCallback setting in your own copy of .infrastructure/attp-bootstrap.cfn.yaml, and please let us know in the GitHub issues!

The bootstrap stack pretty much runs the following steps 0-3 for you in an AWS CodeBuild environment, instead of locally on your computer:

Step 0: Local build prerequisites

To build and deploy this solution, you'll first need:

The AWS CDK (v2), which depends on NodeJS.
Python v3.8+
Docker installed and running (which is used for bundling Python Lambda functions).
- You can see the discussion here on using podman as an alternative with CDK if Docker Desktop licensing is not available for you.
(Optional but recommended) consider using Poetry rather than Python's built-in pip.

CDK Lambda Function bundling uses container images hosted on Amazon ECR Public, so you'll likely need to authenticate to pull these. For example:

# (Always us-east-1 here, regardless of your target region)
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws

You'll also need to bootstrap your AWS environment for CDK with the modern template if you haven't already. For example:

# Assuming your AWS CLI Account and AWS_REGION / AWS_DEFAULT_REGION are set:
export CDK_NEW_BOOTSTRAP=1
cdk bootstrap

🚀 To try the solution out with your own documents and entity types: Review the standard steps below first, but find further guidance in CUSTOMIZATION_GUIDE.md.

Step 1: Set up and activate the project's virtual environment

If you're using Poetry, you should be able to simply run (from this folder):

poetry shell

Otherwise, you can instead run:

# (Create the virtual environment)
python3 -m venv .venv
# (Activate the virtual environment)
source .venv/bin/activate

Depending on your setup, your Python v3 installation may simply be called python instead of python3. If you're on Windows, you can instead try to activate by running .venv\Scripts\activate.bat.

Step 2: Install project dependencies

Once the virtual environment is active, install the required dependencies by running:

# For Poetry:
poetry install --no-root
# ...OR, without Poetry:
pip install -e .[dev]

Step 3: Deploy the solution stacks with CDK

To deploy (or update, if already deployed) all stacks in the solution to your default AWS Region, with default configuration, run:

cdk deploy --all
# To skip approval prompts, you can optionally add: --require-approval never

Some aspects of the solution can also be configured by setting environment variables before running cdk deploy. See cdk_app.py and the Customization Guide for details, but the defaults are a good starting point.

Note that some AWS Regions may not support all services required to run the solution, but it has been tested successfully in at least ap-southeast-1 (Singapore), us-east-1 (N. Virginia), and us-east-2 (Ohio).

You'll be able to see the deployed stacks in the AWS CloudFormation Console.

Step 4: Set up your notebook environment in Amazon SageMaker

The solution stack does not automatically spin up a SageMaker notebook environment because A) there are multiple options and B) users may have an existing one already.

If you're able, we recommend following the instructions to onboard to SageMaker Studio for a more fully-featured user experience and easier setup.

If this is not possible, you can instead choose to create a classic SageMaker Notebook Instance. If using a classic notebook instance:

Instance type ml.t3.medium and volume size 30GB should be sufficient to run through the sample
This repository uses interactive notebook widgets, so you'll need to install extension @jupyter-widgets/jupyterlab-manager either through the extensions manager UI (Settings > Enable Extension Manager) or by attaching a lifecycle configuration script customizing the install-lab-extension template to to reference this particular extension.

Whichever environment type you use, the execution role attached to your Notebook Instance or Studio User Profile will need some additional permissions beyond basic SageMaker access:

Find your deployed OCR pipeline stack in the AWS CloudFormation Console and look up the DataSciencePolicyName from the Outputs tab.
Next, find your studio user's or notebook instance's SageMaker execution role in the AWS IAM Roles Console
Click the Attach policies button to attach the stack's created data science policy to your SageMaker role.

⚠️ Warning: The managed AmazonSageMakerFullAccess policy that some SageMaker on-boarding guides suggest to use grants a broad set of permissions. This can be useful for initial experimentation, but you should consider scoping down access further for shared or production environments.

In addition to the DataSciencePolicy created by the stack, this sample assumes that your SageMaker execution role has permissions to:

Access the internet (to install some additional packages)

Read and write the SageMaker default S3 bucket

Train, and deploy models (and optionally perform automatic hyperparameter tuning) on GPU-accelerated instance types - and invoke the deployed endpoint to test the model

Create SageMaker Ground Truth labeling jobs

Create task templates and workflow definitions in Amazon A2I

To learn more about security with SageMaker, and get started implementing additional controls on your environment, you can refer to the Security section of the SageMaker Developer Guide - as well as this AWS ML Blog Post and associated workshops.

Once your environment is set up, you can:

Open JupyterLab (clicking 'Open Studio' for Studio, or 'Open JupyterLab' for a NBI)
Open a system terminal window in JupyterLab
(If you're on a new SageMaker Notebook Instance, move to the Jupyter root folder with cd ~/SageMaker)
Clone in this repository with:

git clone https://github.com/aws-samples/amazon-textract-transformer-pipeline

Step 5: Follow through the setup notebooks

The Python notebooks in the notebooks/ folder will guide you through the remaining activities to annotate data, train the post-processing model, and configure and test the solution stack. Open each of the .ipynb files in numbered order to follow along.

Background and Use Case Validation

Amazon Textract is a service that automatically extracts text, handwriting, and some structured data from scanned documents: Going beyond simple optical character recognition (OCR) to identify and extract data from tables (with rows and cells), and forms (as key-value pairs).

Many document understanding tasks can already be tackled with Amazon Textract and the Amazon Comprehend AI service for natural language processing. For example:

With the Forms and Tables APIs, Amazon Textract can extract key-value pairs and data tables from documents with no training required.
With Amazon Textract Queries, users can ask simple natural-language questions to guide extraction of specific data points from a page.
With Amazon Comprehend native document NER, users can train layout-aware models to extract entities from documents.
Amazon Textract provides purpose-built APIs for some common document types like invoices and receipts and driving licenses and passports.

This sample demonstrates layout-aware entity recognition, similar to Amazon Comprehend trained with PDF document annotations. In general, users should consider fully-managed AI service options before selecting a more custom SageMaker-based solution like the one shown here. However, this sample may be interesting to users who want:

To process documents in languages not currently supported by Comprehend PDF/document entity recognition - or Amazon Textract Queries.
To customize the modelling tasks - for example for joint entity recognition + document classification, or question answering, or sequence-to-sequence generation.
To integrate new model architectures from research and open source.
To use the end-to-end pipeline already set up here for model training, deployment, and human review.

Next Steps

For this demo we selected the credit card agreement corpus as a good example of a challenging dataset with lots of variability between documents, and realistic commercial document formatting and tone (as opposed to, for example, academic papers).

We also demonstrated detection of field/entity types with quite different characteristics: From short values like annual fee amounts or dates, to full sentences/paragraphs.

The approach should work well for many different document types, and the solution is designed with customization of the field list in mind.

However, there are many more opportunities to extend the approach. For example:

Rather than token/word classification, alternative 'sequence-to-sequence' ML tasks could be applied: Perhaps to fix common OCR error patterns, or to build general question-answering models on documents. Training seq2seq models is discussed further in the Optional Extras notebook.
Just as the BERT-based model was extended to consider coordinates as input, perhaps source OCR confidence scores (also available from Textract) would be useful model inputs.
The post-processing Lambda function could be extended to perform more complex validations on detected fields: For example to extract numerics, enforce regular expression matching, or even call some additional AI service such as Amazon Comprehend.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file. Included annotation datasets are licensed under the Creative Commons Attribution 4.0 International License. See the notebooks/data/annotations/LICENSE file.

amazon-textract-transformer-pipeline's People

Contributors

Stargazers

Watchers

amazon-textract-transformer-pipeline's Issues

Japanese text not rendering in A2I review UI

When processing PDFs containing digital text in Japanese (for e.g. save this AWS JP ML blog post as PDF and upload to the pipeline), Japanese text is not rendering on the document in the A2I review UI: Only latin text gets preserved with blank space where the Japanese text should be.

In the browser console there are repeated messages like:

Warning: loadFont - translateFont failed: "Error: fetchBuiltInCMap: failed to fetch file "undefinedAdobe-Japan1-UCS2.bcmap" with "Forbidden".".
Warning: Error during font loading: fetchBuiltInCMap: failed to fetch file "undefinedAdobe-Japan1-UCS2.bcmap" with "Forbidden".".

It looks like we may need to enable character mapping in the getDocument call, to allow mapping/translating character sets into available fonts?

[Enhancement] From-scratch model pre-training

This sample currently demonstrates:

Fine-tuning existing models for downstream tasks (NER), and
Continuation pre-training with unlabelled data from an existing model checkpoint.

From-scratch pre-training is considerably more resource-intensive. For example the LayoutXLM paper describes using 64 V100 GPUs (i.e. 8x p3.16xlarge or p3dn.24xlarge instances for several hours) over ~30M documents.

However, some users may still be interested in from-scratch pre-training - especially for low-resource languages or specialised domains - if tested example code was available. Please drop a 👍 or a comment if this is an enhancement that you'd find useful!

Failure to deploy due to outdated Lambda Go runtime

Recently heard from a user facing the following error on CDK deploy:

Resource handler returned message:"The runtime parameter of go1.x is no longer supported for creating or updating AWS Lambda functions. We recommend you use the new runtime (provided.al2023) while creating or updating functions. (Service: Lambda, Status Code: 400, Request ID: 51e30614-01de-45c4-9299-7d3e739c936b)" (RequestToken: 1b0734e2-e230-f80f-541b-a2dd85163213, HandlerErrorCode: InvalidRequest)

It looks like this is due to outdated version of cdk-ecr-deployment, which should now be resolved. Tentatively think upgrading this dependency to cdk-ecr-deployment = ">3.0.13" (in pyproject.toml, then re-running poetry install) should provide a fix, but haven't yet had chance to reproduce and push the update here to the repo.

[Enhancement] Merge layout-aware and generative model components

As of #26, users can train generative models to normalize entity text after extraction: For example to standardize date or currency formats, or correct common OCR error patterns.

This is not ideal though, as the normalization model only sees the specific extracted text without the surrounding context (which could for example give locale cues whether a date is more likely to be MM/DD/YYYY or DD/MM/YYYY).

It would be better if we could directly merge a generative output onto the layout-aware model, and fine tune normalized extraction directly.

CDK-deployed thumbnailer and Tesseract OCR not using increased timeouts/limits

When deploying the pipeline with alternative OCR (Tesseract) configuration I'm seeing both the SageMaker OCR and the thumbnailer endpoint are prone on large docs to internal errors due to payload size (e.g. HTTP 413 request too large) or timing out.

As mentioned on the SageMaker async developer guide:

If you're using a SageMaker provided container, you can increase the model server timeout and payload sizes from the default values to the framework‐supported maximums by setting environment variables in this step. You might not be able to leverage the maximum timeout and payload sizes that Asynchronous Inference supports if you don't explicitly set these variables.

I think there are two issues here:

We seem to be relying on the SageMakerCustomizedDLCModel.max_payload_size parameter to set payload environment variables, but these have been hard-coded to the Multi-Model Server variants (used by HuggingFace and old PyTorch containers) instead of the TorchServe variants (used by current PyTorch containers): E.g. MMS_MAX_REQUEST_SIZE instead of TS_MAX_REQUEST_SIZE.
- Currently both the thumbnailer and the SageMaker OCR option are configured to use PyTorch (v1.10) base container - so should use the TS_ variants instead.
- We should probably remove this max_payload_size option altogether unless the SageMakerCustomizedDLCModel is able to detect the base framework/version of the image being used? Switching the statements to TS_... is a viable workaround but not a good practice in case they change back again in future or e.g. an alternative OCR option uses HuggingFace container as a base.
The TS_DEFAULT_RESPONSE_TIMEOUT env var doesn't seem to be getting set anywhere on the generated models, which is particularly important for OCR as that can be slow.

From my tests so far, adding these TS_ environment variables to the CDK-built thumbnailer and OCR SageMakerCustomizedDLCModels seem to work as a workaround.

[Enhancement] End-to-end support for images (as well as PDFs)

While this sample was originally created for multi-page documents in PDF, other related use-cases (such as ID document or receipt extraction) may operate on single-page images/photographs/scans instead.

Today there's support for images in some aspects of the pipeline, but others assume PDF. It would be great to round out support for images as source documents - particularly for common JPEG+PNG formats which have good native support in e.g. Amazon Textract, SageMaker Ground Truth, and web browsers.

1. (Believe so but need to double-check) Core Textract state machine component supports OCRing image files
2. Notebook entity recognition data prep flow supports image files
3. (Need to check) OCR pipeline trigger and Textract orchestration supports image files
4. (Known gap) A2I human review UI supports image files

[Enhancement] Multi-page training annotation UI

As of now the custom online human review UI is able to render detection bounding boxes over a full multi-page document at once, but the training data annotation UI is based on the SageMaker Ground Truth bounding box tool which can only display one image/page at a time.

This is not ideal in cases where users would like to make model-training annotations on entire documents at a time: For example if context between different pages is important for labellers.

Ideally we would re-use and extend components from the review UI to make a custom annotation UI where we can similarly highlight entities, but process the entire PDF at once rather than a single page image.

[Enhancement] Joint entity recognition and page/document classification

Today we demonstrate annotation and training for entity extraction only. For many use cases document classification is also important, and it should be pretty straightforward to support this too.

A sequence classification task is already supported in the open source (e.g. LayoutLMv2ForSequenceClassification), but joint cls+ner with a single model might be performant and more economical for users - and not require too much extra effort.

[Enhancement] Add visual embeddings (for LayoutLMv2 / LayoutXLM)

More recent successors to the LayoutLM model used in this sample (e.g. LayoutLMv2 and LayoutXLM) make more extensive use of visual embeddings of the page image to boost performance. To get the most out of a possible model architecture upgrade, this sample should probably aim to integrate page image pixel analysis.

Tentative items/components:

1. Incorporate fixed-size page thumbnail generation into batch SageMaker pre-processing job
- This is an additional output alongside the annotation-oriented images, because annotation needs good DPI but model embedding thumbnails are typically low-resolution - 224px square in standard LayoutLMv2
- Processing should be configurable to output either or both of the image types, since annotation images may be required for a smaller subset of the corpus than page thumbnails
2. Incorporate fixed-size page thumbnail generation into online pipeline
- Presumably in parallel with the Amazon Textract step
3. Integrate page thumbnails with SageMaker model training
4. Integrate page thumbnails with SageMaker model inference

[Enhancement] Explicit steps to bring previously-Textracted data

In some cases, users may already have run their corpus through Amazon Textract and want to get started with the sample without taking the cost of re-processing all documents.

Although there's nothing preventing this in the model training code itself today, the notebook walkthrough steps often make S3 structure assumptions. More explicit guidance could greatly reduce the notebook debugging currently required to use pre-Textracted data.

Context

Although the model training itself has a pretty broad interface for accepting JSON-lines manifests like:

{
    "source-ref": "s3://.../.../wherever-your-page-thumbnail-image-is.png",  // images_prefix = "s3://.../..."
    "textract-ref": "s3://.../.../corresponding-textract-result.json", // textract_prefix = "s3://.../..."
    "page-num": 2,  // 1-based number of this page in the textract-ref result
    "labels": { "some-smgt-": "-bbox-compatible-label" },
}

...The notebook sections for preparing/curating the dataset and visualizing results often make more explicit assumptions like:

Textract refs correspond 1:1 with input documents and are at input-doc-path.pdf/consolidated.json
Page thumbnail & full-size images have their S3 paths constructed in particular ways from the raw document URIs.

[Enhancement] SageMaker async inference

To host the model, this sample currently deploys a real-time SageMaker endpoint backed by GPU - which may be fine for high-volume use cases but probably pretty resource-intensive for many.

Since the end-to-end workflow here is asynchronous anyway (may have a human review component), it's probably a good case for new SageMaker asynchronous inference feature which supports scaling down to zero when demand is low.

Wait until async inference is supported in SageMaker Python SDK, rather than confusing the notebook with boto3 endpoint setup (tracking their issue and pull request)
Update notebook 2 to create an async endpoint with scale-to-zero capability
Update pipeline to correctly consume async endpoints

TBD: Do we need to retain a real-time deployment option for anybody that wants to optimize for low latency? Seems unnecessary to me at the moment

Unable to access amazon-textract-transformer-pipeline-assets bucket

We're aware that the amazon-textract-transformer-pipeline-assets S3 bucket used by the "Launch stack" button on the root README is no longer publicly accessible, and working to find a resolution...

In the interim, users can simply download the CloudFormation template from GitHub and upload it from their machine to the AWS CloudFormation console.

LILT

I wanted to ask if this solution would currently support Language-Independent Layout Transformer - RoBERTa model (LiLT)?

If not, I wanted to request that the inference code be updated to support a LiLT model.

[Enhancement] Support self-supervised pre-training

Notebook 2 currently has a placeholder section discussing how model accuracy might be improved by self-supervised pre-training (such as masked language modelling) on a broader corpus of unlabelled data, before fine-tuning on the end task with annotations.

However, the model training scripts in notebooks/src aren't set up for that yet.

It'd be great if we can flesh this out to a point where users can optionally run a pre-training job with Textract JSON only (ideally still with the ability to start from a pre-trained model from HF model zoo); and then use that training job instead of the model zoo models, as the starting point for the fine-tuning task.

[Bug] - Handle non-PDF input documents

trying to run this solution (branch lmv2 on jpg inputs will cause an error. two files need to be updated. submitting an issue instead of PR since this is based on lmv2 v main branch.

Required changes:

1. `preprocess/inference.py`

update the SINGLE_IMAGE_CONTENT_TYPES dictionary on line 520 to include "image/jpg":'JPG"

2. `src/code/inference.py`

update logic for thumbnails to fix the logger message of "Thumbnails expected either array of PNG bytestrings or 4D images array. ". after the logging message add the following code:

if thumbnails.ndim == 3:
    logger.info('Resizing thumbnail of dimension 3 to dimension 4.')
    thumbnails = np.expand_dims(thumbnails, axis=0)

the not images logic also needs to be updated on line 428 and 445.

on line 428, the change is from if processor and not images: to if processor and images is None:. Otherwise it the error will say the comparison with a numpy array is ambigious.

Similariy, on line 445, it must be changed from **({"images": images} if images and processor else {}), to **({"images": images} if images is not None and processor else {}),

[Enhancement] Drop-in alternative open-source OCR engine(s)

This sample is compatible with multi-lingual layout-language models like LayoutXLM, but uses Amazon Textract for initial OCR which today only supports a subset of these languages. For example Thai and Vietnamese are supported by LayoutXLM but not currently by Amazon Textract.

It would be useful for this sample to support easy switching to an alternative, open-source-based OCR - for any users that want to work with low-resource languages.

Design Ideas

In terms of engine, would be interesting to compare options e.g. EasyOCR, TrOCR, Tesseract.
The OCR response format should be wrapped to be Amazon Textract-like, to simplify users switching to fully-managed AI service if and when possible.
Maybe SageMaker Async Inference could be a nice platform for the OCR deployment? Gives lots of infrastructure + timing/payload flexibility, and the SNS callback mechanism is similar to using Amazon Textract works anyway.
Maybe a CDK construct option to toggle between Textract or OSS, rather than deploying supporting infrastructure for both?

2.Model Training.ipynb not working

When trying to run this cell, I got 'sndfile library not found' error. Even after I pip install the packages, the issue is still not resolved. Can anyone suggest how to fix this? Thanks!

[Enhancement] Refactor image splitting to a SM Processing Job

The initial image cleaning/splitting process in notebook 1 takes a long time to complete, and is a good potential use case for a SageMaker Processing Job to scale out the resources. This would be especially useful for any users hoping to process the full corpus (or big corpora of their own).

aws-samples / amazon-textract-transformer-pipeline Goto Github PK