Knowledge Extraction Recipes - Forms

Retrieving information from documents and forms has long been a challenge, and even now at the time of writing, organisations are still handling significant amounts of paper forms that need to be scanned, classified and mined for specific information to enable downstream automation and efficiencies. Automating this extraction and applying intelligence is in fact a fundamental step toward digital transformation that organisations are still struggling to solve in an efficient and scalable manner.

An example could be a bank that receives hundreds of kilograms of very diverse remittance forms a day that need to be processed manually by people in order to extract a few key fields. Or medicinal prescriptions need to be automated to extract the prescribed medication and quantity.

Typically organisations will have built text mining and search solutions which are often tailored for a scenario, with baked in application logic, resulting in an often brittle solution that is difficult and expensive to maintain.

Thanks to the breakthroughs and rapid innovation in the machine learning fields of Computer Vision and Natural Language Processing (NLP), reliable options are now available to provide data driven solutions that generalise and provide high degrees of accuracy in extracting information from structured forms.

Coupled with Azure services this provides rapidly deployable, cost efficient and scalable solutions ready for production workloads.

Overview

The goal of this Playbook is to build a set of guidance, tools, examples and documentation that illustrate some known techniques for information extraction, all of which have been applied in real customer solutions.

We hope that the Playbook can significantly reduce the overall development time by simplifying the decision making process from defining the business problem to analysis and development.

The first focus of the Playbook is extraction of information from Forms.

Intended audience

The intended audience of this Playbook include:

Engineering/project leads
Data scientists/data engineers
Machine learning engineers
Software engineers

How this Playbook is structured

This Playbook aims to provide step-by-step guidance for each phase of a typical Forms Extraction project alongside typical considerations, key outcomes and code accelerators per phase. To follow the guidance process see the Walkthrough or dip into the individual code accelerators

Getting Started

The best place to start if this is your first foray into this Playbook is with the Checklist, and then the Walkthrough to ensure that the most important points are addressed in order to build a successful solution in this space.

Terminology used in this Playbook

We refer to the Supervised version of the Form Recognizer service when the argument Use Labels set to True when training, and the Unsupervised version of Form Recognizer as when the argument Use Labels is set to False.

We refer to a form issuer as being the unique source of a form, for example, the vendor of an invoice, or the bank of origin of an application form.

End to end demos ✨

Stage	Scenario	Description
AutoLabelling and Prediction	AutoLabelling	Chains AutoLabelling, Training and Prediction on sample invoices
Pre-Processing Remove Boxes	RemoveBoxes	Shows how to remove boxes that cause OCR errors and find the best image transformation
Get Values in CheckBoxes	Detect and get CheckBox value	Detects and gets the value from CheckBoxes

Code accelerators

The following code accelerators serve as starting points to try approaches that are known to work for Knowledge Extraction. Note - these accelerators need to be adapted to your data and tested and profiled, they are not production ready and need to be incorporated into your pipeline and profiled

The code accelerators included are available in Jupyter notebooks, APIs and python scripts that showcase some of the scenarios in this repository using diverse approaches.

Stage	Scenario	Description
Project preparation	Checklist	Steps to ensure success
Project preparation	Decision Guidance	Core decision points
Project preparation	Data Structure	Recommended training data structure
Analysis	Understanding the data distribution	Illustrates a simple way to understand the distribution of vendor to invoice frequency
Analysis	Understanding form variation	Illustrates how to analyse whether variation in a single form type exists
Analysis	Form layout type labelling using clustering based on text features	Shows an approach which can be used to discover/label different layout types within a big dataset of forms images
Analysis	Form layout clustering based on text and text layout features	Shows another approach which can be used to discover different layouts within a big dataset of images, taking words and positions of words on a page into account
Analysis	Classifying forms	Illustrates how to use an attribute based search approach to classify forms for Form Recognizer model correlation
Analysis	Routing forms	Demonstrates how to use OCR results to find which Form Recognizer model to send an unknown form to
Pre-Processing	Image Channel Normalisation	Illustrates interactive normalisation, binarization and greyscale conversion
Pre-Processing Remove Boxes	RemoveBoxes	Illustrates interactively how to remove boxes that cause OCR errors and find the best image transformation
Pre-Processing	Conversion	Converting documents between various formats such as TIF to PDF, JPG to PDF etc
Pre-Processing	Scan skewness	Illustrates testing and correcting skewness
Pre-Processing	Projection	Illustrates how to identify document skew and location of text lines
Pre-Processing	Detect and get CheckBox value	Illustrates how to detect and get a CheckBox value
Pre-Processing	Optical Mark Recognition	Illustrates some techniques to determine if a checkbox exists and how to extract it
Training	Dataset representativeness	Illustrates how test how to test the train and test datasets for representativeness
Training	Named Entity Recognition	Illustrates how NER can be trained used to identify and extract entities on a form
Training	Auto-labelling and training set optimisation	Illustrates how forms can be automatically labelled for the supervised version of Form Recognizer
Training	Generating a taxonomy	Illustrates a simple approach to generating a taxonomy of known terms from the forms
Extraction	Custom Corpus	Describes an approach to handling a custom corpus
Extraction	Handwriting and common OCR Errors	Describes an approach how to deal with common errors
Extraction	Predicting forms with Form Recognizer Supervised	Predicting forms with Forms Recognizer Supervised
Extraction	Predicting forms with Form Recognizer Unsupervised	Predicting forms with Forms Recognizer Unsupervised
Extraction	Using filter keys from a taxonomy	Illustrates how to filter the keys extracted from the unsupervised version of Form Recognizer using a taxonomy of known terms
Extraction	Table Extraction	Illustrates extracting tables with Form Recognizer
Evaluation	Scoring	Illustrates how to evaluate and score with Form Recognizer

PowerApps ✨

NEW (▀̿Ĺ̯▀̿ ̿)

Stage	Scenario	Description
Invoice Automation	PowerApps	Invoice Automation using the Power Platform

Example Pipelines

The Pipelines section contains some example patterns and pipelines for Knowledge Extraction using Azure Services.

Scenario	Description
Azure Cognitive Search	Sample pipeline using Azure Cognitive Search
Azure Kubernetes Service	Sample pipeline using Azure Kubernetes Service
Azure Machine Learning	Sample pipeline using Azure Machine Learning
Azure Logic Apps	Sample pipeline using Azure Logic Apps
Azure (Durable) Functions	Sample pipeline using Azure (Durable) Functions

Tips and Best Practices for Form Recognizer

For tips and best practices for managing Form Recognizer models via MLOps and deployment pipelines, view MLOps Tips and Tricks for Form Recognizer.

Example Scenarios

This section contains some documented common scenarios

Scenario	Description
CV or Resume Extraction	Sample extraction flow for a CV/Resume
Email Extraction	Sample extraction from emails
Geolocation Extraction	Sample extraction for Geolocation
Prebuilt Receipt Model	Sample extraction for the prebuilt Receipt model
Table extraction with Forms Recognizer	Sample extraction for Tables using Forms Recognizer
Document Extraction detailed example using JFK Files	Sample extraction for Tables using Form Recognizer
Dealing with multiple languages	Illustrates a few approaches with dealing with multiple languages
Custom extraction from Japanese forms	Illustrates a an approach to custom extraction from Japanese forms
Informative Image Selection using OCR with Form Recognizer Extraction	Illustrates an approach to selecting the most "informative" image from a group of similar images before extracting data with the Form Recognizer

Azure Services used in this repository

Azure Computer Vision OCR

Read API detects text content in an image using our latest recognition models and converts the identified text into a machine-readable character stream. It's optimized for text-heavy images (such as documents that have been digitally scanned) and for images with a lot of visual noise. It will determine which recognition model to use for each line of text, supporting images with both printed and handwritten text. The Read API executes asynchronously because larger documents can take several minutes to return a result.

OCR API Computer Vision's optical character recognition (OCR) API is similar to the Read API, but it executes synchronously and is not optimized for large documents. It uses an earlier recognition model but works with more languages

Azure Cognitive Search

Azure Cognitive Search is a fully managed search as a service to reduce complexity and scale easily including:

Auto-complete, geospatial search, filtering, and faceting capabilities for a rich user experience
Built-in AI capabilities including OCR, key phrase extraction, and named entity recognition to unlock insights
Flexible integration of custom models, classifiers, and rankers to fit your domain-specific needs

Form Recognizer Service

Form Recognizer applies advanced machine learning to accurately extract text, key/value pairs and tables from documents.

The Form Recognizer has two modes of operation:

Custom Model: This mode can be trained to recognise specific form types based on your own example data set
Prebuilt Receipt Model: This model is pre-trained (requires no training from you) to reocgnise and extract key data points from receipts (i.e. till receipts, resturant bills, general retail receipts etc)

The Custom Model requires the following for training:

Format must be JPG, PNG, or PDF (text or scanned). Text-embedded PDFs are best because there's no possibility of error in character extraction and location.
If your PDFs are password-locked, you must remove the lock before submitting them.
File size must be less than 4 MB.
For images, dimensions must be between 600 x 100 pixels and 4200 x 4200 pixels.
If scanned from paper documents, forms should be high-quality scans.
Text must use the Latin alphabet (English characters).
Data must contain keys and values.
Keys can appear above or to the left of the values, but not below or to the right.

Form Recognizer doesn't currently support these types of input data:

Complex tables (nested tables, merged headers or cells, and so on).
Checkboxes or radio buttons.
PDF documents longer than 50 pages.

See more on training Form Recognizer here

The requirements for the prebuilt receipt model are slightly different.

Format must be JPEG, PNG, BMP, PDF (text or scanned) or TIFF.
File size must be less than 20 MB.
Image dimensions must be between 50 x 50 pixels and 10000 x 10000 pixels.
PDF dimensions must be at most 17 x 17 inches, corresponding to Legal or A3 paper sizes and smaller.
For PDF and TIFF, only the first 200 pages are processed (with a free tier subscription, only the first two pages are processed).

Azure Machine Learning service

Azure Machine Learning service is a cloud service used to train, deploy, automate, and manage machine learning models, all at the broad scale that the cloud provides. AzureML is presented in notebooks across different scenarios to enhance the efficiency of developing Natural Language systems at scale and for various AI model development related tasks like:

Accessing Datastores to easily read and write your data in Azure storage services such as blob storage or file share.
Scaling up and out on Azure Machine Learning Compute.
Automated Machine Learning which builds high quality machine learning models by automating model and hyperparameter selection.
Tracking experiments and monitoring metrics to enhance the model creation process.
Distributed Training
Hyperparameter tuning
Deploying the trained machine learning model as a web service to Azure Container Instance for deveopment and test, or for low scale, CPU-based workloads.
Deploying the trained machine learning model as a web service to Azure Kubernetes Service for high-scale production deployments and provides autoscaling, and fast response times.

To successfully run these code accelerators, you will need an Azure subscription or can try Azure for free. There may be other Azure services or products used in the code. Introduction and/or reference of those will be provided in the code themselves.

Contributors ✨

_{Alessandro Jannuzzi}	_{Alex Hocking}	_{Ayaka Hara}	_{Dariusz Parys}	_{Ibrahim Kivanc}	_{Karol Zak}	_{Katia Gil Guzman}
_{Krit Kamtuo}	_{Marcia Dos Santos}	_{Nuno Silva}	_{Ignacio Floristan}	_{Oleksiy Shepetko}	_{Omri Mendels}	_{Pretesh Patel}
_{Raj Nemani}	_{Sergii Baidachnyi}	_{Tomi Paananen}	_{Martin Kearn}	_{Shane Peckham}	_{Mick Vleeshouwer}	_{Jon Malsan}
_{Eva Mok}	_{Mitchell Overfield}	_{Jafar Al-Kofahi}	_{Daniel Fatade}	_{Steve Pucelik}