GithubHelp home page GithubHelp logo

microsoft / genalog Goto Github PK

View Code? Open in Web Editor NEW
295.0 12.0 29.0 14.96 MB

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Home Page: https://microsoft.github.io/genalog/

License: MIT License

Jupyter Notebook 93.45% Python 6.35% Shell 0.01% Jinja 0.20%
ner ocr-recognition python text-alignment data-generation data-science machine-learning synthetic-data synthetic-images synthetic-data-generation

genalog's Introduction

Genalog Logo

Python Versions Supported OSs MIT license docs link arxiv link

Genalog is an open source, cross-platform python package for generating document images with synthetic noise that mimics scanned analog documents (thus the name genalog). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you create in simple HTML format.

demo-gif

This repo is now in maintenance mode with limited support.

Overview

Genalog has various capabilities:

  1. Flexible format Image Generation
  2. Custom image degradation
  3. Extract Text from Images using Cognitive Search Pipeline
  4. Get OCR Performance Metrics

The aim of this project is to provide a complete solution for generating synthetic images from any text data rich in natural language and to imitate most of OCR noises founded in scanned text documents.

Please refer to our Genalog documentation for more tutorials.

Installation

See the Genalog install guide for more details.

To install the latest release:

pip install genalog

Extra Installation Steps in MacOs and Windows

We have a dependency on Weasyprint, which in turn has non-python dependencies including Pango, cairo and GDK-PixBuf that need to be installed separately.

So far, Pango, cairo and GDK-PixBuf libraries are available in Ubuntu-18.04 and later by default.

If you are running on Windows, MacOS, or other Linux distributions, please see installation instructions from WeasyPrint.

NOTE: If you encounter the errors like no library called "libcairo-2" was found, this is probably due to the three extra dependencies missing.

Getting Started

The following is a summary of the common applications scenarios of Genalog. Please refer the Jupyter notebook examples that make use of the core code base of Genalog and repository utilities.

TLDR

If you are interested in a full document generation and degration pipeline, please see the following notebook:

Description Indepth Jupyter Notebook Examples
1 Analog Document Generation Pipeline Demo Notebook

Else we have in-depth walkthroughs of each of the module in Genalog.

Steps Indepth Jupyter Notebook Examples Quick Start Guides
1 Create Template for Image Generation Demo Notebook Here is our guide to Document Generation
2 Degrade Prebuilt Images Demo Notebook Here is our guide to Image Degradation
3 Get Text From Images Using OCR Demo Notebook Here is our guide to Extracting Text
4 Align Text Produced from OCR with Ground Truth Text Demo Notebook Here is our guide to Text Alignment
5 NER Label Propagation from Ground Truth to OCR Tokens Demo Notebook Here is our guide to Label Propagation

We also provide notebooks for the complete end-to-end scenario of generating a synthetic dataset connecting all the components of genalog:

Scenario Indepth Jupyter Notebook
1 Synthetic Dataset Generation with LABELED NER Dataset Demo Notebook

Other Requirements:

  1. If you want to use the OCR Capabilities of Azure to Extract Text from the Images You'll require the following resources:

    1. Azure Cognitive Search Service Quickstart Guide Here
    2. Azure Blob Storage Quickstart Guide Here

    See Azure Docs for more information on Azure Cognitive Search.

Package Release

Please see RELEASE.md for more details on the release process.

Development with the Repo

We use tox to orchestrate most of the CI procedure. This will ensure the maximum environment parity between local dev boxes and remote CI pipelines.

  1. git clone https://github.com/microsoft/genalog.git
  2. pip install tox
  3. To run static analysis: tox -e flake8
  4. To run the test suites: tox -e -- -m "not azure"

Repo Structure

genalog
├────genalog
│       ├─── generation                      # generate text images
│       ├──── degradation                    # methods for image degradation
│       ├──── ocr                            # running the Azure Search Pipeline
│       └──── text                           # methods to Align OCR Output Text with 
├────devops                                  # CI/CD pipelines
├────docs                                    # containing online documentaions
├────examples                                # example Jupyter Notebooks for Various 
├────tests                                   # tests
├────tox.ini                                 # CI orchestration and configurations
├────README.md
└────LICENSE

Trademark Notice

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

Microsoft Open Source Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Contribution Guidelines

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Citing genalog

If you find genalog helpful to your work, please consider citing our tool and paper using the following BibTeX entry:

@article{
  gupte2021genalog,
  title={Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR documents},
  author={Gupte, Amit and Romanov, Alexey and Mantravadi, Sahitya and Banda, Dalitso and Liu, Jianjie and Khan, Raza and Meenal, Lakshmanan Ramu and Han, Benjamin and Srinivasan, Soundar},
  journal={Document Intelligence Workshop at KDD 2021},
  year={2021}
}

Collaborators

Genalog was originally developed by the MAIDAP team at Microsoft Cambridge NERD in association with the Text Analytics Team in Redmond.

genalog's People

Contributors

dbanda avatar jgc128 avatar laserprec avatar microsoft-github-operations[bot] avatar microsoftopensource avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

genalog's Issues

Minor typos

Hi. Thanks for sharing a nice work.

There seem to be some minor typos in the docstrings.

Returns:
a string with formatted alignment.
'|' is for matching character
'.' is for substition
'-' indicates gap

Maybe it should be changed as below?
'.' is for substition '.' is for substitution
'-' indicates gap' ' indicates gap

Thanks.

installation issue

Im trying to install the library im facing issues with pip and source installation

im getting this error:

image

Adding ability to extract a template CSS from a given PDF or image file

Genalog is great in generating a synthetic document from a given template, but coming up with a template is still a pain.

Wouldn't it be great if I can just point Genalog to a PDF or image, and ask it to synthesize more documents like that?

In other words, can we add the functionality of extracting a CSS template out of a given PDF/image, to complete the cycle?

Thanks!

Document Intelligence

How to run tests?

The RELEASE.md document does not specify how to run tests.

Would be good to have the information about running the tests in RELEASE.md in the "Preparation" step

other languages

Does genalog support the Arabic language?
Thanks in advance.

Can we add line_spacing?

Hello, I am trying to add linespacing. Even though I add new lines manually from txt, it still removes them.

with open(txt_path, 'r') as f:
    text = f.read()

# Initialize Content Object
text = text.replace('\n', '\n\n')
paragraphs = text.split('\n\n\n')

printing paragraph gives the demanded result, however, default_generator.set_styles_to_generate(new_style_combinations) somehow removes blank lines. Thank you in advance

Question about generate new templates(html.jinja)

Hi, Thank you for sharing nice work. @laserprec

I wanna make my own templates (html file which i have) , How can i make the jinja file that match with my html(or pdf file).

Could you please give me some tips for this issue?

And in additionally, can i put the template matching with my own pdf files?

I saw an issue about this, but i can't getting exactly way about this.

ACTION REQUIRED: Microsoft needs this private repository to complete compliance info

There are open compliance tasks that need to be reviewed for your genalog repo.

Action required: 4 compliance tasks

To bring this repository to the standard required for 2021, we require administrators of this and all Microsoft GitHub repositories to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your microsoft GitHub organization.

Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/microsoft/repos/genalog/compliance

  • The GitHub AE (GitHub inside Microsoft) migration survey has not been completed for this private repository
  • No Service Tree mapping has been set for this repo. If this team does not use Service Tree, they can also opt-out of providing Service Tree data in the Compliance tab.
  • No repository maintainers are set. The Open Source Maintainers are the decision-makers and actionable owners of the repository, irrespective of administrator permission grants on GitHub.
  • Classification of the repository as production/non-production is missing in the Compliance tab.

You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.

If you no longer need this repository, it might be quickest to delete the repo, too.

GitHub inside Microsoft program information

More information about GitHub inside Microsoft and the new GitHub AE product can be found at https://aka.ms/gim.

FYI: current admins at Microsoft include @amitgupte, @laserprec, @sahityamantravadi, @dbanda, @jgc128

Issue with weasyprint dependency

Genalog is dependent on weasyprint. It uses the function write_image_surface() from Weasyprint.document.Document().
But this function has been removed by weasyprint on 19 april 2020. Therefore trying to run even the simplest example code provided by genalog causes issues and cannot be executed.

genalog does not work with newer versions of weasyprint

Newer versions of weasyprint (53.x) removed their dependency on cairo and do not support PNG exports anymore (see Kozea/WeasyPrint#1232 and https://www.courtbouillon.org/blog/00004-weasyprint-without-cairo-what-s-different)

This breaks some parts of the genalog code, specifically the following methods are affected (as far as I have seen)

def render_png(self, target=None, split_pages=False, resolution=300):

def render_array(self, resolution=300, channel="GRAYSCALE"):

Maybe some warning should be added to the documentation regarding this and what is the plan moving forward.

Thank you :D

Retrieve position of rendered document

I want to use this tool to generate a synthetic dataset for the detection phase of the OCR pipelines, I wonder if there is a way to get a location (bounding box) of each word that is rendered to the final documents?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.