GithubHelp home page GithubHelp logo

dcr's Introduction

DCR - Document Content Recognition - README

Coveralls GitHub GitHub (Pre-)Release GitHub (Pre-)Release Date GitHub commits since latest release

Based on the paper "Unfolding the Structure of a Document using Deep Learning" (Rahman and Finin, 2019), this software project attempts to use various software techniques to automatically recognise the structure in any pdf documents and thus make them more searchable.

DCR enables batch processing of documents with the DCR-CORE library. Details of the DCR-CORE library can be found [here}(https://konnexionsgmbh.github.io/dcr-core/). The documents to be processed are expected in a defined file directory. The processing result is made available either in a JSON file or in a PostgreSQL database.

Please see the Documentation for more detailed information.

1. Features

1.1 General

  • Support for documents in different languages - English, French, German and Italian as standard.

1.2 Preprocessor

  • Identifying scanned image pdf documents using PyMuPDF.
  • Converting scanned image pdf documents to a series of jpeg or png files using pdf2image and Poppler.
  • Converting bmp, gif, jp2, jpeg, png, pnm, tif, tiff or webp type documents to pdf format using Tesseract OCR.
  • Converting csv, docx, epub, html, odt, rst or rtf type documents to pdf format using Pandoc and TeX Live.

1.3 Natural Language Processing (NLP)

  • Extracting text and metadata from pdf documents using PDFlib TET.
  • Categorisation of the lines in the document, e.g. body, footer, header lines etc.
  • Determination of the token structure sentence by sentence with the help of spaCy.
  • Storage of the analysis result optional in a PostgreSQL database or in a JSON flat file.

2. Directory and File Structure of this Repository

2.1 Directories

Directory Content
.github/workflows GitHub Action workflows
data Inbox directories and database setup data
docs DCR documentation files
resources DBeaver configuration, Gammadyne utility and various external documentation
scripts Ubuntu and Windows Script for running the application
src Python scripts and PDFlib TET files
tests Scripts and data for pytest

2.2 Files

File Functionality
.gitignore Configuration of files and folders to be ignored.
.pylintrc Configuration file for pylint.
LICENSE Text of the licence terms.
logging_cfg.yaml Configuration of the Logger functionality.
Makefile Definition of tasks to be excuted with the make command.
mkdocs.yml Configuration file for MkDocs.
Pipfile Definition of the Python package requirements.
Pipfile.lock Definition of the specific versions of the Python packages.
pyproject.toml Configuration file for bandit, black, isort, mypy,
pydoc-markdown, pydocstyle, and pytest.
README.md This file.
run_dcr_dev Running the DCR functionality for development purposes.
run_dcr_prod Running the DCR functionality for productiove operation.
setup.cfg Configuration file for coverage, DCR, flake8, and radon.
setup.cfg.reference Original setup configuration file.

3. Support

If you need help with DCR, do not hesitate to get in contact with us!

  • For questions and high-level discussions, use Discussions on GitHub.
  • To report a bug or make a feature request, open an Issue on GitHub.

Please note that we may only provide support for problems / questions regarding core features of DCR. Any questions or bug reports about features of third-party themes, plugins, extensions or similar should be made to their respective projects. But, such questions are not banned from the Discussions.

Make sure to stick around to answer some questions as well!

4. Links

5. Contributing to DCR

The DCR project welcomes, and depends on, contributions from developers and users in the open source community. Please see the Contributing Guide for information on how you can help.

6. Code of Conduct

Everyone who interacts in the DCR project's codebase, issue trackers, and discussion forums is expected to follow the Code of Conduct.

7. License

Konnexions Public License (KX-PL)

dcr's People

Contributors

walter-weinmann avatar c-bik avatar

Stargazers

 avatar  avatar Shamis Shukoor avatar

Watchers

 avatar Stefan Ochsenbein avatar  avatar Gaurav Kumar Garg avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.