DCR - Document Content Recognition - README

Based on the paper "Unfolding the Structure of a Document using Deep Learning" (Rahman and Finin, 2019), this software project attempts to use various software techniques to automatically recognise the structure in any pdf documents and thus make them more searchable.

DCR enables batch processing of documents with the DCR-CORE library. Details of the DCR-CORE library can be found [here}(https://konnexionsgmbh.github.io/dcr-core/). The documents to be processed are expected in a defined file directory. The processing result is made available either in a JSON file or in a PostgreSQL database.

Please see the Documentation for more detailed information.

1. Features

1.1 General

Support for documents in different languages - English, French, German and Italian as standard.

1.2 Preprocessor

Identifying scanned image pdf documents using PyMuPDF.
Converting scanned image pdf documents to a series of jpeg or png files using pdf2image and Poppler.
Converting bmp, gif, jp2, jpeg, png, pnm, tif, tiff or webp type documents to pdf format using Tesseract OCR.
Converting csv, docx, epub, html, odt, rst or rtf type documents to pdf format using Pandoc and TeX Live.

1.3 Natural Language Processing (NLP)

Extracting text and metadata from pdf documents using PDFlib TET.
Categorisation of the lines in the document, e.g. body, footer, header lines etc.
Determination of the token structure sentence by sentence with the help of spaCy.
Storage of the analysis result optional in a PostgreSQL database or in a JSON flat file.

2. Directory and File Structure of this Repository

2.1 Directories

Directory	Content
.github/workflows	GitHub Action workflows
data	Inbox directories and database setup data
docs	DCR documentation files
resources	DBeaver configuration, Gammadyne utility and various external documentation
scripts	Ubuntu and Windows Script for running the application
src	Python scripts and PDFlib TET files
tests	Scripts and data for pytest

2.2 Files

File	Functionality
.gitignore	Configuration of files and folders to be ignored.
.pylintrc	Configuration file for pylint.
LICENSE	Text of the licence terms.
logging_cfg.yaml	Configuration of the Logger functionality.
Makefile	Definition of tasks to be excuted with the `make` command.
mkdocs.yml	Configuration file for MkDocs.
Pipfile	Definition of the Python package requirements.
Pipfile.lock	Definition of the specific versions of the Python packages.
pyproject.toml	Configuration file for bandit, black, isort, mypy, pydoc-markdown, pydocstyle, and pytest.
README.md	This file.
run_dcr_dev	Running the DCR functionality for development purposes.
run_dcr_prod	Running the DCR functionality for productiove operation.
setup.cfg	Configuration file for coverage, DCR, flake8, and radon.
setup.cfg.reference	Original setup configuration file.

3. Support

If you need help with DCR, do not hesitate to get in contact with us!

For questions and high-level discussions, use Discussions on GitHub.
To report a bug or make a feature request, open an Issue on GitHub.

Please note that we may only provide support for problems / questions regarding core features of DCR. Any questions or bug reports about features of third-party themes, plugins, extensions or similar should be made to their respective projects. But, such questions are not banned from the Discussions.

Make sure to stick around to answer some questions as well!

4. Links

Official Documentation
Release Notes
Discussions (Third-party themes, recipes, plugins and more)

5. Contributing to DCR

The DCR project welcomes, and depends on, contributions from developers and users in the open source community. Please see the Contributing Guide for information on how you can help.

6. Code of Conduct

Everyone who interacts in the DCR project's codebase, issue trackers, and discussion forums is expected to follow the Code of Conduct.

7. License

Konnexions Public License (KX-PL)

konnexionsgmbh / dcr Goto Github PK

dcr's Introduction

DCR - Document Content Recognition - README

1. Features

1.1 General

1.2 Preprocessor

1.3 Natural Language Processing (NLP)

2. Directory and File Structure of this Repository

2.1 Directories

2.2 Files

3. Support

4. Links

5. Contributing to DCR

6. Code of Conduct

7. License

dcr's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Jobs