pd3f / dehyphen Goto Github PK

View Code? Open in Web Editor NEW

37.0 2.0 4.0 197 KB

📜 Dehyphenation of broken text (mainly German), i.e., extracted from a PDF

License: GNU General Public License v3.0

Python 100.00%

python pdf dehyphenation hyphens nlp german flair-embeddings flair hyphen pd3f

dehyphen's Introduction

`pd3f`

Experimental, use with care.

pd3f is a PDF text extraction pipeline that is self-hosted, local-first and Docker-based. It reconstructs the original continuous text with the help of machine learning.

pd3f can OCR scanned PDFs with OCRmyPDF (Tesseract) and extracts tables with Camelot and Tabula. It's built upon the output of Parsr. Parsr detects hierarchies of text and splits the text into words, lines and paragraphs.

Even though Parsr brings some structure to the PDF, the text is still scrambled, i.e., due to hyphens. The underlying Python package pd3f-core tries to reconstruct the original continuous text by removing hyphens, new lines and / or spaces. It uses language models to guess how the original text looked like.

pd3f is especially useful for languages with long words such as German. It was mainly developed to parse German letters and official documents. Besides German pd3f supports English, Spanish, French and Italian. More languages will be added a later stage.

pd3f includes a Web-based GUI and a Flask-based microservice (API). You can find a demo at demo.pd3f.com.

Documentation

Check out the full Documentation at: https://pd3f.com/docs/

Future Work / TODO

PDFs are hard to process and it's hard to extract information. So the results of this tool may not satisfy you. There will be more work to improve this software but altogether, it's unlikely that it will successfully extract all the information anytime soon.

Here some things that will get improved.

statics about how long processing (per page) took in the past

calculate runtime based on job.started_at and job.ended_at
Get average runtime of jobs and store data in redis list

more information about PDF

NER
entity linking
extract keywords
use textacy

add more language

check if flair has model
what to do if there is no fast model?

Python client

simple client based on request
send whole folders

Markdown / HTML export

go beyond text

use pdf-scripts / allow more processing

reduce size
repair PDF
detect if scanned
force to OCR again

improve logs / get better feedback

show uncertainty of ML model
allow different log levels

Related Work

https://github.com/axa-group/Parsr
https://github.com/jzillmann/pdf-to-markdown
some PDF processing tools in my blog post

Development

Install and use poetry.

Initially run:

./dev.sh --build

Omit --build if the Docker images do not need to get build. Right now Docker + poetry is not able to cache the installs so building the image all the time is uncool.

Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

License

Affero General Public License 3.0

dehyphen's People

Contributors

Stargazers

Watchers

Forkers

prototypefund davidfarago atmomomo zlian1758

dehyphen's Issues

text_to_format not found

In the provided examples a method 'text_to_format' is mentioned that was not imported.

NameError: name 'text_to_format' is not defined

Returns list with single character entries

I used a text which has a \n on every single line but not paragraphs because that's how I got the text. When using the scorer function, I get a list with every single character as its own entry as return.

Here is part of my example file:

Beginn: 10.00 UhrPräsident Dr. Norbert Lammert: \nDie Sitzung ist eröffnet.\nLiebe Kolleginnen und Kollegen! Ich begrüße Sie alle\nherzlich, wünsche Ihnen einen guten Morgen und uns\neine interessante Sitzungswoche. \nVor Eintritt in die Tagesordnung gratuliere ich der\nKollegin Heidemarie Wieczorek-Zeul, die am 21. No-\nvember ihren 65. Geburtstag gefeiert hat, im Namen des\nganzen Hauses herzlich. Alle guten Wünsche für die\nnächsten Jahre!\n(Beifall)\nWir kommen nun zum Tagesordnungspunkt I:\nEidesleistung des Bundesministers für Arbeit\nund Soziales\nDer Herr Bundespräsident hat mir mit Schreiben vom\n21. November dieses Jahres Folgendes mitgeteilt:\nGemäß Artikel 64 Absatz 1 des Grundgesetzes für\ndie Bundesrepublik Deutschland habe ich heute auf\nVorschlag der Frau Bundeskanzlerin den Bundes-\nminister für Arbeit und Soziales, Herrn Franz\nMüntefering, aus seinem Amt als Bundesminister\nentlassen und Herrn Olaf Scholz zum Bundesminis-\nter für Arbeit und Soziales ernannt. \nNach Art. 64 Abs. 2 des Grundgesetzes leistet ein\nBundesminister bei der Amtsübernahme den in Art. 56\nvorgesehenen Eid.\nHerr Bundesminister Scholz, ich darf Sie zur Eides-\nleistung zu mir bitten.\n(Die Anwesenden erheben sich)\nHerr Minister, ich darf Sie bitten, den Eid zu leisten. \nOlaf Scholz, Bundesminister für Arbeit und Sozia-\nles: \nIch schwöre, dass ich meine Kraft dem Wohle des\ndeutschen Volkes widmen, seinen Nutzen mehren, Scha-\nden von ihm wenden, das Grundgesetz und die Gesetze\ndes Bundes wahren und verteidigen, meine Pflichten ge-wissenhaft erfüllen und Gerechtigkeit gegen jedermann\nüben werde.

and this is what I get out

'i'], ['a'], ['r'], ['d'], ['e'], ['n'], [' '], ['E'], ['u'], ['r'], ['o'], [' '], ['d'], ['a'], ['s'], [' '], ['m'], ['e'], ['i'], ['s'], ['t'], ['e'], [' '], ['G'], ['e'], ['l'], ['d'], [' '], ['g'], ['e'], ['b'], ['u'], ['n'], ['d'], ['e'], ['n'], ['.'], [' '], ['D'], ['e'], ['r'], ['\n'], ['A'], ['u'], ['s'], ['b'], ['a'], ['u'], [' '], ['d'], ['e'], ['r'], [' '], ['B'], ['r'], ['e'], ['i'], ['t'], ['b'], ['a'], ['n'], ['d'], ['v'], ['e'], ['r'], ['s'], ['o'], ['r'], ['g'], ['u'], ['n'], ['g'], [' '], ['i'], ['n'], [' '], ['d'], ['e'], ['n'], [' '], ['l'], ['ä'], ['n'], ['d'], ['l'], ['i'], ['c'], ['h'], ['e'], ['n'], [' '], ['R'], ['ä'], ['u'], ['-'], ['\n'], ['m'], ['e'], ['n'], [' '], ['–'], [' '], ['d'], ['a'], ['s'], [' '], ['i'], ['s'], ['t'], [' '], ['s'], ['c'], ['h'], ['o'], ['n'], [' '], ['g'], ['e'], ['n'], ['a'], ['n'], ['n'], ['t'], [' '], ['w'], ['o'], ['r'], ['d'], ['e'], ['n'], [' '], ['–'], [' '], ['i'], ['s'], ['t'], [' '], ['s'], ['e'], ['h'], ['r'], [' '], ['w'], ['i'], ['c'], ['h'], ['t'], ['i'], ['g'], ['.'], ['\n'], ['A'], ['b'], ['e'], ['r'], [' '], ['a'], ['u'], ['c'], ['h'], [' '], ['w'], ['i'], ['r'], [' '], ['s'], ['e'], ['h'], ['e'], ['n'], [' '], ['d'], ['a'], ['s'], [' '], ['–'], [' '], ['d'], ['a'], [' '], ['s'], ['i'], ['n'], ['d'], [' '], ['w'], ['i'], ['r'], [' '], ['m'], ['i'], ['t'], [' '], ['d'], ['e'], ['r'], [' '], ['F'], ['D'], ['P'], [' '], ['e'], ['i'], ['n'], ['e'],

ImportError: cannot import name 'SpanLabel' from 'flair.data'

Hi when trying to

from dehyphen import FlairScorer

I get following error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-12-1bbf8cdaa7d0> in <module>
      3 import os
      4 import sqlite3
----> 5 from dehyphen import FlairScorer

/usr/local/lib/python3.9/site-packages/dehyphen/__init__.py in <module>
      7     text_to_format,
      8 )
----> 9 from .scorer import FlairScorer

/usr/local/lib/python3.9/site-packages/dehyphen/scorer.py in <module>
      1 from functools import lru_cache
      2 
----> 3 import flair
      4 from flair.embeddings import FlairEmbeddings
      5 

/usr/local/lib/python3.9/site-packages/flair/__init__.py in <module>
     20 # from . import visual
     21 # from . import trainers
---> 22 from . import nn
     23 from .training_utils import AnnealOnPlateau
     24 

/usr/local/lib/python3.9/site-packages/flair/nn/__init__.py in <module>
      1 from .dropout import LockedDropout, WordDropout
----> 2 from .model import Model, Classifier, DefaultClassifier

/usr/local/lib/python3.9/site-packages/flair/nn/model.py in <module>
     13 import flair
     14 from flair import file_utils
---> 15 from flair.data import DataPoint, Sentence, Dictionary, SpanLabel
     16 from flair.datasets import DataLoader, SentenceDataset
     17 from flair.training_utils import Result, store_embeddings

ImportError: cannot import name 'SpanLabel' from 'flair.data' (/usr/local/lib/python3.9/site-packages/flair/data.py)

Why so? Thanks!

Some language models at hu-berlin no longer available

The tests use scorer = FlairScorer(lang="multi-v0", fast=True), but then this project tries to download the language model from a stale url at hu-berlin. Switching to scorer = FlairScorer(lang="de", fast=True) leads to the same problem.

As workaround, you could use the standard German language model by using scorer = FlairScorer(lang="de"), as production does.