GithubHelp home page GithubHelp logo

pd3f / dehyphen Goto Github PK

View Code? Open in Web Editor NEW
37.0 2.0 4.0 197 KB

📜 Dehyphenation of broken text (mainly German), i.e., extracted from a PDF

License: GNU General Public License v3.0

Python 100.00%
python pdf dehyphenation hyphens nlp german flair-embeddings flair hyphen pd3f

dehyphen's Introduction

pd3f

Experimental, use with care.

pd3f is a PDF text extraction pipeline that is self-hosted, local-first and Docker-based. It reconstructs the original continuous text with the help of machine learning.

pd3f can OCR scanned PDFs with OCRmyPDF (Tesseract) and extracts tables with Camelot and Tabula. It's built upon the output of Parsr. Parsr detects hierarchies of text and splits the text into words, lines and paragraphs.

Even though Parsr brings some structure to the PDF, the text is still scrambled, i.e., due to hyphens. The underlying Python package pd3f-core tries to reconstruct the original continuous text by removing hyphens, new lines and / or spaces. It uses language models to guess how the original text looked like.

pd3f is especially useful for languages with long words such as German. It was mainly developed to parse German letters and official documents. Besides German pd3f supports English, Spanish, French and Italian. More languages will be added a later stage.

pd3f includes a Web-based GUI and a Flask-based microservice (API). You can find a demo at demo.pd3f.com.

Documentation

Check out the full Documentation at: https://pd3f.com/docs/

Future Work / TODO

PDFs are hard to process and it's hard to extract information. So the results of this tool may not satisfy you. There will be more work to improve this software but altogether, it's unlikely that it will successfully extract all the information anytime soon.

Here some things that will get improved.

statics about how long processing (per page) took in the past

  • calculate runtime based on job.started_at and job.ended_at
  • Get average runtime of jobs and store data in redis list

more information about PDF

  • NER
  • entity linking
  • extract keywords
  • use textacy

add more language

  • check if flair has model
  • what to do if there is no fast model?

Python client

  • simple client based on request
  • send whole folders

Markdown / HTML export

  • go beyond text

use pdf-scripts / allow more processing

  • reduce size
  • repair PDF
  • detect if scanned
  • force to OCR again

improve logs / get better feedback

  • show uncertainty of ML model
  • allow different log levels

Related Work

Development

Install and use poetry.

Initially run:

./dev.sh --build

Omit --build if the Docker images do not need to get build. Right now Docker + poetry is not able to cache the installs so building the image all the time is uncool.

Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

License

Affero General Public License 3.0

dehyphen's People

Contributors

jfilter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

dehyphen's Issues

text_to_format not found

In the provided examples a method 'text_to_format' is mentioned that was not imported.

NameError: name 'text_to_format' is not defined

Returns list with single character entries

I used a text which has a \n on every single line but not paragraphs because that's how I got the text. When using the scorer function, I get a list with every single character as its own entry as return.

Here is part of my example file:

Beginn: 10.00 UhrPräsident Dr. Norbert Lammert: \nDie Sitzung ist eröffnet.\nLiebe Kolleginnen und Kollegen! Ich begrüße Sie alle\nherzlich, wünsche Ihnen einen guten Morgen und uns\neine interessante Sitzungswoche. \nVor Eintritt in die Tagesordnung gratuliere ich der\nKollegin Heidemarie Wieczorek-Zeul, die am 21. No-\nvember ihren 65. Geburtstag gefeiert hat, im Namen des\nganzen Hauses herzlich. Alle guten Wünsche für die\nnächsten Jahre!\n(Beifall)\nWir kommen nun zum Tagesordnungspunkt I:\nEidesleistung des Bundesministers für Arbeit\nund Soziales\nDer Herr Bundespräsident hat mir mit Schreiben vom\n21. November dieses Jahres Folgendes mitgeteilt:\nGemäß Artikel 64 Absatz 1 des Grundgesetzes für\ndie Bundesrepublik Deutschland habe ich heute auf\nVorschlag der Frau Bundeskanzlerin den Bundes-\nminister für Arbeit und Soziales, Herrn Franz\nMüntefering, aus seinem Amt als Bundesminister\nentlassen und Herrn Olaf Scholz zum Bundesminis-\nter für Arbeit und Soziales ernannt. \nNach Art. 64 Abs. 2 des Grundgesetzes leistet ein\nBundesminister bei der Amtsübernahme den in Art. 56\nvorgesehenen Eid.\nHerr Bundesminister Scholz, ich darf Sie zur Eides-\nleistung zu mir bitten.\n(Die Anwesenden erheben sich)\nHerr Minister, ich darf Sie bitten, den Eid zu leisten. \nOlaf Scholz, Bundesminister für Arbeit und Sozia-\nles: \nIch schwöre, dass ich meine Kraft dem Wohle des\ndeutschen Volkes widmen, seinen Nutzen mehren, Scha-\nden von ihm wenden, das Grundgesetz und die Gesetze\ndes Bundes wahren und verteidigen, meine Pflichten ge-wissenhaft erfüllen und Gerechtigkeit gegen jedermann\nüben werde.

and this is what I get out

'i'], ['a'], ['r'], ['d'], ['e'], ['n'], [' '], ['E'], ['u'], ['r'], ['o'], [' '], ['d'], ['a'], ['s'], [' '], ['m'], ['e'], ['i'], ['s'], ['t'], ['e'], [' '], ['G'], ['e'], ['l'], ['d'], [' '], ['g'], ['e'], ['b'], ['u'], ['n'], ['d'], ['e'], ['n'], ['.'], [' '], ['D'], ['e'], ['r'], ['\n'], ['A'], ['u'], ['s'], ['b'], ['a'], ['u'], [' '], ['d'], ['e'], ['r'], [' '], ['B'], ['r'], ['e'], ['i'], ['t'], ['b'], ['a'], ['n'], ['d'], ['v'], ['e'], ['r'], ['s'], ['o'], ['r'], ['g'], ['u'], ['n'], ['g'], [' '], ['i'], ['n'], [' '], ['d'], ['e'], ['n'], [' '], ['l'], ['ä'], ['n'], ['d'], ['l'], ['i'], ['c'], ['h'], ['e'], ['n'], [' '], ['R'], ['ä'], ['u'], ['-'], ['\n'], ['m'], ['e'], ['n'], [' '], ['–'], [' '], ['d'], ['a'], ['s'], [' '], ['i'], ['s'], ['t'], [' '], ['s'], ['c'], ['h'], ['o'], ['n'], [' '], ['g'], ['e'], ['n'], ['a'], ['n'], ['n'], ['t'], [' '], ['w'], ['o'], ['r'], ['d'], ['e'], ['n'], [' '], ['–'], [' '], ['i'], ['s'], ['t'], [' '], ['s'], ['e'], ['h'], ['r'], [' '], ['w'], ['i'], ['c'], ['h'], ['t'], ['i'], ['g'], ['.'], ['\n'], ['A'], ['b'], ['e'], ['r'], [' '], ['a'], ['u'], ['c'], ['h'], [' '], ['w'], ['i'], ['r'], [' '], ['s'], ['e'], ['h'], ['e'], ['n'], [' '], ['d'], ['a'], ['s'], [' '], ['–'], [' '], ['d'], ['a'], [' '], ['s'], ['i'], ['n'], ['d'], [' '], ['w'], ['i'], ['r'], [' '], ['m'], ['i'], ['t'], [' '], ['d'], ['e'], ['r'], [' '], ['F'], ['D'], ['P'], [' '], ['e'], ['i'], ['n'], ['e'],

ImportError: cannot import name 'SpanLabel' from 'flair.data'

Hi when trying to

from dehyphen import FlairScorer

I get following error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-12-1bbf8cdaa7d0> in <module>
      3 import os
      4 import sqlite3
----> 5 from dehyphen import FlairScorer

/usr/local/lib/python3.9/site-packages/dehyphen/__init__.py in <module>
      7     text_to_format,
      8 )
----> 9 from .scorer import FlairScorer

/usr/local/lib/python3.9/site-packages/dehyphen/scorer.py in <module>
      1 from functools import lru_cache
      2 
----> 3 import flair
      4 from flair.embeddings import FlairEmbeddings
      5 

/usr/local/lib/python3.9/site-packages/flair/__init__.py in <module>
     20 # from . import visual
     21 # from . import trainers
---> 22 from . import nn
     23 from .training_utils import AnnealOnPlateau
     24 

/usr/local/lib/python3.9/site-packages/flair/nn/__init__.py in <module>
      1 from .dropout import LockedDropout, WordDropout
----> 2 from .model import Model, Classifier, DefaultClassifier

/usr/local/lib/python3.9/site-packages/flair/nn/model.py in <module>
     13 import flair
     14 from flair import file_utils
---> 15 from flair.data import DataPoint, Sentence, Dictionary, SpanLabel
     16 from flair.datasets import DataLoader, SentenceDataset
     17 from flair.training_utils import Result, store_embeddings

ImportError: cannot import name 'SpanLabel' from 'flair.data' (/usr/local/lib/python3.9/site-packages/flair/data.py)

Why so? Thanks!

Some language models at hu-berlin no longer available

The tests use scorer = FlairScorer(lang="multi-v0", fast=True), but then this project tries to download the language model from a stale url at hu-berlin. Switching to scorer = FlairScorer(lang="de", fast=True) leads to the same problem.

As workaround, you could use the standard German language model by using scorer = FlairScorer(lang="de"), as production does.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.