openphilology / nidaba Goto Github PK

An expandable and scalable OCR pipeline

License: GNU General Public License v2.0

Python 65.40% CSS 2.12% JavaScript 29.59% HTML 2.77% XSLT 0.13%

nidaba's Introduction

Overview

Nidaba is the central controller for the entire OGL OCR pipeline. It oversees and automates the process of converting raw images into citable collections of digitized texts.

It offers the following functionality:

Grayscale Conversion
Binarization utilizing Sauvola adaptive thresholding, Otsu, or ocropus's nlbin algorithm
Deskewing
Dewarping
Integration of tesseract, kraken, and ocropus OCR engines
Page segmentation from the aforementioned OCR packages
Various postprocessing utilities like spell-checking, merging of multiple results, and ground truth comparison.

As it is designed to use a common storage medium on network attached storage and the celery distributed task queue it scales nicely to multi-machine clusters.

Build

To easiest way to install the latest stable(-ish) nidaba is from PyPi:

$ pip install nidaba

or run:

$ pip install .

in the git repository for the bleeding edge development version.

Some useful tasks have external dependencies. A good start is:

# apt-get install libtesseract3 tesseract-ocr-eng libleptonica-dev liblept

Tests

Per default no dictionaries and OCR models necessary to runs the tests are installed. To download the necessary files run:

$ python setup.py download

$ python setup.py nosetests

Tests for modules that call external programs, at the time only tesseract, ocropus, and kraken, will be skipped if these aren't installed.

Running

First edit (the installed) nidaba.yaml and celery.yaml to fit your needs. Have a look at the docs if you haven't set up a celery-based application before.

Then start up the celery daemon with something like:

$ celery -A nidaba worker

Next jobs can be added to the pipeline using the nidaba executable:

$ nidaba batch -b otsu -l tesseract -o tesseract:eng -- ./input.tiff
Preparing filestore             [✓]
Building batch                  [✓]
951c57e5-f8a0-432d-8d77-8a2e27fff53c

Using the return code the current state of the job can be retrieved:

$ nidaba status 25d79a54-9d4a-4939-acb6-8e168d6dbc7c
PENDING

When the job has been processed the status command will return a list of paths containing the final output:

$ nidaba status 951c57e5-f8a0-432d-8d77-8a2e27fff53c
SUCCESS
14.tif → .../input_img.rgb_to_gray_binarize.otsu_ocr.tesseract_grc.tif.hocr

Documentation

Want to learn more? Read the Docs

nidaba's People

Contributors

Stargazers

Watchers

Forkers

kursataker amitdo mirskiy ryanfb eric013 elijahjcooke jkamlah kamilc lxj0276 brunsgaard transybao1393 shalevy1

nidaba's Issues

Sending multiple languages to tesseract

Is it still possible to send multiple languages to tesseract ocr? When I use

-o tesseract:languages=grc+eng,extended=True

as a switch, I get the error message

File "/cluster/home/mmunso01/envs/nidaba/bin/nidaba", line 11, in <module>
    sys.exit(main())
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/click/core.py", line 700, in __call__
    return self.main(*args, **kwargs)
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/click/core.py", line 680, in main
    rv = self.invoke(ctx)
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/click/core.py", line 1027, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/click/core.py", line 873, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/click/core.py", line 508, in invoke
    return callback(*args, **kwargs)
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/nidaba/cli.py", line 240, in batch
    batch.add_task('ocr', alg[0], **kwargs)
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/nidaba/nidaba.py", line 662, in add_task
    task_arg_validator(task.get_valid_args(), **kwargs)
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/nidaba/nidaba.py", line 72, in task_arg_validator
    raise NidabaInputException('{} not in list of valid values'.format(val))
nidaba.nidabaexceptions.NidabaInputException: grc+eng not in list of valid values

I see that the languages are supposed to be a list in the documentation, but I am not sure how to get nidaba to recognize a list of languages from the command line.

When segmentation jobs fail, batch crashes

This may also be the case with other jobs failing. But I have noticed that as soon as I get a NidabaTesseractException because, I think, Tesseract segmentation craps out on empty pages. At this point, it appears that the segmentation jobs that are already in the queue finish, but the jobs after the segmentation jobs do not even start, even for the pages that did not get a segmentation error.

python setup.py nosetests

When running python setup.py nosetests, I get the following output:

running nosetests
running egg_info
creating nidaba.egg-info
writing pbr to nidaba.egg-info/pbr.json
writing requirements to nidaba.egg-info/requires.txt
writing nidaba.egg-info/PKG-INFO
writing top-level names to nidaba.egg-info/top_level.txt
writing dependency_links to nidaba.egg-info/dependency_links.txt
writing entry points to nidaba.egg-info/entry_points.txt
[pbr] Processing SOURCES.txt
writing manifest file 'nidaba.egg-info/SOURCES.txt'
[pbr] In git context, generating filelist from git
warning: no files found matching 'AUTHORS'
warning: no previously-included files found matching '.gitreview'
warning: no previously-included files matching '*.pyc' found anywhere in distribution
writing manifest file 'nidaba.egg-info/SOURCES.txt'
SSSS........SSSS....................................................................................................................................E...Using default language params
Terminated

syntax error

When running python setup.py download I get the following syntax error 4 times:

creating /home/joel/OCR/nidaba/.eggs/linecache2-1.0.0-py2.7.egg
Extracting linecache2-1.0.0-py2.7.egg to /home/joel/OCR/nidaba/.eggs
  File "/home/joel/OCR/nidaba/.eggs/linecache2-1.0.0-py2.7.egg/linecache2/tests/inspect_fodder2.py", line 102
    def keyworded(*arg1, arg2=1):
                            ^
SyntaxError: invalid syntax

Need unified terms for language models

When using Tesseract, the languages used are designated by the "languages" keyword, whereas with kraken/OCRopus they are designated as models. Perhaps this isn't a big deal, but I think it would be nice to simply have a single keyword here so that I can type tesseract:model=grc and kraken:model=grc and have them both work.

Catalog integration groundwork

Right now no catalog integration is completely missing, except the rather basic metadata task. Let's think about how to generate MODS/MADS records for batches or TEI metadata from MADS/MODS records or some weird combination of the above.

pip install . problem

When installing with pip install ., I get the following error message:

Downloading/unpacking pyxDamerauLevenshtein==1.3.1 (from nidaba==0.3.14)
  Downloading pyxDamerauLevenshtein-1.3.1.tar.gz (51kB): 51kB downloaded
  Running setup.py (path:/cluster/home/mmunso01/envs/nidaba/build/pyxDamerauLevenshtein/setup.py) egg_info for package pyxDamerauLevenshtein
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/cluster/home/mmunso01/envs/nidaba/build/pyxDamerauLevenshtein/setup.py", line 28, in <module>
        import numpy
    ImportError: No module named numpy
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/cluster/home/mmunso01/envs/nidaba/build/pyxDamerauLevenshtein/setup.py", line 28, in <module>

    import numpy

ImportError: No module named numpy

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /cluster/home/mmunso01/envs/nidaba/build/pyxDamerauLevenshtein
Storing debug log for failure in /cluster/home/mmunso01/.pip/pip.log

By running pip install numpy first and then running pip install ., I was able to overcome this problem. I also had this problem with PyTables so it could be that pyxDamerauLevenshtein needs to be detached and installed later (or numpy detached and installed earlier).

Email?

@mittagessen Ben, your email server (l.unchti.me) seems to not be accepting emails. At least not from smtp.informatik.uni-leipzig.de.

Overhaul web frontend

It is a bloody mess right now and should be rewritten by someone who understands frontend work.

cannot install with python 3.7

  Building wheel for pyxDamerauLevenshtein (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/nidaba/nidaba/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-caa2lwvy/pyxDamerauLevenshtein/setup.py'"'"'; __file__='"'"'/tmp/pip-install-caa2lwvy/pyxDamerauLevenshtein/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-wc90vya7
       cwd: /tmp/pip-install-caa2lwvy/pyxDamerauLevenshtein/

[...]

/home/nidaba/nidaba/lib/python3.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
   #warning "Using deprecated NumPy API, disable it with " \
    ^~~~~~~
  pyxdameraulevenshtein/pyxdameraulevenshtein.c: In function ‘__Pyx_GetException’:
  pyxdameraulevenshtein/pyxdameraulevenshtein.c:5209:24: error: ‘PyThreadState’ {aka ‘struct _ts’} has no member named ‘exc_type’; did you mean ‘curexc_type’?
       tmp_type = tstate->exc_type;

====

Any idea?

Spell-checker produces correction candidates for valid words

When using Kraken and the spell-checker flag, the spell-checker produces correction candidates even for words that should be in the dictionary, e.g., ἀλλὰ.

Improve testing of celery tasks

Testing celery tasks doesn't work that well as running them synchronously invokes all the helper task baggage requiring a working redis db and executing the base function using .run() causes a failure in calls to the celery logger. Because of this all tests are currently inoperational.

Problem with spell checking

When I run the command

nidaba batch -b nlbin:threshold=0.5,zoom=0.5,escale=1.0,border=0.1,perc=80,range=20,low=5,high=90 -l tesseract -o kraken:model=migne -o kraken:model=migne-njp -o kraken:model=omnibus -p spell_check:language=polytonic_greek,filter_punctuation=False -f tei2txt -- *.png

on Homer, I get the error:

postprocessing.spell_check (2, n): AttributeError: 'list' object has no attribute 'startswith'

Binarization fails

When I run the command

nidaba batch --binarize sauvola:10,20,30,40 --ocr tesseract:grc+eng --willitblend -- /home/mmunson/ddd/extracted_books/uc1.b4034434/*.png

I get the following error message for every image:

Error in pixSauvolaBinarize: whsize too large for image
[2015-05-07 14:41:31,041: ERROR/MainProcess] Task nidaba.binarize.sauvola[28767f13-0328-4efb-8585-083f56555a25] raised unexpected: NidabaLeptonicaException('Binarization failed for unknownreason.',)
Traceback (most recent call last):
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/nidaba/tasks/helper.py", line 61, in __call__
    ret = super(NidabaTask, self).__call__(*args, **nkwargs)
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/nidaba/plugins/leptonica.py", line 69, in sauvola
    lept_sauvola(input_path, output_path, whsize, factor)
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/nidaba/plugins/leptonica.py", line 111, in lept_sauvola
    raise NidabaLeptonicaException('Binarization failed for unknown'
NidabaLeptonicaException: Binarization failed for unknownreason.

Error: Invalid value for "--ocr" / "-o": Positional arguments are deprecated!

When I use the following command to initialize a nidaba batch

nidaba batch -b nlbin:threshold=0.5,zoom=0.5,escale=1.0,border=0.1,perc=80,range=20,low=5,high=90 -l tesseract -o kraken:greek -p spell_check:polytonic_greek phaistos/OCR/OGL/septuagint-dev/raw_ocr/hvd.swete.1.1901.kraken/*.pbm.png

I get the following error message:

Usage: nidaba batch [OPTIONS] FILES...

Error: Invalid value for "--ocr" / "-o": Positional arguments are deprecated!

Has something changed with the API? This command worked without a problem before.
I am using nidaba 0.9.3 and kraken 0.4.2
You can check out my nidaba.yaml file on Homer at /home/mmunson/envs/nidaba/etc/nidaba/nidaba.yaml
This is also no longer working on the Tufts cluster where it worked before.

Errors with Kraken

When I use the following command:

nidaba batch -b otsu -l tesseract -o kraken:grc_teubner -p spell_check:polytonic_greek -- /cluster/tufts/perseus_ocr/nidaba/teubner/ammonius_1966/*.tif

I get the following error showing up in the celery log for every page, i.e., every page fails with the same error

[2015-08-13 10:32:52,073: ERROR/MainProcess] Task nidaba.ocr.kraken[00db4d89-a411-4098-9036-9865acabe112] raised unexpected: AttributeError("'NoneType' object has no attribute 'predictString'",)
Traceback (most recent call last):
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/nidaba/tasks/helper.py", line 81, in __call__
    ret = super(NidabaTask, self).__call__(*args, **nkwargs)
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/nidaba/plugins/kraken.py", line 123, in ocr_kraken
    for rec in rpred.rpred(rnn, img, [(int(x[0]), int(x[1]), int(x[2]), int(x[3])) for x in lines]):
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/kraken/rpred.py", line 132, in rpred
    pred = network.predictString(line)
AttributeError: 'NoneType' object has no attribute 'predictString'

nidaba.yaml looks like this:

# The home directory for Iris to store files created by OCR jobs. For example,
# tifs, jp2s, meta.xml, and abbyy file downloaded from archive.org are stored
# here. Each new job is automatically placed in a uniquely named directory.
storage_path: /cluster/tufts/perseus_ocr/nidaba/OCR/

# URL to the redis database. May be shared with celery.
redis_url: 'redis://127.0.0.1:6379'

# Spell check configuration. Dictionaries are kept on the common medium (i.e.
# at STORAGE_PATH/tuple[0]/tuple[1]). Each spell checker requires a list of
# valid words ('dictionary') and a dictionary containing all variants of words
# attained by deletion of single characters (see nidaba.lex.make_deldict).
lang_dicts:
  polytonic_greek: {dictionary: [dicts, greek.dic],
                    deletion_dictionary: [dicts, del_greek.dic]}
  latin: {dictionary: [dicts, latin.dic],
                    deletion_dictionary: [dicts, del_latin.dic]}

# Ocropus/kraken models
ocropus_models:
  greek: [models, omnibus-2014-05-31-10-16-00087000.pyrnn.gz]
  grc_teubner: [models, teubner-serif-2013-12-16-11-26-00067000.pyrnn.gz]
  atlantean: [models, atlantean.pyrnn.gz]
  fraktur: [models, fraktur.pyrnn.gz]
  fancy_ligatures: [models, ligatures.pyrnn.gz]

# Models solely working with kraken (i.e. models in HDF5 format).
kraken_models:
  default: [models, en-default.hdf5]


# List of plugins to load. Additional fields in the associative array will be
# handed over to the setup function of the plugin.  Be aware that plugins
# utilizing external components that aren't installed will cause nidaba to
# abort. 
plugins_load:
  tesseract: {implementation: capi, # set to either legacy (hOCR
                                              # output in an *.html file),
                                              # direct (hOCR output in an
                                              # *.hocr file), or capi
                                              # (tesseract version >= 3.02)
             tessdata: /cluster/tufts/perseus_ocr_code/tesseract/tessdata} # location of the tessdata
                                                 # path. May also be a storage
                                                 # tuple.
  #ocropus: {}
  kraken: {}
  #leptonica: {}

Can't prepare filestore

When I use the following command

nidaba batch -b nlbin:threshold=0.5,zoom=0.5,escale=1.0,border=0.1,perc=80,range=20,low=5,high=90 -l tesseract -o kraken:model=greek -p spell_check:dict=polytonic_greek -- /home/mmunson/phaistos/OCR/in_progress/OCR/hvd.swete.1.1901.kraken/*.png

I get the following response

Preparing filestore             [✗]

And then the job exits. The permissions on the the destination folder (phaistos/OCR/in-progress/OCR) are 777, so it doesn't appear to be a permissions problem.

Show if batch is finished in status display

Display the current state of the batch in a more useful manner, i.e. show that batch is finished when in a non-success state.

File required for tests doesn't exist anymore

From docs:

Per default no dictionaries and OCR models necessary to runs the tests are installed. To download the necessary files run:

$ python setup.py download
Afterwards, the test suite can be run:

$ python setup.py nosetests

python setup.py download tries to load archive http://l.unchti.me/nidaba/tests.tar.bz2, but the link is broken

Doesn't produce output

When I run this command:

nidaba batch -b otsu -l tesseract -o tesseract:grc+eng,extended=True -p spell_check:polytonic_greek -- /cluster/tufts/perseus_ocr/nidaba/teubner/ammonius_1966/*.tif

It produces the filestore and it gets through the rgb-to-gray and the binarization steps, but then it seems to hang and does not produce any OCR output. When I request the status, it says that it is pending.
I will send my celery log file by email since I don't know how to attach it to this issue.

API on Homer fails with a Broken pipe

The three jobs I sent to Homer to do the LXX all seem to have failed with a Broken pipe error. So, for instance, the following command

nidaba status -h http://139.18.40.155:8000/api/v1 0e81b36d-b20e-4f7a-8612-f69355d45352 | head

returns

Status: failed

2943/4915 tasks completed. 0 running.

Output files:


Errors:

nidaba.ocr.kraken (a): 
Traceback (most recent call last):
  File "/home/mmunson/envs/nidaba/bin/nidaba", line 11, in <module>
    sys.exit(main())
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/click/core.py", line 700, in __call__
    return self.main(*args, **kwargs)
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/click/core.py", line 680, in main
    rv = self.invoke(ctx)
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/click/core.py", line 1027, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/click/core.py", line 873, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/click/core.py", line 508, in invoke
    return callback(*args, **kwargs)
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/nidaba/cli.py", line 410, in status
    task['errors'][-1]))
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/click/utils.py", line 315, in echo
    file.flush()
IOError: [Errno 32] Broken pipe

And the job does not appear to be running any more. The job ids for these three jobs are:

0e81b36d-b20e-4f7a-8612-f69355d45352
a9dfce52-5589-45be-a910-924cd63632f2
a754df9b-8216-42a4-97a5-7fe84d9c5e68

Can't download dictionaries/models

When I run python setup.py download, I get an invalid command error:

python setup.py download
usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
   or: setup.py --help [cmd1 cmd2 ...]
   or: setup.py --help-commands
   or: setup.py cmd --help

error: invalid command 'download'

Additionally, trying to download these dependencies manually, accessing http://l.unchti.me/nidaba/MANIFEST results in a 404 not found error.