cboulanger / excite-docker Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 125.86 MB

Docker image with tools for the annotation of ML training docs for reference extraction based on the EXparser tools

Home Page: https://cboulanger.github.io/excite-docker

License: GNU General Public License v3.0

Dockerfile 0.07% Python 22.59% Shell 0.87% HTML 3.96% JavaScript 66.96% CSS 5.55%

citation-mining exparser

excite-docker's People

Contributors

Watchers

excite-docker's Issues

Tag-Fragments during Segmentation

Dear contributors, I encountered two cases of tag-fragments during segmentation that I could not edit, delete or interact with in any way.

After running the Auto-segmentation, the function tends to leave such a fragment of the Last Page tag when the pages in the reference are declared in the format “S. Start Page – “. The fragment occurs after the hyphen.
Sometimes when using the function Correct selected text to add new signs and part of the selected text is already tagged, a fragment of the included tag is left next to the newly added signs. I noticed this multiple times but wasn’t able to find a reliable way to replicate it unfortunately.

You can find examples of these two cases in the screenshots I added.

Allow switching of models with optional remote model repository

In order to be able to use specialized models for different kind of scholarly citation patterns, we should make the directory containing model data (now EXparser/Utils) configurable. The idea is to give such a specialized model a unique name which serves both as an well-known id and the name of the directory in which the models are stored. Since the model data is directly dependent on the training code, it needs to be versioned. This also allows to run tests comparing the performance of a particular model with the same id but different versions (for example, by running an evaluation comparing performance of different git branches).

Models is stored in EXparser/Models/<version>/<model_id>. EXparser/Utils/ is renamed to EXparser/Models/<version>/default. The version number is hardcoded in configs.py and manually incremented whenever a change is made in the EXparser code that renders the model data backward-incompatible to previous code versions.
Since different models will have different training material (the whole point of having separate models), EXparser/Dataset needs to be renamed to EXparser/Datasets/default. The training material folders do not need to be versioned.
A new commanddocker run ... excite_toolchain create_model <model_id> is added which creates a directory EXparser/Models/<version>/<model id> and copies over the non-reproduceable model files (if there are any left). It returns a message saying that the user needs to add training material to EXparser/Datasets/<model_id> and to run training.
The model is selected when running the docker commands, such as docker run ... excite_toolchain exparser <model_id>. If no model name is supplied, "default" is used as the model id. If the model id does not exist, an error is raised saying that the command create_model must be run first.
docker run ... excite_toolchain (segmentation|extraction)_training <model_id> computes the models from the training material in EXparser/Datasets/<model_id> and places them into EXparser/Models/<version>/<model_id>.

When we have this system in place, an optional storage system can be build upon it. It works with packages that are a ZIP of the training material and model data stored in a configurable location.

A new commanddocker run ... excite_toolchain download_model <model_id> is added which tries to download /excite-docker/<version>/<model_id>.zip from a WebDAV server (url and credentials are supplied as environment variables). If that is successful, the ZIP is extracted and placed into the training and model directories corresponding to the version and model id.
A new command docker run ... excite_toolchain upload_model <model_id> is added, which uploads the training and model data as a ZIP to the WebDAV folder
A new command docker run ... excite_toolchain list_models is added, which returns a list of models stored at the given repository compatible with the current version

Handling of ders. in German citations

Add a rule for "use the last author name previously recognized" when "ders." is encountered

Fail more gracefully if font issue is missing

  File "/app/run-main.py", line 171, in <module>
    call_extraction_training(sys.argv[2])
  File "/app/run-main.py", line 119, in call_extraction_training
    os.path.join(model_dir, get_version(), model_name))
  File "/app/EXparser/Training_Ext.py", line 52, in train_extraction
    row2 = reader2[uu]
IndexError: list index out of range

this error occurs if old training material is used which doesn't have the font columns

Make titles of bibliography sections configurable

this function is not only spelled wrong, it also hardcodes the names of possible bibliography sections - the titles need to be put into a configurable list

RuntimeWarnings during Feature Extraction

/app/EXparser/src/gle_fun_ext.py:139: RuntimeWarning: invalid value encountered in true_divide
lh2 = 1.0 * lh / sum(lh)
/app/EXparser/src/gle_fun_ext.py:141: RuntimeWarning: invalid value encountered in true_divide
lh = 1.0 * np.array(lh) / sum(lh)

IndexError: string index out of range during segmentation

 File "/app/run-main.py", line 174, in <module>
    call_segmentation_training(sys.argv[2])
  File "/app/run-main.py", line 125, in call_segmentation_training
    os.path.join(model_dir, get_version(), model_name))
  File "/app/EXparser/Training_Seg.py", line 55, in train_segmentation
    train_feat[len(train_feat) - 1].extend([word2feat(a, stopw, 2, len(ln), b1, b2, b3, b4, b5, b6)])
  File "/app/EXparser/src/gle_fun_seg.py", line 378, in word2feat
    feat.update(get_last(w))
  File "/app/EXparser/src/gle_fun_seg.py", line 281, in get_last
    c = w[-1] * 2
IndexError: string index out of range

cboulanger / excite-docker Goto Github PK

excite-docker's People

Contributors

Watchers

excite-docker's Issues

Tag-Fragments during Segmentation

Allow switching of models with optional remote model repository

Handling of ders. in German citations

Fail more gracefully if font issue is missing

Make titles of bibliography sections configurable

RuntimeWarnings during Feature Extraction

IndexError: string index out of range during segmentation

Auto-segmentation deleting first reference

Ignore files starting with "."

Separate backend for extraction and segmentation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs