GithubHelp home page GithubHelp logo

cboulanger / excite-docker Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 0.0 125.86 MB

Docker image with tools for the annotation of ML training docs for reference extraction based on the EXparser tools

Home Page: https://cboulanger.github.io/excite-docker

License: GNU General Public License v3.0

Dockerfile 0.07% Python 22.59% Shell 0.87% HTML 3.96% JavaScript 66.96% CSS 5.55%
citation-mining exparser

excite-docker's People

Contributors

cboulanger avatar iurshina avatar

Watchers

 avatar  avatar  avatar

excite-docker's Issues

Tag-Fragments during Segmentation

Dear contributors, I encountered two cases of tag-fragments during segmentation that I could not edit, delete or interact with in any way.

  1. After running the Auto-segmentation, the function tends to leave such a fragment of the Last Page tag when the pages in the reference are declared in the format “S. Start Page – “. The fragment occurs after the hyphen.

  2. Sometimes when using the function Correct selected text to add new signs and part of the selected text is already tagged, a fragment of the included tag is left next to the newly added signs. I noticed this multiple times but wasn’t able to find a reliable way to replicate it unfortunately.

You can find examples of these two cases in the screenshots I added.

Tag-FragmentLastPage
Tag-FragmentTextEdit

Allow switching of models with optional remote model repository

In order to be able to use specialized models for different kind of scholarly citation patterns, we should make the directory containing model data (now EXparser/Utils) configurable. The idea is to give such a specialized model a unique name which serves both as an well-known id and the name of the directory in which the models are stored. Since the model data is directly dependent on the training code, it needs to be versioned. This also allows to run tests comparing the performance of a particular model with the same id but different versions (for example, by running an evaluation comparing performance of different git branches).

  • Models is stored in EXparser/Models/<version>/<model_id>. EXparser/Utils/ is renamed to EXparser/Models/<version>/default. The version number is hardcoded in configs.py and manually incremented whenever a change is made in the EXparser code that renders the model data backward-incompatible to previous code versions.
  • Since different models will have different training material (the whole point of having separate models), EXparser/Dataset needs to be renamed to EXparser/Datasets/default. The training material folders do not need to be versioned.
  • A new commanddocker run ... excite_toolchain create_model <model_id> is added which creates a directory EXparser/Models/<version>/<model id> and copies over the non-reproduceable model files (if there are any left). It returns a message saying that the user needs to add training material to EXparser/Datasets/<model_id> and to run training.
  • The model is selected when running the docker commands, such as docker run ... excite_toolchain exparser <model_id>. If no model name is supplied, "default" is used as the model id. If the model id does not exist, an error is raised saying that the command create_model must be run first.
  • docker run ... excite_toolchain (segmentation|extraction)_training <model_id> computes the models from the training material in EXparser/Datasets/<model_id> and places them into EXparser/Models/<version>/<model_id>.

When we have this system in place, an optional storage system can be build upon it. It works with packages that are a ZIP of the training material and model data stored in a configurable location.

  • A new commanddocker run ... excite_toolchain download_model <model_id> is added which tries to download /excite-docker/<version>/<model_id>.zip from a WebDAV server (url and credentials are supplied as environment variables). If that is successful, the ZIP is extracted and placed into the training and model directories corresponding to the version and model id.
  • A new command docker run ... excite_toolchain upload_model <model_id> is added, which uploads the training and model data as a ZIP to the WebDAV folder
  • A new command docker run ... excite_toolchain list_models is added, which returns a list of models stored at the given repository compatible with the current version

Fail more gracefully if font issue is missing

  File "/app/run-main.py", line 171, in <module>
    call_extraction_training(sys.argv[2])
  File "/app/run-main.py", line 119, in call_extraction_training
    os.path.join(model_dir, get_version(), model_name))
  File "/app/EXparser/Training_Ext.py", line 52, in train_extraction
    row2 = reader2[uu]
IndexError: list index out of range

this error occurs if old training material is used which doesn't have the font columns

RuntimeWarnings during Feature Extraction

/app/EXparser/src/gle_fun_ext.py:139: RuntimeWarning: invalid value encountered in true_divide
lh2 = 1.0 * lh / sum(lh)
/app/EXparser/src/gle_fun_ext.py:141: RuntimeWarning: invalid value encountered in true_divide
lh = 1.0 * np.array(lh) / sum(lh)

IndexError: string index out of range during segmentation

 File "/app/run-main.py", line 174, in <module>
    call_segmentation_training(sys.argv[2])
  File "/app/run-main.py", line 125, in call_segmentation_training
    os.path.join(model_dir, get_version(), model_name))
  File "/app/EXparser/Training_Seg.py", line 55, in train_segmentation
    train_feat[len(train_feat) - 1].extend([word2feat(a, stopw, 2, len(ln), b1, b2, b3, b4, b5, b6)])
  File "/app/EXparser/src/gle_fun_seg.py", line 378, in word2feat
    feat.update(get_last(w))
  File "/app/EXparser/src/gle_fun_seg.py", line 281, in get_last
    c = w[-1] * 2
IndexError: string index out of range

Auto-segmentation deleting first reference

Dear contributors, I noticed an issue while comparing the displayed lists of references before and after using the Auto-segmentation. The aforementioned function seems to regularly delete the first reference of the extracted list.

Ignore files starting with "."

Layout analysis and exparser should ignore all files starting with a dot (".") so that .gitkeep and other git files won't be analyzed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.