hipster-philology / pandora Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mikekestemont/pandora

9.0 6.0 4.0 265 KB

A Tagger-Lemmatizer for Natural Languages

License: MIT License

Python 99.92% Shell 0.08%

pandora's Introduction

pandora

A (language-independent) Tagger-Lemmatizer for Latin & the Vernacular

The tagging technology behind Pandora is described in the following papers:

Kestemont, M., De Pauw, G., Van Nie, R. & Daelemans, W., ‘Lemmatisation for Variation-Rich Languages Using Deep Learning’. Forthcoming in: DSH – Digital Scholarship in the Humanities. https://academic.oup.com/dsh/article/doi/10.1093/llc/fqw034/2669790/Lemmatization-for-variation-rich-languages-using
Kestemont, M. & J. de Gussem, ‘Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning’, Journal of Data Mining & Digital Humanities (2017), pp. 17. Special Issue on Computer-Aided Processing of Intertextuality in Ancient Languages, ed. M. Buechler and L. Mellerin https://jdmdh.episciences.org/3835/pdf.

Install

For now, installation needs to be done by pulling the repository and installing the required libraries yourself. Currently, Pandora relies to either Keras (+TensorFlow) or Pytorch as backends. In order to run Pandora with the Pytorch backend, you should go to pytorch.org and follow the installation instructions.

Environment free

Note : if you have CUDA installed, you should do pip install -r requirements-gpu.txt instead

git clone https://github.com/hipster-philology/pandora.git
cd pandora
pip install -r requirements.txt

Virtualenv

For CUDA-Ready machines owner:

git clone https://github.com/hipster-philology/pandora.git
cd pandora
virtualenv env
source env/bin/activate
pip install -r requirements-gpu.txt

For the others:

git clone https://github.com/hipster-philology/pandora.git
cd pandora
virtualenv env
source env/bin/activate
pip install -r requirements.txt

Scripts

Note : with Virtualenv install, do not forget to do source env/bin/activate.

main.py

train.py allows you to train your own models :

python train.py --help
python train.py config.txt --dev /path/to/dev/resources --train /path/to/train/resources --test /path/to/test/resources
python train.py config.txt --dev /path/to/dev/resources --train /path/to/train/resources --test /path/to/test/resources --nb_epochs 1
python train.py path/to/model/config.txt --load --dev /path/to/dev/resources --train /path/to/train/resources --test /path/to/test/resources

unseen.py

tagger.py allows you to annotate a string or folder

python tagger.py --help
python tagger.py path/to/model/dir --string --input "Cur in theatrum, Cato severe, venisti?"
python tagger.py path/to/model/dir --input /path/to/dir/to/annotate/ --output /path/to/output/dir/
python tagger.py path/to/model/dir --tokenized_input --input /path/to/dir/to/annotate/ --output /path/to/output/dir/

Note that we do not officially support the Theano backend for keras (anymore), because the Theano development will halt after the 1.0 release (https://groups.google.com/forum/#!topic/theano-users/7Poq8BZutbY).

Examples

The repository includes sample configurations (see config_example folder), and is shipped with a small test data-set of Old French epic texts from the Geste corpus (https://github.com/Jean-Baptiste-Camps/Geste).

To launch training on this corpus, do

python3 train.py config_geste.txt --train data/geste/train --dev data/geste/dev --test data/geste/test

pandora's People

Contributors

Stargazers

Watchers

Forkers

jean-baptiste-camps emanjavacas peterdekker barionleg

pandora's Issues

Change output type of keras model

NOTE: this all refers to the pytorch branch

Currently I am using a dict as output of the pytorch model
https://github.com/hipster-philology/pandora/blob/pytorch/pandora/impl/pytorch/model.py#L256-L267
This should be matched by the keras branch as well as the client code in tagger.py (example: https://github.com/hipster-philology/pandora/blob/pytorch/pandora/tagger.py#L379)

Unseen.py on directories lacks new line

When running unseen.py on a directory, the output files have no new lines.

Performance Drop between Keras and PyTorch on Medieval French

Based on the data set of @Jean-Baptiste-Camps : https://docs.google.com/spreadsheets/d/1uSnLrkouxCkHIzZqr0nTR7-u75WuW2dtXs-Opuywl7A/edit?usp=sharing

Not all epoch have been run. Before I used to run only 100 epochs on Chrestien. Maybe I have badly set the max_len_lemma (?). But it clearly shows superiority of Keras model over PyTorch. It outperforms it starting epoch 25.

Config parameters not being saved from actual parameters

For now, when a model is saved, config.txt is just a copy of the original config.txt, instead of being saved from the parameters being used. This prevents, for instance, modifying a parameter during run (for instance a parameter counting the number of epochs ran so far, as needed for #7 ). Here is the code:

    # save config file:
    if self.config_path:
        # make sure that we can reproduce parametrization when reloading:
        if not self.config_path == os.sep.join((self.model_dir, 'config.txt')):
            shutil.copy(self.config_path, os.sep.join((self.model_dir, 'config.txt')))
    else:
        with open(os.sep.join((self.model_dir, 'config.txt')), 'w') as F:
            F.write('# Parameter file\n\n[global]\n')
            F.write('nb_encoding_layers = '+str(self.nb_encoding_layers)+'\n')
            F.write('nb_dense_dims = '+str(self.nb_dense_dims)+'\n')
            F.write('batch_size = '+str(self.batch_size)+'\n')
            F.write('nb_left_tokens = '+str(self.nb_left_tokens)+'\n')
            F.write('nb_right_tokens = '+str(self.nb_right_tokens)+'\n')
            F.write('nb_embedding_dims = '+str(self.nb_embedding_dims)+'\n')
            F.write('model_dir = '+str(self.model_dir)+'\n')
            F.write('postcorrect = '+str(self.postcorrect)+'\n')
            F.write('nb_filters = '+str(self.nb_filters)+'\n')
            F.write('filter_length = '+str(self.filter_length)+'\n')
            F.write('focus_repr = '+str(self.focus_repr)+'\n')
            F.write('dropout_level = '+str(self.dropout_level)+'\n')
            F.write('include_token = '+str(self.include_context)+'\n')
            F.write('include_context = '+str(self.include_context)+'\n')
            F.write('include_lemma = '+str(self.include_lemma)+'\n')
            F.write('include_pos = '+str(self.include_pos)+'\n')
            F.write('include_morph = '+str(self.include_morph)+'\n')
            F.write('include_dev = '+str(self.include_dev)+'\n')
            F.write('include_test = '+str(self.include_test)+'\n')
            F.write('nb_epochs = '+str(self.nb_epochs)+'\n')
            F.write('halve_lr_at = '+str(self.halve_lr_at)+'\n')
            F.write('max_token_len = '+str(self.max_token_len)+'\n')
            F.write('min_token_freq_emb = '+str(self.min_token_freq_emb)+'\n')
            F.write('min_lem_cnt = '+str(self.min_lem_cnt)+'\n')
            F.write('curr_nb_epochs = '+str(self.curr_nb_epochs)+'\n')

I suggest removing the 6 first lines.

Opt-in epoch step results evaluation with CLI

Make the printing of tests during training optional.
Might be related to #8

Epoch's evaluation takes an unusual long time

Compared to previous Keras implementation, the evaluation takes a really long time (on GPU). It takes few minutes to evaluate while it takes ~ 30/35 seconds to train...

This could be related to #54. Could it be things are run twice ?

Add back Logger to Tagger

Add back the functions here : #30

main.py, unseen.py, train.py

@mikekestemont : This is a result from the merge operation that should be settled: I'm
guessing main.py should become obsolete.

Failed to open TrueType font

Training, I encountered an error that caused the program to stop, because of a missing font (I usually get a warning about this, but I never saw it crashing everything). Curiously, it happened only at epoch 15.

    main(**vars(parser.parse_args()))
  File "main.py", line 95, in main
    tagger.epoch(verbose=curr_verbose, autosave=True)
  File "/home/jbcamps/02_lemmatisation/pandora_testing/pandora/tagger.py", line 579, in epoch
    self.save()
  File "/home/jbcamps/02_lemmatisation/pandora_testing/pandora/tagger.py", line 492, in save
    bbox_inches=0)
  File "/home/jbcamps/02_lemmatisation/pandora_testing/env/lib/python3.5/site-packages/matplotlib/pyplot.py", line 697, in savefig
    res = fig.savefig(*args, **kwargs)
  File "/home/jbcamps/02_lemmatisation/pandora_testing/env/lib/python3.5/site-packages/matplotlib/figure.py", line 1573, in savefig
    self.canvas.print_figure(*args, **kwargs)
  File "/home/jbcamps/02_lemmatisation/pandora_testing/env/lib/python3.5/site-packages/matplotlib/backend_bases.py", line 2252, in print_figure
    **kwargs)
  File "/home/jbcamps/02_lemmatisation/pandora_testing/env/lib/python3.5/site-packages/matplotlib/backends/backend_pdf.py", line 2533, in print_pdf
    file.close()
  File "/home/jbcamps/02_lemmatisation/pandora_testing/env/lib/python3.5/site-packages/matplotlib/backends/backend_pdf.py", line 547, in close
    self.writeFonts()
  File "/home/jbcamps/02_lemmatisation/pandora_testing/env/lib/python3.5/site-packages/matplotlib/backends/backend_pdf.py", line 650, in writeFonts
    fonts[Fx] = self.embedTTF(realpath, chars[1])
  File "/home/jbcamps/02_lemmatisation/pandora_testing/env/lib/python3.5/site-packages/matplotlib/backends/backend_pdf.py", line 1124, in embedTTF
    return embedTTFType3(font, characters, descriptor)
  File "/home/jbcamps/02_lemmatisation/pandora_testing/env/lib/python3.5/site-packages/matplotlib/backends/backend_pdf.py", line 910, in embedTTFType3
    filename.encode(sys.getfilesystemencoding()), glyph_ids)
RuntimeError: Failed to open TrueType font

Clean-up CLI system

Additional step from #20 : clean up the cli so it does not use that much defaults variable and does not assume too much.

Padding, end and beginning of sequence symbols

The current approach of using '%', '$' and '|' is very brittle if the input text includes one of those characters.
I think this could be abstracted using less ambiguous symbols and putting as constants somewhere in a utils.py file instead of hardcode them.

How to handle gpu

We should pass params to the model defining whether to use gpu and which gpu to use.

Monitoring of scores

This is just to keep track of model improvements.

Currently, pytorch model with default params in 3 epochs achieves:

::: Train scores (lemmas) :::

all acc: 0.6419
kno acc: 0.6419
unk acc: 0.0
::: Dev scores (lemmas) :::
all acc: 0.63
kno acc: 0.8316831683168316
unk acc: 0.14334470989761092
::: Train scores (pos) :::
all acc: 0.8663
kno acc: 0.8663
unk acc: 0.0
::: Dev scores (pos) :::
all acc: 0.831
kno acc: 0.9123055162659123
unk acc: 0.6348122866894198

config dir

it's useful to have example config files but they clutter the topdir a little. @Jean-Baptiste-Camps should we move into a toplevel dir of their own?

One-hot format for output data

It seems that keras needs one-hot format for the targets in order to compute the loss,
Pytorch however requires integer format. It makes sense to default to the pytorch format
and have the keras implementation call to_categorical on it rather than having pytorch
undoing the categorical (which means twice as much computation)

Document model output

I am having a hard time to find out what the model output should be during training and testing. Unfortunately, the docstrings aren't helping.
For instance, for the lemmas at training time. Docstring says:

            - list of lemma-indices (`include_lemma` = 'label')
            - list of lemma-matrices at character-level (when
              `include_lemma` = 'generate').

But if I print it looks like it is taking rather class probabilities...

Command-line interface for training

Add a --load argument to training, to be able to (re)train a model in different sessions (need to write curr_nb_epochs somewhere).

Train.py not working anymore : self.max_token_len = len(max(tokens, key=len)) + 1

From #55

Abstract over model (pytorch vs keras)

Things needed for abstracting over the Keras vs Pytorch API:

serialization: save and load model to dir (including model definition and weights)
model.compile() should move to model*.py
model.predict()
get weights for the embeddings (not essential: mainly for viz)
adjust learning rate
model.fit()

Training with include_morph = label yiels TypeError: to_categorical() got an unexpected keyword argument 'nb_classes'

This is similar to the bug described in #12 . When I launch training on the Geste sample data, with

include_morph = label

in the config, I get the following error,

Traceback (most recent call last):
  File "train.py", line 6, in <module>
    cli_train()
  File "/home/jbc/Data/F/pandora/pandora_testing/hipster-philology/pandora/pandora/cli.py", line 144, in cli_train
    train_func(**vars(parser.parse_args()))
  File "/home/jbc/Data/F/pandora/pandora_testing/hipster-philology/pandora/pandora/cli.py", line 94, in train_func
    tagger.setup_to_train(**data_sets)
  File "/home/jbc/Data/F/pandora/pandora_testing/hipster-philology/pandora/pandora/tagger.py", line 230, in setup_to_train
    morph=self.train_morph)
  File "/home/jbc/Data/F/pandora/pandora_testing/hipster-philology/pandora/pandora/preprocessing.py", line 516, in transform
    X_morph, nb_classes=len(self.morph_encoder.classes_))
TypeError: to_categorical() got an unexpected keyword argument 'nb_classes

This does not happen, though, with

include_morph = multilabel

which seems to work quite fine (I'm actually impressed by the results obtained with this option).

Difference in length between predictions and input tokens

I am encountering some strange bugs, with the current pytorch branch, with some data (and not with others). I suspect it might be related to some character encoding problem, perhaps. I have been trying to solve it, but could not pinpoint the cause. Basically, after the first epoch, I have an error due to a difference in length between predictions and input tokens:

-> epoch  1 ...
Epoch 1/1
25845/25845 [==============================] - 12s - loss: 3.5287    Traceback (most recent call last):
  File "main.py", line 61, in <module>
    main(**vars(parser.parse_args()))
  File "main.py", line 46, in main

    tagger.epoch()
  File "/home/jbcamps/02_lemmatisation/pytorch_MAJ/pandora/tagger.py", line 480, in epoch
    train_preds = self.model.predict(train_in)
  File "/home/jbcamps/02_lemmatisation/pytorch_MAJ/pandora/impl/keras/model.py", line 297, in predict
    assert len(out) == len(labels)
AssertionError

Here is a sample of the problematic training data:

Item item
IX 9
£ liura
per pẹr
los lo2
borzes borgẹs
de de
Paris Paris
Item item
XXV 25
£ liura
que que
ac avẹr
P P
Albanels Albanel
de de
part part3
P P
Bonome Bonome
lo lo2
jorn jọrn
de de
la lo2
Saint_Peire Sanh_Peire
de de
feureir febrier

Add a link to articles in the README.md

I feel like this is required to promote the thing and for people that would find it

Updating to Keras 2 API ?

Functions used by tagger.py will be deprecated in future Keras versions, and we might need to update them to the Keras 2 API. Here are the warnings:

model.py:142: UserWarning: Update your `Conv1D` call to the Keras 2 API: `Conv1D(activation="relu", padding="valid", input_shape=(20, 28), filters=100, strides=1, kernel_size=3, name="focus_conv", kernel_initializer="glorot_uniform")`
  name='focus_conv')(token_input)
model.py:174: UserWarning: The `merge` function is deprecated and will be removed after 08/2017. Use instead layers from `keras.layers.merge`, e.g. `add`, `concatenate`, etc.
  joined = merge(subnets, mode='concat', name='joined')
env/lib/python3.5/site-packages/keras/legacy/layers.py:460: UserWarning: The `Merge` layer is deprecated and will be removed after 08/2017. Use instead layers from `keras.layers.merge`, e.g. `add`, `concatenate`, etc.
  name=name)
model.py:274: UserWarning: Update your `Model` call to the Keras 2 API: `Model(inputs=[<tf.Tenso..., outputs=[<tf.Tenso...)`
  model = Model(input=inputs, output=outputs)
-> epoch  1 ...
tagger.py:559: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.
  self.model.fit(train_in, train_out, nb_epoch=1, shuffle=True, batch_size=self.batch_size)

I wonder if it make sense to do it now, or if we should wait for the time where we will merge back the pytorch branch.

Information to everyone : code of conduct regarding direct push

Dear all,
I think it would be better to enforce, for everyone, the need to go through PR. I am willing to enact this by setting up the repo for this (github allows to block push to specific branch). I think it would allow a more streamlined development and also allows people to be aware of changes. The four or five of us are active enough to get back in an acceptable timeframe to merge the PR.

Are you okay with this ?

Installation

Add installation guidelines.
pip install

Test scores are printed twice (once with Logger, once alone)

-> Epoch  1 ...
203300/203300 [==============================] - 33s - loss: 2.4688 - lemma_out_loss: 1.4247 - pos_out_loss: 1.0441        
+	all acc: 0.8012268491211514
+	kno acc: 0.8214169909208819
+	unk acc: 0.14492753623188406
+	all acc: 0.8187645000196611
+	kno acc: 0.8274967574578469
+	unk acc: 0.5349143610013175
::: Train Scores (lemma) :::
+	all acc: 0.8055841134137763
+	kno acc: 0.8055841134137763
+	unk acc: 0.0
::: Dev Scores (lemma) :::
+	all acc: 0.7857818784007571
+	kno acc: 0.8105465553919162
+	unk acc: 0.12636165577342048
::: Test scores (lemma) :::
+	all acc: 0.8012268491211514
+	kno acc: 0.8214169909208819
+	unk acc: 0.14492753623188406
::: Train scores (pos) :::
+	all acc: 0.8166913439355407
+	kno acc: 0.8166913439355407
+	unk acc: 0.0
::: Dev scores (pos) :::
+	all acc: 0.8099913256052362
+	kno acc: 0.8238422516773032
+	unk acc: 0.4411764705882353
::: Test scores (pos) :::
+	all acc: 0.8187645000196611
+	kno acc: 0.8274967574578469
+	unk acc: 0.5349143610013175

Printing or exporting score dict should be done outside of evaluation and test

Just using this outside would allow for lighter API and centralized "logging". I can see the use of an export to csv for example.

Tests are currently not being passed

There are some issues with max_token_len being None.
Additionally, there is a save_params call inside the tests that doesn't seem to have a correlate (anymore) in the model api. We should fix this... Could somebody check if there is still stuff missing after the merge?

predict() got an unexpected keyword argument 'batch_size'

Trying to train with a PyTorch model, I encounter this error at the time of evaluation

Building model...
-> epoch  1 ...
176400/176400 [==============================] - 1570s - loss: 1.5666 - lemma_out_loss: 1.5666      

Traceback (most recent call last):
  File "main.py", line 6, in <module>
    cli_train()
  File "/home/jbcamps/02_lemmatisation/pandora_edge/pandora/cli.py", line 144, in cli_train
    train_func(**vars(parser.parse_args()))
  File "/home/jbcamps/02_lemmatisation/pandora_edge/pandora/cli.py", line 112, in train_func
    tagger.epoch(autosave=True, eval_test=tagger.include_test)
  File "/home/jbcamps/02_lemmatisation/pandora_edge/pandora/tagger.py", line 504, in epoch
    score_dict = self.logger.epoch(self.curr_nb_epochs, run_eval)
  File "/home/jbcamps/02_lemmatisation/pandora_edge/pandora/logger.py", line 69, in epoch
    score_dict = callback()
  File "/home/jbcamps/02_lemmatisation/pandora_edge/pandora/tagger.py", line 501, in run_eval
    score_dict.update(self.test())
  File "/home/jbcamps/02_lemmatisation/pandora_edge/pandora/tagger.py", line 369, in test
    test_preds = self.model.predict(test_in, batch_size=self.batch_size)
TypeError: predict() got an unexpected keyword argument 'batch_size'

Indeed, if we go to pandora/impl/base_model.py, l. 125,

def predict(self, input_data):

does not have the batch_size argument.

lemma_char_idx

I am trying out both models for config with lemma == 'generate' and there seems to be a bug in preprocessing:

Traceback (most recent call last):
  File "train.py", line 47, in <module>
    tagger.setup_to_train(train_data=train_data, dev_data=dev_data)
  File "/home/enrique/code/python/pandora/pandora/tagger.py", line 321, in setup_to_train
    self.save()
  File "/home/enrique/code/python/pandora/pandora/tagger.py", line 410, in save
    self.preprocessor.save(self.model_dir)
  File "/home/enrique/code/python/pandora/pandora/preprocessing.py", line 655, in save
    for idx, c in self.lemma_char_idx.items():
AttributeError: 'Preprocessor' object has no attribute 'lemma_char_idx'

I started checking this out and it seems there is an issue with the naming of a couple of variables involving "lemma_char_idx", "lemma_char_vocab", and "lemma_char_lookup".

Perhaps @mikekestemont can have a look at this?

I had to add some other changes so make sure you pull first from the pytorch branch.

That part of the code also looks a bit brittle given all the if self....-statements that make it hard to keep track of things while debugging, perhaps some refactoring could come fine.

Auto-compute max_lemma_len

It might be an issue with my old config but it seems max lemma len was not in my config.
I would propose to have it computed automatically on "auto" value in the config.

Logo

Perhaps we can use a logo? An icon with a gift-wrapped box perhaps? :)

TypeError: to_categorical() got an unexpected keyword argument 'nb_classes'

I have recently tried to use Pandora with Python 3 on Ubuntu 16.04, with a newer install of all packages. Doing this, I encountered an error, that might be related to a change in Keras (?):
TypeError: to_categorical() got an unexpected keyword argument 'nb_classes'
I will try to examine it more closely soon.

Traceback (most recent call last):
File "main.py", line 59, in <module>
  main(**vars(parser.parse_args()))
File "main.py", line 42, in main
  dev_data=dev_data
File "/home/jbcamps/02_lemmatisation/vanilla_pandora/pandora/pandora/tagger.py", line 242, in   setup_to_train
  morph=self.train_morph)
File "/home/jbcamps/02_lemmatisation/vanilla_pandora/pandora/pandora/preprocessing.py", line 490, in transform
  nb_classes=len(self.lemma_encoder.classes_))
TypeError: to_categorical() got an unexpected keyword argument 'nb_classes'

RIP Theano

We should stop supporting Theano as a keras backend -- partly because the current cannot support it anymore, partly because it won't be developed any further...

[ ] add something about this to the README

Learning rate

I have noticed something weird while retraining a model, and I think it has to do with learning rate. During the first training on the corpus (halve lr at 50), after 100 epochs, my results were seemingly hitting a ceiling:

-> epoch  100 ...
- Lowering learning rate > was: 0.004999999888241291 , now: 0.0025
Epoch 1/1
37853/37853 [==============================] - 200s - loss: 2.6546 - lemma_out_loss: 1.7012 - pos_out_loss: 0.9534   
::: Train scores (lemmas) :::
+	all acc: 0.8658758883047579
+	kno acc: 0.8658758883047579
+	unk acc: 0.0
::: Dev scores (lemmas) :::
+	all acc: 0.8330515638207946
+	kno acc: 0.8961684011352885
+	unk acc: 0.30357142857142855
::: Train scores (pos) :::
+	all acc: 0.8463529971204395
+	kno acc: 0.8463529971204395
+	unk acc: 0.0
::: Dev scores (pos) :::
+	all acc: 0.8229078613693999
+	kno acc: 0.8512298959318827
+	unk acc: 0.5853174603174603
::: Test scores (lemmas) :::
+	all acc: 0.8271996615905245
+	kno acc: 0.8962331201137171
+	unk acc: 0.252465483234714
::: Test scores (pos) :::
+	all acc: 0.8189509306260575
+	kno acc: 0.8538261075574508
+	unk acc: 0.5285996055226825
::: ended :::

Since I was not satisfied, I tried reloading and training again for 50 more epochs. I was then very surprised to see the various rates, which had not increased noticeably for some epochs, starting to increase 1 % per epoch, as well as training time going up. I wondered why, but now I think it is because of a «bug» in my loading function (?) which set learning rate very high (notice the change in learning rate and training time):

-> epoch  150 ...
    - Lowering learning rate > was: 1.0 , now: 0.5
Epoch 1/1
37853/37853 [==============================] - 266s - loss: 1.3604 - lemma_out_loss: 0.9304 - pos_out_loss: 0.4300   
::: Train scores (lemmas) :::
+	all acc: 0.992365202229678
+	kno acc: 0.992365202229678
+	unk acc: 0.0
::: Dev scores (lemmas) :::
+	all acc: 0.9234995773457312
+	kno acc: 0.9739829706717124
+	unk acc: 0.5
::: Train scores (pos) :::
+	all acc: 0.983462341161863
+	kno acc: 0.983462341161863
+	unk acc: 0.0
::: Dev scores (pos) :::
+	all acc: 0.9275147928994083
+	kno acc: 0.9477294228949859
+	unk acc: 0.7579365079365079
::: Test scores (lemmas) :::
+	all acc: 0.9141285956006768
+	kno acc: 0.9701492537313433
+	unk acc: 0.4477317554240631
::: Test scores (pos) :::
+	all acc: 0.9282994923857868
+	kno acc: 0.9559346126510305
+	unk acc: 0.6982248520710059

Have you encountered something like that in your trainings? How high is lr supposed to be at startup ?
In my case, it was at 0.009999999776482582 for the original run, and at 1 for the second (after loading). I guess it has something to do with the optimizer used ?

Additional corpora

I opened a Wiki page about best methods for sharing corpora : https://github.com/hipster-philology/pandora/wiki/Sample-or-example-data-for-Pandora-:-contributing-a-repository

This is for your information.
To open a repository, juste create one or ask for the rights with the name of the repository you want to create

Different embedding matrices for token and lemma characters

Currently, there are different embedding matrices for token and lemma characters. This is not a huge increase in model params, since the char vocab is typically very small, but it is probably wasteful in terms of updates. I can imagine that using the same embedding space for both could help...

Single Label morph fails

From #69

Ok, this solves the first bug, but it leads to another one:
@Jean-Baptiste-Camps

Traceback (most recent call last):
  File "train.py", line 6, in <module>
    cli_train()
  File "/home/jbc/Data/F/pandora/pandora_testing/hipster-philology/pandora/pandora/cli.py", line 154, in cli_train
    train_func(**vars(parser.parse_args()))
  File "/home/jbc/Data/F/pandora/pandora_testing/hipster-philology/pandora/pandora/cli.py", line 104, in train_func
    tagger.setup_to_train(**data_sets)
  File "/home/jbc/Data/F/pandora/pandora_testing/hipster-philology/pandora/pandora/tagger.py", line 230, in setup_to_train
    morph=self.train_morph)
  File "/home/jbc/Data/F/pandora/pandora_testing/hipster-philology/pandora/pandora/preprocessing.py", line 516, in transform
    X_morph, num_classes=len(self.morph_encoder.classes_))
  File "/home/jbc/Data/F/pandora/pandora_testing/hipster-philology/pandora/pandora/utils.py", line 353, in to_categorical
    y = np.array(y, dtype='int').ravel()
ValueError: invalid literal for int() with base 10: 'NOMB._p|GENRE_f|CAS_r'

Documentation

Add documentation, initially with docstrings in the sklearn-format.
Add list of supported input file formats.

Figures not being closed

This one has been around for a long time and does not cause much problems:

RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Solving it is simple, I think. We just need to add

    sns.plt.close()

in save().

Add a known lemma ingester for post-correction

Currently, you can't add unknown lemma for post correction. Just add one

Pypi

Register the package as philology-pandora (?) on http://pypi.python.org/ ( See http://peterdowns.com/posts/first-time-with-pypi.html )
Add me and @Jean-Baptiste-Camps as maintainers ? ( http://stackoverflow.com/questions/12857123/pypi-role-maintenance-owners-vs-maintainers )

Once this is done, it will be up to me to setup the repository for a setup.py ( See #6 )

Merging the pytorch branch

Hello everybody,

We are planning to retake work on the pytorch branch of the repository. In the meantime there has also been some work on the master branch (kudos!). I checked for compatibility and the branches wont merge automatically, but I have the feeling it shouldn't be too much work to make it work.

If you agree, I could open a PR and somebody else could review it (otherwise I could do it myself, but it wont be as clean and transparent).

We know that this is a big move but we are confident that the model will improve in performance given that we will be able to implement things like attention, etc...

What do you think?

Setup.py

Add a setup.py
Move commands to their submodules.
Move base configs to package_data.

Config loading should make sure to import right types

pandora/pandora/tagger.py

Lines 117 to 123 in e16350b

 self.include_token = param_dict['include_token'] 

 self.include_context = param_dict['include_context'] 

 self.include_lemma = param_dict['include_lemma'] 

 self.include_pos = param_dict['include_pos'] 

 self.include_morph = param_dict['include_morph'] 

 self.include_dev = param_dict['include_dev'] 

 self.include_test = param_dict['include_test']

should be converted to bool, according to

pandora/pandora/tagger.py

Lines 44 to 51 in e16350b

 postcorrect = True, 

 include_token = True, 

 include_context = True, 

 include_lemma = True, 

 include_pos = True, 

 include_morph = True, 

 include_dev = True, 

 include_test = True,

Cannot run PyTorch

Message I have when running the cli

RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa
Couldn't import PyTorch. PyTorch implementation not available

Pip Freeze

boto==2.48.0
bz2file==0.98
certifi==2017.7.27.1
chardet==3.0.4
cycler==0.10.0
editdistance==0.3.1
gensim==2.0.0
h5py==2.7.0
idna==2.6
Keras==2.0.4
matplotlib==2.0.2
mock==2.0.0
nltk==3.2.2
numpy==1.12.1
olefile==0.44
pandas==0.20.3
-e [email protected]:hipster-philology/pandora.git@8f71dbca8d9d5eb335eb43ed9fb17705d65b5113#egg=pandora
pbr==3.1.1
Pillow==4.2.1
protobuf==3.4.0
pyparsing==2.2.0
python-dateutil==2.6.1
pytz==2017.2
PyYAML==3.12
requests==2.18.4
scikit-learn==0.18.1
scipy==0.19.1
seaborn==0.7.1
six==1.11.0
smart-open==1.5.3
tensorflow-gpu==1.1.0
Theano==0.9.0
torch==0.2.0.post3
torchvision==0.1.9
urllib3==1.22
Werkzeug==0.12.2

Specifying --dev files mandatory in CLI

Using --dev files is supposedly not mandatory, since the config parameter include_dev can be set to False. But the cli needs --dev path to be defined, even not to use it afterwards. So, right now, even with config

    include_dev = False

You need to use

    --dev path/to/files

Testing every nth epochs to save time

@Jean-Baptiste-Camps in #20

Would it not be good too to be able to test only every nth epoch ? Training takes some time, and maybe we don't need to do it each time.

We are awaing on @mikekestemont feeling if this is gonna break anything. My own feeling is that it would not.

Param dict is not loaded ?

It seems that param dict is not loaded anymore according to test results in https://travis-ci.org/hipster-philology/pandora/builds/283083635 and some code reading

Issue related to #50 and #53 (potentially #56)

Main.py and tagger.train()

I am not sure this one is worth an issue, but since there might be something I am missing here:
why do we use

    for i in range(int(params['nb_epochs'])):
        tagger.epoch()
        tagger.save()

in Main.py, instead of the existing
tagger.train()
?

	self.include_token = param_dict['include_token']
	self.include_context = param_dict['include_context']
	self.include_lemma = param_dict['include_lemma']
	self.include_pos = param_dict['include_pos']
	self.include_morph = param_dict['include_morph']
	self.include_dev = param_dict['include_dev']
	self.include_test = param_dict['include_test']

	postcorrect = True,
	include_token = True,
	include_context = True,
	include_lemma = True,
	include_pos = True,
	include_morph = True,
	include_dev = True,
	include_test = True,