zseder / hunvec Goto Github PK

Sequential Tagging in NLP using neural networks

Python 99.02% Shell 0.98%

hunvec's Introduction

Pylearn2 (and hunvec) development end

Unfortunately pylearn2 doesn't have any developers and they only plan to merge new pull requests, but they warn users not to expect any new features and they suggest users to use other libraries instead. We decided not to continue the development of hunvec relying on a deprecated library.

Why use hunvec?

hunvec is being developed to use neural networks in various nlp tasks. Our intention is to support researchers to give them a tool with which one can experiment with different settings to create neural networks. It is built upon pylearn2/theano, so recent advances in deep learning, that are supported in pylearn2, will hopefully work out of the box. Now it supports basic sequential tagging network based on Natural Language Processing from Scratch paper, Collobert et al. 2011. We designed hunvec in a way to be easily reconfigurable (adding, removing layers, testing hyperparameters, new features, like dropout) to test new advances, how good they are in NLP tasks. If you have any questions, feel free to use the issues page, or contact us: [email protected]; [email protected]

Sequential tagging

Library is ready for pos and ner (or any other bieo1 tagged) training. hunvec/datasets/prepare.py is doing preprocessing.

Unfortunately because of a pylearn2/theano reason, only batch_size==1 can be used, but the training is working. There is also another script for evaluating (F-score and per-word precision) and for tagging with these models. Models can be read and training can be continued. Library supports featurizing, now only 3gram features, casing and gazetteer features are implemented, but it is simply extendable with pure python methods. There are many training options for hunvec/seqtag/trainer.py, see its help message.

Good to know: IndexSequenceSpace.make_theano_batch() and VectorSequenceSpace.make_theano_batch() in pylearn2/space/__init__.py has to be modified right now.

instead of

if batch_size == 1:
    return tensor.matrix(name=name)

one should use

if batch_size == 1 or batch_size is None:
    return tensor.matrix(name=name, dtype=self._dtype)

Sample calls:

Datasets has to be in the common format:

one token per line
empty line separates sentences
in one line:

Preparing dataset with train/test/devel split (before preparing, features can be turned on and off in features/features.py):

python hunvec/datasets/prepare.py
-w 3
--test_file data/eng.bie1.test
--valid_file data/eng.bie1.devel
data/eng.bie1.train preprocessed_dataset.pickle

For training and continuing a trained model with a given dataset:

python hunvec/seqtag/trainer.py
--epochs 100
--regularization 1e-5
--valid_stop
--embedding 100
--hidden 200
--lr .1
--lr_lin_decay .1
--lr_scale
dataset.pickle
output_model

For evaluation:

python hunvec/seqtag/eval.py --fscore --sets test,valid,train dataset model

For tagging:

cut -f1 data/eng.bie1.train | python hunvec/seqtag/tagger.py model > tagged

multi-model training

There is support for using one models output in another model. One example use-case: train a pos tagger, and then use its output as input for a NE tagger. To achieve this, one has to prepare datasets together in order to make words and features have the same index in both datasets. This is still experimental, we didn't achieve good results yet, but we are continuously working on it.

Example running for multiple preparing:

python hunvec/datasets/prepare.py
-w 3
--test_file test1.tsv,test2.tsv
--valid_file valid1.tsv,valid2.tsv
train1.tsv,train2.tsv model1.pickle,model2.pickle

Then, if a first model has been trained individually from the first dataset and saved into model1.pickle, it can be passed to trainer.py with option --embedded_model. Don't forget to use datasets that are prepared together.

Language modeling

Dataset preprocessing, network creation and training is done, but without hierarchical softmax, or negative sampling, it is very slow. For negative sampling, there is ongoing work in the pylearn2 communitiy, there was lisa-lab/pylearn2#1406, which wasn't finished, but hopefully soon will be.

hunvec's People

Contributors

Stargazers

Watchers

Forkers

katalin-pajkossy makrai pajkossy llparaschiv

hunvec's Issues

Language modeling

It's quite slow right now, but there is a PR lisa-lab/pylearn2#1406, with which LM can be fastened

learning_rate_adjustor cannot be omitted

extensions should be empty list by default

error message when using regularization

Traceback (most recent call last):
File "hunvec/seqtag/trainer.py", line 123, in
main()
File "hunvec/seqtag/trainer.py", line 118, in main
wt.create_algorithm(d, args.model_path)
File "/home/pajkossy/git/hunvec/hunvec/seqtag/sequence_tagger.py", line 151, in create_algorithm
self.algorithm.setup(self, self.dataset['train'])
File "/home/pajkossy/pylearn2/pylearn2/training_algorithms/sgd.py", line 316, in setup
** fixed_var_descr.fixed_vars)
File "/home/pajkossy/git/hunvec/hunvec/cost/seq_tagger_cost.py", line 22, in expr
sc += model.tagger.get_weight_decay(self.reg[0])
File "/home/pajkossy/pylearn2/pylearn2/models/mlp.py", line 695, in get_weight_decay
for layer, coeff in safe_izip(self.layers, coeffs):
File "/home/pajkossy/pylearn2/pylearn2/utils/init.py", line 277, in safe_izip
assert all([len(arg) == len(args[0]) for arg in args])
AssertionError

when using --dropout

Traceback (most recent call last):
File "hunvec/seqtag/trainer.py", line 123, in
main()
File "hunvec/seqtag/trainer.py", line 119, in main
wt.train()
File "/home/pajkossy/git/hunvec/hunvec/seqtag/sequence_tagger.py", line 157, in train
self.algorithm.train(dataset=self.dataset['train'])
File "/home/pajkossy/pylearn2/pylearn2/training_algorithms/sgd.py", line 453, in train
self.sgd_update(*batch)
File "/home/pajkossy/theano/theano_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 588, in call
self.fn.thunks[self.fn.position_of_error])
File "/home/pajkossy/theano/theano_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 579, in call
outputs = self.fn()
IndexError: index 9360 is out of bounds for size 9354
Apply node that caused the error: AdvancedSubtensor1(feats_W, Elemwise{Cast{int64}}.0)
Inputs shapes: [(9354, 100), (168,)]
Inputs strides: [(400, 4), (8,)]
Inputs types: [TensorType(float32, matrix), TensorType(int64, vector)]
Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.

optparse

it would be easier to run experiments without always changing parameters (and more git friendly)

tagging script

prepare an easy to use tagging script, the trainer script should be renamed to training, splitted into train and NN, etc

embedding init of word vectors, where the word is not in dataset

eval, tagging slow

evaluation of short sentences

currently while preparing datasets very short sentences are dropped.
even if training is not possible with them the test data could contain them so that test results are reliable

decoding

check if it is done properly

save datasets

save them into file to avoid reading and featurizing over and over again from 3 different files when there are train/test/devel splits

better log-sum-exp approximation

log(sum(exp(z_{i}))) = max_{i}z_{i} + log(sum(z_{i} - max{i}z_{i}))

1 lenght sentences cannot be added to dataset

so correct evaluation is also only possible using tagger.py

continue training

if a model is given to trainer, continue training and don't use training parameters
later maybe we can change training parameters with this

feature.cfg parser needed

for locations of gazeetteer lists etc.

F1 monitoring

viterbi is done by theano, but it would be easier to run theano-independent tagging+viterbi+f1 computing with simply numpy, and add its result somehow to monitor, we don't need this information during training step

symmetric window with lookahead

words in training, but not in vocab

If there is a word in the training data that is not in the predefined vocab, then these words will be handled as unknowns

prepare.py options

-for automatic replacement of numerals
-option for giving closed vocab (so that words out of it are mapped to unknown, even if they're in the training data)

sparse representation of tags

tags are now encoded using one-hot encoding, but is it needed? cannot they be simple numbers?

tagger improvements

per word precision
tagging

ProjectionLayer fix parameters

When we use a bigger vocabulary than what is in training data (external embeddings at evaluation time for example), there are a lots of invariant parameters in ProjectionLayer.
We should create our own ProjectionLayer (inheriting the one from pylearn), that knows which parameters are constant, and which are changable, and then change ProjectionLayer.get_params() to return only that part of W that is changable (if this slicing operation is permitted in theano)

pylearn incompatibility

now, we need small changes in pylearn2/utils/iteration.py and pylearn2/space/__init__.py to be able to run, why are they necessary? should our changes go to pylearn2? Can we overcome this?

SequenceDataSpace

would it be a better choice thant indexsequencespace?
can it be used to create 2+ sized batches?

tagger.py bug

Traceback (most recent call last):
File "hunvec/seqtag/tagger.py", line 61, in
main()
File "hunvec/seqtag/tagger.py", line 57, in main
tag(args)
File "hunvec/seqtag/tagger.py", line 48, in tag
tags = wt.tag_sen(words, feats)
File "/home/pajkossy/hunvec/hunvec/seqtag/sequence_tagger.py", line 224, in tag_sen
y = self.f(words, feats)
File "/home/pajkossy/pylearn_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 608, in call
storage_map=self.fn.storage_map)
File "/home/pajkossy/pylearn_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 597, in call
outputs = self.fn()
IndexError: index 30 is out of bounds for size 28
Apply node that caused the error: AdvancedSubtensor1(feats_W, Flatten{1}.0)
Inputs types: [TensorType(float64, matrix), TensorType(int64, vector)]
Inputs shapes: [(28, 5), (52,)]
Inputs strides: [(40, 8), (8,)]
Inputs values: ['not shown', 'not shown']

Dropout

dropout should be easy with pylearn2

TaggedCorpus refactoring

TaggedCorpus should be split into RawCorpus and TaggedCorpus

only tag-related content goes into TaggedCorpus
read should be implemented this way
- in RawCorpus.read() maybe we should use a needed_fields argument, with a default value of [0] (0th index=word), and TaggedCorpus.read() should have only contain a call to RawCorpus.read(needed_fields=[0,1]), where at first place, there is the tag
- the only problem is that from now on, words will be a 1-length lists, to be compatible with tags, when those will be 2-length lists (NOT tuples, because lists can be changed in-place, so it's easier to turn them into integers later)
read() should keep the pre flag, so when featurizer is preprocessing the data, it will only return words
- if we call featurizer preprocessing in RawCorpus.__init__(), needed_fields flag for read() will be good and only words will be returned, so hopefully no change there

fake_feats error

if fake_feats is there and other features are there, the convergence is way slower. Maybe it is because pylearn's SGD gets confused with features that are always active

feature embedding of low dimension

~5 (no the representation is sparse because of high diemnsional feature vectors)

unknown word's vector

possibly the vector belonging to unknown word (-1) is the same as for the last word, because of python indexing. The new vocab size should be vocab+1, and then they won't collide

tagger.py gives irrealistic fscores

when using with -- sets train,test,valid

external embeddings

for ProjectionLayer, parameters should be able to be set from embeddings trained outside of this library

config file

argparse is okay right now, but there are a lot of options and there will be a lot more later, so we need a config infrastructure

arguments should be there
new configs about possible datasets (eng+ner, hun+pos, etc)

maxout

experiment with maxout as hidden layers

configurable hidden layers

Should be able to run with n hidden layers with different number of units for experimenting
watch out for regularization, coeffwill be a variable length tuple depending on number of hidden layers

save corpus output

after training, if we use model for tagging, we need index->word and index->tag resolution

dropout not working when model is reloaded and training is continued

in sequence_tagger.py, dropout_fprop() uses self.hdims, but that parameter isn't saved with the model (see __get_state__), so it fails.
Instead of checking hdims, a check on self.layers would be better of possible. If not, hdims has to go to __get_state__, but hopefully not, it is redundant.

@pajkossy

zseder / hunvec Goto Github PK

hunvec's Introduction

Pylearn2 (and hunvec) development end

Why use hunvec?

Sequential tagging

Sample calls:

multi-model training

Language modeling

hunvec's People

Contributors

Stargazers

Watchers

Forkers

hunvec's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs