zseder / hunvec Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 4.0 422 KB

Sequential Tagging in NLP using neural networks

Python 99.02% Shell 0.98%

hunvec's People

Contributors

Stargazers

Watchers

Forkers

katalin-pajkossy makrai pajkossy llparaschiv

hunvec's Issues

symmetric window with lookahead

prepare.py options

-for automatic replacement of numerals
-option for giving closed vocab (so that words out of it are mapped to unknown, even if they're in the training data)

learning_rate_adjustor cannot be omitted

extensions should be empty list by default

save corpus output

after training, if we use model for tagging, we need index->word and index->tag resolution

Traceback (most recent call last):
File "hunvec/seqtag/trainer.py", line 123, in
main()
File "hunvec/seqtag/trainer.py", line 119, in main
wt.train()
File "/home/pajkossy/git/hunvec/hunvec/seqtag/sequence_tagger.py", line 157, in train
self.algorithm.train(dataset=self.dataset['train'])
File "/home/pajkossy/pylearn2/pylearn2/training_algorithms/sgd.py", line 453, in train
self.sgd_update(*batch)
File "/home/pajkossy/theano/theano_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 588, in call
self.fn.thunks[self.fn.position_of_error])
File "/home/pajkossy/theano/theano_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 579, in call
outputs = self.fn()
IndexError: index 9360 is out of bounds for size 9354
Apply node that caused the error: AdvancedSubtensor1(feats_W, Elemwise{Cast{int64}}.0)
Inputs shapes: [(9354, 100), (168,)]
Inputs strides: [(400, 4), (8,)]
Inputs types: [TensorType(float32, matrix), TensorType(int64, vector)]
Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.

error message when using regularization

Traceback (most recent call last):
File "hunvec/seqtag/trainer.py", line 123, in
main()
File "hunvec/seqtag/trainer.py", line 118, in main
wt.create_algorithm(d, args.model_path)
File "/home/pajkossy/git/hunvec/hunvec/seqtag/sequence_tagger.py", line 151, in create_algorithm
self.algorithm.setup(self, self.dataset['train'])
File "/home/pajkossy/pylearn2/pylearn2/training_algorithms/sgd.py", line 316, in setup
** fixed_var_descr.fixed_vars)
File "/home/pajkossy/git/hunvec/hunvec/cost/seq_tagger_cost.py", line 22, in expr
sc += model.tagger.get_weight_decay(self.reg[0])
File "/home/pajkossy/pylearn2/pylearn2/models/mlp.py", line 695, in get_weight_decay
for layer, coeff in safe_izip(self.layers, coeffs):
File "/home/pajkossy/pylearn2/pylearn2/utils/init.py", line 277, in safe_izip
assert all([len(arg) == len(args[0]) for arg in args])
AssertionError

config file

argparse is okay right now, but there are a lot of options and there will be a lot more later, so we need a config infrastructure

arguments should be there
new configs about possible datasets (eng+ner, hun+pos, etc)

ProjectionLayer fix parameters

When we use a bigger vocabulary than what is in training data (external embeddings at evaluation time for example), there are a lots of invariant parameters in ProjectionLayer.
We should create our own ProjectionLayer (inheriting the one from pylearn), that knows which parameters are constant, and which are changable, and then change ProjectionLayer.get_params() to return only that part of W that is changable (if this slicing operation is permitted in theano)

Dropout

dropout should be easy with pylearn2

save datasets

save them into file to avoid reading and featurizing over and over again from 3 different files when there are train/test/devel splits

tagger.py bug

Traceback (most recent call last):
File "hunvec/seqtag/tagger.py", line 61, in
main()
File "hunvec/seqtag/tagger.py", line 57, in main
tag(args)
File "hunvec/seqtag/tagger.py", line 48, in tag
tags = wt.tag_sen(words, feats)
File "/home/pajkossy/hunvec/hunvec/seqtag/sequence_tagger.py", line 224, in tag_sen
y = self.f(words, feats)
File "/home/pajkossy/pylearn_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 608, in call
storage_map=self.fn.storage_map)
File "/home/pajkossy/pylearn_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 597, in call
outputs = self.fn()
IndexError: index 30 is out of bounds for size 28
Apply node that caused the error: AdvancedSubtensor1(feats_W, Flatten{1}.0)
Inputs types: [TensorType(float64, matrix), TensorType(int64, vector)]
Inputs shapes: [(28, 5), (52,)]
Inputs strides: [(40, 8), (8,)]
Inputs values: ['not shown', 'not shown']

Language modeling

It's quite slow right now, but there is a PR lisa-lab/pylearn2#1406, with which LM can be fastened

F1 monitoring

viterbi is done by theano, but it would be easier to run theano-independent tagging+viterbi+f1 computing with simply numpy, and add its result somehow to monitor, we don't need this information during training step

continue training

if a model is given to trainer, continue training and don't use training parameters
later maybe we can change training parameters with this

1 lenght sentences cannot be added to dataset

so correct evaluation is also only possible using tagger.py

pylearn incompatibility

now, we need small changes in pylearn2/utils/iteration.py and pylearn2/space/__init__.py to be able to run, why are they necessary? should our changes go to pylearn2? Can we overcome this?

tagging script

prepare an easy to use tagging script, the trainer script should be renamed to training, splitted into train and NN, etc

SequenceDataSpace

would it be a better choice thant indexsequencespace?
can it be used to create 2+ sized batches?

evaluation of short sentences

currently while preparing datasets very short sentences are dropped.
even if training is not possible with them the test data could contain them so that test results are reliable

lr_lin_decay

even if lr_monitor_decay flag is on

tagger improvements

per word precision
tagging

configurable hidden layers

Should be able to run with n hidden layers with different number of units for experimenting
watch out for regularization, coeffwill be a variable length tuple depending on number of hidden layers

fake_feats error

if fake_feats is there and other features are there, the convergence is way slower. Maybe it is because pylearn's SGD gets confused with features that are always active

optparse

it would be easier to run experiments without always changing parameters (and more git friendly)

better log-sum-exp approximation

log(sum(exp(z_{i}))) = max_{i}z_{i} + log(sum(z_{i} - max{i}z_{i}))

feature embedding of low dimension

~5 (no the representation is sparse because of high diemnsional feature vectors)

runs 31 epochs instead of 30

complete vocab file needed

if words in training data are not included, their embedding wont be used

embedding init of word vectors, where the word is not in dataset

unknown word's vector

possibly the vector belonging to unknown word (-1) is the same as for the last word, because of python indexing. The new vocab size should be vocab+1, and then they won't collide

dropout not working when model is reloaded and training is continued

in sequence_tagger.py, dropout_fprop() uses self.hdims, but that parameter isn't saved with the model (see __get_state__), so it fails.
Instead of checking hdims, a check on self.layers would be better of possible. If not, hdims has to go to __get_state__, but hopefully not, it is redundant.

@pajkossy

sparse representation of tags

tags are now encoded using one-hot encoding, but is it needed? cannot they be simple numbers?

TaggedCorpus refactoring

TaggedCorpus should be split into RawCorpus and TaggedCorpus

only tag-related content goes into TaggedCorpus
read should be implemented this way
- in RawCorpus.read() maybe we should use a needed_fields argument, with a default value of [0] (0th index=word), and TaggedCorpus.read() should have only contain a call to RawCorpus.read(needed_fields=[0,1]), where at first place, there is the tag
- the only problem is that from now on, words will be a 1-length lists, to be compatible with tags, when those will be 2-length lists (NOT tuples, because lists can be changed in-place, so it's easier to turn them into integers later)
read() should keep the pre flag, so when featurizer is preprocessing the data, it will only return words
- if we call featurizer preprocessing in RawCorpus.__init__(), needed_fields flag for read() will be good and only words will be returned, so hopefully no change there

zseder / hunvec Goto Github PK

hunvec's People

Contributors

Stargazers

Watchers

Forkers

hunvec's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs