zseder / hunvec Goto Github PK
View Code? Open in Web Editor NEWSequential Tagging in NLP using neural networks
Sequential Tagging in NLP using neural networks
-for automatic replacement of numerals
-option for giving closed vocab (so that words out of it are mapped to unknown, even if they're in the training data)
extensions should be empty list by default
after training, if we use model for tagging, we need index->word and index->tag resolution
Traceback (most recent call last):
File "hunvec/seqtag/trainer.py", line 123, in
main()
File "hunvec/seqtag/trainer.py", line 119, in main
wt.train()
File "/home/pajkossy/git/hunvec/hunvec/seqtag/sequence_tagger.py", line 157, in train
self.algorithm.train(dataset=self.dataset['train'])
File "/home/pajkossy/pylearn2/pylearn2/training_algorithms/sgd.py", line 453, in train
self.sgd_update(*batch)
File "/home/pajkossy/theano/theano_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 588, in call
self.fn.thunks[self.fn.position_of_error])
File "/home/pajkossy/theano/theano_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 579, in call
outputs = self.fn()
IndexError: index 9360 is out of bounds for size 9354
Apply node that caused the error: AdvancedSubtensor1(feats_W, Elemwise{Cast{int64}}.0)
Inputs shapes: [(9354, 100), (168,)]
Inputs strides: [(400, 4), (8,)]
Inputs types: [TensorType(float32, matrix), TensorType(int64, vector)]
Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.
Traceback (most recent call last):
File "hunvec/seqtag/trainer.py", line 123, in
main()
File "hunvec/seqtag/trainer.py", line 118, in main
wt.create_algorithm(d, args.model_path)
File "/home/pajkossy/git/hunvec/hunvec/seqtag/sequence_tagger.py", line 151, in create_algorithm
self.algorithm.setup(self, self.dataset['train'])
File "/home/pajkossy/pylearn2/pylearn2/training_algorithms/sgd.py", line 316, in setup
** fixed_var_descr.fixed_vars)
File "/home/pajkossy/git/hunvec/hunvec/cost/seq_tagger_cost.py", line 22, in expr
sc += model.tagger.get_weight_decay(self.reg[0])
File "/home/pajkossy/pylearn2/pylearn2/models/mlp.py", line 695, in get_weight_decay
for layer, coeff in safe_izip(self.layers, coeffs):
File "/home/pajkossy/pylearn2/pylearn2/utils/init.py", line 277, in safe_izip
assert all([len(arg) == len(args[0]) for arg in args])
AssertionError
argparse is okay right now, but there are a lot of options and there will be a lot more later, so we need a config infrastructure
When we use a bigger vocabulary than what is in training data (external embeddings at evaluation time for example), there are a lots of invariant parameters in ProjectionLayer.
We should create our own ProjectionLayer (inheriting the one from pylearn), that knows which parameters are constant, and which are changable, and then change ProjectionLayer.get_params()
to return only that part of W
that is changable (if this slicing operation is permitted in theano)
dropout should be easy with pylearn2
save them into file to avoid reading and featurizing over and over again from 3 different files when there are train/test/devel splits
Traceback (most recent call last):
File "hunvec/seqtag/tagger.py", line 61, in
main()
File "hunvec/seqtag/tagger.py", line 57, in main
tag(args)
File "hunvec/seqtag/tagger.py", line 48, in tag
tags = wt.tag_sen(words, feats)
File "/home/pajkossy/hunvec/hunvec/seqtag/sequence_tagger.py", line 224, in tag_sen
y = self.f(words, feats)
File "/home/pajkossy/pylearn_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 608, in call
storage_map=self.fn.storage_map)
File "/home/pajkossy/pylearn_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 597, in call
outputs = self.fn()
IndexError: index 30 is out of bounds for size 28
Apply node that caused the error: AdvancedSubtensor1(feats_W, Flatten{1}.0)
Inputs types: [TensorType(float64, matrix), TensorType(int64, vector)]
Inputs shapes: [(28, 5), (52,)]
Inputs strides: [(40, 8), (8,)]
Inputs values: ['not shown', 'not shown']
It's quite slow right now, but there is a PR lisa-lab/pylearn2#1406, with which LM can be fastened
viterbi is done by theano, but it would be easier to run theano-independent tagging+viterbi+f1 computing with simply numpy, and add its result somehow to monitor, we don't need this information during training step
so correct evaluation is also only possible using tagger.py
now, we need small changes in pylearn2/utils/iteration.py
and pylearn2/space/__init__.py
to be able to run, why are they necessary? should our changes go to pylearn2? Can we overcome this?
prepare an easy to use tagging script, the trainer script should be renamed to training, splitted into train and NN, etc
currently while preparing datasets very short sentences are dropped.
even if training is not possible with them the test data could contain them so that test results are reliable
even if lr_monitor_decay flag is on
coeff
will be a variable length tuple depending on number of hidden layersif fake_feats is there and other features are there, the convergence is way slower. Maybe it is because pylearn's SGD gets confused with features that are always active
it would be easier to run experiments without always changing parameters (and more git friendly)
log(sum(exp(z_{i}))) = max_{i}z_{i} + log(sum(z_{i} - max{i}z_{i}))
~5 (no the representation is sparse because of high diemnsional feature vectors)
if words in training data are not included, their embedding wont be used
possibly the vector belonging to unknown word (-1) is the same as for the last word, because of python indexing. The new vocab size should be vocab+1, and then they won't collide
in sequence_tagger.py
, dropout_fprop()
uses self.hdims
, but that parameter isn't saved with the model (see __get_state__
), so it fails.
Instead of checking hdims
, a check on self.layers
would be better of possible. If not, hdims
has to go to __get_state__
, but hopefully not, it is redundant.
tags are now encoded using one-hot encoding, but is it needed? cannot they be simple numbers?
TaggedCorpus should be split into RawCorpus and TaggedCorpus
RawCorpus.read()
maybe we should use a needed_fields
argument, with a default value of [0]
(0th index=word), and TaggedCorpus.read()
should have only contain a call to RawCorpus.read(needed_fields=[0,1])
, where at first place, there is the tagread()
should keep the pre
flag, so when featurizer is preprocessing the data, it will only return words
RawCorpus.__init__()
, needed_fields
flag for read()
will be good and only words will be returned, so hopefully no change therefor locations of gazeetteer lists etc.
experiment with maxout as hidden layers
for ProjectionLayer
, parameters should be able to be set from embeddings trained outside of this library
iteration.py:784: UserWarning: dataset is using the old iterator interface which is deprecated and will become officially unsupported as of July 28, 2015. The dataset should implement a get
method respecting the new interface.
If there is a word in the training data that is not in the predefined vocab, then these words will be handled as unknowns
check if it is done properly
when using with -- sets train,test,valid
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.