GithubHelp home page GithubHelp logo

hunvec's Introduction

Pylearn2 (and hunvec) development end

Unfortunately pylearn2 doesn't have any developers and they only plan to merge new pull requests, but they warn users not to expect any new features and they suggest users to use other libraries instead. We decided not to continue the development of hunvec relying on a deprecated library.

Why use hunvec?

hunvec is being developed to use neural networks in various nlp tasks. Our intention is to support researchers to give them a tool with which one can experiment with different settings to create neural networks. It is built upon pylearn2/theano, so recent advances in deep learning, that are supported in pylearn2, will hopefully work out of the box. Now it supports basic sequential tagging network based on Natural Language Processing from Scratch paper, Collobert et al. 2011. We designed hunvec in a way to be easily reconfigurable (adding, removing layers, testing hyperparameters, new features, like dropout) to test new advances, how good they are in NLP tasks. If you have any questions, feel free to use the issues page, or contact us: [email protected]; [email protected]

Sequential tagging

Library is ready for pos and ner (or any other bieo1 tagged) training. hunvec/datasets/prepare.py is doing preprocessing.

Unfortunately because of a pylearn2/theano reason, only batch_size==1 can be used, but the training is working. There is also another script for evaluating (F-score and per-word precision) and for tagging with these models. Models can be read and training can be continued. Library supports featurizing, now only 3gram features, casing and gazetteer features are implemented, but it is simply extendable with pure python methods. There are many training options for hunvec/seqtag/trainer.py, see its help message.

Good to know: IndexSequenceSpace.make_theano_batch() and VectorSequenceSpace.make_theano_batch() in pylearn2/space/__init__.py has to be modified right now.

instead of

if batch_size == 1:
    return tensor.matrix(name=name)

one should use

if batch_size == 1 or batch_size is None:
    return tensor.matrix(name=name, dtype=self._dtype)

Sample calls:

Datasets has to be in the common format:

  • one token per line
  • empty line separates sentences
  • in one line:

Preparing dataset with train/test/devel split (before preparing, features can be turned on and off in features/features.py):

python hunvec/datasets/prepare.py
-w 3
--test_file data/eng.bie1.test
--valid_file data/eng.bie1.devel
data/eng.bie1.train preprocessed_dataset.pickle

For training and continuing a trained model with a given dataset:

python hunvec/seqtag/trainer.py
--epochs 100
--regularization 1e-5
--valid_stop
--embedding 100
--hidden 200
--lr .1
--lr_lin_decay .1
--lr_scale
dataset.pickle
output_model

For evaluation:

python hunvec/seqtag/eval.py --fscore --sets test,valid,train dataset model

For tagging:

cut -f1 data/eng.bie1.train | python hunvec/seqtag/tagger.py model > tagged

multi-model training

There is support for using one models output in another model. One example use-case: train a pos tagger, and then use its output as input for a NE tagger. To achieve this, one has to prepare datasets together in order to make words and features have the same index in both datasets. This is still experimental, we didn't achieve good results yet, but we are continuously working on it.

Example running for multiple preparing:

python hunvec/datasets/prepare.py
-w 3
--test_file test1.tsv,test2.tsv
--valid_file valid1.tsv,valid2.tsv
train1.tsv,train2.tsv model1.pickle,model2.pickle

Then, if a first model has been trained individually from the first dataset and saved into model1.pickle, it can be passed to trainer.py with option --embedded_model. Don't forget to use datasets that are prepared together.

Language modeling

Dataset preprocessing, network creation and training is done, but without hierarchical softmax, or negative sampling, it is very slow. For negative sampling, there is ongoing work in the pylearn2 communitiy, there was lisa-lab/pylearn2#1406, which wasn't finished, but hopefully soon will be.

hunvec's People

Contributors

makrai avatar pajkossy avatar zseder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

hunvec's Issues

error message when using regularization

Traceback (most recent call last):
File "hunvec/seqtag/trainer.py", line 123, in
main()
File "hunvec/seqtag/trainer.py", line 118, in main
wt.create_algorithm(d, args.model_path)
File "/home/pajkossy/git/hunvec/hunvec/seqtag/sequence_tagger.py", line 151, in create_algorithm
self.algorithm.setup(self, self.dataset['train'])
File "/home/pajkossy/pylearn2/pylearn2/training_algorithms/sgd.py", line 316, in setup
** fixed_var_descr.fixed_vars)
File "/home/pajkossy/git/hunvec/hunvec/cost/seq_tagger_cost.py", line 22, in expr
sc += model.tagger.get_weight_decay(self.reg[0])
File "/home/pajkossy/pylearn2/pylearn2/models/mlp.py", line 695, in get_weight_decay
for layer, coeff in safe_izip(self.layers, coeffs):
File "/home/pajkossy/pylearn2/pylearn2/utils/init.py", line 277, in safe_izip
assert all([len(arg) == len(args[0]) for arg in args])
AssertionError

when using --dropout

Traceback (most recent call last):
File "hunvec/seqtag/trainer.py", line 123, in
main()
File "hunvec/seqtag/trainer.py", line 119, in main
wt.train()
File "/home/pajkossy/git/hunvec/hunvec/seqtag/sequence_tagger.py", line 157, in train
self.algorithm.train(dataset=self.dataset['train'])
File "/home/pajkossy/pylearn2/pylearn2/training_algorithms/sgd.py", line 453, in train
self.sgd_update(*batch)
File "/home/pajkossy/theano/theano_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 588, in call
self.fn.thunks[self.fn.position_of_error])
File "/home/pajkossy/theano/theano_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 579, in call
outputs = self.fn()
IndexError: index 9360 is out of bounds for size 9354
Apply node that caused the error: AdvancedSubtensor1(feats_W, Elemwise{Cast{int64}}.0)
Inputs shapes: [(9354, 100), (168,)]
Inputs strides: [(400, 4), (8,)]
Inputs types: [TensorType(float32, matrix), TensorType(int64, vector)]
Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.

optparse

it would be easier to run experiments without always changing parameters (and more git friendly)

tagging script

prepare an easy to use tagging script, the trainer script should be renamed to training, splitted into train and NN, etc

evaluation of short sentences

currently while preparing datasets very short sentences are dropped.
even if training is not possible with them the test data could contain them so that test results are reliable

save datasets

save them into file to avoid reading and featurizing over and over again from 3 different files when there are train/test/devel splits

continue training

  • if a model is given to trainer, continue training and don't use training parameters
  • later maybe we can change training parameters with this

F1 monitoring

viterbi is done by theano, but it would be easier to run theano-independent tagging+viterbi+f1 computing with simply numpy, and add its result somehow to monitor, we don't need this information during training step

prepare.py options

-for automatic replacement of numerals
-option for giving closed vocab (so that words out of it are mapped to unknown, even if they're in the training data)

ProjectionLayer fix parameters

When we use a bigger vocabulary than what is in training data (external embeddings at evaluation time for example), there are a lots of invariant parameters in ProjectionLayer.
We should create our own ProjectionLayer (inheriting the one from pylearn), that knows which parameters are constant, and which are changable, and then change ProjectionLayer.get_params() to return only that part of W that is changable (if this slicing operation is permitted in theano)

pylearn incompatibility

now, we need small changes in pylearn2/utils/iteration.py and pylearn2/space/__init__.py to be able to run, why are they necessary? should our changes go to pylearn2? Can we overcome this?

SequenceDataSpace

  • would it be a better choice thant indexsequencespace?
  • can it be used to create 2+ sized batches?

tagger.py bug

Traceback (most recent call last):
File "hunvec/seqtag/tagger.py", line 61, in
main()
File "hunvec/seqtag/tagger.py", line 57, in main
tag(args)
File "hunvec/seqtag/tagger.py", line 48, in tag
tags = wt.tag_sen(words, feats)
File "/home/pajkossy/hunvec/hunvec/seqtag/sequence_tagger.py", line 224, in tag_sen
y = self.f(words, feats)
File "/home/pajkossy/pylearn_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 608, in call
storage_map=self.fn.storage_map)
File "/home/pajkossy/pylearn_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 597, in call
outputs = self.fn()
IndexError: index 30 is out of bounds for size 28
Apply node that caused the error: AdvancedSubtensor1(feats_W, Flatten{1}.0)
Inputs types: [TensorType(float64, matrix), TensorType(int64, vector)]
Inputs shapes: [(28, 5), (52,)]
Inputs strides: [(40, 8), (8,)]
Inputs values: ['not shown', 'not shown']

Dropout

dropout should be easy with pylearn2

TaggedCorpus refactoring

TaggedCorpus should be split into RawCorpus and TaggedCorpus

  • only tag-related content goes into TaggedCorpus
  • read should be implemented this way
    • in RawCorpus.read() maybe we should use a needed_fields argument, with a default value of [0] (0th index=word), and TaggedCorpus.read() should have only contain a call to RawCorpus.read(needed_fields=[0,1]), where at first place, there is the tag
    • the only problem is that from now on, words will be a 1-length lists, to be compatible with tags, when those will be 2-length lists (NOT tuples, because lists can be changed in-place, so it's easier to turn them into integers later)
  • read() should keep the pre flag, so when featurizer is preprocessing the data, it will only return words
    • if we call featurizer preprocessing in RawCorpus.__init__(), needed_fields flag for read() will be good and only words will be returned, so hopefully no change there

fake_feats error

if fake_feats is there and other features are there, the convergence is way slower. Maybe it is because pylearn's SGD gets confused with features that are always active

unknown word's vector

possibly the vector belonging to unknown word (-1) is the same as for the last word, because of python indexing. The new vocab size should be vocab+1, and then they won't collide

external embeddings

for ProjectionLayer, parameters should be able to be set from embeddings trained outside of this library

config file

argparse is okay right now, but there are a lot of options and there will be a lot more later, so we need a config infrastructure

  • arguments should be there
  • new configs about possible datasets (eng+ner, hun+pos, etc)

maxout

experiment with maxout as hidden layers

configurable hidden layers

  • Should be able to run with n hidden layers with different number of units for experimenting
  • watch out for regularization, coeffwill be a variable length tuple depending on number of hidden layers

save corpus output

after training, if we use model for tagging, we need index->word and index->tag resolution

implement new get() method for dataset

iteration.py:784: UserWarning: dataset is using the old iterator interface which is deprecated and will become officially unsupported as of July 28, 2015. The dataset should implement a get method respecting the new interface.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.