GithubHelp home page GithubHelp logo

allenai / vampire Goto Github PK

View Code? Open in Web Editor NEW
175.0 11.0 33.0 32.1 MB

Variational Methods for Pretraining in Resource-limited Environments

License: Apache License 2.0

Python 84.78% Dockerfile 0.46% Shell 0.47% Jsonnet 11.28% Jupyter Notebook 3.00%

vampire's Introduction

VAMPIRE

VAriational Methods for Pretraining In Resource-limited Environments

Read paper here.

Citation

@inproceedings{vampire,
 author = {Suchin Gururangan and Tam Dang and Dallas Card and Noah A. Smith},
 title = {Variational Pretraining for Semi-supervised Text Classification},
 year = {2019},
 booktitle = {Proceedings of ACL},
}

Installation

Install necessary dependencies via requirements.txt, which will include the latest unreleased install of allennlp (from the master branch).

pip install -r requirements.txt

Install the spacy english model with:

python -m spacy download en

Verify your installation by running:

SEED=42 pytest -v --color=yes vampire

All tests should pass.

Install from Docker

Alternatively, you can install the repository with Docker.

First, build the container:

docker build -f Dockerfile --tag vampire/vampire:latest .

Then, run the container:

docker run -it vampire/vampire:latest

This will open a shell in a docker container that has all the dependencies installed.

Download Data

Download your dataset of interest, and make sure it is made up of json files, where each line of each file corresponds to a separate instance. Each line must contain a text field, and optionally a label field.

In this tutorial we use the AG News dataset hosted on AllenNLP. Download it using the following script:

sh scripts/download_ag.sh

This will make an examples/ag directory with train, dev, test files from the AG News corpus.

Preprocess data

To make pretraining fast, we precompute fixed bag-of-words representations of the data.

python -m scripts.preprocess_data \
            --train-path examples/ag/train.jsonl \
            --dev-path examples/ag/dev.jsonl \
            --tokenize \
            --tokenizer-type spacy \
            --vocab-size 30000 \
            --serialization-dir examples/ag

This script will tokenize your data, and save the resulting output into the specified serialization-dir.

Alternatively, under `https://s3-us-west-2.amazonaws.com/allennlp/datasets/ag-news/preprocessed.tar", we have a tar file containing a pre-processed AG news data (with vocab size set to 30K).

Run

curl -Lo examples/ag/ag.tar https://s3-us-west-2.amazonaws.com/allennlp/datasets/ag-news/vampire_preprocessed_example.tar
tar -xvf examples/ag/ag.tar -C examples/

to access its contents.

In examples/ag (after running the preprocess_data module or unpacking ag.tar), you should see:

  • train.npz - pre-computed bag of word representations of the training data
  • dev.npz - pre-computed bag of word representations of the dev data
  • vampire.bgfreq - background word frequencies
  • vocabulary/ - AllenNLP vocabulary directory

This script also creates a reference corpus to calcuate NPMI (normalized pointwise mutual information), a measure of topical coherence that we use for early stopping. By default, we use the validation data as our reference corpus. You can supply a --reference-corpus-path to the preprocessing script to use your own reference corpus.

In examples/ag/reference, you should see:

  • ref.npz - pre-computed bag of word representations of the reference corpus (the dev data)
  • ref.vocab.json - the reference corpus vocabulary

Pretrain VAMPIRE

Set your data directory and vocabulary size as environment variables:

export DATA_DIR="$(pwd)/examples/ag"
export VOCAB_SIZE=30000

If you're training on a dataset that's to large to fit into RAM, run VAMPIRE in lazy mode by additionally exporting:

export LAZY=1

Then train VAMPIRE:

python -m scripts.train \
            --config training_config/vampire.jsonnet \
            --serialization-dir model_logs/vampire \
            --environment VAMPIRE \
            --device -1

This model can be run on a CPU (--device -1). To run on a GPU instead, run with --device 0 (or any other available CUDA device number).

This command will output training logs at model_logs/vampire.

For convenience, we include the --override flag to remove the previous experiment at the same serialization directory.

Inspect topics learned

During training, we output the learned topics after each epoch in the serialization directory, under model_logs/vampire.

After your model is finished training, check out the best_epoch field in model_logs/vampire/metrics.json, which corresponds to the training epoch at which NPMI is highest.

Then open up the corresponding epoch's file in model_logs/vampire/topics/.

Use VAMPIRE with a downstream classifier

Using VAMPIRE with a downstream classifier is essentially the same as using regular ELMo. See this documentation for details on how to do that.

This library has some convenience functions for including VAMPIRE with a downstream classifier.

First, set some environment variables:

  • VAMPIRE_DIR: path to newly trained VAMPIRE
  • VAMPIRE_DIM: dimensionality of the newly trained VAMPIRE (the token embedder needs it explicitly)
  • THROTTLE: the sample size of the data we want to train on.
  • EVALUATE_ON_TEST: whether or not you would like to evaluate on test
export VAMPIRE_DIR="$(pwd)/model_logs/vampire"
export VAMPIRE_DIM=81
export THROTTLE=200
export EVALUATE_ON_TEST=0

Then, you can run the classifier:

python -m scripts.train \
            --config training_config/classifier.jsonnet \
            --serialization-dir model_logs/clf \
            --environment CLASSIFIER \
            --device -1

As with VAMPIRE, this model can be run on a CPU (--device -1). To run on a GPU instead, run with --device 0 (or any other available CUDA device number)

This command will output training logs at model_logs/clf.

The dataset sample (specified by THROTTLE) is governed by the global seed supplied to the trainer; the same seed will result in the same subsampling of training data. You can set an explicit seed by passing the additional flag --seed to the train module.

With 200 examples, we report a test accuracy of 83.9 +- 0.9 over 5 random seeds on the AG dataset. Note that your results may vary beyond these bounds under the low-resource setting.

Troubleshooting

If you're running into issues during training (e.g. NaN losses), checkout the troubleshooting file.

vampire's People

Contributors

dangitstam avatar iechevarria avatar kernelmachine avatar shuningjin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vampire's Issues

Unexpected behavior changing the vocabulary size

I am excited to try out this method! I found two small issues trying to change the vocab size with the current code.

Minor issue 1:

$git reset --hard origin/master && python -m scripts.make_reference_corpus examples/ag/dev.jsonl examples/ag/reference --vocab-size 1000

I get an error "TypeError: '>=' not supported between instances of 'str' and 'int'"

I think the issue is just that you need to specify the type of the vocab_size argument in scripts/make_reference_corpus.py.

Changing line 49 of scripts/make_reference_corpus.py toparser.add_option('--vocab-size', dest='vocab_size', default=None,**type=int** fixes the error for me.

Minor issue 2:

I seems like the vocab size is hardcoded into the VAMPIRE environment. So if you run python -m scripts.train after preprocessing with a vocabulary size that is not 30K you will get tensor mismatch errors from torch.

"VOCAB_SIZE": 30000,

Happy to put in a PR if that is helpful. But that is maybe overkill.

recover training: key error

When recovering prior vampire pertaining or classifier training, I encountered key errors:

Key [random_seed, pytorch_seed, numpy_seed] found in training configuration but not in the serialization directory we're recovering from.


Version: PyTorch 1.2.0, AllenNLP 0.8.4

Pretraining command:

export DATA_DIR="$(pwd)/examples/ag"
export VOCAB_SIZE=30000
python -m scripts.train \
            --config training_config/vampire.jsonnet \
            --serialization-dir model_logs/ag \
            --environment VAMPIRE \
            --device 0 \
            --seed 1 \
            --recover	 

Pretraining error:

2020-01-15 12:45:52,302 - INFO - allennlp.training.util - Recovering from prior training at model_logs/ag_1.
2020-01-15 12:45:52,324 - ERROR - allennlp.training.util - Key 'random_seed' found in training configuration but not in the serialization directory we're recovering from.
2020-01-15 12:45:52,324 - ERROR - allennlp.training.util - Key 'pytorch_seed' found in training configuration but not in the serialization directory we're recovering from.
2020-01-15 12:45:52,324 - ERROR - allennlp.training.util - Key 'numpy_seed' found in training configuration but not in the serialization directory we're recovering from.

Classification command:

export VAMPIRE_DIR="$(pwd)/model_logs/vampire"
export VAMPIRE_DIM=81
export THROTTLE=200
export EVALUATE_ON_TEST=1
python -m scripts.train \
            --config training_config/classifier.jsonnet \
            --serialization-dir model_logs/clf \
            --environment CLASSIFIER \
            --device 0 \
            --seed 1 \
	    --recover	

Classification error:

2020-01-15 12:52:46,094 - INFO - allennlp.training.util - Recovering from prior training at model_logs/clf_1.
2020-01-15 12:52:46,112 - ERROR - allennlp.training.util - Key 'pytorch_seed' found in training configuration but not in the serialization directory we're recovering from.
2020-01-15 12:52:46,112 - ERROR - allennlp.training.util - Key 'numpy_seed' found in training configuration but not in the serialization directory we're recovering from.
2020-01-15 12:52:46,112 - ERROR - allennlp.training.util - Key 'random_seed' found in training configuration but not in the serialization directory we're recovering from.

AssertionError in training scripts

I have been trying to run the tutorial on README and wasn't able to run the training script.

I tried to run it on a different VM but still seem to get the same AssertionError: Please set STANFORDNLP_TEST_HOME environment variable for test working dir, base name must be: stanfordnlp_test error

The error stacktrace:

2019-07-17 06:19:27,060 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
Traceback (most recent call last):
  File "/opt/anaconda3/bin/allennlp", line 10, in <module>
    sys.exit(run())
  File "/opt/anaconda3/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/opt/anaconda3/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 101, in main
    import_submodules(package_name)
  File "/opt/anaconda3/lib/python3.7/site-packages/allennlp/common/util.py", line 328, in import_submodules
    for module_finder, name, _ in pkgutil.walk_packages(path):
  File "/opt/anaconda3/lib/python3.7/pkgutil.py", line 92, in walk_packages
    __import__(info.name)
  File "/opt/anaconda3/lib/python3.7/site-packages/tests/__init__.py", line 17, in <module>
    f'Please set {TEST_HOME_VAR} environment variable for test working dir, base name must be: {TEST_DIR_BASE_NAME}'
AssertionError: Please set STANFORDNLP_TEST_HOME environment variable for test working dir, base name must be: stanfordnlp_test
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jupyter/vampire/scripts/train.py", line 75, in <module>
    main()
  File "/home/jupyter/vampire/scripts/train.py", line 71, in main
    subprocess.run(" ".join(allennlp_command), shell=True, check=True)
  File "/opt/anaconda3/lib/python3.7/subprocess.py", line 468, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'allennlp train --include-package vampire training_config/vampire.jsonnet -s model_logs/vampire' returned non-zero exit status 1.

Incorrect type annotation for encoder aggregation

The constructor of Seq2Seq defines aggregations as a str, but the forward method treats it as a sequence, and the jsonnet file splits from a comma-separated string to an array. I'm not sure if something has changed in allennlp, but for me, this results in an exception that it's expecting aggregations to be a string.

def __init__(self, architecture: Seq2SeqEncoder, aggregations: str) -> None:

jsonnet_evaluate_file

Hi,
When I was running "python -m scripts.train --config training_config/vampire.jsonnet --serialization-dir model_logs/vampire-world --environment VAMPIRE --device 0 -o"
I encountered a problem like this:

Something went wrong during jsonnet_evaluate_file, please report this: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - invalid literal; last read: 'Z'
Aborted (core dumped)

Could you please give me a hand? Thanks!

.jsonnet consolidation

Abstract away all common .jsonnet functionality between pretraining and joint learning, deprecate / remove all .json file and irrelevant .jsonnet(s).

RuntimeError: bincount only supports 1-d non-negative integral inputs.

When I tried to execute

python -m scripts.train --config training_config/classifier.jsonnet --serialization-dir model_logs/clf --environment CLASSIFIER --device 0

the following error happens,

File "/home/raman/nlp/lib/python3.6/site-packages/allennlp/modules/token_embedders/bag_of_word_counts_token_embedder.py", line 75, in forward
vec = torch.bincount(document, minlength=self.vocab_size).float()
RuntimeError: bincount only supports 1-d non-negative integral inputs.
2019-09-19 14:10:50,622 - INFO - allennlp.models.archival - removing temporary unarchived model dir at /tmp/tmpzvnq_hx6

Traceback (most recent call last):
File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/raman/vampire/scripts/train.py", line 75, in
main()
File "/home/raman/vampire/scripts/train.py", line 71, in main
subprocess.run(" ".join(allennlp_command), shell=True, check=True)
File "/usr/local/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'allennlp train --include-package vampire training_config/classifier.jsonnet -s model_logs/clf' returned non-zero exit status 1.

I am using allennlp 0.8.4 version. Can someone please help me on this?

Setting dimensions of encoder during joint semi-supervised training

It's possible that given a particular throttling of data that there could be fewer words observed during training than the vocab size specified in the config.

We should think of a way to prevent needing to allow the model to crash first before knowing what vocab to use, possible ways to do this.:

  • A preprocessing script to count unique tokens after doing the same preprocessing as done in the models (spacy + regexes)
  • Reverting back to modularized encoders that instantiate themselves at runtime

new dataset loss problem

hello, i use your model to my personal dataset, a 4 class classification task silimar to IMDB. I followed all your steps but the performance is not that good. Even if i use all of my train data, the loss still decrease very slowly. A simple MLP will increate acc from 0.25 to over 0.8 on my dataset, but vae can only increase acc from 0.25 to near 0.6, and take a much longer time.
I read your code for quite a long time,but i just can not figure out what is going wrong. can you give me some advice about it ? Any advice will be appreciated.
微信图片_20191119215903

E ImportError: cannot import name 'WordTokenizer' from 'allennlp.data.tokenizers'

===================================================================================================== ERRORS ====================================================================================================== _______________________________________________________ ERROR collecting vampire/tests/data/dataset_readers/semisupervised_text_classification_json_test.py _______________________________________________________ ImportError while importing test module '/Others_Models/vampire/vampire/tests/data/dataset_readers/semisupervised_text_classification_json_test.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: vampire/tests/data/dataset_readers/semisupervised_text_classification_json_test.py:8: in <module> from vampire.data.dataset_readers import SemiSupervisedTextClassificationJsonReader vampire/data/__init__.py:1: in <module> from vampire.data.dataset_readers import SemiSupervisedTextClassificationJsonReader vampire/data/dataset_readers/__init__.py:1: in <module> from vampire.data.dataset_readers.semisupervised_text_classification_json import ( vampire/data/dataset_readers/semisupervised_text_classification_json.py:12: in <module> from allennlp.data.tokenizers import Tokenizer, WordTokenizer E ImportError: cannot import name 'WordTokenizer' from 'allennlp.data.tokenizers' (~/.conda/envs/allennlp/lib/python3.7/site-packages/allennlp/data/tokenizers/__init__.py) ______________________________________________________________________________ ERROR collecting vampire/tests/models/vampire_test.py ______________________________________________________________________________ ImportError while importing test module '/Others_Models/vampire/vampire/tests/models/vampire_test.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: vampire/tests/models/vampire_test.py:8: in <module> from vampire.data.dataset_readers import VampireReader vampire/data/__init__.py:1: in <module> from vampire.data.dataset_readers import SemiSupervisedTextClassificationJsonReader vampire/data/dataset_readers/__init__.py:1: in <module> from vampire.data.dataset_readers.semisupervised_text_classification_json import ( vampire/data/dataset_readers/semisupervised_text_classification_json.py:12: in <module> from allennlp.data.tokenizers import Tokenizer, WordTokenizer E ImportError: cannot import name 'WordTokenizer' from 'allennlp.data.tokenizers' (/.conda/envs/allennlp/lib/python3.7/site-packages/allennlp/data/tokenizers/__init__.py)

I did not have any issues with installation.

numerical stability error?

I got the following error. Note that the difference is only a magnitude of 10^-16 or thereabouts. Is this intended behavior?
`============================================================================================================== FAILURES ==============================================================================================================
______________________________________________________________________________________________ TestVampire.test_npmi_computed_correctly ______________________________________________________________________________________________

self = <vampire_test.TestVampire testMethod=test_npmi_computed_correctly>

def test_npmi_computed_correctly(self):
    save_dir = self.TEST_DIR / "save_and_load_test"
    model = train_model_from_file(self.param_file, save_dir, overrides="")

    topics = [(1, ["great", "movie", "film", "amazing", "wow", "best", "ridiculous", "ever", "good", "incredible", "positive"]),
              (2, ["bad", "film", "worst", "negative", "movie", "ever", "not", "any", "gross", "boring"])]
    npmi = model.compute_npmi(topics, num_words=10)

    ref_vocab = model._ref_vocab
    ref_counts = model._ref_count_mat

    vocab_index = dict(zip(ref_vocab, range(len(ref_vocab))))
    n_docs, _ = ref_counts.shape

    npmi_means = []
    for topic in topics:
        words = topic[1]
        npmi_vals = []
        for word_i, word1 in enumerate(words[:10]):
            if word1 in vocab_index:
                index1 = vocab_index[word1]
            else:
                index1 = None
            for word2 in words[word_i+1:10]:
                if word2 in vocab_index:
                    index2 = vocab_index[word2]
                else:
                    index2 = None
                if index1 is None or index2 is None:
                    _npmi = 0.0
                else:
                    col1 = np.array(ref_counts[:, index1].todense() > 0, dtype=int)
                    col2 = np.array(ref_counts[:, index2].todense() > 0, dtype=int)
                    sum1 = col1.sum()
                    sum2 = col2.sum()
                    interaction = np.sum(col1 * col2)
                    if interaction == 0:
                        assert model._npmi_numerator[index1, index2] == 0.0 and model._npmi_denominator[index1, index2] == 0.0
                        _npmi = 0.0
                    else:
                      assert model._ref_interaction[index1, index2] == np.log10(interaction)

E AssertionError: assert 1.0413926851582251 == 1.041392685158225
E -1.0413926851582251
E +1.041392685158225

vampire/tests/models/vampire_test.py:61: AssertionError`

Make a Colab

Want to prove this works in a low resource environment? Make a Google Colab that re-produces one of your examples. Thanks.

Document training instability

A few insights have been received after playing around with the model since publication, including some methods to circumvent training instability, especially when training on larger corpora.

Training instability usually manifests as NaN loss errors. To circumvent this, some easy things to try:

  1. Increase batch size to at least 256
  2. Reduce LR to 1e-4 or 1e-5. If you are training over a very large corpus, shouldn’t affect representation quality much.
  3. Use some learning rate scheduler, slanted triangular scheduler has worked well for me. Make sure you tinker with the total number of epochs you train over.
  4. Clamp the KLD to some max value (e.g. 1000) so it doesn’t diverge
  5. Use a different KLD annealing scheduler (ie sigmoid)

Document these insights in repo for future use.

Hyperparameter search during the pre-training stage

Currently it's unclear what parameters allow for a strong topic model. It's reasonable to suspect that a better VAE will lead to better classifiers both downstream and in the joint model.

A hyperparameter search should be done to ensure the downstream models are using the best possible embeddings, and for the decided starting place of the joint model to be less arbitrary.

Decisions should be made about what parameters can stay fixed (e.g. it doesn't make sense for the pre-training to produce embeddings that are too large to be practical).

Incorrect mask type in Seq2Seq encoder

The seq2seq implementation of mean and max pooling crashes using allennlp 2.7 and pytorch 1,8

The issue is that both mean and max pooling mask the encoded output by converting the original mask from bool to float and multiplying. This masked tensor is then passed to masked_mean or masked_max along with the float mask, but masked_mean or masked_max require a bool mask, 'masked_max` fails on the first line

vector.masked_fill(~mask, min_value_of_dtype(vector.dtype))

It seems unnecessary in the first place to mask the embeddings since that's done in masked_mean or masked_max

Validation NLL is suspicious (regularize?)

Topics are coherent and specific, yet validation nll suffers significantly.

Could this be due to dropping sigma * epsilon at prediction time? Could this be due no regularization measures?

NPMI

NPMI is the metric we will use for topic coherence.

The best way to log NPMI regularly throughout training is unclear; doing it once per epoch seems to work for now. Is it possible to do this once per batch similar to other AllenNLP metrics?

Incorporating it as some sort of metric is desirable for Tensorboard integration and as an early stopping metric on validation

Collab Notebook Not Working

I am trying to get Vampire to run on a Google Collab notebook. I have copied over the example from this repository to a Google Collab notebook.
However, running the notebook as is, does not work.
Is there anything wrong with what I've done?

How is hyperparameter search done in the codebase?

I understand this is more of an AllenNLP framework question than VAMPIRE related, but since I am using this model I would just ask this here for quick response.

How exactly is hyper-parameter search done? I see a folder search-spaces with various json files and one jsonnet file with key:value like:

**"VAE_HIDDEN_DIM": {
    "sampling strategy": "integer",
    "bounds": [64, 128]   
},**

But I could not figure out how exactly to pass this config file to the model. I replaced the VAMPIRE dict in environment.py ("VAE_HIDDEN_DIM": 128,) with the one in vampire_ag_search.json (shown above in bold) but that threw error.

One workaround would be to have many config files with different values (let's say dimensions 64, 80, 100, 128) and train all models separately, but my guess is that is not the right way here.

BOW-embedder temporary fix

Currently, our models zero out these counts themselves, this should be removed when the PR making BOW-embedder zero-out OOV and padding by default goes through.

jsonnet_evaluate_file issue on branch allennlp-1.0

Hi,

When following the instructions in https://github.com/allenai/dont-stop-pretraining/blob/master/DATA_SELECTION.md, upon running the command ""python -m scripts.train --config training_config/vampire.jsonnet --serialization-dir model_logs/vampire-world --environment VAMPIRE --device 0 -o", I get the following error:

overriding model_logs/vampire-world
2021-03-30 17:05:56,684 - INFO - transformers.file_utils - PyTorch version 1.5.1 available.
Something went wrong during jsonnet_evaluate_file, please report this: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - invalid literal; last read: 'Z'
Aborted
Traceback (most recent call last):
File "/path/to/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/path/to/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/path/to//dont-stop-pretraining/vampire/scripts/train.py", line 89, in
main()
File "/path/to/dont-stop-pretraining/vampire/scripts/train.py", line 85, in main
subprocess.run(" ".join(allennlp_command), shell=True, check=True)
File "/path/to/python3.7/subprocess.py", line 468, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'allennlp train --include-package vampire training_config/vampire.jsonnet -s model_logs/vampire-world' returned non-zero exit status 134.

I notice this is the same error as issue #65.

I also notice training works if I use the master branch, so this seems to be an issue related to the branch the DSP DATA_SELECTION.md requires.

Could you perhaps help out with this? Thank you!

Issue with preprocessing background term

At line 90, the calculation of background frequency seems wrong. Since background freq is essentially the probability of words (unigram language model probability), the correct expression should be master.toarray().sum(0)/master.toarray().sum().

Here is what I found with ipdb when run the model on AG dataset

 /data/zeyuliu2/vampire/scripts/preprocess_data.py(93)main()
     92     ipdb.set_trace()
---> 93     print(f"all_data shape: {all_data.shape}")
     94     # generate background frequency

ipdb> l                                                                                                                      
     88     vectorized_dev_examples = sparse.hstack((np.array([0] * len(tokenized_dev_examples))[:,None], vectorized_dev_examples))
     89     master = sparse.vstack([vectorized_train_examples, vectorized_dev_examples])
     90     all_data = master.toarray()
     91     import ipdb
     92     ipdb.set_trace()
---> 93     print(f"all_data shape: {all_data.shape}")
     94     # generate background frequency
     95     print("generating background frequency...")
     96     bgfreq = dict(zip(count_vectorizer.get_feature_names(), all_data.sum(1) / args.vocab_size))
     97 
     98     print("saving data...")

ipdb> all_data.shape                                                                                                         
(120000, 30001)
ipdb> all_data.sum(1).shape                                                                                                  
(120000,)
ipdb> all_data.sum(0).shape                                                                                                  
(30001,)

No error message when run on AG dataset because num_doc > num_vocab. But with my smaller dataset, where num_doc < 3000, there is error (dimension mismatch during training).

I fix the bug; the fix is exactly what I suggested --- master.toarray().sum(0)/master.toarray().sum()

Warning: NaN or Inf found in input tensor.

When I run the vampire model on ag dataset during training, it outputs Warning: NaN or Inf found in input tensor. Is this expected?
Here is my config file:

    "dataset_reader": {
        "type": "vampire_reader",
        "lazy": true
    },
    "iterator": {
        "type": "basic",
        "batch_size": 64,
        "track_epoch": true
    },
    "model": {
        "type": "vampire",
        "bow_embedder": {
            "type": "bag_of_word_counts",
            "ignore_oov": true,
            "vocab_namespace": "vampire"
        },
        "kl_weight_annealing": "linear",
        "linear_scaling": "1000",
        "reference_counts": "/data/zeyuliu2/vampire/examples/ag/reference/ref.npz",
        "reference_vocabulary": "/data/zeyuliu2/vampire/examples/ag/reference/ref.vocab.json",
        "sigmoid_weight_1": "0.25",
        "sigmoid_weight_2": "15",
        "track_npmi": true,
        "update_background_freq": false,
        "vae": {
            "type": "logistic_normal",
            "apply_batchnorm": false,
            "decoder": {
                "activations": "linear",
                "hidden_dims": [
                    30001
                ],
                "input_dim": 81,
                "num_layers": 1
            },
            "encoder": {
                "activations": [
                    "relu",
                    "relu"
                ],
                "hidden_dims": [
                    81,
                    81
                ],
                "input_dim": 30001,
                "num_layers": 2
            },
            "log_variance_projection": {
                "activations": "linear",
                "hidden_dims": [
                    81
                ],
                "input_dim": 81,
                "num_layers": 1
            },
            "mean_projection": {
                "activations": "linear",
                "hidden_dims": [
                    81
                ],
                "input_dim": "81",
                "num_layers": 1
            },
            "z_dropout": "0.49"
        }
    },
    "train_data_path": "/data/zeyuliu2/vampire/examples/ag/train.npz",
    "validation_data_path": "/data/zeyuliu2/vampire/examples/ag/dev.npz",
    "trainer": {
        "cuda_device": 0,
        "num_epochs": 50,
        "num_serialized_models_to_keep": 1,
        "optimizer": {
            "type": "adam",
            "lr": "0.00021"
        },
        "patience": 5,
        "validation_metric": "+npmi"
    },
    "vocabulary": {
        "type": "extended_vocabulary",
        "directory_path": "/data/zeyuliu2/vampire/examples/ag/vocabulary/"
    },
    "validation_dataset_reader": {
        "type": "vampire_reader",
        "lazy": true
    }
}```

scipy version requirement

FYI: if scipy is 1.2 the tests will fail. upgrading to 1.3 fixes. might wanna add this to requirements.txt

Training for IMDB - Running out of patience

I am trying to train for the IMDB dataset, but only after a few epochs the training ends with the message Ran out of patience, which i guess is from early stopping. Can you recommend any hyperparamters I need to adjust to improve training?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.