salesforce / awd-lstm-lm Goto Github PK

View Code? Open in Web Editor NEW

2.0K 66.0 493.0 58 KB

LSTM and QRNN Language Model Toolkit for PyTorch

License: BSD 3-Clause "New" or "Revised" License

Python 97.43% Shell 2.57%

lstm pytorch language-model sgd qrnn

awd-lstm-lm's Issues

Error in moving to GPU

@Smerity I tried the following code:

import torch
import torch.nn as nn

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.gru = nn.GRU(10, 10)
        self.gru = WeightDrop(self.gru, ['weight_hh_l0'], dropout=0.5)

m = Model()
# The following operation throws an error
m.cuda()

Can you take a look and see where the problem occurs? I am using PyTorch 0.2.

I made some modifications to the codebase, so this might be a problem on my end... But does finetune.py require SplitCrossEntropyLoss to be used for the criterion instead? The decoder is called in SplitCrossEntropyLoss only. I added the appropriate SplitCrossEntropyLoss in finetune.py, and it works as expected.

Redundant code in getdata.sh

Lines 8-14 and 54-60 in getdata.sh both contain the same lines of code:

echo "- Downloading WikiText-2 (WT2)"
wget --quiet --continue https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
unzip -q wikitext-2-v1.zip
cd wikitext-2
mv wiki.train.tokens train.txt
mv wiki.valid.tokens valid.txt
mv wiki.test.tokens test.txt

I think that lines 54-60 from getdata.sh can be deleted with no consequence.

finetune & pointer bugs?

python finetune.py --epochs 750 --data data/wikitext-2 --save WT2.pt --dropouth 0.2 --seed 1882
python pointer.py --save WT2.pt --lambdasm 0.1279 --theta 0.662 --window 3785 --bptt 2000 --data data/wikitext-2

Traceback (most recent call last):
File "finetune.py", line 183, in
stored_loss = evaluate(val_data)
File "finetune.py", line 108, in evaluate
model.eval()

Looks like model loading & more needs to be modified.

Also, I no longer get the reported ppls in main. LSTM gets stuck around 80s and QRNN around 90s.

Confused regarding motivation of randomized BPTT

Why does this exist?

bptt = args.bptt if np.random.random() < 0.95 else args.bptt / 2.

Yall already have...

seq_len = max(5, int(np.random.normal(bptt, 5)))

Triggering condition for ASGD bug

I think there is a bug in the way ASGD is being triggered. Right now the code is

if args.optimizer == 'sgd' and 't0' not in optimizer.param_groups[0] and (len(best_val_loss)>args.nonmono and val_loss > min(best_val_loss[:-args.nonmono])):

I believe this should be

if args.optimizer == 'sgd' and 't0' not in optimizer.param_groups[0] and (len(best_val_loss)>args.nonmono and val_loss > min(best_val_loss[-args.nonmono:])):

with the difference being that we grab from the end of the best_val_loss list instead of the start.

Generate broken?

Generate still appears to be broken.
Or is finetune necessary before running generate?

Perhaps I'm doing something wrong, but I have generated with Pytorch's default word language model many times. I haven't dug into your code, assume it is correct... All help appreciated, I'd love to see what QRNN is capable of.

Trained using default settings for QRNN on WT2.
Exited training early.

Then called:

$ python -u generate.py --cuda --words=66 --checkpoint="WT2.pt" --model=QRNN --data=data/wikitext-2

Output:
| Generated 0/66 words
No text file generated.

Model crashes under pytorch 0.4

Hi,
The folks over at pytorch are working on cutting a new 0.4 release. We'd like to make the transition as smooth as possible (if you were planning on upgrading), so we've been testing a number of community repos.

I ran a model and it errors out due to a change in pytorch. Minimal repro:

# Install pytorch-nightly (Currently our pre-release branch)
conda install pytorch-nightly -c pytorch

# Get data
./getdata.sh

# Run model
python main.py --batch_size 20 --data data/penn --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 1 && \
python -u main.py --model QRNN --batch_size 20 --clip 0.2 --wdrop 0.1 --nhid 1550 --nlayers 4 --emsize 400 --dropouth 0.3 --seed 9001 --dropouti 0.4 --epochs
1

Stack trace: https://gist.github.com/zou3519/142d48df1c03db9fe9c11717ad9a59f2

Pytorch 0.4 adds zero-dimensional tensors that cannot be iterated over, which seems to be what the error is complaining about. Changing

awd-lstm-lm/utils.py

Line 8 in f2e8867

return tuple(repackage_hidden(v) for v in h)

in particular to handle this case should fix it.

cc @soumith

UserWarning: RNN module weights are not part of single contiguous chunk of memory. and how to generate probability of a setence

I am getting the following warning:

UserWarning: RNN module weights are not part of a single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().

I am using pytorch 0.20 and python 3.5

Also, how do you generate a probability of a given sentence?

Multi-GPU size mismatch

I try to train lm with multi-GPU(In this case, I use 3 GPU) by set "model = torch.nn.DataParallel(model).cuda()", and change "model.init_hidden" to "model.module.init_hidden", then meet error:

It seems that only one GPU's result has been collected, I can't explain what happened.

model.decoder is never used?

I started to suspect something is wrong when generate.py script crarshed. Then I was surprised to see that line output, hidden = model(input, hidden) yileds an output variable of with hidden size of last recurrent layer, not the size of vocabulary. So I took further look into the model.py and was surprised to see that self.decoder is not used at all!
If I understood something wrong, correct me, but now it seems that it should not work at all (at least if tie_weights is not used)

splits cross entropy can be further optimized

awd-lstm-lm/splitcross.py

Line 137 in 32fcb42

for idx in range(self.nsplits):

As the word in tail, the probability of this word is p(C) * p(x=target|C), then the entropy is target * log(p(C) * p(x=target|C) = target * log(P(C)) + target + log(p(x=target|C)。

We can just add the cross entropy on the head include tombstones, then compute cross entropy on each tail, so it is no need pass head_entropy below.

Why is the decoder using nhid as input size even when tie_weights is set at True ?

I never used torch, but if I understand your code correctly the last LSTM layer's hidden size is equal to the first layer input size when tie_weight is true. But the decoder always take the hidden size as input size :
LSTM Layers
self.rnns = [torch.nn.LSTM(ninp if l == 0 else nhid, nhid if l != nlayers - 1 else (ninp if tie_weights else nhid), 1, dropout=0) for l in range(nlayers)]
Decoder
self.decoder = nn.Linear(nhid, ntoken)

There is a commented raise ValueError in the case nhid is different of ninp when using tie_weights :

if tie_weights:
            #if nhid != ninp:
            #    raise ValueError('When using the tied flag, nhid must be equal to emsize')
            self.decoder.weight = self.encoder.weight

So when using tie_weight ninp should be equals to nhid ?
I don't understand why there is this restriction instead of just using ninp as the input size of the decoder when using tie_weights.

I hope you will clarify this for me.

Question about using monotonic AvSGD

Hi,

I have some questions about the "Regularizing and Optimizing LSTM Languaguage models" paper. In the second line of the first paragraph in page 10 of the paper, you mentioned "using a monotonic criterion instead also hampered performance." I am not sure about what do you mean by "a monotonic criterion"? Do you mean by using AvSGD once the validation metric fails to improve?

In addition, I am also confused about the "dropouti" flag uou used in main.py script in line 39. The help argument said "dropout for input embedding layers (0 = no dropout)", is "dropouti" the dropout applied to the input layer which is the before the embedding layer?

Thank you so much.

Question about embedding dropout vs lockeddropout

why do you apply embedding dropout in line 70 of the model.py and then apply LockedDropout in line 73?

doesn't both functions have the same functionality regrading the dropout?

is it equivalent to applying the embedding dropout with an higher rate?

many thanks

DataParallel

I am training to run the model on multiple GPUs. Probably SplitCrossEntropyLoss causes some troubles, any hints?

File "main.py", line 209, in train
    raw_loss = criterion(model.module.decoder.weight, model.module.decoder.bias, output, targets)
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 114, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 124, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply
    raise output
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 41, in _worker
    output = module(*input, **kwargs)
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/net/scratch/people/plgkwrobel/awd-lstm-lm/splitcross.py", line 115, in forward
    split_targets, split_hiddens = self.split_on_targets(hiddens, targets)
  File "/net/scratch/people/plgkwrobel/awd-lstm-lm/splitcross.py", line 103, in split_on_targets
    split_hiddens.append(hiddens.masked_select(tmp_mask.unsqueeze(1).expand_as(hiddens)).view(-1, hiddens.size(1)))
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/tensor.py", line 302, in expand_as
    return self.expand(tensor.size())
RuntimeError: The expanded size of the tensor (280) must match the existing size (550) at non-singleton dimension 0

Mention requirements and instructions for QRNN in readme/requirements.txt

Hi,

First thanks for releasing this, it has been quite helpful.
Would be great if the README page mentioned in software requirements the dependency on pytorch-qrnn (for QRNN-based models). Currently, following the instructions and running one of the standard QRNN models will just throw a ModuleNotFoundError with no instructions. Would be great if there was a prior mention and/or a try/catch with a link to https://github.com/salesforce/pytorch-qrnn .

Unpredictable behavior of adaptive softmax

The behavior of adaptive softmax is very unpredictable. Sometimes I can run through the whole code on dataset A at the first time, but got error message when training on dataset B with same format and schema. Then, if I switch back to dataset A, the code failed again. Here is the error message:

Traceback (most recent call last):
File "main.py", line 244, in
train()
File "main.py", line 208, in train
loss.backward()
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/init.py", line 99, in backward
variables, grad_variables, retain_graph)
RuntimeError: invalid argument 3: Index tensor must have same dimensions as input tensor at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generic/THCTensorScatterGather.cu:199

This issue has blocked me for a long time. Please review it, thanks!

the script `getdata.sh` creates an empty `enwik8` folder and then finds a python script within the folder

In the getdata.sh script, lines 5 and 6 create an empty data folder.

mkdir -p data
cd data

Line 7 to 25 do not ever mention a script named prep_enwik8.py.

Lines 26 to 31 then create an empty folder named enwik8and then magically finds a python script named prep_enwik8.py within that folder!

echo "- Downloading enwik8 (Character)"
mkdir -p enwik8
cd enwik8
wget --continue http://mattmahoney.net/dc/enwik8.zip
python prep_enwik8.py
cd ..

In ASGD, what do we use for parameter, is it averaged one or normal SGD one?

I am so confused whether which parameter do we use during training.
Is it one Averaged from Time T or just normal SGD one?

QRNNLayer does not have "layer_norm" parameter

Just removing layer_norm=True in line 30 in model.py is a workaround.

Detail on WeightDrop class `_setup()` cuDNN RNN weight compacting issue & `register_parameter()`

Hi there,
cc @Smerity

Thanks for sharing the code first of all. I've been diving into the details and would really appreciate if you could share some insight into WeightDrop class' self._setup() method.

I have 2 questions.

regarding the comment on the cuDNN RNN weight compacting issue, code here. Could anyone expand on what exactly this issue is?
Why does the code delete parameters and registering them again by calling register_parameter()? code here

Thanks.

Issues with SplitCrossEntropyLoss

Parameters in SplitCrossEntropyLoss are not being updated since they are missing from the optimizer.
Parameters in SplitCrossEntropyLoss are being initialized to 0. EDIT: After reflection, I'm not sure this matters.

Locally, I fixed these issues and ran some very short experiments on 140 batches on wikitext-103 as follows:

master.txt are results for code at commit 1e24cc5 (current master)
optimizerfix.txt are the results with the issue 1 fixed
initializationfix.txt are the results with issue 1 fixed as well as the parameters being initialized the same as the embeddings.

Unable to load model from different directory

Hi there,

I'm trying to load a trained language model from another project. Unfortunately, I'm not able to load it because it requires the definition of the model module. As far as I know, this is a known problem for PyTorch model saved using torch.save (as discussed here: pytorch/pytorch#3678). How do you deal with this problem?

Thank you in advance!

Alessandro

What does Finetune do?

The finetune.py file looks to be the same as the main.py file. The paper does not cover the techniques used in the fine tune stage. What are the important techniques that are able to increase performance?

generate.py producing bad samples

From a trained word level PTB model that gets 58 test perplexity, generate.py seems to be producing relatively bad samples, even with a low temperature. Is this expected?

E.g.
such four once billion other assume memotec boston years portfolio thought sooner four fund have than down modest findings compound
making it makes york-based appear u.s. declining number western rate again medical where makes fields parts institute nov. n't indicate
mass. brief areas events died questionable replaced relatively vermont asbestos an one latest even cluett reported yield before have director

Weights sharing

Hi guys! Thanks for sharing this awesome project.

I would like to share my weights similar to - asd-lstm weights.
Is there any way I can do that?

Correct way to continue training?

My training is interupted at epoch 150.
For continuing python main.py training, I've adde a new argument:

parser.add_argument('--load', type=str, default='',
                    help='path to load the final model')

and modifed model instantiation:

if not args.load:
    model = model.RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout, args.dropouth, args.dropouti, args.dropoute, args.wdrop, args.tied)
else:
    with open(args.load, 'rb') as f:
        model = torch.load(f)

Then run the training procedure:
python3 -u main.py --model QRNN --batch_size 20 --clip 0.2 --wdrop 0.1 --nhid 1550 --nlayers 4 --emsize 400 --dropouth 0.3 --seed 9001 --dropouti 0.4 --epochs 400 --save PTB.pt --load PTB.pt

Does following logs look fine?

| end of epoch   1 | time: 103.37s | valid loss  4.19 | valid ppl    65.86
| end of epoch   2 | time: 107.36s | valid loss  4.20 | valid ppl    66.46
| end of epoch   3 | time: 105.37s | valid loss  4.19 | valid ppl    66.01
| end of epoch   4 | time: 106.24s | valid loss  4.20 | valid ppl    66.56
| end of epoch   5 | time: 101.58s | valid loss  4.20 | valid ppl    66.42
| end of epoch   6 | time: 102.41s | valid loss  4.19 | valid ppl    66.22
| end of epoch   7 | time: 104.01s | valid loss  4.19 | valid ppl    66.00
Switching!
| end of epoch   8 | time: 110.03s | valid loss  4.14 | valid ppl    62.92
| end of epoch   9 | time: 109.40s | valid loss  4.14 | valid ppl    62.67
| end of epoch  10 | time: 109.45s | valid loss  4.14 | valid ppl    62.52
| end of epoch  11 | time: 110.47s | valid loss  4.13 | valid ppl    62.39
| end of epoch  12 | time: 111.34s | valid loss  4.13 | valid ppl    62.30
| end of epoch  13 | time: 107.84s | valid loss  4.13 | valid ppl    62.25

Weight drop code masking the same "raw" weight?

Hey,

I was inspecting the weight drop (variant of dropconnect) code and I found it a bit confusing (https://github.com/salesforce/awd-lstm-lm/blob/master/weight_drop.py#L34):

for name_w in self.weights:
      raw_w = getattr(self.module, name_w + '_raw')
      w = None
      if self.variational:
          mask = torch.autograd.Variable(torch.ones(raw_w.size(0), 1))
          if raw_w.is_cuda: mask = mask.cuda()
          mask = torch.nn.functional.dropout(mask, p=self.dropout, training=True)
          w = mask.expand_as(raw_w) * raw_w
      else:
          w = torch.nn.functional.dropout(raw_w, p=self.dropout, training=self.training)
      setattr(self.module, name_w, w)

In every iteration the raw_w you get from name_w + '_raw' is the same, isn't it? Because you only setattr to name_w (e.g. weight_hh_l0) at the end. So every time the dropout mask operates on the same raw weight matrix...

Or maybe I just overlooked something. Can someone help me understand this?

Thanks!

Adaptive softmax question

in the 'An Analysis of Neural Language Modeling at Multiple Scales' paper it states that the hierarchy of the words is determined by their frequency.
For some reason i can't find that in the code. not in the dictionary nor the corpus build.
It seems like the words ids are determined by the order of their occurrence.
please point me to where that takes place.
many thanks

Finetune issue

Hello guys, first thanks for sharing your code with us.

I have noticed a problem when running the fine-tune process as I'm getting an error

RuntimeError: invalid argument 2: size '[-1 x 10000]' is invalid for input with 227500 elements at /pytorch/aten/src/TH/THStorage.c:37

and it happens when

output_flat = output.view(-1, ntokens)

is called in the evaluate function of finetune.py.

After some investigation, I have found that the call for

decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))

have been dropped from model.py

I understand that the SplitCrossEntropyLoss is doing this step for us on training but, given the fine tune is done with regular cross entropy, shouldn't we include this line back in the code?

My apologies if I'm missing something!

How to use the trained model as a discriminative model?

When the training finished, how does the model give a new sentence a probability?

Low number of unique words predicted

I would like to perform a sanity check by passing some input to the model and reading the output text.

Following the PyTorch tutorial on language modelling (https://github.com/pytorch/examples/blob/master/word_language_model/generate.py), I have edited the evaluate function:

def evaluate(data_source, batch_size=10):
    # Turn on evaluation mode which disables dropout.
    if args.model == 'QRNN': model.reset()
    model.eval()
    total_loss = 0
    ntokens = len(corpus.dictionary)
    hidden = model.init_hidden(batch_size)
    for i in range(0, data_source.size(0) - 1, args.bptt):
        data, targets = get_batch(data_source, i, args, evaluation=True)

        print ("inputs")
        inp = data.cpu().data.numpy()
        for input_ in inp:
            print ([created_inverse_tokenizer_during_training[i] for i in input_])

        output, hidden = model(data, hidden)

        word_weights = output.squeeze().data.div(args.temperature).exp().cpu()
        word_idx = torch.multinomial(word_weights, 10)

        print ("outputs")
        for word_ in word_idx:
            for item_ in word_:
                print ("next word", created_inverse_tokenizer_during_training[item_])
            print ("")

        output_flat = output.view(-1, ntokens)
        total_loss += len(data) * criterion(output_flat, targets).data
        hidden = repackage_hidden(hidden)
    return total_loss[0] / len(data_source)

, where created_inverse_tokenizer_during_training is idx2word from Dictionary class

I am testing on ptb dataset and I get the following with approximately 60 perplexity value:

inputs:
[made, value, $, their, intends, N, also, south, , or]
[much, criteria, N, office, to, return, closed, as, one, $]
[difference, devised, billion, visits, restrict, on, sharply, it, analyst, N]
[in, by, , as, the, assets, lower, became, peter, a]
[liquidity, benjamin, a, , rtc, for, across, more, , share]
[in, graham, , breaks, to, security, europe, clear, of, in]
[the, an, , , treasury, pacific, particularly, that, , the]
[pit, analyst, by, but, borrowings, and, in, a, &, fiscal]
[, and, an, massage, only, an, frankfurt, repeat, co., year]
[it, author, , no, unless, N, although, of, new, just]
["s", in, not, matter, the, N, london, the, york, ended]
[too, the, , how, agency, return, and, october, said, up]
[soon, 1930s, though, , receives, on, a, N, the, from]
[to, and, , is, specific, equity, few, crash, gold, $]
[tell, , , still, congressional, , other, was, market, N]
[but, who, english, associated, authorization, the, markets, "nt", already, million]
[people, is, butler, in, , loan, recovered, at, had, in]
[do, widely, in, many, such, growth, some, hand, some, fiscal]
["nt", considered, his, minds, agency, offset, ground, , good, N]
[seem, to, , with, , continuing, after, professionals, , and]
[to, be, proceeds, , borrowing, real-estate, stocks, dominated, technical, $]
[be, the, as, fronts, is, loan, began, municipal, factors, N]
[unhappy, father, if, for, unauthorized, losses, to, trading, that, million]
[with, of, the, , and, in, rebound, throughout, would, in]
[it, modern, realistic, and, expensive, the, in, the, have, N]

outputs:
[berlitz, hydro-quebec, banknote, centrust, gitano, cluett, guterman, aer, fromstein, calloway]
[berlitz, centrust, cluett, fromstein, aer, gitano, hydro-quebec, guterman, calloway, banknote]
[banknote, hydro-quebec, calloway, fromstein, berlitz, gitano, cluett, aer, guterman, centrust]
[calloway, berlitz, cluett, centrust, aer, gitano, hydro-quebec, banknote, guterman, fromstein]
[fromstein, hydro-quebec, aer, banknote, gitano, berlitz, calloway, cluett, centrust, guterman]
[calloway, hydro-quebec, guterman, fromstein, berlitz, banknote, cluett, centrust, gitano, aer]
[gitano, fromstein, hydro-quebec, cluett, calloway, centrust, berlitz, guterman, aer, banknote]
[berlitz, gitano, banknote, cluett, calloway, aer, centrust, fromstein, hydro-quebec, guterman]
[calloway, gitano, guterman, berlitz, centrust, hydro-quebec, cluett, aer, fromstein, banknote]
[hydro-quebec, berlitz, fromstein, gitano, cluett, calloway, aer, centrust, guterman, banknote]
[aer, cluett, fromstein, berlitz, guterman, calloway, hydro-quebec, centrust, banknote, gitano]
[cluett, calloway, centrust, fromstein, banknote, gitano, guterman, hydro-quebec, aer, berlitz]
[hydro-quebec, fromstein, calloway, aer, banknote, berlitz, cluett, gitano, centrust, guterman]
[banknote, gitano, aer, centrust, cluett, fromstein, calloway, guterman, hydro-quebec, berlitz]
[calloway, aer, gitano, berlitz, fromstein, cluett, guterman, banknote, hydro-quebec, centrust]
[banknote, cluett, fromstein, berlitz, gitano, aer, centrust, calloway, hydro-quebec, guterman]
[cluett, fromstein, aer, calloway, guterman, banknote, berlitz, gitano, centrust, hydro-quebec]
[aer, guterman, berlitz, gitano, centrust, cluett, calloway, hydro-quebec, fromstein, banknote]
[centrust, fromstein, cluett, berlitz, aer, banknote, guterman, gitano, calloway, hydro-quebec]
[guterman, banknote, fromstein, cluett, gitano, calloway, aer, centrust, berlitz, hydro-quebec]
[calloway, berlitz, aer, banknote, hydro-quebec, fromstein, cluett, guterman, gitano, centrust]
[banknote, hydro-quebec, berlitz, fromstein, guterman, calloway, cluett, centrust, gitano, aer]
[centrust, aer, fromstein, cluett, hydro-quebec, calloway, gitano, berlitz, guterman, banknote]
[fromstein, centrust, aer, banknote, berlitz, guterman, gitano, hydro-quebec, calloway, cluett]
[cluett, banknote, hydro-quebec, gitano, berlitz, fromstein, calloway, guterman, centrust, aer]

As you can see, the number of unique words in the output is rather small. Why is that? Or am I doing it wrong?

--cuda and --tied are True by default?

In specifying commandline arguments in main.py and finetune.py the --cuda and --tied flags are True by default and get False on specifying. Is this intentional? Seems counter-intuitive. Does this have any bearing on the results in your paper Regularizing and Optimizing LSTM Language Models?

awd-lstm-lm/main.py

Line 45 in bf0742c

parser.add_argument('--tied', action='store_false',

Update codebase to work with PyTorch 0.2

The original codebase was written to be run on PyTorch 0.1.12_2. Updating the codebase to work on PyTorch 0.2 requires a number of steps, including modifying WeightDrop and others.

Best would be to provide two branches - one with the current PyTorch 0.1.12_2 codebase (allowing for exact result replication) and a second branch that is updated to allow for PyTorch 0.2.

How-to generate after training word level qrnn?

After training using Word level WikiText-103 (PTB) with QRNN

Try to finetune:

File "finetune.py", line 107, in evaluate
    if args.model == 'QRNN': model.reset()
AttributeError: 'list' object has no attribute 'reset'

Try to generate:

File "generate.py", line 51, in <module>
    model.eval()
AttributeError: 'list' object has no attribute 'eval'

All help appreciated.

Dictionary - handling OOV tokens

I was looking into the data.py and saw that the dictionary consists of all tokens in train, val, and test files. I'm wondering if adding unseen tokens in val/test files to the dictionary will affect the testing in any way? Thanks!

Inconsistency between NT-ASGD described in the research paper and main.py

The NT-ASGD algorithm compares the validation loss value with previous n loss values, but I think main.py compares the validation loss value with loss values which are from 1st epoch to t-n epoch because of line 222:
if 't0' not in optimizer.param_groups[0] and (len(best_val_loss)>args.nonmono and val_loss > min(best_val_loss[:-args.nonmono])):
If the line is revised to
if 't0' not in optimizer.param_groups[0] and (len(best_val_loss)>args.nonmono and val_loss > min(best_val_loss[-args.nonmono:])):
the line is consistent with the NT-ASGD described in the research paper.

However, if we use the line, the code starts averaging immediately after the validation metric worsens.
So, what about using the following line?
if 't0' not in optimizer.param_groups[0] and (len(best_val_loss)>args.nonmono and val_loss > max(best_val_loss[-args.nonmono:])):

Few questions about main.py

Hi, i am recently studying about averaging method on Optimizations.
I read your paper 'Regularizing and Optimizing LSTM Language Models' and trying to follow your experiment only on PTB. I have few questions about source code.

In your source code main.py, at line 276, you are using 「't0' not in optimizer.param_groups[0]」 condition. I can understand this condition at all. What does this condition mean?
At the same line, there is condition 「len(best_val_loss)>args.nonmono and val_loss > min(best_val_loss[:-args.nonmono])」.

Is this mean 「After args.nonmono」 of Logging Interval L」 and 「Validation loss of right now epoch is bigger then previous args.nonmono of Logging Interval L」?

After changing SGD to ASGD, how does the program keep update parameter?
Does it update the parameter with SGD before Last Epoch and return the Averaged parameter at the end
Or
Update the parameter by averaging every iteration, Epoch or some interval
It is in the same context with Q3. After the program switch optimization to ASGD, the Validation PPL, BPC stop to changing but the training PPL,BPC keep changing. Why does it happen?
Is there any Averaging stop criterion in this program? If so, what is it?
Is there any Training STOP criterion without maximum EPOCH?
Why did you choose 750 EPOCH as MAXIMUM EPOCH? Is it just because you though it is large enough?

How can i use Adam optimizer instead of SGD?

Hi! First of all, thanks for your code. I am recently studying your paper "Regularizing and Optimizing LSTM Language Models".

I want to compare Adam optimizer and SGD optimizer with applying NT-ASGD which u proposed.

I tried your command with some addition and your python code.

"python main.py --batch_size 20 --data data/penn --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 500 --save SGD_PTB.pt --optimizer sgd"
"python main.py --batch_size 20 --data data/penn --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 500 --save Adam_PTB.pt --optimizer adam"

The thing is that the first command does work good, but the second command work but doesn't calculate loss and ppl and bpc. I copied the log of it below. Please give me any possible solution for this if you don't mind.

| end of epoch 14 | time: 48.42s | valid loss nan | valid ppl nan | valid bpc nan

| epoch 15 | 200/ 663 batches | lr 30.00000 | ms/batch 67.94 | loss nan | ppl nan | bpc nan
| epoch 15 | 400/ 663 batches | lr 30.00000 | ms/batch 68.18 | loss nan | ppl nan | bpc nan
| epoch 15 | 600/ 663 batches | lr 30.00000 | ms/batch 67.13 | loss nan | ppl nan | bpc nan

| end of epoch 15 | time: 48.31s | valid loss nan | valid ppl nan | valid bpc nan

| epoch 16 | 200/ 663 batches | lr 30.00000 | ms/batch 67.27 | loss nan | ppl nan | bpc nan
| epoch 16 | 400/ 663 batches | lr 30.00000 | ms/batch 65.48 | loss nan | ppl nan | bpc nan
| epoch 16 | 600/ 663 batches | lr 30.00000 | ms/batch 67.29 | loss nan | ppl nan | bpc nan

| end of epoch 16 | time: 48.28s | valid loss nan | valid ppl nan | valid bpc nan

AttributeError: 'LSTM' object has no attribute 'all_weights'

Failed on pytorch 0.2.0_1

python3.6 main.py --batch_size 20 --data data/penn --dropouti 0.4 --seed 28 --epoch 300 --save PTB.pt
[LSTM(400, 1150, dropout=0.3), LSTM(1150, 1150, dropout=0.3), LSTM(1150, 400, dropout=0.3)]
Applying weight drop of 0.5 to weight_hh_l0
Applying weight drop of 0.5 to weight_hh_l0
Applying weight drop of 0.5 to weight_hh_l0
Traceback (most recent call last):
  File "main.py", line 94, in <module>
    model.cuda()
  File "/data1/XXXX/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 147, in cuda
    return self._apply(lambda t: t.cuda(device_id))
  File "/data1/XXXX/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 118, in _apply
    module._apply(fn)
  File "/data1/XXXX/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 118, in _apply
    module._apply(fn)
  File "/data1/XXXX/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 118, in _apply
    module._apply(fn)
  File "/data1/XXXX/.local/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 116, in _apply
    self.flatten_parameters()
  File "/data1/XXXX/.local/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 104, in flatten_parameters
    all_weights = [[p.data for p in l] for l in self.all_weights]
  File "/data1/XXXX/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 262, in __getattr__
    type(self).__name__, name))
AttributeError: 'LSTM' object has no attribute 'all_weights'

`ValueError: result of slicing is an empty tensor` when trying to run generate.py on QRNN

I've trained a QRNN, but when I try to use generate.py with it, I get the following:

  File "generate.py", line 68, in <module>
    output, hidden = model(input, hidden)
  File "/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/awd-lstm-lm/model.py", line 82, in forward
    raw_output, new_h = rnn(raw_output, hidden[l])
  File "/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/miniconda3/lib/python3.6/site-packages/torchqrnn/qrnn.py", line 60, in forward
    Xm1 = [self.prevX if self.prevX is not None else X[:1, :, :] * 0, X[:-1, :, :]]
  File "/miniconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 76, in __getitem__
    return Index.apply(self, key)
  File "/miniconda3/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py", line 16, in forward
    result = i.index(ctx.index)
ValueError: result of slicing is an empty tensor

Finetuning on different corpus

I am trying to train QRNN model on one dataset and then finetune on another, but I get the following erorr when I try to finetune on different dataset that the model was trained initially:
raw_loss = criterion(output.view(-1, ntokens), targets) RuntimeError: invalid argument 2: size '[-1 x 8967]' is invalid for input with 14600000 elements at /pytorch/torch/lib/TH/THStorage.c:37
Would you be able to explain what this error means?

Do I need to pop the last layer and substitute it with the new Linear layer with needed number of classes?

Penn Treebank Download Link 404 Error

The URLs used to download the Penn Treebank data in get_data.sh is no longer available, hence the URLs return 404 errors.

Line 80 is useless in model.py

hi, I found the line 80 is useless and can be removed in the file model.py. Could you check that? Thanks!

Multiple GPU option

I extended the code to multiple GPU training, but the GPU usage is extremely imbalanced. The root cause is that we collect all outputs back and calculate loss on one GPU. I tried to put loss calculation inside model.forward() as following:

class RNNModel(nn.Module):
def init(...):
super(RNNModel, self).init()
from splitcross import SplitCrossEntropyLoss
splits = [2800, 20000, 76000]
self.criterion = SplitCrossEntropyLoss(ninp, splits=splits, verbose=False)
... ...
def forward(...)
... ...
result = output
# calculate loss
result = result.view(result.size(0)*result.size(1), -1)
raw_loss = self.criterion(decoder_weight, decoder_bias, result, target)
loss = raw_loss
# activation regularization
if args.alpha: loss = loss + sum(args.alpha * dropped_rnn_h.pow(2).mean() for dropped_rnn_h in outputs[-1:])
# Temporal Activation Regularization (slowness)
if args.beta: loss = loss + sum(args.beta * (rnn_h[1:] - rnn_h[:-1]).pow(2).mean() for rnn_h in raw_outputs[-1:])
# expand loss to two dimensional space so it can be gathered via the second dimension
loss = loss.unsqueeze(1)
raw_loss = raw_loss.unsqueeze(1)
if return_h:
return raw_loss, loss, hidden, raw_outputs, outputs
return raw_loss, loss, hidden

Then, in my main.py, I collect the loss and use loss.mean().backward() to update parameters. The interesting thing is, I can successfully finish the first round loss.mean().backward() but failed the second round with error:

RuntimeError: invalid argument 3: Index tensor must have same dimensions as input tensor at
/pytorch/torch/lib/THC/generic/THCTensorScatterGather.cu:199

Can anyone help?
Thanks in advance!

In line 252, val_loss should be val_loss2 isn't it?

Hi, I think i just found some error of main.py.

In line 252, the val_loss should be val_loss2 right?

Because after the program switch to ASGD mode, it would not calculate val_loss so, the log of program will show no change after switching to ASGD about Validation result.

Attention Model

Any ideas on how to incorporate attention model from http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html ?

forward function takes too many arguments?

I copy/paste the command to train the model, but get the error below

$python34 C:/Users/dat/Desktop/awd-lstm-lm/main.py --batch_size 20 --data C:/Users/dat/Desktop/awd-lstm-lm/data/penn --dropouti 0.4 --dropouth 0.25 --se ed 141 --epoch 500 --save C:/Users/dat/Desktop/awd-lstm-lm/PTB.pickle
Applying weight drop of 0.5 to weight_hh_l0
Applying weight drop of 0.5 to weight_hh_l0
Applying weight drop of 0.5 to weight_hh_l0
[WeightDrop (
(module): LSTM(400, 1150)
), WeightDrop (
(module): LSTM(1150, 1150)
), WeightDrop (
(module): LSTM(1150, 400)
)]
Args: Namespace(alpha=2, batch_size=20, beta=1, bptt=70, clip=0.25, cuda=True, data='C:/Users/dat/Desktop/awd-lstm-lm/data/penn', dropout=0.4, dropoute=0.1, dropouth=0.25, dropouti=0.4, emsize=400, epochs=500, log_interval=200, lr=30, model='LSTM', nhid=1150, nlayers=3, nonmono=5, save='C:/Users/dat/Desktop/awd-lstm-lm/PTB.pickle', seed=141, tied=True, wdecay=1.2e-06, wdrop=0.5)
Model total parameters: 24221600
Traceback (most recent call last):
File "C:/Users/dat/Desktop/awd-lstm-lm/main.py", line 185, in
train()
File "C:/Users/dat/Desktop/awd-lstm-lm/main.py", line 146, in train
output, hidden, rnn_hs, dropped_rnn_hs = model(data, hidden, return_h=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "C:\Users\dat\Desktop\awd-lstm-lm\model.py", line 70, in forward
emb = embedded_dropout(self.encoder, input, dropout=self.dropoute if self.training else 0)
File "C:\Users\dat\Desktop\awd-lstm-lm\embed_regularize.py", line 21, in embedded_dropout
embed.scale_grad_by_freq, embed.sparse
TypeError: forward() takes 3 positional arguments but 8 were given

GPU memory and cap

Hi, training crashed not enough memory on Titan X 12GB with char-LSTM on enwik8

The trick about reducing the "cap" on sequence length links to a 404 URL: could you please let me know where I can do that ?

Thanks a lot for the great code !

salesforce / awd-lstm-lm Goto Github PK

awd-lstm-lm's Issues

| end of epoch 14 | time: 48.42s | valid loss nan | valid ppl nan | valid bpc nan

| epoch 15 | 200/ 663 batches | lr 30.00000 | ms/batch 67.94 | loss nan | ppl nan | bpc nan | epoch 15 | 400/ 663 batches | lr 30.00000 | ms/batch 68.18 | loss nan | ppl nan | bpc nan | epoch 15 | 600/ 663 batches | lr 30.00000 | ms/batch 67.13 | loss nan | ppl nan | bpc nan

| end of epoch 15 | time: 48.31s | valid loss nan | valid ppl nan | valid bpc nan

| epoch 16 | 200/ 663 batches | lr 30.00000 | ms/batch 67.27 | loss nan | ppl nan | bpc nan | epoch 16 | 400/ 663 batches | lr 30.00000 | ms/batch 65.48 | loss nan | ppl nan | bpc nan | epoch 16 | 600/ 663 batches | lr 30.00000 | ms/batch 67.29 | loss nan | ppl nan | bpc nan

| end of epoch 16 | time: 48.28s | valid loss nan | valid ppl nan | valid bpc nan

Recommend Projects

Recommend Topics

Recommend Org

Jobs

| epoch 15 | 200/ 663 batches | lr 30.00000 | ms/batch 67.94 | loss nan | ppl nan | bpc nan
| epoch 15 | 400/ 663 batches | lr 30.00000 | ms/batch 68.18 | loss nan | ppl nan | bpc nan
| epoch 15 | 600/ 663 batches | lr 30.00000 | ms/batch 67.13 | loss nan | ppl nan | bpc nan

| epoch 16 | 200/ 663 batches | lr 30.00000 | ms/batch 67.27 | loss nan | ppl nan | bpc nan
| epoch 16 | 400/ 663 batches | lr 30.00000 | ms/batch 65.48 | loss nan | ppl nan | bpc nan
| epoch 16 | 600/ 663 batches | lr 30.00000 | ms/batch 67.29 | loss nan | ppl nan | bpc nan