salesforce / awd-lstm-lm Goto Github PK
View Code? Open in Web Editor NEWLSTM and QRNN Language Model Toolkit for PyTorch
License: BSD 3-Clause "New" or "Revised" License
LSTM and QRNN Language Model Toolkit for PyTorch
License: BSD 3-Clause "New" or "Revised" License
@Smerity I tried the following code:
import torch
import torch.nn as nn
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.gru = nn.GRU(10, 10)
self.gru = WeightDrop(self.gru, ['weight_hh_l0'], dropout=0.5)
m = Model()
# The following operation throws an error
m.cuda()
Can you take a look and see where the problem occurs? I am using PyTorch 0.2.
I made some modifications to the codebase, so this might be a problem on my end... But does finetune.py
require SplitCrossEntropyLoss
to be used for the criterion instead? The decoder is called in SplitCrossEntropyLoss
only. I added the appropriate SplitCrossEntropyLoss
in finetune.py
, and it works as expected.
Lines 8-14 and 54-60 in getdata.sh
both contain the same lines of code:
echo "- Downloading WikiText-2 (WT2)"
wget --quiet --continue https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
unzip -q wikitext-2-v1.zip
cd wikitext-2
mv wiki.train.tokens train.txt
mv wiki.valid.tokens valid.txt
mv wiki.test.tokens test.txt
I think that lines 54-60 from getdata.sh
can be deleted with no consequence.
python finetune.py --epochs 750 --data data/wikitext-2 --save WT2.pt --dropouth 0.2 --seed 1882
python pointer.py --save WT2.pt --lambdasm 0.1279 --theta 0.662 --window 3785 --bptt 2000 --data data/wikitext-2
Traceback (most recent call last):
File "finetune.py", line 183, in
stored_loss = evaluate(val_data)
File "finetune.py", line 108, in evaluate
model.eval()
Looks like model loading & more needs to be modified.
Also, I no longer get the reported ppls in main. LSTM gets stuck around 80s and QRNN around 90s.
Why does this exist?
bptt = args.bptt if np.random.random() < 0.95 else args.bptt / 2.
Yall already have...
seq_len = max(5, int(np.random.normal(bptt, 5)))
I think there is a bug in the way ASGD is being triggered. Right now the code is
if args.optimizer == 'sgd' and 't0' not in optimizer.param_groups[0] and (len(best_val_loss)>args.nonmono and val_loss > min(best_val_loss[:-args.nonmono])):
I believe this should be
if args.optimizer == 'sgd' and 't0' not in optimizer.param_groups[0] and (len(best_val_loss)>args.nonmono and val_loss > min(best_val_loss[-args.nonmono:])):
with the difference being that we grab from the end of the best_val_loss list instead of the start.
Generate still appears to be broken.
Or is finetune necessary before running generate?
Perhaps I'm doing something wrong, but I have generated with Pytorch's default word language model many times. I haven't dug into your code, assume it is correct... All help appreciated, I'd love to see what QRNN is capable of.
Trained using default settings for QRNN on WT2.
Exited training early.
Then called:
$ python -u generate.py --cuda --words=66 --checkpoint="WT2.pt" --model=QRNN --data=data/wikitext-2
Output:
| Generated 0/66 words
No text file generated.
Hi,
The folks over at pytorch are working on cutting a new 0.4 release. We'd like to make the transition as smooth as possible (if you were planning on upgrading), so we've been testing a number of community repos.
I ran a model and it errors out due to a change in pytorch. Minimal repro:
# Install pytorch-nightly (Currently our pre-release branch)
conda install pytorch-nightly -c pytorch
# Get data
./getdata.sh
# Run model
python main.py --batch_size 20 --data data/penn --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 1 && \
python -u main.py --model QRNN --batch_size 20 --clip 0.2 --wdrop 0.1 --nhid 1550 --nlayers 4 --emsize 400 --dropouth 0.3 --seed 9001 --dropouti 0.4 --epochs
1
Stack trace: https://gist.github.com/zou3519/142d48df1c03db9fe9c11717ad9a59f2
Pytorch 0.4 adds zero-dimensional tensors that cannot be iterated over, which seems to be what the error is complaining about. Changing
Line 8 in f2e8867
cc @soumith
I am getting the following warning:
UserWarning: RNN module weights are not part of a single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
I am using pytorch 0.20 and python 3.5
Also, how do you generate a probability of a given sentence?
I started to suspect something is wrong when generate.py
script crarshed. Then I was surprised to see that line output, hidden = model(input, hidden)
yileds an output variable of with hidden size of last recurrent layer, not the size of vocabulary. So I took further look into the model.py
and was surprised to see that self.decoder
is not used at all!
If I understood something wrong, correct me, but now it seems that it should not work at all (at least if tie_weights
is not used)
Line 137 in 32fcb42
As the word in tail, the probability of this word is p(C) * p(x=target|C), then the entropy is target * log(p(C) * p(x=target|C) = target * log(P(C)) + target + log(p(x=target|C)。
We can just add the cross entropy on the head include tombstones, then compute cross entropy on each tail, so it is no need pass head_entropy below.
I never used torch, but if I understand your code correctly the last LSTM layer's hidden size is equal to the first layer input size when tie_weight is true. But the decoder always take the hidden size as input size :
LSTM Layers
self.rnns = [torch.nn.LSTM(ninp if l == 0 else nhid, nhid if l != nlayers - 1 else (ninp if tie_weights else nhid), 1, dropout=0) for l in range(nlayers)]
Decoder
self.decoder = nn.Linear(nhid, ntoken)
There is a commented raise ValueError in the case nhid is different of ninp when using tie_weights :
if tie_weights:
#if nhid != ninp:
# raise ValueError('When using the tied flag, nhid must be equal to emsize')
self.decoder.weight = self.encoder.weight
So when using tie_weight ninp should be equals to nhid ?
I don't understand why there is this restriction instead of just using ninp as the input size of the decoder when using tie_weights.
I hope you will clarify this for me.
Hi,
I have some questions about the "Regularizing and Optimizing LSTM Languaguage models" paper. In the second line of the first paragraph in page 10 of the paper, you mentioned "using a monotonic criterion instead also hampered performance." I am not sure about what do you mean by "a monotonic criterion"? Do you mean by using AvSGD once the validation metric fails to improve?
In addition, I am also confused about the "dropouti" flag uou used in main.py script in line 39. The help argument said "dropout for input embedding layers (0 = no dropout)", is "dropouti" the dropout applied to the input layer which is the before the embedding layer?
Thank you so much.
why do you apply embedding dropout in line 70 of the model.py and then apply LockedDropout in line 73?
doesn't both functions have the same functionality regrading the dropout?
is it equivalent to applying the embedding dropout with an higher rate?
many thanks
I am training to run the model on multiple GPUs. Probably SplitCrossEntropyLoss
causes some troubles, any hints?
File "main.py", line 209, in train
raw_loss = criterion(model.module.decoder.weight, model.module.decoder.bias, output, targets)
File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 114, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 124, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply
raise output
File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 41, in _worker
output = module(*input, **kwargs)
File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/net/scratch/people/plgkwrobel/awd-lstm-lm/splitcross.py", line 115, in forward
split_targets, split_hiddens = self.split_on_targets(hiddens, targets)
File "/net/scratch/people/plgkwrobel/awd-lstm-lm/splitcross.py", line 103, in split_on_targets
split_hiddens.append(hiddens.masked_select(tmp_mask.unsqueeze(1).expand_as(hiddens)).view(-1, hiddens.size(1)))
File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/tensor.py", line 302, in expand_as
return self.expand(tensor.size())
RuntimeError: The expanded size of the tensor (280) must match the existing size (550) at non-singleton dimension 0
Hi,
First thanks for releasing this, it has been quite helpful.
Would be great if the README page mentioned in software requirements the dependency on pytorch-qrnn (for QRNN-based models). Currently, following the instructions and running one of the standard QRNN models will just throw a ModuleNotFoundError with no instructions. Would be great if there was a prior mention and/or a try/catch with a link to https://github.com/salesforce/pytorch-qrnn .
The behavior of adaptive softmax is very unpredictable. Sometimes I can run through the whole code on dataset A at the first time, but got error message when training on dataset B with same format and schema. Then, if I switch back to dataset A, the code failed again. Here is the error message:
Traceback (most recent call last):
File "main.py", line 244, in
train()
File "main.py", line 208, in train
loss.backward()
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/init.py", line 99, in backward
variables, grad_variables, retain_graph)
RuntimeError: invalid argument 3: Index tensor must have same dimensions as input tensor at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generic/THCTensorScatterGather.cu:199
This issue has blocked me for a long time. Please review it, thanks!
In the getdata.sh
script, lines 5 and 6 create an empty data
folder.
mkdir -p data
cd data
Line 7 to 25 do not ever mention a script named prep_enwik8.py.
Lines 26 to 31 then create an empty folder named enwik8
and then magically finds a python script named prep_enwik8.py
within that folder!
echo "- Downloading enwik8 (Character)"
mkdir -p enwik8
cd enwik8
wget --continue http://mattmahoney.net/dc/enwik8.zip
python prep_enwik8.py
cd ..
I am so confused whether which parameter do we use during training.
Is it one Averaged from Time T or just normal SGD one?
Just removing layer_norm=True in line 30 in model.py is a workaround.
Hi there,
cc @Smerity
Thanks for sharing the code first of all. I've been diving into the details and would really appreciate if you could share some insight into WeightDrop
class' self._setup()
method.
I have 2 questions.
regarding the comment on the cuDNN RNN weight compacting issue, code here. Could anyone expand on what exactly this issue is?
Why does the code delete parameters and registering them again by calling register_parameter()
? code here
Thanks.
Locally, I fixed these issues and ran some very short experiments on 140 batches on wikitext-103 as follows:
Hi there,
I'm trying to load a trained language model from another project. Unfortunately, I'm not able to load it because it requires the definition of the model
module. As far as I know, this is a known problem for PyTorch model saved using torch.save (as discussed here: pytorch/pytorch#3678). How do you deal with this problem?
Thank you in advance!
Alessandro
The finetune.py
file looks to be the same as the main.py
file. The paper does not cover the techniques used in the fine tune stage. What are the important techniques that are able to increase performance?
From a trained word level PTB model that gets 58 test perplexity, generate.py seems to be producing relatively bad samples, even with a low temperature. Is this expected?
E.g.
such four once billion other assume memotec boston years portfolio thought sooner four fund have than down modest findings compound
making it makes york-based appear u.s. declining number western rate again medical where makes fields parts institute nov. n't indicate
mass. brief areas events died questionable replaced relatively vermont asbestos an one latest even cluett reported yield before have director
Hi guys! Thanks for sharing this awesome project.
I would like to share my weights similar to - asd-lstm weights.
Is there any way I can do that?
My training is interupted at epoch 150.
For continuing python main.py
training, I've adde a new argument:
parser.add_argument('--load', type=str, default='',
help='path to load the final model')
and modifed model instantiation:
if not args.load:
model = model.RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout, args.dropouth, args.dropouti, args.dropoute, args.wdrop, args.tied)
else:
with open(args.load, 'rb') as f:
model = torch.load(f)
Then run the training procedure:
python3 -u main.py --model QRNN --batch_size 20 --clip 0.2 --wdrop 0.1 --nhid 1550 --nlayers 4 --emsize 400 --dropouth 0.3 --seed 9001 --dropouti 0.4 --epochs 400 --save PTB.pt --load PTB.pt
Does following logs look fine?
| end of epoch 1 | time: 103.37s | valid loss 4.19 | valid ppl 65.86
| end of epoch 2 | time: 107.36s | valid loss 4.20 | valid ppl 66.46
| end of epoch 3 | time: 105.37s | valid loss 4.19 | valid ppl 66.01
| end of epoch 4 | time: 106.24s | valid loss 4.20 | valid ppl 66.56
| end of epoch 5 | time: 101.58s | valid loss 4.20 | valid ppl 66.42
| end of epoch 6 | time: 102.41s | valid loss 4.19 | valid ppl 66.22
| end of epoch 7 | time: 104.01s | valid loss 4.19 | valid ppl 66.00
Switching!
| end of epoch 8 | time: 110.03s | valid loss 4.14 | valid ppl 62.92
| end of epoch 9 | time: 109.40s | valid loss 4.14 | valid ppl 62.67
| end of epoch 10 | time: 109.45s | valid loss 4.14 | valid ppl 62.52
| end of epoch 11 | time: 110.47s | valid loss 4.13 | valid ppl 62.39
| end of epoch 12 | time: 111.34s | valid loss 4.13 | valid ppl 62.30
| end of epoch 13 | time: 107.84s | valid loss 4.13 | valid ppl 62.25
Hey,
I was inspecting the weight drop (variant of dropconnect) code and I found it a bit confusing (https://github.com/salesforce/awd-lstm-lm/blob/master/weight_drop.py#L34):
for name_w in self.weights:
raw_w = getattr(self.module, name_w + '_raw')
w = None
if self.variational:
mask = torch.autograd.Variable(torch.ones(raw_w.size(0), 1))
if raw_w.is_cuda: mask = mask.cuda()
mask = torch.nn.functional.dropout(mask, p=self.dropout, training=True)
w = mask.expand_as(raw_w) * raw_w
else:
w = torch.nn.functional.dropout(raw_w, p=self.dropout, training=self.training)
setattr(self.module, name_w, w)
In every iteration the raw_w
you get from name_w + '_raw'
is the same, isn't it? Because you only setattr
to name_w
(e.g. weight_hh_l0) at the end. So every time the dropout mask operates on the same raw weight matrix...
Or maybe I just overlooked something. Can someone help me understand this?
Thanks!
in the 'An Analysis of Neural Language Modeling at Multiple Scales' paper it states that the hierarchy of the words is determined by their frequency.
For some reason i can't find that in the code. not in the dictionary nor the corpus build.
It seems like the words ids are determined by the order of their occurrence.
please point me to where that takes place.
many thanks
Hello guys, first thanks for sharing your code with us.
I have noticed a problem when running the fine-tune process as I'm getting an error
RuntimeError: invalid argument 2: size '[-1 x 10000]' is invalid for input with 227500 elements at /pytorch/aten/src/TH/THStorage.c:37
and it happens when
output_flat = output.view(-1, ntokens)
is called in the evaluate
function of finetune.py.
After some investigation, I have found that the call for
decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
have been dropped from model.py
I understand that the SplitCrossEntropyLoss
is doing this step for us on training but, given the fine tune is done with regular cross entropy, shouldn't we include this line back in the code?
My apologies if I'm missing something!
When the training finished, how does the model give a new sentence a probability?
I would like to perform a sanity check by passing some input to the model and reading the output text.
Following the PyTorch tutorial on language modelling (https://github.com/pytorch/examples/blob/master/word_language_model/generate.py), I have edited the evaluate
function:
def evaluate(data_source, batch_size=10):
# Turn on evaluation mode which disables dropout.
if args.model == 'QRNN': model.reset()
model.eval()
total_loss = 0
ntokens = len(corpus.dictionary)
hidden = model.init_hidden(batch_size)
for i in range(0, data_source.size(0) - 1, args.bptt):
data, targets = get_batch(data_source, i, args, evaluation=True)
print ("inputs")
inp = data.cpu().data.numpy()
for input_ in inp:
print ([created_inverse_tokenizer_during_training[i] for i in input_])
output, hidden = model(data, hidden)
word_weights = output.squeeze().data.div(args.temperature).exp().cpu()
word_idx = torch.multinomial(word_weights, 10)
print ("outputs")
for word_ in word_idx:
for item_ in word_:
print ("next word", created_inverse_tokenizer_during_training[item_])
print ("")
output_flat = output.view(-1, ntokens)
total_loss += len(data) * criterion(output_flat, targets).data
hidden = repackage_hidden(hidden)
return total_loss[0] / len(data_source)
, where created_inverse_tokenizer_during_training
is idx2word
from Dictionary
class
I am testing on ptb dataset and I get the following with approximately 60 perplexity value:
inputs:
[made, value, $, their, intends, N, also, south, , or]
[much, criteria, N, office, to, return, closed, as, one, $]
[difference, devised, billion, visits, restrict, on, sharply, it, analyst, N]
[in, by, , as, the, assets, lower, became, peter, a]
[liquidity, benjamin, a, , rtc, for, across, more, , share]
[in, graham, , breaks, to, security, europe, clear, of, in]
[the, an, , , treasury, pacific, particularly, that, , the]
[pit, analyst, by, but, borrowings, and, in, a, &, fiscal]
[, and, an, massage, only, an, frankfurt, repeat, co., year]
[it, author, , no, unless, N, although, of, new, just]
["s", in, not, matter, the, N, london, the, york, ended]
[too, the, , how, agency, return, and, october, said, up]
[soon, 1930s, though, , receives, on, a, N, the, from]
[to, and, , is, specific, equity, few, crash, gold, $]
[tell, , , still, congressional, , other, was, market, N]
[but, who, english, associated, authorization, the, markets, "nt", already, million]
[people, is, butler, in, , loan, recovered, at, had, in]
[do, widely, in, many, such, growth, some, hand, some, fiscal]
["nt", considered, his, minds, agency, offset, ground, , good, N]
[seem, to, , with, , continuing, after, professionals, , and]
[to, be, proceeds, , borrowing, real-estate, stocks, dominated, technical, $]
[be, the, as, fronts, is, loan, began, municipal, factors, N]
[unhappy, father, if, for, unauthorized, losses, to, trading, that, million]
[with, of, the, , and, in, rebound, throughout, would, in]
[it, modern, realistic, and, expensive, the, in, the, have, N]
outputs:
[berlitz, hydro-quebec, banknote, centrust, gitano, cluett, guterman, aer, fromstein, calloway]
[berlitz, centrust, cluett, fromstein, aer, gitano, hydro-quebec, guterman, calloway, banknote]
[banknote, hydro-quebec, calloway, fromstein, berlitz, gitano, cluett, aer, guterman, centrust]
[calloway, berlitz, cluett, centrust, aer, gitano, hydro-quebec, banknote, guterman, fromstein]
[fromstein, hydro-quebec, aer, banknote, gitano, berlitz, calloway, cluett, centrust, guterman]
[calloway, hydro-quebec, guterman, fromstein, berlitz, banknote, cluett, centrust, gitano, aer]
[gitano, fromstein, hydro-quebec, cluett, calloway, centrust, berlitz, guterman, aer, banknote]
[berlitz, gitano, banknote, cluett, calloway, aer, centrust, fromstein, hydro-quebec, guterman]
[calloway, gitano, guterman, berlitz, centrust, hydro-quebec, cluett, aer, fromstein, banknote]
[hydro-quebec, berlitz, fromstein, gitano, cluett, calloway, aer, centrust, guterman, banknote]
[aer, cluett, fromstein, berlitz, guterman, calloway, hydro-quebec, centrust, banknote, gitano]
[cluett, calloway, centrust, fromstein, banknote, gitano, guterman, hydro-quebec, aer, berlitz]
[hydro-quebec, fromstein, calloway, aer, banknote, berlitz, cluett, gitano, centrust, guterman]
[banknote, gitano, aer, centrust, cluett, fromstein, calloway, guterman, hydro-quebec, berlitz]
[calloway, aer, gitano, berlitz, fromstein, cluett, guterman, banknote, hydro-quebec, centrust]
[banknote, cluett, fromstein, berlitz, gitano, aer, centrust, calloway, hydro-quebec, guterman]
[cluett, fromstein, aer, calloway, guterman, banknote, berlitz, gitano, centrust, hydro-quebec]
[aer, guterman, berlitz, gitano, centrust, cluett, calloway, hydro-quebec, fromstein, banknote]
[centrust, fromstein, cluett, berlitz, aer, banknote, guterman, gitano, calloway, hydro-quebec]
[guterman, banknote, fromstein, cluett, gitano, calloway, aer, centrust, berlitz, hydro-quebec]
[calloway, berlitz, aer, banknote, hydro-quebec, fromstein, cluett, guterman, gitano, centrust]
[banknote, hydro-quebec, berlitz, fromstein, guterman, calloway, cluett, centrust, gitano, aer]
[centrust, aer, fromstein, cluett, hydro-quebec, calloway, gitano, berlitz, guterman, banknote]
[fromstein, centrust, aer, banknote, berlitz, guterman, gitano, hydro-quebec, calloway, cluett]
[cluett, banknote, hydro-quebec, gitano, berlitz, fromstein, calloway, guterman, centrust, aer]
As you can see, the number of unique words in the output is rather small. Why is that? Or am I doing it wrong?
In specifying commandline arguments in main.py
and finetune.py
the --cuda
and --tied
flags are True by default and get False on specifying. Is this intentional? Seems counter-intuitive. Does this have any bearing on the results in your paper Regularizing and Optimizing LSTM Language Models?
Line 45 in bf0742c
The original codebase was written to be run on PyTorch 0.1.12_2. Updating the codebase to work on PyTorch 0.2 requires a number of steps, including modifying WeightDrop
and others.
Best would be to provide two branches - one with the current PyTorch 0.1.12_2 codebase (allowing for exact result replication) and a second branch that is updated to allow for PyTorch 0.2.
After training using Word level WikiText-103 (PTB) with QRNN
Try to finetune:
File "finetune.py", line 107, in evaluate
if args.model == 'QRNN': model.reset()
AttributeError: 'list' object has no attribute 'reset'
Try to generate:
File "generate.py", line 51, in <module>
model.eval()
AttributeError: 'list' object has no attribute 'eval'
All help appreciated.
I was looking into the data.py and saw that the dictionary consists of all tokens in train, val, and test files. I'm wondering if adding unseen tokens in val/test files to the dictionary will affect the testing in any way? Thanks!
The NT-ASGD algorithm compares the validation loss value with previous n loss values, but I think main.py compares the validation loss value with loss values which are from 1st epoch to t-n epoch because of line 222:
if 't0' not in optimizer.param_groups[0] and (len(best_val_loss)>args.nonmono and val_loss > min(best_val_loss[:-args.nonmono])):
If the line is revised to
if 't0' not in optimizer.param_groups[0] and (len(best_val_loss)>args.nonmono and val_loss > min(best_val_loss[-args.nonmono:])):
the line is consistent with the NT-ASGD described in the research paper.
However, if we use the line, the code starts averaging immediately after the validation metric worsens.
So, what about using the following line?
if 't0' not in optimizer.param_groups[0] and (len(best_val_loss)>args.nonmono and val_loss > max(best_val_loss[-args.nonmono:])):
Hi, i am recently studying about averaging method on Optimizations.
I read your paper 'Regularizing and Optimizing LSTM Language Models' and trying to follow your experiment only on PTB. I have few questions about source code.
In your source code main.py, at line 276, you are using 「't0' not in optimizer.param_groups[0]」 condition. I can understand this condition at all. What does this condition mean?
At the same line, there is condition 「len(best_val_loss)>args.nonmono and val_loss > min(best_val_loss[:-args.nonmono])」.
Is this mean 「After args.nonmono」 of Logging Interval L」 and 「Validation loss of right now epoch is bigger then previous args.nonmono of Logging Interval L」?
After changing SGD to ASGD, how does the program keep update parameter?
Does it update the parameter with SGD before Last Epoch and return the Averaged parameter at the end
Or
Update the parameter by averaging every iteration, Epoch or some interval
It is in the same context with Q3. After the program switch optimization to ASGD, the Validation PPL, BPC stop to changing but the training PPL,BPC keep changing. Why does it happen?
Is there any Averaging stop criterion in this program? If so, what is it?
Is there any Training STOP criterion without maximum EPOCH?
Why did you choose 750 EPOCH as MAXIMUM EPOCH? Is it just because you though it is large enough?
Hi! First of all, thanks for your code. I am recently studying your paper "Regularizing and Optimizing LSTM Language Models".
I want to compare Adam optimizer and SGD optimizer with applying NT-ASGD which u proposed.
I tried your command with some addition and your python code.
"python main.py --batch_size 20 --data data/penn --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 500 --save SGD_PTB.pt --optimizer sgd"
"python main.py --batch_size 20 --data data/penn --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 500 --save Adam_PTB.pt --optimizer adam"
The thing is that the first command does work good, but the second command work but doesn't calculate loss and ppl and bpc. I copied the log of it below. Please give me any possible solution for this if you don't mind.
| epoch 17 | 200/ 663 batches | lr 30.00000 | ms/batch 67.21 | loss nan | ppl nan | bpc nan
| epoch 17 | 400/ 663 batches | lr 30.00000 | ms/batch 65.92 | loss nan | ppl nan | bpc nan
| epoch 17 | 600/ 663 batches | lr 30.00000 | ms/batch 66.32 | loss nan | ppl nan | bpc nan
Failed on pytorch 0.2.0_1
python3.6 main.py --batch_size 20 --data data/penn --dropouti 0.4 --seed 28 --epoch 300 --save PTB.pt
[LSTM(400, 1150, dropout=0.3), LSTM(1150, 1150, dropout=0.3), LSTM(1150, 400, dropout=0.3)]
Applying weight drop of 0.5 to weight_hh_l0
Applying weight drop of 0.5 to weight_hh_l0
Applying weight drop of 0.5 to weight_hh_l0
Traceback (most recent call last):
File "main.py", line 94, in <module>
model.cuda()
File "/data1/XXXX/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 147, in cuda
return self._apply(lambda t: t.cuda(device_id))
File "/data1/XXXX/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 118, in _apply
module._apply(fn)
File "/data1/XXXX/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 118, in _apply
module._apply(fn)
File "/data1/XXXX/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 118, in _apply
module._apply(fn)
File "/data1/XXXX/.local/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 116, in _apply
self.flatten_parameters()
File "/data1/XXXX/.local/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 104, in flatten_parameters
all_weights = [[p.data for p in l] for l in self.all_weights]
File "/data1/XXXX/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 262, in __getattr__
type(self).__name__, name))
AttributeError: 'LSTM' object has no attribute 'all_weights'
I've trained a QRNN, but when I try to use generate.py with it, I get the following:
File "generate.py", line 68, in <module>
output, hidden = model(input, hidden)
File "/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/awd-lstm-lm/model.py", line 82, in forward
raw_output, new_h = rnn(raw_output, hidden[l])
File "/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
result = self.forward(*input, **kwargs)
File "/miniconda3/lib/python3.6/site-packages/torchqrnn/qrnn.py", line 60, in forward
Xm1 = [self.prevX if self.prevX is not None else X[:1, :, :] * 0, X[:-1, :, :]]
File "/miniconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 76, in __getitem__
return Index.apply(self, key)
File "/miniconda3/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py", line 16, in forward
result = i.index(ctx.index)
ValueError: result of slicing is an empty tensor
I am trying to train QRNN model on one dataset and then finetune on another, but I get the following erorr when I try to finetune on different dataset that the model was trained initially:
raw_loss = criterion(output.view(-1, ntokens), targets) RuntimeError: invalid argument 2: size '[-1 x 8967]' is invalid for input with 14600000 elements at /pytorch/torch/lib/TH/THStorage.c:37
Would you be able to explain what this error means?
Do I need to pop the last layer and substitute it with the new Linear layer with needed number of classes?
The URLs used to download the Penn Treebank data in get_data.sh is no longer available, hence the URLs return 404 errors.
hi, I found the line 80 is useless and can be removed in the file model.py. Could you check that? Thanks!
I extended the code to multiple GPU training, but the GPU usage is extremely imbalanced. The root cause is that we collect all outputs back and calculate loss on one GPU. I tried to put loss calculation inside model.forward() as following:
class RNNModel(nn.Module):
def init(...):
super(RNNModel, self).init()
from splitcross import SplitCrossEntropyLoss
splits = [2800, 20000, 76000]
self.criterion = SplitCrossEntropyLoss(ninp, splits=splits, verbose=False)
... ...
def forward(...)
... ...
result = output
# calculate loss
result = result.view(result.size(0)*result.size(1), -1)
raw_loss = self.criterion(decoder_weight, decoder_bias, result, target)
loss = raw_loss
# activation regularization
if args.alpha: loss = loss + sum(args.alpha * dropped_rnn_h.pow(2).mean() for dropped_rnn_h in outputs[-1:])
# Temporal Activation Regularization (slowness)
if args.beta: loss = loss + sum(args.beta * (rnn_h[1:] - rnn_h[:-1]).pow(2).mean() for rnn_h in raw_outputs[-1:])
# expand loss to two dimensional space so it can be gathered via the second dimension
loss = loss.unsqueeze(1)
raw_loss = raw_loss.unsqueeze(1)
if return_h:
return raw_loss, loss, hidden, raw_outputs, outputs
return raw_loss, loss, hidden
Then, in my main.py, I collect the loss and use loss.mean().backward() to update parameters. The interesting thing is, I can successfully finish the first round loss.mean().backward() but failed the second round with error:
RuntimeError: invalid argument 3: Index tensor must have same dimensions as input tensor at
/pytorch/torch/lib/THC/generic/THCTensorScatterGather.cu:199
Can anyone help?
Thanks in advance!
Hi, I think i just found some error of main.py.
In line 252, the val_loss should be val_loss2 right?
Because after the program switch to ASGD mode, it would not calculate val_loss so, the log of program will show no change after switching to ASGD about Validation result.
Any ideas on how to incorporate attention model from http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html ?
I copy/paste the command to train the model, but get the error below
$python34 C:/Users/dat/Desktop/awd-lstm-lm/main.py --batch_size 20 --data C:/Users/dat/Desktop/awd-lstm-lm/data/penn --dropouti 0.4 --dropouth 0.25 --se ed 141 --epoch 500 --save C:/Users/dat/Desktop/awd-lstm-lm/PTB.pickle
Applying weight drop of 0.5 to weight_hh_l0
Applying weight drop of 0.5 to weight_hh_l0
Applying weight drop of 0.5 to weight_hh_l0
[WeightDrop (
(module): LSTM(400, 1150)
), WeightDrop (
(module): LSTM(1150, 1150)
), WeightDrop (
(module): LSTM(1150, 400)
)]
Args: Namespace(alpha=2, batch_size=20, beta=1, bptt=70, clip=0.25, cuda=True, data='C:/Users/dat/Desktop/awd-lstm-lm/data/penn', dropout=0.4, dropoute=0.1, dropouth=0.25, dropouti=0.4, emsize=400, epochs=500, log_interval=200, lr=30, model='LSTM', nhid=1150, nlayers=3, nonmono=5, save='C:/Users/dat/Desktop/awd-lstm-lm/PTB.pickle', seed=141, tied=True, wdecay=1.2e-06, wdrop=0.5)
Model total parameters: 24221600
Traceback (most recent call last):
File "C:/Users/dat/Desktop/awd-lstm-lm/main.py", line 185, in
train()
File "C:/Users/dat/Desktop/awd-lstm-lm/main.py", line 146, in train
output, hidden, rnn_hs, dropped_rnn_hs = model(data, hidden, return_h=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "C:\Users\dat\Desktop\awd-lstm-lm\model.py", line 70, in forward
emb = embedded_dropout(self.encoder, input, dropout=self.dropoute if self.training else 0)
File "C:\Users\dat\Desktop\awd-lstm-lm\embed_regularize.py", line 21, in embedded_dropout
embed.scale_grad_by_freq, embed.sparse
TypeError: forward() takes 3 positional arguments but 8 were given
Hi, training crashed not enough memory on Titan X 12GB with char-LSTM on enwik8
The trick about reducing the "cap" on sequence length links to a 404 URL: could you please let me know where I can do that ?
Thanks a lot for the great code !
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.