GithubHelp home page GithubHelp logo

lancopku / superae Goto Github PK

View Code? Open in Web Editor NEW
136.0 9.0 47.0 395 KB

Code for "Autoencoder as Assistant Supervisor: Improving Text Representation for Chinese Social Media Text Summarization"

Python 51.91% Perl 48.09%

superae's Introduction

Citation

If you use this code for your research, please cite the paper this code is based on: Autoencoder as Assistant Supervisor: Improving Text Representation for Chinese Social Media Text Summarization:

@inproceedings{Ma2016superAE,
  title   = {Autoencoder as Assistant Supervisor: Improving Text Representation for Chinese Social Media Text Summarization},
  author  = {Shuming Ma and Xu Sun and Junyang Lin and Houfeng Wang},
  booktitle = {{ACL} 2018},
  year      = {2018}
}

superae's People

Contributors

imwebson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

superae's Issues

KeyError: 'unexpected key "decoder_s2s.embedding.weight" in state_dict'

Traceback (most recent call last):
File "predict.py", line 177, in
model.load_state_dict(checkpoints['model'])
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 522, in load_state_dict
.format(name))
KeyError: 'unexpected key "decoder_s2s.embedding.weight" in state_dict'
在运行predict加载模型时,出现这个错误,请问下是什么原因呢?

RuntimeError: The size of tensor a (128) must match the size of tensor b (64) at non-singleton dimension 1

/home/TengWei/SAE/models/attention.py:32: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
weights = self.softmax(weights) # batch * time
Traceback (most recent call last):
File "train.py", line 345, in
main()
File "train.py", line 334, in main
train(i)
File "train.py", line 194, in train
loss, num_total, num_correct = model.train_model(src, src_len, tgt, tgt_len, opt.loss, updates, optim, num_oovs=num_oovs)
File "/home/TengWei/SAE/models/seq2seq.py", line 85, in train_model
loss, num_total, num_correct = self.compute_loss(outputs, targets, loss_fn, updates)
File "/home/TengWei/SAE/models/seq2seq.py", line 61, in compute_loss
return models.cross_entropy_loss(hidden_outputs, self.decoder, targets, self.criterion, self.config)
File "/home/TengWei/SAE/models/loss.py", line 190, in cross_entropy_loss
num_correct = pred.data.eq(targets.data).masked_select(targets.ne(dict.PAD).data).sum()
RuntimeError: The size of tensor a (128) must match the size of tensor b (64) at non-singleton dimension 1

sae结果无法复现

相关环境

  1. 字符分词
  2. 字典大小默认
  3. partI和partII做训练集
  4. partIII的前725行做测试集(相关度大于3)
  5. 配置文件就用了默认配置文件

seq2seq结果

seq725

sae结果

sae725

疑惑

虽然没有严格按照论文里给的条件做设置,但至少这两个对比实验的训练集测试集和配置文件都是一样的,按理说seq2seq的结果和论文里的差不多,但是为什么sae会差这么多?

模型文件

你好,可以提供一下,你们训练好的模型文件吗?

hello

我想知道数据最后的格式是什么

What's the version of pytorch that you use?

I have trouble running this code. At first I cannot train, after modifying the code, I am able to train the network. But now I have trouble in eval_rouge, and i can't figure out. I think the version of pytorch make a difference. Could you clarify the requirements in README?
I'm using python3.6 with pytorch 0.4.0.

ValueError: max() arg is an empty sequence

Traceback (most recent call last):
File "train.py", line 343, in
main()
File "train.py", line 337, in main
logging("Best %s score: %.2f\n" % (metric, max(scores[metric])))
ValueError: max() arg is an empty sequence

当我用partⅡ做训练集,partⅢ做测试集时,出现该错误信息怎么解决呢?

datainput

How to input data into code,I can't find zhe path

您好,关注词向量的vocab词库

请问用于把中文转化为数字的vocab词库,用的是哪个?用贵组的训练集训练完了,想做个自己的测试集,不知道用哪个对应的vocab文件啊

关于preprocess.py预处理结果

您好,请问当前版本的preprocess.py是针对LCSTS2.0数据集吗?
(LCSTS2.0的数据文件中有大量<>tag,但似乎没有见到去除这些tag的操作?)

想了解一下您从LCSTS2.0到lcsts.low.share.train.pt的操作,谢谢!
(是因为在预处理其他数据集时发现,处理后的结果运行时报错)

Illegal division by zero at data/script/ROUGE-1.5.5.pl line 2450

 您好,我使用的是lcsts2.0数据集,根据前面issue里面提到的将数据集里的文本和摘要分别放入src以及tgt文件中,但在运行rouge的时候报错,查阅其他资料好像是因为采用中文字符的原因? 但本篇文章就是训练的中文摘要啊? 
  此外,candidate文件夹里每个文件都是<unk>,不知道这是正常的,还是因为我哪个步骤做错了?

log如下:
/mnt/extend_sdb/workspace/superAE/superAE/models/attention.py:32: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
weights = self.softmax(weights) # batch * time
/mnt/extend_sdb/workspace/superAE/superAE/models/loss.py:190: UserWarning: self and other not broadcastable, but have the same number of elements. Falling back to deprecated pointwise behavior.
num_correct = pred.data.eq(targets.data).masked_select(targets.ne(dict.PAD).data).sum()
/mnt/extend_sdb/workspace/superAE/superAE/models/loss.py:190: UserWarning: self and mask not broadcastable, but have the same number of elements. Falling back to deprecated pointwise behavior.
num_correct = pred.data.eq(targets.data).masked_select(targets.ne(dict.PAD).data).sum()
[======================================== 10000/10000 ================================>] Step: 3s712ms | Tot: 6h2m
epoch: 1, ppl: 1.006, time: 21772.570, updates: 10000, accuracy: 95.07
evaluating after 10000 updates...
/mnt/extend_sdb/workspace/superAE/superAE/models/seq2seq.py:169: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
output = unbottle(self.log_softmax(output))
Illegal division by zero at data/script/ROUGE-1.5.5.pl line 2450........................] Step: 410ms | Tot: 0ms
F_measure: [0.0, 0.0, 0.0] Recall: [0.0, 0.0, 0.0] Precision: [0.0, 0.0, 0.0]

运行环境:
python 3.5 pytorch0.3.1 cuda9.0

原始数据处理

你好,我想问一下,原始数据lcsts中有很多<>之类的标签,怎么把它处理成train.src/valid.src这样的数据文件呢?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.