GithubHelp home page GithubHelp logo

nianlonggu / memsum Goto Github PK

View Code? Open in Web Editor NEW
43.0 3.0 16.0 82.77 MB

Code for ACL 2022 paper on the topic of long document summarization: MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes

Python 21.21% Jupyter Notebook 78.79%
extractive-summarization long-document-summarization acl2022 reinforcement-learning memsum nlp pytorch

memsum's People

Contributors

nianlonggu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

memsum's Issues

Cannot get the same oracle result on Pubmed_trunc dataset

oracle result of pubmed_trunc in your paper is 45.12/20.33/40.19. but I download the dataset you provided ,
but I got oracle result:
1 ROUGE-1 Average_R: 0.31187 (95%-conf.int. 0.30779 - 0.31588)
1 ROUGE-1 Average_P: 0.56634 (95%-conf.int. 0.56157 - 0.57099)
1 ROUGE-1 Average_F: 0.37910 (95%-conf.int. 0.37539 - 0.38300)

1 ROUGE-2 Average_R: 0.13605 (95%-conf.int. 0.13329 - 0.13893)
1 ROUGE-2 Average_P: 0.25674 (95%-conf.int. 0.25167 - 0.26153)
1 ROUGE-2 Average_F: 0.16707 (95%-conf.int. 0.16387 - 0.17027)

1 ROUGE-L Average_R: 0.27617 (95%-conf.int. 0.27245 - 0.27985)
1 ROUGE-L Average_P: 0.50572 (95%-conf.int. 0.50079 - 0.51033)
1 ROUGE-L Average_F: 0.33679 (95%-conf.int. 0.33319 - 0.34051)

I would appreciate it if you could provide the calculation code.Thanks.

Do we need to provide ground truth summary to run greedy_summary?

Hi, I am implementing a use case where an algorithm can sequentially generate the sentence that is most relevant to the summary of the text. It's pretty much similar to what the greedy_summary is designed for.

However, when I am looking through your code implementation of greedy_summary (I haven't had time to look at paper yet), I noticed that it needs ground truth summary as input to pick the next most important sentence. I am thinking what if I use memsum to generate summary of a text first and feed that output as the parameter for greedy_summary, would that generate decent result?

数据集格式问题

作者您好,这种格式high_rouge_indices_and_scores.jsonl的数据是如何处理得到的,没有找到indices和score的来源

Missing `summarizers.py` and `download_and_load_word_embedding.py`

Hi @nianlonggu ,

Thanks for providing the code.

I wanted to try your method but found some files are missing.

For example, based on your README file, it should be a file called summarizers.py but I could not find it.

In addition, is there a file called download_and_load_word_embedding.py? I am not able to find it as well.

I really appreciate any help you can provide.

hugging face中的模型问题

作者您好,我从nianlong/memsum-word-embedding下载到了您预先训练好的模型,现在的问题是:如何在自己的中文数据集上训练vocabulary_200dim.pkl 和 unigram_embeddings_200dim.pkl。 希望您能回复我的消息

Clarification of the data format

Hi @nianlonggu ,

Thanks a lot for your previous reply.

I just have some questions regarding your data format for PubMed.

For example, for train_PUBMED.jsonl, the keys are ['text', 'summary', 'indices', 'score'], but for the others like test_PUBMED.jsonl, we only have ['summary', 'text', 'sorted_indices'].

What are these keys mean? for example, sorted_indices and indices and score?

Also I can see test_PUBMED_ORACLE.jsonl, what is the difference between it and test_PUBMED.jsonl.

How did you generate these files?

If I understand correctly, during your training, you need to extract the gold extractive summaries and then during validation and test, your scores are reported by comparing with the real summaries.

Sorry for so many questions.

Thanks in advance.

When predict a custom dataset

Hey guys @nianlonggu , I want to show rouge1/2/L score a summary from a docs !

I guess sentence_score_history in file summarizers.py in model MemSum , I guess right ?

Thanks guys

Clarification on the oracle/sampled summaries

Looking at the paper and data is not very clear how you have obtained the oracle summaries.

In the paper you say that B=2 and at each step you sample 1 greedy selection and 1 sub optimal selection. But that means the number of summaries should be a power of 2? But I see that's not the case?

I would appreciate it if you could clarify this selection process or provide some code?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.