GithubHelp home page GithubHelp logo

booksum's Introduction

BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization

Authors: Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev

Introduction

The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text summarization systems. We address these issues by introducing BookSum, a collection of datasets for long-form narrative summarization. Our dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of our dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures. To facilitate future work, we trained and evaluated multiple extractive and abstractive summarization models as baselines for our dataset.

Paper link: https://arxiv.org/abs/2105.08209

Table of Contents

  1. Updates
  2. Citation
  3. Legal Note
  4. License
  5. Usage
  6. Get Involved

Updates

4/15/2021

Initial commit

Citation

@article{kryscinski2021booksum,
      title={BookSum: A Collection of Datasets for Long-form Narrative Summarization}, 
      author={Wojciech Kry{\'s}ci{\'n}ski and Nazneen Rajani and Divyansh Agarwal and Caiming Xiong and Dragomir Radev},
      year={2021},
      eprint={2105.08209},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Legal Note

By downloading or using the resources, including any code or scripts, shared in this code repository, you hereby agree to the following terms, and your use of the resources is conditioned on and subject to these terms.

  1. You may only use the scripts shared in this code repository for research purposes. You may not use or allow others to use the scripts for any other purposes and other uses are expressly prohibited.
  2. You will comply with all terms and conditions, and are responsible for obtaining all rights, related to the services you access and the data you collect.
  3. We do not make any representations or warranties whatsoever regarding the sources from which data is collected. Furthermore, we are not liable for any damage, loss or expense of any kind arising from or relating to your use of the resources shared in this code repository or the data collected, regardless of whether such liability is based in tort, contract or otherwise.

License

The code is released under the BSD-3 License (see LICENSE.txt for details).

Usage

1. Chapterized Project Guteberg Data

The chapterized book text from Gutenberg, for the books we use in our work, has been made available through a public GCP bucket. It can be fetched using:

gsutil cp gs://sfr-books-dataset-chapters-research/all_chapterized_books.zip .

or downloaded directly here.

2. Data Collection

Data collection scripts for the summary text are organized by the different sources that we use summaries from. Note: At the time of collecting the data, all links in literature_links.tsv were working for the respective sources.

For each data source, the file literature_links.tsv.pruned contains the links for the books in our dataset. Run get_summaries.py to collect the summaries from the links for each source. Additionally, get_works.py can be used to collect an exhaustive set of summaries from that source.

cd scripts/data_collection/cliffnotes/
python get_summaries.py

3. Data Cleaning

  1. Perform basic cleanup operations and setup the summary text for splitting and further cleaning operations

    cd scripts/data_cleaning_scripts/
    python basic_clean.py
    
  2. Using a mapping of which chapter summaries are separable (alignments/chapter_summary_aligned.jsonl.aggregate_splits), the summary text is split into different sections (eg. Chapters 1-3 summary separated into 3 different sections - Chapter 1 summary, Chapter 2 summary, Chapter 3 summary)

    python split_aggregate_chaps_all_sources.py
    
  3. The main cleanup script separates out analysis/commentary/notes from the summary text, removes prefixes etc.

    python clean_summaries.py
    

Data Alignments

Generating paragraph alignments from the chapter-level-summary-alignments, is performed individually for the train/test/val splits:

Gather the data from the summaries and book chapters into a single jsonl. The script needs to be run separately for each split as the matched file

cd paragraph-level-summary-alignments
python gather_data.py --matched_file /path/to/chapter_summary_aligned_{train/test/val}_split.jsonl --split_paragraphs

Generate alignments of the paragraphs with sentences from the summary using the bi-encoder paraphrase-distilroberta-base-v1

python align_data_bi_encoder_paraphrase.py --data_path /path/to/chapter_summary_aligned_{train/test/val}_split.jsonl.gathered --stable_alignment

Troubleshooting

  1. The web archive links we collect the summaries from can often be unreliable, taking a long time to load. One way to fix this is to use higher sleep timeouts when one of the links throws an exception, which has been implemented in some of the scripts.
  2. Some links that constantly throw errors are aggregated in a file called - 'section_errors.txt'. This is useful to inspect which links are actually unavailable and re-running the data collection scripts for those specific links.
  3. Some paths in the provided files might throw errors depending on where the chapterized books have been downloaded. It is recommended to download them in booksum root directory for the scripts to work without requiring any modifications to the paths.

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

booksum's People

Contributors

jigsaw2212 avatar muggin avatar svc-scm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

booksum's Issues

Absolute path in book and chapter alignments

The jsonl files in chapter-level-summary-alignments and book-level-summary-alignments use an absolute path for book_path, chapter_path, and summary_path which begins with /export/longsum/booksumm/, e.g.

{
  ...
  "book_path": "/export/longsum/booksumm/all_chapterized_books/1023-chapters/63.txt",
  "summary_path": "/export/longsum/booksumm/all_chapterized_books/4517-chapters/book_clean.txt",
  ...
}

or

{
  ...
  "chapter_path": "/export/longsum/booksumm/all_chapterized_books/1023-chapters/63.txt",
  "summary_path": "/export/longsum/booksumm/finished_summaries/gradesaver/Bleak House/section_18_part_2.txt",
  ...
}

For the chapter-level summaries, this causes the gather_data.py script to fail at these two lines: gather.py:179 and gather.py:189. I'm not sure where the book-level summaries are used.

More instructions to reproduce the baseline model results?

Hi! Thank you for your amazing efforts.

I wonder do you plan to update the repo to include more reproducibility codes/instructions? I am (and I guess many others are) interested in reproducing the baseline results for research purposes.

Thank you very much.

Data cleaning scripts error handling

I'm not sure if this is needed once the remaining issues get ironed out for the dataset scripts as a whole, but right now both the basic_clean.py and clean_summaries.py scripts can crash out if a book directory isn't found. You want to add a continue if the books_dir is not found at basic_clean.py:29 or if source_summary_dir isn't found at clean_summaries.py:32.

Wrong File Open Mode in <align_data_bi_encoder_paraphrase.py>

In align_data_bi_encoder_paraphrase.py at line 226, the file is opened with write mode. It seems a bug to me as it is supposed to the append mode. I spent 3 hours only to find that the output file kept being overwritten by the new samples.

# Original 
with open(basename(args.data_path) + ".stable.bi_encoder_paraphrase", "w") as fd:  
    for stable_example in stable_examples:  
        fd.write(json.dumps(stable_example) + "\n")  
# expected
with open(basename(args.data_path) + ".stable.bi_encoder_paraphrase", "a") as fd:  
    for stable_example in stable_examples:  
        fd.write(json.dumps(stable_example) + "\n")  

Also a kind suggestion for the usage of tqdm: at line 197, probably enumerate(tqdm(data)) is preferred.

# original
for ix, example in tqdm(enumerate(data)):
# preferred
for ix, example in enumerate(tqdm(data)):

Need more instructions to reproduce the extractive oracle of booksum-chapter

Hi, thank you for your great works!

I plan to reproduce some of your baseline result due to #22 , but I met some problems when reproducing the extractive oracle of booksum-chapter and get a slightly different result from your paper, where I got ROUGE-1/2/L (F1) 42.38/9.82/20.62 while 42.68/9.66/21.33 are posted in your paper.

Here are my steps:

  1. Split text in BOOKSUM-paragraph (lines in chapter_summary_aligned_{}_split.jsonl.gathered.stable) into sentences by spaCy, and compute oracles for each instance as Section 4.2 in your paper.
  2. Split text in BOOKSUM-chapter (lines in chapter_summary_aligned_{}_split.jsonl.gathered) into paragraphs by function "merge_text_paragraphs()" in align_data_bi_encoder_paraphrase.py, then split paragraphs into sentences individually as Step 1.
  3. Mapping ALL of the oracle sentences gained from Step 1 to chapter sentences of BOOKSUM-chapter that gained from Step 2.
  4. Now I have BOOKSUM-chapter that texts are split into sentences and each sentence is marked whether it is an oracle, and I can compute ROUGE for each chapter instance.

Any wrong places in my steps? Can you give more instructions about how you perform this?

Another question is, it seems that those extractive models are not directly provided in Huggingface and need additional efforts to reproduce. Do you train and evaluate the models such as BertExt, MatchSum by using codes of their original repos? Can you also give some instructions about this?

Thank you very much! @jigsaw2212 @muggin

Numbers of Books

Hi, @jigsaw2212 and @muggin

Thanks for sharing the scripts for reproducing the BookSum dataset.
I'm still running the get_works.py and get_summaries.py as it took some time to complete.

I have a few things to confirm:

  1. Meanwhile I read on your paper it says that the BookSum Full contains 436 documents, but in the Gutenberg project Zipped file, it contains only 269 folders (I assume it is corresponding with the number of books). Can you explain the count mismatch?

  2. Could you explain the relationship among numbers of paragraphs, chapters, and books? Do all the paragraphs belong to chapters, and the chapters belong to books?

Cheers

README Issues

I've run into a few issue with the README:

  1. In Steps 2 and 3 of the Usage section, the python scripts must be called from the individual directories, rather than specifying the full script path from the base of the repo. That's because the scripts use hardcoded relative paths
  2. The gather_data.py script under Data Alignments requires a number of parameters to be passed in. It's not entirely clear which are the correct parameters (for example, --join_strings and --split_paragraphs are two mutually exclusive flags and one must be specified). A similar issue exists with the other two alignment scripts (e.g. --stable_alignment vs --greedy_alignment for align_data_bi_encoder_paraphrase.py). It would be useful to have an example of the exact arguments need for each of these scripts.
  3. It would also be quite useful to know what the python package requirements are. While it's possible to look at the scripts manually to see which packages were used, it's hard to determine what versions of the packages are needed. Either documenting this in the README, or including a requirements.txt with pinned versions would be useful.

Missing book level alignments, extra chapter level alignments

The linked arxiv paper reports that there are 436 book-level alignments, but the alignments/book-level-summary-alignments/ directory only contains 405 alignments (314 train + 45 val + 46 test). The paper also reports 12,293 chapter-level alignments, but the alignments/chapter-level-summary-alignments/ directory currently contains 12,630 alignments (9713 train + 1485 val + 1432 test).

Are there plans to fix these discrepancies? I am trying to reproduce the paper's results with as much fidelity as possible.

Release a dataset snapshot

Hi,

Is it possible to release a link to the processed dataset used in the paper? Some of the download scripts are super slow (and always raise connection timed out error) on my end...

Thanks!

Licensing Question

Hi Team,

Thanks for contributing this amazing dataset and open sourcing it. I had a question regarding usage of the data:

The Legal note in the README.md file says:

"You may only use the scripts shared in this code repository for research purposes. You may not use or allow others to use the scripts for any other purposes and other uses are expressly prohibited."

Whereas I believe the BSD 3-Clause License does allow for commercial use. Can I clarify which is correct?

Request for .gathered Data Alignment files

Thanks for the interesting repo. Lots of moving parts here and some working better than others.

Managed to complete all text cleaning tasks (with missing downloads) but encounter this 'IndexError: list index out of range' error each time when running the following command:

python gather_data.py --matched_file ../chapter-level-summary-alignments/chapter_summary_aligned_train_split.jsonl --split_paragraph

Running python3.7, ubuntu 18.04, and all requirements.

Appears to be working until ...

22%|████████▌ | 2117/9713 [00:38<02:35, 48.90it/s]sentence: The Leech
summary_content: ['The Leech']
fixed_content: []
67%|██████████████████████████ | 6496/9713 [01:55<00:57, 56.16it/s]
Traceback (most recent call last):
File "gather_data.py", line 254, in
main(args)
File "gather_data.py", line 220, in main
summary_content = fix_prefix_quotations(summary_content)
File "gather_data.py", line 82, in fix_prefix_quotations
fixed_content[-1] = fixed_content[-1] + sent_split[0].strip()
IndexError: list index out of range

Script exits without a partial .gathered file being generated. Would like to complete final step with paraphrase-distilroberta-base-v1 but require the .gathered files. Appreciate the upload or advice on how to solve issue.

Cheers,

Bugs in sparknotes/get_summaries.py

There are a couple of bugs in sparknotes/get_summaries.py:

  1. Line 82 can lead to a crash if the span cannot be found. It should probably get moved into the try/except block above on line 76.
  2. On line 128, you should probably continue if there are no section paragraphs, i.e. no summary.

Using book-level booksum

Hi, could you share more details on how book-level summaries should be obtained, and how to evaluate models on the book level?

Which scripts used to fine-tune t5 model after final alignment .jsonl files generated?

Have generated all final alignment files and have tested several t5 fine-tuning scripts with different degrees of success.

Seeking to reproduce results from:

Table 8: Examples of decoded summaries of the Chapter 1 of “Sense and Sensibility”, part 2.

Table 11: Examples of decoded summaries of the full text of “Sense and Sensibility”, part 3.

Could you please provide information on scripts and parameters used to generate the above results.

Have tested latest version of transformers 'run_summarization.py' but it yields the following error on the .jsonl

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

Have tested other scripts using the "summarize :" option with same .jsonl. These work but only generate a single line summary for Chapter 1 above. How was the multi-line summary for Chapter 1 and the entire book generated?

Looking forward to your reply.

Evaluation Metric - rougeL

Hi,

May I know which rougeL metric is used in the paper? Specifically, is it rougeL or rougeLSum? The latter one adds a newline to every sentence and compute a union-LCS score.

Thanks.

AttributeError: 'NoneType' object has no attribute 'findAll'

Got an error collecting the cliffnotes summaries (see below). Looks as if there is a problem, when some element (here article) is missing:

Traceback (most recent call last):
  File "scripts/data_collection/cliffnotes/get_summaries.py", line 95, in <module>
    section_paragraphs = list(filter(None, scrape_section_continuation(soup, section_header)))
  File "scripts/data_collection/cliffnotes/get_summaries.py", line 41, in scrape_section_continuation
    section_paragraphs = [paragraph.text.strip() for paragraph in section_data.findAll("p", recursive=False)]
AttributeError: 'NoneType' object has no attribute 'findAll'

Unnecessary in align_data_bi_encoder_paraphrase.py

There are a few unnecessary imports in aligh_data_bi_encoder_paraphrase.py which should be removed:

  1. import rouge
  2. import rouge.rouge_score as rouge_score
  3. from nltk.stem.snowball import SnowballStemmer
  4. from nltk.translate.meteor_score import meteor_score
  5. from bert_score import BERTScorer
  6. from transformers import AutoTokenizer
  7. from nltk.tokenize import word_tokenize, sent_tokenize

Some like the rouge and bert_score packages are not used at all and require users to install the package to run the script, even though they will never be used.

Incorrect behavior in separate_mulitple_summaries function

I've been trying to diagnose why I have missing data and part of the problem appears to be in the separate_multiple_summaries function. The end result, is that the script doesn't split some books which are expected to be split in the provided chapter-level-summary-alignments.

An example of this can behavior can be seen by stepping through the splitting of A Room With a View from gradesaver. It turns out that the script doesn't account for the <PARAGRAPH> tags, despite a comment in the source which states that it should.

While stepping through the function, you can see that the regex splits the text into lines like so:

<PARAGRAPH>Chapter Two In Santa Croce with No Baedeker:<PARAGRAPH>Summary:<PARAGRAPH>Lucy looks out her window onto the beautiful scene of a Florence morning

 
Then the first preprocessing function in the loop, remove_prefixes_line, simply takes off the first < due to split_aggregate_chaps_all_sources.py:276, which strips all leading punctuation. The resulting line that starts with: PARAGRAPH>Chapter Two In Santa Croce with No Baedeker: doesn't match the regex, which expects the chapter marker to be at the beginning of the string.

This splitting issue (maybe there are more issues with splitting, but this is the one I investigated) causes a number of books to fail to split. Here's the list of books that the data collection script downloaded, but failed to properly split for me:

gradesaver/A Room With a View
gradesaver/A Tale of Two Cities
gradesaver/Adam Bede
gradesaver/Anne of Green Gables
gradesaver/Antony and Cleopatra
gradesaver/As You Like It
gradesaver/Babbitt
gradesaver/Bleak House
gradesaver/Dombey and Son
gradesaver/Dr. Jekyll and Mr. Hyde
gradesaver/Dracula
gradesaver/Emma
gradesaver/Ethan Frome
gradesaver/Every Man in His Humour
gradesaver/Frankenstein
gradesaver/Gulliver's Travels
gradesaver/Incidents in the Life of a Slave Girl
gradesaver/Jane Eyre
gradesaver/Kidnapped
gradesaver/King Solomon's Mines
gradesaver/Little Women
gradesaver/Middlemarch
gradesaver/My Antonia
gradesaver/Northanger Abbey
gradesaver/Regeneration
gradesaver/Sense and Sensibility
gradesaver/Tess of the D'Urbervilles
gradesaver/The Age of Innocence
gradesaver/The Blithedale Romance
gradesaver/The House of the Seven Gables
gradesaver/The Jungle
gradesaver/The Marrow of Tradition
gradesaver/The Monkey's Paw
gradesaver/The Prince
gradesaver/The Red Badge of Courage
gradesaver/The Rise of Silas Lapham
gradesaver/The Rivals
gradesaver/The School for Scandal
gradesaver/The Spanish Tragedy
gradesaver/The Tempest
gradesaver/The Time Machine
gradesaver/The Turn of the Screw
gradesaver/The Valley of Fear
gradesaver/Troilus and Cressida
gradesaver/Twelve Years a Slave
gradesaver/What Maisie Knew
novelguide/Henry VI Part 1
novelguide/Madame Bovary
novelguide/Merry Wives of Windsor
novelguide/Oliver Twist
novelguide/Persuasion
sparknotes/Adam Bede
sparknotes/Anne of Green Gables
sparknotes/Anthem
sparknotes/Candide
sparknotes/Dr. Jekyll and Mr. Hyde
sparknotes/Dracula
sparknotes/Emma
sparknotes/Far from the Madding Crowd
sparknotes/Frankenstein
sparknotes/Hamlet
sparknotes/Jane Eyre
sparknotes/Kidnapped
sparknotes/Northanger Abbey
sparknotes/Persuasion
sparknotes/Regeneration
sparknotes/Romeo and Juliet
sparknotes/The Brothers Karamazov
sparknotes/The House of the Seven Gables
sparknotes/The Jungle
sparknotes/The Last of the Mohicans
sparknotes/The Picture of Dorian Gray
sparknotes/The Prince
sparknotes/The Red Badge of Courage
sparknotes/The Secret Garden
sparknotes/The Turn of the Screw

NotADirectoryError[WinError 267]

Hello, When I tried to get the summary of "46. Narrative of the Life of Frederick Douglass: An American Slave", a NotADirectoryError[WinError 267] raised.

The referenced directory is like
'../../raw_summaries/cliffnotes/summaries\Narrative of the Life of Frederick Douglass: An American Slave'

(It is easy to modify the tsv.pruned file to solve it.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.