preprint-average-sentences's Introduction

Preprint average sentences

This is a tool with a graphical user interface that scrapes groups of articles from biorxiv or medrxiv searches, summarizes the articles by outputting the average sentences for each paragraph in each preprint, and clusters the articles based on their most average sentences. I made this when I wanted there to be a quick way to get an overview of any emergent biomed topic.

This notebook uses a simple version of extractive summarization; it cuts out and returns the most relevant text in an article. One sentence is returned per paragraph based on the highest cosine similarity to an averaged-out paragraph embedding. I tried out pretrained Pegasus model to add some abstractive summarization but I found that there was too much text hallucination (i.e. the model outputting unreliable text that's not from the content being summarized).

The webscraping is done with BeautifulSoup. Sentences embedding is done with an all-mpnet-base-v2 pretrained sentence-transformers model downloaded from Hugging Face, average sentences are found via cosine similarity, and clustering is done with k-means clustering the sentence embeddings for the most average sentence per text. An automatic elbow method is used to choose the number of clusters. For each cluster, five keywords are returned. The GUI is built with tkinter.

In-use example

Usage

If you would like to use this notebook you will need to first install all of the dependencies for the first cell. Then, download the model weights (418mb) by running the second cell before running the rest of the cells. The GUI will pop up after running the final cell. Folders for "scraped_texts" and "processed_texts" will pop up in the same directory as the notebook during text scraping and processing. The whole process can take 5-10 min per 100 articles.

Recommend Projects

olya-m / preprint-average-sentences Goto Github PK