GithubHelp home page GithubHelp logo

olya-m / preprint-average-sentences Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 99 KB

This is a Jupyter notebook for scraping and extractively summarizing biorxiv and medrxiv preprints

Jupyter Notebook 100.00%

preprint-average-sentences's Introduction

Preprint average sentences

This is a tool with a graphical user interface that scrapes groups of articles from biorxiv or medrxiv searches, summarizes the articles by outputting the average sentences for each paragraph in each preprint, and clusters the articles based on their most average sentences. I made this when I wanted there to be a quick way to get an overview of any emergent biomed topic.

This notebook uses a simple version of extractive summarization; it cuts out and returns the most relevant text in an article. One sentence is returned per paragraph based on the highest cosine similarity to an averaged-out paragraph embedding. I tried out pretrained Pegasus model to add some abstractive summarization but I found that there was too much text hallucination (i.e. the model outputting unreliable text that's not from the content being summarized).

The webscraping is done with BeautifulSoup. Sentences embedding is done with an all-mpnet-base-v2 pretrained sentence-transformers model downloaded from Hugging Face, average sentences are found via cosine similarity, and clustering is done with k-means clustering the sentence embeddings for the most average sentence per text. An automatic elbow method is used to choose the number of clusters. For each cluster, five keywords are returned. The GUI is built with tkinter.

In-use example

step1

step2

step3

Usage

If you would like to use this notebook you will need to first install all of the dependencies for the first cell. Then, download the model weights (418mb) by running the second cell before running the rest of the cells. The GUI will pop up after running the final cell. Folders for "scraped_texts" and "processed_texts" will pop up in the same directory as the notebook during text scraping and processing. The whole process can take 5-10 min per 100 articles.

preprint-average-sentences's People

Contributors

olya-m avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.