GithubHelp home page GithubHelp logo

m-tian / scientific-paper-summarisation Goto Github PK

View Code? Open in Web Editor NEW

This project forked from edco95/scientific-paper-summarisation

0.0 2.0 0.0 2.28 MB

Machine learning models to automatically summarise scientific papers

Python 100.00%

scientific-paper-summarisation's Introduction

Automatic Summarisation of Scientific Papers

Have you ever had to do a literature review as part of a research project and thought "I wish there was a quicker way of doing this"? This code aims to create that quicker way by developing a supervised-learning based extractive summarisation system for the summarisation of scientific papers.

For more information on the project, please see:

Ed Collins, Isabelle Augenstein, Sebastian Riedel. A Supervised Approach to Extractive Summarisation of Scientific Papers. To appear in Proceedings of CoNLL, July 2017.

Ed Collins. A supervised approach to extractive summarisation of scientific papers. UCL MEng thesis, May 2017.

Code Description

The various code files and folders are described here. Note that the data used is not uploaded here but nonetheless the repository is still over 1GB in size.

  • Analysis - A folder containing code used to analyse the generated summaries and create various pretty graphs. It is not essential to the functioning of the summarisers and will not work without the data.
  • DataTools - Contains files for manipulating and preprocessing the data. The most important file is useful_functions.py which contains many important functions used to run the system.
  • Evaluation - Contains code to evaluate summaries and calculate the ROUGE-L metric, with thanks to hapribot.
  • Models - Contains the code which constructs and trains each of the supervised learning modules that form the core of the summarisation system. All written in TensorFlow.
  • Summarisers - Contains the code which takes the trained models and uses them to actually create summaries of papers.
  • Visualisations - Contains code which visualises summaries by colouring them and saving them as HTML files. This is not essential to run the system.
  • Word2Vec - Contains the code necessary to train the Word2Vec model used for word embeddings. The actual trained Word2Vec model is not uploaded because it is too large.
  • DataDownloader - Contains code to download and parse the original XML paper files into the format currently used by this system - where each section title is delineated by "@&#" so the paper can easily be read and split into constituent sections by reading the whole paper as a string and splitting the string on this symbol which is very unlikely to ever occur in the text. The important file is acquire_data.py.

Running the Code

Before attempting to run this code you should setup a suitable virtualenv using Python 2.7. Install all of the requirements listed in requirements.txt with pip install -r requirements.txt.

To then run this code you will need paper data in the following format: every paper is in a directory and is a .txt file, where the section headings of every section in the paper are surrounded on both sides by the symbol "@&#". You will also need to create a stopword list and list of permitted paper section titles and a word embedding model. Finally you will need to create dictionaries which keep bag of words representations of every paper for calculating features. Finally you will also need to update all the paths currently listed in the project so that they match your own. You will then need to update the loading functions in useful_functions.py to load all of these things by changing the paths to point at the correct locations.

Other Notes

If you have read or are reading the MEng thesis or CoNLL paper corresponding to this code, then SAFNet = SummariserNet, SFNet = SummariserNetV2, SNet = LSTM, SAF+F Ens = EnsembleSummariser, S+F Ens = EnsembleV2Summariser.

scientific-paper-summarisation's People

Contributors

edco95 avatar isabelleaugenstein avatar

Watchers

James Cloos avatar Mi Tian avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.